Learning of state representation in recurrent network: the power of random feedback and biological constraints

Takayuki Tsurumi; Ayaka Kato; Arvind Kumar; Kenji Morita

doi:10.7554/eLife.104101.1

eLife Assessment

This work models reinforcement-learning experiments using a recurrent neural network. It examines if the detailed credit assignment necessary for back-propagation through time can be replaced with random feedback. In this useful study the authors show that it yields a satisfactory approximation but the evidence to support that it holds in general is incomplete. As only short temporal delays are used and the examples simulated are overly simple, the approximation would need to be tested on more complex task and with larger networks.

https://doi.org/10.7554/eLife.104101.1.sa4

Significance of findings

useful: Findings that have focused importance and scope

landmark
fundamental
important
valuable
useful

Strength of evidence

incomplete: Main claims are only partially supported

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

How external/internal ‘state’ is represented in the brain is crucial, since appropriate representation enables goal-directed behavior. Recent studies suggest that state representation and state value can be simultaneously learnt through reinforcement learning (RL) using reward-prediction-error in recurrent-neural-network (RNN) and its downstream weights. However, how such learning can be neurally implemented remains unclear because training of RNN through the ‘backpropagation’ method requires downstream weights, which are biologically unavailable at the upstream RNN. Here we show that training of RNN using random feedback instead of the downstream weights still works because of the ‘feedback alignment’, which was originally demonstrated for supervised learning. We further show that if the downstream weights and the random feedback are biologically constrained to be non-negative, learning still occurs without feedback alignment because the non-negative constraint ensures loose alignment. These results suggest neural mechanisms for RL of state representation/value and the power of random feedback and biological constraints.

Introduction

Multiple lines of studies have suggested that Temporal-Difference-Reinforcement-Leaning (TDRL) is implemented in the cortico-basal ganglia-dopamine(DA) circuits in the way that DA represents TD reward-prediction-error (RPE) ^3–7 and DA-dependent plasticity of cortico-striatal synapses represents TD-RPE-dependent update of state/action values ^8–10. Traditionally, TDRL in the cortico-basal ganglia-DA circuits was considered to serve only for relatively simple behavior. However, subsequent studies suggested that more sophisticated, apparently goal-directed/model-based behavior can also be achieved by TDRL if states are appropriately represented ^11–13 and that DA signals indeed reflect model-based predictions ^{14, 15}. Conversely, state representation-related issues could potentially cause behavioral or mental-health problems ^16–20. Early modeling studies treated state representations appropriate to the situation/task as given (’handcrafted’ by the authors), but representation itself should be learnt in the brain ^21–26. Recently it was shown that appropriate state representation can be learnt through RL in a recurrent neural network (RNN) by minimization of squared value-error without explicit teacher/target ^{2, 13}, while state value can be simultaneously learnt in the downstream of RNN.

However, whether such a learning method, named the value-RNN ², can be implemented in the brain remains unclear, because there are problems in terms of biological plausibility. A major problem, among others, is that the update rule proposed in the previous work for the connections onto the ’neurons’ in the RNN ², derived from the gradient-descent error-’backpropagation’ (hereafter referred to as backprop) method ^{27, 28}, involves the weights of the connections from these RNN units onto the downstream value-encoding unit. Given that the state-representing RNN and the value-encoding unit are implemented by the intra-cortical circuit and the striatal neurons, respectively, as generally suggested ^{3, 29, 30}, this means that the update (plasticity) rule for intra-cortical connections involves the downstream cortico-striatal synaptic strengths, which would not be able to be accessed from the cortex. Indeed, this is an example of the long-standing difficulty in biological implementation of backprop ^{31, 32}, in which update of upstream connections requires biologically unavailable downstream connection strengths.

Recently, a potential solution for this difficulty has been proposed ³³ (see also ^34–41 for other potential solutions). Specifically, in supervised learning of feed-forward network, it was shown that when the downstream connection strengths used for updating upstream connections in backprop were replaced with fixed random strengths, comparable learning performance was still achieved ³³. This was suggested to be because the information of the introduced fixed random strengths transferred, through learning, to the upstream connections and then to the downstream feed-forward connections so that these feed-forward connections became aligned to the random feedback strengths, and thus in turn, the random feedback can play the same role as the one played by the downstream connection strengths in backprop. This mechanism was named the ’feedback alignment’ ³³, and was subsequently shown to work also in supervised learning of RNN ⁴² and proposed to be neurally implemented ⁴³ (in a different way from the present study as we discuss in the Discussion).

The value-RNN ^{2, 13}, the above-introduced simultaneous RL of state values and state representation through minimization of squared value-error, differs from supervised learning considered in these previous feedback-alignment studies in two ways: i) it is TD learning, i.e., it approximates the true error by the TD-RPE because the true error, or true state value, is unknown, and ii) it uses a scalar error (TD-RPE) rather than a vector error. Therefore it was nontrivial whether the feedback alignment mechanism could work also for the value-RNN. In the present work, we first examined this, demonstrating that it does work and providing a mechanistic insight into how it works.

After that, we further addressed other biological-plausibility problems. Specifically, we imposed biological constraints that the downstream (cortico-striatal) weights and the fixed random feedback, as well as the activities of neurons in the RNN, were all non-negative. Moreover, we also remedied the non-monotonic dependence of the update of RNN connection-strength on post-synaptic neural activity. We then found, unexpectedly, that the non-negative constraint appeared to aid, rather than degrade, the learning by ensuring that the downstream weights and the fixed random feedback are loosely aligned even without operation of the feedback alignment mechanism. These results suggest how learning of state representation and value can be neurally implemented, more specifically, through synaptic plasticity depending on DA, which represents TD-RPE, in the cortex and the striatum.

Results

Consideration of the value-RNN with fixed random feedback

We considered an implementation of the value-RNN in the cortico-basal ganglia circuits (Fig. 1). A cortical region/population is supposed to represent information of sensory observation (o) and send it to another cortical region/population, which has rich recurrent connections and therefore can be approximated by an RNN. Activities of neurons in the RNN (x) are supposed to learn to represent states, through updates of the strengths of recurrent connections A and feed-forward connections B. The activity of a population of striatal neurons that receive inputs from the RNN is supposed to learn to represent the state values (v), by learning the weights of cortico-striatal connections (from the RNN to the striatal neurons) (w) indicating the value weights. DA neurons in the ventral tegmental area (VTA) receive (direct and indirect) inputs from the striatum and other structure conveying information of obtained reward (r), and thereby the activity of the DA neurons, as well as released DA, represents TD-RPE (δ). TD-RPE-representing DA is released in the striatum and also in the cortical RNN through mesocorticolimbic projections, and used for modifying the strengths of cortical recurrent and feed-forward connections (A and B) and cortico-striatal connections (w).

Implementation of the value-RNN in the cortico-basal ganglia-DA circuits.

In the original value-RNN ^{2, 13}, update rule for the connections onto the RNN (A and B) requires the (gradually changing) value weights (w), but this is biologically implausible because the corticostriatal synaptic strengths are not available in the cortex as discussed above. Therefore, we considered a modified value-RNN by replacing the cortico-striatal weights used in the updates of intra-cortical connections with fixed random strengths (c). Besides, the original value-RNN adopted a learning rule called the backpropagation through time (BPTT) ⁴⁴, in which the error in the output needs to be incrementally accumulated in the temporally backward order, but such an acausality is also biologically implausible, as previously pointed out ⁴². Therefore, we instead used an online learning rule, which considers only the influence of the recurrent weights at the previous time step (see the Methods for details and equations).

Simulation of a Pavlovian cue-reward association task with variable inter-trial intervals

We compared the learning of the modified value-RNN with fixed random feedback (referred to as VRNNrf) and the value-RNN with backprop (referred to as VRNNbp), both of which adopted the online learning rule rather than the BPTT, and also untrained RNN. The number of RNN units was set to 7 for all the cases. Traditional TD-RL agents with punctate state representation (called the complete serial compound, CSC ^{3, 45}) were also compared. We simulated a Pavlovian cue-reward association task, in which a cue was followed by a reward three time-steps later, and inter-trial interval (i.e., reward to next cue) was randomly chosen from 4, 5, 6, or 7 time-steps (Fig. 2A). In this task, states can be defined by relative timings from the cue, and we estimated the true state values through simulations according to the definition of state value, i.e., expected cumulative discounted future rewards ⁴⁶ (Fig. 2B, black line). Expected TD-RPE calculated from these estimated true values (Fig. 2B, red line) was almost 0 at any states, as expected. Agent having punctate/CSC state representation and state values without continuation between trials (i.e., the value of the last state in a trial was not updated by TD-RPE upon entering the next trial) developed positive values between cue and reward, and abrupt TD-RPE upon cue (Fig. 2C). Agent having punctate/CSC state representation and continuously updated state values across trials developed positive values also for states in the inter-trial interval (Fig. 2D). VRNNbp developed state values between cue and reward, and to some extent in the inter-trial interval, and showed abrupt TD-RPE upon cue and smaller TD-RPE upon reward (Fig. 2E). This indicates that this agent largely learned the task structure, confirming the previously proposed effectiveness of value-RNN in this different task.

Simulation of a Pavlovian cue-reward association task. (A) Simulated task with variable inter-trial intervals. (B) Black line: Estimated true values of states, defined by relative timings from the cue, through simulations according to the definition of state value, i.e., expected cumulative discounted future rewards. Red line: TD-RPEs calculated from the estimated true state values. (**C-G**) State values (black lines) and TD-RPEs (red lines) at 1000-th trial, averaged across 100 simulations (error-bars indicating ± SEM across simulations), in different types of agent: (C) TD-RL agent having punctate/CSC state representation and state values without continuation between trials (i.e., the value of the last state in a trial was not updated by TD-RPE upon entering the next trial); (D) TD-RL agent having punctate/CSC state representation and continuously updated state values across trials; (E) Value-RNN with backprop (VRNNbp). The number of RNN units was 7 (same applied to (F,G); (F) Value-RNN with fixed random feedback (VRNNrf); (G) Agent with untrained RNN. (H) State values at 1000-th trial in individual simulations of VRNNbp (top), VRNNrf (middle), and untrained RNN (bottom). (I) Histograms of the value of the pre-reward state (i.e., the state one-time step before the reward state) at 1000-th trial in individual simulations of the three models. The vertical black dashed lines indicate the true value of the pre-reward state (estimated through simulations). (J) Learning performance of VRNNbp (red line), VRNNrf (blue line), and the untrained RNN (gray line) when the number of RNN units was varied from 5 to 40 (horizontal axis). Leaning performance was measured by the sum of squares of differences between the state values developed at 1000-th trial by each of these three types of agent and the estimated true state values between cue and reward (vertical axis), averaged over 100 simulations (error-bars indicating ± SEM across simulations).

VRNNrf, having fixed random feedback instead of backprop-based feedback, developed state values that were largely similar to, although smaller (on average across simulations) than, those developed by VRNNbp (Fig. 2F black line). VRNNrf generated abrupt TD-RPEs upon cue and reward, again similarly to VRNNbp although the relative size of reward-response was (on average) larger (Fig. 2F red line). As a comparison, agent with untrained RNN developed (on average) even smaller state values and larger relative size of TD-RPE upon reward (Fig. 2G). These results indicate that value-

RNN could be trained by fixed random feedback at least to a certain extent, although somewhat less effectively (as maybe expected) than by backprop-based feedback. Figure 2H shows state values developed in individual simulations of VRNNbp (top), VRNNrf (middle), and untrained RNN (bottom), and Figure 2I shows the histograms of the value of the pre-reward state (i.e., one time-step before the state where reward was obtained) developed in individual simulations of these three models. These figures indicate that VRNNrf did not tend to develop moderately smaller state values than VRNNbp in each simulation. Rather, state values developed in VRNNrf were largely comparable to those developed in VRNNbp once they were successfully learned but the success rate was smaller than VRNNbp while still larger than the untrained RNN.

So far, we examined the cases where the number of RNN units was 7. We compared the learning performance of VRNNbp, VRNNrf, and untrained RNN when the number of RNN units was varied from 5 to 40. Leaning performance was measured by the sum of squares of differences between the state values developed by each of these three types of agents and the estimated true state values (Fig. 2B) between cue and reward. As shown in Fig. 2J, on average across simulations, VRNNbp generally achieved the highest performance, but VRNNrf also exhibited largely comparable performance and it always outperformed the untrained RNN. As the number of RNN units increased from 5 to 15, all these three agents improved their performance, while additional increase of RNN units to 20 or 25 resulted in smaller changes. Further increase of RNN units caused decrease in the mean performance in all the three agents, and when the number of RNN units was increased to 45, there were occasions where learning appeared to diverge. We will discuss these in the Discussion.

Occurrence of feedback alignment and an intuitive understanding of its mechanism

We questioned if feedback alignment underlay the learnability of VRNNrf. Returning to the case with 7 RNN units, we examined whether the value weight vector w became aligned to the random feedback vector c in VRNNrf, by looking at the changes in the angle between these two vectors across trials. As shown in Fig. 3A, this angle, averaged across simulations, decreased over trials, indicating that the value weight w indeed tended to become aligned to the random feedback c. We then examined whether better alignment of w to c related to better development of state value by looking at the relation between the angle between w and c and the value of the pre-reward state at 1000-th trial. As shown in Fig. 3B, there was a negative correlation such that the smaller the angle was (i.e., more aligned), the larger the state value tended to be (r = −0.288, p = 0.00362), in line with our expectation. These results indicate that the mechanism of feedback alignment, previously shown to work for supervised learning, also worked for TD learning of value weights and recurrent/feed-forward connections.

How did the feedback alignment mechanistically occur? We made an attempt to obtain an intuitive understanding. Assume that positive TD-RPE (δ(t) > 0) is generated at a state, S (= x(t)), in a task trial. Because of the update rule for w (w ← w + aδ(t)x(t)), w is updated in the direction of x(t). Next, what is the effect of updates of recurrent/feed-forward connections (A and B) on x? For simplicity, here we consider the case where observation is null (o = 0) and so x(t) = f(Ax(t−1)) holds (but similar argument can be done in the case where observation is not null). If A is replaced with its updated one, it can be calculated that i-th element of Ax(t−1) will hypothetically change by c_i × (a positive value) (technical note: the value is aδ(t){Σ_jx_j(t−1)²}(0.5 + x_i(t))(0.5 − x_i(t)) which is positive unless x(t−1) = 0), and therefore the vector Ax(t−1) as a whole will hypothetically change by a vector that is in a relatively close angle with c (in a sense that, for example, [c₁ c₂ c₃]^T and [0.5c₁ 1.2c₂ 0.8c₃]^T are in a relatively close angle in the same quadrant). Then, because f is a monotonically increasing sigmoidal function, x(t) = f(Ax(t−1)) will also hypothetically change by a vector that is in a relatively close angle with c. This was indeed the case in our simulations as shown in Fig. 3C.

In this way, at state S where TD-RPE is positive, w is updated in the direction of x(t), and x(t) will hypothetically change by a vector that is in a relatively close angle with c if A is replaced with its updated one. Then, if the update of w and the hypothetical change in x(t) due to the update of A could be integrated, w would become aligned to c (if TD-RPE is instead negative, w is updated in the opposite direction of x(t), and x(t) will hypothetically change by a vector that is in a relatively close angle with −c, and so the same story holds in the end).

There is, however, a caveat regarding how the update of w and the hypothetical change in x(t) can be integrated. Although technical, here we briefly describe it, and a possible solution. The updates of w and A use TD-RPE, which is calculated based on v(t) = w^T x(t) and v(t+1) = w^T x(t+1), and so x(t) and x(t+1) should already be determined beforehand. Therefore, the hypothetical change in x(t) due to the update of A, described in the above, does not actually occur (this was why we mentioned ‘hypothetical’) and thus cannot be integrated with the update of w. Nevertheless, integration could still occur across successive trials, at least to a certain extent. Specifically, although TD-RPEs at S in successive trials would generally differ from each other, they would still tend to have the same sign, as was indeed the case in our simulations (Fig. 3D). Also, although the trajectories of RNN activity (x) in successive trials would differ, we could expect a certain level of similarity because the RNN is entrained by observation-representing inputs, again as was indeed the case in our example simulation (Fig. 3E). Then, the hypothetical change in x(t) due to the update of A, considered above, could become a reality in the next trial, to a certain extent, and could thus be integrated into the update of w, explaining the occurrence of feedback alignment.

Simulation of tasks with probabilistic structures of reward timing/existence

We also simulated two tasks (Fig. 4A) that were qualitatively similar to (though simpler than) the two tasks examined in previous experiments ¹ and modeled by the original value-RNN with backprop². In our task 1, a cue was always followed by a reward either two or four time-steps later with equal probabilities. Task 2 was the same as task 1 except that reward was omitted with 40% probability. In task 1, if reward was not given at the early timing (i.e., two-steps later than cue), agent could predict that reward should be given at the late timing (i.e., four-steps later than cue), and thus TD-RPE upon reward at the late timing is expected to be smaller than TD-RPE upon reward at the early timing (if agent perfectly learned the task structure, TD-RPE upon reward at the late timing should be 0). By contrast, in task 2, if reward was not given at the early timing, it might indicate that reward was given at the late timing but might instead indicate that reward was omitted in that trial, and thus TD-RPE upon reward at the late timing is expected to exist and can even be larger than TD-RPE upon reward at the early timing.

Simulation of two tasks having probabilistic structures, which were qualitatively similar to the two tasks examined in experiments ¹ and modeled by value-RNN ². (A) Simulated two tasks, in which reward was given at the early or the late timing with equal probabilities in all the trials (task 1) or 60% of trials (task 2). (B) (a) *Top*: Trial types. Two trial types (with early reward and with late reward) in task 1 and three trial types (with early reward, with late reward, and without reward) in task 2. *Bottom*: Value (expected discounted cumulative future rewards) of each timing in each trial type. (b) Agent’s probabilistic belief about the current trial type, in the case where agent was in fact in the trial with early reward (top row), the trial with late reward (second row), or the trial without reward (third row in task 2). (c) *Top*: States defined by considering the probabilistic beliefs at each timing from cue. *Bottom*: State values (expected discounted cumulative future rewards, estimated through simulations), which should theoretically match an integration (multiplication) of the values of each trial type (shown in (a)-bottom) with the probabilistic beliefs (shown in (b)). (C) Expected TD-RPE calculated from the estimated true values of the states for task 1 (left) and task 2 (right). Red lines: case where reward was given at the early timing, blue lines: case where reward was given at the late timing. (**D-H**) TD-RPEs at the latest trial within 1000 trials in which reward was given at the early timing (red lines) or the late timing (blue lines), averaged across 100 simulations (error-bars indicating ± SEM across simulations), in the different types of agent: (D,E) TD-RL agent having punctate/CSC state representation and state values without (D) or with (E) continuation between trials; (F) VRNNbp. The number of RNN units was 12 (same applied to (G,H); (G) VRNNrf; (H) Untrained RNN.

In these tasks, states can be defined in the following way. There were two types of trials, with early or late reward, in task 1, and additionally one more type of trial, without reward, in task 2 (Fig. 4Ba, top). For each timing in each of these trial types, its value, i.e., expected discounted cumulative future rewards, can be estimated through simulations (Fig. 4Ba, bottom). Agent could not know the current trial type until receiving reward at the early timing or the late timing or receiving no reward at both timings. Until these timings, agent could have probabilistic belief about the current trial type, e.g., 50% in the trial with early reward and 50% in the trial with late reward (in task 1) or 30% in the trial with early reward, 30% in the trial with late reward, and 40% in the trial without reward (in task 2) (Fig. 4Bb). States can be defined by incorporating these probabilistic beliefs at each timing (Fig. 4Bc, top), and state values (Fig. 4Bc, bottom: expected discounted cumulative future rewards, estimated through simulations) should theoretically match an integration (multiplication) of the values of each trial type (Fig. 4Ba, bottom) with the probabilistic beliefs (Fig. 4Bb). Expected TD-RPE calculated from these estimated state values (Fig. 4C) exhibited features that matched the conjecture mentioned above: in task 1, TD-RPE upon reception of late reward, which was actually 0, was smaller than TD-RPE upon reception of early reward, whereas in task 2, TD-RPE upon reception of late reward was larger than TD-RPE upon reception of early reward.

The previous experimental work ¹ has shown that VTA DA neurons exhibited similar activity patterns to the abovementioned TD-RPE patterns, and the theoretical work ² has shown that the original value-RNN with backprop could reproduce such TD-RPE patterns. We examined what TD-RPE patterns were generated in the agents with punctate/CSC representation, VRNNbp, VRNNrf, and untrained RNN (12 RNN units in all cases) in our simulated two tasks. VRNNbp developed similar TD-RPE patterns (smaller TD-RPE upon late than early timing in task 1 but opposite pattern in task 2) (Fig. 4F), qualitatively reproducing the result of the previous work ². Crucially, VRNNrf also developed similar TD-RPE patterns (Fig. 4G), indicating that this agent with random feedback could also learn the distinct structures of the two tasks. By contrast, agents with punctate/CSC state representation without or with continuous value update across trials (Fig. 4D,E), as well as agent with untrained RNN (Fig. 4H), could not develop such patterns well.

Value-RNN with further biological constraints

So far, the activities of neurons in the RNN (x) were initialized to pseudo standard normal random numbers, and thereafter took numbers in the range between −0.5 and 0.5 that was the range of the sigmoidal input-output function. The value weights (w) could also take both positive and negative values since no constraint was imposed. The fixed random feedback in VRNNrf (c) was generated by pseudo standard normal random numbers, and so could also be positive or negative. Negativity of the neurons’ activities and the value weights could potentially be regarded as inhibitory or smaller-than-baseline quantities. However, because neuronal firing rate is non-negative and cortico-striatal projections are excitatory, it would be biologically more plausible to assume that the activities of neurons in the RNN and the value weights are non-negative. As for the fixed random feedback, if it is negative, the update rule becomes anti-Hebbian under positive TD-RPE, and so assuming non-negativity would be plausible since Hebbian property has been suggested for rapid plasticity of cortical synapses ⁴⁷. There was another issue in the update rule for recurrent and feed-forward connections, derived from the gradient descent. Specifically, the dependence on the post-synaptic activity was non-monotonic, maximized at the middle of the range of activity. It would be more biologically plausible to assume monotonic dependence.

In order to address these issues, we considered revised models. We first considered a revised VRNNbp, referred to as revVRNNbp, in which the RNN activities and the value weights were constrained to be non-negative, while the non-monotonic dependence of the update rule on the post-synaptic activity remained unchanged (Fig. 5A). We then considered a revised VRNNrf, referred to as bioVRNNrf, in which the fixed random feedback, as well as the RNN activities and the value weights, were constrained to be non-negative, and also the update rule was modified so that the dependence on the post-synaptic activity became monotonic (with saturation) (Fig. 5B).

Modified value-RNN models with further biological constraints. (A) RevVRNNbp: VRNNbp (value-RNN with backprop) was modified so that the activities of neurons in the RNN (x) and the value weights (w) became non-negative. (B) BioVRNNrf: VRNNrf (value-RNN with fixed random feedback) was modified so that x and w, as well as the fixed random feedback (c), became non-negative and also the dependence of the update rules for recurrent/feed-forward connections (A and B) on post-synaptic activity became monotonic + saturation.

We examined how these revised models, in comparison with untrained RNN that also had non-negative constraint for x and w, performed in the Pavlovian cue-reward association task examined above (the number of neurons/trials were set to 12/1500). RevVRNNbp well developed state values toward reward (Fig. 6A). BioVRNNrf also developed state values to a largely comparable extent (Fig. 6B). By contrast, untrained RNN could not develop such a pattern of state values (Fig. 6C). This, however, could be because initially set recurrent/feed-forward connections were far from those learned in the value-RNNs. Therefore, as a more strict control, we conducted simulations of untrained RNN with non-negative x and w, where in each simulation the recurrent/feed-forward connections were set to be those shuffled from the learnt connections in a simulation of bioVRNNrf. Untrained RNN with this setting performed somewhat better than the original untrained RNN case (Fig. 6D), but still worse than revVRNNbp and bioVRNNrf. We varied the number of neurons in the RNN, and compared the performance (sum of squared errors from the true state values) of revVRNNbp and bioVRNNrf, in comparison with untrained RNN (both naive one and the one with shuffled learnt connections from bioVRNNrf). As shown in Fig. 6E, regardless of the number of neurons, the performance of bioVRNNrf was largely comparable to that of revVRNNbp, and better than the performance of untrained RNN of both kinds. Figure 6F shows the mean of the elements of the recurrent and feed-forward connections at 1500-th trial in the different models. As shown in the figure, these connections (initialized to pseudo standard normal random numbers) were learnt to become negative on average, in revVRNNbp and more prominently in bioVRNNrf. This learnt negative-dominance (inhibition-dominance) could possibly be related, e.g., through prevention of excessive activity, to the good performance of bioVRNNrf and also the better performance of the untrained RNN with connections shuffled from bioVRNNrf than that of the naive untrained RNN.

Performances of the modified value-RNN models in the cue-reward association task, in comparison with untrained RNN that also had the non-negative constraint. **(A-D)** State values (black lines) and TD-RPEs (red lines) at 1500-th trial in revVRNNbp (A), bioVRNNrf (B), untrained RNN with x and w constrained to be non-negative (C), and untrained RNN with non-negative x and w and having connections shuffled from those learnt in bioVRNNrf (D). The number of RNN units was 12 in all the cases. Error-bars indicate mean ± SEM across 100 simulations (same applied to (E,F)). The right histograms show the across-simulation distribution of the value of the pre-reward state in each model. The vertical black dashed lines in the histograms indicate the true value of the pre-reward state (estimated through simulations). **(E)** Learning performance of revVRNNbp (red line), bioVRNNrf (blue line), untrained RNN (gray solid line: partly out of view), and untrained RNN with connections shuffled from those learnt in bioVRNNrf (gray dotted line) when the number of RNN units was varied from 5 to 40 (horizontal axis). Leaning performance was measured by the sum of squares of differences between the state values developed at 1500-th trial by each of these four types of agent and the estimated true state values between cue and reward (vertical axis). **(F)** Mean of the elements of the recurrent and feed-forward connections (at 1500-th trial) of revVRNNbp (red line), bioVRNNrf (blue line), and untrained RNN (gray solid line).

We examined how the angle between the value weights (w) and the random feedback (c) changed across trials in bioVRNNrf. As shown in Fig. 7A, the angle was on average smaller than the chance-level angle (90°) from the beginning, while there was no further alignment over trials. This could be understood as follows. Because both the value weights (w) and the random feedback (c) were now constrained to be non-negative, these two vectors were ensured to be in a relatively close angle (i.e., in the same quadrant) from the beginning. By virtue of this loose alignment, the random feedback could act similarly to backprop-derived proper feedback, even without further alignment. We examined if the angle between the value weights (w) and the random feedback (c) at 1500-th trial was associated with the developed value of pre-reward state across simulations, but found no association (r = 0.0117, p = 0.908) (Fig. 7B). We then examined if the w-c angle at earlier trials (2nd - 500-th trials) was associated with the developed values at 500-th trial, with the number of simulations increased to 1000 so that small correlation could be detected. We found that the w-c angle at initial trials (2nd - around 10-th trials) was negatively correlated with the developed values of the reward state and preceding states at 500-th trial (Fig. 7C). As for the reward state, negative correlation at around 100-th - 300-th trial was also observed. These results suggest that better alignment of w and c at initial and early timings was associated with better development of state values, in line with the conjecture that loose alignment of w and c coming from the non-negative constraint supported learning. It should be noted, however, that there were cases where positive (although small) correlation was observed. Its exact reason is not sure, but it could be related to the fact that largeness of developed values or the speed of value development does not necessarily mean good learning.

Loose alignment of the value weights (w) and the random feedback (c) in bioVRNNrf (with 12 RNN units), and its relation to the developed state values. **(A)** Over-trial changes in the angle between the value weights w and the fixed random feedback c. The solid line and the dashed lines indicate the mean and ± SD across 100 simulations, respectively. **(B)** Relation between the w-c angle (horizontal axis) and the value of the pre-reward state (vertical axis) at 1500-th trial. The dots indicate the results of individual simulations. **(C)** Correlation between the w-c angle at k-th trial (horizontal axis) and the value of the cue, post-cue, pre-reward, or reward state (top-bottom panels) at 500-th trial across 1000 simulations. The solid lines indicate the correlation coefficient, and the short vertical bars at the top of each panel indicates the cases in which p-value was less than 0.05.

We further examined how the revised value-RNN models performed in the two tasks with probabilistic structures examined above. Since the revised value-RNN models with 12 neurons appeared not able to produce the different patterns of TD-RPEs in the two tasks (TD-RPE at early reward > TD-RPE at late reward in task 1 and opposite pattern in task 2), we increased the number of neurons to 20. Then, both revVRNNbp and bioVRNNrf produced such TD-RPE patterns (Fig. 8A,B) whereas untrained RNN of both kinds (naive, and with connections shuffled from bioVRNNrf) could not (Fig. 8C,D). This indicates that the value-RNN with random feedback and further biological constraints could learn the differential characteristics of the tasks.

Performances of the modified value-RNN models in the two tasks having probabilistic structures, in comparison with untrained RNN having the non-negative constraint. TD-RPEs at the latest trial within 2000 trials in which reward was given at the early timing (red lines) or the late timing (blue lines) in task 1 (left) and task 2 (right), averaged across 100 simulations (error-bars indicating ± SEM across simulations), are shown for the four types of agent: (A) revVRNNbp; (B) bioVRNNrf; (C) untrained RNN with non-negative x and w; (D) untrained RNN with non-negative x and w and having connections shuffled from those learnt in bioVRNNrf. The number of RNN units was 20 for all the cases.

Discussion

We have shown that state representation and value can be learned in the RNN and its downstream by using random feedback instead of backprop-derived biologically unavailable downstream weights. In the model without non-negative constraint, the feedback alignment, previously shown for supervised learning, occurred, and we have presented an intuitive understanding of its mechanism. In the model with non-negative constraint, loose alignment occurred from the beginning because of the constraint, and it appeared to support learning. Below we discuss implementation of the value-RNN with random feedback, pointing to a crucial role of DA outside of striatum, and also heterogeneity of DA signals. We further discuss limitations, relations to other proposals and suggestions, and future perspectives.

Implementation of the value-RNN with random feedback, featuring a role of DA outside of striatum

DA neurons in the midbrain project to not only the striatum but also the cortex, including the prefrontal cortex ⁴⁸ and the hippocampus ⁴⁹. Previous studies demonstrated a crucial role of prefrontal DA in working memory ^{50, 51}, presumably through the effects on synaptic/ionic conductance ^{52, 53}. Roles of prefrontal DA in behavioral flexibility or decision making have also been suggested ⁵⁴. Moreover, a role of hippocampal DA in modulation of aversive memory formation has been demonstrated ⁵⁵. However, although i) there has been increasing evidence that DA represents TD-RPE ⁵⁶, ii) human fMRI experiments found TD-RPE correlates in cortical regions ⁵⁷, iii) DAergic modulation or initiation of plasticity in the prefrontal cortex ⁵⁸ or the hippocampus ⁵⁹ have been demonstrated, and iv) lesion or inactivation of prefrontal or hippocampal regions were found to disrupt DA’s encoding of RPE reflecting appropriate state representation ^60–62, what computational role in RL is played by TD-RPE-representing DA in the cortex remains to be clarified. This is behind compared to the case of striatum, where it has been widely considered that DAergic modulation of cortico-striatal synaptic weights implements TD-RPE-based update of state/action values ^{8, 63}.

The value-RNN with fixed random feedback and biological constraints considered in the present work suggests a possibility that TD-RPE-representing DA modulates plasticity of RNN in the cortex so that state representation can be learnt. Different from the original value-RNN with backprop ^{2, 13}, update of intra-cortical connections does not require downstream cortico-striatal weights but requires only non-negative fixed random feedback specific to each post-synaptic neuron. The non-negativity was assumed so that the update rule became Hebbian under positive TD-RPE, since Hebbian plasticity has been suggested for rapid plasticity of cortical synapses ⁴⁷. The fixed randomness would naturally be achieved by intrinsic heterogeneity of neurons. The successful learning performance of our model thus indicates that DA-dependent modulation of Hebbian plasticity of cortical excitatory connections serves for learning of state representation that captures task structure.

VTA DA neurons project also to regions other than the striatum and cortex, including the basolateral amygdala (BLA) ⁶⁴, and DA was suggested to regulate plasticity also in the BLA ⁶⁵. Recent work ⁶⁶ demonstrated that VTA→BLA DA entailed properties of TD-RPE, although increased rather than decreased upon aversive event, and was not itself reinforcing but necessary and sufficient for the formation of environmental model. BLA has recurrent connections ⁶⁷, projects to the striatum ^{68, 69}, and engages in abstract context representation together with the prefrontal cortex ⁷⁰. Thus, given that environmental relationships needed for goal-directed behavior could be embedded in state representation ^11–13, it seems possible that mechanism partly akin to the learning of state representation, but not value, in the RNN of our model takes place in the BLA. It remains open, however, whether and how such sophisticated representation can be learned. It might require multidimensional error ⁷¹ beyond TD-RPE, and/or multi-compartment unit ²⁶, both of which we will further discuss below.

DA’s encoding of TD-RPE and other variables

There have been many results suggesting heterogeneity of DA signals. Recent work ⁷² suggested that there (co)exist different origins: (i) heterogeneity of learning target (reward or other), (ii) heterogeneity of state features, and (iii) others, such as ramping patterns. (i) is typically observed in DA neurons projecting to different regions, which can represent prediction errors of things other than reward. In contrast, (ii) is applied to DA neurons projecting to a same region, in which even though individual DA neurons show heterogeneous responses, the resulting merged DA signal still represents a scalar error such as TD-RPE.

Referring to a result ⁷³ of type (i) and the fact that DA neurons receive inputs from the cerebellum ^{74, 75} that supposedly implements supervised learning ⁷⁶, a recent modeling work ⁴³ proposed that DA neurons convey vector-valued error signals, which are used for supervised learning of actions in continuous space. This previous work showed that learning occurred without adjustment of DA projection strengths because the feedback alignment mechanism worked. In contrast, in the present work, we assumed scalar TD-RPE, which can be consistent with type (ii) heterogeneity of DA signals. We have shown that the feedback alignment mechanism works also for RL, and moreover, learning could also occur by virtue of loose alignment coming from the biological constraints even without the operation of feedback alignment. Notably, the previous model ⁴³ and our model can coexist, given that different DA neuronal populations may encode vector-valued error and TD-RPE, or even same single DA neuron might represent both errors depending on the context, reflecting which inputs are active.

Limitations and possible reasons

We have shown that state representation and value could be learned in the value-RNN with fixed random feedback with a relatively small number of simple RNN units and observation inputs in simple simulated tasks. These simplicities enabled us to derive an intuitive understanding of how the feedback alignment could occur. However, in our models without the non-negativity constraint, as the number of RNN units increased, the performance of the models initially improved but then degraded when the number of RNN units increased beyond around 25. In contrast, in the original value-RNN with backprop ^{2, 13}, the ability to develop belief-state-like representation was reported to improve as the number of RNN units increased to 100 or 50.

There are several possible reasons for this difference. First, as a performance measure we used the sum of squared errors between the values developed by the value-RNN and the values estimated according to the definition (expected discounted cumulative future rewards) between cue and reward, while the previous studies focused on the similarity between the representation developed by the value-RNN and the handcrafted belief states. Second, there was a difference in the way of weight update. Specifically, as mentioned in the Results, while the previous studies used the BPTT ⁴⁴, which considers the recursive influence of the recurrent weights in a way that lacks causality, our models used an online learning rule, which only considers the influence of the recurrent weights at the previous time step.

Last but not least, there was a difference in the RNN unit. Specifically, we used a simple sigmoidal function, whereas the previous studies used the “Gated Recurrent Unit (GRU) cell” ⁷⁷. RNN with simple nonlinear unit is known to have the “vanishing gradient problem” ⁷⁸: through repetitive learning, the gradient of the loss function becomes so small that update becomes invisible. This issue could be alleviated by using an RNN unit having a memory, called the Long Short-Term Memory (LSTM) unit ⁷⁹. The GRU cell was suggested to have a memory function similar to the LSTM unit ⁷⁷. We focused on resolving the biological implausibility of the backprop, and sticked to the simple sigmoidal unit. However, gated unit similar to the LSTM unit has actually been proposed to be implemented in cortical microcircuits ⁸⁰, and incorporating the features of real neuron into value-RNN could enhance the computational power as we will discuss below.

Biological details and future perspectives

Our RNN unit did not incorporate neuronal spiking and its effects on plasticity ^{38, 81, 82}, as well as neuronal morphology with nonlinear dendritic computations ^{41, 83, 84}. Importantly, recent studies suggest that dendritic mechanisms ^{34, 35}, potentially together with burst-dependent plasticity ^{38, 39}, can realize credit assignment without backprop in supervised learning, and also in unsupervised learning ^{85, 86}. Dendritic mechanisms have their own specific features, or constraints, and so having them is different from increasing the number of layers of neural network, and it was argued ⁴¹ that adding such biological constraints enables learning in deep neural networks. Moreover, recent model of hippocampus ²⁶ has shown that a network of multi-compartment units could learn complex representations. Given these, it would be interesting to explore if incorporating biological details into RNN unit can improve the performance of value-RNN.

A different alternative to backprop is the Associative Reward-Penalty (A_R-P) algorithm ^87–89, in which the hidden units behave stochastically, and thereby the gradient could be estimated, in effect, through stochastic sampling without explicit information of the downstream weights. More recent work ⁹⁰ demonstrated that noise-induced learning of back projections could achieve better alignment and performance compared with the case of fixed random feedback in a feed-forward network. These mechanisms can be biologically implemented because neurons and neural networks can exhibit noisy or chaotic behavior ^91–93, and are expected to potentially improve the performance of value-RNN.

Regarding the connectivity, in our models, recurrent/feed-forward connections could take both positive and negative values. This could be justified because there are both excitatory and inhibitory connections in the cortex and the net connection sign between two units can be positive or negative depending on whether excitation or inhibition exceeds the other. However, recent studies have shown that feed-forward and recurrent neural networks conforming to Dale’s law can perform well depending on the architecture, initialization, and update rules ^{94, 95}. Integration of these models and ours, also with other connectivity features ⁹⁶, may be a fruitful direction.

More specific to the cortico-basal ganglia circuit, existences of D1/D2 DA receptors and D1-direct and D2-indirect basal ganglia pathways ^97–100, as well as cortical areas and cell types ^101–104, were also not incorporated. Furthermore, circuit/synaptic mechanisms of how TD-RPE is calculated in DA neurons (c.f., ^{105, 106}) and/or how it can be learned (c.f., ¹⁰⁷) were unspecified. Future studies are expected to incorporate these factors.

Methods

Value-RNN with backprop (VRNNbp)

We constructed a value-RNN model based on the previous proposals ^{2, 13} but with several differences. We assumed that the activities of neurons in the RNN at time t+1 were determined by the activities of these neurons and neurons representing observation (cue, reward, or nothing) at time t:

where

were the value weights. The error between this estimated value and the true value, v_true(t), was defined as:

Parameters w_j, A_ij, and B_ik that minimize the squared error ε(t)² could be found by a gradient descent / error-backpropagation (backprop) method, i.e., by updating them in the directions of −∂(ε(t)²)/∂w_j, −∂(ε(t)²)/∂A_ij, and −∂(ε(t)²)/∂B_ik. −∂(ε(t)²)/∂w_j was calculated as follows:

In the last line, since ε(t) was unavailable as v_true(t) was unknown, it was approximated by the TD-RPE:

−∂(ε(t)²)/∂A_ij was calculated as follows:

Similarly, −∂(ε(t)²)/∂B_ik was calculated as follows:

According to these, the update rule for the value-RNN was determined as follows:

where a was the learning rate. In each simulation, the elements of A and B, as well as the elements of x, were initialized to pseudo standard normal random numbers, and the elements of w were initialized to 0.

Value-RNN with fixed random feedback (VRNNrf)

We considered an implementation of the value-RNN described above in the cortico-basal ganglia-DA system (Fig. 1):

x : activities of neurons in a cortical region with rich recurrent connections

A : recurrent connection strengths among x

ο : activities of neurons in a cortical region processing sensory inputs

B : feed-forward connection strengths from o to x

f : sigmoidal relationship from the input to the output of the cortical neurons

w : connection strengths from cortical neurons x to a group of striatal neurons

v : activity of the group of striatal neurons

δ : activity of a group of DA neurons / released DA

The update rule for w

could be naturally implemented as cortico-striatal synaptic plasticity, which depends on DA (δ(t)) and pre-synaptic (cortical) neuronal activity (x_j(t)). However, an issue emerged in implementation of the update rules for A and B:

Specifically, w_i included in the rightmost of these update rules (for the strengths of cortico-cortical synapses A_ij and B_ik) is the connection strength from cortical neuron x_i to striatal neurons, i.e., the strength of the cortico-striatal synapses (located within the striatum), which is considered to be unavailable at the cortico-cortical synapses (located within the cortex).

As mentioned in the Introduction, this is an example of the long-standing difficulty in biological implementation of backprop, and recently a potential solution for this difficulty, i.e., replacement of the downstream connection strengths in the update rule for upstream connections with fixed random strengths, has been demonstrated in supervised learning of feed-forward and recurrent networks ^{33, 42,43}. The value-RNN, which we considered here, differed from supervised learning considered in these previous studies in two ways: i) it was TD learning, apparent in the approximation of the true error ε(t) by the TD-RPE δ(t) in the derivation described above, and ii) it used a scalar error (TD-RPE) rather than a vector error. But we expected that the feedback alignment mechanism could still work at least to some extent, and explored it in this study. Specifically, we examined a modified value-RNN with fixed random feedback (VRNNrf), in which the update rules for A and B were modified as follows:

where w_i in the update rules of the value-RNN with backprop (VRNNbp) was replaced with a fixed random parameter c_i. Notably, these modified update rules for the cortico-cortical connections A and B required only pre-synaptic activities (x_j(t−1), o_k(t−1)), post-synaptic activities (x_i(t)), TD-RPE-representing DA (δ(t)), and fixed random strengths (c_i), which would all be available at the cortico-cortical synapses given that VTA DA neurons project not only to the striatum but also to the cortex and random c_i could be provided by intrinsic heterogeneity. In each simulation, the elements of c were initialized to pseudo standard normal random numbers.

Revised value-RNN models with further biological constraints

In the later part of this study, we examined revised value-RNN models with further biological constraints. Specifically, we considered models, in which the value weights and the activities of neurons in the RNN were constrained to be non-negative. In order to do so, the update rule for w was modified to:

where max(q₁, q₂) returned the maximum of q₁ and q₂. Also, the sigmoidal input-output function was replaced with

and the elements of x were initialized to pseudo uniform [0 1] random numbers. The backprop-based update rules for A and B in VRNNbp were replaced with

We referred to the model with these modifications to VRNNbp as revVRNNbp.

As a revised value-RNN with fixed random feedback (VRNNrf), in addition to the abovementioned modifications of the update of w, the sigmoidal input-output function, and the initialization of x, the fixed random feedback c was assumed to be non-negative. Specifically, the elements of c were set to pseudo uniform [0 1] random numbers. Moreover, the update rules for A and

B were replaced with

so that the originally non-monotonic dependence on x_i(t) (post-synaptic activity) became monotonic + saturation (Fig. 5B). These update rules with non-negative c_i could be said to be Hebbian with additional modulation by TD-RPE (Hebbian under positive TD-RPE). We referred to the model with these modifications to VRNNrf as bioVRNNrf.

Simulation of the tasks

In the Pavlovian cue-reward association task, at time 1 of each trial, cue observation was received by the RNN, and at time 4, reward observation was received. Trial was pseudo-randomly ended at time 7, 8, 9, or 10, and the next trial started from the next time-step. Reward size was r = 1. The tasks with probabilistic structures (task 1 and task 2) were implemented in the same way except that reward timing was not time 4 but time 3 or 5 with equal probabilities, specifically, 50% and 50% in task 1 and 30% and 30% in task 2, and there was no reward in the remaining 40% of trials in task 2.

The cue or reward state/timing, mentioned in the text and marked in the figures, was defined to be the timing when the RNN received the cue or reward observation, respectively. Specifically, if o(t) = (1 0)^T or o(t) = (0 1)^T at time t, t + 1 was defined to be a cue or reward timing, respectively. For the agents with punctate (CSC) representation, each timing in the tasks was represented by a 10-dimensional one-hot vector, starting from (1 0 0 … 0)^T for the cue state, with the next state (0 1 0 … 0)^T and so on.

Unless otherwise mentioned, parameters were set to the following values. Learning rate (a): 0.1 (normalization by the squared norm of feature vector was not implemented). Time discount factor (γ): 0.8.

Estimation of true state values

As for the Pavlovian cue-reward association task, we defined states by relative timings from the cue, and estimated their (true) state values by simulations according to the definition of state value. Specifically, we generated a sequence of cues and rewards corresponding to 1000 trials, and calculated cumulative discounted future rewards within the sequence:

where _t_{_rew} denotes the time-step of each reward counted from the starting state, starting from -2, -1, …, and +6 time steps from a cue. We repeated this 1000 times, generating 1000 sequences (i.e., 1000 simulations of 1000 trials), with different sets of pseudo-random numbers, and calculated an average over these 1000 sequences so as to estimate the expected cumulative discounted future rewards, i.e., state value (by definition) for each state (−2, -1, …, and +6 time steps from cue). Using these estimated true state values, we calculated TD-RPE at each state (−2, -1, …, and +5 time steps from cue).

In a similar manner, we defined states and estimated true state values, and also calculated TD-RPE, for tasks 1 and 2 that had probabilistic structures. As for task 1, we defined the following states: -2, -1, .., and +2 time steps from cue (i.e., states visited (entered) before knowing whether reward was given at the early timing (= +2 time step from cue)), +3, 4, 5, and 6 time steps from cue after reception of reward at the early timing, and +3, 4, 5, and 6 time steps from cue after no reception of reward at the early timing (in total 5 + 4 + 4 = 13 states) (Fig. 4Bc, left-top). We generated 10000 sequences of cues and rewards corresponding to 1000 trials (i.e., 10000 simulations of 1000 trials), and for each state, calculated cumulative discounted future rewards within the sequence for each of 10000 simulations and took an average to obtain the expected cumulative discounted future rewards (i.e., estimation of state value) (Fig. 4Bc, left-bottom). Using the estimated state values, we calculated TD-RPE (Fig. 4C, left).

As for task 2, we defined the following states: -2, -1, .., and +2 time steps from cue (i.e., states visited (entered) before knowing whether reward was given at the early timing), +3, 4, 5, and 6 time steps from cue after reception of reward at the early timing, +3 and 4 time steps from cue after no reception of reward at the early timing (states visited (entered) before knowing whether reward was given at the late timing (= +4 time step from cue)), +5 and 6 time steps from cue after reception of reward at the late timing, and +5 and 6 time steps from cue after no reception of reward at both early and late timings (in total 5 + 4 + 2 + 2 + 2 = 15 states) (Fig. 4Bc, right-top). We estimated state values of these states (Fig. 4Bc, right-bottom), and also calculated TD-RPE (Fig. 4C, right), in similar manners to the above.

Analyses, software, and code availability

SEM (Standard error of the mean) was approximated by SD (standard deviation)/√N (number of samples). Linear regression and principal component analysis (PCA) were conducted by using R (functions lm and prcomp). Simulations were conducted by using MATLAB, and pseudo-random numbers were implemented by using rand, randn, and randperm functions. All the codes will be made available at GitHub upon publication of this work in a journal.

Author contributions

Conceptualization: KM; Formal analysis: KM, TT; Investigation: KM, TT, AyK; Writing – original draft: KM; Writing – review & editing: KM, TT, AyK, ArK

Acknowledgements

The authors thank Dr. Kenji Doya for valuable suggestions. KM was supported by Grants-in-Aid for Scientific Research 23H03295 and 23K27985 from Japan Society for the Promotion of Science (JSPS) and the Naito Foundation. AyK was supported by JSPS Overseas Research Fellowships. ArK was partially funded by Digital Futures (KTH) grant and StratNeuro SRA.

References

1.
1. Starkweather C.K.
2. Babayan B.M.
3. Uchida N.
4. Gershman S.J
2017Dopamine reward prediction errors reflect hidden-state inference across timeNat Neurosci 20:581–589Google Scholar
2.
1. Hennig J.A.
2. et al.
2023Emergence of belief-like representations through reinforcement learningPLoS Comput Biol 19:e1011067Google Scholar
3.
1. Montague P.R.
2. Dayan P.
3. Sejnowski T.J
1996A framework for mesencephalic dopamine systems based on predictive Hebbian learningJ Neurosci 16:1936–1947Google Scholar
4.
1. Schultz W.
2. Dayan P.
3. Montague P.R
1997A neural substrate of prediction and rewardScience 275:1593–1599Google Scholar
5.
1. Niv Y.
2. Schoenbaum G
2008Dialogues on prediction errorsTrends Cogn Sci 12:265–272Google Scholar
6.
1. Cohen J.Y.
2. Haesler S.
3. Vong L.
4. Lowell B.B.
5. Uchida N
2012Neuron-type-specific signals for reward and punishment in the ventral tegmental areaNature 482:85–88Google Scholar
7.
1. Steinberg E.E.
2. et al.
2013A causal link between prediction errors, dopamine neurons and learningNat Neurosci 16:966–973Google Scholar
8.
1. Reynolds J.N.
2. Hyland B.I.
3. Wickens J.R
2001A cellular mechanism of reward-related learningNature 413:67–70Google Scholar
9.
1. Shen W.
2. Flajolet M.
3. Greengard P.
4. Surmeier D.J
2008Dichotomous dopaminergic control of striatal synaptic plasticityScience 321:848–851Google Scholar
10.
1. Yagishita S.
2. et al.
2014A critical time window for dopamine actions on the structural plasticity of dendritic spinesScience 345:1616–1620Google Scholar
11.
1. Russek E.M.
2. Momennejad I.
3. Botvinick M.M.
4. Gershman S.J.
5. Daw N.D
2017Predictive representations can link model-based reinforcement learning to model-free mechanismsPLoS Comput Biol 13:e1005768Google Scholar
12.
1. Stachenfeld K.L.
2. Botvinick M.M.
3. Gershman S.J
2017The hippocampus as a predictive mapNat Neurosci 20:1643–1653Google Scholar
13.
1. Qian L.
2. et al.
2024The role of prospective contingency in the control of behavior and dopamine signals during associative learningbioRxiv Google Scholar
14.
1. Langdon A.J.
2. Sharpe M.J.
3. Schoenbaum G.
4. Niv Y
2018Model-based predictions for dopamineCurr Opin Neurobiol 49:1–7Google Scholar
15.
1. Keiflin R.
2. Pribut H.J.
3. Shah N.B.
4. Janak P.H
2019Ventral Tegmental Dopamine Neurons Participate in Reward Identity PredictionsCurr Biol 29:93–103Google Scholar
16.
1. Redish A.D.
2. Jensen S.
3. Johnson A.
4. Kurth-Nelson Z
2007Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gamblingPsychol Rev 114:784–805Google Scholar
17.
1. Gershman S.J.
2. Jones C.E.
3. Norman K.A.
4. Monfils M.H.
5. Niv Y
2013Gradual extinction prevents the return of fear: implications for the discovery of stateFront Behav Neurosci 7:164Google Scholar
18.
1. Shimomura K.
2. Kato A.
3. Morita K
2021Rigid reduced successor representation as a potential mechanism for addictionEur J Neurosci 53:3768–3790Google Scholar
19.
1. Feng Z.
2. Nagase A.M.
3. Morita K
2021A Reinforcement Learning Approach to Understanding Procrastination: Does Inaccurate Value Approximation Cause Irrational Postponing of a Task?Front Neurosci 15Google Scholar
20.
1. Sato R.
2. Shimomura K.
3. Morita K
2023Opponent learning with different representations in the cortico-basal ganglia pathways can develop obsession-compulsion cyclePLoS Comput Biol 19:e1011206Google Scholar
21.
1. Gershman S.J.
2. Niv Y
2010Learning latent structure: carving nature at its jointsCurr Opin Neurobiol 20:251–256Google Scholar
22.
1. Niv Y
2019Learning task-state representationsNat Neurosci 22:1544–1553Google Scholar
23.
1. George T.M.
2. de Cothi W.
3. Stachenfeld K.L.
4. Barry C
2023Rapid learning of predictive maps with STDP and theta phase precessionElife 12:e80663Google Scholar
24.
1. Bono J.
2. Zannone S.
3. Pedrosa V.
4. Clopath C
2023Learning predictive cognitive maps with spiking neurons during behavior and replaysElife 12:e80671Google Scholar
25.
1. Fang C.
2. Aronov D.
3. Abbott L.F.
4. Mackevicius E.L
2023Neural learning rules for generating flexible predictions and computing the successor representationElife 12:e80680Google Scholar
26.
1. Cone I.
2. Clopath C
2024Latent representations in hippocampal network model co-evolve with behavioral exploration of task structureNat Commun 15Google Scholar
27.
1. Amari S
1967A Theory of Adaptive Pattern ClassifiersIEEE Transactions on Electronic Computers EC 16:299–307Google Scholar
28.
1. Rumelhart D.E.
2. Hinton G.E.
3. Williams R.J
1986Learning representations by back-propagating errorsNature 323:533–536Google Scholar
29.
1. Doya K
2000Complementary roles of basal ganglia and cerebellum in learning and motor controlCurr Opin Neurobiol 10:732–739Google Scholar
30.
1. O’Doherty J.
2. et al.
2004Dissociable roles of ventral and dorsal striatum in instrumental conditioningScience 304:452–454Google Scholar
31.
1. Grossberg S
1987Competitive learning: from interactive activation to adaptive resonanceCognitive Science 11:23–63Google Scholar
32.
1. Crick F
1989The recent excitement about neural networksNature 337:129–132Google Scholar
33.
1. Lillicrap T.P.
2. Cownden D.
3. Tweed D.B.
4. Akerman C.J
2016Random synaptic feedback weights support error backpropagation for deep learningNat Commun 7Google Scholar
34.
1. Guerguiev J.
2. Lillicrap T.P.
3. Richards B.A
2017Towards deep learning with segregated dendritesElife 6:e22901Google Scholar
35.
1. Sacramento J.
2. Costa R.P.
3. Bengio Y.
4. Senn W.
2018Dendritic cortical microcircuits approximate the backpropagation algorithm. in Advances in Neural Information Processing Systems 31Google Scholar
36.
1. Whittington J.C.R.
2. Bogacz R
2019Theories of Error Back-Propagation in the BrainTrends Cogn Sci 23:235–250Google Scholar
37.
1. Lillicrap T.P.
2. Santoro A.
3. Marris L.
4. Akerman C.J.
5. Hinton G.
2020Backpropagation and the brainNat Rev Neurosci 21:335–346Google Scholar
38.
1. Payeur A.
2. Guerguiev J.
3. Zenke F.
4. Richards B.A.
5. Naud R
2021Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuitsNat Neurosci 24:1010–1019Google Scholar
39.
1. Greedy W.
2. Zhu H.W.
3. Pemberton J.
4. Mellor J.
5. Costa R.P.
2022Single-phase deep learning in cortico-cortical networksAdvances in Neural Information Processing Systems (NeurIPS 2022) 35Google Scholar
40.
1. Song Y.
2. et al.
2024Inferring neural activity before plasticity as a foundation for learning beyond backpropagationNat Neurosci 27:348–358Google Scholar
41.
1. Pagkalos M.
2. Makarov R.
3. Poirazi P
2024Leveraging dendritic properties to advance machine learning and neuro-inspired computingCurr Opin Neurobiol 85:102853Google Scholar
42.
1. Murray J.M
2019Local online learning in recurrent networks with random feedbackElife 8Google Scholar
43.
1. Wärnberg E.
2. Kumar A
2023Feasibility of dopamine as a vector-valued feedback signal in the basal gangliaProc Natl Acad Sci U S A 120:e2221994120Google Scholar
44.
1. Rumelhart D.E.
2. Hinton G.E.
3. Williams R.J.
4. Rumelhart D.E.
5. McClelland J.L.
6. The PDP Group
1985Learning Internal Representations by Error PropagationIn: Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations MIT Press, Cambridge pp. 318–362Google Scholar
45.
1. Ludvig E.A.
2. Sutton R.S.
3. Kehoe E.J
2012Evaluating the TD model of classical conditioningLearn Behav 40:305–319Google Scholar
46.
1. Sutton R.S.
2. Barto A.G
2018Reinforcement Learning: An Introduction (Second Edition)MA: MIT Press, Cambridge Google Scholar
47.
1. Feldman D.E
2009Synaptic mechanisms for plasticity in neocortexAnnu Rev Neurosci 32:33–55Google Scholar
48.
1. Williams S.M.
2. Goldman-Rakic P.S
1998Widespread origin of the primate mesofrontal dopamine systemCereb Cortex 8:321–345Google Scholar
49.
1. Broussard J.I.
2. et al.
2016Dopamine Regulates Aversive Contextual Learning and Associated In Vivo Synaptic Plasticity in the HippocampusCell Rep 14:1930–1939Google Scholar
50.
1. Brozoski T.J.
2. Brown R.M.
3. Rosvold H.E.
4. Goldman P.S
1979Cognitive deficit caused by regional depletion of dopamine in prefrontal cortex of rhesus monkeyScience 205:929–932Google Scholar
51.
1. Sawaguchi T.
2. Goldman-Rakic P.S
1991D1 dopamine receptors in prefrontal cortex: involvement in working memoryScience 251:947–950Google Scholar
52.
1. Durstewitz D.
2. Seamans J.K.
3. Sejnowski T.J
2000Dopamine-mediated stabilization of delay-period activity in a network model of prefrontal cortexJ Neurophysiol 83:1733–1750Google Scholar
53.
1. Brunel N.
2. Wang X
2001Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibitionJ Comput Neurosci 11:63–85Google Scholar
54.
1. Floresco S.B.
2. Magyar O
2006Mesocortical dopamine modulation of executive functions: beyond working memoryPsychopharmacology (Berl) 188:567–585Google Scholar
55.
1. Tsetsenis T.
2. et al.
2021Midbrain dopaminergic innervation of the hippocampus is sufficient to modulate formation of aversive memoriesProc Natl Acad Sci U S A 118:e2111069118Google Scholar
56.
1. Kim H.R.
2. et al.
2020A Unified Framework for Dopamine Signals across TimescalesCell 183:1600–1616Google Scholar
57.
1. O’Doherty J.P.
2. Dayan P.
3. Friston K.
4. Critchley H.
5. Dolan R.J
2003Temporal difference models and reward-related learning in the human brainNeuron 38:329–337Google Scholar
58.
1. Otani S.
2. Daniel H.
3. Roisin M.P.
4. Crepel F
2003Dopaminergic modulation of long-term synaptic plasticity in rat prefrontal neuronsCereb Cortex 13:1251–1256Google Scholar
59.
1. Sayegh F.J.P.
2. et al.
2024Ventral tegmental area dopamine projections to the hippocampus trigger long-term potentiation and contextual learningNat Commun 15:4100Google Scholar
60.
1. Takahashi Y.K.
2. et al.
2011Expectancy-related changes in firing of dopamine neurons depend on orbitofrontal cortexNat Neurosci 14:1590–1597Google Scholar
61.
1. Starkweather C.K.
2. Gershman S.J.
3. Uchida N
2018The Medial Prefrontal Cortex Shapes Dopamine Reward Prediction Errors under State UncertaintyNeuron 98:616–629Google Scholar
62.
1. Takahashi Y.K.
2. et al.
2023Expectancy-related changes in firing of dopamine neurons depend on hippocampusbioRxiv https://doi.org/10.1101/2023.07.19.549728 Google Scholar
63.
1. Samejima K.
2. Ueda Y.
3. Doya K.
4. Kimura M
2005Representation of action-specific reward values in the striatumScience 310:1337–1340Google Scholar
64.
1. Beier K.T.
2. et al.
2015Circuit Architecture of VTA Dopamine Neurons Revealed by Systematic Input-Output MappingCell 162:622–634Google Scholar
65.
1. Li C.
2. Rainnie D.G
2014Bidirectional regulation of synaptic plasticity in the basolateral amygdala induced by the D1-like family of dopamine receptors and group II metabotropic glutamate receptorsJ Physiol 592:4329–4351Google Scholar
66.
1. Sias A.C.
2. et al.
2024Dopamine projections to the basolateral amygdala drive the encoding of identity-specific reward memoriesNat Neurosci 27:728–736Google Scholar
67.
1. Headley D.B.
2. Kyriazi P.
3. Feng F.
4. Nair S.S.
5. Pare D
2021Gamma Oscillations in the Basolateral Amygdala: Localization, Microcircuitry, and Behavioral CorrelatesJ Neurosci 41:6087–6101Google Scholar
68.
1. Britt J.P.
2. et al.
2012Synaptic and behavioral profile of multiple glutamatergic inputs to the nucleus accumbensNeuron 76:790–803Google Scholar
69.
1. Lee I.B.
2. et al.
2024Persistent enhancement of basolateral amygdala-dorsomedial striatum synapses causes compulsive-like behaviors in miceNat Commun 15Google Scholar
70.
1. Saez A.
2. Rigotti M.
3. Ostojic S.
4. Fusi S.
5. Salzman C.D
2015Abstract Context Representations in Primate Amygdala and Prefrontal CortexNeuron 87:869–881Google Scholar
71.
1. Stalnaker T.A.
2. et al.
2019Dopamine neuron ensembles signal the content of sensory prediction errorsElife 8:e49315Google Scholar
72.
1. Lee R.S.
2. Sagiv Y.
3. Engelhard B.
4. Witten I.B.
5. Daw N.D
2024A feature-specific prediction error model explains dopaminergic heterogeneityNat Neurosci 27:1574–1586Google Scholar
73.
1. Avvisati R.
2. et al.
2024Distributional coding of associative learning in discrete populations of midbrain dopamine neuronsCell Rep 43Google Scholar
74.
1. Watabe-Uchida M.
2. Zhu L.
3. Ogawa S.K.
4. Vamanrao A.
5. Uchida N
2012Whole-brain mapping of direct inputs to midbrain dopamine neuronsNeuron 74:858–873Google Scholar
75.
1. Carta I.
2. Chen C.H.
3. Schott A.L.
4. Dorizan S.
5. Khodakhah K
2019Cerebellar modulation of the reward circuitry and social behaviorScience 363Google Scholar
76.
1. Marr D
1969A theory of cerebellar cortexJ Physiol 202:437–470Google Scholar
77.
1. Cho K.
2. et al.
2014Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine TranslationarXiv https://doi.org/10.48550/arXiv.1406.1078 Google Scholar
78.
1. Hochreiter S
1998The vanishing gradient problem during learning recurrent neural nets and problem solutionsInternational Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6:107–30Google Scholar
79.
1. Hochreiter S.
2. Schmidhuber J
1997Long short-term memoryNeural Comput 9:1735–1780Google Scholar
80.
1. Costa R.P.
2. Assael Y.M.
3. Shillingford B.
4. De Freitas N.
5. Vogels T
2017Cortical microcircuits as gated-recurrent neural networksAdvances in Neural Information Processing Systems Google Scholar
81.
1. Shouval H.Z.
2. Wang S.S.
3. Wittenberg G.M
2010Spike timing dependent plasticity: a consequence of more fundamental learning rulesFront Comput Neurosci 4Google Scholar
82.
1. Gjorgjieva J.
2. Clopath C.
3. Audet J.
4. Pfister J.P
2011A triplet spike-timing-dependent plasticity model generalizes the Bienenstock-Cooper-Munro rule to higher-order spatiotemporal correlationsProc Natl Acad Sci U S A 108:19383–19388Google Scholar
83.
1. Poirazi P.
2. Brannon T.
3. Mel B
2003Pyramidal neuron as two-layer neural networkNeuron 37:989–999Google Scholar
84.
1. Morita K
2008Possible role of dendritic compartmentalization in the spatial working memory circuitJ Neurosci 28:7699–7724Google Scholar
85.
1. Körding K.P.
2. König P
2001Supervised and unsupervised learning with two sites of synaptic integrationJ Comput Neurosci 11:207–215Google Scholar
86.
1. Illing B.
2. Ventura J.
3. Bellec G.
4. Gerstner W.
2021Local plasticity rules can learn deep representations using self-supervised contrastive predictionsAdvances in Neural Information Processing Systems (NeurIPS 2021) 34Google Scholar
87.
1. Barto A.G.
2. Jordan M.I
1987Gradient Following Without Back-Propagation in Layered NetworksIn: Proceedings of the First Annual International Conference on Neural Networks pp. 629–636Google Scholar
88.
1. Mazzoni P.
2. Andersen R.A.
3. Jordan M.I
1991A more biologically plausible learning rule for neural networksProc Natl Acad Sci U S A 88:4433–4437Google Scholar
89.
1. Mazzoni P.
2. Andersen R.A.
3. Jordan M.I
1991A more biologically plausible learning rule than backpropagation applied to a network model of cortical area 7aCereb Cortex 1:293–307Google Scholar
90.
1. Max K.
2. et al.
2024Learning efficient backprojections across cortical hierarchies in real timeNature Machine Intelligence 6:619–630Google Scholar
91.
1. Faisal A.A.
2. Selen L.P.
3. Wolpert D.M
2008Noise in the nervous systemNat Rev Neurosci 9:292–303Google Scholar
92.
1. Aihara K.
2. Matsumoto G.
3. Holden A.V.
1986Chaotic oscillations and bifurcations in squid giant axonsIn: Chaos Princeton University Press Google Scholar
93.
1. van Vreeswijk C.
2. Sompolinsky H
1996Chaos in neuronal networks with balanced excitatory and inhibitory activityScience 274:1724–1726Google Scholar
94.
1. Cornford J.
2. et al.
2021Learning to live with Dale’s principle: ANNs with separate excitatory and inhibitory unitsbioRxiv https://doi.org/10.1101/2020.11.02.364968 Google Scholar
95.
1. Li P.
2. Cornford J.
3. Ghosh A.
4. Richards B.
2023Learning better with Dale’s Law: A Spectral PerspectivebioRxiv https://doi.org/10.1101/2023.06.28.546924 Google Scholar
96.
1. Mastrogiuseppe F.
2. Ostojic S.
2018Linking Connectivity, Dynamics, and Computations in Low-Rank Recurrent Neural NetworksNeuron 99:609–623Google Scholar
97.
1. Gerfen C.R.
2. Surmeier D.J
2011Modulation of Striatal Projection Systems by DopamineAnnu Rev Neurosci 34:441–466Google Scholar
98.
1. Collins A.G.
2. Frank M.J
2014Opponent actor learning (OpAL): modeling interactive effects of striatal dopamine on reinforcement learning and choice incentivePsychol Rev 121:337–366Google Scholar
99.
1. Mikhael J.G.
2. Bogacz R
2016Learning Reward Uncertainty in the Basal GangliaPLoS Comput Biol 12:e1005062Google Scholar
100.
1. Lowet A.S.
2. et al.
2024An opponent striatal circuit for distributional reinforcement learningbioRxiv https://doi.org/10.1101/2024.01.02.573966 Google Scholar
101.
1. Wall N.R.
2. De La Parra M.
3. Callaway E.M.
4. Kreitzer A.C
2013Differential innervation of direct-and indirect-pathway striatal projection neuronsNeuron 79:347–360Google Scholar
102.
1. Morita K
2014Differential cortical activation of the striatal direct and indirect pathway cells: reconciling the anatomical and optogenetic results by using a computational methodJ Neurophysiol 112:120–146Google Scholar
103.
1. Hooks B.M.
2. et al.
2018Topographic precision in sensory and motor corticostriatal projections varies across cell type and cortical areaNat Commun 9:3549Google Scholar
104.
1. Morita K.
2. Im S.
3. Kawaguchi Y
2019Differential striatal axonal arborizations of the intratelencephalic and pyramidal-tract neurons: analysis of the data in the MouseLight databaseFront Neural Circuits 13Google Scholar
105.
1. Tian J.
2. et al.
2016Distributed and Mixed Information in Monosynaptic Inputs to Dopamine NeuronsNeuron 91:1374–1389Google Scholar
106.
1. Morita K.
2. Kawaguchi Y
2019A Dual Role Hypothesis of the Cortico-Basal-Ganglia Pathways: Opponency and Temporal Difference Through Dopamine and AdenosineFront Neural Circuits 12Google Scholar
107.
1. Cone I.
2. Clopath C.
3. Shouval H.Z
2024Learning to express reward prediction error-like dopaminergic activity requires plastic representations of timeNat Commun 15:5856Google Scholar

Article and author information

Author information

Takayuki Tsurumi
Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo, Japan, Theoretical Sciences Visiting Program, Okinawa Institute of Science and Technology, Tancha, Japan
Ayaka Kato
Theoretical Sciences Visiting Program, Okinawa Institute of Science and Technology, Tancha, Japan, Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, USA
ORCID iD: 0000-0002-6306-6600
Arvind Kumar
Theoretical Sciences Visiting Program, Okinawa Institute of Science and Technology, Tancha, Japan, Division of Computational Science and Technology, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden
ORCID iD: 0000-0002-8044-9195
Kenji Morita
Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo, Japan, Theoretical Sciences Visiting Program, Okinawa Institute of Science and Technology, Tancha, Japan, International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo, Tokyo, Japan
ORCID iD: 0000-0003-2192-4248
- For correspondence: morita@p.u-tokyo.ac.jp

Version history

Preprint posted: October 4, 2024
Sent for peer review: November 5, 2024
Reviewed Preprint version 1: January 14, 2025

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.104101. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

views: 170
downloads: 6
citation: 1

Views, downloads and citations are aggregated across all versions of this paper published by eLife.