Abstract
How external/internal ‘state’ is represented in the brain is crucial, since appropriate representation enables goal-directed behavior. Recent studies suggest that state representation and state value can be simultaneously learnt through reinforcement learning (RL) using reward-prediction-error in recurrent-neural-network (RNN) and its downstream weights. However, how such learning can be neurally implemented remains unclear because training of RNN through the ‘backpropagation’ method requires downstream weights, which are biologically unavailable at the upstream RNN. Here we show that training of RNN using random feedback instead of the downstream weights still works because of the ‘feedback alignment’, which was originally demonstrated for supervised learning. We further show that if the downstream weights and the random feedback are biologically constrained to be non-negative, learning still occurs without feedback alignment because the non-negative constraint ensures loose alignment. These results suggest neural mechanisms for RL of state representation/value and the power of random feedback and biological constraints.
Introduction
Multiple lines of studies have suggested that Temporal-Difference-Reinforcement-Leaning (TDRL) is implemented in the cortico-basal ganglia-dopamine(DA) circuits in the way that DA represents TD reward-prediction-error (RPE) 3–7 and DA-dependent plasticity of cortico-striatal synapses represents TD-RPE-dependent update of state/action values 8–10. Traditionally, TDRL in the cortico-basal ganglia-DA circuits was considered to serve only for relatively simple behavior. However, subsequent studies suggested that more sophisticated, apparently goal-directed/model-based behavior can also be achieved by TDRL if states are appropriately represented 11–13 and that DA signals indeed reflect model-based predictions 14, 15. Conversely, state representation-related issues could potentially cause behavioral or mental-health problems 16–20. Early modeling studies treated state representations appropriate to the situation/task as given (’handcrafted’ by the authors), but representation itself should be learnt in the brain 21–26. Recently it was shown that appropriate state representation can be learnt through RL in a recurrent neural network (RNN) by minimization of squared value-error without explicit teacher/target 2, 13, while state value can be simultaneously learnt in the downstream of RNN.
However, whether such a learning method, named the value-RNN 2, can be implemented in the brain remains unclear, because there are problems in terms of biological plausibility. A major problem, among others, is that the update rule proposed in the previous work for the connections onto the ’neurons’ in the RNN 2, derived from the gradient-descent error-’backpropagation’ (hereafter referred to as backprop) method 27, 28, involves the weights of the connections from these RNN units onto the downstream value-encoding unit. Given that the state-representing RNN and the value-encoding unit are implemented by the intra-cortical circuit and the striatal neurons, respectively, as generally suggested 3, 29, 30, this means that the update (plasticity) rule for intra-cortical connections involves the downstream cortico-striatal synaptic strengths, which would not be able to be accessed from the cortex. Indeed, this is an example of the long-standing difficulty in biological implementation of backprop 31, 32, in which update of upstream connections requires biologically unavailable downstream connection strengths.
Recently, a potential solution for this difficulty has been proposed 33 (see also 34–41 for other potential solutions). Specifically, in supervised learning of feed-forward network, it was shown that when the downstream connection strengths used for updating upstream connections in backprop were replaced with fixed random strengths, comparable learning performance was still achieved 33. This was suggested to be because the information of the introduced fixed random strengths transferred, through learning, to the upstream connections and then to the downstream feed-forward connections so that these feed-forward connections became aligned to the random feedback strengths, and thus in turn, the random feedback can play the same role as the one played by the downstream connection strengths in backprop. This mechanism was named the ’feedback alignment’ 33, and was subsequently shown to work also in supervised learning of RNN 42 and proposed to be neurally implemented 43 (in a different way from the present study as we discuss in the Discussion).
The value-RNN 2, 13, the above-introduced simultaneous RL of state values and state representation through minimization of squared value-error, differs from supervised learning considered in these previous feedback-alignment studies in two ways: i) it is TD learning, i.e., it approximates the true error by the TD-RPE because the true error, or true state value, is unknown, and ii) it uses a scalar error (TD-RPE) rather than a vector error. Therefore it was nontrivial whether the feedback alignment mechanism could work also for the value-RNN. In the present work, we first examined this, demonstrating that it does work and providing a mechanistic insight into how it works.
After that, we further addressed other biological-plausibility problems. Specifically, we imposed biological constraints that the downstream (cortico-striatal) weights and the fixed random feedback, as well as the activities of neurons in the RNN, were all non-negative. Moreover, we also remedied the non-monotonic dependence of the update of RNN connection-strength on post-synaptic neural activity. We then found, unexpectedly, that the non-negative constraint appeared to aid, rather than degrade, the learning by ensuring that the downstream weights and the fixed random feedback are loosely aligned even without operation of the feedback alignment mechanism. These results suggest how learning of state representation and value can be neurally implemented, more specifically, through synaptic plasticity depending on DA, which represents TD-RPE, in the cortex and the striatum.
Results
Consideration of the value-RNN with fixed random feedback
We considered an implementation of the value-RNN in the cortico-basal ganglia circuits (Fig. 1). A cortical region/population is supposed to represent information of sensory observation (o) and send it to another cortical region/population, which has rich recurrent connections and therefore can be approximated by an RNN. Activities of neurons in the RNN (x) are supposed to learn to represent states, through updates of the strengths of recurrent connections A and feed-forward connections B. The activity of a population of striatal neurons that receive inputs from the RNN is supposed to learn to represent the state values (v), by learning the weights of cortico-striatal connections (from the RNN to the striatal neurons) (w) indicating the value weights. DA neurons in the ventral tegmental area (VTA) receive (direct and indirect) inputs from the striatum and other structure conveying information of obtained reward (r), and thereby the activity of the DA neurons, as well as released DA, represents TD-RPE (δ). TD-RPE-representing DA is released in the striatum and also in the cortical RNN through mesocorticolimbic projections, and used for modifying the strengths of cortical recurrent and feed-forward connections (A and B) and cortico-striatal connections (w).
In the original value-RNN 2, 13, update rule for the connections onto the RNN (A and B) requires the (gradually changing) value weights (w), but this is biologically implausible because the corticostriatal synaptic strengths are not available in the cortex as discussed above. Therefore, we considered a modified value-RNN by replacing the cortico-striatal weights used in the updates of intra-cortical connections with fixed random strengths (c). Besides, the original value-RNN adopted a learning rule called the backpropagation through time (BPTT) 44, in which the error in the output needs to be incrementally accumulated in the temporally backward order, but such an acausality is also biologically implausible, as previously pointed out 42. Therefore, we instead used an online learning rule, which considers only the influence of the recurrent weights at the previous time step (see the Methods for details and equations).
Simulation of a Pavlovian cue-reward association task with variable inter-trial intervals
We compared the learning of the modified value-RNN with fixed random feedback (referred to as VRNNrf) and the value-RNN with backprop (referred to as VRNNbp), both of which adopted the online learning rule rather than the BPTT, and also untrained RNN. The number of RNN units was set to 7 for all the cases. Traditional TD-RL agents with punctate state representation (called the complete serial compound, CSC 3, 45) were also compared. We simulated a Pavlovian cue-reward association task, in which a cue was followed by a reward three time-steps later, and inter-trial interval (i.e., reward to next cue) was randomly chosen from 4, 5, 6, or 7 time-steps (Fig. 2A). In this task, states can be defined by relative timings from the cue, and we estimated the true state values through simulations according to the definition of state value, i.e., expected cumulative discounted future rewards 46 (Fig. 2B, black line). Expected TD-RPE calculated from these estimated true values (Fig. 2B, red line) was almost 0 at any states, as expected. Agent having punctate/CSC state representation and state values without continuation between trials (i.e., the value of the last state in a trial was not updated by TD-RPE upon entering the next trial) developed positive values between cue and reward, and abrupt TD-RPE upon cue (Fig. 2C). Agent having punctate/CSC state representation and continuously updated state values across trials developed positive values also for states in the inter-trial interval (Fig. 2D). VRNNbp developed state values between cue and reward, and to some extent in the inter-trial interval, and showed abrupt TD-RPE upon cue and smaller TD-RPE upon reward (Fig. 2E). This indicates that this agent largely learned the task structure, confirming the previously proposed effectiveness of value-RNN in this different task.
VRNNrf, having fixed random feedback instead of backprop-based feedback, developed state values that were largely similar to, although smaller (on average across simulations) than, those developed by VRNNbp (Fig. 2F black line). VRNNrf generated abrupt TD-RPEs upon cue and reward, again similarly to VRNNbp although the relative size of reward-response was (on average) larger (Fig. 2F red line). As a comparison, agent with untrained RNN developed (on average) even smaller state values and larger relative size of TD-RPE upon reward (Fig. 2G). These results indicate that value-
RNN could be trained by fixed random feedback at least to a certain extent, although somewhat less effectively (as maybe expected) than by backprop-based feedback. Figure 2H shows state values developed in individual simulations of VRNNbp (top), VRNNrf (middle), and untrained RNN (bottom), and Figure 2I shows the histograms of the value of the pre-reward state (i.e., one time-step before the state where reward was obtained) developed in individual simulations of these three models. These figures indicate that VRNNrf did not tend to develop moderately smaller state values than VRNNbp in each simulation. Rather, state values developed in VRNNrf were largely comparable to those developed in VRNNbp once they were successfully learned but the success rate was smaller than VRNNbp while still larger than the untrained RNN.
So far, we examined the cases where the number of RNN units was 7. We compared the learning performance of VRNNbp, VRNNrf, and untrained RNN when the number of RNN units was varied from 5 to 40. Leaning performance was measured by the sum of squares of differences between the state values developed by each of these three types of agents and the estimated true state values (Fig. 2B) between cue and reward. As shown in Fig. 2J, on average across simulations, VRNNbp generally achieved the highest performance, but VRNNrf also exhibited largely comparable performance and it always outperformed the untrained RNN. As the number of RNN units increased from 5 to 15, all these three agents improved their performance, while additional increase of RNN units to 20 or 25 resulted in smaller changes. Further increase of RNN units caused decrease in the mean performance in all the three agents, and when the number of RNN units was increased to 45, there were occasions where learning appeared to diverge. We will discuss these in the Discussion.
Occurrence of feedback alignment and an intuitive understanding of its mechanism
We questioned if feedback alignment underlay the learnability of VRNNrf. Returning to the case with 7 RNN units, we examined whether the value weight vector w became aligned to the random feedback vector c in VRNNrf, by looking at the changes in the angle between these two vectors across trials. As shown in Fig. 3A, this angle, averaged across simulations, decreased over trials, indicating that the value weight w indeed tended to become aligned to the random feedback c. We then examined whether better alignment of w to c related to better development of state value by looking at the relation between the angle between w and c and the value of the pre-reward state at 1000-th trial. As shown in Fig. 3B, there was a negative correlation such that the smaller the angle was (i.e., more aligned), the larger the state value tended to be (r = −0.288, p = 0.00362), in line with our expectation. These results indicate that the mechanism of feedback alignment, previously shown to work for supervised learning, also worked for TD learning of value weights and recurrent/feed-forward connections.
How did the feedback alignment mechanistically occur? We made an attempt to obtain an intuitive understanding. Assume that positive TD-RPE (δ(t) > 0) is generated at a state, S (= x(t)), in a task trial. Because of the update rule for w (w ← w + aδ(t)x(t)), w is updated in the direction of x(t). Next, what is the effect of updates of recurrent/feed-forward connections (A and B) on x? For simplicity, here we consider the case where observation is null (o = 0) and so x(t) = f(Ax(t−1)) holds (but similar argument can be done in the case where observation is not null). If A is replaced with its updated one, it can be calculated that i-th element of Ax(t−1) will hypothetically change by ci × (a positive value) (technical note: the value is aδ(t){Σjxj(t−1)2}(0.5 + xi(t))(0.5 − xi(t)) which is positive unless x(t−1) = 0), and therefore the vector Ax(t−1) as a whole will hypothetically change by a vector that is in a relatively close angle with c (in a sense that, for example, [c1 c2 c3]T and [0.5c1 1.2c2 0.8c3]T are in a relatively close angle in the same quadrant). Then, because f is a monotonically increasing sigmoidal function, x(t) = f(Ax(t−1)) will also hypothetically change by a vector that is in a relatively close angle with c. This was indeed the case in our simulations as shown in Fig. 3C.
In this way, at state S where TD-RPE is positive, w is updated in the direction of x(t), and x(t) will hypothetically change by a vector that is in a relatively close angle with c if A is replaced with its updated one. Then, if the update of w and the hypothetical change in x(t) due to the update of A could be integrated, w would become aligned to c (if TD-RPE is instead negative, w is updated in the opposite direction of x(t), and x(t) will hypothetically change by a vector that is in a relatively close angle with −c, and so the same story holds in the end).
There is, however, a caveat regarding how the update of w and the hypothetical change in x(t) can be integrated. Although technical, here we briefly describe it, and a possible solution. The updates of w and A use TD-RPE, which is calculated based on v(t) = wT x(t) and v(t+1) = wT x(t+1), and so x(t) and x(t+1) should already be determined beforehand. Therefore, the hypothetical change in x(t) due to the update of A, described in the above, does not actually occur (this was why we mentioned ‘hypothetical’) and thus cannot be integrated with the update of w. Nevertheless, integration could still occur across successive trials, at least to a certain extent. Specifically, although TD-RPEs at S in successive trials would generally differ from each other, they would still tend to have the same sign, as was indeed the case in our simulations (Fig. 3D). Also, although the trajectories of RNN activity (x) in successive trials would differ, we could expect a certain level of similarity because the RNN is entrained by observation-representing inputs, again as was indeed the case in our example simulation (Fig. 3E). Then, the hypothetical change in x(t) due to the update of A, considered above, could become a reality in the next trial, to a certain extent, and could thus be integrated into the update of w, explaining the occurrence of feedback alignment.
Simulation of tasks with probabilistic structures of reward timing/existence
We also simulated two tasks (Fig. 4A) that were qualitatively similar to (though simpler than) the two tasks examined in previous experiments 1 and modeled by the original value-RNN with backprop2. In our task 1, a cue was always followed by a reward either two or four time-steps later with equal probabilities. Task 2 was the same as task 1 except that reward was omitted with 40% probability. In task 1, if reward was not given at the early timing (i.e., two-steps later than cue), agent could predict that reward should be given at the late timing (i.e., four-steps later than cue), and thus TD-RPE upon reward at the late timing is expected to be smaller than TD-RPE upon reward at the early timing (if agent perfectly learned the task structure, TD-RPE upon reward at the late timing should be 0). By contrast, in task 2, if reward was not given at the early timing, it might indicate that reward was given at the late timing but might instead indicate that reward was omitted in that trial, and thus TD-RPE upon reward at the late timing is expected to exist and can even be larger than TD-RPE upon reward at the early timing.
In these tasks, states can be defined in the following way. There were two types of trials, with early or late reward, in task 1, and additionally one more type of trial, without reward, in task 2 (Fig. 4Ba, top). For each timing in each of these trial types, its value, i.e., expected discounted cumulative future rewards, can be estimated through simulations (Fig. 4Ba, bottom). Agent could not know the current trial type until receiving reward at the early timing or the late timing or receiving no reward at both timings. Until these timings, agent could have probabilistic belief about the current trial type, e.g., 50% in the trial with early reward and 50% in the trial with late reward (in task 1) or 30% in the trial with early reward, 30% in the trial with late reward, and 40% in the trial without reward (in task 2) (Fig. 4Bb). States can be defined by incorporating these probabilistic beliefs at each timing (Fig. 4Bc, top), and state values (Fig. 4Bc, bottom: expected discounted cumulative future rewards, estimated through simulations) should theoretically match an integration (multiplication) of the values of each trial type (Fig. 4Ba, bottom) with the probabilistic beliefs (Fig. 4Bb). Expected TD-RPE calculated from these estimated state values (Fig. 4C) exhibited features that matched the conjecture mentioned above: in task 1, TD-RPE upon reception of late reward, which was actually 0, was smaller than TD-RPE upon reception of early reward, whereas in task 2, TD-RPE upon reception of late reward was larger than TD-RPE upon reception of early reward.
The previous experimental work 1 has shown that VTA DA neurons exhibited similar activity patterns to the abovementioned TD-RPE patterns, and the theoretical work 2 has shown that the original value-RNN with backprop could reproduce such TD-RPE patterns. We examined what TD-RPE patterns were generated in the agents with punctate/CSC representation, VRNNbp, VRNNrf, and untrained RNN (12 RNN units in all cases) in our simulated two tasks. VRNNbp developed similar TD-RPE patterns (smaller TD-RPE upon late than early timing in task 1 but opposite pattern in task 2) (Fig. 4F), qualitatively reproducing the result of the previous work 2. Crucially, VRNNrf also developed similar TD-RPE patterns (Fig. 4G), indicating that this agent with random feedback could also learn the distinct structures of the two tasks. By contrast, agents with punctate/CSC state representation without or with continuous value update across trials (Fig. 4D,E), as well as agent with untrained RNN (Fig. 4H), could not develop such patterns well.
Value-RNN with further biological constraints
So far, the activities of neurons in the RNN (x) were initialized to pseudo standard normal random numbers, and thereafter took numbers in the range between −0.5 and 0.5 that was the range of the sigmoidal input-output function. The value weights (w) could also take both positive and negative values since no constraint was imposed. The fixed random feedback in VRNNrf (c) was generated by pseudo standard normal random numbers, and so could also be positive or negative. Negativity of the neurons’ activities and the value weights could potentially be regarded as inhibitory or smaller-than-baseline quantities. However, because neuronal firing rate is non-negative and cortico-striatal projections are excitatory, it would be biologically more plausible to assume that the activities of neurons in the RNN and the value weights are non-negative. As for the fixed random feedback, if it is negative, the update rule becomes anti-Hebbian under positive TD-RPE, and so assuming non-negativity would be plausible since Hebbian property has been suggested for rapid plasticity of cortical synapses 47. There was another issue in the update rule for recurrent and feed-forward connections, derived from the gradient descent. Specifically, the dependence on the post-synaptic activity was non-monotonic, maximized at the middle of the range of activity. It would be more biologically plausible to assume monotonic dependence.
In order to address these issues, we considered revised models. We first considered a revised VRNNbp, referred to as revVRNNbp, in which the RNN activities and the value weights were constrained to be non-negative, while the non-monotonic dependence of the update rule on the post-synaptic activity remained unchanged (Fig. 5A). We then considered a revised VRNNrf, referred to as bioVRNNrf, in which the fixed random feedback, as well as the RNN activities and the value weights, were constrained to be non-negative, and also the update rule was modified so that the dependence on the post-synaptic activity became monotonic (with saturation) (Fig. 5B).
We examined how these revised models, in comparison with untrained RNN that also had non-negative constraint for x and w, performed in the Pavlovian cue-reward association task examined above (the number of neurons/trials were set to 12/1500). RevVRNNbp well developed state values toward reward (Fig. 6A). BioVRNNrf also developed state values to a largely comparable extent (Fig. 6B). By contrast, untrained RNN could not develop such a pattern of state values (Fig. 6C). This, however, could be because initially set recurrent/feed-forward connections were far from those learned in the value-RNNs. Therefore, as a more strict control, we conducted simulations of untrained RNN with non-negative x and w, where in each simulation the recurrent/feed-forward connections were set to be those shuffled from the learnt connections in a simulation of bioVRNNrf. Untrained RNN with this setting performed somewhat better than the original untrained RNN case (Fig. 6D), but still worse than revVRNNbp and bioVRNNrf. We varied the number of neurons in the RNN, and compared the performance (sum of squared errors from the true state values) of revVRNNbp and bioVRNNrf, in comparison with untrained RNN (both naive one and the one with shuffled learnt connections from bioVRNNrf). As shown in Fig. 6E, regardless of the number of neurons, the performance of bioVRNNrf was largely comparable to that of revVRNNbp, and better than the performance of untrained RNN of both kinds. Figure 6F shows the mean of the elements of the recurrent and feed-forward connections at 1500-th trial in the different models. As shown in the figure, these connections (initialized to pseudo standard normal random numbers) were learnt to become negative on average, in revVRNNbp and more prominently in bioVRNNrf. This learnt negative-dominance (inhibition-dominance) could possibly be related, e.g., through prevention of excessive activity, to the good performance of bioVRNNrf and also the better performance of the untrained RNN with connections shuffled from bioVRNNrf than that of the naive untrained RNN.
We examined how the angle between the value weights (w) and the random feedback (c) changed across trials in bioVRNNrf. As shown in Fig. 7A, the angle was on average smaller than the chance-level angle (90°) from the beginning, while there was no further alignment over trials. This could be understood as follows. Because both the value weights (w) and the random feedback (c) were now constrained to be non-negative, these two vectors were ensured to be in a relatively close angle (i.e., in the same quadrant) from the beginning. By virtue of this loose alignment, the random feedback could act similarly to backprop-derived proper feedback, even without further alignment. We examined if the angle between the value weights (w) and the random feedback (c) at 1500-th trial was associated with the developed value of pre-reward state across simulations, but found no association (r = 0.0117, p = 0.908) (Fig. 7B). We then examined if the w-c angle at earlier trials (2nd - 500-th trials) was associated with the developed values at 500-th trial, with the number of simulations increased to 1000 so that small correlation could be detected. We found that the w-c angle at initial trials (2nd - around 10-th trials) was negatively correlated with the developed values of the reward state and preceding states at 500-th trial (Fig. 7C). As for the reward state, negative correlation at around 100-th - 300-th trial was also observed. These results suggest that better alignment of w and c at initial and early timings was associated with better development of state values, in line with the conjecture that loose alignment of w and c coming from the non-negative constraint supported learning. It should be noted, however, that there were cases where positive (although small) correlation was observed. Its exact reason is not sure, but it could be related to the fact that largeness of developed values or the speed of value development does not necessarily mean good learning.
We further examined how the revised value-RNN models performed in the two tasks with probabilistic structures examined above. Since the revised value-RNN models with 12 neurons appeared not able to produce the different patterns of TD-RPEs in the two tasks (TD-RPE at early reward > TD-RPE at late reward in task 1 and opposite pattern in task 2), we increased the number of neurons to 20. Then, both revVRNNbp and bioVRNNrf produced such TD-RPE patterns (Fig. 8A,B) whereas untrained RNN of both kinds (naive, and with connections shuffled from bioVRNNrf) could not (Fig. 8C,D). This indicates that the value-RNN with random feedback and further biological constraints could learn the differential characteristics of the tasks.
Discussion
We have shown that state representation and value can be learned in the RNN and its downstream by using random feedback instead of backprop-derived biologically unavailable downstream weights. In the model without non-negative constraint, the feedback alignment, previously shown for supervised learning, occurred, and we have presented an intuitive understanding of its mechanism. In the model with non-negative constraint, loose alignment occurred from the beginning because of the constraint, and it appeared to support learning. Below we discuss implementation of the value-RNN with random feedback, pointing to a crucial role of DA outside of striatum, and also heterogeneity of DA signals. We further discuss limitations, relations to other proposals and suggestions, and future perspectives.
Implementation of the value-RNN with random feedback, featuring a role of DA outside of striatum
DA neurons in the midbrain project to not only the striatum but also the cortex, including the prefrontal cortex 48 and the hippocampus 49. Previous studies demonstrated a crucial role of prefrontal DA in working memory 50, 51, presumably through the effects on synaptic/ionic conductance 52, 53. Roles of prefrontal DA in behavioral flexibility or decision making have also been suggested 54. Moreover, a role of hippocampal DA in modulation of aversive memory formation has been demonstrated 55. However, although i) there has been increasing evidence that DA represents TD-RPE 56, ii) human fMRI experiments found TD-RPE correlates in cortical regions 57, iii) DAergic modulation or initiation of plasticity in the prefrontal cortex 58 or the hippocampus 59 have been demonstrated, and iv) lesion or inactivation of prefrontal or hippocampal regions were found to disrupt DA’s encoding of RPE reflecting appropriate state representation 60–62, what computational role in RL is played by TD-RPE-representing DA in the cortex remains to be clarified. This is behind compared to the case of striatum, where it has been widely considered that DAergic modulation of cortico-striatal synaptic weights implements TD-RPE-based update of state/action values 8, 63.
The value-RNN with fixed random feedback and biological constraints considered in the present work suggests a possibility that TD-RPE-representing DA modulates plasticity of RNN in the cortex so that state representation can be learnt. Different from the original value-RNN with backprop 2, 13, update of intra-cortical connections does not require downstream cortico-striatal weights but requires only non-negative fixed random feedback specific to each post-synaptic neuron. The non-negativity was assumed so that the update rule became Hebbian under positive TD-RPE, since Hebbian plasticity has been suggested for rapid plasticity of cortical synapses 47. The fixed randomness would naturally be achieved by intrinsic heterogeneity of neurons. The successful learning performance of our model thus indicates that DA-dependent modulation of Hebbian plasticity of cortical excitatory connections serves for learning of state representation that captures task structure.
VTA DA neurons project also to regions other than the striatum and cortex, including the basolateral amygdala (BLA) 64, and DA was suggested to regulate plasticity also in the BLA 65. Recent work 66 demonstrated that VTA→BLA DA entailed properties of TD-RPE, although increased rather than decreased upon aversive event, and was not itself reinforcing but necessary and sufficient for the formation of environmental model. BLA has recurrent connections 67, projects to the striatum 68, 69, and engages in abstract context representation together with the prefrontal cortex 70. Thus, given that environmental relationships needed for goal-directed behavior could be embedded in state representation 11–13, it seems possible that mechanism partly akin to the learning of state representation, but not value, in the RNN of our model takes place in the BLA. It remains open, however, whether and how such sophisticated representation can be learned. It might require multidimensional error 71 beyond TD-RPE, and/or multi-compartment unit 26, both of which we will further discuss below.
DA’s encoding of TD-RPE and other variables
There have been many results suggesting heterogeneity of DA signals. Recent work 72 suggested that there (co)exist different origins: (i) heterogeneity of learning target (reward or other), (ii) heterogeneity of state features, and (iii) others, such as ramping patterns. (i) is typically observed in DA neurons projecting to different regions, which can represent prediction errors of things other than reward. In contrast, (ii) is applied to DA neurons projecting to a same region, in which even though individual DA neurons show heterogeneous responses, the resulting merged DA signal still represents a scalar error such as TD-RPE.
Referring to a result 73 of type (i) and the fact that DA neurons receive inputs from the cerebellum 74, 75 that supposedly implements supervised learning 76, a recent modeling work 43 proposed that DA neurons convey vector-valued error signals, which are used for supervised learning of actions in continuous space. This previous work showed that learning occurred without adjustment of DA projection strengths because the feedback alignment mechanism worked. In contrast, in the present work, we assumed scalar TD-RPE, which can be consistent with type (ii) heterogeneity of DA signals. We have shown that the feedback alignment mechanism works also for RL, and moreover, learning could also occur by virtue of loose alignment coming from the biological constraints even without the operation of feedback alignment. Notably, the previous model 43 and our model can coexist, given that different DA neuronal populations may encode vector-valued error and TD-RPE, or even same single DA neuron might represent both errors depending on the context, reflecting which inputs are active.
Limitations and possible reasons
We have shown that state representation and value could be learned in the value-RNN with fixed random feedback with a relatively small number of simple RNN units and observation inputs in simple simulated tasks. These simplicities enabled us to derive an intuitive understanding of how the feedback alignment could occur. However, in our models without the non-negativity constraint, as the number of RNN units increased, the performance of the models initially improved but then degraded when the number of RNN units increased beyond around 25. In contrast, in the original value-RNN with backprop 2, 13, the ability to develop belief-state-like representation was reported to improve as the number of RNN units increased to 100 or 50.
There are several possible reasons for this difference. First, as a performance measure we used the sum of squared errors between the values developed by the value-RNN and the values estimated according to the definition (expected discounted cumulative future rewards) between cue and reward, while the previous studies focused on the similarity between the representation developed by the value-RNN and the handcrafted belief states. Second, there was a difference in the way of weight update. Specifically, as mentioned in the Results, while the previous studies used the BPTT 44, which considers the recursive influence of the recurrent weights in a way that lacks causality, our models used an online learning rule, which only considers the influence of the recurrent weights at the previous time step.
Last but not least, there was a difference in the RNN unit. Specifically, we used a simple sigmoidal function, whereas the previous studies used the “Gated Recurrent Unit (GRU) cell” 77. RNN with simple nonlinear unit is known to have the “vanishing gradient problem” 78: through repetitive learning, the gradient of the loss function becomes so small that update becomes invisible. This issue could be alleviated by using an RNN unit having a memory, called the Long Short-Term Memory (LSTM) unit 79. The GRU cell was suggested to have a memory function similar to the LSTM unit 77. We focused on resolving the biological implausibility of the backprop, and sticked to the simple sigmoidal unit. However, gated unit similar to the LSTM unit has actually been proposed to be implemented in cortical microcircuits 80, and incorporating the features of real neuron into value-RNN could enhance the computational power as we will discuss below.
Biological details and future perspectives
Our RNN unit did not incorporate neuronal spiking and its effects on plasticity 38, 81, 82, as well as neuronal morphology with nonlinear dendritic computations 41, 83, 84. Importantly, recent studies suggest that dendritic mechanisms 34, 35, potentially together with burst-dependent plasticity 38, 39, can realize credit assignment without backprop in supervised learning, and also in unsupervised learning 85, 86. Dendritic mechanisms have their own specific features, or constraints, and so having them is different from increasing the number of layers of neural network, and it was argued 41 that adding such biological constraints enables learning in deep neural networks. Moreover, recent model of hippocampus 26 has shown that a network of multi-compartment units could learn complex representations. Given these, it would be interesting to explore if incorporating biological details into RNN unit can improve the performance of value-RNN.
A different alternative to backprop is the Associative Reward-Penalty (AR-P) algorithm 87–89, in which the hidden units behave stochastically, and thereby the gradient could be estimated, in effect, through stochastic sampling without explicit information of the downstream weights. More recent work 90 demonstrated that noise-induced learning of back projections could achieve better alignment and performance compared with the case of fixed random feedback in a feed-forward network. These mechanisms can be biologically implemented because neurons and neural networks can exhibit noisy or chaotic behavior 91–93, and are expected to potentially improve the performance of value-RNN.
Regarding the connectivity, in our models, recurrent/feed-forward connections could take both positive and negative values. This could be justified because there are both excitatory and inhibitory connections in the cortex and the net connection sign between two units can be positive or negative depending on whether excitation or inhibition exceeds the other. However, recent studies have shown that feed-forward and recurrent neural networks conforming to Dale’s law can perform well depending on the architecture, initialization, and update rules 94, 95. Integration of these models and ours, also with other connectivity features 96, may be a fruitful direction.
More specific to the cortico-basal ganglia circuit, existences of D1/D2 DA receptors and D1-direct and D2-indirect basal ganglia pathways 97–100, as well as cortical areas and cell types 101–104, were also not incorporated. Furthermore, circuit/synaptic mechanisms of how TD-RPE is calculated in DA neurons (c.f., 105, 106) and/or how it can be learned (c.f., 107) were unspecified. Future studies are expected to incorporate these factors.
Methods
Value-RNN with backprop (VRNNbp)
We constructed a value-RNN model based on the previous proposals 2, 13 but with several differences. We assumed that the activities of neurons in the RNN at time t+1 were determined by the activities of these neurons and neurons representing observation (cue, reward, or nothing) at time t:
where
where
were the value weights. The error between this estimated value and the true value, vtrue(t), was defined as:
Parameters wj, Aij, and Bik that minimize the squared error ε(t)2 could be found by a gradient descent / error-backpropagation (backprop) method, i.e., by updating them in the directions of −∂(ε(t)2)/∂wj, −∂(ε(t)2)/∂Aij, and −∂(ε(t)2)/∂Bik. −∂(ε(t)2)/∂wj was calculated as follows:
In the last line, since ε(t) was unavailable as vtrue(t) was unknown, it was approximated by the TD-RPE:
−∂(ε(t)2)/∂Aij was calculated as follows:
Similarly, −∂(ε(t)2)/∂Bik was calculated as follows:
According to these, the update rule for the value-RNN was determined as follows:
where a was the learning rate. In each simulation, the elements of A and B, as well as the elements of x, were initialized to pseudo standard normal random numbers, and the elements of w were initialized to 0.
Value-RNN with fixed random feedback (VRNNrf)
We considered an implementation of the value-RNN described above in the cortico-basal ganglia-DA system (Fig. 1):
x : activities of neurons in a cortical region with rich recurrent connections
A : recurrent connection strengths among x
ο : activities of neurons in a cortical region processing sensory inputs
B : feed-forward connection strengths from o to x
f : sigmoidal relationship from the input to the output of the cortical neurons
w : connection strengths from cortical neurons x to a group of striatal neurons
v : activity of the group of striatal neurons
δ : activity of a group of DA neurons / released DA
The update rule for w
could be naturally implemented as cortico-striatal synaptic plasticity, which depends on DA (δ(t)) and pre-synaptic (cortical) neuronal activity (xj(t)). However, an issue emerged in implementation of the update rules for A and B:
Specifically, wi included in the rightmost of these update rules (for the strengths of cortico-cortical synapses Aij and Bik) is the connection strength from cortical neuron xi to striatal neurons, i.e., the strength of the cortico-striatal synapses (located within the striatum), which is considered to be unavailable at the cortico-cortical synapses (located within the cortex).
As mentioned in the Introduction, this is an example of the long-standing difficulty in biological implementation of backprop, and recently a potential solution for this difficulty, i.e., replacement of the downstream connection strengths in the update rule for upstream connections with fixed random strengths, has been demonstrated in supervised learning of feed-forward and recurrent networks 33, 42,43. The value-RNN, which we considered here, differed from supervised learning considered in these previous studies in two ways: i) it was TD learning, apparent in the approximation of the true error ε(t) by the TD-RPE δ(t) in the derivation described above, and ii) it used a scalar error (TD-RPE) rather than a vector error. But we expected that the feedback alignment mechanism could still work at least to some extent, and explored it in this study. Specifically, we examined a modified value-RNN with fixed random feedback (VRNNrf), in which the update rules for A and B were modified as follows:
where wi in the update rules of the value-RNN with backprop (VRNNbp) was replaced with a fixed random parameter ci. Notably, these modified update rules for the cortico-cortical connections A and B required only pre-synaptic activities (xj(t−1), ok(t−1)), post-synaptic activities (xi(t)), TD-RPE-representing DA (δ(t)), and fixed random strengths (ci), which would all be available at the cortico-cortical synapses given that VTA DA neurons project not only to the striatum but also to the cortex and random ci could be provided by intrinsic heterogeneity. In each simulation, the elements of c were initialized to pseudo standard normal random numbers.
Revised value-RNN models with further biological constraints
In the later part of this study, we examined revised value-RNN models with further biological constraints. Specifically, we considered models, in which the value weights and the activities of neurons in the RNN were constrained to be non-negative. In order to do so, the update rule for w was modified to:
where max(q1, q2) returned the maximum of q1 and q2. Also, the sigmoidal input-output function was replaced with
and the elements of x were initialized to pseudo uniform [0 1] random numbers. The backprop-based update rules for A and B in VRNNbp were replaced with
We referred to the model with these modifications to VRNNbp as revVRNNbp.
As a revised value-RNN with fixed random feedback (VRNNrf), in addition to the abovementioned modifications of the update of w, the sigmoidal input-output function, and the initialization of x, the fixed random feedback c was assumed to be non-negative. Specifically, the elements of c were set to pseudo uniform [0 1] random numbers. Moreover, the update rules for A and
B were replaced with
so that the originally non-monotonic dependence on xi(t) (post-synaptic activity) became monotonic + saturation (Fig. 5B). These update rules with non-negative ci could be said to be Hebbian with additional modulation by TD-RPE (Hebbian under positive TD-RPE). We referred to the model with these modifications to VRNNrf as bioVRNNrf.
Simulation of the tasks
In the Pavlovian cue-reward association task, at time 1 of each trial, cue observation was received by the RNN, and at time 4, reward observation was received. Trial was pseudo-randomly ended at time 7, 8, 9, or 10, and the next trial started from the next time-step. Reward size was r = 1. The tasks with probabilistic structures (task 1 and task 2) were implemented in the same way except that reward timing was not time 4 but time 3 or 5 with equal probabilities, specifically, 50% and 50% in task 1 and 30% and 30% in task 2, and there was no reward in the remaining 40% of trials in task 2.
The cue or reward state/timing, mentioned in the text and marked in the figures, was defined to be the timing when the RNN received the cue or reward observation, respectively. Specifically, if o(t) = (1 0)T or o(t) = (0 1)T at time t, t + 1 was defined to be a cue or reward timing, respectively. For the agents with punctate (CSC) representation, each timing in the tasks was represented by a 10-dimensional one-hot vector, starting from (1 0 0 … 0)T for the cue state, with the next state (0 1 0 … 0)T and so on.
Unless otherwise mentioned, parameters were set to the following values. Learning rate (a): 0.1 (normalization by the squared norm of feature vector was not implemented). Time discount factor (γ): 0.8.
Estimation of true state values
As for the Pavlovian cue-reward association task, we defined states by relative timings from the cue, and estimated their (true) state values by simulations according to the definition of state value. Specifically, we generated a sequence of cues and rewards corresponding to 1000 trials, and calculated cumulative discounted future rewards within the sequence:
where t_rew denotes the time-step of each reward counted from the starting state, starting from -2, -1, …, and +6 time steps from a cue. We repeated this 1000 times, generating 1000 sequences (i.e., 1000 simulations of 1000 trials), with different sets of pseudo-random numbers, and calculated an average over these 1000 sequences so as to estimate the expected cumulative discounted future rewards, i.e., state value (by definition) for each state (−2, -1, …, and +6 time steps from cue). Using these estimated true state values, we calculated TD-RPE at each state (−2, -1, …, and +5 time steps from cue).
In a similar manner, we defined states and estimated true state values, and also calculated TD-RPE, for tasks 1 and 2 that had probabilistic structures. As for task 1, we defined the following states: -2, -1, .., and +2 time steps from cue (i.e., states visited (entered) before knowing whether reward was given at the early timing (= +2 time step from cue)), +3, 4, 5, and 6 time steps from cue after reception of reward at the early timing, and +3, 4, 5, and 6 time steps from cue after no reception of reward at the early timing (in total 5 + 4 + 4 = 13 states) (Fig. 4Bc, left-top). We generated 10000 sequences of cues and rewards corresponding to 1000 trials (i.e., 10000 simulations of 1000 trials), and for each state, calculated cumulative discounted future rewards within the sequence for each of 10000 simulations and took an average to obtain the expected cumulative discounted future rewards (i.e., estimation of state value) (Fig. 4Bc, left-bottom). Using the estimated state values, we calculated TD-RPE (Fig. 4C, left).
As for task 2, we defined the following states: -2, -1, .., and +2 time steps from cue (i.e., states visited (entered) before knowing whether reward was given at the early timing), +3, 4, 5, and 6 time steps from cue after reception of reward at the early timing, +3 and 4 time steps from cue after no reception of reward at the early timing (states visited (entered) before knowing whether reward was given at the late timing (= +4 time step from cue)), +5 and 6 time steps from cue after reception of reward at the late timing, and +5 and 6 time steps from cue after no reception of reward at both early and late timings (in total 5 + 4 + 2 + 2 + 2 = 15 states) (Fig. 4Bc, right-top). We estimated state values of these states (Fig. 4Bc, right-bottom), and also calculated TD-RPE (Fig. 4C, right), in similar manners to the above.
Analyses, software, and code availability
SEM (Standard error of the mean) was approximated by SD (standard deviation)/√N (number of samples). Linear regression and principal component analysis (PCA) were conducted by using R (functions lm and prcomp). Simulations were conducted by using MATLAB, and pseudo-random numbers were implemented by using rand, randn, and randperm functions. All the codes will be made available at GitHub upon publication of this work in a journal.
Author contributions
Conceptualization: KM; Formal analysis: KM, TT; Investigation: KM, TT, AyK; Writing – original draft: KM; Writing – review & editing: KM, TT, AyK, ArK
Acknowledgements
The authors thank Dr. Kenji Doya for valuable suggestions. KM was supported by Grants-in-Aid for Scientific Research 23H03295 and 23K27985 from Japan Society for the Promotion of Science (JSPS) and the Naito Foundation. AyK was supported by JSPS Overseas Research Fellowships. ArK was partially funded by Digital Futures (KTH) grant and StratNeuro SRA.
References
- 1.Dopamine reward prediction errors reflect hidden-state inference across timeNat Neurosci 20:581–589
- 2.Emergence of belief-like representations through reinforcement learningPLoS Comput Biol 19
- 3.A framework for mesencephalic dopamine systems based on predictive Hebbian learningJ Neurosci 16:1936–1947
- 4.A neural substrate of prediction and rewardScience 275:1593–1599
- 5.Dialogues on prediction errorsTrends Cogn Sci 12:265–272
- 6.Neuron-type-specific signals for reward and punishment in the ventral tegmental areaNature 482:85–88
- 7.A causal link between prediction errors, dopamine neurons and learningNat Neurosci 16:966–973
- 8.A cellular mechanism of reward-related learningNature 413:67–70
- 9.Dichotomous dopaminergic control of striatal synaptic plasticityScience 321:848–851
- 10.A critical time window for dopamine actions on the structural plasticity of dendritic spinesScience 345:1616–1620
- 11.Predictive representations can link model-based reinforcement learning to model-free mechanismsPLoS Comput Biol 13
- 12.The hippocampus as a predictive mapNat Neurosci 20:1643–1653
- 13.The role of prospective contingency in the control of behavior and dopamine signals during associative learningbioRxiv
- 14.Model-based predictions for dopamineCurr Opin Neurobiol 49:1–7
- 15.Ventral Tegmental Dopamine Neurons Participate in Reward Identity PredictionsCurr Biol 29:93–103
- 16.Reconciling reinforcement learning models with behavioral extinction and renewal: implications for addiction, relapse, and problem gamblingPsychol Rev 114:784–805
- 17.Gradual extinction prevents the return of fear: implications for the discovery of stateFront Behav Neurosci 7
- 18.Rigid reduced successor representation as a potential mechanism for addictionEur J Neurosci 53:3768–3790
- 19.A Reinforcement Learning Approach to Understanding Procrastination: Does Inaccurate Value Approximation Cause Irrational Postponing of a Task?Front Neurosci 15
- 20.Opponent learning with different representations in the cortico-basal ganglia pathways can develop obsession-compulsion cyclePLoS Comput Biol 19
- 21.Learning latent structure: carving nature at its jointsCurr Opin Neurobiol 20:251–256
- 22.Learning task-state representationsNat Neurosci 22:1544–1553
- 23.Rapid learning of predictive maps with STDP and theta phase precessionElife 12
- 24.Learning predictive cognitive maps with spiking neurons during behavior and replaysElife 12
- 25.Neural learning rules for generating flexible predictions and computing the successor representationElife 12
- 26.Latent representations in hippocampal network model co-evolve with behavioral exploration of task structureNat Commun 15
- 27.A Theory of Adaptive Pattern ClassifiersIEEE Transactions on Electronic Computers EC 16:299–307
- 28.Learning representations by back-propagating errorsNature 323:533–536
- 29.Complementary roles of basal ganglia and cerebellum in learning and motor controlCurr Opin Neurobiol 10:732–739
- 30.Dissociable roles of ventral and dorsal striatum in instrumental conditioningScience 304:452–454
- 31.Competitive learning: from interactive activation to adaptive resonanceCognitive Science 11:23–63
- 32.The recent excitement about neural networksNature 337:129–132
- 33.Random synaptic feedback weights support error backpropagation for deep learningNat Commun 7
- 34.Towards deep learning with segregated dendritesElife 6
- 35.Dendritic cortical microcircuits approximate the backpropagation algorithm. in Advances in Neural Information Processing Systems 31
- 36.Theories of Error Back-Propagation in the BrainTrends Cogn Sci 23:235–250
- 37.Backpropagation and the brainNat Rev Neurosci 21:335–346
- 38.Burst-dependent synaptic plasticity can coordinate learning in hierarchical circuitsNat Neurosci 24:1010–1019
- 39.Single-phase deep learning in cortico-cortical networksAdvances in Neural Information Processing Systems (NeurIPS 2022) 35
- 40.Inferring neural activity before plasticity as a foundation for learning beyond backpropagationNat Neurosci 27:348–358
- 41.Leveraging dendritic properties to advance machine learning and neuro-inspired computingCurr Opin Neurobiol 85
- 42.Local online learning in recurrent networks with random feedbackElife 8
- 43.Feasibility of dopamine as a vector-valued feedback signal in the basal gangliaProc Natl Acad Sci U S A 120
- 44.Learning Internal Representations by Error PropagationParallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations MIT Press, Cambridge :318–362
- 45.Evaluating the TD model of classical conditioningLearn Behav 40:305–319
- 46.Reinforcement Learning: An Introduction (Second Edition)MA: MIT Press, Cambridge
- 47.Synaptic mechanisms for plasticity in neocortexAnnu Rev Neurosci 32:33–55
- 48.Widespread origin of the primate mesofrontal dopamine systemCereb Cortex 8:321–345
- 49.Dopamine Regulates Aversive Contextual Learning and Associated In Vivo Synaptic Plasticity in the HippocampusCell Rep 14:1930–1939
- 50.Cognitive deficit caused by regional depletion of dopamine in prefrontal cortex of rhesus monkeyScience 205:929–932
- 51.D1 dopamine receptors in prefrontal cortex: involvement in working memoryScience 251:947–950
- 52.Dopamine-mediated stabilization of delay-period activity in a network model of prefrontal cortexJ Neurophysiol 83:1733–1750
- 53.Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibitionJ Comput Neurosci 11:63–85
- 54.Mesocortical dopamine modulation of executive functions: beyond working memoryPsychopharmacology (Berl) 188:567–585
- 55.Midbrain dopaminergic innervation of the hippocampus is sufficient to modulate formation of aversive memoriesProc Natl Acad Sci U S A 118
- 56.A Unified Framework for Dopamine Signals across TimescalesCell 183:1600–1616
- 57.Temporal difference models and reward-related learning in the human brainNeuron 38:329–337
- 58.Dopaminergic modulation of long-term synaptic plasticity in rat prefrontal neuronsCereb Cortex 13:1251–1256
- 59.Ventral tegmental area dopamine projections to the hippocampus trigger long-term potentiation and contextual learningNat Commun 15
- 60.Expectancy-related changes in firing of dopamine neurons depend on orbitofrontal cortexNat Neurosci 14:1590–1597
- 61.The Medial Prefrontal Cortex Shapes Dopamine Reward Prediction Errors under State UncertaintyNeuron 98:616–629
- 62.Expectancy-related changes in firing of dopamine neurons depend on hippocampusbioRxiv https://doi.org/10.1101/2023.07.19.549728
- 63.Representation of action-specific reward values in the striatumScience 310:1337–1340
- 64.Circuit Architecture of VTA Dopamine Neurons Revealed by Systematic Input-Output MappingCell 162:622–634
- 65.Bidirectional regulation of synaptic plasticity in the basolateral amygdala induced by the D1-like family of dopamine receptors and group II metabotropic glutamate receptorsJ Physiol 592:4329–4351
- 66.Dopamine projections to the basolateral amygdala drive the encoding of identity-specific reward memoriesNat Neurosci 27:728–736
- 67.Gamma Oscillations in the Basolateral Amygdala: Localization, Microcircuitry, and Behavioral CorrelatesJ Neurosci 41:6087–6101
- 68.Synaptic and behavioral profile of multiple glutamatergic inputs to the nucleus accumbensNeuron 76:790–803
- 69.Persistent enhancement of basolateral amygdala-dorsomedial striatum synapses causes compulsive-like behaviors in miceNat Commun 15
- 70.Abstract Context Representations in Primate Amygdala and Prefrontal CortexNeuron 87:869–881
- 71.Dopamine neuron ensembles signal the content of sensory prediction errorsElife 8
- 72.A feature-specific prediction error model explains dopaminergic heterogeneityNat Neurosci 27:1574–1586
- 73.Distributional coding of associative learning in discrete populations of midbrain dopamine neuronsCell Rep 43
- 74.Whole-brain mapping of direct inputs to midbrain dopamine neuronsNeuron 74:858–873
- 75.Cerebellar modulation of the reward circuitry and social behaviorScience 363
- 76.A theory of cerebellar cortexJ Physiol 202:437–470
- 77.Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine TranslationarXiv https://doi.org/10.48550/arXiv.1406.1078
- 78.The vanishing gradient problem during learning recurrent neural nets and problem solutionsInternational Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6:107–30
- 79.Long short-term memoryNeural Comput 9:1735–1780
- 80.Cortical microcircuits as gated-recurrent neural networksAdvances in Neural Information Processing Systems
- 81.Spike timing dependent plasticity: a consequence of more fundamental learning rulesFront Comput Neurosci 4
- 82.A triplet spike-timing-dependent plasticity model generalizes the Bienenstock-Cooper-Munro rule to higher-order spatiotemporal correlationsProc Natl Acad Sci U S A 108:19383–19388
- 83.Pyramidal neuron as two-layer neural networkNeuron 37:989–999
- 84.Possible role of dendritic compartmentalization in the spatial working memory circuitJ Neurosci 28:7699–7724
- 85.Supervised and unsupervised learning with two sites of synaptic integrationJ Comput Neurosci 11:207–215
- 86.Local plasticity rules can learn deep representations using self-supervised contrastive predictionsAdvances in Neural Information Processing Systems (NeurIPS 2021) 34
- 87.Gradient Following Without Back-Propagation in Layered NetworksProceedings of the First Annual International Conference on Neural Networks :629–636
- 88.A more biologically plausible learning rule for neural networksProc Natl Acad Sci U S A 88:4433–4437
- 89.A more biologically plausible learning rule than backpropagation applied to a network model of cortical area 7aCereb Cortex 1:293–307
- 90.Learning efficient backprojections across cortical hierarchies in real timeNature Machine Intelligence 6:619–630
- 91.Noise in the nervous systemNat Rev Neurosci 9:292–303
- 92.Chaotic oscillations and bifurcations in squid giant axonsChaos Princeton University Press
- 93.Chaos in neuronal networks with balanced excitatory and inhibitory activityScience 274:1724–1726
- 94.Learning to live with Dale’s principle: ANNs with separate excitatory and inhibitory unitsbioRxiv https://doi.org/10.1101/2020.11.02.364968
- 95.Learning better with Dale’s Law: A Spectral PerspectivebioRxiv https://doi.org/10.1101/2023.06.28.546924
- 96.Linking Connectivity, Dynamics, and Computations in Low-Rank Recurrent Neural NetworksNeuron 99:609–623
- 97.Modulation of Striatal Projection Systems by DopamineAnnu Rev Neurosci 34:441–466
- 98.Opponent actor learning (OpAL): modeling interactive effects of striatal dopamine on reinforcement learning and choice incentivePsychol Rev 121:337–366
- 99.Learning Reward Uncertainty in the Basal GangliaPLoS Comput Biol 12
- 100.An opponent striatal circuit for distributional reinforcement learningbioRxiv https://doi.org/10.1101/2024.01.02.573966
- 101.Differential innervation of direct-and indirect-pathway striatal projection neuronsNeuron 79:347–360
- 102.Differential cortical activation of the striatal direct and indirect pathway cells: reconciling the anatomical and optogenetic results by using a computational methodJ Neurophysiol 112:120–146
- 103.Topographic precision in sensory and motor corticostriatal projections varies across cell type and cortical areaNat Commun 9
- 104.Differential striatal axonal arborizations of the intratelencephalic and pyramidal-tract neurons: analysis of the data in the MouseLight databaseFront Neural Circuits 13
- 105.Distributed and Mixed Information in Monosynaptic Inputs to Dopamine NeuronsNeuron 91:1374–1389
- 106.A Dual Role Hypothesis of the Cortico-Basal-Ganglia Pathways: Opponency and Temporal Difference Through Dopamine and AdenosineFront Neural Circuits 12
- 107.Learning to express reward prediction error-like dopaminergic activity requires plastic representations of timeNat Commun 15
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
Copyright
© 2025, Tsurumi et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 0
- downloads
- 0
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.