Rewardbased training of recurrent neural networks for cognitive and valuebased tasks
 Cited
 3
 Views
 2,061
 Comments
 0
Abstract
Trained neural network models, which exhibit features of neural activity recorded from behaving animals, may provide insights into the circuit mechanisms of cognitive functions through systematic analysis of network activity and connectivity. However, in contrast to the graded error signals commonly used to train networks through supervised learning, animals learn from reward feedback on definite actions through reinforcement learning. Reward maximization is particularly relevant when optimal behavior depends on an animal’s internal judgment of confidence or subjective preferences. Here, we implement rewardbased training of recurrent neural networks in which a value network guides learning by using the activity of the decision network to predict future reward. We show that such models capture behavioral and electrophysiological findings from wellknown experimental paradigms. Our work provides a unified framework for investigating diverse cognitive and valuebased computations, and predicts a role for value representation that is essential for learning, but not executing, a task.
https://doi.org/10.7554/eLife.21492.001eLife digest
A major goal in neuroscience is to understand the relationship between an animal’s behavior and how this is encoded in the brain. Therefore, a typical experiment involves training an animal to perform a task and recording the activity of its neurons – brain cells – while the animal carries out the task. To complement these experimental results, researchers “train” artificial neural networks – simplified mathematical models of the brain that consist of simple neuronlike units – to simulate the same tasks on a computer. Unlike real brains, artificial neural networks provide complete access to the “neural circuits” responsible for a behavior, offering a way to study and manipulate the behavior in the circuit.
One open issue about this approach has been the way in which the artificial networks are trained. In a process known as reinforcement learning, animals learn from rewards (such as juice) that they receive when they choose actions that lead to the successful completion of a task. By contrast, the artificial networks are explicitly told the correct action. In addition to differing from how animals learn, this limits the types of behavior that can be studied using artificial neural networks.
Recent advances in the field of machine learning that combine reinforcement learning with artificial neural networks have now allowed Song et al. to train artificial networks to perform tasks in a way that mimics the way that animals learn. The networks consisted of two parts: a “decision network” that uses sensory information to select actions that lead to the greatest reward, and a “value network” that predicts how rewarding an action will be. Song et al. found that the resulting artificial “brain activity” closely resembled the activity found in the brains of animals, confirming that this method of training artificial neural networks may be a useful tool for neuroscientists who study the relationship between brains and behavior.
The training method explored by Song et al. represents only one step forward in developing artificial neural networks that resemble the real brain. In particular, neural networks modify connections between units in a vastly different way to the methods used by biological brains to alter the connections between neurons. Future work will be needed to bridge this gap.
https://doi.org/10.7554/eLife.21492.002Introduction
A major challenge in uncovering the neural mechanisms underlying complex behavior is our incomplete access to relevant circuits in the brain. Recent work has shown that model neural networks optimized for a wide range of tasks, including visual object recognition (Cadieu et al., 2014; Yamins et al., 2014; Hong et al., 2016), perceptual decisionmaking and working memory (Mante et al., 2013; Barak et al., 2013; Carnevale et al., 2015; Song et al., 2016; Miconi, 2016), timing and sequence generation (Laje and Buonomano, 2013; Rajan et al., 2015), and motor reach (Hennequin et al., 2014; Sussillo et al., 2015), can reproduce important features of neural activity recorded in numerous cortical areas of behaving animals. The analysis of such circuits, whose activity and connectivity are fully known, has therefore reemerged as a promising tool for understanding neural computation (Zipser and Andersen, 1988; Sussillo, 2014; Gao and Ganguli, 2015). Constraining network training with tasks for which detailed neural recordings are available may also provide insights into the principles that govern learning in biological circuits (Sussillo et al., 2015; Song et al., 2016; Brea and Gerstner, 2016).
Previous applications of this approach to 'cognitivetype' behavior such as perceptual decisionmaking and working memory have focused on supervised learning from graded error signals. Animals, however, learn to perform specific tasks from reward feedback provided by the experimentalist in response to definite actions, i.e., through reinforcement learning (Sutton and Barto, 1998). Unlike in supervised learning where the network is given the correct response on each trial in the form of a continuous target output to be followed, reinforcement learning provides evaluative feedback to the network on whether each selected action was 'good' or 'bad.' This form of feedback allows for a graded notion of behavioral correctness that is distinct from the graded difference between the network’s output and the target output in supervised learning. For the purposes of using model networks to generate hypotheses about neural mechanisms, this is particularly relevant in tasks where the optimal behavior depends on an animal’s internal state or subjective preferences. In a perceptual decisionmaking task with postdecision wagering, for example, on a random half of the trials the animal can opt for a sure choice that results in a small (compared to the correct choice) but certain reward (Kiani and Shadlen, 2009). The optimal decision regarding whether or not to select the sure choice depends not only on the task condition, such as the proportion of coherently moving dots, but also on the animal’s own confidence in its decision during the trial. Learning to make this judgment cannot be reduced to reproducing a predetermined target output without providing the full probabilistic solution to the network. It can be learned in a natural, ethologically relevant way, however, by choosing the actions that result in greatest overall reward; through training, the network learns from the reward contingencies alone to condition its output on its internal estimate of the probability that its answer is correct.
Meanwhile, supervised learning is often not appropriate for valuebased, or economic, decisionmaking where the 'correct' judgment depends explicitly on rewards associated with different actions, even for identical sensory inputs (PadoaSchioppa and Assad, 2006). Although such tasks can be transformed into a perceptual decisionmaking task by providing the associated rewards as inputs, this sheds little light on how valuebased decisionmaking is learned by the animal because it conflates external with 'internal,' learned inputs. More fundamentally, reward plays a central role in all types of animal learning (Sugrue et al., 2005). Explicitly incorporating reward into network training is therefore a necessary step toward elucidating the biological substrates of learning, in particular rewarddependent synaptic plasticity (Seung, 2003; Soltani et al., 2006; Izhikevich, 2007; Urbanczik and Senn, 2009; Frémaux et al., 2010; Soltani and Wang, 2010; Hoerzer et al., 2014; Brosch et al., 2015; Friedrich and Lengyel, 2016) and the role of different brain structures in learning (Frank and Claus, 2006).
In this work, we build on advances in recurrent policy gradient reinforcement learning, specifically the application of the REINFORCE algorithm (Williams, 1992; Baird and Moore, 1999; Sutton et al., 2000; Baxter and Bartlett, 2001; Peters and Schaal, 2008) to recurrent neural networks (RNNs) (Wierstra et al., 2009), to demonstrate rewardbased training of RNNs for several wellknown experimental paradigms in systems neuroscience. The networks consist of two modules in an 'actorcritic' architecture (Barto et al., 1983; Grondman et al., 2012), in which a decision network uses inputs provided by the environment to select actions that maximize reward, while a value network uses the selected actions and activity of the decision network to predict future reward and guide learning. We first present networks trained for tasks that have been studied previously using various forms of supervised learning (Mante et al., 2013; Barak et al., 2013; Song et al., 2016); they are characterized by 'simple' inputoutput mappings in which the correct response for each trial depends only on the task condition, and include perceptual decisionmaking, contextdependent integration, multisensory integration, and parametric working memory tasks. We then show results for tasks in which the optimal behavior depends on the animal’s internal judgment of confidence or subjective preferences, specifically a perceptual decisionmaking task with postdecision wagering (Kiani and Shadlen, 2009) and a valuebased economic choice task (PadoaSchioppa and Assad, 2006). Interestingly, unlike for the other tasks where we focus on comparing the activity of units in the decision network to neural recordings in the dorsolateral prefrontal and posterior parietal cortex of animals performing the same tasks, for the economic choice task we show that the activity of the value network exhibits a striking resemblance to neural recordings from the orbitofrontal cortex (OFC), which has long been implicated in the representation of rewardrelated signals (Wallis, 2007).
An interesting feature of our REINFORCEbased model is that a reward baseline—in this case, the output of a recurrently connected value network (Wierstra et al., 2009)—is essential for learning, but not for executing the task, because the latter depends only on the decision network. Importantly, learning can sometimes still occur without the value network but is much more unreliable. It is sometimes observed in experiments that rewardmodulated structures in the brain such as the basal ganglia or OFC are necessary for learning or adapting to a changing environment, but not for executing a previously learned skill (Turner and Desmurget, 2010; Schoenbaum et al., 2011; Stalnaker et al., 2015). This suggests that one possible role for such circuits may be representing an accurate baseline to guide learning. Moreover, since confidence is closely related to expected reward in many cognitive tasks, the explicit computation of expected reward by the value network provides a concrete, learningbased rationale for confidence estimation as a ubiquitous component of decisionmaking (Kepecs et al., 2008; Wei and Wang, 2015), even when it is not strictly required for performing the task.
Conceptually, the formulation of behavioral tasks in the language of reinforcement learning presented here is closely related to the solution of partially observable Markov decision processes (POMDPs) (Kaelbling et al., 1998) using either modelbased belief states (Rao, 2010) or modelfree working memory (Todd et al., 2008). Indeed, as in Dayan and Daw, (2008) one of the goals of this work is to unify related computations into a common language that is applicable to a wide range of tasks in systems neuroscience. Such policies can also be compared more directly to behaviorally 'optimal' solutions when they are known, for instance to the signal detection theory account of perceptual decisionmaking (Gold and Shadlen, 2007). Thus, in addition to expanding the range of tasks and neural mechanisms that can be studied with trained RNNs, our work provides a convenient framework for the study of cognitive and valuebased computations in the brain, which have often been viewed from distinct perspectives but in fact arise from the same reinforcement learning paradigm.
Results
Policy gradient reinforcement learning for behavioral tasks
For concreteness, we illustrate the following in the context of a simplified perceptual decisionmaking task based on the random dots motion (RDM) discrimination task as described in Kiani et al. (2008) (Figure 1A). In its simplest form, in an RDM task the monkey must maintain fixation until a 'go' cue instructs the monkey to indicate its decision regarding the direction of coherently moving dots on the screen. Thus the three possible actions available to the monkey at any given time are fixate, choose left, or choose right. The true direction of motion, which can be considered a state of the environment, is not known to the monkey with certainty, i.e., is partially observable. The monkey must therefore use the noisy sensory evidence to infer the direction in order to select the correct response at the end of the trial. Breaking fixation early results in a negative reward in the form of a timeout, while giving the correct response after the fixation cue is extinguished results in a positive reward in the form of juice. Typically, there is neither a timeout nor juice for an incorrect response during the decision period, corresponding to a 'neutral' reward of zero. The goal of this section is to give a general description of such tasks and how an RNN can learn a behavioral policy for choosing actions at each time to maximize its cumulative reward.
Consider a typical interaction between an experimentalist and animal, which we more generally call the environment $\mathcal{E}$ and agent $\mathcal{A}$, respectively (Figure 1B). At each time $t$ the agent chooses to perform actions ${\mathbf{\mathbf{a}}}_{t}$ after observing inputs ${\mathbf{\mathbf{u}}}_{t}$ provided by the environment, and the probability of choosing actions ${\mathbf{\mathbf{a}}}_{t}$ is given by the agent’s policy ${\pi}_{\theta}\left({\mathbf{\mathbf{a}}}_{t}\right{\mathbf{\mathbf{u}}}_{1:t})$ with parameters $\theta $. Here the policy is implemented as the output of an RNN, so that $\theta $ comprises the connection weights, biases, and initial state of the decision network. The policy at time $t$ can depend on all past and current inputs ${\mathbf{\mathbf{u}}}_{1:t}=({\mathbf{\mathbf{u}}}_{1},{\mathbf{\mathbf{u}}}_{2},\mathrm{\dots},{\mathbf{\mathbf{u}}}_{t})$, allowing the agent to integrate sensory evidence or use working memory to perform the task. The exception is at $t=0$, when the agent has yet to interact with the environment and selects its actions 'spontaneously' according to ${\pi}_{\theta}\left({\mathbf{\mathbf{a}}}_{0}\right)$. We note that, if the inputs give exact information about the environmental state ${\mathbf{\mathbf{s}}}_{t}$, i.e., if ${\mathbf{\mathbf{u}}}_{t}={\mathbf{\mathbf{s}}}_{t}$, then the environment can be described by a Markov decision process. In general, however, the inputs only provide partial information about the environmental states, requiring the network to accumulate evidence over time to determine the state of the world. In this work we only consider cases where the agent chooses one out of ${N}_{a}$ possible actions at each time, so that ${\pi}_{\theta}\left({\mathbf{\mathbf{a}}}_{t}\right{\mathbf{\mathbf{u}}}_{1:t})$ for each $t$ is a discrete, normalized probability distribution over the possible actions ${a}_{1},\mathrm{\dots},{a}_{{N}_{a}}$. More generally, ${\mathbf{\mathbf{a}}}_{t}$ can implement several distinct actions or even continuous actions by representing, for example, the means of Gaussian distributions (Peters and Schaal, 2008; Wierstra et al., 2009). After each set of actions by the agent at time $t$ the environment provides a reward (or special observable) ${\rho}_{t+1}$ at time $t+1$, which the agent attempts to maximize in the sense described below.
In the case of the example RDM task above (Figure 1A), the environment provides (and the agent receives) as inputs a fixation cue and noisy evidence for two choices L(eft) and R(ight) during a variablelength stimulus presentation period. The strength of evidence, or the difference between the evidence for L and R, is called the coherence, and in the actual RDM experiment corresponds to the percentage of dots moving coherently in one direction on the screen. The agent chooses to perform one of ${N}_{a}=3$ actions at each time: fixate (${a}_{t}=\text{F}$), choose L (${a}_{t}=\text{L}$), or choose R (${a}_{t}=\text{R}$). Here, the agent must choose F as long as the fixation cue is on, and then, when the fixation cue is turned off to indicate that the agent should make a decision, correctly choose L or R depending on the sensory evidence. Indeed, for all tasks in this work we required that the network 'make a decision' (i.e., break fixation to indicate a choice at the appropriate time) on at least 99% of the trials, whether the response was correct or not. A trial ends when the agent chooses L or R regardless of the task epoch: breaking fixation early before the go cue results in an aborted trial and a negative reward ${\rho}_{t}=1$, while a correct decision is rewarded with ${\rho}_{t}=+1$. Making the wrong decision results in no reward, ${\rho}_{t}=0$. For the zerocoherence condition the agent is rewarded randomly on half the trials regardless of its choice. Otherwise the reward is always ${\rho}_{t}=0$.
Formally, a trial proceeds as follows. At time $t=0$, the environment is in state ${\mathbf{\mathbf{s}}}_{0}$ with probability $\mathcal{E}\left({\mathbf{\mathbf{s}}}_{0}\right)$. The state ${\mathbf{\mathbf{s}}}_{0}$ can be considered the starting time (i.e., $t=0$) and 'task condition,' which in the RDM example consists of the direction of motion of the dots (i.e., whether the correct response is L or R) and the coherence of the dots (the difference between evidence for L and R). The time component of the state, which is updated at each step, allows the environment to present different inputs to the agent depending on the task epoch. The true state ${\mathbf{\mathbf{s}}}_{0}$ (such as the direction of the dots) is only partially observable to the agent, so that the agent must instead infer the state through inputs ${\mathbf{\mathbf{u}}}_{t}$ provided by the environment during the course of the trial. As noted previously, the agent initially chooses actions ${\mathbf{\mathbf{a}}}_{0}$ with probability ${\pi}_{\theta}\left({\mathbf{\mathbf{a}}}_{0}\right)$. The networks in this work almost always begin by choosing F, or fixation.
At time $t=1$, the environment, depending on its previous state ${\mathbf{\mathbf{s}}}_{0}$ and the agent’s action ${\mathbf{\mathbf{a}}}_{0}$, transitions to state ${\mathbf{\mathbf{s}}}_{1}$ with probability $\mathcal{E}\left({\mathbf{\mathbf{s}}}_{1}\right{\mathbf{\mathbf{s}}}_{0},{\mathbf{\mathbf{a}}}_{0})$ and generates reward ${\rho}_{1}$. In the perceptual decisionmaking example, only the time advances since the trial condition remains constant throughout. From this state the environment generates observable ${\mathbf{\mathbf{u}}}_{1}$ with a distribution given by $\mathcal{E}\left({\mathbf{\mathbf{u}}}_{1}\right{\mathbf{\mathbf{s}}}_{1})$. If $t=1$ were in the stimulus presentation period, for example, ${\mathbf{\mathbf{u}}}_{1}$ would provide noisy evidence for L or R, as well as the fixation cue. In response, the agent, depending on the inputs ${\mathbf{\mathbf{u}}}_{1}$ it receives from the environment, chooses actions ${\mathbf{\mathbf{a}}}_{1}$ with probability ${\pi}_{\theta}\left({\mathbf{\mathbf{a}}}_{1}\right{\mathbf{\mathbf{u}}}_{1:1})={\pi}_{\theta}\left({\mathbf{\mathbf{a}}}_{1}\right{\mathbf{\mathbf{u}}}_{1})$. The environment, depending on its previous states ${\mathbf{\mathbf{s}}}_{0:1}=({\mathbf{\mathbf{s}}}_{0},{\mathbf{\mathbf{s}}}_{1})$ and the agent’s previous actions ${a}_{0:1}=({\mathbf{\mathbf{a}}}_{0},{\mathbf{\mathbf{a}}}_{1})$, then transitions to state ${\mathbf{\mathbf{s}}}_{2}$ with probability $\mathcal{E}\left({\mathbf{\mathbf{s}}}_{2}\right{\mathbf{\mathbf{s}}}_{0:1},{\mathbf{\mathbf{a}}}_{0:1})$ and generates reward ${\rho}_{2}$. These steps are repeated until the end of the trial at time $T$. Trials can terminate at different times (e.g., for breaking fixation early or because of variable stimulus durations), so that $T$ in the following represents the maximum length of a trial. In order to emphasize that rewards follow actions, we adopt the convention in which the agent performs actions at $t=0,\mathrm{\dots},T$ and receives rewards at $t=1,\mathrm{\dots},T+1$.
The goal of the agent is to maximize the sum of expected future rewards at time $t=0$, or expected return
where the expectation ${\mathbb{E}}_{H}$ is taken over all possible trial histories $H=({\mathbf{\mathbf{s}}}_{0:T+1},{\mathbf{\mathbf{u}}}_{1:T},{\mathbf{\mathbf{a}}}_{0:T})$ consisting of the states of the environment, the inputs given to the agent, and the actions of the agent. In practice, the expectation value in Equation 1 is estimated by performing ${N}_{\text{trials}}$ trials for each policy update, i.e., with a Monte Carlo approximation. The expected return depends on the policy and hence parameters $\theta $, and we use Adam stochastic gradient descent (SGD) (Kingma and Ba, 2015) with gradient clipping (Graves, 2013; Pascanu et al., 2013b) to find the parameters that maximize this reward (Materials and methods).
More specifically, after every ${N}_{\text{trials}}$ trials the decision network uses gradient descent to update its parameters in a direction that minimizes an objective function ${\mathcal{L}}^{\pi}$ of the form
with respect to the connection weights, biases, and initial state of the decision network, which we collectively denote as $\theta $. Here ${\mathrm{\Omega}}_{n}^{\pi}\left(\theta \right)$ can contain any regularization terms for the decision network, for instance an entropy term to control the degree of exploration (Xu et al., 2015). The key gradient ${\nabla}_{\theta}{J}_{n}\left(\theta \right)$ is given for each trial $n$ by the REINFORCE algorithm (Williams, 1992; Baird and Moore, 1999; Sutton et al., 2000; Baxter and Bartlett, 2001; Peters and Schaal, 2008; Wierstra et al., 2009) as
where ${\mathbf{\mathbf{r}}}_{1:t}^{\pi}$ are the firing rates of the decision network units up to time $t$, ${v}_{\varphi}$ denotes the value function as described below, and the gradient ${\mathrm{\nabla}}_{\theta}\mathrm{log}{\pi}_{\theta}\left({\mathbf{a}}_{t}{\mathbf{u}}_{1:t}\right)$, known as the eligibility, [and likewise ${\nabla}_{\theta}{\mathrm{\Omega}}_{n}^{\pi}\left(\theta \right)$] is computed by backpropagation through time (BPTT) (Rumelhart et al., 1986) for the selected actions ${\mathbf{\mathbf{a}}}_{t}$. The sum over rewards in large brackets only runs over $\tau =t,\mathrm{\dots},T$, which reflects the fact that present actions do not affect past rewards. In this form the terms in the gradient have the intuitive property that they are nonzero only if the actual return deviates from what was predicted by the baseline. It is worth noting that this form of the value function (with access to the selected action) can, in principle, lead to suboptimal policies if the value network’s predictions become perfect before the optimal decision policy is learned; we did not find this to be the case in our simulations.
The reward baseline is an important feature in the success of almost all REINFORCEbased algorithms, and is here represented by a second RNN ${v}_{\varphi}$ with parameters $\varphi $ in addition to the decision network ${\pi}_{\theta}$ (to be precise, the value function is the readout of the value network). This baseline network, which we call the value network, uses the selected actions ${\mathbf{\mathbf{a}}}_{1:t}$ and activity of the decision network ${\mathbf{\mathbf{r}}}_{1:t}^{\pi}$ to predict the expected return at each time $t=1,\mathrm{\dots},T$; the value network also predicts the expected return at $t=0$ based on its own initial states, with the understanding that ${\mathbf{\mathbf{a}}}_{1:0}=\mathrm{\varnothing}$ and ${\mathbf{\mathbf{r}}}_{1:0}^{\pi}=\mathrm{\varnothing}$ are empty sets. The value network is trained by minimizing a second objective function
every ${N}_{\text{trials}}$ trials, where ${\mathrm{\Omega}}_{n}^{v}\left(\varphi \right)$ denotes any regularization terms for the value network. The necessary gradient ${\nabla}_{\varphi}{E}_{n}\left(\varphi \right)$ [and likewise ${\nabla}_{\varphi}{\mathrm{\Omega}}_{n}^{v}\left(\varphi \right)$] is again computed by BPTT.
Decision and value recurrent neural networks
The policy probability distribution over actions ${\pi}_{\theta}\left({\mathbf{\mathbf{a}}}_{t}\right{\mathbf{\mathbf{u}}}_{1:t})$ and scalar baseline ${v}_{\varphi}({\mathbf{\mathbf{a}}}_{1:t},{\mathbf{\mathbf{r}}}_{1:t}^{\pi})$ are each represented by an RNN of $N$ firingrate units ${\mathbf{\mathbf{r}}}^{\pi}$ and ${\mathbf{\mathbf{r}}}^{v}$, respectively, where we interpret each unit as the mean firing rate of a group of neurons. In the case where the agent chooses a single action at each time $t$, the activity of the decision network determines ${\pi}_{\theta}\left({\mathbf{\mathbf{a}}}_{t}\right{\mathbf{\mathbf{u}}}_{1:t})$ through a linear readout followed by softmax normalization:
for $k=1,\mathrm{\dots},{N}_{a}$. Here ${W}_{\text{out}}^{\pi}$ is an ${N}_{a}\times N$ matrix of connection weights from the units of the decision network to the ${N}_{a}$ linear readouts ${\mathbf{\mathbf{z}}}_{t}$, and ${\mathbf{\mathbf{b}}}_{\text{out}}^{\pi}$ are ${N}_{a}$ biases. Action selection is implemented by randomly sampling from the probability distribution in Equation 7, and constitutes an important difference from previous approaches to training RNNs for cognitive tasks (Mante et al., 2013; Carnevale et al., 2015; Song et al., 2016; Miconi, 2016), namely, here the final output of the network (during training) is a specific action, not a graded decision variable. We consider this sampling as an abstract representation of the downstream action selection mechanisms present in the brain, including the role of noise in implicitly realizing stochastic choices with deterministic outputs (Wang, 2002, 2008). Meanwhile, the activity of the value network predicts future returns through a linear readout
where ${W}_{\text{out}}^{v}$ is an $1\times N$ matrix of connection weights from the units of the value network to the single linear readout ${v}_{\varphi}$, and ${b}_{\text{out}}^{v}$ is a bias term.
In order to take advantage of recent developments in training RNNs [in particular, addressing the problem of vanishing gradients (Bengio et al., 1994)] while retaining intepretability, we use a modified form of Gated Recurrent Units (GRUs) (Cho et al., 2014; Chung et al., 2014) with a thresholdlinear '$f$$I$' curve ${\left[x\right]}_{+}=max(0,x)$ to obtain positive, nonsaturating firing rates. Since firing rates in cortex rarely operate in the saturating regime, previous work (Sussillo et al., 2015) used an additional regularization term to prevent saturation in common nonlinearities such as the hyperbolic tangent; the thresholdlinear activation function obviates such a need. These units are thus leaky, thresholdlinear units with dynamic time constants and gated recurrent inputs. The equations that describe their dynamics can be derived by a naïve discretization of the following continuoustime equations for the $N$ currents $\mathbf{\mathbf{x}}$ and corresponding rectifiedlinear firing rates $\mathbf{\mathbf{r}}$:
Here $\stackrel{.}{\mathbf{x}}=d\mathbf{x}/dt$ is the derivative of $\mathbf{\mathbf{x}}$ with respect to time, $\odot $ denotes elementwise multiplication, $\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{m}\mathrm{o}\mathrm{i}\mathrm{d}(x)=[1+{e}^{x}{]}^{1}$ is the logistic sigmoid, ${\mathbf{\mathbf{b}}}^{\lambda}$, ${\mathbf{\mathbf{b}}}^{\gamma}$, and $\mathbf{\mathbf{b}}$ are biases, $\mathit{\bm{\xi}}$ are $N$ independent Gaussian white noise processes with zero mean and unit variance, and ${\sigma}_{\text{rec}}^{2}$ controls the size of this noise. The multiplicative gates $\mathit{\bm{\lambda}}$ dynamically modulate the overall time constant $\tau $ for network units, while the $\mathit{\bm{\gamma}}$ control the recurrent inputs. The $N\times N$ matrices ${W}_{\text{rec}}$, ${W}_{\text{rec}}^{\lambda}$, and ${W}_{\text{rec}}^{\gamma}$ are the recurrent weight matrices, while the $N\times {N}_{\text{in}}$ matrices ${W}_{\text{in}}$, ${W}_{\text{in}}^{\lambda}$, and ${W}_{\text{in}}^{\gamma}$ are connection weights from the ${N}_{\text{in}}$ inputs $\mathbf{\mathbf{u}}$ to the $N$ units of the network. We note that in the case where $\mathit{\bm{\lambda}}\to 1$ and $\mathit{\bm{\gamma}}\to 1$ the equations reduce to 'simple' leaky thresholdlinear units without the modulation of the time constants or gating of inputs. We constrain the recurrent connection weights (Song et al., 2016) so that the overall connection probability is ${p}_{c}$; specifically, the number of incoming connections for each unit, or indegree $K$, was set to $K={p}_{c}N$ (see Table 1 for a list of all parameters).
The result of discretizing Equations 9–12, as well as details on initializing the network parameters, are given in Materials and methods. We successfully trained networks with time steps $\mathrm{\Delta}t=\text{1 ms}$, but for computational convenience all of the networks in this work were trained and run with $\mathrm{\Delta}t=\text{10 ms}$. We note that, for typical tasks in systems neuroscience lasting on the order of several seconds, this already implies trials lasting hundreds of time steps. Unless noted otherwise in the text, all networks were trained using the parameters listed in Table 1.
While the inputs to the decision network ${\pi}_{\theta}$ are determined by the environment, the value network always receives as inputs the activity of the decision network ${\mathbf{\mathbf{r}}}^{\pi}$, together with information about which actions were actually selected at each time step (Figure 1B). The value network serves two purposes: first, the output of the value network is used as the baseline in the REINFORCE gradient, Equation 3, to reduce the variance of the gradient estimate (Williams, 1992; Baird and Moore, 1999; Baxter and Bartlett, 2001; Peters and Schaal, 2008); second, since policy gradient reinforcement learning does not explicitly use a value function but value information is nevertheless implicitly contained in the policy, the value network serves as an explicit and potentially nonlinear readout of this information. In situations where expected reward is closely related to confidence, this may explain, for example, certain disassociations between perceptual decisions and reports of the associated confidence (Lak et al., 2014).
A reward baseline, which allows the decision network to update its parameters based on a relative quantity akin to prediction error (Schultz et al., 1997; Bayer and Glimcher, 2005) rather than absolute reward magnitude, is essential to many learning schemes, especially those based on REINFORCE. Indeed, it has been suggested that in general such a baseline should be not only taskspecific but stimulus (taskcondition)specific (Frémaux et al., 2010; Engel et al., 2015; Miconi, 2016), and that this information may be represented in OFC (Wallis, 2007) or basal ganglia (Doya, 2000). Previous schemes, however, did not propose how this baseline critic may be instantiated, instead implementing it algorithmically. Here we use a simple neural implementation of the baseline that automatically depends on the stimulus and thus does not require the learning system to have access to the true trial type, which in general is not known with certainty to the agent.
Tasks with simple inputoutput mappings
The training procedure described in the previous section can be used for a variety of tasks, and results in networks that qualitatively reproduce both behavioral and electrophysiological findings from experiments with behaving animals. For the example perceptual decisionmaking task above, the trained network learns to integrate the sensory evidence to make the correct decision about which of two noisy inputs is larger (Figure 1C). This and additional networks trained for the same task were able to reach the target performance in $\sim $7000 trials starting from completely random connection weights, and moreover the networks learned the 'core' task after $\sim $2000 trials (Figure 1—figure supplement 1). As with monkeys performing the task, longer stimulus durations allow the network to improve its performance by continuing to integrate the incoming sensory evidence (Wang, 2002; Kiani et al., 2008). Indeed, the output of the value network shows that the expected reward (in this case equivalent to confidence) is modulated by stimulus difficulty (Figure 1E). Prior to the onset of the stimulus, the expected reward is the same for all trial conditions and approximates the overall reward rate; incoming sensory evidence then allows the network to distinguish its chances of success.
Sorting the activity of individual units in the network by the signed coherence (the strength of the evidence, with negative values indicating evidence for L and positive for R) also reveals coherencedependent ramping activity (Figure 1D) as observed in neural recordings from numerous perceptual decisionmaking experiments, e.g., Roitman and Shadlen (2002). This pattern of activity illustrates why a nonlinear readout by the value network is useful: expected return is computed by performing an 'absolute value'like operation on the accumulated evidence (plus shifts), as illustrated by the overlap of the expected return for positive and negativecoherence trials (Figure 1E).
The reaction time as a function of coherence in the reactiontime version of the same task, in which the go cue coincides with the time of stimulus onset, is also shown in Figure 1—figure supplement 2 and may be compared, e.g., to Wang (2002); Mazurek et al. (2003); Wong and Wang (2006). We note that in many neural models [e.g., Wang (2002); Wong and Wang (2006)] a 'decision' is made when the output reaches a fixed threshold. Indeed, when networks are trained using supervised learning (Song et al., 2016), the decision threshold is imposed retroactively and has no meaning during training; since the outputs are continuous, the speedaccuracy tradeoff is also learned in the space of continuous error signals. Here, the time at which the network commits to a decision is unambiguously given by the time at which the selected action is L or R. Thus the appropriate speedaccuracy tradeoff is learned in the space of concrete actions, illustrating the desirability of using rewardbased training of RNNs when modeling reactiontime tasks. Learning curves for this and additional networks trained for the same reactiontime task are shown in Figure 1—figure supplement 3.
In addition to the example task from the previous section, we trained networks for three wellknown behavioral paradigms in which the correct, or optimal, behavior is (pre)determined on each trial by the task condition alone. Similar tasks have previously been addressed with several different forms of supervised learning, including FORCE (Sussillo and Abbott, 2009; Carnevale et al., 2015), Hessianfree (Martens and Sutskever, 2011; Mante et al., 2013; Barak et al., 2013), and stochastic gradient descent (Pascanu et al., 2013b; Song et al., 2016), so that the results shown in Figure 2 are presented as confirmation that the same tasks can also be learned using reward feedback on definite actions alone. For all three tasks the prestimulus fixation period was 750 ms; the networks had to maintain fixation until the start of a 500 ms 'decision' period, which was indicated by the extinction of the fixation cue. At this time the network was required to choose one of two alternatives to indicate its decision and receive a reward of +1 for a correct response and 0 for an incorrect response; otherwise, the networks received a reward of −1.
The contextdependent integration task (Figure 2A) is based on Mante et al. (2013), in which monkeys were required to integrate one type of stimulus (the motion or color of the presented dots) while ignoring the other depending on a context cue. In training the network, we included both the 750 ms stimulus period and 300–1500 ms delay period following stimulus presentation. The delay consisted of 300 ms followed by a variable duration drawn from an exponential distribution with mean 300 ms and truncated at a maximum of 1200 ms. The network successfully learned to perform the task, which is reflected in the psychometric functions showing the percentage of trials on which the network chose R as a function of the signed motion and color coherences, where motion and color indicate the two sources of noisy information and the sign is positive for R and negative for L (Figure 2A, left). As in electrophysiological recordings, units in the decision network show mixed selectivity when sorted by different combinations of task variables (Figure 2A, right). Learning curves for this and additional networks trained for the task are shown in Figure 2—figure supplement 1.
The multisensory integration task (Figure 2B) is based on Raposo et al. (2012, 2014), in which rats used visual flashes and auditory clicks to determine whether the event rate was higher or lower than a learned threshold of 12.5 events per second. When both modalities were presented, they were congruent, which implied that the rats could improve their performance by combining information from both sources. As in the experiment, the network was trained with a 1000 ms stimulus period, with inputs whose magnitudes were proportional (both positively and negatively) to the event rate. For this task the input connection weights ${W}_{\text{in}}$, ${W}_{\text{in}}^{\lambda}$, and ${W}_{\text{in}}^{\gamma}$ were initialized so that a third of the $N=150$ decision network units received visual inputs only, another third auditory inputs only, and the remaining third received neither. As shown in the psychometric function (percentage of high choices as a function of event rate, Figure 2B, left), the trained network exhibits multisensory enhancement in which performance on multisensory trials was better than on singlemodality trials. Indeed, as for rats, the results are consistent with optimal combination of the two modalities,
where ${\sigma}_{\text{visual}}^{2}$, ${\sigma}_{\text{auditory}}^{2}$, and ${\sigma}_{\text{multisensory}}^{2}$ are the variances obtained from fits of the psychometric functions to cumulative Gaussian functions for visual only, auditory only, and multisensory (both visual and auditory) trials, respectively (Table 2). As observed in electrophysiological recordings, moreover, decision network units exhibit a range of tuning to task parameters, with some selective to choice and others to modality, while many units showed mixed selectivity to all task variables (Figure 2B, right). Learning curves for this and additional networks trained for the task are shown in Figure 2—figure supplement 2.
The parametric working memory task (Figure 2C) is based on the vibrotactile frequency discrimination task of Romo et al. (1999), in which monkeys were required to compare the frequencies of two temporally separated stimuli to determine which was higher. For network training, the task epochs consisted of a 500 ms base stimulus with 'frequency' ${f}_{1}$, a 2700–3300 ms delay, and a 500 ms comparison stimulus with frequency ${f}_{2}$; for the trials shown in Figure 2C the delay was always 3000 ms as in the experiment. During the decision period, the network had to indicate which stimulus was higher by choosing ${f}_{1}<{f}_{2}$ or ${f}_{1}>{f}_{2}$. The stimuli were constant inputs with amplitudes proportional (both positively and negatively) to the frequency. For this task we set the learning rate to $\eta =0.002$; the network successfully learned to perform the task (Figure 2C, left), and the individual units of the network, when sorted by the first stimulus (base frequency) ${f}_{1}$, exhibit highly heterogeneous activity (Figure 2C, right) characteristic of neurons recorded in the prefrontal cortex of monkeys performing the task (Machens et al., 2010). Learning curves for this and additional networks trained for the task are shown in Figure 2—figure supplement 3.
Additional comparisons can be made between the model networks shown in Figure 2 and the neural activity observed in behaving animals, for example statespace analyses as in Mante et al. (2013), Carnevale et al. (2015), or Song et al. (2016). Such comparisons reveal that, as found previously in studies such as Barak et al. 2013), the model networks exhibit many, but not all, features present in electrophysiological recordings. Figure 2 and the following make clear, however, that RNNs trained with reward feedback alone can already reproduce the mixed selectivity characteristic of neural populations in higher cortical areas (Rigotti et al., 2010, 2013), thereby providing a valuable platform for future investigations of how such complex representations are learned.
Confidence and perceptual decisionmaking
All of the tasks in the previous section have the property that the correct response on any single trial is a function only of the task condition, and, in particular, does not depend on the network’s state during the trial. In a postdecision wager task (Kiani and Shadlen, 2009), however, the optimal decision depends on the animal’s (agent’s) estimate of the probability that its decision is correct, i.e., its confidence. As can be seen from the results, on a trialbytrial basis this is not the same as simply determining the stimulus difficulty (a combination of stimulus duration and coherence); this makes it difficult to train with standard supervised learning, which requires a predetermined target output for the network to reproduce; instead, we trained an RNN to perform the task by maximizing overall reward. This task extends the simple perceptual decisionmaking task (Figure 1A) by introducing a 'sure' option that is presented during a 1200–1800 ms delay period on a random half of the trials; selecting this option results in a reward that is 0.7 times the size of the reward obtained when correctly choosing L or R. As in the monkey experiment, the network receives no information indicating whether or not a given trial will contain a sure option until the middle of the delay period after stimulus offset, thus ensuring that the network makes a decision about the stimulus on all trials (Figure 3A). For this task the input connection weights ${W}_{\text{in}}$, ${W}_{\text{in}}^{\lambda}$, and ${W}_{\text{in}}^{\gamma}$ were initialized so that half the units received information about the sure target while the other half received evidence for L and R. All units initially received fixation input.
The key behavioral features found in Kiani and Shadlen (2009); Wei and Wang (2015) are reproduced in the trained network, namely the network opted for the sure option more frequently when the coherence was low or stimulus duration short (Figure 3B, left); and when the network was presented with a sure option but waived it in favor of choosing L or R, the performance was better than on trials when the sure option was not presented (Figure 3B, right). The latter observation is taken as indication that neither monkeys nor trained networks choose the sure target on the basis of stimulus difficulty alone but based on their internal sense of uncertainty on each trial.
Figure 3C shows the activity of an example network unit, sorted by whether the decision was the unit’s preferred or nonpreferred target (as determined by firing rates during the stimulus period on all trials), for both nonwager and wager trials. In particular, on trials in which the sure option was chosen, the firing rate is intermediate compared to trials on which the network made a decision by choosing L or R. Learning curves for this and additional networks trained for the task are shown in Figure 3—figure supplement 1.
Valuebased economic choice task
We also trained networks to perform the simple economic choice task of PadoaSchioppa and Assad (2006) and examined the activity of the value, rather than decision, network. The choice patterns of the networks were modulated only by varying the reward contingencies (Figure 4A, upper and lower). We note that, on each trial there is a 'correct' answer in the sense that there is a choice which results in greater reward. In contrast to the previous tasks, however, information regarding whether an answer is correct in this sense is not contained in the inputs but rather in the association between inputs and rewards. This distinguishes the task from the cognitive tasks discussed in previous sections: although the task can be transformed into a cognitivetype task by providing the associated rewards as inputs, training in this manner conflates external with 'internal,' learned inputs.
Each trial began with a 750 ms fixation period; the offer, which indicated the 'juice' type and amount for the left and right choices, was presented for 1000–2000 ms, followed by a 750 ms decision period during which the network was required to indicate its decision. In the upper panel of Figure 4A the indifference point was set to 1A = 2.2B during training, which resulted in 1A = 2.0B when fit to a cumulative Gaussian (Figure 4—figure supplement 1), while in the lower panel it was set to 1A = 4.1B during training and resulted in 1A = 4.0B (Figure 4—figure supplement 2). The basic unit of reward, i.e., 1B, was 0.1. For this task we increased the initial value of the value network’s input weights, ${W}_{\text{in}}^{v}$, by a factor of 10 to drive the value network more strongly.
Strikingly, the activity of units in the value network ${v}_{\varphi}$ exhibits similar types of tuning to task variables as observed in the orbitofrontal cortex of monkeys, with some units (roughly 20% of active units) selective to chosen value, others (roughly 60%, for both A and B) to offer value, and still others (roughly 20%) to choice alone as defined in PadoaSchioppa and Assad (2006) (Figure 4B). The decision network also contained units with a diversity of tuning. Learning curves for this and additional networks trained for the task are shown in Figure 4—figure supplement 3. We emphasize that no changes were made to the network architecture for this valuebased economic choice task. Instead, the same scheme shown in Figure 1B, in which the value network is responsible for predicting future rewards to guide learning but is not involved in the execution of the policy, gave rise to the pattern of neural activity shown in Figure 4B.
Discussion
In this work we have demonstrated rewardbased training of recurrent neural networks for both cognitive and valuebased tasks. Our main contributions are twofold: first, our work expands the range of tasks and corresponding neural mechanisms that can be studied by analyzing model recurrent neural networks, providing a unified setting in which to study diverse computations and compare to electrophysiological recordings from behaving animals; second, by explicitly incorporating reward into network training, our work makes it possible in the future to more directly address the question of rewardrelated processes in the brain, for instance the role of value representation that is essential for learning, but not executing, a task.
To our knowledge, the specific form of the baseline network inputs used in this work has not been used previously in the context of recurrent policy gradients; it combines ideas from Wierstra et al. (2009) where the baseline network received the same inputs as the decision network in addition to the selected actions, and Ranzato et al. (2016), where the baseline was implemented as a simple linear regressor of the activity of the decision network, so that the decision and value networks effectively shared the same recurrent units. Indeed, the latter architecture is quite common in machine learning applications (Mnih et al., 2016), and likewise, for some of the simpler tasks considered here, models with a baseline consisting of a linear readout of the selected actions and decision network activity could be trained in comparable (but slightly longer) time (Figure 1—figure supplement 4). The question of whether the decision and value networks ought to share the same recurrent network parallels ongoing debate over whether choice and confidence are computed together or if certain areas such as OFC compute confidence signals locally, though it is clear that such 'metacognitive' representations can be found widely in the brain (Lak et al., 2014). Computationally, the distinction is expected to be important when there are nonlinear computations required to determine expected return that are not needed to implement the policy, as illustrated in the perceptual decisionmaking task (Figure 1).
Interestingly, a separate value network to represent the baseline suggests an explicit role for value representation in the brain that is essential for learning a task (equivalently, when the environment is changing), but not for executing an already learned task, as is sometimes found in experiments (Turner and Desmurget, 2010; Schoenbaum et al., 2011; Stalnaker et al., 2015). Since an accurate baseline dramatically improves learning but is not required—the algorithm is less reliable and takes many samples to converge with a constant baseline, for instance—this baseline network hypothesis for the role of value representation may account for some of the subtle yet broad learning deficits observed in OFClesioned animals (Wallis, 2007). Moreover, since expected reward is closely related to decision confidence in many of the tasks considered, a value network that nonlinearly reads out confidence information from the decision network is consistent with experimental findings in which OFC inactivation affects the ability to report confidence but not decision accuracy (Lak et al., 2014).
Our results thus support the actorcritic picture for rewardbased learning, in which one circuit directly computes the policy to be followed, while a second structure, receiving projections from the decision network as well as information about the selected actions, computes expected future reward to guide learning. Actorcritic models have a rich history in neuroscience, particularly in studies of the basal ganglia (Houk et al., 1995; Dayan and Balleine, 2002; Joel et al., 2002; O'Doherty et al., 2004; Takahashi et al., 2008; Maia, 2010), and it is interesting to note that there is some experimental evidence that signals in the striatum are more suitable for direct policy search rather than for updating action values as an intermediate step, as would be the case for purely value functionbased approaches to computing the decision policy (Li and Daw, 2011; Niv and Langdon, 2016). Moreover, although we have used a single RNN each to represent the decision and value modules, using 'deep,' multilayer RNNs may increase the representational power of each module (Pascanu et al., 2013a). For instance, more complex tasks than considered in this work may require hierarchical feature representation in the decision network, and likewise value networks can use a combination of the different features [including raw sensory inputs (Wierstra et al., 2009)] to predict future reward. Anatomically, the decision networks may correspond to circuits in dorsolateral prefrontal cortex, while the value networks may correspond to circuits in OFC (Schultz et al., 2000; Takahashi et al., 2011) or basal ganglia (Hikosaka et al., 2014). This architecture also provides a useful example of the hypothesis that various areas of the brain effectively optimize different cost functions (Marblestone et al., 2016): in this case, the decision network maximizes reward, while the value network minimizes the prediction error for future reward.
As in many other supervised learning approaches used previously to train RNNs (Mante et al., 2013; Song et al., 2016), the use of BPTT to compute the gradients (in particular, the eligibility) make our 'plasticity rule' not biologically plausible. As noted previously (Zipser and Andersen, 1988), it is indeed somewhat surprising that the activity of the resulting networks nevertheless exhibit many features found in neural activity recorded from behaving animals. Thus our focus has been on learning from realistic feedback signals provided by the environment but not on its physiological implementation. Still, recent work suggests that exact backpropagation is not necessary and can even be implemented in 'spiking' stochastic units (Lillicrap et al., 2016), and that approximate forms of backpropagation and SGD can be implemented in a biologically plausible manner (Scellier and Bengio, 2016), including both spatially and temporally asynchronous updates in RNNs (Jaderberg et al., 2016). Such ideas require further investigation and may lead to effective yet more neurally plausible methods for training model neural networks.
Recently, Miconi (2016) used a 'node perturbation'based (Fiete and Seung, 2006; Fiete et al., 2007; Hoerzer et al., 2014) algorithm with an error signal at the end of each trial to train RNNs for several cognitive tasks, and indeed, node perturbation is closely related to the REINFORCE algorithm used in this work. On one hand, the method described in Miconi (2016) is more biologically plausible in the sense of not requiring gradients computed via backpropagation through time as in our approach; on the other hand, in contrast to the networks in this work, those in Miconi (2016) did not 'commit' to a discrete action and thus the error signal was a graded quantity. In this and other works (Frémaux et al., 2010), moreover, the prediction error was computed by algorithmically keeping track of a stimulus (task condition)specific running average of rewards. Here we used a concrete scheme (namely a value network) for approximating the average that automatically depends on the stimulus, without requiring an external learning system to maintain a separate record for each (true) trial type, which is not known by the agent with certainty.
One of the advantages of the REINFORCE algorithm for policy gradient reinforcement learning is that direct supervised learning can also be mixed with rewardbased learning, by including only the eligibility term in Equation 3 without modulating by reward (Mnih et al., 2014), i.e., by maximizing the loglikelihood of the desired actions. Although all of the networks in this work were trained from reward feedback only, it will be interesting to investigate this feature of the REINFORCE algorithm. Another advantage, which we have not exploited here, is the possibility of learning policies for continuous action spaces (Peters and Schaal, 2008; Wierstra et al., 2009); this would allow us, for example, to model arbitrary saccade targets in the perceptual decisionmaking task, rather than limiting the network to discrete choices.
We have previously emphasized the importance of incorporating biological constraints in the training of neural networks (Song et al., 2016). For instance, neurons in the mammalian cortex have purely excitatory or inhibitory effects on other neurons, which is a consequence of Dale’s Principle for neurotransmitters (Eccles et al., 1954). In this work we did not include such constraints due to the more complex nature of our rectified GRUs (Equations 9–12); in particular, the units we used are capable of dynamically modulating their time constants and gating their recurrent inputs, and we therefore interpreted the firing rate units as a mixture of both excitatory and inhibitory populations. Indeed, these may implement the 'reservoir of time constants' observed experimentally (Bernacchia et al., 2011). In the future, however, comparison to both model spiking networks and electrophysiological recordings will be facilitated by including more biological realism, by explicitly separating the roles of excitatory and inhibitory units (Mastrogiuseppe and Ostojic, 2016). Moreover, since both the decision and value networks are obtained by minimizing an objective function, additional regularization terms can be easily included to obtain networks whose activity is more similar to neural recordings (Sussillo et al., 2015; Song et al., 2016).
Finally, one of the most appealing features of RNNs trained to perform many tasks is their ability to provide insights into neural computation in the brain. However, methods for revealing neural mechanisms in such networks remain limited to statespace analysis (Sussillo and Barak, 2013), which in particular does not reveal how the synaptic connectivity leads to the dynamics responsible for implementing the higherlevel decision policy. General and systematic methods for analyzing trained networks are still needed and are the subject of ongoing investigation. Nevertheless, rewardbased training of RNNs makes it more likely that the resulting networks will correspond closely to biological networks observed in experiments with behaving animals. We expect that the continuing development of tools for training model neural networks in neuroscience will thus contribute novel insights into the neural basis of animal cognition.
Materials and methods
Policy gradient reinforcement learning with RNNs
Here we review the application of the REINFORCE algorithm for policy gradient reinforcement learning to recurrent neural networks (Williams, 1992; Baird and Moore, 1999; Sutton et al., 2000; Baxter and Bartlett, 2001; Peters and Schaal, 2008; Wierstra et al., 2009). In particular, we provide a careful derivation of Equation 3 following, in part, the exposition in Zaremba and Sutskever (2016).
Let ${H}_{\mu :t}$ be the sequence of interactions between the environment and agent (i.e., the environmental states, observables, and agent actions) that results in the environment being in state ${\mathbf{\mathbf{s}}}_{t+1}$ at time $t+1$ starting from state ${\mathbf{\mathbf{s}}}_{\mu}$ at time $\mu $:
For notational convenience in the following, we adopt the convention that, for the special case of $\mu =0$, the history ${H}_{0:t}$ includes the initial state ${\mathbf{\mathbf{s}}}_{0}$ and excludes the meaningless inputs ${\mathbf{\mathbf{u}}}_{0}$, which are not seen by the agent:
When $t=0$, it is also understood that ${\mathbf{\mathbf{u}}}_{1:0}=\mathrm{\varnothing}$, the empty set. A full history, or a trial, is thus denoted as
where $T$ is the end of the trial. Here we only consider the episodic, 'finitehorizon' case where $T$ is finite, and since different trials can have different durations, we take $T$ to be the maximum length of a trial in the task. The reward ${\rho}_{t+1}$ at time $t+1$ following actions ${\mathbf{\mathbf{a}}}_{t}$ (we use $\rho $ to distinguish it from the firing rates $\mathbf{\mathbf{r}}$ of the RNNs) is determined by this history, which we sometimes indicate explicitly by writing
As noted in the main text, we adopt the convention that the agent performs actions at $t=0,\mathrm{\dots},T$ and receives rewards at $t=1,\mathrm{\dots},T+1$ to emphasize that rewards follow the actions and are jointly determined with the next state (Sutton and Barto, 1998). For notational simplicity, here and elsewhere we assume that any discount factor is already included in ${\rho}_{t+1}$, i.e., in all places where the reward appears we consider ${\rho}_{t+1}\to {e}^{t/{\tau}_{\text{reward}}}{\rho}_{t+1}$, where ${\tau}_{\text{reward}}$ is the time constant for discounting future rewards (Doya, 2000); we included temporal discounting only for the reactiontime version of the simple perceptual decisionmaking task (Figure 1—figure supplement 2), where we set $\tau}_{\mathrm{r}\mathrm{e}\mathrm{w}\mathrm{a}\mathrm{r}\mathrm{d}}=10\mathrm{s$. For the remaining tasks, ${\tau}_{\text{reward}}=\mathrm{\infty}$.
Explicitly, a trial ${H}_{0:T}$ comprises the following. At time $t=0$, the environment is in state ${\mathbf{\mathbf{s}}}_{0}$ with probability $\mathcal{E}\left({\mathbf{\mathbf{s}}}_{0}\right)$. The agent initially chooses a set of actions ${\mathbf{\mathbf{a}}}_{0}$ with probability ${\pi}_{\theta}\left({\mathbf{\mathbf{a}}}_{0}\right)$, which is determined by the parameters of the decision network, in particular the initial conditions ${\mathbf{\mathbf{x}}}_{0}$ and readout weights ${W}_{\text{out}}^{\pi}$ and biases ${\mathbf{\mathbf{b}}}_{\text{out}}^{\pi}$ (Equation 6). At time $t=1$, the environment, depending on its previous state ${\mathbf{\mathbf{s}}}_{0}$ and the agent’s actions ${\mathbf{\mathbf{a}}}_{0}$, transitions to state ${\mathbf{\mathbf{s}}}_{1}$ with probability $\mathcal{E}\left({\mathbf{\mathbf{s}}}_{1}\right{\mathbf{\mathbf{s}}}_{0},{\mathbf{\mathbf{a}}}_{0})$. The history up to this point is ${H}_{0:0}=({\mathbf{\mathbf{s}}}_{0:1},\mathrm{\varnothing},{\mathbf{\mathbf{a}}}_{0:0})$, where $\mathrm{\varnothing}$ indicates that no inputs have yet been seen by the network. The environment also generates reward ${\rho}_{1}$, which depends on this history, ${\rho}_{1}={\rho}_{1}\left({H}_{0:0}\right)$. From state ${\mathbf{\mathbf{s}}}_{1}$ the environment generates observables (inputs to the agent) ${\mathbf{\mathbf{u}}}_{1}$ with a distribution given by $\mathcal{E}\left({\mathbf{\mathbf{u}}}_{1}\right{\mathbf{\mathbf{s}}}_{1})$. In response, the agent, depending on the inputs ${\mathbf{\mathbf{u}}}_{1}$ it receives from the environment, chooses the set of actions ${\mathbf{\mathbf{a}}}_{1}$ according to the distribution ${\pi}_{\theta}\left({\mathbf{\mathbf{a}}}_{1}\right{\mathbf{\mathbf{u}}}_{1:1})={\pi}_{\theta}\left({\mathbf{\mathbf{a}}}_{1}\right{\mathbf{\mathbf{u}}}_{1})$. The environment, depending on its previous states ${\mathbf{\mathbf{s}}}_{0:1}$ and the agent’s previous actions ${\mathbf{\mathbf{a}}}_{0:1}$, then transitions to state ${\mathbf{\mathbf{s}}}_{2}$ with probability $\mathcal{E}\left({\mathbf{\mathbf{s}}}_{2}\right{\mathbf{\mathbf{s}}}_{0:1},{\mathbf{\mathbf{a}}}_{0:1})$. Thus ${H}_{0:1}=({\mathbf{\mathbf{s}}}_{0:2},{\mathbf{\mathbf{u}}}_{1:1},{\mathbf{\mathbf{a}}}_{0:1})$. Iterating these steps, the history at time $t$ is therefore given by Equation 15, while a full history is given by Equation 16.
The probability ${p}_{\theta}\left({H}_{0:\tau}\right)$ of a particular subhistory ${H}_{0:\tau}$ up to time $\tau $ occurring, under the policy ${\pi}_{\theta}$ parametrized by $\theta $, is given by
In particular, the probability ${p}_{\theta}\left(H\right)$ of a history $H={H}_{0:T}$ occurring is
A key ingredient of the REINFORCE algorithm is that the policy parameters only indirectly affect the environment through the agent’s actions. The logarithmic derivatives of Equation 18 with respect to the parameters $\theta $ therefore do not depend on the unknown (to the agent) environmental dynamics contained in $\mathcal{E}$, i.e.,
with the understanding that ${\mathbf{\mathbf{u}}}_{1:0}=\mathrm{\varnothing}$ (the empty set) and therefore ${\pi}_{\theta}\left({\mathbf{\mathbf{a}}}_{0}\right{\mathbf{\mathbf{u}}}_{1:0})={\pi}_{\theta}\left({\mathbf{\mathbf{a}}}_{0}\right)$.
The goal of the agent is to maximize the expected return at time $t=0$ (Equation 1, reproduced here)
where we have used the time index $\tau $ for notational consistency with the following and made the historydependence of the rewards explicit. In terms of the probability of each history $H$ occurring, Equation 19, we have
where the generic sum over $H$ may include both sums over discrete variables and integrals over continuous variables. Since, for any $\tau =0,\mathrm{\dots},T$,
(cf. Equation 18), we can simplify Equation 22 to
This simplification is used below to formalize the intuition that present actions do not influence past rewards. Using the 'likelihoodratio trick'
we can write
From Equation 20 we therefore have
where we have 'undone' Equation 23 to recover the sum over the full histories $H$ in going from Equation 30 to Equation 31. We then obtain the first terms of Equation 2 and Equation 3 by estimating the sum over all $H$ by ${N}_{\text{trials}}$ samples from the agent’s experience.
In Equation 22 it is evident that, while subtracting any constant $b$ from the reward $J\left(\theta \right)$ will not affect the gradient with respect to $\theta $, it can reduce the variance of the stochastic estimate (Equation 33) from a finite number of trials. Indeed, it is possible to use this invariance to find an 'optimal' value of the constant baseline that minimizes the variance of the gradient estimate (Peters and Schaal, 2008). In practice, however, it is more useful to have a historydependent baseline that attempts to predict the future return at every time (Wierstra et al., 2009; Mnih et al., 2014; Zaremba and Sutskever, 2016). We therefore introduce a second network, called the value network, that uses the selected actions ${\mathbf{\mathbf{a}}}_{1:t}$ and the activity of the decision network ${\mathbf{\mathbf{r}}}_{1:t}^{\pi}$ to predict the future return ${\sum}_{\tau =t}^{T}{\rho}_{\tau +1}$ by minimizing the squared error (Equations 4–5). Intuitively, such a baseline is appealing because the terms in the gradient of Equation 3 are nonzero only if the actual return deviates from what was predicted by the value network.
Discretized network equations and initialization
Carrying out the discretization of Equations 9–12 in time steps of $\mathrm{\Delta}t$, we obtain
for $t=1,\mathrm{\dots},T$, where $\alpha =\mathrm{\Delta}t/\tau $ and $\mathbf{\mathbf{N}}(0,1)$ are normally distributed random numbers with zero mean and unit variance. We note that the rectifiedlinear activation function appears in different positions compared to standard GRUs, which merely reflects the choice of using 'synaptic currents' as the dynamical variable rather than directly using firing rates as the dynamical variable. One small advantage of this choice is that we can train the unconstrained initial conditions ${\mathbf{\mathbf{x}}}_{0}$ rather than the nonnegatively constrained firing rates ${\mathbf{\mathbf{r}}}_{0}$.
The biases ${\mathbf{\mathbf{b}}}^{\lambda}$, ${\mathbf{\mathbf{b}}}^{\gamma}$, and $\mathbf{\mathbf{b}}$, as well as the readout weights ${W}_{\text{out}}^{\pi}$ and ${W}_{\text{out}}^{v}$, were initialized to zero. The biases for the policy readout ${\mathbf{\mathbf{b}}}_{\text{out}}^{\pi}$ were initially set to zero, while the value network bias ${b}_{\text{out}}^{v}$ was initially set to the 'reward' for an aborted trial, −1. The entries of the input weight matrices ${W}_{\text{in}}^{\gamma}$, ${W}_{\text{in}}^{\lambda}$, and ${W}_{\text{in}}$ for both decision and value networks were drawn from a zeromean Gaussian distribution with variance $K/{N}_{\text{in}}^{2}$. For the recurrent weight matrices ${W}_{\text{rec}}$, ${W}_{\text{rec}}^{\lambda}$, and ${W}_{\text{rec}}^{\gamma}$, the $K$ nonzero entries in each row were initialized from a gamma distribution $\mathrm{\Gamma}(\alpha ,\beta )$ with $\alpha =\beta =4$, with each entry multiplied randomly by $\pm 1$; the entire matrix was then scaled such that the spectral radius—the largest absolute value of the eigenvalues—was exactly ${\rho}_{0}$. Although we also successfully trained networks starting from normally distributed weights, we found it convenient to control the sign and magnitude of the weights independently. The initial conditions ${\mathbf{\mathbf{x}}}_{0}$, which are also trained, were set to 0.5 for all units before the start of training. We implemented the networks in the Python machine learning library Theano (The Theano Development Team, 2016).
Adam SGD with gradient clipping
We used a recently developed version of stochastic gradient descent known as Adam, for adaptive moment estimation (Kingma and Ba, 2015), together with gradient clipping to prevent exploding gradients (Graves, 2013; Pascanu et al., 2013b). For clarity, in this section we use vector notation $\mathit{\bm{\theta}}$ to indicate the set of all parameters being optimized and the subscript $k$ to indicate a specific parameter ${\theta}_{k}$. At each iteration $i>0$, let
be the gradient of the objective function $\mathcal{L}$ with respect to the parameters $\mathit{\bm{\theta}}$. We first clip the gradient if its norm $\left{\mathbf{\mathbf{g}}}^{\left(i\right)}\right$ exceeds a maximum $\mathrm{\Gamma}$ (see Table 1), i.e.,
Each parameter ${\theta}_{k}$ is then updated according to
where $\eta $ is the base learning rate and the moving averages
estimate the first and second (uncentered) moments of the gradient. Initially, ${\mathbf{\mathbf{m}}}^{\left(0\right)}={\mathbf{\mathbf{v}}}^{\left(0\right)}=0$. These moments allow each parameter to be updated in Equation 40 according to adaptive learning rates, such that parameters whose gradients exhibit high uncertainty and hence small 'signaltonoise ratio' lead to smaller learning rates.
Except for the base learning rate $\eta $ (see Table 1), we used the parameter values suggested in Kingma and Ba (2015):
Computer code
All code used in this work, including code for generating the figures, is available at http://github.com/xjwanglab/pyrl.
References

1
Gradient descent for general reinforcement learningAdvances in Neural Information Processing Systems 11:968–974.

2
From fixed points to chaos: three models of delayed discriminationProgress in Neurobiology 103:214–222.https://doi.org/10.1016/j.pneurobio.2013.02.002

3
Neuronlike adaptive elements that can solve difficult learning control problemsIEEE Transactions on Systems, Man, and Cybernetics SMC13:834–846.https://doi.org/10.1109/TSMC.1983.6313077

4
Infinitehorizon policygradient estimationThe Journal of Artificial Intelligence Research 15:319–350.https://doi.org/10.1613/jair.806
 5

6
Learning longterm dependencies with gradient descent is difficultIEEE Transactions on Neural Networks 5:157–166.https://doi.org/10.1109/72.279181

7
A reservoir of time constants for memory traces in cortical neuronsNature Neuroscience 14:366–372.https://doi.org/10.1038/nn.2752

8
Does computational neuroscience need new synaptic learning paradigms?Current Opinion in Behavioral Sciences 11:61–66.https://doi.org/10.1016/j.cobeha.2016.05.012

9
Reinforcement learning of linking and tracing contours in recurrent neural networks.PLoS Computational Biology 11:e1004489.https://doi.org/10.1371/journal.pcbi.1004489

10
Deep neural networks rival the representation of primate IT cortex for core visual object recognitionPLoS Computational Biology 10:e1003963.https://doi.org/10.1371/journal.pcbi.1003963
 11
 12
 13
 14

15
Decision theory, reinforcement learning, and the brainCognitive, Affective, & Behavioral Neuroscience 8:429–453.https://doi.org/10.3758/CABN.8.4.429

16
Reinforcement learning in continuous time and spaceNeural Computation 12:219–245.https://doi.org/10.1162/089976600300015961

17
Cholinergic and inhibitory synapses in a pathway from motoraxon collaterals to motoneuronesThe Journal of Physiology 126:524–562.https://doi.org/10.1113/jphysiol.1954.sp005226
 18

19
Gradient learning in spiking neural networks by dynamic perturbation of conductancesPhysical Review Letters 97:048104.https://doi.org/10.1103/PhysRevLett.97.048104

20
Model of birdsong learning based on gradient estimation by dynamic perturbation of neural conductancesJournal of Neurophysiology 98:2038–2057.https://doi.org/10.1152/jn.01311.2006
 21

22
GoalDirected decision making with spiking neuronsJournal of Neuroscience 36:1529–1546.https://doi.org/10.1523/JNEUROSCI.285415.2016

23
Functional requirements for rewardmodulated spiketimingdependent plasticityJournal of Neuroscience 30:13326–13337.https://doi.org/10.1523/JNEUROSCI.624909.2010

24
On simplicity and complexity in the brave new world of largescale neuroscienceCurrent Opinion in Neurobiology 32:148–155.https://doi.org/10.1016/j.conb.2015.04.003

25
The neural basis of decision makingAnnual Review of Neuroscience 30:535–574.https://doi.org/10.1146/annurev.neuro.29.051605.113038
 26

27
A survey of actorcritic reinforcement learning: Standard and natural policy gradientsIEEE Transactions on Systems, Man, and Cybernetics, Part C 42:1291–1307.https://doi.org/10.1109/TSMCC.2012.2218595
 28

29
Basal ganglia circuits for reward valueguided behaviorAnnual Review of Neuroscience 37:289–306.https://doi.org/10.1146/annurevneuro071013013924
 30
 31

32
A model of how the basal ganglia generates and uses neural signals that predict reinforcementIn: J. C Houk, J. L Davis, D. G Beisberb, editors. Models of Information Processing in the Basal Ganglia. Cambridge, MA: MIT Press. pp. 249–274.
 33
 34
 35

36
Planning and acting in partially observable stochastic domainsArtificial Intelligence 101:99–134.https://doi.org/10.1016/S00043702(98)00023X
 37
 38
 39
 40

41
Robust timing and motor patterns by taming chaos in recurrent neural networksNature Neuroscience 16:925–933.https://doi.org/10.1038/nn.3405
 42

43
Signals in human striatum are appropriate for policy update rather than value predictionJournal of Neuroscience 31:5504–5511.https://doi.org/10.1523/JNEUROSCI.631610.2011

44
Random synaptic feedback weights support error backpropagation for deep learningNature Communications 7:13276.https://doi.org/10.1038/ncomms13276

45
Functional, but not anatomical, separation of "what" and "when" in prefrontal cortexJournal of Neuroscience 30:350–360.https://doi.org/10.1523/JNEUROSCI.327609.2010

46
Twofactor theory, the actorcritic model, and conditioned avoidanceLearning & Behavior 38:50–67.https://doi.org/10.3758/LB.38.1.50
 47

48
Toward an integration of deep learning and neuroscienceFrontiers in Computational Neuroscience 10:94.https://doi.org/10.3389/fncom.2016.00094

49
Learning recurrent neural networks with Hessianfree optimizationProceedings of the 28th International Conference on Machine Learning.
 50

51
A role for neural integrators in perceptual decision makingCerebral Cortex 13:1257–1269.https://doi.org/10.1093/cercor/bhg097
 52

53
Recurrent models of visual attentionAdvances in neural information processing systems.
 54

55
Reinforcement learning with MarrCurrent Opinion in Behavioral Sciences 11:67–73.https://doi.org/10.1016/j.cobeha.2016.04.005
 56
 57
 58

59
On the difficulty of training recurrent neural networksProceedings of the 30th International Conference on Machine Learning.

60
Reinforcement learning of motor skills with policy gradientsNeural Networks 21:682–697.https://doi.org/10.1016/j.neunet.2008.02.003
 61
 62

63
Decision making under uncertainty: a neural model based on partially observable markov decision processesFrontiers in Computational Neuroscience 4:146.https://doi.org/10.3389/fncom.2010.00146

64
Multisensory decisionmaking in rats and humansJournal of Neuroscience 32:3726–3735.https://doi.org/10.1523/JNEUROSCI.499811.2012

65
A categoryfree neural population supports evolving demands during decisionmakingNature Neuroscience 17:1784–1792.https://doi.org/10.1038/nn.3865

66
Internal representation of task rules by recurrent dynamics: the importance of the diversity of neural responsesFrontiers in Computational Neuroscience 4:24.https://doi.org/10.3389/fncom.2010.00024
 67

68
Response of neurons in the lateral intraparietal area during a combined visual discrimination reaction time taskJournal of Neuroscience 22:9475–9489.
 69

70
Learning internal representations by error propagationIn: DE Rumelhart, JL McClelland, editors. Parallel Distributed Processing, 1. Cambridge, MA: MIT Press. pp. 318–362.
 71

72
Does the orbitofrontal cortex signal value?Annals of the New York Academy of Sciences 1239:87–99.https://doi.org/10.1111/j.17496632.2011.06210.x
 73
 74
 75

76
Neural mechanism for stochastic behaviour during a competitive gameNeural Networks 19:1075–1090.https://doi.org/10.1016/j.neunet.2006.05.044

77
Synaptic computation underlying probabilistic inferenceNature Neuroscience 13:112–119.https://doi.org/10.1038/nn.2450
 78

79
What the orbitofrontal cortex does not doNature Neuroscience 18:620–627.https://doi.org/10.1038/nn.3982

80
Choosing the greater of two goods: neural currencies for valuation and decision makingNature Reviews Neuroscience 6:363–375.https://doi.org/10.1038/nrn1666
 81
 82

83
Neural circuits as computational dynamical systemsCurrent Opinion in Neurobiology 25:156–163.https://doi.org/10.1016/j.conb.2014.01.008

84
A neural network that finds a naturalistic solution for the production of muscle activityNature Neuroscience 18:1025–1033.https://doi.org/10.1038/nn.4042
 85

86
Policy gradient methods for reinforcement learning with function approximationAdvances in neural information processing systems 12:1057–1063.
 87

88
Expectancyrelated changes in firing of dopamine neurons depend on orbitofrontal cortexNature Neuroscience 14:1590–1597.https://doi.org/10.1038/nn.2957
 89

90
Learning to use working memory in partially observable environments through dopaminergic reinforcementAdvances in Neural Information Processing Systems.

91
Basal ganglia contributions to motor control: a vigorous tutorCurrent Opinion in Neurobiology 20:704–716.https://doi.org/10.1016/j.conb.2010.08.022

92
Reinforcement learning in populations of spiking neuronsNature Neuroscience 12:250–252.https://doi.org/10.1038/nn.2264

93
Orbitofrontal cortex and its contribution to decisionmakingAnnual Review of Neuroscience 30:31–56.https://doi.org/10.1146/annurev.neuro.30.051606.094334
 94
 95

96
Confidence estimation as a stochastic process in a neurodynamical system of decision makingJournal of Neurophysiology 114:99–113.https://doi.org/10.1152/jn.00793.2014
 97
 98

99
A recurrent network mechanism of time integration in perceptual decisionsJournal of Neuroscience 26:1314–1328.https://doi.org/10.1523/JNEUROSCI.373305.2006

100
Show, attend and tell: Neural image caption generation with visual attentionProceedings of the 32 nd International Conference on Machine Learning.
 101
 102
 103
Decision letter

Timothy EJ BehrensReviewing Editor; University College London, United Kingdom
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "Rewardbased training of recurrent neural networks for cognitive and valuebased tasks" for consideration by eLife. Your article has been reviewed by three peer reviewers, including Sam Gershman (Reviewer #2) and Mattia Rigotti (Reviewer #3), and the evaluation has been overseen by Timothy Behrens as the Senior and Reviewing editor.
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
The authors propose to use rewardbased training of recurrent neural networks in order to build models able to perform behavioral tasks that are being studied in animal models by the systems neuroscience and neuroelectrophysiology communities. The trained models are then shown to capture features of both the behavior and of electrophysiological data observed in primate experiments. The rewardbased training procedure is more general and more realistic (in terms of mimicking the actual primate training) than the supervised training approaches that have been typically employed so far in computational works that compare trained recurrent neural networks to neural recordings. As a result, this training method promises to have a much wider applicability and reach, in terms of the experimental paradigms that can be modeled and the scientific questions that can be asked. The authors then give several examples of the utility of their framework by training models to reproduce a perceptual discrimination task, a contextdependent integration task, a multisensory integration task, a working memory task, and economic choice task.
The paper is wellwritten and demonstrates an approach to fit recurrent neural networks to neuroscience experiment that is much more general, principled and powerful than the mentioned supervised method typically used so far. Besides being more natural and of wider applicability, the rewardbased training technique will also potentially allow investigators to model the evolution of the neural dynamics and behavior as a subject is learning to perform a task, as the authors nicely demonstrate in a handful of landmark primate decision making tasks. These points alone arguably make for a strong paper.
Essential revisions:
Despite the laudable aspects of the work described above, the reviewers unanimously agreed that more needs to be done to demonstrate that the approach will lead to clear new biological knowledge.
All three reviewers raised the issue that it is not clear that the network learning routine is novel from the machine learning perspective, and yet it is completely clear that assumptions made in the model render it only a distant approximation to a biological learning rule. Hence, the reviewers thought that the manuscript needed to be reframed to be clearer what the contribution was. Less emphasis should be placed on the biological realism of the network, and more emphasis should be placed on the biological utility of training a network based on rewards.
Comments from the reviews that address this point are as follows:
I am having problems assessing novelty. If the main novelty is to provide the first neural implementation of the standard "actorcritic" paradigm, then the authors would need to provide an implementation where learning (both in the actor and in the critic) happens through plausible learning rules, which is clearly not the case here (use of backpropagation through time as well as Gated Recurrent Units; the authors acknowledge these limitations in the Discussion). It is not clear that the problem remains learnable (even with the same "teaching signals") under more realistic assumptions. If, on the other hand, the aim was to argue that jointly training a critic facilitates learning of the actor, the authors did not need a neural implementation to make that point, and in fact, it is already well known that policygradient methods benefit immensely from variance reduction through finegrained reward prediction. So, this leaves us with a third alternative message, which is to use these insights + their (nonplausible) neural implementations to try and map the algorithm (as opposed to a particular implementation) to specific brain structures/processes by relating model activity to recorded brain activity and behaviour. I think this could have been the main strength of this paper, but is somehow underplayed (in fact, inspection of the activity of the value network is limited to a single page).
The network is quasibiological in the sense that it is technically a neural network, but the biological constraints on architecture, dynamics and synaptic plasticity are really weak. Should we see this as essentially a phenomenological model, or is there deeper significance to the "neural" aspect? If the latter, then I think a stronger case needs to be made.
My main criticism with the paper is that some presentation choices might mislead some readers into thinking that the application of the policy gradient method to recurrent neural networks trained with backpropagation through time is novel. In fact, the training method used in this paper is essentially the "recurrent policy gradient" algorithm presented among others in one of the papers referenced by the authors (Wierstra et al. 2009). In that paper from the Schmidhuber group the authors utilize LSTMs, instead of GRUs, but otherwise the rest is essentially the same, including the use of a variance reduction method to the estimate of the gradient of the policy function, consisting in a "baseline value function". In addition, in both this paper and in Wierstra et al. 2009 the baseline value function is an RNNs (in the paper under review, the input to the value function is the hidden state of the policy network, but this seems arbitrary).
In fact both ideas – using backpropagation to estimate the policy gradient and using a baseline reinforcement function to reduce the variance of the gradient estimate – are already present in the original Williams 1992 paper (referenced by the authors). The second idea was however probably mostly popularized by a series of papers by Baxter and Bartlett (that apply policy gradients to solving POMDPs), and by the VAPS (Value and Policy Search) algorithm by Baird and More (1999). I feel that the authors should cite these papers. I also suggest that the authors explicitly say that they're using (a modified version of) the "recurrent policy gradient" algorithm presented in the series of papers by Wierstra et al. Since the algorithm already has a name, it might be beneficial to use it. That would also help make the mentioned connections across the literature most transparent.
Whilst there was clear agreement that the biologicallyrealistic training regime opened up the possibility of asking new biological questions with the network, and that this was exciting, there was also clear agreement that the biological questions and predictions that you have actually made did not lead to the clear new biological insights that the reviewers were expecting. For example, comments from the reviews that spoke to this issue were:
I found the Results section somewhat weak – the Discussion arrived at the point in the Results that I thought was only the end of the warmup phase. Up to Figure 3, I thought this is all nice but it looks like a sanity check that their architecture does learn what it's supposed to learn, i.e. a sort of reproduction (albeit in an actorcritic architecture) of the results of Song et al. (2016) where the same authors had used stochastic gradient descent to train RNNs on the same family of tasks. It is only in Figure 4 that the authors start inspecting the value network. In light of what I wrote above (namely, knowing that the learning implementation isn't realistic, and that reward prediction is already known to help a lot in policy gradient learning), this was disappointing to me. I was expecting the authors to use their trained value networks to make more specific, experimentally testable predictions about the type of signals you expect to see (and where), the form of synergy predicted between the progress of learning in the task and the quality of reward predictions (by looking at the synergies in the simultaneous training of both nets), etc.
The main empirical observation from the section "Tasks with simple inputoutput mappings", apart from the fact that the network learns the tasks, is that it exhibits mixed selectivity. This mixed selectivity observation shows up in several of the other sections, but I feel that it is not a particularly strong argument for this model. Many models could produce mixed selectivity, and in any case the prevalence of mixed selectivity is never quantified. Is this really the only empirical constraint from the neural data?
In general, need more explanation for why the network reproduces particular empirical phenomena. Which assumptions provide explanatory power? If one were to deviate from these assumptions, would the model no longer reproduce these phenomena? What is uniquely explained by this model?
The authors mention that the model allows the Markov property to be relaxed. This seems like an important observation, but there's no demonstration of this in simulations. What empirical phenomena speak to this issue?
The reward baseline idea is interesting, and the authors mention some empirical data possibly consistent with this idea, but they don't report any simulations of lesions, inactivation or pharmacological manipulations to reproduce these effects with their model.
All three reviewers reiterated this point as essential in the Discussion. The basic point is that eLife is a biology journal, not a machine learning journal. It needs to be clearer how the new ML advances have led to substantial new biological insight. There was also a clear suggestion in the Discussion that a major advantage of having a network that learns from rewards is the potential to analyse the dynamics of learning itself; the potential to elucidate the limits of task learnability under sparse delayed rewards, and to predict specific patterns of interaction between the learning of the task and the learning of the reward landscape. The reviewers thought that one potential avenue to strengthen the Results section was to focus on these dynamics.
There was also a discussion about deference to the existing literature. You can see several points above that demonstrate concerns along these lines. A further point was also raised:
The argument that actorcritic/policy gradient models are "opposite of the way value functions are usually thought of in neuroscience" (Discussion) seems extreme, since this only applies to valuebased modelfree algorithms like Qlearning and sarsa. But there is a long tradition of actorcritic models applied to the basal ganglia; see for example Houk et al. (1995), Dayan & Balleine (2002), Joel et al. (2002), O'Doherty et al. (2004), Takahashi et al. (2008), Maia (2010), to name a few.
https://doi.org/10.7554/eLife.21492.020Author response
Essential revisions:
Despite the laudable aspects of the work described above, the reviewers unanimously agreed that more needs to be done to demonstrate that the approach will lead to clear new biological knowledge.
All three reviewers raised the issue that it is not clear that the network learning routine is novel from the machine learning perspective, and yet it is completely clear that assumptions made in the model render it only a distant approximation to a biological learning rule. Hence, the reviewers thought that the manuscript needed to be reframed to be clearer what the contribution was. Less emphasis should be placed on the biological realism of the network, and more emphasis should be placed on the biological utility of training a network based on rewards.
We hope that the revised manuscript allays some of these concerns, please see below for more detailed responses to the editor and reviewers’ comments.
Here we note two points that were partially raised in the original manuscript and have now been expanded for emphasis and clarification:
We distinguish a learning rule and the neural activity that results from it. The plasticity rule used in this work is not biologically plausible in the sense that it is not one of the known synaptic plasticity rules. While several recent works have made progress toward biologically plausible backpropagation (or doing without BP altogether), we never considered this to be a point of debate. At present there is simply no biologically plausible learning rule capable of training networks for such tasks, in the same manner. What we argue is not that the physiological learning rule is biological, but rather that the resulting circuits, trained in an ethologically relevant manner, operate in a similar way to neural circuits in the brain as demonstrated by neural activity traces that are quite similar to those recorded from behaving animals. This is a nontrivial, even surprising, result, as we can think of no simple reason why this had to be [a point also raised in Zipser & Andersen (1988) for the supervised learning case].
We did not invent the core learning algorithm used in this work, namely recurrent policy gradients with a baseline value network (responsible for computing expected values). Indeed, in some cases we opted not to include stateoftheart techniques such as Asynchronous Advantage ActorCritic (Mnih et al., 2016) due to biological impossibility (rather than implausibility). What is novel is the application of these techniques, with a few modifications (that are novel), to tasks relevant to systems neuroscience, particularly to the combined study of behavior and electrophysiology for a wide range of tasks. We have therefore reframed the overall paper around its main contributions as laid out at the beginning of the Discussion, to wit: “In this work we have demonstrated rewardbased training of recurrent neural networks for both cognitive and valuebased tasks. Our main contributions are twofold: first, our work expands the range of tasks and corresponding neural mechanisms that can be studied by analyzing model recurrent neural networks, providing a unified setting in which to study diverse computations and compare to electrophysiological recordings from behaving animals; second, by explicitly incorporating reward into network training, our work makes it possible in the future to more directly address the question of rewardrelated processes in the brain, for instance the role of value representation that is essential for learning, but not executing, a task.”
Comments from the reviews that address this point are as follows:
I am having problems assessing novelty. If the main novelty is to provide the first neural implementation of the standard "actorcritic" paradigm, then the authors would need to provide an implementation where learning (both in the actor and in the critic) happens through plausible learning rules, which is clearly not the case here (use of backpropagation through time as well as Gated Recurrent Units; the authors acknowledge these limitations in the Discussion). It is not clear that the problem remains learnable (even with the same "teaching signals") under more realistic assumptions. If, on the other hand, the aim was to argue that jointly training a critic facilitates learning of the actor, the authors did not need a neural implementation to make that point, and in fact, it is already well known that policygradient methods benefit immensely from variance reduction through finegrained reward prediction. So, this leaves us with a third alternative message, which is to use these insights + their (nonplausible) neural implementations to try and map the algorithm (as opposed to a particular implementation) to specific brain structures/processes by relating model activity to recorded brain activity and behaviour. I think this could have been the main strength of this paper, but is somehow underplayed (in fact, inspection of the activity of the value network is limited to a single page).
We agree that this is not the first neural network implementation of the “actorcritic” architecture – many variations of actorcritic are routinely used in machine learning, and as addressed further below, the concept of actorcritic has a rich history in neuroscience, particularly in models of the basal ganglia. We did not intend in any way to imply otherwise, and we have reframed and expanded portions of the manuscript to make this as clear as possible. Again, what is novel is to demonstrate a reinforcement learning framework of training RNNs to perform a number of cognitive and valuebased tasks, which is of wide interest to systems neuroscience.
We have added the output of the value network (the expected reward) to Figure 1 to show how the expected return/confidence is computed by performing an “absolute value”like operation on the accumulated evidence. As we note in an expanded Discussion, however, it is also common in machine learning applications for the policy and value networks to share the same recurrent units, by using a weighted sum of the two losses (reward maximization and reward prediction error) to jointly train the two using a single loss. The question of whether the decision and value networks ought to share the same recurrent network parallels ongoing debate over whether choice and confidence are computed together, or if certain areas such as OFC compute confidence signals locally, and we view this problem as currently unresolved. Computationally, the distinction is expected to be important when there are nonlinear computations required to determine expected return that are not needed to implement the policy.
We were less focused on the learnability problem in this situation because there is no question that the tasks can be learned by animals under similar reward conditions, and we assumed that a sufficiently powerful RL algorithm should therefore be able to replicate this learning. We believe that in the future researchers will elucidate the corresponding mechanism in the brain.
The network is quasibiological in the sense that it is technically a neural network, but the biological constraints on architecture, dynamics and synaptic plasticity are really weak. Should we see this as essentially a phenomenological model, or is there deeper significance to the "neural" aspect? If the latter, then I think a stronger case needs to be made.
We agree that the biological constraints are weak, particularly the learning rule, but in our experience it would be a mistake to completely discard the “neural” aspect and view trained RNNs as a purely phenomenological model – after all, the activity of individual units in our trained networks exhibit strikingly similar features to biological neurons recorded in experiments. For instance, it is quite surprising that, after training with a valuebased choice task, three major types of units (offer value, chosen value, and choice) found in OFC physiology naturally emerged in our network (Figure 4).
Since no known biologically realistic learning rule exists with the flexibility to learn all the tasks explored in this work, our goal was to use this as a starting point for future development of increasingly biologically realistic models that can still perform the task.
My main criticism with the paper is that some presentation choices might mislead some readers into thinking that the application of the policy gradient method to recurrent neural networks trained with backpropagation through time is novel. In fact, the training method used in this paper is essentially the "recurrent policy gradient" algorithm presented among others in one of the papers referenced by the authors (Wierstra et al. 2009). In that paper from the Schmidhuber group the authors utilize LSTMs, instead of GRUs, but otherwise the rest is essentially the same, including the use of a variance reduction method to the estimate of the gradient of the policy function, consisting in a "baseline value function". In addition, in both this paper and in Wierstra et al. 2009 the baseline value function is an RNNs (in the paper under review, the input to the value function is the hidden state of the policy network, but this seems arbitrary).
In fact both ideas – using backpropagation to estimate the policy gradient and using a baseline reinforcement function to reduce the variance of the gradient estimate – are already present in the original Williams 1992 paper (referenced by the authors). The second idea was however probably mostly popularized by a series of papers by Baxter and Bartlett (that apply policy gradients to solving POMDPs), and by the VAPS (Value and Policy Search) algorithm by Baird and More (1999). I feel that the authors should cite these papers. I also suggest that the authors explicitly say that they're using (a modified version of) the "recurrent policy gradient" algorithm presented in the series of papers by Wierstra et al. Since the algorithm already has a name, it might be beneficial to use it. That would also help make the mentioned connections across the literature most transparent.
It was not in any way our intention to imply that this was the first application of policy gradients to RNNs, and it was a big mistake on our part to have taken this to be obvious. We have changed the exposition to make this much clearer, and included the suggested citations. Please note that we were using Peters & Schall (2008) as a sort of review of the GPOMDP method in addition to the Williams derivation. In any case, we now explicitly say that our work is based on Wierstra’s recurrent policy gradient, which again we felt was clear (please note that we used this term explicitly in the Discussion of the original manuscript) but could have been, and has been made, clearer.
We do think, however, that using the hidden state of the policy/decision network as inputs to the value network is 1) not completely arbitrary and 2) is, in fact, novel from the machine learning perspective. It is similar in spirit to jointly training the policy and value networks as one network with a single, combined loss (reward maximization and reward prediction error), while nevertheless treating the two networks separately. Of course, this was not explained well in the manuscript and we have also included additional discussion of this matter. In particular, we now point out how different choices for this architecture parallel ongoing debate over the question of whether decision and confidence are jointly computed in one brain area and read out by another, or if confidence can be computed locally by a different region from signals in the decision area. Within neuroscience this question remains unresolved, and we hope that our approach, and generalizations based on it, can help shed light on the matter.
Whilst there was clear agreement that the biologicallyrealistic training regime opened up the possibility of asking new biological questions with the network, and that this was exciting, there was also clear agreement that the biological questions and predictions that you have actually made did not lead to the clear new biological insights that the reviewers were expecting. For example, comments from the reviews that spoke to this issue were:
I found the Results section somewhat weak – the Discussion arrived at the point in the Results that I thought was only the end of the warmup phase. Up to Figure 3, I thought this is all nice but it looks like a sanity check that their architecture does learn what it's supposed to learn, i.e. a sort of reproduction (albeit in an actorcritic architecture) of the results of Song et al. (2016) where the same authors had used stochastic gradient descent to train RNNs on the same family of tasks. It is only in Figure 4 that the authors start inspecting the value network. In light of what I wrote above (namely, knowing that the learning implementation isn't realistic, and that reward prediction is already known to help a lot in policy gradient learning), this was disappointing to me. I was expecting the authors to use their trained value networks to make more specific, experimentally testable predictions about the type of signals you expect to see (and where), the form of synergy predicted between the progress of learning in the task and the quality of reward predictions (by looking at the synergies in the simultaneous training of both nets), etc.
The main empirical observation from the section "Tasks with simple inputoutput mappings", apart from the fact that the network learns the tasks, is that it exhibits mixed selectivity. This mixed selectivity observation shows up in several of the other sections, but I feel that it is not a particularly strong argument for this model. Many models could produce mixed selectivity, and in any case the prevalence of mixed selectivity is never quantified. Is this really the only empirical constraint from the neural data?
In general, need more explanation for why the network reproduces particular empirical phenomena. Which assumptions provide explanatory power? If one were to deviate from these assumptions, would the model no longer reproduce these phenomena? What is uniquely explained by this model?
The authors mention that the model allows the Markov property to be relaxed. This seems like an important observation, but there's no demonstration of this in simulations. What empirical phenomena speak to this issue?
The section on “Tasks with simple inputoutput mappings” was indeed meant as a sanity check, to confirm that any task that was previously trained using supervised learning could also be learned based on reward only. Please note a modified Figure 1 to show the expected return predicted by the value network during the perceptual decisionmaking task as a function of time, which was previously Figure 1—figure supplement 2. It shows that the expected reward can be computed by an “absolute value”like operation on the accumulated evidence to compute the expected reward. (It is not simply an absolute value because it also requires a shift, etc.). However, the confidence experiment (Figure 3) is new, not simulated in our previous work using supervised learning.
The tasks considered in this work are in some sense too simple for strong conclusions to be drawn about the role of the value network and we have avoided doing so, although we clearly speculate on possibilities that are of interest for future investigation. Moreover, we found the value network in the economic choice task of some interest precisely because electrophysiologists have recorded from the OFC of monkeys performing a similar task, but it is quite rare that PFC/PPC and OFC are recorded simultaneously so that we had little experimental constraint to work with – a situation that will, we hope, change in the future.
The relaxation of the Markov assumption is more relevant to ongoing work on certain games but we wanted to repeat this point, which was already made in Wierstra et al. (2009). We removed this comment to avoid confusion.
The reward baseline idea is interesting, and the authors mention some empirical data possibly consistent with this idea, but they don't report any simulations of lesions, inactivation or pharmacological manipulations to reproduce these effects with their model.
All three reviewers reiterated this point as essential in the Discussion. The basic point is that eLife is a biology journal, not a machine learning journal. It needs to be clearer how the new ML advances have led to substantial new biological insight. There was also a clear suggestion in the Discussion that a major advantage of having a network that learns from rewards is the potential to analyse the dynamics of learning itself; the potential to elucidate the limits of task learnability under sparse delayed rewards, and to predict specific patterns of interaction between the learning of the task and the learning of the reward landscape. The reviewers thought that one potential avenue to strengthen the Results section was to focus on these dynamics.
We were very mindful of the fact that eLife is a biology journal, and, importantly, that the primary intended audience of our work were neuroscientists. It would be a stretch to say that current research on RL in machine learning can be applied to neuroscience wholesale; we believe, however, that judiciously applying those methods and adapting the parts we can to be even incrementally more biologically plausible is extremely useful, and that was the main goal of our paper. We were less focused on the learnability problem in this situation because there is no question that the tasks can be learned by animals under similar reward conditions, and we assumed that a sufficiently powerful RL algorithm should therefore be able to replicate this learning.
We are very interested in comparing the dynamics of learning in our networks to animals under experimental conditions; however, we simply do not have the necessary data for meaningful comparison. We are working with experimental collaborators to make this a reality. We agree that this is a major advantage of the RL framework.
There was also a discussion about deference to the existing literature. You can see several points above that demonstrate concerns along these lines. A further point was also raised:
The argument that actorcritic/policy gradient models are "opposite of the way value functions are usually thought of in neuroscience" (Discussion) seems extreme, since this only applies to valuebased modelfree algorithms like Qlearning and sarsa. But there is a long tradition of actorcritic models applied to the basal ganglia; see for example Houk et al. (1995), Dayan & Balleine (2002), Joel et al. (2002), O'Doherty et al. (2004), Takahashi et al. (2008), Maia (2010), to name a few.
This was at best a sloppy sentence; the sentiment expressed here merely reflects the first author’s own (cortical) bias in interacting with neuroscientists working on reinforcement learning. We are of course aware that actorcritic models have a long and distinguished history in neuroscience, especially in work on basal ganglia. We have added the suggested references along with a sentence in the Discussion acknowledging this history, and removed the problematic phrase in question.
https://doi.org/10.7554/eLife.21492.021Article and author information
Author details
Funding
Office of Naval Research (N000141310297)
 H Francis Song
 Guangyu R Yang
 XiaoJing Wang
 H Francis Song
 Guangyu R Yang
 XiaoJing Wang
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank R Pascanu, K Cho, E Ohran, E Russek, and G Wayne for valuable discussions. This work was supported by Office of Naval Research Grant N000141310297 and a Google Computational Neuroscience Grant.
Reviewing Editor
 Timothy EJ Behrens, Reviewing Editor, University College London, United Kingdom
Publication history
 Received: September 13, 2016
 Accepted: January 12, 2017
 Accepted Manuscript published: January 13, 2017 (version 1)
 Version of Record published: February 6, 2017 (version 2)
Copyright
© 2017, Song et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 2,061
 Page views

 585
 Downloads

 3
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.