Biologically plausible learning in recurrent neural networks reproduces neural dynamics observed during cognitive tasks
 Cited
 1
 Views
 1,433
 Comments
 0
Abstract
Neural activity during cognitive tasks exhibits complex dynamics that flexibly encode taskrelevant variables. Chaotic recurrent networks, which spontaneously generate rich dynamics, have been proposed as a model of cortical computation during cognitive tasks. However, existing methods for training these networks are either biologically implausible, and/or require a continuous, realtime error signal to guide learning. Here we show that a biologically plausible learning rule can train such recurrent networks, guided solely by delayed, phasic rewards at the end of each trial. Networks endowed with this learning rule can successfully learn nontrivial tasks requiring flexible (contextdependent) associations, memory maintenance, nonlinear mixed selectivities, and coordination among multiple outputs. The resulting networks replicate complex dynamics previously observed in animal cortex, such as dynamic encoding of task features and selective integration of sensory inputs. We conclude that recurrent neural networks offer a plausible model of cortical dynamics during both learning and performance of flexible behavior.
https://doi.org/10.7554/eLife.20899.001Introduction
Recent evidence suggests that neural representations are highly dynamic, encoding multiple aspects of tasks, stimuli, and commands in the joint fluctuating activity of interconnected populations of neurons, rather than in the stable activation of specific neurons (Meyers et al., 2008; Barak et al., 2010; Stokes et al., 2013; Churchland et al., 2012; Raposo et al., 2014). Models based on recurrent neural networks (RNN), operating in the nearchaotic regime, seem wellsuited to capture similar dynamics (Jaeger, 2001; Maass et al., 2002; Buonomano and Maass, 2009; Sussillo and Abbott, 2009). For this reason, such models have been used to investigate the mechanisms by which neural populations solve various computational problems, including working memory (Barak et al., 2013; Rajan et al., 2016), motor control (Sussillo et al., 2015; Laje and Buonomano, 2013; Hennequin et al., 2014), and perceptual decisionmaking (Mante et al., 2013).
However, the methods commonly used to train these recurrent models are generally not biologically plausible. The most common training methods are based on supervised learning, in which a nonbiological algorithm (usually a form of backpropagation or regression) minimizes the difference between the network’s output and a target output signal (Pearlmutter, 1995; Jaeger, 2001; Sussillo and Abbott, 2009; Song et al., 2016; Rajan et al., 2016). Besides the nonbiological nature of these algorithms, the requirement for a constant supervisory signal is in stark contrast with most behavioral tasks, in which the only source of information about performance are temporally sparse rewards that are usually delayed with regard to the actions that caused them.
A more biologically plausible form of learning is rewardmodulated Hebbian learning: during ongoing activity, each synapse accumulates a potential weight change according to classical Hebbian learning, by multiplying pre and postsynaptic activities at any time and accumulating this product over time. These potential weight changes are then multiplied by a global reward signal, which determines the actual weight changes. This method, inspired by the effect of dopamine on synaptic plasticity, has been successfully demonstrated and analyzed in feedforward or weakly connected spiking (Izhikevich, 2007; Florian, 2007; Frémaux et al., 2010) and firingrate (Soltoggio and Steil, 2013) networks. However, simple rewardmodulated Hebbian learning does not work for stronglyconnected recurrent networks that can generate complex trajectories of the type discussed here (Fiete et al., 2007).
A method that successfully trains arbitrary recurrent networks is the socalled nodeperturbation method (Fiete et al., 2006, 2007). This method consists in applying small perturbations to neural activity, then calculating potential weight changes by multiplying the ‘normal’ (nonperturbative) inputs by the perturbations (rather than by postsynaptic output, as in Hebbian learning). These potential weight changes are then multiplied by a reward signal to provide the final weight changes. This method was successfully applied to feedforward networks to model birdsong learning (Fiete et al., 2007) and our own previous results show that it is also successful when applied to chaotic recurrent neural networks (Miconi, 2014). Interestingly, this method is largely similar to the wellknown REINFORCE algorithm, which is widely used in reinforcement learning (Williams, 1992; Mnih et al., 2014; Peters and Schaal, 2008; Kober et al., 2013) (see in particular Eq. 11 in Williams, 1992).
However, node perturbation is nonHebbian (since it multiplies two types of inputs, rather than pre and postsynaptic activities) and requires information that is not local to the synapse (namely, the perturbatory inputs, which must somehow be kept separate from the ‘normal’ inputs). Thus, it is not obvious how nodeperturbation could be implemented in biological neural networks. Legenstein and colleagues (Hoerzer et al., 2014; Legenstein et al., 2010) showed that, under certain conditions, nodeperturbation could be made more biologically plausible by leveraging momenttomoment fluctuations in postsynaptic activity: by keeping a running average of recent activity and subtracting it from the current instantaneous response at any time, we obtain a ‘highpass’ filtered trace of postsynaptic activity, which can be used as a proxy for the exploratory perturbations of postsynaptic activity. This can then be multiplied by the presynaptic inputs, and the final accumulated product is then modulated by a reward signal to recreate the nodeperturbation method in a more biologically plausible, Hebbian manner (i.e. as a product of presynaptic and postsynaptic activities rather than between two input sources) (Legenstein et al., 2010). This method can successfully train chaotic recurrent neural networks (Hoerzer et al., 2014). Unfortunately, this method critically requires an instantaneous, realtime continuous reward signal to be provided at each point in time. The continuous, realtime reward signal is necessary to allow the subtraction method to extract taskrelevant information, and to counter the effect of spurious deviations introduced by the runningaverage subtraction process (see Analysis). This is in contrast with most tasks (whether in nature or in the laboratory), which only provide sparse, delayed rewards to guide the learning process.
In summary, to our knowledge, there is currently no biologically plausible learning algorithm that can successfully train chaotic recurrent neural networks with realistic reward regimes.
Here we introduce a novel rewardmodulated Hebbian learning rule that can be used to train recurrent networks for flexible behaviors, with reward occurring in a delayed, onetime fashion after each trial, as in most animal training paradigms. This method is Hebbian and uses only synapselocal information, without requiring instantaneous reward signals (see Materials and methods and Analysis). We apply our method to several tasks that require flexible (contextdependent) decisions, memory maintenance, and coordination among multiple outputs. By investigating the network’s representation of taskrelevant aspects over time, we find that trained networks exhibit complex dynamics previously observed in recordings of animal frontal cortices, such as dynamic encoding of task features (Meyers et al., 2008; Stokes et al., 2013; Jun et al., 2010), switching from stimulusspecific to responsespecific representations (Stokes et al., 2013), and selective integration of sensory input streams (Mante et al., 2013). We conclude that recurrent networks endowed with rewardmodulated Hebbian learning offer a plausible model of cortical computation and learning, capable of building networks that dynamically represent and analyze stimuli and produce flexible responses in a way that is compatible with observed evidence in behaving animals.
Results
Description of the learning rule
Here we provide an overview of the networks and plasticity rule used in this paper. A full description is provided in Materials and methods. Furthermore, we provide an extensive discussion of the mechanisms underlying the rule in the Analysis section. Note that all the software used in this paper is available at http://github.com/ThomasMiconi/BiologicallyPlausibleLearningRNN.
Our model networks are fullyconnected continuoustime recurrent neural networks operating in the early chaotic regime, which allows them to autonomously generate rich dynamics while still being amenable to learning (Sompolinsky et al., 1988; Sussillo and Abbott, 2009; Jaeger, 2001; Maass et al., 2002). In most simulations, we use a canonical model in which responses are signed, and each neuron can send both excitatory and inhibitory connections (see Materials and methods). However, in the last section of Results, we build on recent work (Mastrogiuseppe and Ostojic, 2016) to test the rule on a recurrent network with nonnegative activations and separate populations of strictly excitatory and strictly inhibitory neurons, which still generates ongoing chaotic activity.
In all simulations, one or more neurons in the network are arbitrarily designated as the ‘output’ neurons, and their responses at any given time are used as the network’s response (these neurons are otherwise identical to all others). For the simulations reported here, networks include 200 neurons (400 for the motor control task).
We now briefly describe the learning rule that trains the network’s connectivity (a more complete description is provided in Materials and methods). First, in order to produce exploratory variation in network responses across trials, each neuron i in the network occasionally receives a random perturbation Δ_{i}(t) to its current excitation (perturbations are applied in all simulations, both in learning and in testing/decoding). During a trial, at every time step, every synapse from neuron j to neuron i accumulates a potential Hebbian weight change (also called eligibility trace [Izhikevich, 2007]) according to the following equation:
where r_{j} represents the output of neuron j, and thus the current input at this synapse. x_{i} represents the current excitation (or potential) of neuron i (see Materials and methods) and ${\overline{x}}_{i}$ represents a shortterm running average of x_{i}, and thus x(t)−$\overline{x}$ tracks the fast fluctuations of neuron output. Thus, this rule is essentially Hebbian, based on the product of inputs and output fluctuations. Importantly, S is a monotonic, supralinear function; in this paper, we simply used the cubic function S(x)=x^{3}, though the particular choice of function is not crucial as long as it is supralinear (see Materials and methods and Analysis). Note that the eligibility trace for any synapse is accumulated over the course of a trial, with each new timestep adding a small increment to the synapse’s eligibility trace (potential weight change).
At the end of each trial, a certain reward R is issued to the network, based on the network’s performance for this trial as determined by the specific task. From this reward, the system computes a reward prediction error signal, as observed in physiological experiments, by subtracting the expected reward for this trial $\overline{R}$ (generally a running average of previously received rewards for the same type of trial; see Materials and methods) from the actually received reward R. This rewardprediction signal is used to modulate the eligibility trace, producing the actual weight change:
where η is a learning rate constant. Together, equations 1 and 2 fully determine the learning rule described here. See Materials and methods for a complete description.
Task 1: Delayed nonmatchtosample task
The first task considered here is a simple delayed nonmatchtosample problem (Figure 1). In every trial, we present two brief successive inputs to the network, with an intervening delay. Each input can take either of two values, labelled A and B respectively. The task is to determine whether the two successive inputs are identical (AA or BB), in which case the network should output −1; or different (AB or BA), in which case the network should output 1 (see Materials and methods for a detailed description).
This simple task exhibits several interesting features. First, it is arguably the simplest possible flexible (that is, contextdependent) decision task: on sensing the second stimulus, the network must produce a different response depending on the identity of the first stimulus. Second, because the intervening delay is much longer than the neural time constant (τ = 30 ms, see Materials and methods), the network must maintain some memory of the first stimulus before the second stimulus arises. Third, to solve this task, some neurons in the network must necessarily possess some form of nonlinear mixed selectivity (note that the problem is in essence a delayed exclusiveor problem), a hallmark of neural activities in prefrontal cortices (Rigotti et al., 2013).
The networks consistently learn to perform the task with high accuracy. Figure 1b shows the time course of the median error over 20 training runs, each starting with a different randomly initialized network. The ‘error’ for a trial is the mean absolute difference between the output neuron’s response and the correct response, i.e. 1 or −1, over the last 200 ms of the trial (see Materials and methods). Shaded area indicates 1 st and third quartile over the 20 runs. To define a measure of successful performance, we set a criterion of 95 trials with error lower than 1 (indicating the mean response is closer to the correct than to the incorrect response, i.e. of the correct sign) over 100 successive trials (p<10^{−20} under random choice, binomial test). The median time to criterion across 20 runs is 843 trials (interquartile range: 692–1125). Response error reliably converges towards a very low residual value.
How does the network represent and maintain traces of incoming stimuli? One possibility is that certain neurons encode stimulus identity by maintaining a stable ‘register’ value over time, such that the firing rate of certain cells directly specify stimulus identity in a relatively timeindependent manner. By contrast, physiological studies suggest that neural coding during a working memory task is highly dynamic, with stimulus identity being represented by widely fluctuating patterns of neural responses, in such a way that the tuning of individual neurons significantly changes over the course of a trial (Meyers et al., 2008; Barak et al., 2010; Stokes et al., 2013). As shown in Figure 1c, trained networks do exhibit highly dynamic responses over the course of a trial. This suggests that the networks might make use of dynamic representations.
To analyze the encoding and maintenance of stimulus identity over time in the network, we used a crosstemporal classification approach (Meyers et al., 2008; Stokes et al., 2013; Dehaene and King 2016). We trained a maximumcorrelation classifier to decode various taskrelevant features (identity of first and second stimulus, and final response), based on wholepopulation activity at any given time, and then used these timespecific classifiers to try and extract the same taskrelevant features at all possible points in time. This method can detect not only whether the network encodes a certain taskrelevant variable, but also whether the representation of this variable changes over time (see Materials and methods).
The results in Figure 2 suggest a highly dynamic representation of stimuli by the network. For example, the identity of the first stimulus can be successfully decoded during both first and second stimulus presentations, as well as during the intervening delay, as shown by high classification accuracy values on the diagonal during this entire period (Figure 2, left panel). However, the crosstemporal classification performance between these two periods, as seen on the offdiagonal areas (for example, in the areas at 0–200 ms on one axis and 400–600 on the other, corresponding to training the classifier based on data from one stimulus presentation and testing it on data from the other stimulus presentation) is essentially at chance level (accuracy ~0.5), or even below chance (dark patches). This suggests that while the network reliably encodes information about 1ststimulus identity across the first 800 ms of the trial, the way in which this identity is represented changes widely between successive periods within the trial. Similarly, the 2ndstimulus identity is maintained from its onset until the beginning of the response period, but in a dynamical manner (low offdiagonal, crosstemporal accuracy between the 400–600 ms period and the 600–800 ms period, in comparison to the high diagonal accuracy over the entire 400–800 ms period).
Another feature of these plots is that the accuracy of stimulus identity decoding strongly decreases over the course of the ‘response’ period (low values along the diagonal for the 800–1000 ms in first and second panel of Figure 2). This suggests that the network largely stops maintaining information about the specific identity of previous stimuli, and instead encodes solely the actual response, as shown by the very strong classification accuracy in the upperright portion of the third panel.
To test this interpretation, following (Stokes et al., 2013), we produce Multidimensional scaling (MDS) plots of population activity at different time points and for different stimulus conditions (Figure 3). MDS attempts to find a twodimensional projection such that the distance between any two data points is as similar as possible to their actual distance in the fulldimensional space: nearby (distant) population states should thus produce nearby (distant) points on the MDS plot. Early in the trial, all possible stimulus identity combinations generate different, consistent trajectories, indicating stimulusdependent encoding. By the late response period, however (1000 ms), the trajectories have essentially merged into two clusters, corresponding to the network response (‘same’ or ‘different’) and largely erasing any distinction based on the specific identity of either first or second stimulus. Thus, during the response period, the network moves from a stimulusspecific representation to a responsespecific representation: the stimulusspecific response is flexibly routed to the appropriate, contextdependent response state, as previously observed in cortical activity during a flexible association task (Stokes et al., 2013).
Task 2: Flexible selective integration of sensory inputs
An important aspect of cognitive control is the ability to attend selectively to specific portions of the sensory input, while ignoring the rest, in a flexible manner. Recently Mante, Sussillo and colleagues have studied the neural basis of this ability in macaque monkey prefrontal cortex (Mante et al., 2013). Monkeys were trained to report either the dominant color or the dominant motion direction of randomlymoving colored dots. Thus, the same stimulus could entail different appropriate responses depending on current context (i.e. which modality  color or motion  was relevant for this trial). Furthermore, due to the noisy stimulus, the task required selective temporal integration of the relevant sensory input. In addition to neural recordings, Mante and colleagues also trained a recurrent neural network to perform the same task, using supervised learning based on Hessianfree optimization. By analyzing the trained network, they identified mechanisms for selective integration of contextdependent inputs in a single network (Mante et al., 2013). This task was also used as an example application by Song and colleagues for their recurrent network training framework (Song et al., 2016).
We trained a network to perform the same task, using our proposed plasticity rule (see Figure 4). Our settings are deliberately similar to those described by Mante, Sussillo and colleagues. The network has two ‘sensory’ inputs (representing the two stimulus modalities of motion and color), implemented as random (Gaussian) time series, with a randomly chosen mean for each trial; the mean of each timeseries thus represent the ‘value’ of the corresponding modality for this trial. In addition, two ‘context’ inputs specify which modality is relevant for each trial. The network must produce output 1 if the contextindicated sensory input has a positive mean, or −1 if it has negative mean (see Materials and methods for a detailed description).
Figure 4b shows the psychometric curves of a fullytrained network, that is, the mean response as a function of stimulus value. For either modality, we show separate psychometric curves for when this modality was the relevant one and when it was irrelevant. When trials are sorted according to the value of the relevant modality, responses form a steep sigmoid curve with a relatively sharp transition between −1 and +1 centered roughly at 0. By contrast, when trials are sorted according to the value of the irrelevant modality, responses are evenly distributed across the entire range. Thus, the network accurately responds to the relevant signal, while largely ignoring the irrelevant one in each context (Compare to Figure Extended Data 2 in Mante et al., 2013). This indicates that the network has learned not only to perform temporal integration of an ambiguous, stochastic input, but also to flexibly ‘attend’ to different input streams depending on context.
How is information represented in the network over time? We use Mante and Sussillo’s orthogonal decoding procedure, which seeks to extract independent measures of how various task features (stimulus values, context, decision) are encoded in the network (see Materials and methods). Briefly, this method consists in using multiple linear regression to measure how much certain taskrelevant features are being independently represented by the network at any time (see Materials and methods for a detailed description). The results are shown in Figure 5 (compare to Figures 2 and 5 in Mante et al., 2013). These trajectories plot the evolution of network information over time, for various combinations of context (relevant modality) and task features. Each trajectory represents the average population activity, at successive points in time, of all correct trials that have the same bias value for the averaging modality and resulted in the same final choice. Trajectories are colored according to the bias value of the averaging modality, ranging from −0.25 (bright red) to 0.25 (bright green). We project these averaged population trajectories along the orthogonal feature dimensions extracted by orthogonal decoding, which tells us how strongly the network encodes this particular feature at a given time. We then plot the resulting trajectories in feature dimension subspaces (‘final choice’ dimension is always used as the horizontal axis, while the vertical axis may be either of the two sensory modality dimensions).
As observed in cortical recordings (Mante et al., 2013), these trajectories reveal that both the relevant and the irrelevant modality are actually represented in the network: the trajectories for varying value of either modality form an ordered progression in the corresponding ‘modality’ dimension (yaxis), even when that modality is irrelevant (bottomleft and topright panels in Figure 5); however, only the relevant modality correlates with representation of final choice (compare panels where trajectories are separated by value of the relevant vs. irrelevant modality), in accordance with physiological observations (Mante et al., 2013). This confirms that the network has learnt to selectively integrate the contextindicated variable while discarding the irrelevant one for each trial.
Task 3: Controlling a musculoskeletal model of the human arm
In both of the previous tasks, the network output was a single response channel. However, flexible behavior often requires coordinating multiple outputs, especially during movement. To test whether our plasticity rule can produce coordinated multiplexed responses, we trained a network to control a biomechanical model of the human arm. The model is a custom modification of the one described in Saul et al. (2015) (itself an extension of Holzbaur et al., 2005) and uses the Thelen muscle model (Thelen, 2003). The model implements the human upper skeleton, with 4 degrees of freedom (three at the shoulder, one at the elbow), actuated by 16 muscles attached to the shoulder, chest, and upper and lower arm bones. Each of the 16 muscles is controlled by a specific network output cell. The task is to reach towards one of two spherical targets, located in front of the body on either side of the sagittal plane. The appropriate target ball for each trial is indicated by two input channels, set either to 1 and 0 (leftside target) or to 0 and 1 (rightside target) respectively for the entire duration of the trial (700 ms). No other inputs are provided to the system. The error at the end of each trial is measured by the absolute distance between the tip of the hand and the center of the target ball, plus a small penalty for total muscle activation over the entire trial. Note that while the target balls are symmetrically arranged with regard to the body, they are not symmetrical with regard to the right arm (which is the one we model): the rightside ball is closer than the leftside one, and thus reaching either target requires qualitatively different movements.
Results are shown in Figure 6. Initially, as expected, the untrained network performs random, aimless movements, resulting in high initial error (Figure 6b). Performance improves almost from the start of the training process, reaching a low residual error after about 3000 trials. To visualize the impact of training on the dynamics of population activity, we project the population activity over its first three principal components at successive points in time over the course of each trial, using 16 trials for either target context, both before and after training (i.e., 64 trials in total). The fully trained network correctly reaches the adequate target according to context (Figure 6c).
Additional experiments: long delays, variable timing, and excitatoryinhibitory networks
Here we return to the delayed nonmatchtosample task, in order to test the learning rule under various changes in experimental conditions. All results in this section are obtained with the delayed nonmatchtosample task as described above, with modifications specified below.
First, we test the ability of the rule to deal with much longer delays. We extend the stimulusabsent delay period, from 200 ms to one full second. In addition, stimulus presentations are extended from 200 ms to 400 ms. With the 200 ms response period, this sums up to a total time of 2000 ms for each trial. We found it useful to reduce the learning rate from 0.1 to 0.03. As shown in Figure 7a, the rule still learns the task reliably, although reaching criterion (95% of trials with a mean absolute error over the response period lower than 1) requires more trials (median over 20 runs: 2230 trials, interquartile range:1366–4573 trials).
We then test the ability of the rule to deal with variable trial structure, and in particular, with variable stimulus timing. For each trial, we randomly pick the interstimulus delay period between 300 and 800 ms (as opposed to the previous fixed 200 ms duration), while stimulus presentation is extended to 300 ms. The total duration of the trial is always 1600 ms, with the last 200 ms being the response period. As shown in Figure 7b, the rule still manages to learn the task; however, the number of trials required to learn the task is now much larger. Indeed, the increased variance introduced by variable stimulus timing had a strong destabilizing effect: we found it necessary to reduce the learning rate to 0.003 to obtain reliable learning. Thus, variable stimulus timing makes learning much more difficult, though still feasible.
Finally, we modify the network model to make it more realistic, by enforcing nonnegative neural responses and separate populations of strictly excitatory and strictly inhibitory neurons (Dale’s law). In accordance with most related work (Sussillo and Abbott, 2009; Hoerzer et al., 2014; Sussillo et al., 2015), previous experiments used a canonical, widely studied recurrent network model (Sompolinsky et al., 1988), where neural responses can be negative and neurons send both positive and negative connections (see Materials and methods). However, recently, Mastrogiuseppe and Ostojic (Mastrogiuseppe and Ostojic, 2016) showed that strong theoretical results could be extended to networks with nonnegative responses and separate populations of strictly excitatory and inhibitory neurons; in particular, they proved the existence of a connectivity regime that guarantees ongoing, chaotic activity in the network, without reaching saturation.
We sought to test whether the rule proposed here can be used to learn cognitive tasks in a network of this type. We implemented a network with separate excitatory and inhibitory neurons, initialized with semisparse connections, with nonnegative, piecewiselinear activation functions (see Materials and methods). Importantly, the responses are now constrained between 0 and 20 (although in practice they never exceed ~10, thus remaining far from saturation). The target response for identical stimuli is 0 (instead of −1), and the target response for different stimuli is 5 (instead of 1). The criterion is now to reach 95% of (absolute) errors below 2.5, which ensures that responses are closer to the correct than the incorrect response.
As shown in Figure 7c, the rule can still reliably learn the task with this more realistic network. While learning takes longer, a direct comparison is difficult due to the differences in network activity and the wider range of possible responses. Nevertheless, these results confirm that the proposed rule is applicable to more realistic networks with nonnegative, nonsaturating responses and separate excitatory and inhibitory neurons.
Analysis of the proposed rule
Here we attempt to reach a better understanding on how and why the learning rule actually works. We also put the rule in the context of related learning algorithms, including the nodeperturbation method (Fiete et al., 2006) and the Exploratory Hebbian method (Hoerzer et al., 2014; Legenstein et al., 2010). We support our discussion with simple quantitative experiments that allow us to isolate the impact of various factors on the performance of the learning rule.
Overview
The rule proposed here is a more biologically plausible implementation of the socalled nodeperturbation rule (Fiete et al., 2006, 2007), which is itself a variant of the classical REINFORCE algorithm (Williams, 1992). Nodeperturbation (like REINFORCE) consists in applying random perturbations to neural responses, then modifying the weights so as to make future responses more similar (resp. less similar) to the perturbationinduced responses, if the perturbed outputs led to a betterthanexpected (resp. worsethanexpected) reward. Like the ExploratoryHebbian (EH) method of Legenstein, Hoerzer and Maass (Legenstein et al., 2010; Hoerzer et al., 2014), our rule extracts exploratory perturbations from outputs by subtracting a running average from ongoing neural responses. Unlike the EH method, however, it can actually learn from sparse, delayed rewards at the end of each trial, without requiring a continuous, realtime reward signal. As we explain below, this is due to the supralinear amplification of plasticity increments.
Nodeperturbation and REINFORCE
The nodeperturbation rule (Fiete et al., 2006, 2007) is a reinforcement learning rule that can train neural networks using only sparse, delayed reward signals to guide the learning. The intuitive mechanism of the rule is to apply random exploratory perturbations to neural output, then modify the weights so that future, unperturbed outputs will be more similar to this perturbed output if this perturbed output turned out to elicit a ‘good’ reward  or conversely, less similar if it produced a ‘poor’ reward (note that this is the central idea of the REINFORCE algorithm [Williams, 1992]).
The nodeperturbation method can be summarized as follows (Fiete et al., 2006):
1 During each episode, apply random exploratory perturbations ξ(t) to the neuron’s output y_{i}(t).
2 At each synapse, compute the socalled eligibility trace e_{i,j} by accumulating the products of output perturbations by current input at this synapse at the time of perturbation: e_{i,j} = Σ_{t}ξ _{i}(t) x_{j}(t)
3 At the end of each episode, compute the reward R for this episode.
4 Add to each synaptic weight the accumulated eligibility trace, multiplied by the net (baselinesubtracted) reward for this episode: Δw_{i,j} = η * e_{i,j} * (R  R_{0}).
In step 4, η is a learning rate parameter, and R_{0} is the predicted reward for this episode in the absence of perturbations; thus, RR_{0} reflects whether the trajectory of outputs produced during this episode was ‘better’ or ‘worse’ than usual.
The effect of step 4 is to make future trajectories more similar to the justexperienced perturbed trajectory (in response to the same inputs), if this perturbed trajectory produced a ‘good’ reward; or less similar, if the trajectory produced a ‘bad’ reward. This is because adding ξ(t)x_{j}(t) to w_{i,j} will move future responses to the same input x_{j}(t) in the direction of ξ(t), and thus tend to reproduce the stochastically perturbed response. This can be easily shown analytically (notice that $\frac{\partial y\left(t\right)}{\partial w}=x\left(t\right)$, at least in the limit of a linear neuron). It also makes intuitive sense: if the perturbation was positive (resp. negative), this rule would add a positive (resp. negative) multiple of x(t) to the synaptic weight, which would make the weights more (resp. less) correlated with input x(t) and thus produce a higher (resp. lower) response to future presentations of the same input x(t).
While node perturbation was introduced by Fiete and Seung (Fiete et al., 2006), the original REINFORCE paper describes several rules that implement nodeperturbation, i.e. modifying weights by the accumulated product of inputs by perturbations, multiplied by net rewards (see e.g. Eq. 11 in Williams, 1992).
The EH method: Hebbian implementation of nodeperturbation by subtracting running averages
A difficulty with the nodeperturbation rule is that it is hard to reconcile with existing models of synaptic plasticity. It requires that each synapse maintains a distinction between the ‘actual’ inputs and the ‘perturbation’ input, and then perform learning by multiplying these two different types of inputs. The biological mechanism for such learning is not obvious. This is also in contrast with standard models of plasticity based on Hebbian learning, that is, a product between inputs and outputs rather than between two forms of input.
Legenstein, Hoerzer and Maass have proposed a method, which they call the Exploratory Hebbian (EH) method, to implement nodeperturbation in a biologically plausible, Hebbian manner, using information local to the synapse (Legenstein et al., 2010). The central idea is that if perturbations are fast and strong enough, they can be extracted from ongoing neuron output by subtracting a fast running average, which should isolate fast fluctuations in neural responses. The eligibility trace is then computed as the product of inputs by these fast fluctuations in output (which are deemed to represent mostly the exploratory perturbations), producing a Hebbian, synapselocal rule which implements the nodeperturbation rule.
The EH rule is expressed formally as follows (Legenstein et al., 2010):
Where x_{j} is the input at the synapse, y_{i} is the neuron’s output, and the overbar denotes a shortterm running average (i.e. $\overline{Y}$(t + 1) = α $\overline{Y}$(t) + (1α) y(t) where α is a constant). Thus, y_{i}(t)  $\overline{Y}$_{i}(t) extracts fast fluctuations in neural output, while R(t) $\overline{R}$(t) extracts fast fluctuations in the reward signal.
Hoerzer and colleagues (Hoerzer et al., 2014) showed that this rule still works if the R(t)$\overline{R}$(t) term is replaced by a less informative quantity M(t) which is 1 if R(t)>$\overline{R}$(t) and 0 otherwise. They also showed that this rule can be used to successfully train chaotic recurrent networks for complex tasks.
Removing the need for continuous reward signals with supralinear amplification
A difficulty with the EH rule is that it requires a continuous, realtime reward signal R(t): at every point in time, the system must know whether it is doing better or worse than before. This negates a central advantage of reinforcement learning: the ability to learn from sparse, delayed rewards.
One reason why realtime reward signals are needed in the EH rule is that simply subtracting a running average from ongoing activity does not reliably isolate external perturbations, due to spurious relaxation effects. To take a maximally simplified example, consider the example trace in Figure 8. This represents the output of a single neuron receiving constant inputs, with a single positive perturbation at time T = 100 (top panel). To extract fluctuations, we subtract a running average from the output (bottom panel).
When the perturbation is applied, initially the output is larger than the running average, and thus the difference reflects the perturbation (‘Perturbation effect’ gray area). However, the running average then increases to include the recent perturbed outputs, and now the difference between decaying ongoing activity and running average switches to negative (‘Relaxation effect’ gray area). These negative relaxation terms will be accumulated into the Hebbian product and counteract the positive, perturbationrelated initial terms. In fact, for the simple case shown in Figure 8 (constant input and a single perturbation) the total sum of all $y\left(t\right)\overline{y}\left(t\right)$ terms over a full episode will converge to zero, and so will the eligibility trace (being a product of constant inputs by fluctuations summing to zero).
In more complex cases such as recurrent networks considered here, with timevarying inputs, the cancellation will only be partial, depending on the amount of autocorrelation in the inputs. The eligibility trace will include both the product of the ‘perturbation effect’ by the inputs at the time of the perturbation (which is what we want), and also the product of immediately subsequent, unpredictable inputs by the ‘relaxation effect’ (which is undesirable). The latter part adds uncontrollable (‘garbage’) components to the eligibility trace. Furthermore, if there is any temporal correlation in successive inputs, then these ‘garbage’ components will tend to partially cancel out the perturbationrelated terms. This, we suggest, causes the inability of the EH rule to learn without continuous, realtime reward signal (we show experiments in support of our interpretation below, in Figure 9).
The Exploratory Hebbian method negates this undesirable effect by using a realtime reward signal R(t) which is also highpassed by subtracting its own running average $\overline{R}$(t). Now the same relaxation effects will occur both in the y and the R traces, and thus the product of relaxation terms in both traces results in a positive value, which will actually reinforce the perturbationrelated terms rather than cancel them (assuming that inputs have any significant temporal autocorrelation).
This analysis immediately suggests an alternative solution to the problem. Notice that the perturbationrelated fluctuation is large, while the subsequent countering relaxation terms are small (‘Perturbation effect’ vs. ‘Relaxation effect’ in Figure 8). Thus, if we impose a supralinear function on plasticity increments, the large perturbationrelated terms will be amplified, while the small relaxationrelated terms will be suppressed. This leads to the plasticity rule proposed in the present paper.
Note that this is conceptually related to recentlyproposed thresholded Hebbian rules, whereby plasticity is only triggered by events in which the Hebbian product reaches a certain threshold (Soltoggio and Steil, 2013). A supralinear amplification offers a smoother amplification of larger Hebbian events, by comparison to the allornothing effect of a threshold; however, the overall effect is similar: ignore small, possibly incidental correlations of input and output, but retain the larger ones, which are more likely to be informative.
Why not just use rewardmodulated hebbian learning?
Nodeperturbation increases weights by a product of (net) rewards, inputs and perturbations of outputs. Why use the perturbations, rather than the raw outputs, in the product? Why not just use the standard Hebbian product of inputs by outputs (rather than perturbations), and multiply this by the reward? This is because the full Hebbian products (inputsoutputs) are dominated by rewardunrelated terms. The Hebbian products tend to simply reinforce all existing cooccurrences of inputs and outputs, regardless of how this increase would affect performance. By contrast, the perturbationbased products ensure that the weights are modified specifically to reproduce the perturbationcaused change in trajectory. The two have no reason to correlate, and the former is usually larger than the latter if perturbations are sparse.
In theory, it is possible to average out the nonrewardrelated component of rewardmodulated Hebbian learning, by making sure that the reward signal is absolutely zerocentered separately for each trial type (that is, for each possible combination of input range and target outputs) (Frémaux et al., 2010; Sprekeler et al., 2009). This corresponds to the role of a critic (a complex reward predictor) in reinforcement learning, and is implemented in the present paper by maintaining separate values of $\overline{R}$ for each trial type. However, this procedure leaves us with the problem of considerable variance, which suffices to make learning all but impossible in complex tasks (see the specific discussion of this point in Legenstein et al. (2010): as connection weights increases, the purelyHebbian component will tend to take over and drown the rewardrelated component).
This problem is compounded in recurrent networks such as the one described here, because now a single perturbation can have significant, unpredictable effects on future neural responses over the rest of the trial. For example, a single positive perturbation can have an arbitrary indirect effect on the response of the perturbed neuron, even making it eventually lower than it would have been over an extended period of time. As a result, the accumulated product of inputs and outputs would not be dominated by the (singlepoint) response increase caused by the perturbation, but by the (extended) subsequent decrease in response, which would actually have unpredictable effects on weight modification and future trajectories. Note that this effect (Hebbian learning futilely learning the coactivations during the perturbed trajectories, rather than the perturbations that cause the changes in trajectories) cannot be averaged out by centering the rewards, since the trajectories are what determines the rewards.
By contrast, nodeperturbation only modifies weights to incorporate products of perturbations and inputs at the time of the perturbation. As a result, it will tend to specifically reproduce the perturbation itself, that is, the event that caused the complex changes in trajectory, and thus in reward.
Gradient visualization
To better illustrate the effects of the rule, we run a simple experiment to compare the weight changes computed by our rule with those prescribed by several variants and other rules under the same conditions, isolating the effects of specific elements in the rules.
The experiment involves a recurrent network similar to those described in the previous sections. The task of the network is simply to determine whether a set of inputs have a positive or negative mean. In each episode, a vector of 10 randomly chosen inputs is presented for 100 ms, then (after a delay of 100 ms) the output neuron of the network must return 1 if the inputs have positive mean, or −1 if the inputs had negative mean, over the last 100 ms of the trial (thus each trial lasts 300 ms in total). The reward for each trial is computed as the mean negative absolute difference between neuron output and target response (−1 or 1) over the last 100 ms. At each episode, we apply a single perturbation (of random sign), at a fixed point in time within the response period. We then use various plasticity rules, including the one proposed in this paper, to compute the appropriate change in weights (hereafter called the ‘gradient’), and compare the gradients obtained from different rules. Note that no learning actually occurs: weights are randomly initialized at each trial, since we are only interested in comparing the gradients obtained under various rules.
Figure 9a plots the gradient obtained by the rule presented here, against the gradient obtained by the nodeperturbation method, for many episodes (each with randomly generated weights). This reveals that the two rules produce largely similar gradients, with some warping introduced by the nonlinearity. Next we apply the rule described in this paper, but without supralinear amplification: the eligibility trace is now simply the accumulated produce of inputs by fluctuations, and this is then multiplied by the net reward. Note that this is equivalent to using the ExploratoryHebbian method, but with a single reward value for each episode, rather than a continuous realtime reward signal. The computed gradients (Figure 9b) are now uncorrelated with the nodeperturbation gradients, as expected from the reasoning presented above. This illustrates the crucial role of the supralinearity. To emphasize that this difference is caused by postperturbation effects, Figure 9c uses the same method as Figure 9b (EH with delayed rewards, no supralinear amplification), but now only accumulates the eligibility trace product over the 10 ms that follow the perturbation time. Since neurons have a time constant of 30 ms, this results in only partial relaxation. The gradients are now largely aligned with the nodeperturbation gradients. If only 1 ms was used, the gradients would be identical to nodeperturbation gradients and the points would fall exactly on the median line. As the time window is increased, relaxation effects eventually overwhelm the gradient and leave a residue dominated by noise, as in Figure 9b.
As a safety check, Figure 9d confirms that the full EH method (with a realtime, continuous reward signal) correctly recovers the nodeperturbation gradients in our experimental settings.
Finally, we emphasize that the choice of supralinear function is not crucial, as long as it is indeed supralinear. In Figure 9e, using a different supralinear function (the signed square function f(x) = x.x) produces largely similar result to Figure 9a. By contrast, in Figure 9f, using a sublinear function (namely the square root) largely randomizes the obtained gradients, since the (large) perturbation effects are now suppressed with regard to (smaller) nonperturbation related effects.
Discussion
This paper makes three contributions:
1 We introduce a biologically plausible learning algorithm that can train a recurrent neural network to learn flexible (contextdependent) tasks, using only timesparse, delayed rewards and synapselocal information to guide learning.
2 We show that this rule can train networks for relatively complex tasks, requiring memory maintenance, selective attention, and coordination of multiple outputs.
3 We show that the trained networks exhibit features of neural activity observed in the primate higher cortex during similar tasks. In particular, we demonstrate highly dynamic populationwide encoding of taskrelevant information, as observed in neural recordings (Meyers et al., 2008; Stokes et al., 2013; Barak et al., 2010); and we show that selective integration of sensory inputs occurs as described in both observational and modelling studies of primate prefrontal cortex during a similar selective attention task (Mante et al., 2013). In our view, the fact that these features of cortical activity arise spontaneously in networks trained with a biologically plausible rule (as opposed to training the network to directly reproduce observed neural activity traces) increases the plausibility of recurrent neural networks as a model of cortical computation, during both performance and learning of cognitive tasks.
Our proposed plasticity rule implements rewardmodulated Hebbian learning between inputs, outputs, and rewards, with the crucial introduction of a supralinear amplification applied to the Hebbian plasticity increments (see Materials and methods). In other words, we posit that plasticity is dominated by large cooccurrences of inputs and outputs, while smaller ones are relatively ignored. This hypothesis of nonlinear effects in Hebbian plasticity allows our rule to support robust learning in highly dynamic networks, without requiring nonHebbian plasticity between segregated driving and perturbatory inputs (Fiete et al., 2007), or a continuous, realtime reward signal (Legenstein et al., 2010; Hoerzer et al., 2014) (see Materials and methods and Analysis). We note that this suggestion is similar to the independent proposal of socalled thresholded Hebbian rules (Soltoggio and Steil, 2013), in which plasticity is only triggered if the Hebbian product reaches a certain threshold.
The flexible, dynamic coding observed in prefrontal activity has led to suggestions that cortex implements ‘silent’ memory traces by using shortterm synaptic plasticity (Barak et al., 2010; Stokes, 2015). Shortterm synaptic plasticity clearly plays an important role in neural responses, and may well play an important role in maintaining a ‘hidden internal state’ of the network (Buonomano and Maass, 2009). However, our network does not implement shortterm synaptic plasticity; no weight modification occurs during the course of a trial (all learning occurs between trials), and all the decoding results reported above were obtained with frozen synaptic weights. Our results suggest that the highly dynamic activities spontaneously produced by nearchaotic recurrent networks can be harnessed to produce the dynamic encodings observed in experiments, using only sparse, delayed rewards and biologically plausible plasticity rules. Thus, while shortterm synaptic plasticity clearly affects neural responses, it may not be required to explain the highly dynamic nature of workingmemory encodings.
It is unlikely that cortical connectivity should be drastically and finely remodeled through a long training process for any new task. For example, while monkeys require extensive training to perform decision tasks, human subjects can quickly perform new tasks simply by verbal instruction. Rather, it is more likely that the process of slow, rewardmodulated synaptic modification in cortical circuitry depicted here reflects the learning of functional networks capable of implementing a certain type of task (or cognitive ability), which must then be activated and parameterized for each instance of the task. The latter process of flexible task specification is likely to involve not just other cortical areas, but also the basal ganglia and dopamine system. Elucidating the interactions between cortical, limbic, and dopaminergic structures is an important future task for the study of flexible behavior and its neural implementation.
Materials and methods
Model description
Here we provide a full description of our model and proposed plasticity rule, with an emphasis on implementation details. In the Analysis section, we provide an extended discussion at a more intuitive level. Note that the source code for all simulations reported here is available online at http://github.com/ThomasMiconi/BiologicallyPlausibleLearningRNN.
Network models
For most of our experiments, the model is a fullyconnected continuoustime recurrent neural network of N neurons, governed by the classical RNN equations (Sompolinsky et al., 1988; Sussillo and Abbott, 2009; Jaeger, 2001; Maass et al., 2002):
where x_{i} is the excitation (or ‘potential’) of neuron i, r_{i} is its response (or ‘firing rate’ / activity), J_{i,j} is the connection weight from neuron j to neuron i, u_{k}(t) is the current value of each of the M external inputs to the network, and B_{i,k} is the connection weight from external input k to neuron i (τ is the relaxation time constant of neuron activation). J is initialized with weights taken from a normal distribution with mean 0 and variance g^{2}/N, while the input weights B_{k,i} are fixed and taken from a uniform distribution over the [−1,1] interval. Activations x_{i} are initialized at the start of every trial with uniform noise in the [−0.1, 0.1] range. For the simulations reported here, N = 200 (400 for the motor control task), τ = 30 ms, and g = 1.5. Note that the latter value places the networks in the early chaotic regime, where the longterm behavior generally remains nonperiodic (Sompolinsky et al., 1988).
These canonical networks have signed responses and mix positive and negative weights, which is unrealistic. However, recently, Mastrogiuseppe and Ostojic have studied the conditions under which chaotic ongoing activity can emerge in excitatoryinhibitory networks with nonnegative responses (Mastrogiuseppe and Ostojic, 2016). Using their results, we also implemented a network that produces ongoing chaotic activity with nonnegative responses, and separate populations of strictly excitatory and strictly inhibitory neurons (in accordance with Dale’s law). As we demonstrate in the last section of Results, the rule can still learn cognitive tasks in this more realistic network. In these simulations, there are 100 excitatory and 100 inhibitory neurons. The neuron response function is a nonnegative piecewiselinear function:
We set −m=−2 and M = 20. Note that the network is largely nonsaturating: neural responses usually do not reach responses close to the maximum M, which is another similarity with cortical networks. The connection matrix is initialized as semisparse: each neuron receives connections from 50 randomly chosen excitatory neurons and 50 randomly chosen inhibitory neurons. All excitatory connections have initial weight 1. All inhibitory connections have initial weight −g_{inhib} = −1.2. As learning occurs, connection weights change from their initial values; however, connections from excitatory neurons are always clipped to 0 from below, while connections from inhibitory neurons are always clipped to 0 from above. The network otherwise operates as described above (in particular, the differential equation governing x_{i}(t) is unchanged).
In all simulations, four arbitrarily chosen neurons have a constant activation x = 1 and thus provide a bias input to other neurons. There is no separate feedback or output network. Instead, one or more neurons in the network are arbitrarily designated as the ‘output’ neurons, and their responses at any given time are used as the network’s response (these neurons are otherwise identical to all others).
Learning rule
Synapses between neurons are modified according to a novel form of rewardmodulated Hebbian learning, which we now describe.
First, in order to produce exploratory variation in network responses across trials, each neuron in the network occasionally receives a random perturbation Δ_{i}(t) to its activation; these perturbations are not segregated from ‘normal’ inputs (in contrast to Fiete et al., 2006, 2007). Note that Δ_{i}(t) might also represent random noise, or a ‘teaching’ signal from a different area. Also, perturbations are applied in all simulations reported here, both at learning and testing/decoding time (perturbations correspond to the sharp spikes in the curves in Figure 1c, which would otherwise show only smooth curves; they also cause the spread of dots around their central values in Figure 3). In this paper, Δ_{i}(t) is taken from a uniform distribution within the [−0.5, 0.5] range, occurring randomly and independently for each neuron with a mean rate of 3 Hz (rates of 1 Hz or 10 Hz also give satisfactory results).
During a trial, at every time step, every synapse from neuron j to neuron i accumulates a potential Hebbian weight change (also called eligibility trace [Izhikevich, 2007]) according to the following equation:
Remember that r_{j} represents the output of neuron j, and thus the current input at this synapse. x_{i} represents the activation of neuron i and $\overline{x}$_{i} represents a shortterm running average of x_{i}, and thus $x\left(t\right)\overline{x}$ tracks the fast fluctuations of the neuron’s output. Thus this rule is essentially Hebbian, based on the product of inputs and output (fluctuations). Crucially, S is a monotonic, supralinear function of its inputs; in other words, we posit that the plasticity mechanism is dominated by large increments, and tends to suppress smaller ones. The particular choice of S is not critical, as long as it is supralinear. In this paper we simply used the cubic function S(x)=x^{3}. Signpreserving squaring S(x) = xx also gives satisfactory results; however, simply using the identity function fails to produce learning. The supralinear amplification of cooccurrences allows our learning rule to successfully learn from instantaneous deviations of activity, using only sparse, delayed rewards, without requiring a continuous, realtime reward signal (Legenstein et al., 2010; Hoerzer et al., 2014); see Discussion and Analysis.
Note that the eligibility trace for any synapse is accumulated over the course of a trial, with each new timestep adding a small increment to the synapse’s eligibility trace / potential weight change.
At the end of each trial, a certain reward R is issued to the network, based on the network’s performance for this trial as determined by the specific task. From this reward, the system computes a reward prediction error signal, as observed in physiological experiments, by subtracting the expected reward for this trial $\overline{R}$ (see below for computation of$\overline{R}$ $\overline{R}$) from the actually received reward R. This rewardprediction signal is used to modulate the eligibility trace into an actual weight change:
where η is a learning rate constant, set to 0.5 for all simulations described here.
To compute the reward prediction error signal (R$\overline{R}$), we need to estimate the expected reward in the absence of perturbation, $\overline{R}$. Following (Frémaux et al., 2010), we simply maintain a running average of recent rewards for trials of the same type (where trial type is determined by the combination of inputs). As (Frémaux et al., 2010) pointed out, it is important that separate traces should be maintained for each trial type, so as to provide an accurate estimation of the expected reward $\overline{R}$ for each trial. Thus, after the nth trial of a given type, $\overline{R}$ is updated as follows:
Where R(n) is the reward for this trial, and $\overline{R}$(n1) was the expected reward after the previous trial of the same type. In all simulations, α_{trace} = 0.33.
To stabilize learning, we clip the weight modifications for each trial to have a maximum absolute value of 10^{−4} (across experiments, roughly 10% of all potential weight modifications exceed this value and are clipped).
Description of tasks
Delayed nonmatchtosample task
The first task considered here is a simple delayed nonmatchtosample problem (Figure 1). In every trial, we present two brief successive inputs to the network, with an intervening delay. Each input can take either of two values, labelled A and B respectively. The task is to determine whether the two successive inputs are identical (AA or BB), in which case the network should output −1; or different (AB or BA), in which case the network should output 1. We specify the input stimuli by using two different input channels u1 and u2; the identity of the input stimulus is determined by which channel is activated (i.e., for stimulus A, u1 = 1 and u2 = 0; for stimulus B, u1 = 0 and u2 = 1; remember that each input channel u_{k} is transmitted to the network by its own independent set of weights  see Model Description above). In every trial, the first stimulus is presented for 200 ms, then after a 200 ms delay the second stimulus is presented for 200 ms. Outside of input presentation periods, both input channels are set to 0. The trial goes on for an additional 400 ms, thus each trial is 1000 ms long. The network’s overall response is determined by the activity of the arbitrarily chosen output neuron over the last 200 ms of the trial (the socalled ‘response’ period). The overall error for this trial is the average absolute difference between the network’s output (that is, the activity of the output neuron) and the target response (1 or −1 depending on presented stimuli), over these last 200 ms.
For details on how the network activity was analyzed for Figures 2 and 3, see below.
Selective integration of contextcued sensory inputs
This task was introduced by Mante and colleagues (Mante et al., 2013). In this study, monkeys looked at randomlymoving colored dots, in which both the value and coherence of motion direction and dot color varied from trial to trial. Monkeys had to report the dominant motion direction, or the dominant color, according to current task conditions; thus, the same stimulus could entail different appropriate responses depending on current context. Furthermore, due to the noisy stimulus, the task required temporal integration of sensory input. Importantly, the authors showed that prefrontal neurons registered inputs from both the relevant and the irrelevant modality; however, inputs from the irrelevant modality had no longterm impact on neural activities, while inputs from the relevant modality were selectively integrated over time. Thus, only information from the relevant modality contributed to the final decision.
Our settings are deliberately similar to those described by Mante, Sussillo and colleagues in their neural network implementation of the task, and Song and colleagues in their own implementation (Song et al., 2016). The network has two ‘sensory’ inputs (representing the two stimulus modalities of motion and color) and two ‘context’ inputs (which specify which of the two modalities must be attended to). The sensory inputs are noisy timeseries, centered on a specific mean which indicates the ‘value’ of this input for this trial. More precisely, each of the two sensory inputs is sampled at each time step from a Gaussian variable with variance 1 and a mean, or bias, randomly set at either −0.5 or 0.5 for each trial (this is for the learning phase; for the testing phase used to generate the psychometric curves in Figure 4, the bias is varied in increments of 0.1 from −0.5 to 0.5, inclusive). The mean/bias of the Gaussian (positive or negative) represents the ‘direction’ or ‘value’ of the corresponding sensory input (left vs. right, or red vs. green). The context inputs are set to 1 and 0, or 0 and 1, to indicate the relevant modality for this trial. The goal of the network is to determine whether the sensory input in the relevant modality has positive or negative mean.
Sensory inputs are presented for the first 500 ms of the trial, followed by a 200 ms response period during which all sensory inputs are set to 0. The expected response of the network is 1 if the relevant sensory input has a positive mean, and −1 otherwise; thus the same sensory input can entail different appropriate responses depending on the context. As for the previous task, the network’s response for a trial is the firing rate of the arbitrarily chosen output cell, and the error for a trial is the average absolute difference between the firing rate of this output cell and the appropriate response for this trial (either −1 or 1) over the 200 ms response period.
For details of network analysis (Figure 5), see below.
Analysis of network activity
1 Decoding of network information in a delayed nonmatchtosample task
In the delayed nonmatchtosample task, we used a crosstemporal classification analysis (Meyers et al., 2008; Stokes et al., 2013) to investigate how fully trained networks encode information over time (Figure 2). The interpretation of these crosstemporal decoding accuracy matrices is that they tell us not only whether the network encodes a certain taskrelevant variable, but also whether it uses similar representations to encode this variable at different points in time. If we train one such classifier using data at time t in some trials, and then use it to decode population activity from data at the same time t in other trials, then decoding accuracy measures how strongly the network encodes the feature at that time point t. However, when the decoder is trained on data at time t_{learn} and then applied to population activity data at time t_{decode}, the resulting accuracy measures the stability in the network’s ‘neural code’ for this feature across both time points, i.e., how similarly the decoded feature is represented by the network across these time points. If representations are similar across both time points (that is, if the network use similar patterns to represent each possible value of the feature across both time points), then classifiers successfully trained with population activities at time t_{learn} should also produce accurate decoding of population activities at time t_{decode}. By contrast, if the network uses different representations/encoding of task features at these two time points, crosstemporal accuracy should be poor; this should be represented as ‘bottlenecks’ of high accuracy on the crosstemporal decoding plots, whereby information is high along the diagonal (i.e. the feature is indeed encoded by the network at that given time), but awayfromdiagonal (crosstemporal) decoding accuracy is low. This is precisely what we observe in Figure 2.
We follow the maximalcorrelation classifier approach described in Meyers et al. (2008) as closely as possible. Briefly, we want to measure how well a certain taskrelevant feature (identity of first presented stimulus, or identity of second presented stimulus, or final response) can be predicted by observing network activity at time t_{1}, using a classifier trained on network activity at time t_{2}. First, we sample the activation of each neuron, every 10 ms, for each trial. This data is stored in a matrix of 100 rows and 200 columns, indicating the activities (firing rates) of all 200 neurons at each of the 100 sampling times. We first generate 80 trials (20 per possible condition, where ‘condition’ is defined as one of the four possible stimulus combination: AA, AB, BA or BB) with a trained network. The time course of neural activity will differ somewhat between successive trials, even for identical conditions, due to noise. Then we iterate the following procedure. For each of all four possible conditions, we randomly choose half the trials as ‘training’ trials, and the other half as ‘testing’ or ‘decoding’ trials. The training trials corresponding to the same category that we are trying to decode (for example, all stimuli having the same first presented stimulus) are averaged together, pointwise, for each neuron and each time point, giving a ‘prototype’ matrix of activation for each neuron at each timepoint under this category. This training data allows us to decode the category of each testing trial, at each point in time, using maximumcorrelation classification, in the following way. We compute the Pearson correlation between each row of each ‘testing’ trial and each row of each ‘prototype’ trial. Each such correlation between row i of a testing trial and row j of a training categoryaverage tells us how much the population activity at time i in the testing trial resembles the average population activity at time j for this particular category. We can then select the category for which this correlation is maximal, at each training/testing timepoint pair, as the ‘decoded’ category for each testing trial. For each testing trial, this provides a 100 × 100 matrix of decoded categories (one for each pair of training and testing timepoints). Of course, each testing trial belongs to only one category, so only one possible answer is correct, and thus we can compute another 100 × 100 matrix of binary values, indicating whether the decoded category at a given point in the decoding matrix (i.e., for any given pair of training and testing timepoints) is correct. The average of these ‘correctness matrices’, over all testing trials, provides the accuracy in crosstemporal decoding of this category for every training/testing pair of timepoints. We iterate this whole procedure 100 times and average together the resulting ‘correctness’ matrices. The resulting 100 × 100 matrix indicates at each row i and column j the proportion of times that the decoded category for population activity at timepoint j was correct, using training data from timepoint i. This is the matrix shown in each of the panels in Figure 2 (one for each of the three categories to be decoded).
2 Orthogonal decoding of network information during a selective integration task
For the selective integration task, we used the analysis method introduced by Mante, Sussillo and colleagues (Mante et al., 2013), and also used by Song and colleagues (Song et al., 2016) (see Figure 5). Intuitively, the purpose of this method is to estimate how much information the network encodes about different task feature (input value, context, final choice, etc.) independently from each other.
After generating multiple trials under various conditions (context  that is, relevant modality  and bias for each modality) with a fully trained network, we regress the activity of each neuron over the values of features of interest (context, value of each modality, and final choice) for each trial. This gives us a set of weights for each neuron, one for each feature, representing how much each feature influences the neuron’s firing rate. We then ‘switch views’ by grouping together all such weights for any given feature (200 weights  one per neuron). This in turn produces vectors in neuron population space, along which the feature is in a sense maximally represented (notice that this is quite different from, and not equivalent to, the simpler idea of simply regressing each feature over the firing rates of the neurons across trials). We then orthogonalize these vectors using QR decomposition, to ensure that these representations are as independent from each other as possible. Projecting population activity at a given time over the resulting vectors approximates the network’s current estimate of the corresponding feature at that time. For successive time slices, we average network activity vectors corresponding to the same value of bias in a certain modality, a certain attended modality, and a certain final choice. We refer the reader to Mante et al. (2013) for a complete description of the method.
We project population activity, averaged within various groups of trials, at each point in time, over these decoding axes. The trials are grouped according to final choice, value of one modality (either modality 1 or modality 2), and current context (i.e., relevant modality), and the population activity at each point in time is averaged across all trials within each group. When the resulting averages are projected over the orthogonal feature vectors, they produce trajectories, indicating the network’s encoded value for each feature, at each point in time, for trials of this group. Only correct trials are used, and thus certain combinations are impossible (for example, positive value of modality 1 bias, while attending modality 1, with a final choice of −1); this is reflected in the topleft and bottomright panels of Figure 6, which contain half as many trajectories as the topright and bottomleft panels.
References

1
From fixed points to Chaos: three models of delayed discriminationProgress in Neurobiology 103:214–222.https://doi.org/10.1016/j.pneurobio.2013.02.002

2
Neuronal population coding of parametric working memoryJournal of Neuroscience 30:9424–9430.https://doi.org/10.1523/JNEUROSCI.187510.2010

3
Statedependent computations: spatiotemporal processing in cortical networksNature Reviews Neuroscience 10:113–125.https://doi.org/10.1038/nrn2558
 4

5
Micro, Meso and MacroDynamics of the BrainDecoding the Dynamics of Conscious Perception: The Temporal Generalization Method  Springer, Micro, Meso and MacroDynamics of the Brain, Springer.

6
Model of birdsong learning based on gradient estimation by dynamic perturbation of neural conductancesJournal of Neurophysiology 98:2038–2057.https://doi.org/10.1152/jn.01311.2006

7
Gradient learning in spiking neural networks by dynamic perturbation of conductancesPhysical Review Letters 97:048104.https://doi.org/10.1103/PhysRevLett.97.048104
 8

9
Functional requirements for rewardmodulated spiketimingdependent plasticityJournal of Neuroscience 30:13326–13337.https://doi.org/10.1523/JNEUROSCI.624909.2010
 10
 11

12
A model of the upper extremity for simulating musculoskeletal surgery and analyzing neuromuscular controlAnnals of Biomedical Engineering 33:829–840.https://doi.org/10.1007/s1043900533207
 13

14
The ‘echo State’ Approach to Analysing and Training Recurrent Neural Networks – with an Erratum note1 GMD 148German National Research Center for Information Technology.

15
Heterogenous population coding of a shortterm memory and decision taskJournal of Neuroscience 30:916–929.https://doi.org/10.1523/JNEUROSCI.206209.2010

16
Reinforcement learning in robotics: a surveyThe International Journal of Robotics Research 32:1238–1274.https://doi.org/10.1177/0278364913495721

17
Robust timing and motor patterns by taming Chaos in recurrent neural networksNature Neuroscience 16:925–933.https://doi.org/10.1038/nn.3405
 18
 19
 20
 21

22
Dynamic population coding of category information in inferior temporal and prefrontal cortexJournal of Neurophysiology 100:1407–1419.https://doi.org/10.1152/jn.90248.2008
 23

24
Recurrent models of visual attentionIn: Z Ghahramani, M Welling, C Cortes, N. D Lawrence, K. Q Weinberger, editors. Advances in Neural Information Processing Systems 27 (NIPS 2014). Curran Associates, Inc. pp. 2204–2212.

25
Gradient calculations for dynamic recurrent neural networks: a surveyIEEE Transactions on Neural Networks 6:1212–1228.https://doi.org/10.1109/72.410363

26
Reinforcement learning of motor skills with policy gradientsNeural Networks 21:682–697.https://doi.org/10.1016/j.neunet.2008.02.003
 27

28
A categoryfree neural population supports evolving demands during decisionmakingNature Neuroscience 17:1784–1792.https://doi.org/10.1038/nn.3865
 29

30
Benchmarking of dynamic simulation predictions in two software platforms using an upper limb musculoskeletal modelComputer Methods in Biomechanics and Biomedical Engineering 18:1445–1458.https://doi.org/10.1080/10255842.2014.916698

31
Solving the distal reward problem with rare correlationsNeural Computation 25:940–978.https://doi.org/10.1162/NECO_a_00419

32
Chaos in random neural networksPhysical Review Letters 61:259–262.https://doi.org/10.1103/PhysRevLett.61.259
 33

34
Codespecific policy gradient rules for spiking neuronsAdvances in Neural Information Processing Systems(NIPS 2009).
 35

36
'Activitysilent' working memory in prefrontal cortex: a dynamic coding frameworkTrends in Cognitive Sciences 19:394–405.https://doi.org/10.1016/j.tics.2015.05.004
 37

38
A neural network that finds a naturalistic solution for the production of muscle activityNature Neuroscience 18:1025–1033.https://doi.org/10.1038/nn.4042

39
Adjustment of muscle mechanics model parameters to simulate dynamic contractions in older adultsJournal of Biomechanical Engineering 125:70–77.https://doi.org/10.1115/1.1531112
 40
Decision letter

Michael J FrankReviewing Editor; Brown University, United States
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "Biologically plausible learning in recurrent neural networks reproduces neural dynamics observed during cognitive tasks" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by Michael Frank as Reviewing Editor and Timothy Behrens as the Senior Editor. The reviewers have opted to remain anonymous.
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
The manuscript is a theoretical/computational study of rewardbased learning in recurrent networks of (nonspiking) neurons. The author compares the simulation results to previously published analyses of experimental data. The author describes an improved rewardbased learning rule which is more biologically plausible than other supervised learning rules that have previously been used for similar purposes. The author applies the rule to a number of relevant tasks. It is shown that a network equipped with the rule can learn these tasks surprisingly well. Further, it is demonstrated that the resulting network dynamics is consistent with experimental data.
Essential revisions:
1) Both Reviewers were enthusiastic about the contribution, which they agreed was the learning rule itself, rather than its ability to produce dynamics, but both expressed concerns that it wasn't sufficiently clear why the rule works. This point was underscored in the consultation session in which it was noted that without further analysis, there is a strong risk that either i) the learning rule will be found to break down for more complex tasks and generality will be lost, or ii) someone else will soon publish a detailed theoretical study of this learning rule and greatly shadow this contribution. Thus the most important point for revision is to focus efforts on quantifying the intuitions given in the supplementary material.
Here are more specific comments and suggestions in this regard:
On the one hand, it is very nice to see a simple, plausible learning rule (in a fairly wellwritten paper) learn nontrivial computations in recurrent networks where we know the gradients of the objective function are hopelessly nonlocal. I see this paper as a much welcome addition to the upcoming literature on reverseengineering brain computations by training RNNs. On the other hand, the author seems to spend comparatively too much effort on "extra goodies", or what I would see as mainly "asides" (e.g. studying the activity of those networks after training), and not enough in trying to understand the reasons why this learning rule works so well (again, I see the learning rule itself as the main contribution of this paper  will other reviewers agree?). Dissecting the inner workings of the proposed learning rule quantitatively would considerably strengthen the paper, and would also suggest situations where one would expect it to fail (in a way, the paper could have been more "selfcritical"). In its current formulation, the learning rule looks like a hack  so, as for any hack, one has little reason to believe it is generally applicable. One could still argue that the range of tasks studied here is wide enough to suggest this rule will robustly apply to other scenarios, but I would be more convinced by some more thorough theoretical analysis. The supplementary material does give some reasonable intuitions (especially regarding the nonlinearity in the rule), but they remain handwavy and at the very least should be included in the main text, since it is central to the paper.
Here are a couple of suggestions:
i) Substantiate the relatively vague intuitions given in the supplementary material, by numerically and/or analytically quantifying the most important relationships stated there ("x is greater than y, while z is neglible etc.").
ii) To dissect the role of the supralinear amplification of cofluctuations, I would have started by much simpler RNN problems, such as simple nonlinear regression in a multilayer, feedforward network  this would get rid of all the subtleties related to recurrent processing, shortterm memory etc., which I found the supplementary material does not handle very well. How do the updates prescribed by the proposed learning rule compare with the exact gradient of the reward function? How does that depend on supralinear amplification?
iii) Somewhat more generally  we know some of the reasons why it is hard to train recurrent neural networks (without extra hacky components like LSTM etc.): vanishing/exploding gradients, very illconditioned saddle points, etc. How does the proposed learning rule overcome these difficulties? (and in fact, does it?  are the tasks studied here representative of more complex cognitive tasks with respect to these generic difficulties in training RNNs?).
2) Figure quality could be improved. Sub panels with labels (A, B, etc.) should be established and referred to. Figure 3 is ugly.
3) The description of Figure 5 in the legend and in the text is hard to follow. For example, in Figure 5, the meaning of colors should be explained.
4) Compared to experiments, the delay of 200 ms in task 1 is very short. Does it also work with more realistic delays on the order of seconds? Relatedly, where the learning rule might break down, it may be nice to address explicitly in discussion whether this rule is dependent on a fixed interval between task relevant events. One central problem for RL rules that depend on eligibility traces with fixed time constants is that without other mechanisms, these may have difficulty in tasks that have distractors, variable timing, or number of intervening events between task relevant stimuli (see for example O'Reilly & Frank 2006 Neural Computation).
5) Although it is common to use ratebased networks in such models, the model used contains a number of coarse approximations to biological reality.i) Ratebased neurons are used instead of spiking models.ii) The tanhactivation function allows for negative neuron outputs. What happens for nonnegative outputs (e.g. a logsig)?iii) Synaptic weights from one presynaptic neuron can be positive or negative (violating Dale's law) and presumably even change the sign during learning. What if more realistic constraints are included?
It would be nice if one could have at least one control simulation where ii) and iii) are considered and a brief discussion. This would strengthen the main claims.
6) It would be interesting to discuss how the learning rule compares to the supervised approach considered in other works? How much does one lose in terms of performance or convergence speed?
7) The eligibility trace is typically decaying. According to Equation 3 it is not in this model. Why, and is it important?
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "Biologically plausible learning in recurrent neural networks for cognitive tasks" for further consideration at eLife. Your revised article has been favorably evaluated by Timothy Behrens as Senior editor), Michael Frank as Reviewing editor, and two reviewers.
The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined below:
The revised paper is clearer in how it exposes the learning rule, but reviewers differed as to whether this was satisfactory or not. Please see concerns below, if you can respond directly to them and expand the manuscript to address them the paper will be improved.
Reviewer #1:
I think this revision only partially addresses my original concern. The supplementary material has merely been rebranded "appendix", with a few (nice) additions, where I was expecting a convincing theoretical analysis of the learning rule as a firstclass citizen in the main text. Why does this learning rule work? Why does learning success rely on moments of higher order than the usual Hebbian covariances? I am not sure I understand it better now after reading this revision.
The comparison to the exact gradient in Figure 9 is a nice starting point, but there is potential to squeeze more out of this toy example to explore the role of the supralinear gain function more systematically; e.g. how does the alignment between proposed rule updates and true gradient depend on the exact type of superlinear function used? (e.g. on the exponent n>1 if x → x is used).
The addition of Figure 8 and the explanation of the relaxation effect is not entirely convincing to me. In paragraph two of subsection “Removing the need for continuous reward signals with supralinear amplification”, all the author is saying here is that the average of $x\overline{x}$ is zero  sure  but this is not what the eligibility trace is accumulating: it's estimating a covariance. The argument works OK when the input is constant, but in recurrent nets it will be anything but constant. In this respect, I am not sure that the example of Figure 9 (which uses constant inputs) is representative. Can the author comment?
I still find the argument appealing though, if it can be made more precise/quantitative; in particular, does this argument inform us of what "a good time constant" would be for the running average of x  e.g. that it should be larger than the typical autocorrelation time of the input fluctuations? Could another fix be to remove a running average of r_{j} from the input r_{j} too, so that the pre and post "relaxation effects" cancel?
Reviewer #2:
The manuscript has significantly been improved in this revision. In particular the learning rule has been analyzed more thoroughly, which provides important insights to the general learning problem addressed in this paper. The author has also added new experiments that show that the rule also works under some more challenging setups. The discussion of the learning rule is now in the Appendix, which is not optimal but acceptable. I think no additional work is necessary at this point.
https://doi.org/10.7554/eLife.20899.011Author response
Essential revisions:
1) Both Reviewers were enthusiastic about the contribution, which they agreed was the learning rule itself, rather than its ability to produce dynamics, but both expressed concerns that it wasn't sufficiently clear why the rule works. This point was underscored in the consultation session in which it was noted that without further analysis, there is a strong risk that either i) the learning rule will be found to break down for more complex tasks and generality will be lost, or ii) someone else will soon publish a detailed theoretical study of this learning rule and greatly shadow this contribution. Thus the most important point for revision is to focus efforts on quantifying the intuitions given in the supplementary material.
Here are more specific comments and suggestions in this regard:
On the one hand, it is very nice to see a simple, plausible learning rule (in a fairly wellwritten paper) learn nontrivial computations in recurrent networks where we know the gradients of the objective function are hopelessly nonlocal. I see this paper as a much welcome addition to the upcoming literature on reverseengineering brain computations by training RNNs. On the other hand, the author seems to spend comparatively too much effort on "extra goodies", or what I would see as mainly "asides" (e.g. studying the activity of those networks after training), and not enough in trying to understand the reasons why this learning rule works so well (again, I see the learning rule itself as the main contribution of this paper  will other reviewers agree?).
We agree that the rule requires better description and explanation, and we have implemented the reviewers’ recommendations in that regard, both by improving the discussion and adding quantitative experiments (see below). On the other hand, clearly different readers will be interested in different aspects. In our opinion, the fact that the networks can not only solve those (admittedly simple) tasks, but also reproduce some of the puzzling dynamics of real cortical networks, is interesting in itself and warrants discussion. We have certainly seen some interest in these results from informal communications and presentations in poster form at conferences.
We did change the title. The previous title was “Biologically plausible learning in recurrent neural networks reproduces neural dynamics observed during cognitive tasks”. The new title is “Biologically plausible learning in recurrent neural networks for cognitive tasks”, which concentrates on the important point (as pointed out by the reviewer) and is less potentially contentious.
Dissecting the inner workings of the proposed learning rule quantitatively would considerably strengthen the paper, and would also suggest situations where one would expect it to fail (in a way, the paper could have been more "selfcritical"). In its current formulation, the learning rule looks like a hack  so, as for any hack, one has little reason to believe it is generally applicable. One could still argue that the range of tasks studied here is wide enough to suggest this rule will robustly apply to other scenarios, but I would be more convinced by some more thorough theoretical analysis. The supplementary material does give some reasonable intuitions (especially regarding the nonlinearity in the rule), but they remain handwavy and at the very least should be included in the main text, since it is central to the paper.
We agree with the reviewer. Following the reviewer’s request, we have moved the detailed discussion of the rule into the main text (Appendix), streamlined it, and added quantitative experiments described below.
Here are a couple of suggestions:
i) Substantiate the relatively vague intuitions given in the supplementary material, by numerically and/or analytically quantifying the most important relationships stated there ("x is greater than y, while z is neglible etc.").
ii) To dissect the role of the supralinear amplification of cofluctuations, I would have started by much simpler RNN problems, such as simple nonlinear regression in a multilayer, feedforward network  this would get rid of all the subtleties related to recurrent processing, shortterm memory etc., which I found the supplementary material does not handle very well. How do the updates prescribed by the proposed learning rule compare with the exact gradient of the reward function? How does that depend on supralinear amplification?
We agree with the reviewer’s observations in both of these points. We have tried to implement the reviewer’s recommendations by implementing a simple experiment (single neuron, timeconstant inputs, single perturbation, iterated over many randomly chosen weights and inputs). We use this simple experiment to compare the gradient obtained by the proposed rule with gradients obtained under various other methods, including the gradient obtained by supervised backpropagation as a “ground truth” (Figure 9). This experiment seems to confirm that the supralinear amplification is indeed the crucial element that allows the proposed rule to recover the correct direction of gradients (compare Figure 9B and 9D). We also see additional evidence to support the idea that this is caused by relaxation effects (Figure 9C).
iii) Somewhat more generally  we know some of the reasons why it is hard to train recurrent neural networks (without extra hacky components like LSTM etc.): vanishing/exploding gradients, very illconditioned saddle points, etc. How does the proposed learning rule overcome these difficulties? (and in fact, does it?  are the tasks studied here representative of more complex cognitive tasks with respect to these generic difficulties in training RNNs?).
The tasks here are relatively simple in comparison to the much more complex tasks that supervised RNNs are applied to these days (natural language processing, etc.) At any rate, as explained in the paper, the rule is an implementation of a specific algorithm that is commonly used in machine learning, namely, the REINFORCE algorithm, so we would expect performance to be on par with it – accounting for the specific characteristics of the networks discussed here (persistent chaotic activity, nontrivial time constants, etc.).
2) Figure quality could be improved. Sub panels with labels (A, B, etc.) should be established and referred to. Figure 3 is ugly.
We have added labels to all panels and tried to deuglify Figure 3 (and others!)
3) The description of Figure 5 in the legend and in the text is hard to follow. For example, in Figure 5, the meaning of colors should be explained.
Figure 5 describes a complex, multistep decoding procedure from Mante, Sussillo et al. Nature 2013. We have tried to clarify both the caption and the description in the main text, though the procedure is quite involved. We did add a description for the meaning of colors.
4) Compared to experiments, the delay of 200 ms in task 1 is very short. Does it also work with more realistic delays on the order of seconds? Relatedly, where the learning rule might break down, it may be nice to address explicitly in discussion whether this rule is dependent on a fixed interval between task relevant events. One central problem for RL rules that depend on eligibility traces with fixed time constants is that without other mechanisms, these may have difficulty in tasks that have distractors, variable timing, or number of intervening events between task relevant stimuli (see for example O'Reilly & Frank 2006 Neural Computation).
We agree with the reviewer that this is an important subject. To test this, we performed additional experiments with the delayed nonmatch to sample task. Results are shown in the new Figure 7. We first increased the delay to 1000 ms (Figure 7A). Then we tried variable delays (from 300 to 800ms Figure 7B).
Basically, the rule is fine with longer delays. It does take a hit with variable delays, requiring significant reduction in learning rate to deal with the increased variance; however, it still successfully learns the task – eventually.
5) Although it is common to use ratebased networks in such models, the model used contains a number of coarse approximations to biological reality.i) Ratebased neurons are used instead of spiking models.ii) The tanhactivation function allows for negative neuron outputs. What happens for nonnegative outputs (e.g. a logsig)?iii) Synaptic weights from one presynaptic neuron can be positive or negative (violating Dale's law) and presumably even change the sign during learning. What if more realistic constraints are included?
It would be nice if one could have at least one control simulation where ii) and iii) are considered and a brief discussion. This would strengthen the main claims.
We have actually added an experiment with a network using nonnegative responses and separate E and I neurons (i.e. observing Dale’s law), as shown in Figure 7C. The rule still manages to learn the task with these more realistic networks.
Note that this would not have been possible without the recent work of Mastrogiuseppe and Ostojic (cited in the paper). The whole paper relies on the strong theoretical guarantees developed for the canonical RNN model by Sompolinsky et al. 1993 – in particular, a welldefined regime in which ongoing nonperiodic (chaotic) activity is guaranteed.
These guarantees are less important when using supervised learning, which will fit the weights to reproduce any trajectory (as in HF Song et al. Plos Comp Biol 2016). But they are crucial when using reinforcement learning which assumes that the network can already generate the proper”kind” of trajectories in the first place. Mastrogiuseppe and Ostojic showed that a similar chaotic regime can be found and parametrized for networks with nonnegative responses and separate E and I neurons.
Furthermore, they also show that the network in this regime are nonsaturating (responses don’t reach the maximum value), which may be of interest to the reviewer (see below).
6) It would be interesting to discuss how the learning rule compares to the supervised approach considered in other works? How much does one lose in terms of performance or convergence speed?
Supervised learning is quite a different problem – it assumes that we already have a good trajectory and want to find the weights that reproduce this trajectory. The problem considered here is the more general problem of reinforcement learning, where we don’t know what the good trajectory is
we only have a trajectory evaluator that tells us whether a given trajectory is “better” or “worse”, without telling us what part of the trajectory makes it so. Supervised learning is of course faster and more reliable than reinforcement learning if you can apply it – that is, if you already know the correct trajectory.
7) The eligibility trace is typically decaying. According to Equation 3 it is not in this model. Why, and is it important?
As discussed in the Introduction and Appendix, the rule proposed here inherits much from Fiete and Seung 2006, Legenstein et al. 2010 and Hoerzer et al. 2014. All of them use a nondecaying eligibility trace. Izhikevich 2007 used a time constant of 1 sec, which should have very little effect in the present paper where all trials last 1s or less.
Just to be sure, we tried the DNMS task with a decaying eligibility trace with a 1second time constant, and as expected it worked very similarly.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
The revised paper is clearer in how it exposes the learning rule, but reviewers differed as to whether this was satisfactory or not. Please see concerns below, if you can respond directly to them and expand the manuscript to address them the paper will be improved.
Reviewer #1:
I think this revision only partially addresses my original concern. The supplementary material has merely been rebranded "appendix", with a few (nice) additions, where I was expecting a convincing theoretical analysis of the learning rule as a firstclass citizen in the main text. Why does this learning rule work? Why does learning success rely on moments of higher order than the usual Hebbian covariances? I am not sure I understand it better now after reading this revision.
We have moved the analysis of the learning rule into an “Analysis” section just after the Results section. We have also tried to clarify and expand the main points of this analysis, and updated the experiments in this section with a more complex, representative setting (see below).
The comparison to the exact gradient in Figure 9 is a nice starting point, but there is potential to squeeze more out of this toy example to explore the role of the supralinear gain function more systematically; e.g. how does the alignment between proposed rule updates and true gradient depend on the exact type of superlinear function used? (e.g. on the exponent n>1 if x → x^{n} is used).
We agree with the reviewer. We have added two more examples: one with a different supralinear function, namely “signpreserving square” (x*x) which was already mentioned in the methods as satisfactory, and one with a sublinear function (square root). As expected, the former shows very similar results to the main rule, while the latter fails to recover correct gradients.
The addition of Figure 8 and the explanation of the relaxation effect is not entirely convincing to me. In paragraph two of subsection “Removing the need for continuous reward signals with supralinear amplification”: all the author is saying here is that the average of $x\overline{x}$ is zero  sure  but this is not what the eligibility trace is accumulating: it's estimating a covariance. The argument works OK when the input is constant, but in recurrent nets it will be anything but constant.
In this very simple example, the total eligibility trace, being the accumulated product of inputs by output fluctuations, will be zero because the inputs are constant, as the reviewer notes below.
If the inputs are not constant, the eligibility trace will include both the product of the perturbation effect by the inputs at the time of the perturbation (which is what we want), AND the product of immediately subsequent inputs by the relaxation terms (which is absolutely not what we want). The latter part adds “garbage” components to the eligibility trace. Furthermore, if there is any temporal correlation in successive inputs, then these “garbage” components will tend to partially cancel out the perturbationrelated terms (because they will be the product of similar inputs by an output relaxation term that has the opposite sign as the output perturbation term). This, we suggest, causes the inability of the EH rule to learn without continuous, realtime reward signal (experiments in the new Figure 9 are meant to confirm this interpretation).
We have expanded and clarified our explanation of this point, which is crucial for our discussion.
We are unsure whether the eligibility trace can be said to “compute a covariance”, either formally or intuitively. Note that the inputs are not centered or detrended in any way, and neither should they be.
Intuitively, we think of the eligibility trace as a gradient: the quantity that should be added to the weights in order to replicate the encountered perturbations (or at least, move outputs in the same direction as the perturbations) when the same inputs are presented again in the future. It serves this role, formally, in the original REINFORCE algorithm (See Section 4, especially Equation 9 in William 1992).
In this respect, I am not sure that the example of Figure 9 (which uses constant inputs) is representative. Can the author comment?
We agree with the reviewer that the simple feedforward network with constant inputs shown in the previous version of Figure 9 may not be a sufficient test of the various methods. Therefore, we have replaced this with a more complex experiment using a fully recurrent network, similar to the one used in the other experiments of the paper. In these settings, the inputs (which are the activities of the neurons in the network) are constantly varying. We show that the computed gradients are very similar and produce the expected results.
Note that we have also added a simulation of the full ExploratoryHebbian rule, confirming that it does recover the correct gradients when allowed to use of a realtime, continuous reward signal (Figure 9D).
I still find the argument appealing though, if it can be made more precise/quantitative; in particular, does this argument inform us of what "a good time constant" would be for the running average of x  e.g. that it should be larger than the typical autocorrelation time of the input fluctuations?
A long running average on the activation would start extracting slow,
nonperturbationrelated fluctuations to a significant degree, which would defeat the point. This is in addition to the necessary temporal smearing. Our early experiments suggested that shorter time constants worked best.
Could another fix be to remove a running average of r_{j} from the input r_{j} too, so that the pre and post "relaxation effects" cancel?
Again, we stress that in our rule (as in nodeperturbation or REINFORCE), we must include the full inputs r_{j}, not their fluctuations. The inputs are used “as is”, which is exactly how it should be. If we used input fluctuations rather than raw inputs, we would add the wrong quantity to the weights.
To take a very simple example, imagine that a positive perturbation is applied while the input at the synapse is high and positive, but declining. If we want to reproduce the effect of the perturbation next time this input is shown, we should add a positive quantity to the weight. This is exactly what nodeperturbation, REINFORCE, and our rule suggest – but using input fluctuations rather than raw inputs in the eligibility trace would actually force us to add a negative quantity to the weights, which would actually make subsequent responses lower, i.e. opposite to the direction of the perturbation.
https://doi.org/10.7554/eLife.20899.012Article and author information
Author details
Funding
G Harold and Leila Y. Mathers Foundation
 Thomas Miconi
The William and Jane Walsh Charitable Remainder Unitrust
 Thomas Miconi
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank W Einar Gall for useful comments and suggestions. We thank Vishwa Goudar for helpful discussions. We thank H Francis Song for important insight regarding the computation of statespace trajectories in Figure 5. This work was supported by the Neurosciences Research Foundation through funding from The G Harold and Leila YMathers Charitable Foundation and the William and Jane Walsh Charitable Remainder Unitrust, for which we are grateful.
Reviewing Editor
 Michael J Frank, Reviewing Editor, Brown University, United States
Publication history
 Received: August 23, 2016
 Accepted: February 17, 2017
 Accepted Manuscript published: February 23, 2017 (version 1)
 Version of Record published: April 20, 2017 (version 2)
Copyright
© 2017, Miconi et al
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,433
 Page views

 418
 Downloads

 1
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Comments
Download links
Downloads (link to download the article as PDF)
Download citations (links to download the citations from this article in formats compatible with various reference manager tools)
Open citations (links to open the citations from this article in various online reference manager services)
Further reading

 Computational and Systems Biology
 Developmental Biology and Stem Cells

 Genomics and Evolutionary Biology
 Microbiology and Infectious Disease