Local online learning in recurrent networks with random feedback
 Cited 0
 Views 1,130
 Annotations
Abstract
Recurrent neural networks (RNNs) enable the production and processing of timedependent signals such as those involved in movement or working memory. Classic gradientbased algorithms for training RNNs have been available for decades, but are inconsistent with biological features of the brain, such as causality and locality. We derive an approximation to gradientbased learning that comports with these constraints by requiring synaptic weight updates to depend only on local information about pre and postsynaptic activities, in addition to a random feedback projection of the RNN output error. In addition to providing mathematical arguments for the effectiveness of the new learning rule, we show through simulations that it can be used to train an RNN to perform a variety of tasks. Finally, to overcome the difficulty of training over very large numbers of timesteps, we propose an augmented circuit architecture that allows the RNN to concatenate shortduration patterns into longer sequences.
https://doi.org/10.7554/eLife.43299.001Introduction
Many tasks require computations that unfold over time. To accomplish tasks involving motor control, working memory, or other timedependent phenomena, neural circuits must learn to produce the correct output at the correct time. Such learning is a difficult computational problem, as it generally involves temporal credit assignment, requiring synaptic weight updates at a particular time to minimize errors not only at the time of learning but also at earlier and later times. The problem is also a very general one, as such learning occurs in numerous brain areas and is thought to underlie many complex cognitive and motor tasks encountered in experiments.
To obtain insight into how the brain might perform challenging timedependent computations, an increasingly common approach is to train highdimensional dynamical systems known as recurrent neural networks (RNNs) to perform tasks similar to those performed by circuits of the brain, often with the goal of comparing the RNN with neural data to obtain insight about how the brain solves computational problems (Mante et al., 2013; Carnevale et al., 2015; Sussillo et al., 2015; Remington et al., 2018). While such an approach can lead to useful insights about the neural representations that are formed once a task is learned, it so far cannot address in a satisfying way the process of learning itself, as the standard learning rules for training RNNs suffer from highly nonbiological features such as nonlocality and acausality, as we describe below.
The most straightforward approach to training an RNN to produce a desired output is to define a loss function based on the difference between the RNN output and the target output that we would like it to match, then to update each parameter in the RNN—typically the synaptic weights—by an amount proportional to the gradient of the loss function with respect to that parameter. The most widely used among these algorithms is backpropagation through time (BPTT) (Rumelhart et al., 1985). As its name suggests, BPTT is acausal, requiring that errors in the RNN output be accumulated incrementally from the end of a trial to the beginning in order to update synaptic weights. Realtime recurrent learning (RTRL) (Williams and Zipser, 1989), the other classic gradientbased learning rule, is causal but nonlocal, with the update to a particular synaptic weight in the RNN depending on the full state of the network—a limitation shared by more modern reservoir computing methods (Jaeger and Haas, 2004; Sussillo and Abbott, 2009). What’s more, both BPTT and RTRL require fine tuning in the sense that the feedback weights from the RNN output back to the network must precisely match the readout weights from the RNN to its output. Such precise matching corresponds to fine tuning in the sense that it requires a highly particular initial configuration of the synaptic weights, typically with no justification as to how such a configuration might come about in a biologically plausible manner. Further, if the readout weights are modified during training of the RNN, then the feedback weights must also be updated to match them, and it is unclear how this might be done without requiring nonlocal information.
The goal of this work is to derive a learning rule for RNNs that is both causal and local, without requiring fine tuning of the feedback weights. Our results depend crucially on two approximations. First, locality is enforced by dropping the nonlocal part of the loss function gradient, making our learning rule only approximately gradientbased. Second, we replace the finely tuned feedback weights required by gradientbased learning with random feedback weights, inspired by the success of a similar approach in nonrecurrent feedforward networks (Lillicrap et al., 2016; Liao et al., 2016). While these two approximations address distinct shortcomings of gradientbased learning and can be made independently (as discussed below in Results), only when both are made together does a learning rule emerge that is fully biologically plausible in the sense of being causal, local, and avoiding fine tuning of feedback weights. In the sections that follow, we show that, even with these approximations, RNNs can be effectively trained to perform a variety of tasks. In the Appendices, we provide supplementary mathematical arguments showing why the algorithm remains effective despite its use of an inexact loss function gradient.
Results
The RFLO learning rule
To begin, we consider an RNN, as shown in Figure 1, in which a timedependent input vector $\mathbf{\mathbf{x}}(t)$ provides input to a recurrently connected hidden layer of $N$ units described by activity vector $\mathbf{\mathbf{h}}(t)$, and this activity is read out to form a timedependent output $\mathbf{\mathbf{y}}(t)$. Such a network is defined by the following equations:
For concreteness, we take the nonlinear function appearing in Equation (1) to be $\varphi (\cdot )=\mathrm{tanh}(\cdot )$. The goal is to train this network to produce a target output function ${\mathbf{\mathbf{y}}}^{*}(t)$ given a specified input function $\mathbf{\mathbf{x}}(t)$ and initial activity vector $\mathbf{\mathbf{h}}(0)$. The error is then the difference between the target output and the actual output, and the loss function is the squared error integrated over time:
The goal of producing the target output function ${\mathbf{\mathbf{y}}}^{*}(t)$ is equivalent to minimizing this loss function.
In order to minimize the loss function with respect to the recurrent weights, we take the derivative with respect to these weights:
Next, using the update Equation (1), we obtain the following recursion relation:
where ${\delta}_{ja}$ is the Kronecker delta function, ${u}_{a}(t)$ is the input current to unit $a$, and the recursion terminates with $\partial {h}_{j}(0)/\partial {W}_{ab}=0$. This gradient can be updated online at each timestep as the RNN is run, and implementing gradient descent to update the weights using Equation (3), we have $\mathrm{\Delta}{W}_{ab}=\eta \partial L/\partial {W}_{ab}$, where $\eta $ is a learning rate. This approach, known as RTRL (Williams and Zipser, 1989), is one of the two classic gradientbased algorithms for training RNNs. This approach can also be used for training the input and output weights of the RNN. The full derivation is presented in Appendix 1. (The other classic gradientbased algorithm, BPTT, involves a different approach for taking partial derivatives but is equivalent to RTRL; its derivation and relation to RTRL are also provided in Appendix 1.)
From a biological perspective, there are two problems with RTRL as a plausible rule for synaptic plasticity. The first problem is that it is nonlocal, with the update to synaptic weight ${W}_{ab}$ depending, through the last term in Equation (4), on every other synaptic weight in the RNN. This information would be inaccessible to a synapse in an actual neural circuit. The second problem is the appearance of ${({\mathbf{\mathbf{W}}}^{\mathrm{out}})}^{\mathrm{T}}$ in Equation (3), which means that the error in the RNN output must be fed back into the network with synaptic weights that are precisely symmetric with the readout weights. It is unclear how the readout and feedback weights could be made to match one another in a neural circuit in the brain.
In order to address these two shortcomings, we make two approximations to the RTRL learning rule. The first approximation consists of dropping a nonlocal term from the gradient, so that computing the update to a given synaptic weight requires only pre and postsynaptic activities, rather than information about the entire state of the RNN including all of its synaptic weights. Second, as described in more detail below, we project the error back into the network for learning using random feedback weights, rather than feedback weights that are tuned to match the readout weights. These approximations, described more fully in Appendix 1, result in the following weight update equations:
where ${\eta}_{\alpha}$ are learning rates, and $\mathbf{\mathbf{B}}$ is a random matrix of feedback weights. Here we have defined
which are the accumulated products of the pre and (the derivative of the) postsynaptic activity at the recurrent and input synapses, respectively. We have also defined ${u}_{a}(t)\equiv {\sum}_{c}{W}_{ac}{h}_{c}(t1)+{\sum}_{c}{W}_{ac}^{\mathrm{in}}{x}_{c}(t)$ as the total input current to unit $a$. While this form of the update equations does not require explicit integration and hence is more efficient for numerical simulation, it is instructive to take the continuoustime ($\tau \gg 1$) limit of Equation (5) and the integral of Equation (6), which yields
In this way, it becomes clear that the integrals in the second and third equations are eligibility traces that accumulate the correlations between pre and postsynaptic activity over a time window of duration $\sim \tau $. The weight update is then proportional to this eligibility trace, multiplied by a feedback projection of the readout error. The fact that the timescale for the eligibility trace matches the RNN time constant $\tau $ reflects the fact that the RNN dynamics are typically correlated only up to this timescale, so that the error is associated only with RNN activity up to time $\tau $ in the past. If the error feedback were delayed rather than provided instantaneously, then eligibility traces with longer timescales might be beneficial (Gerstner et al., 2018).
Three features of the above learning rules are especially important. First, the updates are local, requiring information about the presynaptic activity and the postsynaptic input current, but no information about synaptic weights and activity levels elsewhere in the network. Second, the updates are online and can either be made at each timestep or accumulated over many timesteps and made at the end of each trial or of several trials. In either case, unlike the BPTT algorithm, it is not necessary to run the dynamics backward in time at the end of each trial to compute the weight updates. Third, the readout error is projected back to each unit in the network with weights $\mathbf{\mathbf{B}}$ that are fixed and random. An exact gradient of the loss function, on the other hand, would lead to ${({\mathbf{\mathbf{W}}}^{\mathrm{out}})}^{\mathrm{T}}$, where ${(\cdot )}^{\mathrm{T}}$ denotes matrix transpose, appearing in the place of $\mathbf{\mathbf{B}}$. As described above, the use of random feedback weights is inspired by a similar approach in feedforward networks (Lillicrap et al., 2016; see also Nøkland, 2016, as well as a recent implementation in feedforward spiking networks [Samadi et al., 2017]), and we shall show below that the same feedback alignment mechanism that is responsible for the success of the feedforward version is also at work in our recurrent version. (While an RNN is often described as being ‘unrolled in time’, so that it becomes a feedforward network in which each layer corresponds to one timestep, it is important to note that the unrolled version of the problem that we consider here is not identical to the feedforward case considered in Lillicrap et al. (2016) and Nøkland, 2016. In the RNN, a readout error is defined at every ‘layer’ $t$, whereas in the feedforward case, the error is defined only at the last layer ($t=T$) and is fed back to update weights in all preceding layers.)
With the above observations in mind, we refer to the above learning rule as random feedback local online (RFLO) learning. In Appendix 1, we provide a full derivation of the learning rule, and describe in detail its relation to the other gradientbased methods mentioned above, BPTT and RTRL. It should be noted that the approximations applied above to the RTRL algorithm are distinct from recent approximations made in the machine learning literature (Tallec and Ollivier, 2018; Mujika et al., 2018), where the goal was to decrease the computational cost of RTRL, rather than to increase its biological plausibility.
Because the RFLO learning rule uses an approximation of the loss function gradient rather than the exact gradient for updating the synaptic weights, a natural question to ask is whether it can be expected to decrease the loss function at all. In Appendix 2 we show that, under certain simplifying assumptions including linearization of the RNN, the loss function does indeed decrease on average with each step of RFLO learning. In particular, we show that, as in the feedforward case (Lillicrap et al., 2016), reduction of the loss function requires alignment between the learned readout weights ${\mathbf{\mathbf{W}}}^{\mathrm{out}}$ and the fixed feedback weights $\mathbf{\mathbf{B}}$. We then proceed to show that this alignment tends to increase during training due to coordinated learning of the recurrent weights $\mathbf{\mathbf{W}}$ and readout weights ${\mathbf{\mathbf{W}}}^{\mathrm{out}}$. The mathematical approach for showing that alignment between readout and feedback weights occurs is similar to that used previously in the feedforward case (Lillicrap et al., 2016). In particular, the network was made fully linear in both cases in order to make mathematical headway possible, and a statistical average over inputs (in the feedforward case) or the activity vector (for the RNN) was performed. However, because a feedforward network retains no state information from one timestep to the next and because the network architectures are distinct (even if one thinks about an RNN as a feedforward network ‘unrolled in time’), the results in Appendix 2 are not simply a straightforward generalization of the feedforward case.
A number of simplifying assumptions have been made in the mathematical derivations of Appendix 2, including linear dynamics, uncorrelated neurons, and random synaptic weights, none of which will necessarily hold in a nonlinear network trained to perform a dynamical computation. Hence, although such mathematical arguments provide reason to hope that RFLO learning might be successful and insight into the mechanism by which learning occurs, it remains to be shown that RFLO learning can be used to successfully train a nonlinear RNN in practice. In the following section, therefore, we show using simulated examples that RFLO learning can perform well on a variety of tasks.
Performance of RFLO learning
In this section we illustrate the performance of the RFLO learning algorithm on a number of simulated tasks. These tasks require an RNN to produce sequences of output values and/or delayed responses to an input to the RNN, and hence are beyond the capabilities of feedforward networks. As a benchmark, we compare the performance of RFLO learning with BPTT, the standard algorithm for training RNNs. (As described in Appendix 1, the weight updates in RTRL are, when performed in batches at the end of each trial, completely equivalent to those in BPTT. Hence in this section we compare RFLO learning with BPTT only in what follows.)
Autonomous production of continuous outputs
Figure 2 illustrates the performance of an RNN trained with RFLO learning to produce a onedimensional periodic output given no external input. Figure 2a shows the decrease of the loss function (the mean squared error of the RNN output) as the RNN is trained over many trials, where each trial corresponds to one period consisting of $T$ timesteps, as well as the performance of the RNN at the end of training. As a benchmark for comparison with the RFLO learning rule, BPTT was also used to train the RNN. In addition, we show in Figure 2—figure supplement 1 that a variant of RFLO learning in which all outbound synapses from a given unit were constrained to be of the same sign—a biological constraint known as Dale’s law (Dale, 1935)—also yields effective learning. (A similar result, in this case using nonlocal learning rules, was recently obtained in other modeling work [Song et al., 2016].)
Figure 2b shows that, in the case where the number of timesteps in the target output was not too great, both versions of RFLO learning perform comparably well to BPTT. BPTT shows an advantage, however, when the number of timesteps became very large. Intuitively, this difference in performance is due to the accumulation of small errors in the estimated gradient of the loss function over many timesteps with RFLO learning. This is less of a problem for BPTT, on the other hand, in which the exact gradient is used.
Figure 2c shows the increase in the alignment between the vector of readout weights ${\mathbf{\mathbf{W}}}^{\mathrm{out}}$ and the vector of feedback weights $\mathbf{\mathbf{B}}$ during training with RFLO learning. As in the case of feedforward networks (Lillicrap et al., 2016; Nøkland, 2016), the readout weights evolve over time to become increasingly similar to the feedback weights, which are fixed during training. In Appendix 2 we provide mathematical arguments for why this alignment occurs, showing that the alignment is not due to the change in ${\mathbf{\mathbf{W}}}^{\mathrm{out}}$ alone, but rather to coordinated changes in the readout and recurrent weights.
In deriving the RFLO learning rule, two independent approximations were made: locality was enforced by dropping the nonlocal term from the loss function gradient, and feedback weights were chosen randomly rather than tuned to match the readout weights. If these approximations are instead made independently, which will have the greater effect on the performance of the RNN? Figure 2d answers this question by comparing RFLO and BPTT with two alternative learning rules: one in which the local approximation is made while symmetric error feedback is maintained, and another in which the nonlocal part of the loss function gradient is retained but the error feedback is random. The results show that the local approximation is essentially fully responsible for the performance difference between RFLO and BPTT, while there is no significant loss in performance due to the random feedback alone.
It is also worthwhile to consider the relative contributions of the two types of learning in Figure 2, namely the learning of recurrent and of readout weights. Given that the learning rule for the readout weights makes use of the exact loss function gradient while that for the recurrent weights does not, it could be that the former are fully responsible for the successful training. In Figure 2—figure supplement 2 we show that this is not the case, and that training of both recurrent and readout weights significantly outperforms training of the readout weights only (with the readout fed back as an input to the RNN for stability–see Materials and methods). Also shown is the performance of an RNN in which recurrent weights but not readout weights are trained. In this case learning is completely unsuccessful. The reason is that, in order for successful credit assignment to take place, there must be some alignment between the readout weights and feedback weights. Such alignment can’t occur, however, if the readout weights are frozen. In the case of a linearized network, the necessity of coordinated learning between the two sets of weights can be shown mathematically, as done in Appendix 2.
As with other RNN training methods, performance of the trained RNN generally improves for larger network sizes (Figure 2—figure supplement 3). While the computational cost of training the RNN increases with RNN size, leading to a tradeoff between fast training and high performance for a given number of training trials, it is worthwhile to note that the cost is much lower than that of RTRL ($\sim {N}^{4}$ operations per timestep) and is on par with BPTT (both $\sim {N}^{2}$ operations per timestep, as shown in Appendix 1).
Interval matching
Figure 3 illustrates the performance of the RFLO algorithm on a ‘Ready Set Go’ task, in which the RNN is required to produce an output pulse after a time delay matching the delay between two preceding input pulses (Jazayeri and Shadlen, 2010). This task is more difficult than the production of a periodic output due to the requirement that the RNN must learn to store the information about the interpulse delay, and then produce responses at different times depending on what the delay was. Figure 3b,c illustrate the testing performance of an RNN trained with either RFLO learning or BPTT. If the RNN is trained and tested on interpulse delays satisfying ${T}_{\mathrm{delay}}\le 15\tau $, the performance is similarly good for the two algorithms. If the RNN is trained and tested with longer ${T}_{\mathrm{delay}}$, however, then BPTT performs better than RFLO learning. As in the case of the periodic output task from Figure 2, RFLO learning performs well for tasks on short and intermediate timescales, but not as well as BPTT for tasks involving longer timescales. In the following subsection, we shall address this shortcoming by constructing a network in which learned subsequence elements of short duration can be concatenated to form longerduration sequences.
Learning a sequence of actions
In the above examples, it was shown that, while the performance of RFLO learning is comparable to that of BPTT for tasks over short and intermediate timescales, it is less impressive for tasks involving longer timescales. From the perspective of machine learning, this represents a failure of RFLO learning. From the perspective of neuroscience, however, we can adopt a more constructive attitude. The brain, after all, suffers the same limitations that we have imposed in constructing the RFLO learning rule—namely, causality and locality—and cannot be performing BPTT for learned movements and working memory tasks over long timescales of seconds or more. So how might recurrent circuits in the brain learn to perform tasks over these long timescales? One possibility is that they use a more sophisticated learning rule than the one that we have constructed. While we cannot rule out this possibility, it is worth keeping in mind that, due to the problem of vanishing or exploding gradients, all gradientbased training methods for RNNs fail eventually at long timescales. Another possibility is that a simple, fully connected recurrent circuit in the brain, like an RNN trained with RFLO learning, can only be trained directly with supervised learning over short timescales, and that a more complex circuit architecture is necessary for longer timescales.
It has long been recognized that longduration behaviors tend to be sequences composed of short, stereotyped actions concatenated together (Lashley, 1951). Further, a great deal of experimental work suggests that learning of this type involves training of synaptic weights from cortex to striatum (Graybiel, 1998), the input structure of the basal ganglia, which in turn modifies cortical activity via thalamus. In this section we propose a circuit architecture, largely borrowed from Logiaco et al. (2018) and inspired by the subcortical loop involving basal ganglia and thalamus, that allows an RNN to learn and perform sequences of ‘behavioral syllables’.
As illustrated in Figure 4a, the first stage of learning in this scheme involves training an RNN to produce a distinct timedependent output in response to the activation of each of its tonic inputs. In this case, the RNN output is a twodimensional vector giving the velocity of a cursor moving in a plane. Once the RNN has been trained in this way, the circuit is augmented with a loop structure, shown schematically in Figure 4b. At one end of the loop, the RNN activity is read out with weights ${\mathbf{\mathbf{W}}}^{s}$. At the other end of the loop, this readout is used to control the input to the RNN. The weights ${\mathbf{\mathbf{W}}}^{s}$ can be learned such that, at the end of one behavioral syllable, the RNN input driving the next syllable in the sequence is activated by the auxiliary loop. This is done most easily by gating the RNN readout so that it can only drive changes at the end of a syllable.
In this example, each time the end of a syllable is reached, four readout units receive input ${z}_{i}={\sum}_{j=1}^{N}{W}_{ij}^{s}{h}_{j}$, and a winnertakeall rule is applied such that the most active unit activates a corresponding RNN input unit, which drives the RNN to produce the next syllable. Meanwhile, the weights are updated with the rewardmodulated Hebbian learning rule $\mathrm{\Delta}{W}_{ij}^{s}={\eta}_{s}R{z}_{i}{h}_{j}$, where $R=1$ if the syllable transition matches the target and $R=0$ otherwise. By training over many trials, the network learns to match the target sequence of syllables. Figure 4c shows the output from an RNN trained in this way to produce a sequence of reaches and holds in a twodimensional space. Importantly, while the duration of each behavioral syllable in this example ($20\tau $) is relatively short, the full concatenated sequence is long ($160\tau $) and would be very difficult to train directly in an RNN lacking such a loop structure.
How might the loop architecture illustrated in Figure 4 be instantiated in the brain? For learned motor control, motor cortex likely plays the role of the recurrent circuit controlling movements. In addition to projections to spinal cord for controlling movement directly, motor cortex also projects to striatum, and experimental evidence has suggested that modification of these corticostriatal synapses plays an important role in the learning of action sequences (Jin and Costa, 2010). Via a loop through the basal ganglia output nucleus GPi and motor thalamus, these signals pass back to motor cortex, as illustrated schematically in Figure 4. According to the model, then, behavioral syllables are stored in motor cortex, and the role of striatum is to direct the switching from one syllable to the next. Experimental evidence for both the existence of behavioral syllables and the role played by striatum in switching between syllables on subsecond timescales has been found recently in mice (Wiltschko et al., 2015; Markowitz et al., 2018). How might the weights from motor cortex in this model be gated so that this projection is active at behavioral transitions? It is well known that dopamine, in addition to modulating plasticity at corticostriatal synapses, also modulates the gain of cortical inputs to striatum (Gerfen et al., 2011). Further, it has recently been shown that transient dopamine signals occur at the beginning of each movement in a leverpress sequence in mice (da Silva et al., 2018). Together, these experimental results support a model in which dopamine bursts enable striatum to direct switching between behavioral syllables, thereby allowing for learned behavioral sequences to occur over long timescales by enabling the RNN to control its own input. Within this framework, RFLO learning provides a biologically plausible means by which the behavioral syllables making up these sequences might be learned.
Discussion
In this work we have derived an approximation to gradientbased learning rules for RNNs, yielding a learning rule that is local, online, and does not require fine tuning of feedback weights. We have shown that RFLO learning performs comparably well to BPTT when the duration of the task being trained is not too long, but that it performs less well when the task duration becomes very long. In this case, however, we showed that training can still be effective if the RNN architecture is augmented to enable the concatenation of shortduration outputs into longer output sequences. Further exploring how this augmented architecture might map onto cortical and subcortical circuits in the brain is an interesting direction for future work. Another promising area for future work is the use of layered recurrent architectures, which occur throughout cortex and have been shown to be beneficial in complex machine learning applications spanning long timescales (Pascanu et al., 2014). Finally, machine learning tasks with discrete timesteps and discrete outputs such as text prediction benefit greatly from the use of RNNs with crossentropy loss functions and softmax output normalization. In general, these lead to additional nonlocal terms in gradientbased learning, and in future work it would be interesting to investigate whether RFLO learning can be adapted and applied to such problems while preserving locality, or whether new ideas are necessary about how such tasks are solved in the brain.
How might RFLO learning be implemented concretely in the brain? As we have discussed above, motor cortex is an example of a recurrent circuit that can be trained to produce a particular timedependent output. Neurons in motor cortex receive information about planned actions (${\mathbf{\mathbf{y}}}^{*}(t)$ in the language of the model) from premotor cortical areas, as well as information about the current state of the body ($\mathbf{\mathbf{y}}(t)$) from visual and/or proprioceptive inputs, giving them the information necessary to compute a timedependent error $\mathit{\bm{\epsilon}}(t)={\mathbf{\mathbf{y}}}^{*}(t)\mathbf{\mathbf{y}}(t)$. Hence it is possible that neurons within motor cortex might use a projection of this error signal to learn to produce a target output trajectory. Such a computation might feature a special role for apical dendrites, as in recently developed theories for learning in feedforward cortical networks (Guerguiev et al., 2017; Sacramento et al., 2017), though further work would be needed to build a detailed theory for its implementation in recurrent cortical circuits.
A possible alternative scenario is that neuromodulators might encode error signals. In particular, midbrain dopamine neurons project to many frontal cortical areas including prefrontal cortex and motor cortex, and their input is known to be necessary for learning certain timedependent behaviors (Hosp et al., 2011; Li et al., 2017). Further, recent experiments have shown that the signals encoded by dopamine neurons are significantly richer than the reward prediction error that has traditionally been associated with dopamine, and include phasic modulation during movements (Howe and Dombeck, 2016; da Silva et al., 2018; Coddington and Dudman, 2018). This interpretation of dopamine as a continuous online error signal used for supervised learning would be distinct from and complementary to its well known role as an encoder of reward prediction error for reinforcement learning.
In addition to the gradientbased approaches (RTRL and BPTT) already discussed above, another widely used algorithm for training RNNs is FORCE learning (Sussillo and Abbott, 2009) and its more recent variants (Laje and Buonomano, 2013; DePasquale et al., 2018). The FORCE algorithm, unlike gradientbased approaches, makes use of chaotic fluctuations in RNN activity driven by strong recurrent input. These chaotic fluctuations, which are not necessary in gradientbased approaches, provide a temporally rich set of basis functions that can be summed together with trained readout weights in order to construct a desired timedependent output. As with gradientbased approaches, however, FORCE learning is nonlocal, in this case because the update to any given readout weight depends not just on the presynaptic activity, but also on the activities of all other units in the network. Although FORCE learning is biologically implausible due to the nonlocality of the learning rule, it is, like RFLO learning, implemented online and does not require finely tuned feedback weights for the readout error. It is an open question whether approximations to the FORCE algorithm might exist that would obviate the need for nonlocal learning while maintaining sufficiently good performance.
In addition to RFLO learning, a number of other local and causal learning rules for training RNNs have been proposed. The oldest of these algorithms (Mazzoni et al., 1991; Williams, 1992) operate within the framework of reinforcement learning rather than supervised learning, meaning that only a scalar—and possibly temporally delayed—reward signal is available for training the RNN, rather than the full target function ${y}^{*}(t)$. Typical of such algorithms, which are often known as ‘node perturbation’ algorithms, is the REINFORCE learning rule (Williams, 1992), which in our notation gives the following weight update at the end of each trial:
where $R$ is the scalar reward signal (which might be defined as the negative of the loss function that we have used in RFLO learning), $\overline{R}$ is the average reward over recent trials, and ${\xi}_{a}(t)$ is noise current injected into unit $a$ during training. This learning rule means, for example, that (assuming the presynaptic unit $b$ is active) if the postsynaptic unit $a$ is more active than usual in a given trial (i.e. ${\xi}_{a}(t)$ is positive) and the reward is greater than expected, then the synaptic weight ${W}_{ab}$ should be increased so that this postsynaptic unit should be more active in future trials. A slightly more elaborate version of this learning rule replaces the summand in Equation (8) with a lowpass filtered version of this same quantity, leading to eligibility traces of similar form to those appearing in Equation (7). This learning rule has also been adapted for a network of spiking neurons (Fiete et al., 2006).
A potential shortcoming of the REINFORCE learning rule is that it depends on the postsynaptic noise current rather than on the total postsynaptic input current (i.e. the noise current plus the input current from presynaptic units). Because it is arguably implausible that a neuron could keep track of these sources of input current separately, a recently proposed version (Miconi, 2017) replaces ${\xi}_{a}(t)\to f({u}_{a}(t){\overline{u}}_{a}(t))$, where $f(\cdot )$ is a supralinear function, ${u}_{a}(t)$ is the total input current (including noise) to unit $a$, and ${\overline{u}}_{a}(t)$ is the lowpassfiltered input current. This substitution is logical since the quantity ${u}_{a}(t){\overline{u}}_{a}(t)$ tracks the fast fluctuations of each unit, which are mainly due to the rapidly fluctuating input noise rather than to the more slowly varying recurrent and feedforward inputs.
A severe limitation of reinforcement learning as formulated in Equation (8) is the sparsity of reward information, which comes in the form of a single scalar value at the end of each trial. Clearly this provides the RNN with much less information to learn from than a vector of errors $\mathit{\bm{\epsilon}}(t)\equiv {\mathbf{\mathbf{y}}}^{*}(t)\mathbf{\mathbf{y}}(t)$ at every timestep, which is assumed to be available in supervised learning. As one would expect from this observation, reinforcement learning is typically much slower than supervised learning in RNNs, as in feedforward neural networks. A hybrid approach is to assume that reward information is scalar, as in reinforcement learning, but available at every timestep, as in supervised learning. This might correspond to setting $R(t)\equiv {\mathit{\bm{\epsilon}}(t)}^{2}$ and including this reward in a learning rule such as the REINFORCE rule in Equation (8). To our knowledge this has not been done for training recurrent weights in an RNN, though a similar idea has recently been used for training the readout weights of an RNN (Legenstein et al., 2010; Hoerzer et al., 2014). Ultimately, whether recurrent neural circuits in the brain use reinforcement learning or supervised learning is likely to depend on the task being learned and what feedback information about performance is available. For example, in a reachtotarget task such as the one modeled in Figure 4, it is plausible that a human or nonhuman primate might have a mental template of an ideal reach, and might make corrections to make the hand match the target trajectory at each timepoint in the trial. On the other hand, if only delayed binary feedback is provided in an intervalmatching task such as the one modeled in Figure 3, neural circuits in the brain might be more likely to use reinforcement learning.
More recently, local, online algorithms for supervised learning in RNNs with spiking neurons have been proposed. Gilra and Gerstner (2017) and Alemi et al. (2017) have trained spiking RNNs to produce particular dynamical trajectories of RNN readouts. These works constitute a large step toward greater biological plausibility, particularly in their use of local learning rules and spiking neurons. Here we describe the most important differences between those works and RFLO learning. In both Gilra and Gerstner (2017) and Alemi et al. (2017), the RNN is driven by an input $\mathbf{\mathbf{x}}(t)$ as well as the error signal $\mathit{\bm{\epsilon}}(t)={\mathbf{\mathbf{y}}}^{*}(t)\mathbf{\mathbf{y}}(t)$, where the target output is related to the input $\mathbf{\mathbf{x}}(t)$ according to
where ${g}_{i}(\mathbf{\mathbf{x}})={x}_{i}(t)$ in Alemi et al. (2017), but is arbitrary in Gilra and Gerstner (2017). In either case, however, it is not possible to learn arbitrary, timedependent mappings between inputs and outputs in these networks, since the RNN output must take the form of a dynamical system driven by the RNN input. This is especially limiting if one desires that the RNN dynamics should be autonomous, so that $\mathbf{\mathbf{x}}(t)=0$ in Equation (9). It is not obvious, for example, what dynamical equations having the form of (9) would provide a solution to the intervalmatching task studied in Figure 3. Of course, it is always possible to obtain an arbitrarily complex readout by making $\mathbf{\mathbf{x}}(t)$ sufficiently large such that $\mathbf{\mathbf{y}}(t)$ simply follows $\mathbf{\mathbf{x}}(t)$ from Equation (9). However, since $\mathbf{\mathbf{x}}(t)$ is provided as input, the RNN essentially becomes an autoencoder in this limit.
Two other features of Gilra and Gerstner (2017) and Alemi et al. (2017) differ from RFLO learning. First, the readout weights and the error feedback weights are related to one another in a highly specific way, being either symmetric with one another (Alemi et al., 2017), or else configured such that the loop from the RNN to the readout and back to the RNN via the error feedback pathway forms an autoencoder (Gilra and Gerstner, 2017). In either case these weights are preset to these values before training of the RNN begins, unlike the randomly set feedback weights used in RFLO learning. Second, both approaches require that the error signal $\mathit{\bm{\epsilon}}(t)$ be fed back to the network with (at least initially) sufficiently large gain such that the RNN dynamics are essentially slaved to produce the target readout ${\mathbf{\mathbf{y}}}^{*}(t)$, so that one has $\mathbf{\mathbf{y}}(t)\approx {\mathbf{\mathbf{y}}}^{*}(t)$ immediately from the beginning of training. (This follows as a consequence of the relation between the readout and feedback weights described above.) With RFLO learning, in contrast, forcing the output to always follow the target in this way is not necessary, and learning can work even if the RNN dynamics early in learning do not resemble the dynamics of the ultimate solution.
In summary, the random feedback learning rule that we propose offers a potential advantage over previous biologically plausible learning rules by making use of the full timedependent, possibly multidimensional error signal, and also by training all weights in the network, including input, output, and recurrent weights. In addition, it does not require any special relation between the RNN inputs and outputs, nor any special relationship between the readout and feedback weights, nor a mechanism that restricts the RNN dynamics to always match the target from the start of training. Especially when extended to allow for sequence learning such as depicted in Figure 4, RFLO learning provides a plausible mechanism by which supervised learning might be implemented in recurrent circuits in the brain.
Materials and methods
Source code
Request a detailed protocolA Python notebook implementing a simple, selfcontained example of RFLO learning has been included as Source code 1 to accompany this publication. The example trains an RNN on the periodic output task from Figure 2 using RFLO learning, as well as using BPTT and RTRL for comparison.
Simulation details
Request a detailed protocolIn all simulations, the RNN time constant was $\tau =10$. Learning rates were selected by grid search over ${\eta}_{1,2,3}=\eta \in [{10}^{4},3\times {10}^{4},{10}^{3},\mathrm{\dots},3\times {10}^{1}]$. Input and readout weights were initialized randomly and uniformly over $[1,1]$ and $[1/\sqrt{N},1/\sqrt{N}]$, respectively. Recurrent weights were initialized randomly as $W\sim \mathcal{N}(0,{g}^{2}/N)$, where $g=1.5$ and $\mathcal{N}(0,{\sigma}^{2})$ is the normal distribution with zero mean and variance ${\sigma}^{2}$. The fixed feedback weights were chosen randomly as ${B}_{ij}\sim \mathcal{N}(0,1)$. The nonlinear activation function of the RNN units was $\varphi (\cdot )=\mathrm{tanh}(\cdot )$.
In Figure 2, the RNN size was $N=30$. For task durations of $T=(200,400,800,1600)$ timesteps, the optimal learning rates after grid search were $\eta =(0.03,0.01,0.001,0.0003)$ for RFLO and $(0.03,0.03,0.01,0.03)$ for BPTT. The target output waveform was ${y}^{*}(t)=\mathrm{sin}(2\pi t/T)+0.5\mathrm{sin}(4\pi t/T)+0.25\mathrm{sin}(8\pi t/T)$. The shaded regions in panels a, b, and d are 25/75 percentiles of performance computed over nine randomly initialized networks, and the solid curves show the median performance.
In the version of the periodic output task satisfying Dale’s law enforcing signconstrained synapses (Figure 2—figure supplement 1), half of RNN units were assigned to be excitatory and half were inhibitory. Recurrent weights were initialized as above, with the additional step of ${W}_{ij}\leftarrow {\xi}_{j}{W}_{ij}$, where ${\xi}_{j}=\pm 1$ for excitatory or inhibitory units. During learning in this network, recurrent weights were updated normally but clipped to zero to prevent the weights from changing sign.
In the version of the periodic output task in which only readout weights were trained (Figure 2—figure supplement 2), the readout was fed back into the RNN as a separate input current to the recurrent units via the random feedback weights $\mathbf{\mathbf{B}}$. This is necessary to stabilize the RNN dynamics in the absence of learning of the recurrent weights, as they would be either chaotic (for large recurrent weights) or quickly decaying (for small recurrent weights) in the absence of such stabilization. The RNN was initialized as described above, and the learning rate for the readout weights was $\eta =0.03$, determined by grid search.
In Figure 3, the RNN size was $N=100$. The input and target output pulses were Gaussian with a standard deviation of 15 timesteps. The RNNs were trained for 5000 trials. With BPTT, the learning rate was ${\eta}_{1,2,3}=0.003$, while with RFLO learning it was $0.001$. Rather than performing weight updates in every trial, the updates were continuously accumulated but only implemented after batches of 10 trials.
In Figure 4, networks of size $N=100$ were used. In the version with the loop architecture, RFLO learning was first used to train the network to produce a particular reach trajectory in response to each of four tonic inputs for 10,000 trials, with a random input chosen in each trial, subject to the constraint that the trajectory could not move the cursor out of bounds. Next, the RNN weights were held fixed and the weights ${\mathbf{\mathbf{W}}}^{s}$ were learned for 10,000 additional trials while the RNN controlled its own input via the auxiliary loop. The active unit in ‘striatum’ was chosen randomly with probability ${p}_{\mathrm{explore}}=0.1$ and was otherwise chosen deterministically based on the RNN input via the weights ${W}^{s}$, again subject to the constraint that the trajectory could not move the cursor out of bounds. In the comparison shown in subpanel (c), RNNs without the loop architecture were trained for 20,000 trials with either RFLO learning or BPTT to autonomously produce the entire sequence of $160\tau $ timesteps.
Appendix 1
Gradientbased RNN learning and RFLO learning
In the first subsection of this appendix, we begin by reviewing the derivation of RTRL, the classic gradientbased learning rule. We show that the update equation for the recurrent weights under the RTRL rule has two undesirable features from a biological point of view. First, the learning rule is nonlocal, with the update to weight ${W}_{ij}$ depending on all of the other weights in the RNN, rather than just on information that is locally available to that particular synapse. Second, the RTRL learning rule requires that the error in the RNN readout be fed back into the RNN with weights that are precisely symmetric with the readout weights. In the second subsection, we implement approximations to the RTRL gradient in order to overcome these undesirable features, leading to the RFLO learning rules.
In the third subsection of this appendix, we review the derivation of BPTT, the most widely used algorithm for training RNNs. Because it is the standard gradientbased learning rule for RNN training, BPTT is the learning rule against which we compare RFLO learning in the main text. Finally, in the final subsection of this appendix we illustrate the equivalence of RTRL and BPTT. Although this is not strictly necessary for any of the results given in the main text, we expect that readers with an interest in gradientbased learning rules for training RNNs will be interested in this correspondence, which to our knowledge has not been very clearly explicated in the literature.
Realtime recurrent learning
In this section we review the derivation of the realtime recurrent learning (RTRL) algorithm (Williams and Zipser, 1989) for an RNN such as the one shown in Figure 1. This rule is obtained by taking a gradient of the meansquared output error of the RNN with respect to the synaptic weights, and, as we will show later in this appendix, is equivalent (when implemented in batches rather than online) to the more widely used backpropagation through time (BPTT) algorithm.
The standard RTRL algorithm is obtained by calculating the gradient of the loss function Equation (2) with respect to the RNN weights, and then using gradient descent to find the weights that minimize the loss function (Goodfellow et al., 2016). Specifically, for each run of the network, one can calculate $\partial L/\partial {W}_{ab}$ and then update the weights by an amount proportional to this gradient: $\mathrm{\Delta}{W}_{ab}=\eta \partial L/\partial {W}_{ab}$, where $\eta $ determines the learning rate. This can be done similarly for the input and output weights, ${W}_{ab}^{\mathrm{in}}$ and ${W}_{ab}^{\mathrm{out}}$, respectively. This results in the following update equations:
In these equations, ${(\cdot )}^{\mathrm{T}}$ denotes matrix transpose, and the gradients of the hidden layer activities with respect to the recurrent and input weights are given by
where we have defined
and $\mathbf{\mathbf{u}}(t)$ is the total input to each recurrent unit at time $t$:
The recursions in Equation (11) terminate with
As many others have recognized previously, the synaptic weight updates given in the second and third lines of Equation (10) are not biologically realistic for a number of reasons. First, the error is projected back into the network with the particular weight matrix ${({W}^{\mathrm{out}})}^{\mathrm{T}}$, so that the feedback and readout weights must be related to one another in a highly specific way. Second, the terms involving $\mathbf{\mathbf{W}}$ in Equation (11) mean that information about the entire network is required to update any given synaptic weight, making the rules nonlocal. In contrast, a biologically plausible learning rule for updating a weight ${W}_{ab}$ or ${W}_{ab}^{\mathrm{in}}$ ought to depend only on the activity levels of the pre and postsynaptic units $a$ and $b$, in addition to the error signal that is fed back into the network. Both of these shortcomings will be addressed in the following subsection.
Random feedback local online learning
In order to obtain a biologically plausible learning rule, we can attempt to relax some of the requirements in the RTRL learning rule and see whether the RNN is still able to learn effectively. Inspired by a recently used approach in feedforward networks (Lillicrap et al., 2016), we do this by replacing the ${({W}^{\mathrm{out}})}^{\mathrm{T}}$ appearing in the second and third lines of Equation (10) with a fixed random matrix $\mathbf{\mathbf{B}}$, so that the feedback projection of the output error no longer needs to be tuned to match the other weights in the network in a precise way. Second, we simply drop the terms involving $\mathbf{\mathbf{W}}$ in Equation (11), so that nonlocal information about all recurrent weights in the network is no longer required to update a particular synaptic weight. In this case we can rewrite the approximate weightupdate equations as
where
Here we have defined rank2 versions of the eligibility trace tensors from (12):
As desired, the Equation (15) are local, depending only on the pre and postsynaptic activity, together with a random feedback projection of the error signal. In addition, because all of the quantities appearing in Equation (15) are computed in real time as the RNN is run, the weight updates can be performed online, in contrast to BPTT, for which the dynamics over all timesteps must be run first forward and then backward before making any weight updates. Hence, we refer to the learning rule given by (15  12) as random feedback local online (RFLO) learning.
Backpropagation through time
Because it is the standard algorithm used for training RNNs, in this section we review the derivation of the learning rules for backpropagation through time (BPTT) (Rumelhart et al., 1985) in order to compare it with the learning rules presented above. The derivation here follows Lecun (1988).
Consider the following Lagrangian function:
The second line is the cost function that is to be minimized, while the first line uses the Lagrange multiplier $\mathbf{\mathbf{z}}(t)$ to enforce the constraint that the dynamics of the RNN should follow Equation (1). From Equation (18) we can also define the following action:
We now proceed by minimizing Equation (19) with respect to each of its arguments. First, taking $\partial S/\partial {z}_{i}(t)$ just gives the dynamical Equation (1). Next, we set $\partial S/\partial {h}_{i}(t)=0$, which yields
which applies at timesteps $t=1,\mathrm{\dots},T1$. To obtain the value at the final timestep, we take $\mathrm{\partial}S/\mathrm{\partial}{h}_{i}(T)$, which leads to
Finally, taking the derivative with respect to the weights leads to the following:
Rather than setting these derivatives equal to zero, which may lead to an undesired solution that corresponds to a maximum or saddle point of the action and would in any case be intractable, we use the gradients in Equation (22) to perform gradient descent, reducing the error in an iterative fashion:
where ${\eta}_{i}$ are learning rates.
The BPTT algorithm then proceeds in three steps. First, the dynamical Equation (1) for $\mathbf{\mathbf{h}}(t)$ are integrated forward in time, beginning with the initial condition $\mathbf{\mathbf{h}}(0)$. Second, the auxiliary variable $\mathbf{\mathbf{z}}(t)$ is integrated backwards in time using Equation (20), using with the $\mathbf{\mathbf{h}}(t)$ saved from the forward pass and the boundary condition $\mathbf{\mathbf{z}}(T)$ from Equation (21). Third, the weights are updated according to Equation (23), using $\mathbf{\mathbf{h}}(t)$ and $\mathbf{\mathbf{z}}(t)$ saved from the preceding two steps.
Note that no approximations have been made in computing the gradients using either the RTRL or BPTT procedures. In fact, as we will show in the following section, the two algorithms are completely equivalent, at least in the case where RFLO weight updates are performed only at the end of each trial rather than at every timestep.
A unified view of gradientbased learning in recurrent networks
As pointed out previously (Beaufays and Wan, 1994; Srinivasan et al., 1994), the difference between RTRL and BPTT can ultimately be traced to distinct methods of bookkeeping in applying the chain rule to the gradient of the loss function. (Thanks to A. LitwinKumar for discussion about this correspondence). In order to make this explicit, we begin by noting that, when taking implicit dependences into account, the loss function defined in Equation (2) has the form
In this section, we write ${\mathbf{\mathbf{h}}}^{t}\equiv \mathbf{\mathbf{h}}(t)$ for notational convenience, and consider only updates to the recurrent weights $\mathbf{\mathbf{W}}$, ignoring the input $\mathbf{\mathbf{x}}(t)$ to the RNN. In any gradientbased learning scheme, the weight update $\mathrm{\Delta}{W}_{ab}$ should be proportional to the gradient of the loss function, which has the form
The difference between RTRL and BPTT arises from the two possible ways of keeping track of the implicit dependencies from Equation (24), which give rise to the following equivalent formulations of Equation (25):
In RTRL, the first derivative is simple to compute because loss function is treated as an explicit function of the variables ${\mathbf{\mathbf{h}}}^{t}$. The dependence of ${\mathbf{\mathbf{h}}}^{t}$ on $\mathbf{\mathbf{W}}$ and ${\mathbf{\mathbf{h}}}^{{t}^{\prime}}$ (where ${t}^{\prime}<t$) is then taken into account in the second derivative, which must be computed recursively due to the nested dependence on $\mathbf{\mathbf{W}}$. In BPTT, on the other hand, the implicit dependencies are dealt with in the first derivative, which in this case must be computed recursively because all terms at times ${t}^{\prime}>t$ depend implicitly on ${\mathbf{\mathbf{h}}}^{t}$. The second derivative then becomes simple since these dependencies are no longer present.
Let us define the following:
Then, using the definition of $L$ from Equation (2) and the dynamical Equation (1) for ${\mathbf{\mathbf{h}}}^{t}$ to take the other derivatives appearing in Equation (26), we have
The recursion relations follow from application of the chain rule in the definitions from Equation (27):
These recursion relations are identical to those appearing in Equation (11) and Equation (20). Notably, the first is computed forward in time, while the second is computed backward in time. Because no approximations have been made in computing the gradient in either case for Equation (28), the two methods are equivalent, at least if RTRL weight updates are made only at the end of each trial, rather than online. For this reason, only one of the algorithms (BPTT) was compared against RFLO learning in the main text.
As discussed in previous sections, RTRL has the advantages of obeying causality and of allowing for weights to be continuously updated. But, as discussed above, RTRL has the disadvantage of being nonlocal, and also features a greater computational cost due to the necessity of updating a rank3 tensor ${P}_{ab}^{i}(t)$ rather than a vector ${z}_{i}(t)$ at each timestep. By dropping the second term in the first line of Equation (29), RFLO learning eliminates both of these undesirable features, so that the resulting algorithm is causal, online, local, and has a computational complexity ($\sim {N}^{2}$ per timestep, vs. $\sim {N}^{4}$ for RTRL) on par with BPTT.
Appendix 2
Analysis of the RFLO learning rule
Given that the learning rules in Equation (7) do not move the weights directly along the steepest path that would minimize the loss function (as would the learning rules in Equation (10)), it is worthwhile to ask whether it can be shown that these learning rules in general decrease the loss function at all. To answer this question, we consider the change in weights after one trial lasting $T$ timesteps, working in the continuoustime limit for convenience, and performing weight updates only at the end of the trial:
where $\delta \mathbf{\mathbf{W}}$ and $\delta {\mathbf{\mathbf{W}}}^{\mathrm{out}}$ are given by Equation (7). For simplicity in this section we ignore the updates to the input weights, since the results in this case are very similar to those for recurrent weight updates.
In the first subsection of this appendix, we show that, under some approximations, the loss function tends to decrease on average under RFLO learning if there is positive alignment between the readout weights ${\mathbf{\mathbf{W}}}^{\mathrm{out}}$ and the feedback weights $\mathbf{\mathbf{B}}$. In the second subsection, we show that this alignment tends to increase during RFLO learning.
Decrease of the loss function
We first consider the change in the loss function defined in Equation (2) after updating the weights:
Assuming the weight updates to be small, we ignore terms beyond leading order in $\mathrm{\Delta}\mathbf{\mathbf{W}}$ and $\mathrm{\Delta}{\mathbf{\mathbf{W}}}^{\mathrm{out}}$. Then, using the update rules in Equation (30) and performing some algebra, Equation (31) becomes
Clearly the first term in Equation (32) always tends to decrease the loss function, as we would expect given that the precise gradient of $L$ with respect to ${\mathbf{\mathbf{W}}}^{\mathrm{out}}$ was used to determine this part of the learning rule. We now wish to show that, at least on average and with some simplifying assumptions, the second term in Equation (32) tends to be negative as well. Before beginning, we note in passing that this term is manifestly nonpositive like the first term if we perform RTRL, in which case ${\sum}_{k}{B}_{ak}{\epsilon}_{k}({t}^{\prime}){p}_{ab}({t}^{\prime})\to {\sum}_{kl}{W}_{kl}^{out}{\epsilon}_{k}({t}^{\prime}){P}_{ab}^{l}({t}^{\prime})$ in Equation (32), making the gradient exact.
In order to analyze $\mathrm{\Delta}{L}^{(2)}$, we will assume that the RNN is linear, with $\varphi (x)=x$. Further, we will average over the RNN activity $\mathbf{\mathbf{h}}(t)$, assuming that the activities are correlated from one timestep to the next, but not from one unit to the next:
The correlation function should be peaked at a positive value at $t{t}^{\prime}=0$ and decay to 0 at much earlier and later times. Finally, because of the antisymmetry under $x\to x$, odd powers of $\mathbf{\mathbf{h}}$ will average to zero: ${\u27e8{h}_{i}\u27e9}_{\mathbf{\mathbf{h}}}={\u27e8{h}_{i}{h}_{j}{h}_{k}\u27e9}_{\mathbf{\mathbf{h}}}=0$.
With these assumptions, we can express the activityaveraged second line of Equation (32) as ${\u27e8\mathrm{\Delta}{L}^{(2)}\u27e9}_{\mathbf{\mathbf{h}}}={F}_{1}+{F}_{2}$, with
and
In order to make further progress, we can perform an ensemble average over $\mathbf{\mathbf{W}}$, assuming that ${W}_{ij}\sim \mathcal{N}(0,{g}^{2}/N)$ is a random variable, which leads to
This leads to
and
Putting Equation (37) and Equation (38) together, changing one integration variable, and dropping the terms smaller than $O(N)$ then gives
Because we have assumed that $C(t)\ge 0$, the sign of this quantity depends only on the sign of the two terms in the second line of Equation (39).
Already we can see that Equation (39) will tend to be negative when ${\mathbf{\mathbf{W}}}^{\mathrm{out}}$ is aligned with $B$. To see this, suppose that $\mathbf{\mathbf{B}}=\alpha {\mathbf{\mathbf{W}}}^{\mathrm{out}}$, with $\alpha >0$. Due to the exponential factor, the integrand will be vanishingly small except when $t\approx {t}^{\prime}$, so that the first term in the second line in this case can be written as $\approx \alpha {{({\mathbf{\mathbf{W}}}^{\mathrm{out}})}^{\mathrm{T}}{\mathbf{\mathbf{y}}}^{*}(t)}^{2}\ge 0$. The second term, meanwhile, becomes $\alpha C(t{t}^{\prime})\mathrm{Tr}\left[{({({\mathbf{\mathbf{W}}}^{\mathrm{out}})}^{\mathrm{T}}{\mathbf{\mathbf{W}}}^{\mathrm{out}})}^{2}\right]\ge 0$.
The situation is most transparent if we assume that the RNN readout is onedimensional, in which case the readout and feedback weights become vectors ${\mathbf{\mathbf{w}}}^{\mathrm{out}}$ and $\mathbf{\mathbf{b}}$, respectively, and Equation (39) becomes
In this case it is clear that, as in the case of feedforward networks (Lillicrap et al., 2016), the loss function tends to decrease when the readout weights become aligned with the feedback weights. In the following subsection we will show that, at least under similar approximations to the ones made here, such alignment does in fact occur.
Alignment of readout weights with feedback weights
In the preceding subsection it was shown that, assuming a linear RNN and averaging over activities and recurrent weights, the loss function tends to decrease when the alignment between the readout weights ${\mathbf{\mathbf{W}}}^{\mathrm{out}}$ and the feedback weights $\mathbf{\mathbf{B}}$ becomes positive. In this subsection we ask whether such alignment does indeed occur.
In order to address this question, we consider the quantity $\mathrm{Tr}({\mathbf{\mathbf{W}}}^{\mathrm{out}}\mathbf{\mathbf{B}})$ and ask how it changes following one cycle of training, with combined weight updates on $\mathbf{\mathbf{W}}$ and ${\mathbf{\mathbf{W}}}^{\mathrm{out}}$. (As in the preceding subsection, external input to the RNN is ignored here for simplicity.) The effect of modifying the readout weights is obvious from Equation (15):
The update to the recurrent weights, on the other hand, modifies $\mathbf{\mathbf{h}}(t)$ in the above equation. Because we are interested in the combined effect of the two weight updates and are free to make the learning rates arbitrarily small, we focus on the following quantity:
The goal of this subsection is thus to show that (at least on average) $G>0$.
In order to evaluate this quantity, we need to know how the RNN activity $\mathbf{\mathbf{h}}(t)$ depends on the weight modification $\mathrm{\Delta}\mathbf{\mathbf{W}}$. As in the preceding subsection, we will assume a linear RNN and will work in the continuoustime limit ($\tau \gg 1$) for convenience. In this case, the dynamics are given by
If we wish to integrate this equation to get $\mathbf{\mathbf{h}}(t)$ and expand to leading order in $\mathrm{\Delta}\mathbf{\mathbf{W}}$, care must be taken due to the fact that $\mathbf{\mathbf{W}}$ and $\mathrm{\Delta}\mathbf{\mathbf{W}}$ are noncommuting matrices. Taking a cue from perturbation theory in quantum mechanics (Sakurai, 1994), we can work in the ‘interaction picture’ and obtain
where
We can now expand Equation (44) to obtain
For a linear network, the update rule for $\mathbf{\mathbf{W}}$ from Equation (15) is then simply
where the bar denotes lowpass filtering:
Combining (Equations (46–48)), the timedependent activity vector to leading order in ${\eta}_{2}$ is
where $\widehat{\mathbf{\mathbf{h}}}(t)$ is the unperturbed RNN activity vector (i.e. without the weight update $\mathrm{\Delta}\mathbf{\mathbf{W}}$). With this result, we can express Equation (42) as $G={G}_{1}+{G}_{2}$, where
and
Here we have defined $\widehat{\mathit{\bm{\epsilon}}}(t)\equiv {\mathbf{\mathbf{y}}}^{*}(t){\mathbf{\mathbf{W}}}^{\mathrm{out}}\widehat{\mathbf{\mathbf{h}}}(t)$.
In order to make further progress, we follow the approach of the previous subsection and perform an average over RNN activity vectors, which yields
and
Similar to the integral in Equation (39), both of these quantities will tend to be positive if we assume that $C(t)\ge 0$ with a peak at $t=0$, and note that the integrand is large only when $t\approx {t}^{\prime}$.
In order to make the result even more transparent, we can again consider the case of a onedimensional readout, in which case Equation (52) becomes
and
This version illustrates even more clearly that the right hand sides of these equations tend to be positive.
Equation (52) (or, in the case of onedimensional readout, Equation (54)) shows that the overlap between the readout weights and feedback weights tends to increase with training. Equation (39) (or Equation (40)) then shows that the readout error will tend to decrease during training given that this overlap is positive. While these mathematical results provide a compelling plausibility argument for the efficacy of RFLO learning, it is important to recall that some limiting assumptions were required in order to obtain them. Specifically, we assumed linearity of the RNN and vanishing of the crosscorrelations in the RNN activity, neither of which is strictly true in a trained nonlinear network. In order to show that RFLO learning remains effective even without these limitations, we must turn to numerical simulations such as those performed in the main text.
References
 1
 2
 3
 4
 5

6
Pharmacology and Nerveendings (Walter Ernest Dixon memorial lecture)(Section of therapeutics and pharmacology)Proceedings of the Royal Society of Medicine 28:319–332.
 7

8
Gradient learning in spiking neural networks by dynamic perturbation of conductancesPhysical Review Letters 97:048104.https://doi.org/10.1103/PhysRevLett.97.048104

9
Modulation of striatal projection systems by dopamineAnnual Review of Neuroscience 34:441–466.https://doi.org/10.1146/annurevneuro061010113641
 10
 11
 12

13
The basal ganglia and chunking of action repertoiresNeurobiology of Learning and Memory 70:119–136.https://doi.org/10.1006/nlme.1998.3843
 14
 15

16
Dopaminergic projections from midbrain to primary motor cortex mediate motor skill learningJournal of Neuroscience 31:2481–2487.https://doi.org/10.1523/JNEUROSCI.541110.2011
 17
 18

19
Temporal context calibrates interval timingNature Neuroscience 13:1020–1026.https://doi.org/10.1038/nn.2590
 20

21
Robust timing and motor patterns by taming chaos in recurrent neural networksNature Neuroscience 16:925–933.https://doi.org/10.1038/nn.3405
 22

23
A theoretical framework for backpropagationIn: D Touretzky, G Hinton, T Sejnowski, editors. Proceedings of the 1988 Connectionist Models Summer School. Pittsburg, PA: Morgan Kaufmann. pp. 21–28.
 24
 25

26
How important is weight symmetry in Backpropagation?AAAI'16 Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. pp. 1837–1844.

27
Random synaptic feedback weights support error backpropagation for deep learningNature Communications 7:13276.https://doi.org/10.1038/ncomms13276

28
The corticothalamic loop can control cortical dynamics for flexible robust motor output, 2018Poster at Cosyne 2018.
 29
 30
 31
 32

33
Advances in Neural Information Processing Systems 316594–6603, Approximating realtime recurrent learning with random kronecker factors, Advances in Neural Information Processing Systems 31, Curran.

34
Direct feedback alignment provides learning in deep neural networksNIPS'16 Proceedings of the 30th International Conference on Neural Information. pp. 1045–1053.

35
How to construct deep recurrent neural networks2nd International Conference on Learning Representations.
 36

37
Learning Internal Representations by Error PropagationCalifornia Univ San Diego La Jolla Inst for Cognitive Science.https://doi.org/10.21236/ADA164453
 38
 39

40
Deep learning with dynamic spiking neurons and fixed feedback weightsNeural Computation 29:578–602.https://doi.org/10.1162/NECO_a_00929
 41

42
Back propagation through adjoints for the identification of nonlinear dynamic systems using recurrent neural modelsIEEE Transactions on Neural Networks 5:213–228.https://doi.org/10.1109/72.279186

43
A neural network that finds a naturalistic solution for the production of muscle activityNature Neuroscience 18:1025–1033.https://doi.org/10.1038/nn.4042
 44

45
Unbiased online recurrent optimizationInternational Conference on Learning Representation.
 46
 47
 48
Decision letter

Peter LathamReviewing Editor; University College London, United Kingdom

Michael J FrankSenior Editor; Brown University, United States

Brian DePasqualeReviewer; Princeton University, United States
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "Local online learning in recurrent networks with random feedback" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Michael Frank as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Brian DePasquale (Reviewer #2).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
The manuscript develops a new algorithm for training recurrent neural networks (RFLO). The algorithm is intended to be local and online, and uses random feedback connections to send error into the network. The algorithm is presented as a contribution to a growing set of work on how learning in neural networks might take place in the brain. The results demonstrate on a series of simple tasks that RFLO works about as well as methods that use full gradient information (e.g. BPTT and RTRL) in cases where the timescale of the task is relatively short. In cases where the timescales are longer, RFLO struggles versus BPTT. The manuscript makes some effort to explore what component of RFLO damages performance and suggests that the difficulties with long timescales may be dealt with via different mechanisms all together (e.g. by stitching together multiple shorter behaviours).
We found the paper well written and have a few essential revisions.
Some of our points are really suggestions that we hope will improve the paper. However, that's always a matter of opinion, so feel free to ignore us on those. We'll be specific by putting (suggestion) or (strong suggestion), the latter for the ones we think are important, before our comments.
Essential revisions:
1) (strong suggestion) The distinct modifications to BPTT/RTRL and the consequences of these modifications should be amplified in the existing text, to ensure that readers are sure to appreciate the results. For readers that are not familiar with gradientbased learning in RNNs and the ideas of random feedback, I fear that the separate ideas of gradient approximation and of random feedback might be muddled. Stressing that these two ideas are only related insofar as they separately address distinct shortcomings of existing algorithms would stress that they were introduced by the author not because they share a special relationship, but only to achieve that goal. This came across somewhat in the text, but I believe it could be amplified. Additionally, stressing earlier in the text why exactly finelytuned feedback is a problem would be helpful (this might not be obvious to everyone). It's touched on in Discussion but explaining the concern in the Introduction would help frame that result more clearly.
2) (strong suggestion) Much of the math is relegated to the appendix. This seems like a pity since 1) the relationship of RFLO to BPTT is not obvious from the main text (undoubtedly requiring the reader to visit the appendix) and 2) the connection between RTRL and BPTT is presented so clearly (and, as the author noted, this connection has not been presented clearly in the literature). At a minimum, I would recommend migrating a minimal amount of the appendix to the main text to illustrate how RFLO relates to BPTT and RTRL. (Maybe, for example, Equation 9 could be presented in the main text). The tradeoff between BPTT and RTRL (tradingoff nonlocality and causality) and how RFLO addresses these nonbiological features can only really be understood within the appendix, and I fear that not all readers will invest in it.
3) Figure 2B seems to indicate that performance decreases consistently as the period grows longer and then perhaps asymptotes. It would be nice to know how performance for the RFLO changes as a function of the period, over a broader range of values (at least for this particular periodic task), to understand a reasonable timescale under which learning could occur. This is directly relevant to the author's proposal about how the brain might learn (i.e. through a sequence of actions) because it will dictate the duration of a "behavioral syllable" based on this learning rule.
Other points:
1) The theory showing that the learning rule decreases the error makes a number of assumptions: linear dynamics, uncorrelated neurons, and random weights. This seems very problematic: linear dynamics correlates the neurons, and the whole point of learning is to make the weights nonrandom (and presumably, correlated). Thus, it's not clear what the theory adds. At the very least, this should be pointed out in the main text. Even better (but not required) would be to try to verify, numerically, some of the expressions in the Appendix. That won't be particularly easy, but it seems possible. Alternatively, the change in error versus the alignment, and the alignment versus time, could be plotted during a simulation. This would go a long way toward supporting – or refuting – the theory.
2) We couldn't find what the initial weights were in the learning rules. It would be good to know if small initial weights were needed, or if the learning rule works when the initial weights are large enough that the network is in the chaotic regime. (suggestion) In particular, a plot of performance versus initial weights (presumably the variance) would be informative.
3) In our experience, for local learning rules the variance in the error is large. The variance should be reported – not just the mean. In addition, more than 5 networks should be used to compute the error.
4) The networks were small (30 in Figure 2; 100 in Figure 3; not sure in Figure 4). What happens when the size increases? We're hesitant to ask for more simulations. However, if simulations with larger networks are not done, you need to be upfront about the fact that this study may not scale well to large networks.
5) (Very important!) The Materials and methods section should contain all simulation details. As far as we can tell, some are missing: the initial values of the weights, the learning rates (after the grid search), explicit forms for the target functions in Figure 2, and the time step. And we may have missed other details; you should make sure that there's enough information that the simulations can be replicated. It's true that the code is supplied, but not all of us like to read code.
6) You should also discuss Hoerzer et al. (2012, Cerebral cortex, 24(3), 677690). This is an example of node perturbation without noise; it's instead based on overall performance relative to a running average. In that paper they train only the output weights, not the recurrent weights as is done here. However, if the network can do better than a training paradigm involving recurrent weights, it's worth mentioning. (suggestion) We would even go so far as to suggest comparing performance when training only the output weight against performance when training the recurrent weights.
7) The shaded regions in Figure 2 are only explained in the caption for panel D. It would be helpful to explain them in panel A as well.
8) The symbol 'i' is used to index over neurons in equation 1 (top), over outputs in the lower equation of equation 1 (and thus over the outputs in equation 2) and used to refer to the learning rates in the first sentence after equation 3. Given how technical the indexing can become, we would strongly suggest reserving 'i' (as you have for 'j','a', and 'b') for neurons only.
9) The gray line in Figure 2B is lost in the text describing what each color represents. It should be enlarged and made more prominent.
10) It would be helpful to add a more intuitive plotting convention for Figure 2C (such as a color gradient as τ gets longer).
11) (suggestion) In general, presenting the RFLO+Dale results in Figure 2 can be distracting from the main point (and the author doesn't treat it extensively in the text). We would suggest moving the Dale's Law results to a supplement to Figure 2 (see eLife's treatment of figure supplements).
12) The last line of text before equation 5 seems to contain incorrect references to equations 14 and 15 of the appendix, which I believe are the same as equations 3 and 4 of the main text.
13) Figure 2D appears to have a color mismatch between the line and the explanatory text within the figure, for "Local only".
14) The last sentence of "Interval matching" states that you will return to a point in the following subsection, but it's not clear which subsection you are referring to, or if that thread is ever discussed again later.
15) The algorithm is only run on simple toy problems that are constructed for the manuscript. The experiments hint that RFLO struggles in the context of longer timescales, but the manuscript provides no grounding for how well the algorithm performs on richer data. It is important to see performance of the algorithm gauged on a commonly used problem from the machine learning literature. Various datasets could be used, but the language modeling task on the Penn TreeBank (PTB) is very well explored and serves as a kind of MNIST for sequence modeling. Here it is important to quantify success in a fashion that is congruent with current standards in ML.
To be clear, highlevel performance on such a task does not seem crucial. The paper is aimed at biology, and there is no reason to pursue top results. But it is important to be able to situate the obtained performance, and offer a benchmark for subsequent work on biologically plausible algorithms that simultaneously aim to be practical/functional.
16) (weak suggestion, since this will be tough, and possibly beyond the scope of the work. but it would be nice if it could be done) Along these lines, and in the context of an externally defined problem, it would be ideal to see the performance of the algorithm explored using an LSTM architecture. Basic RNNs performance tends to be quite poor relative to LSTMs across many tasks. Is it easy to adapt the RFLO algorithm to more complex architectures, and does doing so deliver better performance on any task? The answer may simply be no; if so, that's OK.
17) (weak suggestion) On this question of architecture: how well does RFLO function in the case that there are multiple 'layers' in the RNN (e.g. as in Deep LSTMs where the connectivity matrix is not alltoall). Does the algorithm still function about as well, or does increasing the depth of the network slow training?
18) More could be done to emphasize that the tasks solved go well beyond what is solvable using a feedforward network and feedback alignment. This is implicit in some of the tasks, but guiding the reader to see this clearly and conclusively would be ideal.
19) In the original manuscript describing feedback alignment, convergence of error is proved under some very restrictive conditions. Please describe briefly, in the main text, how the theoretical results developed here are related to the original FA results. e.g. To what extend can they be seen as a generalization of those results? What assumptions are made differently or in addition to the FA results?
20) The manuscript briefly makes connections to the FORCE training method, but my feeling is that this currently doesn't go far enough. A few more sentences that summon more of the details of the model/algorithm from Sussillo and delineate the connections to RFLO would be useful to the reader.
21) (suggestion) The section on 'Learning a sequence of actions' is interesting, but currently feels adhoc. The section almost feels more like a long discussion point than something that ought to sit in the results. The message is, I believe, that: RFLO and similar algorithms may suffer relative to ML approaches such as BPTT on longer time scales, but this is ok because there are ways to rescue performance for long sequences. In particular, winnertakeall and reward modulated Hebbian learning rules are introduced along with additional sets of neurons to rescue performance on a movement sequence task. There is a brief attempt to relate these to the literature on structures that connect to motor cortex, but this feels rushed. Ideally, the manuscript would develop this section further so that it can be appreciated both with respect to biology and ML. Additionally, on this note, it would be good to see the performance of BPTT on the full sequence problem without these adhoc approaches. There is certainly a limit to what BPTT can do: how much does is it struggle with this situation? Is the long sequence one the BPTT also struggles with? This would provide grounding for where RFLO and these additional ideas sit with respect to BPTT training in this more interesting case.
22) There are a couple of citations that might be useful to include. For example, a mention of spiking variants of feedback alignment (e.g. "Deep learning with dynamic spiking neurons and fixed feedback weights"), and several recent works on approximations of RTRL in the ML literature that seem worth mentioning (e.g. "Unbiased online recurrent optimization", and "Approximating RealTime Recurrent Learning with Random Kronecker Factors").
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Thank you for submitting your article "Local online learning in recurrent networks with random feedback" for consideration by eLife. Your article has been reviewed by three peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Michael Frank as the Senior Editor. The following individual involved in review of your submission has agreed to reveal their identity: Brian DePasquale (Reviewer #2).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
This paper is essentially in, but there are two major (but not too hard) things left to do.
Essential revisions:
1) The treatment of Hoerzer et al. (a paper where you trained only the output weights) needs to be expanded. Our question last time was whether performance when only the output weights are trained could match performance when recurrent weights are trained. To address this, you trained only the output weights. However, as far as we could tell, there was no feedback. For a fair comparison, feedback weights are critical. Or at the very least, you should make sure you have a good weight initialization. (It is well known in the echostate literature that when training only the output weights, the initialization of the other weight matrices is very important. The literature on echostate networks contains lots of advice on how to ensure that these are set appropriately.) It would be nice if you included feedback weights and reran the simulations. That's not absolutely necessary, but if it's not done, then you will have to point out that you can't rule out the possibility that training the output weights actually works better than your approach of training the recurrent weights.
2) It would be good to include your response to concern 15 (attempting RFLO learning on more complicated problems) in the Discussion of the submitted manuscript. Your response at present is satisfactory, but the insight you share into why extending RFLO learning to more complex problems is itself interesting, and will likely be interesting to readers of the paper.
https://doi.org/10.7554/eLife.43299.014Author response
Essential revisions:
1) (strong suggestion) The distinct modifications to BPTT/RTRL and the consequences of these modifications should be amplified in the existing text, to ensure that readers are sure to appreciate the results. For readers that are not familiar with gradientbased learning in RNNs and the ideas of random feedback, I fear that the separate ideas of gradient approximation and of random feedback might be muddled. Stressing that these two ideas are only related insofar as they separately address distinct shortcomings of existing algorithms would stress that they were introduced by the author not because they share a special relationship, but only to achieve that goal. This came across somewhat in the text, but I believe it could be amplified. Additionally, stressing earlier in the text why exactly finelytuned feedback is a problem would be helpful (this might not be obvious to everyone). It's touched on in Discussion but explaining the concern in the Introduction would help frame that result more clearly.
To make the independence of the two approximations clear, the following sentence has been added to the Introduction: “While these two approximations address distinct shortcomings of gradientbased learning and can be made independently (as discussed below in Results), only when both are made together does a learning rule emerge that is fully biologically plausible in the sense of being causal, local, and avoiding fine tuning of feedback weights.”
To address the second point about the need for a clearer explanation of the problem with symmetric feedback weights, the following text has been added to the Introduction: “Such precise matching corresponds to fine tuning in the sense that it requires a highly particular initial configuration of the synaptic weights, typically with no justification as to how such a configuration might come about in a biologically plausible manner. Further, if the readout weights are modified during training of the RNN, then the feedback weights must also be updated to match them, and it is unclear how this might be done without requiring nonlocal information.”
2) (strong suggestion) Much of the math is relegated to the appendix. This seems like a pity since 1) the relationship of RFLO to BPTT is not obvious from the main text (undoubtedly requiring the reader to visit the appendix) and 2) the connection between RTRL and BPTT is presented so clearly (and, as the author noted, this connection has not been presented clearly in the literature). At a minimum, I would recommend migrating a minimal amount of the appendix to the main text to illustrate how RFLO relates to BPTT and RTRL. (Maybe, for example, Equation 9 could be presented in the main text). The tradeoff between BPTT and RTRL (tradingoff nonlocality and causality) and how RFLO addresses these nonbiological features can only really be understood within the appendix, and I fear that not all readers will invest in it.
The author is thrilled by the reviewers’ suggestion to import more of the math from the Appendix into the main text. In response, two paragraphs following Equation 2 have been added, in which a minimal derivation of the RTRL learning rule is provided, and a precise discussion of the shortcomings of the learning rule and the two approximations made to ameliorate them is included. In addition, a more explicit encouragement for the reader to visit Appendix 1 for details about BPTT has been provided (“The other classic gradientbased algorithm, BPTT, involves a different approach for taking partial derivatives but is equivalent to RTRL; its derivation and relation to RTRL are also provided in Appendix 1.”) Details about BPTT and its relation to RTRL have been kept in the Appendix, however. This is because they are not essential for anything that follows in the main text, but rather are a bonus for the interested reader. Additionally, it would be impossible to explicate the topic clearly without introducing several more equations and a significant amount of technical discussion, providing a possible hurdle or annoyance to the less mathematically inclined of eLife’s readership.
3) Figure 2B seems to indicate that performance decreases consistently as the period grows longer and then perhaps asymptotes. It would be nice to know how performance for the RFLO changes as a function of the period, over a broader range of values (at least for this particular periodic task), to understand a reasonable timescale under which learning could occur. This is directly relevant to the author's proposal about how the brain might learn (i.e. through a sequence of actions) because it will dictate the duration of a "behavioral syllable" based on this learning rule.
This is an excellent observation. However, based on data that hasn’t been included in the manuscript, it appears that the existence of such a plateau is not a universal feature of the RNN performance as a function of task duration. Rather, it depends on quantities such as the number of training trials, network size, and the particular task that the RNN is being trained for. Taken together, there does not appear to be a universal timescale setting the duration of a behavioral syllable. Since it would take quite a lot of additional simulations to establish this definitively, and since the presumed result is a null one, the author’s preference would be not to pursue this point further. If the reviewers and editor feel strongly that this would be an important result, however, it could be done.
Other points:
1) The theory showing that the learning rule decreases the error makes a number of assumptions: linear dynamics, uncorrelated neurons, and random weights. This seems very problematic: linear dynamics correlates the neurons, and the whole point of learning is to make the weights nonrandom (and presumably, correlated). Thus, it's not clear what the theory adds. At the very least, this should be pointed out in the main text. Even better (but not required) would be to try to verify, numerically, some of the expressions in the Appendix. That won't be particularly easy, but it seems possible. Alternatively, the change in error versus the alignment, and the alignment versus time, could be plotted during a simulation. This would go a long way toward supporting – or refuting – the theory.
The reviewer is certainly right that the limitations of the highly simplified theory in Appendix 2 should be highlighted more prominently in the main text. To address this, the following text has been added to the end of the first subsection in Results: “A number of simplifying assumptions have been made in the mathematical derivations of Appendix 2, including linear dynamics, uncorrelated neurons, and random synaptic weights, none of which will necessarily hold in a nonlinear network trained to perform a dynamical computation. Hence, although such mathematical arguments provide reason to hope that RFLO learning might be successful and insight into the mechanism by which learning occurs, it remains to be shown that RFLO learning can be used to successfully train a nonlinear RNN in practice.”
Regarding the reviewers’ other suggestion, the author has declined to attempt to verify numerically the expressions for the linearized network in the Appendix. On the one hand, it seems clear that strong quantitative agreement with the simulations of the nonlinear RNN would be too much to hope for, given the drastic simplifications and assumptions made in the derivation, as pointed out by the reviewer. On the other hand, it is equally clear that qualitative agreement between the theoretical expressions and simulations must occur, since the loss function does in fact decrease (Figure 2A) and the alignment between readout and feedback does in fact increase (Figure 2C) in simulations, as the theory predicts. Hence, it doesn’t appear that the numerical simulations would add much insight, and the reviewers’ leniency on this point is greatly appreciated
2) We couldn't find what the initial weights were in the learning rules. It would be good to know if small initial weights were needed, or if the learning rule works when the initial weights are large enough that the network is in the chaotic regime. (suggestion) In particular, a plot of performance versus initial weights (presumably the variance) would be informative.
A section has been added to the Materials and methods explaining weight initialization and other simulation details. Regarding the performance as a function of the initial weight variance, a supplementary figure to Figure 2 has been added, following the reviewers’ suggestion. In all other simulations, the initial weights were chosen to be sufficiently large to place the RNN in the chaotic regime. It is perhaps unsurprising that larger initial weights lead to better performance in a task such as the one shown in Figure 2, since only in this regime is the network able to autonomously generate rich timevarying signals.
3) In our experience, for local learning rules the variance in the error is large. The variance should be reported – not just the mean. In addition, more than 5 networks should be used to compute the error.
Following the reviewers’ suggestion, the number of networks used for the results in Figure 2 has been increased from 5 to 9.
Regarding the suggestion to report variance, the author’s opinion is that indicating the percentiles gives a clearer idea of the variability than standard deviation in logarithmic plots spanning many orders of magnitude, such as those shown in Figure 2. Hence, this format has been maintained in Figure 2 and the new supplemental figures related to it.
4) The networks were small (30 in Figure 2; 100 in Figure 3; not sure in Figure 4). What happens when the size increases? We're hesitant to ask for more simulations. However, if simulations with larger networks are not done, you need to be upfront about the fact that this study may not scale well to large networks.
As with other RNN training approaches, performance with RFLO learning generally improves for larger network sizes. Following the reviewers’ suggestion, a supplementary figure to Figure 2 has been added to show this. The reason that small networks have been used in simulations here is because such networks require less time to simulate, allowing for more training trials in a given amount of CPU time. A related point that was perhaps not sufficiently emphasized in the previous version of the manuscript is that the computational complexity of RFLO learning is on par with BPTT (both are ~N^{2} per timestep, as shown in Appendix 1), and greatly improved compared with RTRL (~N^{4} per timestep). Because this fact may be of practical importance for those who wish to implement the algorithm, it has been pointed out at the end of the subsection on Figure 2 (“As for other RNN training methods, performance of the trained RNN generally improves for larger network sizes […]”).
5) (Very important!) The Materials and methods section should contain all simulation details. As far as we can tell, some are missing: the initial values of the weights, the learning rates (after the grid search), explicit forms for the target functions in Figure 2, and the time step. And we may have missed other details; you should make sure that there's enough information that the simulations can be replicated. It's true that the code is supplied, but not all of us like to read code.
A section has been added to the Materials and methods explaining all simulation details.
6) You should also discuss Hoerzer et al. (2012, Cerebral cortex, 24(3), 677690). This is an example of node perturbation without noise; it's instead based on overall performance relative to a running average. In that paper they train only the output weights, not the recurrent weights as is done here. However, if the network can do better than a training paradigm involving recurrent weights, it's worth mentioning. (suggestion) We would even go so far as to suggest comparing performance when training only the output weight against performance when training the recurrent weights.
The learning rule from the paper by Hoerzer appears to be a minor modification of that from Legenstein et al (2010), which was already discussed in the Discussion section. (Specifically, the learning rule in Legenstein is proportional to R, where R is reward and is recent average reward, whereas Hoerzer uses just the sign of this value.) A citation to Hoerzer has been added to the updated manuscript, though there doesn’t seem to be a need for additional discussion beyond the what is already said about the Legenstein paper.
The suggestion to compare performance of an RNN in which only readout weights are trained with an RNN in which both recurrent and readout weights are trained is a good one. Following this suggestion, a supplementary figure has been added to Figure 2 showing that performance of an RNN in which both recurrent and readout weights are trained is better than that of an RNN in which only recurrent or only readout weights are trained. A paragraph discussing this has also been added to the main text (“It is also worthwhile to consider the relative contributions of the two types of learning in Figure 2, namely the learning of recurrent and of readout weights […]”)
7) The shaded regions in Figure 2 are only explained in the caption for panel D. It would be helpful to explain them in panel A as well.
The shaded regions have been explained in panel A.
8) The symbol 'i' is used to index over neurons in equation 1 (top), over outputs in the lower equation of equation 1 (and thus over the outputs in equation 2) and used to refer to the learning rates in the first sentence after equation 3. Given how technical the indexing can become, we would strongly suggest reserving 'i' (as you have for 'j','a', and 'b') for neurons only.
The indexing changes have been made following the reviewers’ suggestion.
9) The gray line in Figure 2B is lost in the text describing what each color represents. It should be enlarged and made more prominent.
The gray line has been made darker and thicker.
10) It would be helpful to add a more intuitive plotting convention for Figure 2C (such as a color gradient as τ gets longer).
The plot has been redrawn using a color gradient, as suggested by the reviewer.
11) (suggestion) In general, presenting the RFLO+Dale results in Figure 2 can be distracting from the main point (and the author doesn't treat it extensively in the text). We would suggest moving the Dale's Law results to a supplement to Figure 2 (see eLife's treatment of figure supplements).
The RFLO+Dale results have been moved to a supplemental figure, as suggested.
12) The last line of text before equation 5 seems to contain incorrect references to equations 14 and 15 of the appendix, which I believe are the same as equations 3 and 4 of the main text.
Thanks to the reviewer for pointing this out. The mistake has been corrected.
13) Figure 2D appears to have a color mismatch between the line and the explanatory text within the figure, for "Local only".
The color mismatch has been fixed.
14) The last sentence of "Interval matching" states that you will return to a point in the following subsection, but it's not clear which subsection you are referring to, or if that thread is ever discussed again later.
In order to make the logic clearer, the sentence has been replaced with the following: “In the following subsection, we shall address this shortcoming by constructing a network in which learned subsequence elements of short duration can be concatenated to form longerduration sequences.”
15) The algorithm is only run on simple toy problems that are constructed for the manuscript. The experiments hint that RFLO struggles in the context of longer timescales, but the manuscript provides no grounding for how well the algorithm performs on richer data. It is important to see performance of the algorithm gauged on a commonly used problem from the machine learning literature. Various datasets could be used, but the language modeling task on the Penn TreeBank (PTB) is very well explored and serves as a kind of MNIST for sequence modeling. Here it is important to quantify success in a fashion that is congruent with current standards in ML.
To be clear, highlevel performance on such a task does not seem crucial. The paper is aimed at biology, and there is no reason to pursue top results. But it is important to be able to situate the obtained performance, and offer a benchmark for subsequent work on biologically plausible algorithms that simultaneously aim to be practical/functional.
Benchmarking the RFLO learning algorithm on a standard machine learning task such as PTB is an excellent idea. The main problem with this, though, is that training on such a task requires the use of a normalizing softmax on the RNN outputs together with a crossentropy loss function (rather than mean squared error). In this case, additional nonlocal terms arise in the gradientdescent learning rule, forcing one to either (i) ignore these and try to use local RFLO anyway, or (ii) use a nonlocal version of RFLO learning that accounts for the different loss function and normalization. The first option leads (unsurprisingly) to terrible performance on PTB, while the second option is outside the scope of this study, since the point of the manuscript is the study of local RNN learning rules. Because of these considerations, results for the PTB task have regretfully not been added to the manuscript. Investigating whether local RNN learning rules for largescale categorization tasks such as PTB could be an interesting question for future work, but it’s not currently obvious how this could be done.
Unfortunately, there don’t seem to be any similarly universal RNN benchmarking tasks of a sort that might provide a good test of RFLO learning. Presumably this is why other recent works on local RNN learning (e.g. Miconi, eLife 2017; Gilra and Gerstner, eLife 2018) haven’t applied their algorithms to a standard battery of tasks. Establishing such a battery would obviously be very useful for the field, but is unfortunately beyond the scope of the present work.
16) (weak suggestion, since this will be tough, and possibly beyond the scope of the work. but it would be nice if it could be done) Along these lines, and in the context of an externally defined problem, it would be ideal to see the performance of the algorithm explored using an LSTM architecture. Basic RNNs performance tends to be quite poor relative to LSTMs across many tasks. Is it easy to adapt the RFLO algorithm to more complex architectures, and does doing so deliver better performance on any task? The answer may simply be no; if so, that's OK.
The theory has so far not been applied to LSTMs due to concerns about the biological plausibility of LSTM architectures in the first place, regardless of the learning rule used. It is possible to derive local LSTM learning rules in a similar manner to RFLO learning, i.e. by dropping nonlocal terms that appear in the loss function gradient. The issue hasn’t been explored further in simulations, though, since the aim of the present paper is to develop biologically plausible learning rules for recurrent networks, not to study network architectures that have no clear basis in biology.
17) (weak suggestion) On this question of architecture: how well does RFLO function in the case that there are multiple 'layers' in the RNN (e.g. as in Deep LSTMs where the connectivity matrix is not alltoall). Does the algorithm still function about as well, or does increasing the depth of the network slow training?
No significant benefit to using a twolayer architecture vs. a single layer with the same number of parameters was found for the task shown in Figure 2 (p=0.85 for n=9 networks, data not shown). It’s possible that multilayer architectures could be more advantageous for more challenging tasks, for example tasks with compositional structure, such as the reach sequence task shown in Figure 4. Investigating whether this is the case, in addition to developing theoretical understanding of why multilayer RNN architectures might be advantageous and investigating how they might be implemented in the brain (most obviously in pre and primary motor cortex), is a fascinating direction for further study. Because these are big questions that would require an entire independent project to address in a satisfactory way, however, the author, with the editor’s permission, would prefer to defer this as future work.
18) More could be done to emphasize that the tasks solved go well beyond what is solvable using a feedforward network and feedback alignment. This is implicit in some of the tasks, but guiding the reader to see this clearly and conclusively would be ideal.
Following the reviewers’ suggestion, the following text has been added to the beginning of the Results section: “These tasks require an RNN to produce sequences of output values and/or delayed responses to an input to the RNN, and hence are beyond the capabilities of feedforward networks.”
19) In the original manuscript describing feedback alignment, convergence of error is proved under some very restrictive conditions. Please describe briefly, in the main text, how the theoretical results developed here are related to the original FA results. e.g. To what extend can they be seen as a generalization of those results? What assumptions are made differently or in addition to the FA results?
The mathematical results in Appendix 2 share some similarities with the FA results. Both approaches linearize the network, and the statistical average over RNN state vectors is similar to Lillicrap’s average over inputs. The result is not a straightforward extension of the Lillicrap result for a onehiddenlayer network, however, since the retaining of state information from one timestep to the next in our case makes it impossible to directly apply the feedforward results to an RNN “unrolled in time”. Specifically, the fact that the update to the recurrent weight matrix changes the RNN state vector trajectory makes the case considered here somewhat trickier. A few sentences about this have been added near the end of the first subsection in Results (“The mathematical approach for showing that alignment between readout and feedback weights occurs is similar to that used previously in the feedforward case […]”).
20) The manuscript briefly makes connections to the FORCE training method, but my feeling is that this currently doesn't go far enough. A few more sentences that summon more of the details of the model/algorithm from Sussillo and delineate the connections to RFLO would be useful to the reader.
A short paragraph on this topic has been added to the Discussion section (“In addition to the gradientbased approaches (RTRL and BPTT) already discussed above, another widely used algorithm for training RNNs is FORCE learning […]”.)
21) (suggestion) The section on 'Learning a sequence of actions' is interesting, but currently feels adhoc. The section almost feels more like a long discussion point than something that ought to sit in the results. The message is, I believe, that: RFLO and similar algorithms may suffer relative to ML approaches such as BPTT on longer time scales, but this is ok because there are ways to rescue performance for long sequences. In particular, winnertakeall and reward modulated Hebbian learning rules are introduced along with additional sets of neurons to rescue performance on a movement sequence task. There is a brief attempt to relate these to the literature on structures that connect to motor cortex, but this feels rushed. Ideally, the manuscript would develop this section further so that it can be appreciated both with respect to biology and ML. Additionally, on this note, it would be good to see the performance of BPTT on the full sequence problem without these adhoc approaches. There is certainly a limit to what BPTT can do: how much does is it struggle with this situation? Is the long sequence one the BPTT also struggles with? This would provide grounding for where RFLO and these additional ideas sit with respect to BPTT training in this more interesting case.
The reviewers’ suggestion to compare the performance of RFLO+subcortical loop with BPTT has been addressed with a new subpanel in Figure 4. This subpanel shows that, when the number of training trials is held constant, RFLO learning with the loop architecture outperforms not only RFLO learning without the loop architecture, but even outperforms BPTT.
22) There are a couple of citations that might be useful to include. For example, a mention of spiking variants of feedback alignment (e.g. "Deep learning with dynamic spiking neurons and fixed feedback weights"), and several recent works on approximations of RTRL in the ML literature that seem worth mentioning (e.g. "Unbiased online recurrent optimization", and "Approximating RealTime Recurrent Learning with Random Kronecker Factors").
Thanks to the reviewers for pointing these references out. All three have been added to the first subsection of the Results section.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Essential revisions:
1) The treatment of Hoerzer et al. (a paper where you trained only the output weights) needs to be expanded. Our question last time was whether performance when only the output weights are trained could match performance when recurrent weights are trained. To address this, you trained only the output weights. However, as far as we could tell, there was no feedback. For a fair comparison, feedback weights are critical. Or at the very least, you should make sure you have a good weight initialization. (It is well known in the echostate literature that when training only the output weights, the initialization of the other weight matrices is very important. The literature on echostate networks contains lots of advice on how to ensure that these are set appropriately.) It would be nice if you included feedback weights and reran the simulations. That's not absolutely necessary, but if it's not done, then you will have to point out that you can't rule out the possibility that training the output weights actually works better than your approach of training the recurrent weights.
The simulation in which only the readout weights were trained did in fact include feedback of the readout, and the weight initialization is typical of what is used by Hoerzer et al. and in much of the echo state literature. This is already described in the Materials and methods section, but the reviewer is certainly correct that it should also be mentioned in the main text. A parenthetical note along these lines has therefore been added to the main text [“(with the readout fed back as an input to the RNN for stability – see Materials and methods)”].
2) It would be good to include your response to concern 15 (attempting RFLO learning on more complicated problems) in the Discussion of the submitted manuscript. Your response at present is satisfactory, but the insight you share into why extending RFLO learning to more complex problems is itself interesting, and will likely be interesting to readers of the paper.
The first paragraph of the Discussion section has been extended (“Another promising area for future work…”) to point out two points as promising avenues for future work: (i) stacked RNN architectures, and (ii) the topic that the reviewer suggested above, namely the possible application of local learning to discrete problems using crossentropy loss functions and softmax normalization.
https://doi.org/10.7554/eLife.43299.015Article and author information
Author details
Funding
National Institutes of Health (DP5 OD019897)
 James M Murray
National Science Foundation (DBI1707398)
 James M Murray
Gatsby Charitable Foundation
 James M Murray
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
The author is grateful to LF Abbott, GS Escola, and A LitwinKumar for helpful discussions and feedback on the manuscript. Support for this work was provided by the National Science Foundation NeuroNex program (DBI1707398), the National Institutes of Health (DP5 OD019897), and the Gatsby Charitable Foundation.
Senior Editor
 Michael J Frank, Brown University, United States
Reviewing Editor
 Peter Latham, University College London, United Kingdom
Reviewer
 Brian DePasquale, Princeton University, United States
Publication history
 Received: November 1, 2018
 Accepted: May 23, 2019
 Accepted Manuscript published: May 24, 2019 (version 1)
 Accepted Manuscript updated: May 31, 2019 (version 2)
 Version of Record published: June 12, 2019 (version 3)
Copyright
© 2019, Murray
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,130
 Page views

 218
 Downloads

 0
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.