Rules and mechanisms for efficient twostage learning in neural circuits
Abstract
Trialanderror learning requires evaluating variable actions and reinforcing successful variants. In songbirds, vocal exploration is induced by LMAN, the output of a basal gangliarelated circuit that also contributes a corrective bias to the vocal output. This bias is gradually consolidated in RA, a motor cortex analogue downstream of LMAN. We develop a new model of such twostage learning. Using stochastic gradient descent, we derive how the activity in ‘tutor’ circuits (e.g., LMAN) should match plasticity mechanisms in ‘student’ circuits (e.g., RA) to achieve efficient learning. We further describe a reinforcement learning framework through which the tutor can build its teaching signal. We show that mismatches between the tutor signal and the plasticity mechanism can impair learning. Applied to birdsong, our results predict the temporal structure of the corrective bias from LMAN given a plasticity rule in RA. Our framework can be applied predictively to other paired brain areas showing twostage learning.
https://doi.org/10.7554/eLife.20944.001Introduction
Twostage learning has been described in a variety of different contexts and neural circuits. During hippocampal memory consolidation, recent memories, that are dependent on the hippocampus, are transferred to the neocortex for longterm storage (Frankland and Bontempi, 2005). Similarly, the rat motor cortex provides essential input to subcortical circuits during skill learning, but then becomes dispensable for executing certain skills (Kawai et al., 2015). A paradigmatic example of twostage learning occurs in songbirds learning their courtship songs (Andalman and Fee, 2009; Turner and Desmurget, 2010; Warren et al., 2011). Zebra finches, commonly used in birdsong research, learn their song from their fathers as juveniles, and keep the same song for life (Immelmann, 1969).
The birdsong circuit has been extensively studied; see Figure 1A for an outline. Area HVC is a timebase circuit, with projection neurons that fire sparse spike bursts in precise synchrony with the song (Hahnloser et al., 2002; Lynch et al., 2016; Picardo et al., 2016). A population of neurons from HVC projects to the robust nucleus of the arcopallium (RA), a premotor area, which then projects to motor neurons controlling respiratory and syringeal muscles (Leonardo and Fee, 2005; Simpson and Vicario, 1990; Yu and Margoliash, 1996). A second input to RA comes from the lateral magnocellular nucleus of the anterior nidopallium (LMAN). Unlike HVC and RA activity patterns, LMAN spiking is highly variable across different renditions of the song (Kao et al., 2008; Ölveczky et al., 2005). LMAN is the output of the anterior forebrain pathway, a circuit involving the songspecialized basal ganglia (Perkel, 2004).
Because of the variability in its activity patterns, it was thought that LMAN’s role was simply to inject variability into the song (Ölveczky et al., 2005). The resulting vocal experimentation would enable reinforcementbased learning. For this reason, prior models tended to treat LMAN as a pure Poisson noise generator, and assume that a reward signal is received directly in RA (Fiete et al., 2007). More recent evidence, however, suggests that the reward signal reaches Area X, the songspecialized basal ganglia, rather than RA (Gadagkar et al., 2016; Hoffmann et al., 2016; Kubikova et al., 2010). Taken together with the fact that LMAN firing patterns are not uniformly random, but rather contain a corrective bias guiding plasticity in RA (Andalman and Fee, 2009; Warren et al., 2011), this suggests that we should rethink our models of song acquisition.
Here we build a general model of twostage learning where one neural circuit ‘tutors’ another. We develop a formalism for determining how the teaching signal should be adapted to a specific plasticity rule, to best instruct a student circuit to improve its performance at each learning step. We develop analytical results in a ratebased model, and show through simulations that the general findings carry over to realistic spiking neurons. Applied to the vocal control circuit of songbirds, our model reproduces the observed changes in the spiking statistics of RA neurons as juvenile birds learn their song. Our framework also predicts how the LMAN signal should be adapted to properties of RA synapses. This prediction can be tested in future experiments.
Our approach separates the mechanistic question of how learning is implemented from what the resulting learning rules are. We nevertheless demonstrate that a simple reinforcement learning algorithm suffices to implement the learning rule we propose. Our framework makes general predictions for how instructive signals are matched to plasticity rules whenever information is transferred between different brain regions.
Results
Model
We considered a model for information transfer that is composed of three subcircuits: a conductor, a student, and a tutor (see Figure 1B). The conductor provides input to the student in the form of temporally precise patterns. The goal of learning is for the student to convert this input to a predefined output pattern. The tutor provides a signal that guides plasticity at the conductor–student synapses. For simplicity, we assumed that the conductor always presents the input patterns in the same order, and without repetitions. This allowed us to use the time $t$ to label input patterns, making it easier to analyze the online learning rules that we studied. This model of learning is based on the logic implemented by the vocal circuits of the songbird (Figure 1A). Relating this to the songbird, the conductor is HVC, the student is RA, and the tutor is LMAN. The song can be viewed as a mapping between clocklike HVC activity patterns and musclerelated RA outputs. The goal of learning is to find a mapping that reproduces the tutor song.
Birdsong provides interesting insights into the role of variability in tutor signals. If we focus solely on information transfer, the tutor output need not be variable; it can deterministically provide the best instructive signal to guide the student. This, however, would require the tutor to have a detailed model of the student. More realistically, the tutor might only have access to a scalar representation of how successful the student rendition of the desired output is, perhaps in the form of a reward signal. A tutor in this case has to solve the socalled ‘credit assignment problem’—it needs to identify which student neurons are responsible for the reward. A standard way to achieve this is to inject variability in the student output and reinforce the firing of neurons that precede reward (see for example (Fiete et al., 2007) in the birdsong context). Thus, in our model, the tutor has a dual role of providing both an instructive signal and variability, as in birdsong.
We described the output of our model using a vector ${y}_{a}(t)$ where $a$ indexed the various output channels (Figure 2A). In the context of motor control $a$ might index the muscle to be controlled, or, more abstractly, different features of the motor output, such as pitch and amplitude in the case of birdsong. The output ${y}_{a}(t)$ was a function of the activity of the student neurons ${s}_{j}(t)$. The student neurons were in turn driven by the activity of the conductor neurons ${c}_{i}(t)$. The student also received tutor signals to guide plasticity; in the songbird, the guiding signals for each RA neuron come from several LMAN neurons (Canady et al., 1988; GarstOrozco et al., 2014; Herrmann and Arnold, 1991). In our model, we summarized the net input from the tutor to the $j$th student neuron as a single function ${g}_{j}(t)$.
We started with a ratebased implementation of the model (Figure 2A) that was analytically tractable but averaged over tutor variability. We further took the neurons to be in a linear operating regime (Figure 2A) away from the threshold and saturation present in real neurons. We then relaxed these conditions and tested our results in spiking networks with initial parameters selected to imitate measured firing patterns in juvenile birds prior to song learning. The student circuit in both the ratebased and spiking models included a global inhibitory signal that helped to suppress excess activity driven by ongoing conductor and tutor input. Such recurrent inhibition is present in area RA of the bird (Spiro et al., 1999). In the spiking model we implemented the suppression as an activitydependent inhibition, while for the analytic calculations we used a constant negative bias for the student neurons.
Learning in a ratebased model
Learning in our model was enabled by plasticity at the conductor–student synapses that was modulated by signals from tutor neurons (Figure 2B). Many different forms of such heterosynaptic plasticity have been observed. For example, in ratebased synaptic plasticity high tutor firing rates lead to synaptic potentiation and low tutor firing rates lead to depression (Chistiakova and Volgushev, 2009; Chistiakova et al., 2014). In timingdependent rules, such as the one recently measured by Mehaffey and Doupe (2015) in slices of zebra finch RA (see Figure 1C), the relative arrival times of spike bursts from different input pathways set the sign of synaptic change. To model learning that lies between these rate and timingbased extremes, we introduced a class of plasticity rules governed by two parameters $\alpha $ and $\beta $ (see also Materials and methods and Figure 2B):
where ${W}_{ij}$ is the weight of the synapse from the $i$th conductor to the $j$th student neuron, $\eta $ is a learning rate, $\theta $ is a threshold on the firing rate of tutor neurons, and ${\tau}_{1}$ and ${\tau}_{2}$ are timescales associated with the plasticity. This is similar to an STDP rule, except that the dependence on postsynaptic activity was replaced by dependence on the input from the tutor. Thus plasticity acts heterosynaptically, with activation of the tutor–student synapse controlling the change in the conductor–student synaptic weight. The timescales ${\tau}_{1}$ and ${\tau}_{2}$, as well as the coefficients $\alpha $ and $\beta $, can be thought of as effective parameters describing the plasticity observed in student neurons. As such, they do not necessarily have a simple correspondence in terms of the biochemistry of the plasticity mechanism, and the framework we describe here is not specifically tied to such an interpretation.
If we set $\alpha $ or $\beta $ to zero in our rule, Equation (1), the sign of the synaptic change is determined solely by the firing rate of the tutor ${g}_{j}(t)$ as compared to a threshold, reproducing the rate rules observed in experiments. When $\alpha /\beta \approx 1$, if the conductor leads the tutor, potentiation occurs, while coincident signals lead to depression (Figure 2B), which mimics the empirical findings from Mehaffey and Doupe (2015). For general $\alpha $ and $\beta $, the sign of plasticity is controlled by both the firing rate of the tutor relative to the baseline, and by the relative timing of tutor and conductor. The overall scale of the parameters $\alpha $ and $\beta $ can be absorbed into the learning rate $\eta $ and so we set $\alpha \beta =1$ in all our simulations without loss of generality (see Materials and methods). Note that if $\alpha $ and $\beta $ are both large, it can be that $\alpha \beta =1$ and $\alpha /\beta \approx 1$ also, as needed to realize the Mehaffey and Doupe (2015) curve.
We can ask how the conductor–student weights ${W}_{ij}$ (Figure 2A) should change in order to best improve the output ${y}_{a}(t)$. We first need a loss function $L$ that quantifies the distance between the current output ${y}_{a}(t)$ and the target ${\overline{y}}_{a}(t)$ (Figure 2C). We used a quadratic loss function, but other choices can also be incorporated into our framework (see Appendix). Learning should change the synaptic weights so that the loss function is minimized, leading to a good rendition of the targeted output. This can be achieved by changing the synaptic weights in the direction of steepest descent of the loss function (Figure 2C).
We used the synaptic plasticity rule from Equation (1) to calculate the overall change of the weights, $\mathrm{\Delta}{W}_{ij}$, over the course of the motor program. This is a function of the time course of the tutor signal, ${g}_{j}(t)$. Not every choice for the tutor signal leads to motor output changes that best improve the match to the target. Imposing the condition that these changes follow the gradient descent procedure described above, we derived the tutor signal that was best matched to the student plasticity rule (detailed derivation in Materials and methods). The result is that the best tutor for driving gradient descent learning must keep track of the motor error
integrated over the recent past
where ${M}_{aj}$ are the weights describing the linear relationship between student activities and motor outputs (Figure 2A) and $\zeta $ is a learning rate. Moreover, for effective learning, the parameter ${\tau}_{\text{tutor}}$ appearing in Equation (3), which quantifies the timescale on which error information is integrated into the tutor signal, should be related to the synaptic plasticity parameters according to
is the optimal timescale for the error integration.
In short, motor learning with a heterosynaptic plasticity rule requires convolving the motor error with a kernel whose timescale is related to the structure of the plasticity rule, but is otherwise independent of the motor program. As explained in more detail in Materials and methods, this result is derived in an approximation that assumes that the tutor signal does not vary significantly over timescales of the order of the student timescales ${\tau}_{1}$ and ${\tau}_{2}$. Given Equation (4), this implies that we are assuming ${\tau}_{\text{tutor}}\gg {\tau}_{1,2}$. This is a reasonable approximation because variations in the tutor signal that are much faster than the student timescales ${\tau}_{1,2}$ have little effect on learning since the plasticity rule (1) blurs conductor inputs over these timescales.
Matched vs. unmatched learning
Our ratebased model predicts that when the timescale on which error information is integrated into the tutor signal (${\tau}_{\text{tutor}}$) is matched to the student plasticity rule as described above, learning will proceed efficiently. A mismatched tutor should slow or disrupt convergence to the desired output. To test this, we numerically simulated the birdsong circuit using the linear model from Figure 2A with a motor output ${y}_{a}$ filtered to more realistically reflect muscle response times (see Materials and methods). We selected plasticity rules as described in Equation (1) and Figure 2B and picked a target output pattern to learn. The target was chosen to resemble recordings of airsac pressure from singing zebra finches in terms of smoothness and characteristic timescales (Veit et al., 2011), but was otherwise arbitrary. In our simulations, the output typically involved two different channels, each with its own target, but for brevity, in figures we typically showed the output from only one of these.
For our analytical calculations, we made a series of assumptions and approximations meant to enhance tractability, such as linearity of the model and a focus on the regime ${\tau}_{\text{tutor}}\gg {\tau}_{1,2}$. These constraints can be lifted in our simulations, and indeed below we test our numerical model in regimes that go beyond the approximations made in our derivation. In many cases, we found that the basic findings regarding tutor–student matching from our analytical model remain true even when some of the assumptions we used to derive it no longer hold.
We tested tutors that were matched or mismatched to the plasticity rule to see how effectively they instructed the student. Figure 3A and online Video 1 show convergence with a matched tutor when the sign of plasticity is determined by the tutor’s firing rate. We see that the student output rapidly converged to the target. Figure 3B and online Video 2 show convergence with a matched tutor when the sign of plasticity is largely determined by the relative timing of the tutor signal and the student output. We see again that the student converged steadily to the desired output, but at a somewhat slower rate than in Figure 3A.
To test the effects of mismatch between tutor and student, we used tutors with timescales that did not match Equation (4). All student plasticity rules had the same effective time constants ${\tau}_{1}$ and ${\tau}_{2}$, but different parameters $\alpha $ and $\beta $ (see Equation 1), subject to the constraint $\alpha \beta =1$ described in the previous section. Different tutors had different memory time scales ${\tau}_{\text{tutor}}$ (Equation 3). Figure 3C and D demonstrate that learning was more rapid for wellmatched tutorstudent pairs (the diagonal neighborhood, where ${\tau}_{\text{tutor}}\approx {\tau}_{\text{tutor}}^{*}$). When the tutor error integration timescale was shorter than the matched value in Equation (4), $\tau}_{\text{tutor}}\text{}\text{}{\tau}_{\text{tutor}}^{\ast$, learning was often completely disrupted (many pairs below the diagonal in Figure 3C and D). When the tutor error integration timescale was longer than the matched value in Equation (4), $\tau}_{\text{tutor}}\text{}\text{}{\tau}_{\text{tutor}}^{\ast$ learning was slowed down. Figure 3C also shows that a certain amount of mismatch between the tutor error integration timescale ${\tau}_{\text{tutor}}$ and the matched timescale ${\tau}_{\text{tutor}}^{*}$ implied by the student plasticity rule is tolerated by the system. Interestingly, the diagonal band over which learning is effective in Figure 3C is roughly of constant width—note that the scale on both axes is logarithmic, so that this means that the tutor error integration timescale ${\tau}_{\text{tutor}}$ has to be within a constant factor of the optimal timescale ${\tau}_{\text{tutor}}^{*}$ for good learning. We also see that the breakdown in learning is more abrupt when $\tau}_{\text{tutor}}\text{}\text{}{\tau}_{\text{tutor}}^{\ast$ than in the opposite regime.
An interesting feature of the results from Figure 3C and D is that the difference in performance between matched and mismatched pairs becomes less pronounced for timescales shorter than about $100\mathrm{ms}$. This is due to the fact that the plasticity rule (Equation 1) implicitly smooths over timescales of the order of ${\tau}_{1,2}$, which in our simulations were equal to ${\tau}_{1}=80\mathrm{ms}$, ${\tau}_{2}=40\mathrm{ms}$. Thus, variations of the tutor signal on shorter timescales have little effect on learning. Using different values for the effective timescales ${\tau}_{1,2}$ describing the plasticity rule can increase or decrease the range of parameters over which learning is robust against tutor–student mismatches (see Appendix).
Robust learning with nonlinearities
In the model above, firing rates for the tutor were allowed to grow as large as necessary to implement the most efficient learning. However, the firing rates of realistic neurons typically saturate at some fixed bound. To test the effects of this nonlinearity in the tutor, we passed the ideal tutor activity (Equation 3) through a sigmoidal nonlinearity,
where $2\rho $ is the range of firing rates. We typically chose $\theta =\rho =80\mathrm{Hz}$ to constrain the rates to the range 0–160 Hz (Ölveczky et al., 2005; GarstOrozco et al., 2014). Learning slowed down with this change (Figure 4A and online Video 3) as a result of the tutor firing rates saturating when the mismatch between the motor output and the target output was large. However, the accuracy of the final rendition was not affected by saturation in the tutor (Figure 4A, inset). An interesting effect occurred when the firing rate constraint was imposed on a matched tutor with a long memory timescale. When this happened and the motor error was large, the tutor signal saturated and stopped growing in relation to the motor error before the end of the motor program. In the extreme case of very long integration timescales, learning became sequential: early features in the output were learned first, before later features were addressed, as in Figure 4B and online Video 4. This is reminiscent of the learning rule described in (Memmesheimer et al., 2014).
Nonlinearities can similarly affect the activities of student neurons. Our model can be readily extended to describe efficient learning even in this case. The key result is that for efficient learning to occur, the synaptic plasticity rule should depend not just on the tutor and conductor, but also on the activity of the postsynaptic student neurons (details in Appendix). Such dependence on postsynaptic activity is commonly seen in experiments (Chistiakova and Volgushev, 2009; Chistiakova et al., 2014).
The relation between student neuron activations ${s}_{j}(t)$ and motor outputs ${y}_{a}(t)$ (Figure 2A) is in general also nonlinear. Compared to the linear assumption that we used, the effect of a monotonic nonlinearity, ${y}_{a}={N}_{a}({\sum}_{j}{M}_{aj}{s}_{j})$, with ${N}_{a}$ an increasing function, is similar to modifying the loss function $L$, and does not significantly change our results (see Appendix). We also checked that imposing a rectification constraint that conductor–student weights ${W}_{ij}$ must be positive does not modify our results either (see Appendix). This shows that our model continues to work with biologically realistic synapses that cannot change sign from excitatory to inhibitory during learning.
Spiking neurons and birdsong
To apply our model to vocal learning in birds, we extended our analysis to networks of spiking neurons. Juvenile songbirds produce a ‘babble’ that converges through learning to an adult song strongly resembling the tutor song. This is reflected in the songaligned spiking patterns in premotor area RA, which become more stereotyped and cluster in shorter, betterdefined bursts as the bird matures (Figure 5A). We tested whether our model could reproduce key statistics of spiking in RA over the course of song learning. In this context, our theory of efficient learning, derived in a ratebased scenario, predicts a specific relation between the teaching signal embedded in LMAN firing patterns, and the plasticity rule implemented in RA. We tested whether these predictions continued to hold in the spiking context.
Following the experiments of Hahnloser et al. (2002), we modeled each neuron in HVC (the conductor) as firing one short, precisely timed burst of 5–6 spikes at a single moment in the motor program. Thus the population of HVC neurons produced a precise timebase for the song. LMAN (tutor) neurons are known to have highly variable firing patterns that facilitate experimentation, but also contain a corrective bias (Andalman and Fee, 2009). Thus we modeled LMAN as producing inhomogeneous Poisson spike trains with a timedependent firing rate given by Equation (5) in our model. Although biologically there are several LMAN neurons projecting to each RA neuron, we again simplified by ‘summing’ the LMAN inputs into a single, effective tutor neuron, similarly to the approach in (Fiete et al., 2007). The LMANRA synapses were modeled in a currentbased approach as a mixture of AMPA and NMDA receptors, following the songbird data (GarstOrozco et al., 2014; Stark and Perkel, 1999). The initial weights for all synapses were tuned to produce RA firing patterns resembling juvenile birds (Ölveczky et al., 2011), subject to constraints from direct measurements in slice recordings (GarstOrozco et al., 2014) (see Materials and methods for details, and Figure 5B for a comparison between neural recordings and spiking in our model). In contrast to the constant inhibitory bias that we used in our ratebased simulations, for the spiking simulations we chose an activitydependent global inhibition for RA neurons. We also tested that a constant bias produced similar results (see Appendix).
Synaptic strength updates followed the same twotimescale dynamics that was used in the ratebased models (Figure 2B). The firing rates ${c}_{i}(t)$ and ${g}_{j}(t)$ that appear in the plasticity equation were calculated in the spiking model by filtering the spike trains from conductor and tutor neurons with exponential kernels. The synaptic weights were constrained to be nonnegative. (See Materials and methods for details.)
As long as the tutor error integration timescale was not too large, learning proceeded effectively when the tutor error integration timescale and the student plasticity rule were matched (see Figure 5C and online Video 5), with mismatches slowing down or abolishing learning, just as in our ratebased study (compare Figure 5D with Figure 3C). The rate of learning and the accuracy of the trained state were lower in the spiking model compared to the ratebased model. The lower accuracy arises because the tutor neurons fire stochastically, unlike the deterministic neurons used in the ratebased simulations. The stochastic nature of the tutor firing also led to a decrease in learning accuracy as the tutor error integration timescale ${\tau}_{\text{tutor}}$ increased (Figure 5D). This happens through two related effects: (1) the signaltonoise ratio in the tutor guiding signal decreases as ${\tau}_{\text{tutor}}$ increases once the tutor error integration timescale is longer than the duration $T$ of the motor program (see Appendix); and (2) the fluctuations in the conductor–student weights lead to some weights getting clamped at 0 due to the positivity constraint, which leads to the motor program overshooting the target (see Appendix). The latter effect can be reduced by either allowing for negative weights, or changing the motor output to a pushpull architecture in which some student neurons enhance the output while others inhibit it. The signaltonoise ratio effect can be attenuated by increasing the gain of the tutor signal, which inhibits early learning, but improves the quality of the guiding signal in the latter stages of the learning process. It is also worth emphasizing that these effects only become relevant once the tutor error integration timescale ${\tau}_{\text{tutor}}$ becomes significantly longer than the duration of the motor program, $T$, which for a birdsong motif would be around 1 s.
Spiking in our model tends to be a little more regular than that in the recordings (compare Figure 5A and Figure 5B). This could be due to sources of noise that are present in the brain which we did not model. One detail that our model does not capture is the fact that many LMAN spikes occur in bursts, while in our simulation LMAN firing is Poisson. Bursts are more likely to produce spikes in downstream RA neurons particularly because of the NMDA dynamics, and thus a bursty LMAN will be more effective at injecting variability into RA (Kojima et al., 2013). Small inaccuracies in aligning the recorded spikes to the song are also likely to contribute apparent variability between renditions in experiments. Indeed, some of the variability in Figure 5A looks like it could be due to time warping and global time shifts that were not fully corrected.
Robust learning with credit assignment errors
The calculation of the tutor output in our rule involved estimating the motor error ${\u03f5}_{j}$ from Equation (2). This required knowledge of the assignment between student activities and motor output, which in our model was represented by the matrix ${M}_{aj}$ (Figure 2A). In our simulations, we typically chose an assignment in which each student neuron contributed to a single output channel, mimicking the empirical findings for neurons in bird RA. Mathematically, this implies that each column of ${M}_{aj}$ contained a single nonzero element. In Figure 6A, we show what happened in the ratebased model when the tutor incorrectly assigned a certain fraction of the neurons to the wrong output. Specifically, we considered two output channels, ${y}_{1}$ and ${y}_{2}$, with half of the student neurons contributing only to ${y}_{1}$ and the other half contributing only to ${y}_{2}$. We then scrambled a fraction $\rho $ of this assignment when calculating the motor error, so that the tutor effectively had an imperfect knowledge of the student–output relation. Figure 6A shows that learning is robust to this kind of misassignment even for fairly large values of the error fraction $\rho $ up to about 40%, but quickly deteriorates as this fraction approaches 50%.
Due to environmental factors that affect development of different individuals in different ways, it is unlikely that the student–output mapping can be innate. As such, the tutor circuit must learn the mapping. Indeed, it is known that LMAN in the bird receives an indirect evaluation signal via Area X, which might be used to effect this learning (Andalman and Fee, 2009; Gadagkar et al., 2016; Hoffmann et al., 2016; Kubikova et al., 2010). One way in which this can be achieved is through a reinforcement paradigm. We thus considered a learning rule where the tutor circuit receives a reward signal that enables it to infer the student–output mapping. In general the output of the tutor circuit should depend on an integral of the motor error, as in Equation (3), to best instruct the student. For simplicity, we start with the memoryless case, ${\tau}_{\text{tutor}}=0$, in which only the instantaneous value of the motor error is reflected in the tutor signal; we then show how to generalize this for ${\tau}_{\text{tutor}}\text{}\text{}0$.
As before, we took the tutor neurons to fire Poisson spikes with timedependent rates ${f}_{j}(t)$, which were initialized arbitrarily. Because of stochastic fluctuations, the actual tutor activity on any given trial, ${g}_{j}(t)$, differs somewhat from the average, ${\overline{g}}_{j}(t)$. Denoting the difference by ${\xi}_{j}(t)={g}_{j}(t){\overline{g}}_{j}(t)$, the update rule for the tutor firing rates was given by
where ${\eta}_{\text{tutor}}$ is a learning rate, $R(t)$ is the instantaneous reward signal, and $\overline{R}$ is its average over recent renditions of the motor program. In our implementation, $\overline{R}$ is obtained by convolving $R(t)$ with an exponential kernel (timescale = 1 s). The reward $R({t}_{\text{max}})$ at the end of one rendition becomes the baseline at the start of the next rendition $R(0)$. The baseline ${\overline{g}}_{j}(t)$ of the tutor activity is calculated by averaging over recent renditions of the song with exponentially decaying weights (one $e$fold of decay for every five renditions). Further implementation details are available in our code at https://github.com/ttesileanu/twostagelearning (Teşileanu, 2016) (with a copy archived at https://github.com/elifesciencespublications/twostagelearning).
The intuition behind this rule is that, whenever a fluctuation in the tutor activity leads to betterthanaverage reward ($R(t)\text{}\text{}\overline{R}$), the tutor firing rate changes in the direction of the fluctuation for subsequent trials, ‘freezing in’ the improvement. Conversely, the firing rate moves away from the directions in which fluctuations tend to reduce the reward.
To test our learning rule, we ran simulations using this reinforcement strategy and found that learning again converges to an accurate rendition of the target output (Figure 6B, inset and online Video 6). The number of repetitions needed for training is greatly increased compared to the case in which the credit assignment is assumed known by the tutor circuit (compare Figure 6B to Figure 5C). This is because the tutor needs to use many training rounds for experimentation before it can guide conductor–student plasticity. The rate of learning in our model is similar to the songbird (i.e., order $\mathrm{10\hspace{0.17em}000}$ repetitions for learning, given that a zebra finch typically sings about 1000 repetitions of its song each day, and takes about one month to fully develop adult song).
Because of the extra training time needed for the tutor to adapt its signal, the motor output in our rewardbased simulations tends to initially overshoot the target (leading to the kink in the error at around 2000 repetitions in Figure 6B). Interestingly, the subsequent reduction in output that leads to convergence of the motor program, combined with the positivity constraint on the synaptic strengths, leads to many conductor–student connections being pruned (Figure 6D). This mirrors experiments on songbirds, where the number of connections between HVC and RA first increases with learning and then decreases (GarstOrozco et al., 2014).
The reinforcement rule described above responds only to instantaneous values of the reward signal and tutor firing rate fluctuations. In general, effective learning requires that the tutor keep a memory trace of its activity over a timescale ${\tau}_{\text{tutor}}\text{}\text{}0$, as in Equation (4). To achieve this in the reinforcement paradigm, we can use a simple generalization of Equation (6) where the update rule is filtered over the tutor memory timescale:
We tested that this rule leads to effective learning when paired with the corresponding student, i.e., one for which Equation (4) is obeyed (Figure 6C and online Video 7).
The reinforcement rules proposed here are related to the learning rules from (Fiete and Seung, 2006; Fiete et al., 2007) and (Farries and Fairhall, 2007). However, those models focused on learning in a single pass, instead of the twostage architecture that we studied. In particular, in Fiete et al. (2007), area LMAN was assumed to generate pure Poisson noise and reinforcement learning took place at the HVC–RA synapses. In our model, which is in better agreement with recent evidence regarding the roles of RA and LMAN in birdsong (Andalman and Fee, 2009), reinforcement learning first takes place in the anterior forebrain pathway (AFP), for which LMAN is the output. A rewardindependent heterosynaptic plasticity rule then solidifies the information in RA.
In our simulations, tutor neurons fire Poisson spikes with specific timedependent rates which change during learning. The timecourse of the firing rates in each repetition must then be stored somewhere in the brain. In fact, in the songbird, there are indirect projections from HVC to LMAN, going through the basal ganglia (Area X) and the dorsolateral division of the medial thalamus (DLM) in the anterior forebrain pathway (Figure 1A) (Perkel, 2004). These synapses could store the required timedependence of the tutor firing rates. In addition, the same synapses can provide the timebase input that would ensure synchrony between LMAN firing and RA output, as necessary for learning. Our reinforcement learning rule for the tutor area, Equation (6), can be viewed as an effective model for plasticity in the projections between HVC, Area X, DLM, and LMAN, as in Fee and Goldberg (2011). In this picture, the indirect HVC–LMAN connections behave somewhat like the ‘hedonistic synapses’ from Seung (2003), though we use a simpler synaptic model here. Implementing the integral from Equation (7) would require further recurrent circuitry in LMAN which is beyond the scope of this paper, but would be interesting to investigate in future work.
Discussion
We built a twostage model of learning in which one area (the student) learns to generate a patterned motor output under guidance from a tutor area. This architecture is inspired by the song system of zebra finches, where area LMAN provides a corrective bias to the song that is then consolidated in the HVC–RA synapses. Using an approach rooted in the efficient coding literature, we showed analytically that, in a simple model, the tutor output that is most likely to lead to effective learning by the student involves an integral over the recent magnitude of the motor error. We found that efficiency requires that the timescale for this integral should be related to the synaptic plasticity rule used by the student. Using simulations, we tested our findings in more general settings. In particular, we demonstrated that tutorstudent matching is important for learning in a spikingneuron model constructed to reproduce spiking patterns similar to those measured in zebra finches. Learning in this model changes the spiking statistics of student neurons in realistic ways, for example, by producing more bursty, stereotyped firing events as learning progresses. Finally, we showed how the tutor can build its errorcorrecting signal by means of reinforcement learning.
If the birdsong system supports efficient learning, our model can predict the temporal structure of the firing patterns of RAprojecting LMAN neurons, given the plasticity rule implemented at the HVC–RA synapses. These predictions can be directly tested by recordings from LMAN neurons in singing birds, assuming that a good measure of motor error is available, and that we can estimate how the neurons contribute to this error. Moreover, recordings from a tutor circuit, such as LMAN, could be combined with a measure of motor error to infer the plasticity rule in a downstream student circuit, such as RA. This could be compared with direct measurements of the plasticity rule obtained in slice. Conversely, knowledge of the student plasticity rule could be used to predict the timedependence of tutor firing rates. According to our model, the firing rate should reflect the integral of the motor error with the timescale predicted by the model. A different approach would be to artificially tutor RA by stimulating LMAN neurons electrically or optogenetically. We would predict that if the tutor signal is delivered appropriately (e.g., in conjunction with a particular syllable [Tumer and Brainard, 2007]), then the premotor bias produced by the stimulation should become incorporated into the motor pathway faster when the timescale of the artificial LMAN signal is properly matched to the RA synaptic plasticity rule.
Our model can be applied more generally to other systems in the brain exhibiting twostage learning, such as motor learning in mammals. If the plasticity mechanisms in these systems are different from those in songbirds, our predictions for the structure of the guiding signal will vary correspondingly. This would allow a further test of our model of ‘efficient learning’ in the brain. It is worth pointing out that our model was derived assuming a certain hierarchy among the timescales that model the student plasticity and the tutor signal. A mismatch between the model predictions and observations could also imply a breakdown of these approximations, rather than failure of the hypothesis that the particular system under study evolved to support efficient learning. Of course our analysis could be extended by relaxing these assumptions, for example by keeping more terms in the Taylor expansion that we used in our derivation of the matched tutor signal.
Applied to birdsong, our model is best seen as a mechanism for learning song syllables. The ordering of syllables in song motifs seems to have a second level of control within HVC and perhaps beyond (Basista et al., 2014; Hamaguchi et al., 2016). Songs can also be distorted by warping their timebase through changes in HVC firing without alterations of the HVC–RA connectivity (Ali et al., 2013). In view of these phenomena, it would be interesting to incorporate our model into a larger hierarchical framework in which the sequencing and temporal structure of the syllables are also learned. A model of transitions between syllables can be found in Doya and Sejnowski (2000), where the authors use a ‘weight perturbation’ optimization scheme in which each HVC–RA synaptic weight is perturbed individually. We did not follow this approach because there is no plausible mechanism for LMAN to provide separate guidance to each HVC–RA synapse; in particular, there are not enough LMAN neurons (Fiete et al., 2007).
In this paper we assumed a twostage architecture for learning, inspired by birdsong. An interesting question is whether and under what conditions such an architecture is more effective than a singlestep model. Possibly, having two stages is better when a single tutor area is responsible for training several different dedicated controllers, as is likely the case in motor learning. It would then be beneficial to have an area that can learn arbitrary behaviors, perhaps at the cost of using more resources and having slower reaction times, along with the ability to transfer these behaviors into lowlevel circuitry that is only capable of producing stereotyped motor programs. The question then arises whether having more than two levels in this hierarchy could be useful, what the other levels might do, and whether such hierarchical learning systems are implemented in the brain.
Materials and methods
Equations for ratebased model
Request a detailed protocolThe basic equations we used for describing our ratebased model (Figure 2A) are the following:
In simulations, we further filtered the output using an exponential kernel,
with a timescale ${\tau}_{\text{out}}$ that we typically set to 25 ms. The smoothing produces more realistic outputs by mimicking the relatively slow reaction time of real muscles, and stabilizes learning by filtering out highfrequency components of the motor output. The latter interfere with learning because of the delay between the effect of conductor activity on synaptic strengths vs. motor output. This delay is of the order ${\tau}_{1,2}{\tau}_{\text{out}}$ (see the plasticity rule below).
The conductor activity in the ratebased model is modeled after songbird HVC (Hahnloser et al., 2002): each neuron fires a single burst during the motor program. Each burst corresponds to a sharp increase of the firing rate ${c}_{i}(t)$ from 0 to a constant value, and then a decrease $10\mathrm{ms}$ later. The activities of the different neurons are spread out to tile the whole duration of the output program. Other choices for the conductor activity also work, provided no patterns are repeated (see Appendix).
Mathematical description of plasticity rule
Request a detailed protocolIn our model the rate of change of the synaptic weights obeys a rule that depends on a filtered version of the conductor signal (see Figure 2B). This is expressed mathematically as
where $\eta $ is a learning rate and ${\stackrel{~}{c}}_{i}=K*{c}_{i}$, with the star representing convolution and $K$ being a filtering kernel. We considered a linear combination of two exponential kernels with timescales ${\tau}_{1}$ and ${\tau}_{2}$,
with ${K}_{i}(t)$ given by
Different choices for the kernels give similar results (see Appendix). The overall scale of $\alpha $ and $\beta $ can be absorbed into the learning rate $\eta $ in Equation (10). In our simulations, we fix $\alpha \beta =1$ and keep the learning rate constant as we change the plasticity rule (see Equation 3).
In the spiking simulations with and without reinforcement learning in the tutor circuit, the firing rates ${c}_{i}(t)$ and ${g}_{j}(t)$ were estimated by filtering spike trains with exponential kernels whose timescales were in the range $5\mathrm{ms}$–$40\mathrm{ms}$. The reinforcement studies typically required longer timescales for stability, possibly because of delays between conductor activity and reward signals.
Derivation of the matching tutor signal
Request a detailed protocolTo find the tutor signal that provides the most effective teaching for the student, we first calculate how much synaptic weights change according to our plasticity rule, Equation (10). Then we require that this change matches the gradient descent direction. We have
Because of the linearity assumptions in our model, it is sufficient to focus on a case in which each conductor neuron, $i$, fires a single short burst, at a time ${t}_{i}$. We write this as ${c}_{i}(t)=\delta (t{t}_{i})$, and so
where we used the definition of ${\stackrel{~}{c}}_{i}(t)$. If the time constants ${\tau}_{1}$, ${\tau}_{2}$ are short compared to the timescale on which the tutor input ${g}_{j}(t)$ varies, only the values of ${g}_{j}(t)$ around time ${t}_{i}$ will contribute to the integral. If we further assume that $T\gg {t}_{i}$, we can use a Taylor expansion of ${g}_{j}(t)$ around $t={t}_{i}$ to perform the calculation:
Doing the integrals involving the exponential kernels ${K}_{1}$ and ${K}_{2}$, we get
We would like this synaptic change to optimally reduce a measure of mismatch between the output and the desired target as measured by a loss function. A generic smooth loss function $L({y}_{a}(t),{\overline{y}}_{a}(t))$ can be quadratically approximated when ${y}_{a}$ is sufficiently close to the target ${\overline{y}}_{a}(t)$. With this in mind, we consider a quadratic loss
The loss function would decrease monotonically during learning if synaptic weights changed in proportion to the negative gradient of $L$:
where $\gamma $ is a learning rate. This implies
Using again ${c}_{i}(t)=\delta (t{t}_{i})$, we obtain
where we used the notation from Equation (2) for the motor error at student neuron $j$.
We now set Equations (16) and (20) equal to each other. If the conductor fires densely in time, we need the equality to hold for all times, and we thus get a differential equation for the tutor signal ${g}_{j}(t)$. This identifies the tutor signal that leads to gradient descent learning as a function of the motor error ${\u03f5}_{j}(t)$, Equation (3) (with the notation $\zeta =\gamma /\eta $).
Spiking simulations
Request a detailed protocolWe used spiking models that were based on leaky integrateandfire neurons with currentbased dynamics for the synaptic inputs. The magnitude of synaptic potentials generated by the conductor–student synapses was independent of the membrane potential, approximating AMPA receptor dynamics, while the synaptic inputs from the tutor to the student were based on a mixture of AMPA and NMDA dynamics. Specifically, the equations describing the dynamics of the spiking model were:
Here ${V}_{j}$ is the membrane potential of the ${j}^{\text{th}}$ student neuron and ${V}_{R}$ is the resting potential, as well as the potential to which the membrane was reset after a spike. Spikes were registered whenever the membrane potential went above a threshold ${V}_{\text{th}}$, after which a refractory period ${\tau}_{\text{ref}}$ ensued. Apart from excitatory AMPA and NMDA inputs modeled by the ${I}_{j}^{\text{AMPA}}$ and ${I}_{j}^{\text{NMDA}}$ variables in our model, we also included a global inhibitory signal ${V}_{\text{inh}}$ which is proportional to the overall activity of student neurons averaged over a timescale ${\tau}_{\text{inh}}$. The averaging is performed using the auxiliary variables ${S}_{j}$ which are convolutions of student spike trains with an exponential kernel. These can be thought of as a simple model for the activities of inhibitory interneurons in the student.
Table 1 gives the values of the parameters we used in the simulations. These values were chosen to match the firing statistics of neurons in bird RA, as described below.
The voltage dynamics for conductor and tutor neurons was not simulated explicitly. Instead, each conductor neuron was assumed to fire a burst at a fixed time during the simulation. The onset of each burst had additive timing jitter of $\pm 0.3\mathrm{ms}$ and each spike in the burst had a jitter of $\pm 0.2\mathrm{ms}$. This modeled the uncertainty in spike times that is observed in in vivo recordings in birdsong (Hahnloser et al., 2002). Tutor neurons fired Poisson spikes with a timedependent firing rate that was set as described in the main text.
The initial connectivity between conductor and student neurons was chosen to be sparse (see Table 1). The initial distribution of synaptic weights was lognormal, matching experimentally measured values for zebra finches (GarstOrozco et al., 2014). Since these measurements are done in the slice, the absolute number of HVC synapses per RA neuron is likely to have been underestimated. The number of conductor–student synapses we start with in our simulations is thus chosen to be higher than the value reported in that paper (see Table 1), and is allowed to change during learning. We checked that the learning paradigm described here is robust to substantial changes in these parameters, but we have chosen values that are faithful to birdsong experiments and which are thus able to imitate the RA spiking statistics during song.
The synapses projecting onto each student neuron from the tutor have a weight that is fixed during our simulations reflecting the finding in GarstOrozco et al. (2014) that the average strength of LMAN–RA synapses for zebra finches does not change with age. There is some evidence that individual LMAN–RA synapses undergo plasticity concurrently with the HVC–RA synapses (Mehaffey and Doupe, 2015) but we did not seek to model this effect. There are also developmental changes in the kinetics of NMDAmediated synaptic currents in both HVC–RA and LMAN–RA synapses which we do not model (Stark and Perkel, 1999). These, however, happen early in development, and thus are unlikely to have an effect on song crystallization, which is what our model focuses on. Stark and Perkel, 1999 also observed changes in the relative contribution of NMDA to AMPA responses in the HVC–RA synapses. We do not incorporate such effects in our model since we do not explicitly model the dynamics of HVC neurons in this paper. However, this is an interesting avenue for future work, especially since there is evidence that area HVC can also contribute to learning, in particular in relation to the temporal structure of song (Ali et al., 2013).
Matching spiking statistics with experimental data
Request a detailed protocolWe used an optimization technique to choose parameters to maximize the similarity between the statistics of spiking in our simulations and the firing statistics observed in neural recordings from the songbird. The comparison was based on several descriptive statistics: the average firing rate; the coefficient of variation and skewness of the distribution of interspike intervals; the frequency and average duration of bursts; and the firing rate during bursts. For calculating these statistics, bursts were defined to start if the firing rate went above 80 Hz and last until the rate decreased below 40 Hz.
To carry out such optimizations in the stochastic context of our simulations, we used an evolutionary algorithm—the covariance matrix adaptation evolution strategy (CMAES) (Hansen, 2006). The objective function was based on the relative error between the simulation statistics ${x}_{i}^{\text{sim}}$ and the observed statistics ${x}_{i}^{\text{obs}}$,
Equal weight was placed on optimizing the firing statistics in the juvenile (based on a recording from a 43 dph bird) and optimizing firing in the adult (based on a recording from a 160 dph bird). In this optimization there was no learning between the juvenile and adult stages. We simply required that the number of HVC synapses per RA neuron, and the mean and standard deviation of the corresponding synaptic weights were in the ranges seen in the juvenile and adult by GarstOrozco et al. (2014). The optimization was carried out in Python (RRID:SCR_008394), using code from https://www.lri.fr/~hansen/cmaes_inmatlab.html. The results fixed the parameter choices in Table 1 which were then used to study our learning paradigm. While these choices are important for achieving firing statistics that are similar to those seen in recordings from the bird, our learning paradigm is robust to large variations in the parameters in Table 1.
Software and data
Request a detailed protocolWe used custombuilt Python (RRID:SCR_008394) code for simulations and data analysis. The software and data that we used can be accessed online on GitHub (RRID:SCR_002630) at https://github.com/ttesileanu/twostagelearning.
Appendix 1
Effect of nonlinearities
We can generalize the model from Equation (8) by using a nonlinear transfer function from student activities to motor output, and a nonlinear activation function for student neurons:
Suppose further that we use a general loss function,
Following the same argument as that from section "Derivation of the matching tutor signal", the gradient descent condition, Equation (18), implies
The departure from the quadratic loss function, $\mathcal{L}\ne \frac{1}{2}{\sum}_{a}{({y}_{a}(t){\overline{y}}_{a}(t))}^{2}$, and the nonlinearities in the output, ${N}_{a}$, have the effect of redefining the motor error,
A proper loss function will be such that the derivatives $\partial \mathcal{L}/\partial {y}_{a}$ vanish when ${y}_{a}(t)={\overline{y}}_{a}(t)$, and so the motor error ${\u03f5}_{j}$ as defined here is zero when the rendition is perfect, as expected. If we use a tutor that ignores the nonlinearities in a nonlinear system, i.e., if we use Equation (2) instead of Equation (26) to calculate the tutor signal that is plugged into Equation (3), we still expect successful learning provided that ${N}_{a}^{\mathrm{\prime}}\text{}\text{}0$ and that $\mathcal{L}$ is itself an increasing function of ${y}_{a}{\overline{y}}_{a}$ (see section "Effect of different output functions"). This is because replacing Equation (26) with Equation (2) would affect the magnitude of the motor error without significantly changing its direction. In more complicated scenarios, if the transfer function to the output is not monotonic, there is the potential that using Equation (2) would push the system away from convergence instead of towards it. In such a case, an adaptive mechanism, such as the reinforcement rules from Equations (6) or (7) can be used to adapt to the local values of the derivatives ${N}_{a}^{\prime}$ and $\partial \mathcal{L}/\partial {y}_{a}$.
Finally, the nonlinear activation function $F$ introduces a dependence on the student output ${s}_{j}(t)$ in Equation (25), since ${F}^{\prime}$ is evaluated at ${F}^{1}({s}_{j}(t))$. To obtain a good match between the student and the tutor in this context, we can modify the student plasticity rule (Equation 10) by adding a dependence on the postsynaptic activity,
In general, synaptic plasticity has been observed to indeed depend on postsynaptic activity (Chistiakova et al., 2014; Chistiakova and Volgushev, 2009). Our derivation suggests that the effectiveness of learning could be improved by tuning this dependence of synaptic change on postsynaptic activity to the activation function of postsynaptic neurons, according to Equation (27). It would be interesting to check whether such tuning occurs in real neurons.
Effect of different output functions
In the main text, we assumed a linear mapping between student activities and motor output. Moreover, we assumed a myotopic organization, in which each student neuron projected to a single muscle, leading to a student–output assignment matrix ${M}_{aj}$ in which each column had a single nonzero entry. We also assumed that student neurons only contributed additively to the outputs, with no inhibitory activity. Here we show that our results hold for other choices of student–output mappings.
For example, assume a pushpull architecture, in which half of the student neurons controlling one output are excitatory and half are inhibitory. This can be used to decouple the overall firing rate in the student from the magnitude of the outputs. Learning works just as effectively as in the case of the purely additive student–output mapping when using matched tutors, Appendix 1—figures 1A and 1B. The consequences of mismatching student and tutor circuits are also not significantly changed, Appendix 1—figures 1C and 1D.
We can also consider nonlinear mappings between the student activity and the final output. If there is a monotonic output nonlinearity, as in Equation (23) with ${N}_{a}^{\mathrm{\prime}}\text{}\text{}0$, the tutor signal derived for the linear case, Equation (3), can still achieve convergence, though at a slower rate and with a somewhat lower accuracy (see Appendix 1—figure 1E for the case of a sigmoidal nonlinearity). For nonmonotonic nonlinearities, the direction from which the optimum is approached can be crucial, as learning can get stuck in local minima of the loss function (we thank Josh Gold for this observation). Studying this might provide an interesting avenue to test whether learning in songbirds is based on a gradient descenttype rule or on a more sophisticated optimization technique.
Different inhibition models
In the spiking model, we used an activitydependent inhibitory signal that was proportional to the average student activity. Using a constant inhibition instead, ${V}_{\text{inh}}=\text{constant}$, does not significantly change the results: see Appendix 1—figure 1F for an example.
Effect of changing plasticity kernels
In the main text, we used exponential kernels with ${\tau}_{1}=80\mathrm{ms}$ and ${\tau}_{2}=40\mathrm{ms}$ for the smoothing of the conductor signal that enters the synaptic plasticity rule, Equation (10). We can generalize this in two ways: we can use different timescales ${\tau}_{1}$, ${\tau}_{2}$, or we can use a different functional form for the kernels. (Note that in the main text we showed the effects of varying the parameters $\alpha $ and $\beta $ in the plasticity rule, while the timescales ${\tau}_{1}$ and ${\tau}_{2}$ were kept fixed.)
The values for the timescales ${\tau}_{1,2}$ were chosen to roughly match the shape of the plasticity curve measured in slices of zebra finch RA (Mehaffey and Doupe, 2015) (see Figure 1C and D). The main predictions of our model, that learning is most effective when the tutor signal is matched to the student plasticity rule, and that large mismatches between tutor and student lead to impaired learning, hold well when the student timescales change: see Appendix 1—figure 2A for the case when ${\tau}_{1}=20\mathrm{ms}$ and ${\tau}_{2}=10\mathrm{ms}$. In the main text we saw that the negative effects of tutor–student mismatch diminish for timescales that are shorter than $\sim {\tau}_{1,2}$. In Appendix 1—figure 2A, the range of timescales where a precise matching is not essential becomes very small because the student timescales are short.
Another generalization of our plasticity rule can be obtained by changing the functional form of the kernels used to smooth the conductor input. As an example, suppose ${K}_{2}$ is kept exponential, while ${K}_{1}$ is replaced by
An example of learning using an STDP rule based on kernels ${\overline{K}}_{1}$ and ${K}_{2}$ where ${\overline{\tau}}_{1}={\tau}_{2}$ is shown in Appendix 1—figure 2B. The matching tutor has the same form as before, Equation (3) with timescale ${\tau}_{\text{tutor}}={\tau}_{\text{tutor}}^{*}$ given by Equation (4), but with ${\tau}_{1}=2{\overline{\tau}}_{1}=2{\tau}_{2}$. We can see that learning is as effective as in the case of purely exponential kernels.
More general conductor patterns
In the main text, we have focused on a conductor whose activity matches that observed in area HVC of songbirds (Hahnloser et al., 2002): each neuron fires a single burst during the motor program. Our model, however, is not restricted to this case. We generated alternative conductor patterns by using arbitrarilyplaced bursts of activity, as in Appendix 1—figure 3A. The model converges to a good rendition of the target program, Appendix 1—figure 3B. Learning is harder in this case because many conductor neurons can be active at the same time, and the weight updates affect not only the output of the system at the current position in the motor program, but also at all the other positions where the conductor neurons fire. This is in contrast to the HVClike conductor, where each neuron fires at a single point in the motor program, and thus the effect of weight updates is better localized. More generally, simulations show that the sparser the conductor firing, the faster the convergence (data not shown). The accuracy of the final rendition of the motor program (Appendix 1—figure 3B, inset) is also not as good as before.
Edge effects
In our derivation of the matching tutor rule, we assumed that the system has enough time to integrate all the synaptic weight changes from Equation (10). However, some of these changes occur tens or hundreds of milliseconds after the inputs that generated them, due to the timescales used in the plasticity kernel. Since our simulations are only run for a finite amount of time, there will in general be edge effects, where periods of the motor program towards the end of the simulations will have difficulty converging. To offset such numerical issues, we ran the simulations for a few hundred milliseconds longer than the duration of the motor program, and ignored the data from this extra period. Our simulations typically run for $600\mathrm{ms}$, and the time reserved for relaxation after the end of the program was set to $1200\mathrm{ms}$. The long relaxation time was chosen to allow for cases where the tutor was chosen to have a very long memory timescale.
Parameter optimization for reproducing juvenile and adult spiking statistics
We set the parameters in our simulations to reproduce spiking statistics from recordings in zebra finch RA as closely as possible. Appendix 1—figure 4 shows how the distribution of summary statistics obtained from 50 runs of the simulation compares to the distributions calculated from recordings in birds at various developmental stages. Each plot shows a standard box and whisker plot superimposed over a kerneldensity estimate of the distribution of a given summary statistic, either over simulation runs or over recordings from birds at various stages of song learning. We ran two sets of simulations, one for a bird with juvenilelike connectivity between HVC and RA, and one with adultlike connectivity (see Materials and methods). In these simulations there was no learning to match the timecourse of songs—the goal was simply to identify parameters that lead to birdsonglike firing statistics.
The qualitative match between our simulations and recordings is good, but the simulations are less variable than the measurements. This may be due to sources of variability that we have ignored—for example, all our simulated neurons had exactly the same membrane time constants, refractory periods, and threshold potentials, which is not the case for real neurons. Another reason might be that in our simulations, all the runs were performed for the same network, while the measurements are from different cells in different birds.
Effect of spiking stochasticity on learning
As pointed out in the main text, learning is affected in the spiking simulations when the tutor error integration timescale ${\tau}_{\text{tutor}}$ becomes very long. More specifically, two distinct effects occur. First, the fluctuations in the motor output increase, leading to a poorer match to the shape of the target motor program. And second, the whole output gets shifted up, towards higher muscle activation values. Both of these effects can be traced back to the stochasticity of the tutor signal.
In the spiking simulations, tutor neurons are assumed to fire Poisson spikes following a timedependent firing rate that obeys Equation (5). By the nature of the Poisson process, the tutor output in this case will contain fluctuations around the mean, $g(t)\sim \overline{g}(t)+\xi (t)$. Recall that the scale of $g(t)$ is set by the threshold $\theta $ and thus, since this is a Poisson process, so is the scale of the variability $\xi (t)$.
As long as the tutor error integration timescale is not very long, $g(t)$ roughly corresponds to a smoothed version of the motor error $\u03f5(t)$ (cf. Equation 5). However, as ${\tau}_{\text{tutor}}$ grows past the duration $T$ of the motor program, the exponential term in Equation (5) becomes essentially constant, leading to a tutor signal $\overline{g}(t)$ whose departures from the center value $\theta $ decrease in proportion to the timescale ${\tau}_{\text{tutor}}$. As far as the student is concerned, the relevant signal is $g(t)\theta $ (Equation 1), and thus, when ${\tau}_{\text{tutor}}\text{}\text{}T$, the signaltonoise ratio in the tutor guiding signal starts to decrease as $1/{\tau}_{\text{tutor}}$. This ultimately leads to a very noisy rendition of the target program. One way to improve this would be to increase the gain factor $\zeta $ that controls the relation between the motor error and the tutor signal (see Equation 5). This improves the ability of the system to converge onto its target in the late stages of learning. In the early stages of learning, however, this could lead to saturation problems. One way to fix this would be to use a variable gain factor $\zeta $ that ensures the whole range of tutor firing rates is used without generating too much saturation. This would be an interesting avenue for future research.
Reducing the fluctuations in the tutor signal also decreases the fluctuations in the conductor–student synaptic weights, which leads to fewer weights being clamped at 0 because of the positivity constraint. This reduces the shift between the learned motor program and the target. As mentioned in the main text, another approach to reducing or eliminating this shift is to allow for negative weights or (more realistically) to use a pushpull mechanism, in which the activity of some student neurons acts to increase muscle output, while the activity of other student neurons acts as an inhibition on muscle output.
Appendix 2
Plasticity parameter values
In the heatmaps that appear in many of the figures in the main text and in the supplementary information, we kept the timescales ${\tau}_{1}$ and ${\tau}_{2}$ constant while varying $\alpha $ and $\beta $ to modify the student plasticity rule. Since the overall scale of $\alpha $ and $\beta $ is inconsequential as it can be absorbed into the learning rate (as explained in the section "Learning in a ratebased model"), we imposed the further constraint $\alpha \beta =1$. This implies that we effectively focused on a oneparameter family of student plasticity rule, as identified by the value of $\alpha $ (and the corresponding value for $\beta =\alpha 1$). In the figures, we expressed this instead in terms of the timescale of the optimallymatching tutor, ${\tau}_{\text{tutor}}^{*}$, as defined in Equation (4).
Below we give the explicit values of $\alpha $ and $\beta $ that we used for each row in the heatmaps. These can be calculated by solving for $\alpha $ in Equation (4), using $\beta =\alpha 1$, and assuming that ${\tau}_{1}=80\mathrm{ms}$ and ${\tau}_{2}=40\mathrm{ms}$.
${\tau}_{\text{tutor}}^{*}$ (ms)  $\alpha $  $\beta $ 

10  $0.75$  $1.75$ 
20  $0.5$  $1.5$ 
40  $0.0$  $1.0$ 
80  $1.0$  $0.0$ 
160  $3.0$  $2.0$ 
320  $7.0$  $6.0$ 
640  $15.0$  $14.0$ 
1280  $31.0$  $30.0$ 
2560  $63.0$  $62.0$ 
5120  $127.0$  $126.0$ 
10240  $255.0$  $254.0$ 
20480  $511.0$  $510.0$ 
References

Independent premotor encoding of the sequence and structure of birdsong in avian cortexJournal of Neuroscience 34:16821–16834.https://doi.org/10.1523/JNEUROSCI.194014.2014

Effect of testosterone on input received by an identified neuron type of the canary song system: a golgi/electron microscopy/degeneration studyJournal of Neuroscience 8:3770–3784.

Heterosynaptic plasticity: multiple mechanisms and multiple rolesThe Neuroscientist : A Review Journal Bringing Neurobiology, Neurology and Psychiatry 20:483–498.https://doi.org/10.1177/1073858414529829

Heterosynaptic plasticity in the neocortexExperimental Brain Research 199:377–390.https://doi.org/10.1007/s0022100918595

BookA computational model of avian song learningIn: Gazzaniga M. S, editors. The New Cognitive Neurosciences. arXiv:1011.1669v3. pp. 469–482.

Reinforcement learning with modulated spike timing dependent synaptic plasticityJournal of Neurophysiology 98:3648–3665.https://doi.org/10.1152/jn.00364.2007

Model of birdsong learning based on gradient estimation by dynamic perturbation of neural conductancesJournal of Neurophysiology 98:2038–2057.https://doi.org/10.1152/jn.01311.2006

Gradient learning in spiking neural networks by dynamic perturbation of conductancesPhysical Review Letters 97:048104.https://doi.org/10.1103/PhysRevLett.97.048104

The organization of recent and remote memoriesNature Reviews Neuroscience 6:119–130.https://doi.org/10.1038/nrn1607

Towards a New Evolutionary Computation. Advances on Estimation of Distribution Algorithms75–102, The CMA Evolution Strategy: A Comparing Review, Towards a New Evolutionary Computation. Advances on Estimation of Distribution Algorithms.

The development of afferent projections to the robust archistriatal nucleus in male zebra finches: a quantitative Electron microscopic studyJournal of Neuroscience 11:2063–2074.

Dopaminergic contributions to vocal learningJournal of Neuroscience 36:2176–2189.https://doi.org/10.1523/JNEUROSCI.388315.2016

Bird Vocalizations61–74, Song development in the zebra finch and other estrildid finches, Bird Vocalizations.

Dopaminergic system in Birdsong learning and maintenanceJournal of Chemical Neuroanatomy 39:112–123.https://doi.org/10.1016/j.jchemneu.2009.10.004

Ensemble coding of vocal control in birdsongJournal of Neuroscience 25:652–661.https://doi.org/10.1523/JNEUROSCI.303604.2005

Origin of the anterior forebrain pathwayAnnals of the New York Academy of Sciences 1016:736–748.https://doi.org/10.1196/annals.1298.039

Brain pathways for learned and unlearned vocalizations differ in zebra finchesJournal of Neuroscience 10:1541–1556.

Longrange inhibition within the zebra finch song nucleus RA can coordinate the firing of multiple projection neuronsJournal of Neurophysiology 81:3007–3020.

Twostage, inputspecific synaptic maturation in a nucleus essential for vocal production in the zebra finchJournal of Neuroscience 19:9107–9116.

Basal ganglia contributions to motor control: a vigorous tutorCurrent Opinion in Neurobiology 20:704–716.https://doi.org/10.1016/j.conb.2010.08.022

Learning to breathe and sing: development of respiratoryvocal coordination in young songbirdsJournal of Neurophysiology 106:1747–1765.https://doi.org/10.1152/jn.00247.2011

Mechanisms and time course of vocal learning and consolidation in the adult songbirdJournal of Neurophysiology 106:1806–1821.https://doi.org/10.1152/jn.00311.2011

Changes in the neural control of a complex motor sequence during learningJournal of Neurophysiology 106:386–397.https://doi.org/10.1152/jn.00018.2011
Decision letter

Michael J FrankReviewing Editor; Brown University, United States
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "Matching tutor to student: rules and mechanisms for efficient twostage learning in neural circuits" for consideration by eLife. Your article has been reviewed by two peer reviewers, and the evaluation has been overseen by Michael Frank as the Reviewing Editor and Andrew King as the Senior Editor. The reviewers have opted to remain anonymous.
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
Summary:
This is a careful and welldesigned study of the role of the matching between plasticity rules and instructive signals for efficient learning in a twostage model, wherein the reinforcement signal is processed by one circuit that in turn provides input to another circuit that drives the actual behavior. This is particularly applicable to recent detailed findings in the birdsong field. Several important results are laid out in the manuscript: (1) matching is important for fast learning, (2) matching is not as important when the tutor timescale is shorter than STDP timescales (3) similar results are obtained for ratebased or spikingbased models, (4) learning drives student neurons to become more bursty, (5) learning drives pruning of connections in spiking networks, and (6) sequential learning is obtained when the tutor memory is long.
Essential revisions:
All involved found the work to be of high quality and of broad interest to researchers in sensorimotor and reinforcement learning, as well as being of particular interest to those working on neural recording in the songbird brain.
1) However, an invigorating consultation session among the reviewers was needed to get down to the bottom of whether some of the main modeling results were circular/trivial or not. It was determined that in fact they are not, and that rather some of them show that the model is robust to the precise form of the motor error. But the reason for this initial difference in opinion could be traced to a lack of clarity in the manuscript over the definition of tau_tutor. One reviewer originally understood it as the timescale over which the firing rate of the tutor neurons is modulated, which led to the conclusion that g(t) is constrained to have a timescale of tau_tutor, which would mean that the main finding that your results "predict the temporal structure of the [LMAN signal] given a plasticity rule" – would be a trivial consequence of the plasticity rule….
However, as explained in a small sentence above Equation 1, what tau_tutor actually refers to is the timescale over which motor errors are integrated into the firing of tutor neurons. This definition of tau_tutor is specific to the Taylor series approximation taken in A.8 and hence, their matching results point to the robustness of the model. One way to summarize the finding is if you want to use an STDP rule in motor learning, you should convolve your motor error signal with a particular exponential smoothing function, with a particular timescale related to the plasticity rule, independent of the form of the motor error signal.
Another interesting observation is that their derivation for the timescale tau_tutor is not valid when tau_tutor is of a comparable magnitude to tau1 and tau2. Stated simply, they make an assumption in their derivation that tau_tutor >> tau1 and tau2. Therefore, they don't see clear matching in areas where this assumption is violated. A discussion of this in subsection “Match vs. unmatched learning” may also improve the readability of the paper.
The authors should comment on what they would conclude if one observed a different form for g(t) in the songbird brain. It's a subtle point – a g(t) with the \tau^* they predict would certainly shore up their theory, but one with a different form might mean either a) that their theory is wrong, or b) that their approximation (A.8) is wrong.
2) Both reviewers strongly agreed that more is needed to motivate what the model predicts. In the Abstract and Discussion, a claim is made that the results presented here make clear and testable predictions for experimentalists. The manuscript would be greatly improved if this were made more explicit and specific. Indeed, explicitly connecting the model to falsifiable experimental predictions would make the difference between this being a paper of truly broad interest.
Perhaps the authors envision revealing the precise form of the plasticity rule by looking at the structure of the tutor signal. Perhaps they envision inferring the form of the motor error signal. Perhaps they will test coarser effects that result from this kind of tutor signal, like the progressive clumping of student responses into burstlike events during learning. If these predictions were laid out more clearly, concerns about what exactly is being claimed through the calculation in Appendix 3 would be ameliorated.
A couple of other suggestions for how the manuscript might be revised are given here:
– The point \α = \β is the case that corresponds to the study by Mehaffey and Doupe, so a detailed discussion of the tutor \tau's that yield reasonable learning in that part of parameter space would be useful. How big is the range of \tau_{tutor}'s in that case? Does it span many orders of magnitude? How well does the model activity at these points align with real LMAN activity in the developing bird?
– What precise features of the firing of LMAN neurons are predicted for the models that yield efficient learning and match known plasticity features? What should an experimentalist measure in their spike trains? Putting some numbers to the predictions would be great.
3) One reviewer thought that Figure 3C is not very convincing that learning is good only along the diagonal. It does not seem to matter at short timescales (which the authors discuss briefly), where performance is equivalent regardless of whether timescales are matched or unmatched (e.g. 10 ms vs 80 ms). But even at timescales >80 ms, the performance does not seem strictly diagonal, and the asymptotic performance of the model is significantly worse than at shorter timescales. Therefore it seems like the two key findings – timescale matching and the ability to learn – do not coexist. The other reviewer noted that the error is on a log scale is very important here. There is a steep valley sloping towards the diagonal. The larger error reached at longer timescales was thus less worrisome. 3B shows that this amount of error doesn't stem from huge differences in the output time course. We would therefore like to ask the authors for another plot like B at yet longer timescales to see how and if things break down. Also, 250 iterations isn't that many in terms of number of renditions sung during development, and we might want to see what the error surface looks like after a few thousand. Finally, the paper does discuss the flattening of the valley at short timescales, but this could be expanded.
4. It seems misleading to call τ1 and τ2 the timescales of synaptic plasticity of the student. They are simply timescales of two exponential kernels that are linearly combined to create the final filtering kernel. If the authors want to relate τ1 and τ2 to the timescales of synaptic plasticity, they should include a discussion of how and why they consider them to be related. In general it would be helpful if the authors are more clear in defining their variables through the paper – especially since tau1 and tau2 are very different from tau_tutor.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "Rules and mechanisms for efficient twostage learning in neural circuits" for further consideration at eLife. Your revised article has been favorably evaluated by Andrew King (Senior editor), Reviewing editor Michael Frank, and one external reviewer.
The manuscript has been improved but there are some remaining issues that need to be addressed before acceptance, as outlined clearly by the reviewer below:
By and large we are happy with the changes the authors have made.
However, there are still a few important aspects of the paper that need to be significantly revised to improve clarity and address some issues of interpretation:
1) The (qualitative) definition of tau_tutor could still be much clearer. When we first read the initial submission, we mistakenly thought that tau_tutor represented the timescale over which the firing rate of the tutor neurons is modulated, rather than (as is actually the case) the timescale over which motor errors are integrated into the firing of tutor neurons. Based on our comments from last time, the authors have somewhat clarified this, however this could still be more clear (in fact, I got confused in the same way reading the revised paper). We think this could be easily clarified with a couple of minor changes
a) Just above Equation 4, revise to "Moreover, for effective learning, the timescale t_tutor – ***which quantifies the timescale on which error information is integrated into the tutor signal*** – appearing in Equation 3…" (or something similar).
b) Be a bit more specific when talking about "matching". For example, the first sentence in 2.3 simply reads "…when the tutor is matched to the student…". I think it would be much better (especially for less quantitative readers) to explicitly state what is being matched by expanding this to read something like "…when the ***timescale on which error information is integrated into the tutor signal (tau_tutor) is matched to the student plasticity rule***…".
c) More generally, throughout the paper replace the term "tutor timescale" – which many biologists will misinterpret as the timescale on which tutor activity varies – with "tutor error integration timescale" or something like this.
2) We appreciate the authors expanding the duration of simulations and range of timescales tested in the simulations shown in Figure 3C and 5D. However, the authors should be more explicit (and consistent) in explaining how these simulations are conducted. My understanding is that in Figure 3C, tau_1 and tau_2 are fixed at [80 40]msec, and to obtain a particular t_tutor*, plasticity parameters α and β are adjusted for each simulation (subject to the constraint αβ=1). Provided this is correct, the fact that α and β were different for different simulations (i.e. for different squares in Figure 3C, 5D) should be explicitly stated in the results text and/or legend to Figure 3. Similarly, as far as I can tell a similar approach (fix tau_1,2, vary α/β to get a particular tau_tutor*) was used in the analysis shown in Figure 5D. However, this is not made clear in the text/caption (I find the caption to 5D especially confusing). Authors should explicitly state the procedures for 5D and explain any differences in the general methods used in 3C. Furthermore, the authors should present (maybe as a supplemental figure) the values of α or β used for all simulations (e.g. a heatmap of α values to go along with Figures 3C and 5D).
3) The following is a somewhat subtle point, so I'll leave it up to the Authors to deal with as they see fit.
As noted above, tau_tutor is the timescale over which motor errors are integrated into the firing of tutor neurons. Furthermore, the various models (linear rate, spiking, etc.) can work over a big range of tau_tutors, say from 10ms1280msec, as shown in Figures 3C and 5D. In the revised Discussion, the authors correctly assert that LMAN/tutor firing "should reflect the integral of the motor error with the timescale predicted by the model". The authors also correctly point out that we don't really know what the motor error signal looks like.
However – a key physiological fact is that firing in the "tutor" brain area – LMAN – consists of both short bursts (~15 msec duration) as well as single spikes whose overall rate varies much more slowly. So, although we don't know how fast the error signal varies, even if the error signal were a pulse that only lasted, say, 1 msec, it would affect the firing of the LMAN neuron on timescale tau_tutor – e.g. over >100 msec if tau_tutor=100. So, if I'm getting this right, it seems implausible that a 15 msec burst in LMAN could possibly reflect motor error – even an infinitely brief motor error – if tau_tutor were longer than, say, 50 msec. The authors may want to discuss this issue, especially whether their model suggests that only slow changes in the rate of singlespikes, and not bursts, are carrying error information? I think that such an implication is both important for understanding the implications of the model and for guiding future work in the system.
Additionally, as indicated just below Equation 4, the model assumes that tau_tutor>>tau_1,2. In many simulations (definitely in Figure 3C, also I think in Figure 5D, see my question above), tau_1 and tau_2 have values of 40 and 80 msec, which are values derived from the Mehaffey and Doupe STDP paper. This would suggest that tau_tutor should be >>80 msec, and that cases with tau_tutor and tau_tutor* less than this value (which make up a significant region of the parameter space shown in Figures 3C and 5D) are biologically implausible. This seems to me to be something the authors should address.
https://doi.org/10.7554/eLife.20944.020Author response
Essential revisions:
All involved found the work to be of high quality and of broad interest to researchers in sensorimotor and reinforcement learning, as well as being of particular interest to those working on neural recording in the songbird brain.
1) However, an invigorating consultation session among the reviewers was needed to get down to the bottom of whether some of the main modeling results were circular/trivial or not. It was determined that in fact they are not, and that rather some of them show that the model is robust to the precise form of the motor error. But the reason for this initial difference in opinion could be traced to a lack of clarity in the manuscript over the definition of tau_tutor. One reviewer originally understood it as the timescale over which the firing rate of the tutor neurons is modulated, which led to the conclusion that g(t) is constrained to have a timescale of tau_tutor, which would mean that the main finding that your results "predict the temporal structure of the [LMAN signal] given a plasticity rule" – would be a trivial consequence of the plasticity rule….
However, as explained in a small sentence above Equation 1, what tau_tutor actually refers to is the timescale over which motor errors are integrated into the firing of tutor neurons. This definition of tau_tutor is specific to the Taylor series approximation taken in A.8 and hence, their matching results point to the robustness of the model. One way to summarize the finding is if you want to use an STDP rule in motor learning, you should convolve your motor error signal with a particular exponential smoothing function, with a particular timescale related to the plasticity rule, independent of the form of the motor error signal.
We are grateful for this suggestion. Indeed, upon rereading we realized that the text could have been clearer on this point. Your suggested explanation is indeed clear and accurate and we have incorporated a version of this text below Equation 4.
Another interesting observation is that their derivation for the timescale tau_tutor is not valid when tau_tutor is of a comparable magnitude to tau1 and tau2. Stated simply, they make an assumption in their derivation that tau_tutor >> tau1 and tau2. Therefore, they don't see clear matching in areas where this assumption is violated. A discussion of this in subsection “Match vs. unmatched learning” may also improve the readability of the paper.
We have added a discussion of these assumptions in our derivation at the end of section “Learning in a ratebased model”. Since the plasticity rule smooths conductor inputs over timescales of order tau_1 and tau_2, tutor signals that vary on shorter timescales do not have a strong effect on learning, justifying the approximation. It is also possible to improve the approximation we are making by including more terms in the Taylor expansion that we use in section A.3. We added a note about this in the Discussion.
The authors should comment on what they would conclude if one observed a different form for g(t) in the songbird brain. It's a subtle point – a g(t) with the \tau^* they predict would certainly shore up their theory, but one with a different form might mean either a) that their theory is wrong, or b) that their approximation (A.8) is wrong.
This is a very good point, and we thank the reviewers for encouraging us to discuss this issue. We added a few sentences pointing out these different ways of interpreting a potential mismatch between theory and experiment in this case (Discussion section). We have also mentioned that we could improve the approximation by going to the next order in the Taylor series expansion that we used to derive the matched tutor.
2) Both reviewers strongly agreed that more is needed to motivate what the model predicts. In the Abstract and Discussion, a claim is made that the results presented here make clear and testable predictions for experimentalists. The manuscript would be greatly improved if this were made more explicit and specific. Indeed, explicitly connecting the model to falsifiable experimental predictions would make the difference between this being a paper of truly broad interest.
Perhaps the authors envision revealing the precise form of the plasticity rule by looking at the structure of the tutor signal. Perhaps they envision inferring the form of the motor error signal. Perhaps they will test coarser effects that result from this kind of tutor signal, like the progressive clumping of student responses into burstlike events during learning. If these predictions were laid out more clearly, concerns about what exactly is being claimed through the calculation in Appendix 3 would be ameliorated.
Our model provides a general method for calculating the tutor signal needed by a student circuit during learning, under the assumption that the student and tutor circuits have evolved together to maximize learning efficiency. This can be used to make predictions in several ways, as we now explain in the Discussion: (a) we could use recordings from tutor neurons together with measurements of motor error to infer the student plasticity rule; (b) we could conversely use the plasticity rule with the motor error to predict tutor spiking statistics; (c) we could electrically or optogenetically stimulate tutor neurons and test whether the student learns as predicted by our model.
A couple of other suggestions for how the manuscript might be revised are given here:
– The point \α = \β is the case that corresponds to the study by Mehaffey and Doupe, so a detailed discussion of the tutor \tau's that yield reasonable learning in that part of parameter space would be useful. How big is the range of \tau_{tutor}'s in that case? Does it span many orders of magnitude? How well does the model activity at these points align with real LMAN activity in the developing bird?
We added a comment about the range of tau_tutor’s that allow effective learning to the bottom of section “Matched vs. unmatched learning”. Because of the normalization we use, which sets α – β = 1, we cannot directly address the question of what happens when α = β. However, the Mehaffey & Doupe data can be explained just as well using a plasticity model that has α large, α>>1, which would set α and β approximately equal to each other (leading to a large tau_tutor^*). This implies that the Mehaffey & Doupe case corresponds to the bottom part of the plots in Figures 3C and D.
A precise comparison of LMAN activity and our model would require identifying the specific muscle group to which the recorded LMAN neurons refer, which is feasible but hasn't been done yet. The relation between the muscle activations and song features (such as pitch) would also have to be known – again, this is possible in principle, but goes beyond the scope of our paper.
– What precise features of the firing of LMAN neurons are predicted for the models that yield efficient learning and match known plasticity features? What should an experimentalist measure in their spike trains? Putting some numbers to the predictions would be great.
Our model can predict how the average firing rate of LMAN neurons should vary during song production, provided we have a good measure of motor error and can estimate how specific LMAN neurons contribute to this error. Measurements of the latter kind are necessary for making specific quantitative comparisons, which thus lie beyond the scope of the present paper. Please also see the response to question (2.1), and the revised Discussion.
3) One reviewer thought that Figure 3C is not very convincing that learning is good only along the diagonal. It does not seem to matter at short timescales (which the authors discuss briefly), where performance is equivalent regardless of whether timescales are matched or unmatched (e.g. 10 ms vs 80 ms). But even at timescales >80 ms, the performance does not seem strictly diagonal, and the asymptotic performance of the model is significantly worse than at shorter timescales. Therefore it seems like the two key findings – timescale matching and the ability to learn – do not coexist. The other reviewer noted that the error is on a log scale is very important here. There is a steep valley sloping towards the diagonal. The larger error reached at longer timescales was thus less worrisome. 3B shows that this amount of error doesn't stem from huge differences in the output time course. We would therefore like to ask the authors for another plot like B at yet longer timescales to see how and if things break down. Also, 250 iterations isn't that many in terms of number of renditions sung during development, and we might want to see what the error surface looks like after a few thousand. Finally, the paper does discuss the flattening of the valley at short timescales, but this could be expanded.
We updated the heatmaps extending the range of timescales. Note that the timescales are chosen in geometric progression, so the four bins we added to each row and column of the heatmap extended the maximum tutor timescale by a factor of 2^4 = 16, to over 20 seconds. This is much longer than our typical simulation, which ran for about 1 second. We also ran the simulations for 1000 steps now instead of 250 in the previous version, and we used the same color scale throughout the paper to make it easier to make comparisons.
The new plots for the ratebased simulations show that both effective learning when the student matches the tutor, and impaired learning when there is a mismatch, persist even when the tutor timescales are very long. A more interesting effect occurs for the spiking simulations, where convergence breaks down for very long tutor memories. When the tutor timescale is significantly longer than the duration of the motor program, the tutor firing rates tend to stay close to the threshold theta. In the spiking case, the fluctuations due to Poisson spiking introduce noise in the tutor guiding signal, and this noise is more damaging to learning the less the tutor rates vary. This is one of the reasons for the failure in convergence in spiking simulations when the tutor timescale is very long. Another issue is related to the constraint that requires conductorstudent synaptic weights to be positive. A consequence of this constraint is that negative fluctuations in the synaptic weights (which happen due to the stochasticity of the tutor signal) are sometimes clamped (when the weights reach zero), and so there is a net positive trend on the weights. This leads to a residual positive shift in the motor program that does not go away with learning.
Both effects can be reduced – increasing the gain that relates the motor error to the tutor signal reduces the effect of the fluctuations, and using a pushpull mechanism in which student neurons can either act to increase or decrease muscle activity effectively eliminates the residual shift in the motor program.
In the first version of the manuscript we had chosen to keep the tutor timescales in a range that didn’t significantly exceed the typical duration of the motor program, both because that seems to be the more relevant scenario (longer memory timescales do not have enough time to take full effect during the motor program), and to avoid the complications discussed above. We now included a discussion of these issues in section “Spiking neurons and birdsong”, and in the Supplementary Information, in section “Effect of spiking stochasticity on learning”.
4. It seems misleading to call τ1 and τ2 the timescales of synaptic plasticity of the student. They are simply timescales of two exponential kernels that are linearly combined to create the final filtering kernel. If the authors want to relate τ1 and τ2 to the timescales of synaptic plasticity, they should include a discussion of how and why they consider them to be related. In general it would be helpful if the authors are more clear in defining their variables through the paper – especially since tau1 and tau2 are very different from tau_tutor.
Indeed, we did not mean to imply that tau_1 and tau_2 are directly related to particular details of the plasticity mechanism, such as the duration it takes for plasticity to occur. In our model, tau_1 and tau_2 are simply two parameters that allow us to fit the observed dynamics of synaptic plasticity in some cases. We extended the description of our rule in section “Learning in a ratebased model” to make this clearer, and also to emphasize the relation between our rule and standard STDP rules. Rather than referring to these parameters as “plasticity timescales”, we now call them “student timescales”. We also added the equation defining the plasticity rule to subsection “Learning in a ratebased model” (Equation 1) to avoid confusion regarding the meaning of the different parameters used by the plasticity rule. We further adopted the notation tau_tutor to refer to the exponential kernel used to convolve the motor error in defining the tutor signal (Equation 2), and tau_tutor^* to refer to the timescale matched to a particular student.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
However, there are still a few important aspects of the paper that need to be significantly revised to improve clarity and address some issues of interpretation:
1) The (qualitative) definition of tau_tutor could still be much clearer. When we first read the initial submission, we mistakenly thought that tau_tutor represented the timescale over which the firing rate of the tutor neurons is modulated, rather than (as is actually the case) the timescale over which motor errors are integrated into the firing of tutor neurons. Based on our comments from last time, the authors have somewhat clarified this, however this could still be more clear (in fact, I got confused in the same way reading the revised paper). We think this could be easily clarified with a couple of minor changes
a) Just above Equation 4, revise to "Moreover, for effective learning, the timescale t_tutor – ***which quantifies the timescale on which error information is integrated into the tutor signal*** – appearing in Equation 3…" (or something similar).
We added the suggested clarification above Equation 4.
b) Be a bit more specific when talking about "matching". For example, the first sentence in 2.3 simply reads "…when the tutor is matched to the student…". I think it would be much better (especially for less quantitative readers) to explicitly state what is being matched by expanding this to read something like "…when the ***timescale on which error information is integrated into the tutor signal (tau_tutor) is matched to the student plasticity rule***…".
We added the extended explanation of the matching at the top of subsection “Matched vs. unmatched learning”.
c) More generally, throughout the paper replace the term "tutor timescale" – which many biologists will misinterpret as the timescale on which tutor activity varies – with "tutor error integration timescale" or something like this.
We replaced the phrase “tutor timescale” by “tutor error integration timescale” throughout the manuscript.
2) We appreciate the authors expanding the duration of simulations and range of timescales tested in the simulations shown in Figure 3C and 5D. However, the authors should be more explicit (and consistent) in explaining how these simulations are conducted. My understanding is that in Figure 3C, tau_1 and tau_2 are fixed at [80 40]msec, and to obtain a particular t_tutor*, plasticity parameters α and β are adjusted for each simulation (subject to the constraint αβ=1). Provided this is correct, the fact that α and β were different for different simulations (i.e. for different squares in Figure 3C, 5D) should be explicitly stated in the results text and/or legend to Figure 3. Similarly, as far as I can tell a similar approach (fix tau_1,2, vary α/β to get a particular tau_tutor*) was used in the analysis shown in Figure 5D. However, this is not made clear in the text/caption (I find the caption to 5D especially confusing). Authors should explicitly state the procedures for 5D and explain any differences in the general methods used in 3C. Furthermore, the authors should present (maybe as a supplemental figure) the values of α or β used for all simulations (e.g. a heatmap of α values to go along with Figures 3C and 5D).
Indeed, as the referee states, for the heatmaps the timescales tau_1 and tau_2 are kept fixed while α and β are varied, while keeping αβ = 1. This was already explained in the text in the Results section. In addition, we followed the referee’s suggestions and explicitly stated this again in the captions of Figures 3C and 5D. We further added a more detailed description of the values of α and β that we used in the supplementary information.
3) The following is a somewhat subtle point, so I'll leave it up to the Authors to deal with as they see fit.
As noted above, tau_tutor is the timescale over which motor errors are integrated into the firing of tutor neurons. Furthermore, the various models (linear rate, spiking, etc.) can work over a big range of tau_tutors, say from 10ms1280msec, as shown in Figures 3C and 5D. In the revised Discussion, the authors correctly assert that LMAN/tutor firing "should reflect the integral of the motor error with the timescale predicted by the model". The authors also correctly point out that we don't really know what the motor error signal looks like.
However – a key physiological fact is that firing in the "tutor" brain area – LMAN – consists of both short bursts (~15 msec duration) as well as single spikes whose overall rate varies much more slowly. So, although we don't know how fast the error signal varies, even if the error signal were a pulse that only lasted, say, 1 msec, it would affect the firing of the LMAN neuron on timescale tau_tutor – e.g. over >100 msec if tau_tutor=100. So, if I'm getting this right, it seems implausible that a 15 msec burst in LMAN could possibly reflect motor error – even an infinitely brief motor error – if tau_tutor were longer than, say, 50 msec. The authors may want to discuss this issue, especially whether their model suggests that only slow changes in the rate of singlespikes, and not bursts, are carrying error information? I think that such an implication is both important for understanding the implications of the model and for guiding future work in the system.
If we have understood this comment correctly, the referee is asking whether variations of the LMAN signal over short timescales are informative about motor error when tau_tutor is long. While it is true that the timing of a single spike or short burst in any given trial is noisy and thus cannot hold precise information about rapid variations in the motor error, we assume that the effects of learning are averaged over many repetitions of the motor program. In this sense, learning ends up depending on average firing rates and not on the precise moments when spikes or bursts occur. Recordings suggest that the average firing rates are similar for spikes and bursts (Kao, Wright and Doupe, 2008). It is also true that short pulses in the error signal get convolved with an exponential decay with timescale tau_tutor in the tutor signal. However, a student circuit that is matched to the tutor circuit effectively performs the appropriate deconvolution to pick out the faster variations of the motor error even from slowlyvarying tutor signals. This is assured by our matching condition from Equation 4, and is demonstrated by the results of our spiking simulations.
For the purposes of this manuscript, we chose to not explicitly take into account how the tutor signal is split between isolated spikes and bursts. This is because we focused on the effects from average tutor firing rates, which are independent of this distinction, and also because our model is applicable to systems other than the vocal production mechanism in songbirds, where the precise balance between isolated spiking and bursting may be different. It would nevertheless be very interesting to study the way in which tutor bursting affects learning in a model such as ours. We hope to look into these topics in future work.
Additionally, as indicated just below Equation 4, the model assumes that tau_tutor>>tau_1,2. In many simulations (definitely in Figure 3C, also I think in Figure 5D, see my question above), tau_1 and tau_2 have values of 40 and 80 msec, which are values derived from the Mehaffey and Doupe STDP paper. This would suggest that tau_tutor should be >>80 msec, and that cases with tau_tutor and tau_tutor* less than this value (which make up a significant region of the parameter space shown in Figures 3C and 5D) are biologically implausible. This seems to me to be something the authors should address.
Indeed the values of tau_1 and tau_2 were chosen here to qualitatively match the STDP curve from Mehaffey and Doupe when the ratio of α and β is approximately equal to 1. However, our model applies more generally, and hence we explore a wider range of parameters. Our analysis also applies to systems where twostage learning happens in which the student plasticity can be modeled with different values for these timescales, or with a different ratio of α and β. This is why we consider a wide range of values for these parameters in our simulations.
Further, although our analytical derivation relies on the assumption tau_tutor>>tau_1,2, this is simply an approximation needed to make the calculations tractable. By running the simulations in parameter ranges in which this and other assumptions of our derivation are not valid, we can check the robustness of our matching condition beyond the range in which the strict derivation held true. Indeed, we find that many of the assumptions can apparently be relaxed without affecting the matching condition (such as replacing the ratebased dynamics by spiking), while others are more rigid (such as the condition tau_tutor>>tau_1,2). We have added a paragraph explaining these points towards the beginning of subsection “Matched vs. unmatched learning”.
https://doi.org/10.7554/eLife.20944.021Article and author information
Author details
Funding
Swartz Foundation
 Tiberiu Teşileanu
National Science Foundation
 Vijay Balasubramanian
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We would like to thank Serena Bradde for fruitful discussions during the early stages of this work. We also thank Xuexin Wei and Christopher Glaze for useful discussions. We are grateful to Timothy Otchy for providing us with some of the data we used in this paper. During this work VB was supported by NSF grant PHY1066293 at the Aspen Center for Physics and by NSF Physics of Living Systems grant PHY1058202. TT was supported by the Swartz Foundation.
Reviewing Editor
 Michael J Frank, Brown University, United States
Publication history
 Received: August 25, 2016
 Accepted: March 4, 2017
 Version of Record published: April 4, 2017 (version 1)
Copyright
© 2017, Teşileanu et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 2,499
 Page views

 389
 Downloads

 14
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Neuroscience
Artificial neural networks could pave the way for efficiently simulating largescale models of neuronal networks in the nervous system.

 Neuroscience
A key question in decision making is how humans arbitrate between competing learning and memory systems to maximize reward. We address this question by probing the balance between the effects, on choice, of incremental trialanderror learning versus episodic memories of individual events. Although a rich literature has studied incremental learning in isolation, the role of episodic memory in decision making has only recently drawn focus, and little research disentangles their separate contributions. We hypothesized that the brain arbitrates rationally between these two systems, relying on each in circumstances to which it is most suited, as indicated by uncertainty. We tested this hypothesis by directly contrasting contributions of episodic and incremental influence to decisions, while manipulating the relative uncertainty of incremental learning using a wellestablished manipulation of reward volatility. Across two large, independent samples of young adults, participants traded these influences off rationally, depending more on episodic information when incremental summaries were more uncertain. These results support the proposal that the brain optimizes the balance between different forms of learning and memory according to their relative uncertainties and elucidate the circumstances under which episodic memory informs decisions.