Naturalgradient learning for spiking neurons
Abstract
In many normative theories of synaptic plasticity, weight updates implicitly depend on the chosen parametrization of the weights. This problem relates, for example, to neuronal morphology: synapses which are functionally equivalent in terms of their impact on somatic firing can differ substantially in spine size due to their different positions along the dendritic tree. Classical theories based on Euclideangradient descent can easily lead to inconsistencies due to such parametrization dependence. The issues are solved in the framework of Riemannian geometry, in which we propose that plasticity instead follows naturalgradient descent. Under this hypothesis, we derive a synaptic learning rule for spiking neurons that couples functional efficiency with the explanation of several welldocumented biological phenomena such as dendritic democracy, multiplicative scaling, and heterosynaptic plasticity. We therefore suggest that in its search for functional synaptic plasticity, evolution might have come up with its own version of naturalgradient descent.
Editor's evaluation
The natural gradient has a long and rich history in machine learning. Here, the authors derive a biologically plausible implementation of naturalgradientbased plasticity for spiking neurons, which renders learning invariant under dendritic transformations. This new synaptic learning rule makes several very interesting experimental predictions with respect to the interplay of homo and heterosynaptic plasticity, and with regard to the scaling of plasticity by the presynaptic variance.
https://doi.org/10.7554/eLife.66526.sa0Introduction
Understanding the fundamental computational principles underlying synaptic plasticity represents a longstanding goal in neuroscience. To this end, a multitude of topdown computational paradigms have been developed, which derive plasticity rules as gradient descent on a particular objective function of the studied neural network (Rosenblatt, 1958; Rumelhart et al., 1986; Pfister et al., 2006; D’Souza et al., 2010; Friedrich et al., 2011).
However, the exact physical quantity to which these synaptic weights correspond often remains unspecified. What is frequently simply referred to as ${w}_{ij}$ (the synaptic weight from neuron $j$ to neuron i) might relate to different components of synaptic interaction, such as calcium concentration in the presynaptic axon terminal, neurotransmitter concentration in the synaptic cleft, receptor activation in the postsynaptic dendrite or the postsynaptic potential (PSP) amplitude in the spine, the dendritic shaft or at the soma of the postsynaptic cell. All these biological processes can be linked by transformation rules, but depending on which of them represents the variable with respect to which performance is optimized, the network behavior during training can be markedly different.
As an example we consider the parametrization of the synaptic strength either as PSP amplitude in the soma, ${w}^{\mathrm{s}}$, or as PSP amplitude in the dendrite, ${w}^{\mathrm{d}}$ (see also Figure 1 and Sec. ‘The naive Euclidean gradient is not parametrizationinvariant’). Reparametrizing the synaptic strength in this way implies an attenuation factor for each single synapse, but different factors are assigned across the positions on the dendritic tree. As a consequence, the weight vector will follow a different trajectory during learning depending on whether the somatic or dendritic parametrization of the PSP amplitude was chosen.
It certainly could be the case that evolution has favored a parametrizationdependent learning rule, along with one particular parametrization over all others, but this would necessarily imply suboptimal convergence for all but a narrow set of neuron morphologies and connectome configurations. An invariant learning rule on the other hand would not only be mathematically unambiguous and therefore more elegant, but could also improve learning, thus increasing fitness.
In some aspects, the question of invariant behavior is related to the principle of relativity in physics, which requires the laws of physics – in our case: the improvement of performance during learning – to be the same in all frames of reference. What if neurons would seek to conserve the way they adapt their behavior regardless of, for example, the specific positioning of synapses along their dendritic tree? Which equations of motion – in our case: synaptic learning rules – are able to fulfill this requirement?
The solution lies in following the path of steepest descent not in relation to a small change in the synaptic weights (Euclideangradient descent), but rather with respect to a small change in the inputoutput distribution (naturalgradient descent). This requires taking the gradient of the error function with respect to a metric defined directly on the space of possible inputoutput distributions, with coordinates defined by the synaptic weights. First proposed in Amari, 1998, but with earlier roots in information geometry (Amari, 1987; Amari and Nagaoka, 2000), naturalgradient methods (Yang and Amari, 1998; Rattray and Saad, 1999; Park et al., 2000; Kakade, 2001) have recently been rediscovered in the context of deep learning (Pascanu and Bengio, 2013; Martens, 2014; Ollivier, 2015; Amari et al., 2019; Bernacchia et al., 2018). Moreover, Pascanu and Bengio, 2013 showed that the naturalgradient learning rule is closely related to other machine learning algorithms. However, most of the applications focus on ratebased networks which are not inherently linked to a statistical manifold and have to be equipped with Gaussian noise or a probabilistic output layer interpretation in order to allow an application of the natural gradient. Furthermore, a biologically plausible synaptic plasticity rule needs to make all the required information accessible at the synapse itself, which is usually unnecessary and therefore largely ignored in machine learning.
The stochastic nature of neuronal outputs invivo (see, e.g. Softky and Koch, 1993) provides a natural setting for plasticity rules based on information geometry. As a model for biological synapses, natural gradient combines the elegance of invariance with the success of gradientdescentbased learning rules. In this manuscript, we derive a closedform synaptic learning rule based on naturalgradient descent for spiking neurons and explore its implications. Our learning rule equips the synapses with more functionality compared to classical error learning by enabling them to adjust their learning rate to their respective impact on the neuron’s output. It naturally takes into account relevant variables such as the statistics of the afferent input or their respective positions on the dendritic tree. This allows a set of predictions which are corroborated by both experimentally observed phenomena such as dendritic democracy and multiplicative weight dynamics and theoretically desirable properties such as Bayesian reasoning (MarceauCaron and Ollivier, 2007). Furthermore, and unlike classical errorlearning rules, plasticity based on the natural gradient is able to incorporate both homo and heterosynaptic phenomena into a unified framework. While theoretically derived heterosynaptic components of learning rules are notoriously difficult for synapses to implement due to their nonlocality, we show that in our learning rule they can be approximated by quantities accessible at the locus of plasticity. In line with results from machine learning, the combination of these features also enables faster convergence during supervised learning.
Results
The naive Euclidean gradient is not parametrizationinvariant
We consider a cost function $C$ on the neuronal level that, in the sense of cortical credit assignment (see e.g. Sacramento et al., 2017), can relate to some behavioral cost of the agent that it serves. The output of the neuron depends on the amplitudes of the somatic PSPs elicited by the presynaptic spikes. We denote these ‘somatic weights’ by $\mathit{w}}^{\mathrm{s}$, and may parametrize the neuronal cost as $C={C}^{\phantom{\rule{thinmathspace}{0ex}}\mathrm{s}}[{\mathit{w}}^{\mathrm{s}}]$.
However, dendritic PSP amplitudes ${\mathit{w}}^{\mathrm{d}}$ can be argued to offer a more unmitigated representation of synaptic weights, so we might rather wish to express the cost as $C={C}^{\text{}\mathrm{d}}[{\mathit{w}}^{\mathrm{d}}]$. These two parametrizations are related by an attenuation factor $\mathit{\alpha}$ (between 0 and 1):
In general, this attenuation factor depends on the synaptic position and is therefore described by a vector that is multiplied componentwise with the weights. For clarity, we assume a simplified picture in which we neglect, for example, the spread of PSPs as they travel along dendritic cables. However, our observations regarding the effects of reparametrization hold in general.
It may now seem straightforward to switch between the somatic and dendritic representation of the cost by simply substituting variables, for example
To derive a plasticity rule for the somatic and dendritic weights, we might consider gradient descent on the cost:
At first glance, this relation seems reasonable: dendritic weight changes affect the cost more weakly then somatic weight changes, so their respective gradient is more shallow by the factor $\mathit{\alpha}$. However, from a functional perspective, the opposite should be true: dendritic weights should experience a larger change than somatic weights in order to elicit the same effect on the cost. This conflict can be made explicit by considering that somatic weight changes are, themselves, attenuated dendritic weight changes: $\mathrm{\Delta}{\mathit{w}}^{\mathrm{s}}=\mathit{\alpha}\mathrm{\Delta}{\mathit{w}}^{\mathrm{d}}$. Substituting this into Equation 3 leads to an inconsistency: $\mathrm{\Delta}{\mathit{w}}^{\mathrm{d}}={\mathit{\alpha}}^{2}\mathrm{\Delta}{\mathit{w}}^{\mathrm{d}}$. To solve this conundrum, we need to shift the focus from changing the synaptic input to changing the neuronal output, while at the same time considering a more rigorous treatment of gradient descent (see also Surace et al., 2020).
The naturalgradient plasticity rule
We consider a neuron with somatic potential $V$ (above a baseline potential ${V}_{\mathrm{rest}}$) evoked by the spikes $x}_{i$ of $n$ presynaptic afferents firing at rates $r}_{i$. The presynaptic spikes of afferent i cause a train of weighted dendritic potentials ${w}_{i}^{\mathrm{d}}{x}_{i}^{\u03f5}$ locally at the synaptic site. The $x}_{i$ denotes the unweighted synaptic potential (USP) train elicited by the lowpassfiltered spike train $x}_{i$. At the soma, each dendritic potential is attenuated by a potentially nonlinear function that depends on the synaptic location:
The somatic voltage above baseline thus reads as
We further assume that the neuron’s firing follows an inhomogeneous Poisson process whose rate
depends on the current membrane potential through a nonlinear transfer function $\varphi $. In this case, spiking in a sufficiently short interval $[t,t+\mathrm{d}t]$ is Bernoullidistributed. The probability of a spike occurring in this interval (denoted as ${y}_{t}=1$) is then given by
which defines our generalized linear neuron model (Gerstner and Kistler, 2002). Here, we used $\mathit{x}}^{\u03f5$ as a shorthand notation for the USP vector $({x}_{i}^{\u03f5})$. The full inputoutput distribution then reads
where the probability density of the input $\mathit{x}}_{t}^{\u03f5$ is independent of synaptic weights (see also Sec. ‘Detailed derivation of the naturalgradient learning rule’). In the following, we drop the time indices for better readability.
In view of this stochastic nature of neuronal responses, we consider a neuron that strives to reproduce a target firing distribution ${p}^{\ast}\phantom{\rule{negativethinmathspace}{0ex}}\left(y{\mathit{x}}^{\u03f5}\right)$. The required information is received as teacher spike train $Y}^{\ast$ that is sampled from ${p}^{*}$ (see Sec. ‘Sketch for the derivation of the somatic naturalgradient learning rule’ for details). In this context, plasticity may follow a supervisedlearning paradigm based on gradient descent. The KullbackLeibler divergence between the neuron’s current and its target firing distribution
represents a natural cost function which measures the error between the current and the desired output distribution in an informationtheoretic sense. Minimizing this cost function is equivalent to maximizing the loglikelihood of teacher spikes, since $p}^{\ast$ does not depend on $\mathit{w}$. Using naive Euclideangradient descent with respect to the synaptic weights (denoted by $\mathrm{\nabla}}_{\mathit{w}}^{\mathrm{E}$) results in the wellknown errorcorrecting rule (Pfister et al., 2006),
which is a spikebased version of the classical perceptron learning rule (Rosenblatt, 1958), whose multilayer version forms the basis of the errorbackpropagation algorithm (Rumelhart et al., 1986). On the singleneuron level, a possible biological implementation has been suggested by Urbanczik and Senn, 2014, who demonstrated how a neuron may exploit its morphology to store errors, an idea that was recently extended to multilayer networks (Sacramento et al., 2017; Haider et al., 2021).
However, as we argued above, learning based on Euclideangradient descent is not unproblematic. It cannot account for synaptic weight (re)parametrization, as caused, for example, by the diversity of synaptic loci on the dendritic tree. Equation 10 represents the learning rule for a somatic parametrization of synaptic weights. For a local computation of Euclidean gradients using a dendritic parametrization, weight updates decrease with increasing distance towards the soma (Equation 3), which is likely to harm the convergence speed toward an optimal weight configuration. With the multiplicative USP term $\mathit{x}}^{\u03f5$ in Equation 10 being the only manifestation of presynaptic activity, there is no mechanism by which to take into account input variability. This can, in turn, also impede learning, by implicitly assigning equal importance to reliable and unreliable inputs. Furthermore, when compared to experimental evidence, this learning rule cannot explain heterosynaptic plasticity, as it is purely presynaptically gated.
In general, Euclideangradient descent is wellknown to exhibit slow convergence in nonisotropic regions of the cost function (Ruder, 2016), with such nonisotropy frequently arising or being aggravated by an inadequate choice of parametrization (see Ollivier, 2015 and Figure 2). In contrast, naturalgradient descent is, by construction, immune to these problems. The key idea of natural gradient as outlined by Amari is to follow the (locally) shortest path in terms of the neuron’s firing distribution. Argued from a normative point of view, this is the only ‘correct’ path to consider, since plasticity aims to adapt a neuron’s behavior, that is, its inputoutput relationship, rather than some internal parameter (Figure 2). In the following, we therefore drop the index from the synaptic weights $\mathit{w}$ to emphasize the parametrizationinvariant nature of the natural gradient.
For the concept of a locally shortest path to make sense in terms of distributions, we require the choice of a distance measure for probability distributions. Since a parametric statistical model, such as the set of our neuron’s realizable output distributions, forms a Riemannian manifold, a local distance measure can be obtained in form of a Riemannian metric. The Fisher metric (Rao, 1945), an infinitesmial version of the ${D}_{\mathrm{KL}}$, represents a canonical choice on manifolds of probability distributions (Amari and Nagaoka, 2000). On a given parameter space, the Fisher metric may be expressed in terms of a bilinear product with the Fisher information matrix
The Fisher metric locally measures distances in the $p$manifold as a function of the chosen parametrization. We can then obtain the natural gradient (which intuitively may be thought of as ‘$\mathrm{\partial}C/\mathrm{\partial}p$’) by correcting the Euclidean gradient $\mathrm{\nabla}}_{\mathit{w}}^{\mathrm{E}}C:=\mathrm{\partial}C/\mathrm{\partial}\mathit{w$ with the distance measure above:
This correction guarantees invariance of the gradient under reparametrization (see also Sec. ‘Reparametrization and the general naturalgradient rule’). The naturalgradient learning rule is then given as $\dot{\mathit{w}}=\eta {\mathrm{\nabla}}_{\mathit{w}}^{\mathrm{N}}C$. Calculating the righthand expression for the case of Poissonspiking neurons (for details, see Detailed derivation of the naturalgradient learning rule and Inverse of the Fisher Information Matrix), this takes the form
where $\mathit{w}$ is an arbitrary weight parametrization that relates to the somatic amplitudes via a componentwise rescaling
Note that the attenuation function in Equation 4 represents a special case of Equation 14 for dendritic amplitudes.
We used the shorthand ${\gamma}_{\mathrm{s}}$, ${\gamma}_{\mathrm{u}}$, and ${\gamma}_{\mathrm{w}}$ for three scaling factors introduced by the natural gradient that we address in detail below. For easier reading, we use a shorthand notation in which multiplications, divisions and scalar functions of vectors apply componentwise.
Equation 13 represents the complete expression of our naturalgradient rule, which we discuss throughout the remainder of the manuscript. Note that, while having used a standard sigmoidal transfer function throughout the paper, Equation 13 holds for every sufficiently smooth $\varphi $.
Naturalgradient learning conserves both the error term $\left[{Y}^{\phantom{\rule{thinmathspace}{0ex}}\ast}\varphi (V)\right]$ and the USP contribution $\mathit{x}}^{\u03f5$ from classical gradientdescent plasticity. However, by including the relationship between the parametrization of interest $\mathit{w}$ and the somatic PSP amplitudes $\mathit{f}\phantom{\rule{thinmathspace}{0ex}}(\mathit{w})$, naturalgradientbased plasticity explicitly accounts for reparametrization distortions, such as those arising from PSP attenuation during propagation along the dendritic tree. Furthermore, naturalgradient learning introduces multiple scaling factors and new plasticity components, whose characteristics will be further explored in dedicated sections below (see also Sec. ‘Global scaling factor’ and ‘Empirical Analysis of $\gamma}_{\mathrm{u}$ and $\gamma}_{\mathrm{w}$’ for more details).
First of all, we note the appearance of two scaling factors (more details in ‘Input and outputspecific scaling’). On one hand, the size of the synaptic adjustment is modulated by a global scaling factor ${\gamma}_{\mathrm{s}}$, which adjusts synaptic weight updates to the characteristics of the output nonlinearity, similarly to the synapsespecific scaling by the inverse of $\mathit{f}}^{\phantom{\rule{thinmathspace}{0ex}}\mathrm{\prime}$. Furthermore ${\gamma}_{\mathrm{s}}$ also depends on the output statistics of the neuron, harmonizing plasticity across different states in the output distribution (see Sec. ‘Global scaling factor’). On the other hand, a second, synapsespecific learning rate scaling accounts for the statistics of the input at the respective synapse, in the form of a normalization by the afferent input rate $c}_{\u03f5}/\mathit{r$, where ${c}_{\u03f5}$ is a constant that depends on the PSP kernel (see Sec. ‘Neuron model’). Unlike the global modulation introduced by ${\gamma}_{\mathrm{s}}$, this scaling only affects the USPdependent plasticity component. Just as for Euclideangradientbased learning, the latter is directly evoked by the spike trains arriving at the synapse. Therefore, the resulting plasticity is homosynaptic, affecting only synapses which receive afferent input.
However, in the case of naturalgradient learning, this inputspecific adaptation is complemented by two additional forms of heterosynaptic plasticity (Sec. ‘Interplay of homosynaptic and heterosynaptic plasticity’). First, the learning rule has a bias term ${\gamma}_{\mathrm{u}}$ which uniformly adjusts all synapses and may be considered homeostatic, as it usually opposes the USPdependent plasticity contribution. The amplitude of this bias does not exclusively depend on the afferent input at the respective synapse, but is rather determined by the overall input to the neuron. Thus, unlike the USPdependent component, this heterosynaptic plasticity component equally affects both active and inactive synaptic connections. Furthermore, naturalgradient descent implies the presence of another plasticity component ${\gamma}_{\mathrm{w}}\phantom{\rule{thinmathspace}{0ex}}\mathit{f}\left(\mathit{w}\right)$ which adapts the synapses depending on their current weight. More specifically, connections that are already strong are subject to larger changes compared to weaker ones. Since the proportionality factor ${\gamma}_{\mathrm{w}}$ only depends on global variables such as the membrane potential, this component also affects both active and inactive synapses.
The full expressions for ${\gamma}_{\mathrm{s}}$, ${\gamma}_{\mathrm{u}}$, and ${\gamma}_{\mathrm{w}}$ are functions of the membrane potential, its mean and its variance, which represent synapselocal quantities. In addition, ${\gamma}_{\mathrm{u}}$ and ${\gamma}_{\mathrm{w}}$ also depend on the total input $\sum _{i=1}^{n}{x}_{i}^{\u03f5}$ and the total instantaneous presynaptic rate $\sum _{i=1}^{n}{r}_{i}$. However, under reasonable assumptions such as a high number of presynaptic partners and for a large, diverse set of empirically tested scenarios, we have shown that these factors can be reduced to simple functions of variables that are fully accessible at the locus of individual synapses:
where ${c}_{\u03f5}$, ${c}_{\mathrm{w}}$ are constants (Sec. ‘Global scaling factor’ and ‘Empirical Analysis of $\gamma}_{\mathrm{u}$ and $\gamma}_{\mathrm{w}$’). The above learning rule along with closedform expressions for these factors represent the main analytical findings of this paper.
In the following, we demonstrate that the additional terms introduced in naturalgradientbased plasticity confer important advantages compared to Euclideangradient descent, both in terms of of convergence as well as with respect to biological plausibility. More precisely, we show that our plasticity rule improves convergence in a supervised learning task involving an anisotropic cost function, a situation which is notoriously hard to deal with for Euclideangradientbased learning rules (Ruder, 2016). We then proceed to investigate naturalgradient learning from a biological point of view, deriving a number of predictions that can be experimentally tested, with some of them related to in vivo observations that are otherwise difficult to explain with classical gradientbased learning rules.
Naturalgradient plasticity speeds up learning
Nonisotropic cost landscapes can easily be provoked by nonhomogeneous input conditions. In nature, these can arise under a wide range of circumstances, for elementary reasons that boil down to morphology (the position of a synapse along a dendrite can affect the attenuation of its afferent input) and function (different afferents perform different computations and thus behave differently). To evaluate the convergence behavior of our learning rule and compare it to Euclideangradient descent, we considered a very generic situation in which a neuron is required to map a diverse set of inputs onto a target output.
In order to induce a simple and intuitive anisotropy of the error landscape, we divided the afferent population into two equally sized groups of neurons with different firing rates (Figure 3A, firing rates were chosen as 10 Hz for group one and 50 Hz for group 2). The input spikes were lowpassfiltered with a difference of exponentials (see Sec. ‘Neuron model for details’). This resulted in an asymmetric cost function (KLdivergence between the student and the target firing distribution, Equation 9), as visible from the elongated contour lines (Figure 3D and E). We further chose a realizable teacher by simulating a different neuron with the same input populations connected via a predefined set of target weights $\mathit{w}}^{\ast$. For the weight path plots in Figure 3D and E fixed target weight ${\mathit{w}}^{*}={(\frac{0.3}{n},\frac{0.5}{n})}^{T}$ was chosen, whereas for the learning curve in Figure 3F, the target weight components were randomly sampled from a uniform distribution on $[1/n,1/n]$ (see Sec. ‘Supervised Learning Task’ for further details). Figure 3B and C shows that our naturalgradient rule enables the student neuron to adapt its weights to reproduce the teacher voltage $V}^{\phantom{\rule{thinmathspace}{0ex}}\ast$ and thereby its output distribution.
In the following, we compare learning in two student neurons, one endowed with Euclideangradient plasticity (Equation 10, Figure 3D) and one with our naturalgradient rule (Equation 13, Figure 3E). To better visualize the difference between the two rules, we used a twodimensional input weight space, that is, one neuron per afferent population. While the negative Euclideangradient vectors stand, by definition, perpendicular to the contour lines of $C$, the negative naturalgradient vectors point directly towards the target weight configuration ${\mathit{w}}^{*}$. Due to the anisotropy of $C$ induced by the different input rates (see also Figure 2B), Euclideangradient learning starts out by mostly adapting the highrate afferent weight and only gradually begins learning the lowrate afferent. In contrast, natural gradient adapts both synaptic weights homogeneously. This is clearly reflected by paths traced by the synaptic weights during learning.
Overall, this lead to faster convergence of the naturalgradient plasticity rule compared to Euclideangradient descent. In order to enable a meaningful comparison, learning rates were tuned separately for each plasticity rule in order to optimize their respective convergence speed. The faster convergence of naturalgradient plasticity is a robust effect, as evidenced in Figure 3F by the average learning curves over 1,000 trials.
In addition to the functional advantages described above, naturalgradient learning also makes some interesting predictions about biology, which we address below.
Democratic plasticity
As discussed in the introduction, classical gradientbased learning rules do not usually account for neuron morphology. Since attenuation of PSPs is equivalent to weight reparametrization and our learning rule is, by construction, parametrizationinvariant, it naturally compensates for the distance between synapse and soma. In Equation 13, this is reflected by a componentwise rescaling of the synaptic changes with the inverse of the attenuation function $\mathit{f}}^{\phantom{\rule{thinmathspace}{0ex}}\mathrm{\prime}$, which is induced by the Fisher information metric (see also Figure 8 and the corresponding section in the Materials and methods). Under the assumption of passive attenuation along the dendritic tree, we have
where d_{i} denotes the distance of the ith synapse from the soma. More specifically, $\mathit{\alpha}(\mathit{d})={e}^{\mathit{d}/\lambda}$, where λ represents the electrotonic length scale. We can write the naturalgradient rule as
For functionally equivalent synapses (i.e. with identical input statistics), synaptic changes in distal dendrites are scaled up compared to proximal synapses. As a result, the effect of synaptic plasticity on the neuron’s output is independent of the synapse location, since dendritic attenuation is precisely counterbalanced by weight update amplification.
We illustrate this effect with simulations of synaptic weight updates at different locations along a dendritic tree in Figure 4. Such ‘democratic plasticity’, which enables distal synapses to contribute just as effectively to changes in the output as proximal synapses, is reminiscent of the concept of ‘dendritic democracy’ (Magee and Cook, 2000). These experiments show increased synaptic amplitudes in the distal dendritic tree of multiple cell types, such as rat hippocampal CA1 neurons; dendritic democracy has therefore been presumed to serve the purpose of giving distal inputs a ‘vote’ on the neuronal output. Still, experiments show highly diverse PSP amplitudes in neuronal somata (Williams and Stuart, 2002). Our plasticity rule refines the notion of democracy by asserting that learning itself rather than its end result is rescaled in accordance with the neuronal morphology.
Whether such democratic plasticity ultimately leads to distal and proximal synapses having the same effective vote at the soma depends on their respective importance towards reaching the target output. In particular, if synapses from multiple afferents that encode the same information are randomly distributed along the dendritic tree, then democratic plasticity also predicts dendritic democracy, as the scaling of weight changes implies a similar scaling of the final learned weights. However, the absence of dendritic democracy does not contradict the presence of democratic plasticity, as afferents from different cortical regions might target specific positions on the dendritic tree (see, e.g., Markram et al., 2004). Furthermore, also with democratic plasticity in place, the functional efficacy of synapses at different locations on the dendritic tree will change differently based on different initial conditions. The experimental findings by Froemke et al., 2005, who report an inverse relationship between the amount of plasticity at a synapse and its distance from soma, are therefore completely consistent with the predictions of our learning rule, since they compared synapses that were not initially functionally equivalent.
Note also that dendritic democracy could, in principle, be achieved without democratic plasticity. However, this would be much slower than with naturalgradient learning, especially for distal synapses, as discussed in Sec. ‘Naturalgradient plasticity speeds up learning’.
Input and outputspecific scaling
In addition to undoing distortions induced by, for example, attenuation, the naturalgradient rule predicts further modulations of the homosynaptic learning rate. The factor ${\gamma}_{\mathrm{s}}$ in Equation 13 represents an outputdependent global scaling factor (for both homo and heterosynaptic plasticity):
Together with the ${\varphi}^{\prime}/\varphi $ factor in Equation 13, the effective scaling of the learning rule is approximately $1/{\varphi}^{\prime}$. In other words, it increases the learning rate in regions where the sigmoidal transfer function is flat (see also Sec. ‘Global scaling factor’). This represents an unmediated reflection of the philosophy of naturalgradient descent, which finds the steepest path for a small change in output, rather than in the numeric value of some parameter. The desired change in the output requires scaling the corresponding input change by the inverse slope of the transfer function.
Furthermore, synaptic learning rates are inversely correlated to the USP variance ${\sigma}^{2}({\mathit{x}}^{\u03f5})$ (Figure 5). In particular, for the homosynaptic component, the scaling is exactly equal to
(see Equation 13 and Sec. ‘Neuron model’). In other words, naturalgradient learning explicitly scales synaptic updates with the (un)reliability – more specifically, with the inverse variance – of their input. To demonstrate this effect in isolation, we simulated the effects of changing the USP variance while conserving its mean. Moreover, to demonstrate its robustness, we independently varied two contributors to the input reliability, namely input rates (which enter ${\sigma}^{2}({\mathit{x}}^{\u03f5})$ directly) and synaptic time constants (which affect the PSPkerneldependent scaling constant ${c}_{\u03f5}$). Figure 5 shows how unreliable input leads to slower learning, with an inverse dependence of synaptic weight changes on the USP variance. We note that this observation also makes intuitive sense from a Bayesian point of view, under which any information needs to be weighted by the reliability of its source. Furthermore, the approximately inverse scaling with the presynaptic firing rate is qualitatively in line with observations from Aitchison and Latham, 2014 and Aitchison et al., 2021, although our interpretation and precise relationship is different.
Interplay of homosynaptic and heterosynaptic plasticity
One elementary property of update rules based on Euclideangradient descent is their presynaptic gating, that is, all weight updates are scaled with their respective synaptic input $\mathit{x}}^{\u03f5$. Therefore, they are necessarily restricted to homosynaptic plasticity, as studied in classical LTP and LTD experiments (Bliss and Lomo, 1973; Dudek and Bear, 1992). As discussed above, naturalgradient learning retains a rescaled version of this homosynaptic contribution, but at the same time predicts the presence of two additional plasticity components. Contrary to homosynaptic plasticity, these components also adapt synapses to currently nonactive afferents, given a sufficient level of global input. Due to their lack of input specificity, they give rise to heterosynaptic weight changes, a form of plasticity that has been observed in hippocampus (Chen et al., 2013; Lynch et al., 1977), cerebellum (Ito and Kano, 1982), and neocortex (Chistiakova and Volgushev, 2009), mostly in combination with homosynaptic plasticity. A functional interpretation of heterosynaptic plasticity, to which our learning rule also alludes, is as a prospective adaptation mechanism for temporarily inactive synapses such that, upon activation, they are already useful for the neuronal output.
Our naturalgradient learning rule Equation 13 can be more summarily rewritten as
where the three additive terms represent the variancenormalized homosynaptic plasticity, the uniform heterosynaptic plasticity and the weightdependent heterosynaptic plasticity:
with the common proportionality factor $\eta {\gamma}_{\mathrm{s}}\left[{Y}^{\phantom{\rule{thinmathspace}{0ex}}\ast}\varphi (V)\right]\frac{{\varphi}^{\mathrm{\prime}}(V)}{\varphi (V)}{\mathit{f}}^{\phantom{\rule{thinmathspace}{0ex}}\mathrm{\prime}}{\left(\mathit{w}\right)}^{1}$ composed of the learning rate, the outputdependent global scaling factor, the postsynaptic error, a sensitivity factor and the inverse attenuation function, in order of their appearance. The effect of these three components is visualized in Figure 6B. The homosynaptic term $\mathrm{\Delta}{\mathit{w}}^{\mathrm{h}\mathrm{o}\mathrm{m}}$ is experienced only by stimulated synapses, while the two heterosynaptic terms act on all synapses. The first heterosynaptic term $\mathrm{\Delta}{\mathit{w}}^{\mathrm{h}\mathrm{e}{\mathrm{t}}_{\mathrm{u}}}$ introduces a uniform adjustment to all components by the same amount, depending on the global activity level. For a large number of presynaptic inputs, it can be approximated by a constant (see Sec. ‘Empirical analysis of $\gamma}_{\mathrm{u}$ and $\gamma}_{\mathrm{w}$’). Furthermore, it usually opposes the homosynaptic change, which we address in more detail below.
In contrast, the contribution of the second heterosynaptic term $\mathrm{\Delta}{\mathit{w}}^{\mathrm{h}\mathrm{e}{\mathrm{t}}_{\mathrm{w}}}$ is weightdependent, adapting all synapses in proportion to their current strength. This corresponds to experimental reports such as Loewenstein et al., 2011, which found in vivo weight changes in the neocortex to be proportional to the spine size, which itself is correlated with synaptic strength (Asrican et al., 2007). Our simulations in Sec. ‘Empirical Analysis of $\gamma}_{\mathrm{u}$ and $\gamma}_{\mathrm{w}$’ show that $\mathrm{\Delta}{\mathit{w}}^{\mathrm{h}\mathrm{e}{\mathrm{t}}_{\mathrm{w}}}$ is roughly a linear function of the membrane potential (more specifically, its deviation with respect to its baseline), which is reflected by the last approximation in Equation 21. Since the latter can be interpreted as a scalar product between the afferent input vector and the synaptic weight vector, it implies that spikes transmitted by strong synapses have a larger impact on this heterosynaptic plasticity component compared to spikes arriving via weak synapses. Thus, weaker synapses require more persistent and strong stimulation to induce significant changes and ‘override the status quo’ of the neuron. Since, following a period of learning, afferents connected via weak synapses can be considered uninformative for the neuron’s target output, this mechanism ensures a form of heterosynaptic robustness towards noise.
The homo and heterosynaptic terms exhibit an interesting relationship. To illustrate the nature of their interplay, we simulated a simple experiment (Figure 7A) with varying initial synaptic weights for both active and inactive presynaptic afferents. Stimulated synapses (Figure 7B) are seen to undergo strong potentiation (LTP) for very small initial weights; the magnitude of weight changes decreases for larger initial amplitudes until the neuron’s output matches its teacher, at which point the sign of the postsynaptic error term flips. For even larger initial weights, potentiation at stimulated synapses therefore turns into depression (LTD), which becoms stronger for higher initial values of the stimulated synapses’ weights. This is in line with the error learning paradigm, in which changes in synaptic weights seek to reduce the difference between a neuron’s target and its output.
For unstimulated synapses (Figure 7C), we observe a reversed behavior. For small weights, the negative uniform term $\mathrm{\Delta}{\mathit{w}}^{\mathrm{h}\mathrm{e}{\mathrm{t}}_{\mathrm{u}}}$ dominates and plasticity is depressing. As for the homosynaptic case, the sign of plasticity switches when the weights become large enough for the error to switch sign. Therefore, in the regime where stimulated synapses experienced potentiation, unstimulated synapses are depressed and viceversa. This reproduces various experimental observations: on one hand, potentiation of stimulated synapses has often been found to be accompanied by depression of unstimulated synapses (Lynch et al., 1977), such as in the amygdala (Royer and Paré, 2003) or the visual cortex (Arami et al., 2013); on the other hand, when the postsynaptic error term switches sign, depression at unstimulated synapses transforms into potentiation (Wöhrl et al., 2007; Royer and Paré, 2003).
While plasticity at stimulated synapses is unaffected by the initial state of the unstimulated synapses, plasticity at unstimulated synaptic connections depends on both the stimulated and unstimulated weights. In particular, when either of these grow large enough, the proportional term $\mathrm{\Delta}{\mathit{w}}^{\mathrm{h}\mathrm{e}{\mathrm{t}}_{\mathrm{w}}}$ overtakes the uniform term $\mathrm{\Delta}{\mathit{w}}^{\mathrm{h}\mathrm{e}{\mathrm{t}}_{\mathrm{u}}}$ and heterosynaptic plasticity switches sign again. Thus, for very large weights (top right corner of Figure 7C), heterosynaptic potentiation transforms back into depression, in order to more quickly quench excessive output activity. This behavior is useful for both supervised and unsupervised learning scenarios (Zenke and Gerstner, 2017), where it was shown that pairing Hebbian terms with heterosynaptic and homeostatic plasticity is crucial for stability.
In summary, we can distinguish three plasticity regimes for naturalgradient learning (Figure 7D–G). In two of these regimes, heterosynaptic and homosynaptic plasticity are opposed (O1, O2), whereas in the third, they are aligned and lead to depression (S). The two opposing regimes are separated by the zeroerror equilibrium line, at which plasticity switches sign.
Discussion
As a consequence of the fundamentally stochastic nature of evolution, it is no surprise that biology withstands confinement to strict laws. Still, physicsinspired arguments from symmetry and invariance can help uncover abstract principles that evolution may have gradually discovered and implemented into our brains. Here, we have considered parametrization invariance in the context of learning, which, in biological terms, translates to the fundamental ability of neurons to deal with diversity in their morphology and inputoutput characteristics. This requirement ultimately leads to various forms of scaling and heterosynaptic plasticity that are experimentally welldocumented, but can not be accounted for by classical paradigms that regard plasticity as Euclideangradient descent. In turn, these biological phenomena can now be seen as a means to jointly improve and accelerate errorcorrecting learning.
Inspired by insights from information geometry, we applied the framework of natural gradient descent to biologically realistic neurons with extended morphology and spiking output. Compared to classical errorcorrecting learning rules, our plasticity paradigm requires the presence of several additional ingredients. First, a global factor adapts the learning rate to the particular shape of the voltagetospike transfer function and to the desired statistics of the output, thus addressing the diversity of neuronal response functions observed in vivo (Markram et al., 2004). Second, the homosynaptic component of plasticity is normalized by the variance of presynaptic inputs, which provides a direct link to Bayesian frameworks of neuronal computation (Aitchison and Latham, 2014; Jordan et al., 2020). Third, our rule contains a uniform heterosynaptic term that opposes homosynaptic changes, downregulating plasticity and thus acting as a homeostatic mechanism (Chen et al., 2013; Chistiakova et al., 2015). Fourth, we find a weightdependent heterosynaptic term that also accounts for the shape of the neuron’s activation function, while increasing its robustness towards noise. Finally, our naturalgradientbased plasticity correctly accounts for the somatodendritic reparametrization of synaptic strengths.
These features enable faster convergence on nonisotropic error landscapes, in line with results for multilayer perceptrons (Yang and Amari, 1998; Rattray and Saad, 1999) and ratebased deep neural networks (Pascanu and Bengio, 2013; Ollivier, 2015; Bernacchia et al., 2018). Importantly, our learning rule can be formulated as a simple, fully local expression, only requiring information that is available at the locus of plasticity.
We further note an interesting property of our learning rule, which it inherits directly from the Fisher information metric that underlies natural gradient descent, namely invariance under sufficient statistics (Cencov, 1972). This is especially relevant for biological neurons, whose stochastic firing effectively communicates information samples rather than explicit distributions. Thus, downstream computation is likely to require a reliable samplebased, that is, statistically sufficient, estimation of the afferent distribution’s parameters, such as the sample mean and variance. This singles out our naturalgradient approach from other secondorderlike methods as a particularly appealing framework for biological learning.
Many of the biological phenomena predicted by our invariant learning rule are reflected in existing experimental results. Our democratic plasticity can give rise to dendritic democracy, as observed by Magee and Cook, 2000. Moreover, it also relates to results by Letzkus et al., 2006 and Sjöström and Häusser, 2006 which describe a sign switch of synaptic plasticity between proximal and distal sites that is not easily reconcilable with naive gradient descent. As the authors themselves speculate, the most likely reason is the (partial) failure of action potential backpropagation through the dendritic tree. In some sense, this is a mirror problem to the issue we discuss here: while we consider the attenuation of input signals, these experimental findings could be explained by an attenuation of the output signal. A simple way of incorporating this aspect into our learning rule (but also into Euclidean rules) is to apply a (nonlinear) attenuation function to the teacher signal $Y}^{\phantom{\rule{thinmathspace}{0ex}}\ast$ (Equations 10 and 13). At a certain distance from the soma, this attenuation would become strong enough to switch the sign of the error term ${Y}^{\phantom{\rule{thinmathspace}{0ex}}\ast}\varphi (V)$ and thereby also, for example, LTP to LTD. However, addressing this at the normative level of natural gradient descent as opposed to a direct tweak of the learning rule is less straightforward.
Our rule also requires heterosynaptic plasticity, which has been observed in neocortex, as well as in deeper brain regions such as amygdala and hippocampus (Lynch et al., 1977; Engert and Bonhoeffer, 1997; White et al., 1990; Royer and Paré, 2003; Wöhrl et al., 2007; Chistiakova and Volgushev, 2009; Arami et al., 2013; Chen et al., 2013; Chistiakova et al., 2015), often in combination with homosynaptic weight changes. Moreover, we find that heterosynaptic plasticity generally opposes homosynaptic plasticity, which qualitatively matches many experimental findings (Lynch et al., 1977; White et al., 1990; Royer and Paré, 2003; Wöhrl et al., 2007) and can be functionally interpreted as an enhancement of competition. For very large weights, heterosynaptic plasticity aligns with homosynaptic changes, pushing the synaptic weights back to a sensible range (Chistiakova et al., 2015), as shown to be necessary for unsupervised learning (Zenke and Gerstner, 2017). In supervised learning it helps speed up convergence by keeping the weights in the operating range.
These qualitative matches of experimental data to the predictions of our plasticity rule provide a welldefined foundation for future, more targeted experiments that will allow a quantitative exploration of the relationship between heterosynaptic plasticity and natural gradient learning. In contrast to the mostly deterministic protocols used in the referenced literature, such experiments would require a Poissonlike, stochastic stimulation of the student neuron. To further facilitate a quantitative analysis, the experimental setup should, in particular, allow a clear separation between student and teacher, for example via a pairing protocol (see e.g. Royer and Paré, 2003, but note that their protocol was deterministic).
A further prediction that follows from our plasticity rule is the normalization of weight changes by the presynaptic variance. We would thus anticipate that increasing the variance in presynaptic spike trains (through, e.g. temporal correlations) should reduce LTP in standard plasticity induction protocols. Also, we expect to observe a significant dependence of synaptic plasticity on neuronal response functions and output statistics. For example, flatter response functions should correlate with faster learning, in contrast to the inverse correlation predicted by classical learning rules derived from Euclideangradient descent. These propositions remain to be tested experimentally.
By following gradients with respect to the neuronal output rather than the synaptic weights themselves, we were able to derive a parametrizationinvariant errorcorrecting plasticity rule on the singleneuron level. Errorcorrecting learning rules are an important ingredient in understanding biological forms of error backpropagation (Sacramento et al., 2017). In principle, our learning rule can be directly incorporated as a building block into spikebased frameworks of error backpropagation such as (Sporea and Grüning, 2013; Schiess et al., 2016). Based on these models, topdown feedback can provide a target for the somatic spiking of individual neurons, toward which our learning rule could be used to speed up convergence.
Explicitly and exactly applying natural gradient at the network level does not appear biologically feasible due to the existence of crossunit terms in the Fisher information matrix $G$. These terms introduce nonlocalities in the naturalgradient learning rule, in the sense that synaptic changes at one neuron depend on the state of other neurons in the network. However, methods such as the unitwise naturalgradient approach (Ollivier, 2015) could be employed to approximate the natural gradient using a blockdiagonal form of $G$. The resulting learning rule would not exhibit biologically implausible codependencies between different neurons in the network, while still retaining many of the advantages of full naturalgradient learning. For spiking networks, this would reduce global naturalgradient descent to our local rule for single neurons.
Materials and methods
Neuron model
Request a detailed protocolWe chose a Poisson neuron model whose firing rate depends on the somatic membrane potential above the resting potential $V}_{\text{rest}}=70\phantom{\rule{thinmathspace}{0ex}}\phantom{\rule{thinmathspace}{0ex}}\mathrm{m}\mathrm{V$. They relate via a sigmoidal activation function
with $\beta =0.3,\phantom{\rule{thinmathspace}{0ex}}\varphi =10\mathrm{m}\mathrm{V}$, and a maximal firing rate $\varphi}_{\mathrm{m}\mathrm{a}\mathrm{x}}=100\phantom{\rule{thinmathspace}{0ex}}\mathrm{H}\mathrm{z$. This means the activation function is centered at −60 mV, saturating around −50 mV. Note that the derivation of our learning rule does not depend on the explicit choice of activation function, but holds for arbitrary monotonically increasing, positive functions that are sufficiently smooth. For the sake of simplicity, refractoriness was neglected. For the same reason, we assumed synaptic input to be currentbased, such that incoming spikes elicit a somatic membrane potential above baseline given by
where
denotes the unweighted synaptic potential (USP) evoked by a spike train
of afferent i, and w_{i} is the corresponding synaptic weight. Here, ${t}_{i}^{\mathrm{f}}$ denote the firing times of afferent i, and the synaptic response kernel $\u03f5$ is modeled as
where ${\u03f5}_{0}$ is a scaling factor with units $\text{}\mathrm{mV}\text{}\mathrm{ms}$, and unless specified otherwise, we chose ${\u03f5}_{0}=1\text{}\mathrm{mV}\text{}\mathrm{ms}$. For slowly changing input rates, mean and variance of the stationary unweighted synaptic potential are then given as (Petrovici, 2016)
and
with
Equations 29 and 30 are exact for constant input rates. Note that it is Equation 30 that induces the inverse dependence of the homosynaptic term on the input rate discussed around Equation 19.
Unless indicated otherwise, simulations were performed with a membrane time constant ${\tau}_{\mathrm{m}}=10\text{}\mathrm{ms}$ and a synaptic time constant ${\tau}_{\mathrm{s}}=3\text{}\mathrm{ms}$. Hence, USPs had an amplitude of 60 mV and were normalized with respect to area under the curve and multiplied by the synaptic weights. Initial and target weights were chosen such that the resulting average membrane potential was within operating range of the activation function. As an example, in Figure 3F, the average initial excitatory weight was 0.005, corresponding to an EPSP amplitude of 300V.
In Figure 5, the scaling factor ${\u03f5}_{0}$ was additionally normalized proportionally to the input rate r_{i} at the synapse in order to keep the mean USP constant and allow a comparison based solely on the variance.
Sketch for the derivation of the somatic naturalgradient learning rule
Request a detailed protocolThe choice of a Poisson neuron model implies that spiking in a small interval $[t,t+\mathrm{d}t]$ is governed by a Poisson distribution. For sufficiently small interval lengths $\mathrm{d}t$, the probability of having a single spike in $[t,t+\mathrm{d}t]$ becomes Bernoulli with parameter $\varphi \left({V}_{t}\right)\mathrm{d}t$. The aim of supervised learning is to bring this distribution closer to a given target distribution with density ${p}^{*}$. The latter is delivered in form of a teacher spike train
where the probability of having a teacher spike (denoted as ${Y}^{*}=1$) in $[t,t+\mathrm{d}t]$ is Bernoulli with parameter
Note that to facilitate the convergence analysis, in Figure 3 we chose a realizable teacher ${\varphi}_{t}^{\ast}=\varphi \left(\sum _{i=1}^{n}{w}_{i}^{\ast}{x}_{i}^{\u03f5}\right)$ generated by a given set of teacher weights ${\mathit{w}}^{*}$.
We measure the error between the desired and the current inputoutput spike distribution in terms of the KullbackLeibler divergence given in Equation 9, which attains its single minimum when the two distributions are equal. Note that while the ${D}_{\mathrm{KL}}$ is a standard measure to characterize ‘how far’ two distributions are apart, its behavior can sometimes be slightly unintuitive since it is not a metric. In particular, it is not symmetric and does not satisfy the triangle inequality.
Classical error learning follows the Euclidean gradient of this cost function, given as the vector of partial derivatives with respect to the synaptic weights. A short calculation (Sec. ‘Detailed derivation of the naturalgradient learning rule’) shows that the resulting Euclideangradientdescent learning rule is given by Equation 10. By correcting the vector of partial derivatives for the distance distortion between the manifold of inputoutput distributions and the synaptic weight space, given in terms of the Fisher information matrix $G\left(\mathit{w}\right)$, we obtain the natural gradient (Equation 12). We then followed an approach by Amari, 1998 to derive an explicit formula for the product on the right hand side of Equation 12.
In Sec. ‘Detailed derivation of the naturalgradient learning rule’, we show that given independent input spike trains, the Fisher information matrix defined in Equation 11 can be decomposed with respect to the vector of input rates $\mathit{r}$ and the current weight vector $\mathit{w}$ as
Here,
is the covariance matrix of the unweighted synaptic potentials, and ${c}_{1},{c}_{2},{c}_{3}$ are coefficients (see Equation 82 for their definition) depending on the mean ${\mu}_{V}$ and variance ${\sigma}_{V}^{2}$ of the membrane potential, and on the total rate
Through repeated application of the ShermanMorrisonFormula, the inverse of $G\left(\mathit{w}\right)$ can be obtained as
Here the coefficients ${\gamma}_{\mathrm{s}},{g}_{1},{g}_{2},{g}_{3},{g}_{4}$, which are defined in Equation 102, are again functions of mean and variance of the membrane potential, and of the total rate. Consequently, the naturalgradient rule in terms of somatic amplitudes is given by
Note that the formulas for,
arise from the product of the inverse Fisher information matrix and the Euclidean gradient, using $\mathbf{1}}^{T}{\mathit{x}}^{\u03f5}=\sum _{i=1}^{n}{x}_{i}^{\u03f5$ and ${{\mathit{w}}^{\mathrm{s}}}^{T}{\mathit{x}}^{\u03f5}=V$. Due to the complicated expressions for ${g}_{1},\mathrm{\dots},{g}_{4}$ (Equation 82), Equation 39 only provides limited information about the behavior of ${\gamma}_{\mathrm{u}}$ and ${\gamma}_{\mathrm{w}}$. Therefore, we performed an empirical analysis based on simulation data (Sec. ‘Empirical analysis of $\gamma}_{\mathrm{u}$ and $\gamma}_{\mathrm{w}$’; Figure 11). In a stepwise manner we first evaluated ${g}_{1},\mathrm{\dots},{g}_{4}$ under various conditions, which revealed that the products with g_{2} and g_{3} in Equation 39 are neglible in most cases compared to the other terms, hence
Furthermore, for a sufficient number of input afferents, we can approximate $g}_{1}\approx {q}^{1$. Since $q={c}_{\u03f5}{\u03f5}_{0}\mathbb{E}\left[\sum _{i=1}^{n}{x}_{i}^{\u03f5}\right]$, by the central limit theorem we have
for large $n$ and ${c}_{\mathrm{u}}={\u03f5}_{0}=1$. Moreover, while the variance of g_{4} across weight samples increases with the number and firing rate of input afferents, its mean stays approximately constant across conditions. This lead to the approximation
where ${c}_{\mathrm{w}}$ is constants across weights, input rates and the number of input afferents.
To evaluate the quality of our approximations, we tested the performance of learning in the setting of Figure 3 when ${\gamma}_{\mathrm{u}}$ and ${\gamma}_{\mathrm{w}}$ were replaced by their approximations (Equations 112; 113). The test was performed for several input patterns (Sec. ‘Evaluation of the approximated naturalgradient rule’, Sec. ‘Performance of the approximated learning rule’). It turned out that a convergence behavior very similar to naturalgradient descent could be achieved with ${c}_{\mathrm{u}}=0.95$, which worked much better in practice than ${c}_{\mathrm{u}}=1$. For ${c}_{\mathrm{w}}$ a choice of 0.05 which was close to the mean of ${c}_{\mathrm{w}}$ worked well.
For these input rate configurations and choices of constants, we additionally sampled the negative gradient vectors for random initial weights and USPs (Sec. ‘Evaluation of the approximated naturalgradient rule’, Sec. ‘Performance of the approximated learning rule’) and compared the angles and length difference between naturalgradient vectors and the approximation to the ones between natural and Euclidean gradient.
Reparametrization and the general naturalgradient rule
Request a detailed protocolTo arrive at a more general form of the naturalgradient learning rule, we consider a parametrization $\mathit{w}$ of the synaptic weights which is connected to the somatic amplitudes via a smooth componentwise coordinate change $\mathit{f}=\left(\phantom{\rule{thinmathspace}{0ex}}{f}_{1},\dots ,{f}_{n}\right)$, such that ${w}_{i}^{\mathrm{s}}={f}_{i}\left({w}_{i}\right)$. A Taylor expansion shows that small weight changes then relate via the derivative of$\mathit{f}$
On the other hand, we can also express the cost function in terms of $\mathit{w}$, with $C[\mathit{w}]={C}^{\phantom{\rule{thinmathspace}{0ex}}\mathrm{s}}[\phantom{\rule{thinmathspace}{0ex}}\mathit{f}({\mathit{w}}^{\mathrm{s}})]$, and directly calculate the Euclidean gradient of $C$ in terms of $\mathit{w}$. By the chain rule, we then have
Plugging this into Equation 43, we obtain an inconsistency: $\mathrm{\Delta}{w}_{i}^{\mathrm{s}}={f}_{i}^{\phantom{\rule{thinmathspace}{0ex}}\mathrm{\prime}}{\left({w}_{i}\right)}^{2}\mathrm{\Delta}{w}_{i}^{\mathrm{s}}$. Hence the predictions of Euclideangradient learning depend on our choice of synaptic weight parametrization (Figure 1).
In order to obtain the naturalgradient learning rule in terms of $\mathit{w}$, we first express the Fisher Information Matrix in the new parametrization, starting with Equation 60
Inserting both Equation 44 and Equation 47 into Equation 12, we obtain the naturalgradient rule in terms of $\mathit{w}$ (Equation 13). As illustrated in Figure 8, unlike for Euclideangradient descent, the result is consistent with Equation 43.
Simulation details
All simulations were performed in python and used the numpy and scipy packages. Differential equations were integrated using a forward Euler method with a time step of 0.5 ms.
Supervised learning task
Request a detailed protocolA single output neuron was trained to spike according to a given target distribution in response to incoming spike trains from n independently firing afferents. To create an asymmetry in the input, we chose one half of the afferents’ firing rates as 10 Hz, while the remaining afferents fired at 50 Hz. The supervision signal consisted of spike trains from a teacher that received the same input spikes. To allow an easy interpretation of the results, we chose a realizable teacher, firing with rate $\varphi \left({V}^{\phantom{\rule{thinmathspace}{0ex}}\ast}\right)$, where $V}^{\phantom{\rule{thinmathspace}{0ex}}\ast}={\mathit{w}}^{\ast T}{\mathit{x}}^{\u03f5$ for some optimal set of weights $\mathit{w}}^{\ast$. However, our theory itself does not include assumptions about the origin and exact form of the teacher spike train.
For the learning curves in Figure 3F, initial and target weight components were chosen randomly (but identically for the two curves) from a uniform distribution on $\mathcal{U}\left(1/n,1/n\right)$, corresponding to maximal PSP amplitudes between −600 μV and 600 μV for the simulation with $n=100$ input neurons. Learning curves were averaged over 1000 initial and target weight configurations. In addition, the minimum and maximum values are shown in Figure 9, as well as the mean Euclidean distance between student and teacher weights and between student and teacher firing rates. We did not enforce Dales’ law, thus about half of the synaptic input was inhibitory at the beginning but sign changes were permitted. This means that the mean membrane potential above rest covered a maximal range [−30 mV, 30 mV]. Learning rates were optimized as $\eta}_{\mathrm{n}}=6\ast {10}^{4$ for the naturalgradientdescent algorithm and $\eta}_{\mathrm{e}}=4.5\ast {10}^{7$ for Euclideangradient descent, providing the fastest possible convergence to a residual root mean squared error in output rates of 0.8 Hz. To confirm that the convergence behavior of both naturalgradient and Euclideangradient learning is robust, we varied the learning rate between $0.5{\eta}_{0}$ and $2{\eta}_{0}$ and measured the time until the $D}_{\mathrm{K}\mathrm{L}$ first reached a value of $5*{10}^{5}$ (Figure 9). Here $\eta}_{0}={\eta}_{\mathrm{e}$ for Euclideangradient descent and $\eta}_{0}={\eta}_{\mathrm{n}$ for naturalgradient descent.
Per trial, the expectation over USPs in the cost function was evaluated on a randomly sampled test set of 50 USPs that resulted from input spike trains of 250 ms. The expectation over output spikes was calculated analytically.
For the weight path simulation with two neurons (Figure 3D–E), we chose a fixed initial weight $\mathit{w}}_{0}={\left(\frac{0.3}{n},\frac{0.5}{n}\right)}^{T$, a fixed target weight $\mathit{w}}^{\ast}={\left(\frac{0.15}{n},\frac{0.15}{n}\right)}^{T$, and learning rates $\eta}_{\mathrm{n}}=2.5\ast {10}^{4$ and $\eta}_{\mathrm{e}}=7.\ast {10}^{7$. Weight paths were averaged over 500 trials of 6000s duration each.
The vector plots in Figure 3D–E display the average negative normalized natural and Euclideangradient vectors across 2000 USP samples per synapse ($n=2$) on a grid of weight positions on $\left[\frac{0.4}{n},\frac{0.6}{n}\right]}^{2$, with the first coordinate of the gridpoints in $\left\{\frac{0.28}{n},\frac{0.1}{n},\frac{0.08}{n},\frac{0.26}{n},\frac{0.44}{n}\right\}$ and the second in $\left\{\frac{0.22}{n},\frac{0.02}{n},\frac{0.26}{n},\frac{0,5}{n}\right\}$. Each USP sample was the result of a $1\text{}\mathrm{s}$ spike train at rate ${r}_{1}=10\text{}\mathrm{Hz}$ and ${r}_{2}=50\text{}\mathrm{Hz}$ respectively. The contour lines were obtained from 2000 samples of the ${D}_{\mathrm{KL}}$ along a grid on $\left[\frac{0.4}{n},\frac{0.6}{n}\right]}^{2$ (distance between two grid points in one dimension: $\frac{0.006}{n}$) and displayed at the levels 0.001, 0.003, 0.005, 0.009, 0.015, 0.02, 0.03, 0.04.
For the plots in Figure 3B–C, we used initial, final, and target weights from a sample of the learning curve simulation. We then randomly sampled input spike trains of 250 ms length and calculated the resulting USPs and voltages according to Equation 26 and Equation 25. The output spikes shown in the raster plot were then sampled from a discretized Poisson process with $\mathrm{d}t=5.*{10}^{4}$. We then calculated the PSTH with a bin size of 12.5 ms.
Distance dependence of amplitude changes
Request a detailed protocolA single excitatory synapse received Poisson spikes at 5 Hz, paired with Poisson teacher spikes at 20 Hz. The distance from the soma was varied between 0 μm and 460 μm. Learning was switched on for 5 s with an initial weight corresponding to 0.05 at the soma, corresponding to a PSP amplitude of 3 mV. Initial dendritic weights were scaled up with the proportionality factor $\alpha {(d)}^{1}$ depending on the distance from the soma, in order for input spikes to result in the same somatic amplitude independent of the synaptic position. Example traces are shown for $\alpha {(d)}^{1}=3$ and $\alpha {(d)}^{1}=7$.
Variance dependence of amplitude changes
Request a detailed protocolWe stimulated a single excitatory synapse with Poisson spikes, while at the same time providing Poisson teacher spike trains at 80 Hz. To change USP variance independently from mean, unlike in the other exercises, the input kernel in Equation 28 was additionally normalized by the input rate. USP variance was varied by either keeping the input rate at 10 Hz while varying the synaptic time constant ${\tau}_{s}$ between 1 ms and 20 ms, or fixing ${\tau}_{s}$ at 20 ms and varying the input rate between 10 Hz and 50 Hz.
Comparison of homo and heterosynaptic plasticity
Out of $n=10$ excitatory synapses of a neuron, we stimulated 5 by Poisson spike trains at 5 Hz, together with teacher spikes at 20 Hz, and measured weight changes after 60 s of learning. To avoid singularities in the homosynaptic term of learning rule, we assumed in the learning rule that unstimulated synapses received input at some infinitesimal rate, thus effectively setting $\mathrm{\Delta}{\mathit{w}}^{\mathrm{hom}}=0$. Initial weights for both unstimulated and stimulated synapses were varied between $\frac{1}{n}$ and $\frac{5}{n}$. For reasons of simplicity, all stimulated weights were assumed to be equal, and tonic inhibition was assumed by a constant shift in baseline membrane potential of 5 mV. Example weight traces are shown for initial weights of $\frac{1.6}{n},\frac{2.5}{n},$ and $\frac{4.5}{n}$ for both stimulated and unstimulated weights. The learning rate was chosen as $\eta =0.01$.
For the plots in Figure 7B C, we chose the diverging colormap matplotlib.cm.seismic in matplotlib. The colormap was inverted such that red indicates negative values representing synaptic depression and blue indicates positive values representing synaptic potentiation. The colormap was linearly discretized into 500 steps. To avoid too bright regions where colors cannot be clearly distinguished, the indices 243–257 were excluded. The colorbar range was manually set to a symmetric range that includes the max/min values of the data. In Figure 7D, we only distinguish between regions where plasticity at stimulated and at unstimulated synapses have opposite signs (light green), and regions where they have the same sign (dark green).
Approximation of learningrule coefficents
Request a detailed protocolWe sampled the values for ${g}_{1},\mathrm{\dots},{g}_{4}$ from Equation 82 for different afferent input rates. The input rate $r$ was varied between 5 Hz and 55 Hz for $n=100$ neurons. The coefficients were evaluated for randomly sampled input weights (20 weight samples of dimension $n$, each component sampled from a uniform distribution $\mathcal{U}\left(5/n,5/n\right)$).
In a second simulation, we varied the number $n$ of afferents between 10 and 200 for a fixed input rate of 20 Hz, again for randomly sampled input weights (20 weight samples of dimension $n$, each component sampled from a uniform distribution $\mathcal{U}\left(5/n,5/n\right)$).
In a next step, we compared the sampled values of g_{1} as a function of the total input rate $n*r$ to the values of the approximation given by $g}_{1}\approx {q}^{1$ ($r$ between 5 Hz and 55 Hz, $n$ between 10 and 200 neurons, 20 weight samples of dimension $n$, each component sampled from a uniform distribution $\mathcal{U}\left(5/n,5/n\right)$).
Afterwards, we plotted the sampled values of ${\gamma}_{\mathrm{u}}$ as a function of the approximation $s$ (Equation 111, $r$ between 5 Hz and 55 Hz, $n=100$, 20 weight samples of dimension $n$, each component sampled from a uniform distribution $\mathcal{U}\left(5/n,5/n\right)$, 20 USPsamples of dimension $n$ for each rate/weightcombination).
Next, we investigated the behavior of $\gamma}_{\mathrm{w}$ as a function of ${g}_{4}V$ ($r$ between 5 Hz and 55 Hz, $n=100$, 20 weight samples of dimension $n$, each component sampled from a uniform distribution $\mathcal{U}\left(5/n,5/n\right)$, 20 USPsamples of dimension $n$ for each rate/weightcombination), and in last step, as a function of ${c}_{w}V$ with a constant ${c}_{w}=0.05$.
Evaluation of the approximated naturalgradient rule
Request a detailed protocolWe evaluated the performance of the approximated naturalgradient rule in Equation 114 (with ${c}_{\mathrm{u}}=0.95$ and ${c}_{\mathrm{w}}=0.05$) compared to Euclideangradient descent and the full rule in Equation 13 in the learning task of Figure 3 under different input conditions (n=100, Group 1: $10\text{}\mathrm{Hz}$/ Group 2: 30 Hz, Group 1: 10 Hz/ Group 2: 50 Hz, Group 1: 20 Hz/ Group 2: 20 Hz, Group 1: 20 Hz/ Group 2: 40 Hz). The learning curves were averaged over 1000 trials with input and target weight components randomly chosen from a uniform distribution on $\mathcal{U}\left(1/n,1/n\right)$. Learning rate parameters were tuned individually for each learning rule and scenario according to Table 1. All other parameters were the same as for Figure 3F.
For the angle histograms in Figure 12AB, we simulated the natural, Euclidean and approximated natural weight updates for several input and initial weight conditions. Similar to the setup in Figure 3 we separated the $n=100$ input afferents in two groups firing at different rates (Group1/Group2: 10 Hz/10 Hz, 10 Hz/30 Hz, 10 Hz/50 Hz, 20 Hz/20 Hz, 20 Hz/40 Hz). For each input pattern, 100 Initial weight components were sampled randomly from a uniform distribution $\mathcal{U}\left(5/n,5/n\right)$, while the target weight was fixed at ${\mathit{w}}^{*}={(\frac{0.15}{n},\frac{0.15}{n})}^{T}$. For each initial weight, 100long input spike trains were sampled and the average angle between the naturalgradient weight update and the approximated naturalgradient weight update at $t=1\text{}\mathrm{s}$ was calculated. The same was done for the average angle between the natural and the Euclidean weight update.
Detailed derivation of the naturalgradient learning rule
Request a detailed protocolHere, we summarize the mathematical derivations underlying our naturalgradient learning rule (Equation 13). While all derivations in Sec. ‘Detailed derivation of the naturalgradient learning rule’ and Sec. ‘Inverse of the Fisher Information Matrix’ are made for the somatic parametrization and can then be extended to other weight coordinates as described in Sec. ‘Reparametrization and the general naturalgradient rule’, we drop the index $\mathrm{s}$ in ${\mathit{w}}^{\mathrm{s}}$ for the sake of readability.
Supervised learning requires the neuron to adapt its synapses in such a way that its inputoutput distribution approaches a given target distribution with density ${p}^{*}$. For a given input spike pattern $\mathit{x}$, at each point in time, the probability for a Poisson neuron to fire a spike during the interval $[t,t+\mathrm{d}t]$ (denoted as ${y}_{t}=1$) follows a Bernoulli distribution with a parameter ${\varphi}_{t}\mathrm{d}t=\varphi \left({V}_{t}\right)\mathrm{d}t$, depending on the current membrane potential. The probability density of the binary variable y_{t} on {0, 1}, describing whether or not a spike occurred in the interval $[t,t+\mathrm{d}t]$, is therefore given by
and we have
where $p}_{\mathrm{u}\mathrm{s}\mathrm{p}$ denotes the probability density of the unweighted synaptic potentials $\mathit{x}}_{t}^{\u03f5$. Measuring the distance to the target distribution in terms of the KullbackLeibler divergence, we arrive at
Since the target distribution does not depend on the synaptic weights, the negative Euclidean gradient of the $D}_{\mathrm{K}\mathrm{L}$ equals
We may then calculate
where Equation 55 follows from the fact that $y}_{t}\in \{0,1\$ and for Equation 56 we neglected the term of order $\mathrm{d}{t}^{2}$ which is small compared to the remainder. Plugging Equation 56 into Equation 52 leads to the Euclideangradient descent online learning rule, given by
Here,
is the teacher spike train. We obtain the negative natural gradient by multiplying Equation 57 with the inverse Fisher information matrix, since
with the Fisher information matrix $G\left(\mathit{w}\right)$ at $\mathit{w}$ being defined as
Exploiting that, just like for the target density, the density of the USPs does not depend on $\mathit{w}$, so
we can insert the previously derived formula Equation 56 for the partial derivative of the loglikelihood. Hence, using the tower property for expectation values and the definition of $p}_{\mathit{w}$ (Equation 49, Equation 50), Equation 60 transforms to
In order to arrive at an explicit expression for the naturalgradient learning rule, we further decompose the Fisher information matrix, which will then enable us to find a closed expression for its inverse.
Inspired by the approach in Amari, 1998, we exploit the fact that a positive semidefinite matrix is uniquely defined by its values as a bivariate form on any basis of ${\mathbb{R}}^{n}$. Choosing a basis for which the bilinear products with $G\left(\mathit{w}\right)$ are of a particularly simple form, we are able to decompose the Fisher Information Matrix by constructing a sum of matrices whose values as a bivariate form on the basis equal are equal to those of $G\left(\mathit{w}\right)$. Due to the structure of this particular decomposition, we may then apply wellknown formulas for matrix inversion to obtain $G{\left(\mathit{w}\right)}^{1}$.
Consider the basis $\mathcal{B}=\{\mathit{w},{\mathit{b}}_{1},\dots ,{\mathit{b}}_{n1}\}$ such that the vectors $\sqrt{{\mathrm{\Sigma}}_{\mathrm{u}\mathrm{s}\mathrm{p}}}\phantom{\rule{thinmathspace}{0ex}}\mathit{w},\sqrt{{\mathrm{\Sigma}}_{\mathrm{u}\mathrm{s}\mathrm{p}}}\phantom{\rule{thinmathspace}{0ex}}{\mathit{b}}_{1},\dots ,\sqrt{{\mathrm{\Sigma}}_{\mathrm{u}\mathrm{s}\mathrm{p}}}\phantom{\rule{thinmathspace}{0ex}}{\mathit{b}}_{n1}$ are orthogonal to each other. Here, $\mathrm{\Sigma}}_{\mathrm{u}\mathrm{s}\mathrm{p}$ denotes the covariance matrix of the USPs which in the case of independent Poisson input spike trains is given as
In this case, the matrix square root reduces to the componentwise square root. Note that for any $\mathit{b},{\mathit{b}}^{\mathbf{\prime}}\in \mathcal{B}$ with $\mathit{b}\ne {\mathit{b}}^{\mathbf{\prime}}$, the random variables $\mathit{b}}^{T}{\mathit{x}}_{t}^{\u03f5$ and $\mathit{b}}^{\mathrm{\prime}T}{\mathit{x}}_{t}^{\u03f5$ are uncorrelated, since
We make the mild assumptions of having small afferent populations firing at the same input rate, and that the basis $\mathcal{B}$ is constructed in such way that the basis vectors are not too close to the coordinate axes, such that the products $\mathit{b}}^{T}{\mathit{x}}_{t}^{\u03f5$ are not dominated by a single component. Then, for sufficiently large $n$, every linear combination of the random variables $\mathit{b}}^{T}{\mathit{x}}_{t}^{\u03f5$ and $\mathit{b}}^{\mathrm{\prime}T}{\mathit{x}}_{t}^{\u03f5$ is approximately normally distributed, thus, the two random variables follow a joint bivariate normal distribution. Furthermore, uncorrelated random variables that are jointly normally distributed are independent. Since functions of independent random variables are also independent, this allows us to calculate all products of the form $\mathit{b}}^{T}G\left(\mathit{w}\right){\mathit{b}}^{\mathbf{\prime}$ for $\mathit{b},{\mathit{b}}^{\mathbf{\prime}}\in \mathcal{B}\setminus \{\mathit{w}\}$, as we can transform the expectation of products
into products of expectations. Taking into account that $V(t)={\mathit{w}}^{T}{\mathit{x}}_{t}^{\u03f5}$ and $\mathbb{E}\left[{\mathit{x}}^{\u03f5}\right]=\mathit{r}$, we arrive at
Here, $I}_{1},{I}_{2},{I}_{3$ denote the generalized voltage moments given as
The integral formulas follow from the fact that for a large number of input afferents, and under the mild assumption of all synaptic weights roughly being of the same order of magnitude, the membrane potential approximately follows a normal distribution with mean $\mu}_{\mathrm{v}}={\u03f5}_{0}\sum _{i=1}^{n}{w}_{i}{r}_{i$ and variance $\sigma}_{\mathrm{v}}^{2}=\frac{1}{{c}_{\u03f5}}\sum _{i=1}^{n}{w}_{i}^{2}{r}_{i$. Here, $\u03f5}_{0}=1\text{mV ms$ and $c}_{\u03f5$ is given by Equation 31, see Sec. ‘Neuron model’.
Based on the above calculations, we construct a candidate $\stackrel{~}{G}(\mathit{w})$ for a decomposition of $G\left(\mathit{w}\right)$. We start with the matrix ${c}_{1}\left({\u03f5}_{0}^{2}\mathit{r}{\mathit{r}}^{T}+{\mathrm{\Sigma}}_{\mathrm{u}\mathrm{s}\mathrm{p}}\right)$, since its easy to see that
Exploiting the orthogonal properties according to which we constructed $\mathcal{B}$ to carefully add more terms such that also the other identities in Equation 73 hold, we arrive at
Here,
To check that indeed $\stackrel{~}{G}(\mathit{w})=G\left(\mathit{w}\right)$, it suffices to check the values of $\stackrel{~}{G}$ as a bilinear form on the basis $\mathcal{B}$.
Inverse of the Fisher information matrix
Request a detailed protocolAs the expectation of the outer product of a vector with itself, $G\left(\mathit{w}\right)$ is per construction symmetric and positive semidefinite. From the previous calculations, it follows that for elements $b$ of a basis $\mathcal{B}$ of ${\mathbb{R}}^{n}$ the products $\mathit{b}}^{T}G\left(\mathit{w}\right)\mathit{b$ are strictly positive. Hence $G\left(\mathit{w}\right)$ is positive definite and thus invertible. We showed that,
We introduce the notation.
Then,
For the following calculations, we will repeatedly use the identities
By the ShermanMorrisonWoodbury formula, the inverse of an invertible rank one correction $(A+\mathit{u}{\mathit{v}}^{T})$ of an invertible matrix A is given by
Applying this to invert $G\left(\mathit{w}\right)$, as a first step, we consider the term $M}_{1}+{M}_{2$ and identify ${M}_{1}=A$ and $M}_{2}=\mathit{u}{\mathit{v}}^{T$. Its inverse is given by
with
Applying the ShermanMorrisonWoodbury formula a second time, this time with ${M}_{1}+{M}_{2}=A$ and $M}_{3}=\mathit{u}{\mathit{v}}^{T$, we obtain
Here,
Plugging in the definitions of M_{3} and M_{1}, and using that
we arrive at
After resorting the terms and grouping, we obtain the inverse of the Fisher Information Matrix as
with
Note that instead of applying the ShermanMorrisonWoodbury formula twice one could also directly use the version for rank2 corrections, known as the Woodbury identity (Max, 1950). This can be seen by rewriting Equation 84 as
with
and
By the Woodbury identity for rank2 corrections, we then have
with $\stackrel{~}{U}={c}_{1}^{1}{\mathrm{\Sigma}}_{\mathrm{u}\mathrm{s}\mathrm{p}}^{1}U$.
Analysis of the learning rule coefficients
In order to gain an intuitive understanding of Equation 13 and to judge its suitability as an invivo plasticity rule, we require insights into the behavior of the coefficients ${\gamma}_{\mathrm{s}},{\gamma}_{\mathrm{u}},$ and $\gamma}_{\mathrm{w}$ under various circumstances.
Global scaling factor
Request a detailed protocolThe global scaling factor $\gamma}_{\mathrm{s}$ is given as
The above formula reveals that $\gamma}_{\mathrm{s}$ is closely tied to the firing nonlinearity of the neuron, as well as the statistics of output. The scaling by the inverse slope of the output nonlinearity amplifies the synaptic update in regions where a small change in weight and thus in the membrane potential would not lead to a noticeable change in output distribution. A further scaling with $\varphi$ additionally amplifies the synaptic change in highoutput regimes (Figure 10). This is in line with the spirit of natural gradient that ensures that the size of the synaptic weight update is homogeneous in terms of $D}_{\mathrm{K}\mathrm{L}$ change rather than in absolute weight terms. Furthermore, the rescaling is based on the average output statistics, which the synapse might have access to via backpropagating action potentials (Stuart et al., 1997), rather than an instantaneous value.
Empirical analysis of $\gamma}_{\mathrm{u}$ and $\gamma}_{\mathrm{w}$
Request a detailed protocolWhile the formula for $\gamma}_{\mathrm{s}$ provides a rather good intuition of this coefficients’ behavior, from the derivations in the previous sections, it becomes clear that such a straightforward interpretation is not readily available from the formulas for $\gamma}_{\mathrm{u}$ and $\gamma}_{\mathrm{w}$. Defined as
and
where $g}_{1},\dots {g}_{4$ are given in Equation 102, these coefficients depend, apart from the membrane potential and its first and second moments, on the total input and its mean the total rate. This raises the question whether the synapse, which in general only has limited access to global quantities, can implement $\gamma}_{\mathrm{u}$ and $\gamma}_{\mathrm{w}$. We therefore used an empirical analysis through simulations to obtain a more detailed insight.
As a starting point, we sampled $g}_{1},\dots ,{g}_{4$ for various input conditions (Figure 11A–D, refer to Sec. ‘Approximation of learningrule coefficents’ for simulation details). Here, we varied the afferent input rate $r$ between $5\text{}\mathrm{Hz}$ and $55\text{}\mathrm{Hz}$ for $n=100$ neurons and evaluated the value of the respective coefficient for a randomly sampled input weight. In a second simulation (Figure 11E–H, Sec. ‘Approximation of learningrule coefficents’), we varied the number $n$ of afferent inputs between 10 and 200 neurons for fixed input rate $20\text{}\mathrm{Hz}$, again with randomly chosen input weights. This revealed an approximately inverse proportional relationship between the average input rate $r$ and $g}_{1},{g}_{2$ and g_{3} respectively. Furthermore, these coefficients seemed to be approximately inversely proportional to the number $n$ of afferent inputs. However, this was not true for g_{4}, whose mean seemed to stay approximately constants across input rates, although the scattering across the mean value (for different weight samples) increased.
While the value of the membrane potential stays bounded also for a large number of inputs (due to the normalization of the synaptic weight range with $\frac{1}{n}$), the total sum of USPs increases with an increasing number of inputs. Therefore, for large enough $n$, the term ${g}_{3}V$ can be neglected, so
A closer look at the behavior of g_{1} shows that, for a sufficiently large number of neurons or high input rates, $g}_{1}\approx {q}^{1$, which we verified by simulations (Figure 11I, Sec. ‘Approximation of learningrule coefficents’). In consequence,
To verify this approximation, we sampled the values of $\gamma}_{\mathrm{u}$ for various conditions (Figure 11J, Sec. ‘Approximation of learningrule coefficents’) and compared them against the approximation in Equation 111, which confirmed our approximation. Since the input rate is the mean value of the USP, assuming large enough populations with the same input rate and a sufficient number of input afferents, by the central limit theorem we have
with $c}_{\mathrm{u}}={\u03f5}_{0$. However, in practice, for ${\u03f5}_{0}=1\text{}\mathrm{mV}\text{}\mathrm{ms}$, a learning behavior much closer to natural gradient was obtained when $c}_{\mathrm{u}$ was slightly smaller than 1 such as ${c}_{\mathrm{u}}=0.95$ (Sec. ‘Quadratic transfer function’).
As a starting point to approximate $\gamma}_{\mathrm{w}$, we noticed that the mean of g_{4} stayed approximately constant when varying the input rate or the number of input afferents. On the other hand, g_{2} rapidly tends to zero in those cases, so we assumed that $g}_{2}\sum _{i=1}^{n}{x}_{i}^{\u03f5$ stays either constant or goes to zero in the limit of large n. Since $c}_{\u03f5}{\u03f5}_{0}{g}_{2$ seemed to be rather small compared to g_{4}, we hypothesized ${\gamma}_{\mathrm{w}}\approx {g}_{4}V$, which was confirmed by simulations (Figure 11K, Sec. ‘Approximation of learningrule coefficents’). As a second step (Figure 11L, Sec. ‘Approximation of learningrule coefficents’), since g_{4} seemed to be constant in the mean, we approximated
where simulations with ${c}_{\mathrm{w}}=0.05$, close to the mean of g_{4} showed a learning behavior close to naturalgradient learning (Figure 12C–F). Replacing $\gamma}_{\mathrm{u}$ and $\gamma}_{\mathrm{w}$ in Equation 13 by the expressions in Equation 112 and Equation 113, we obtain the approximated naturalgradient rule
Performance of the approximated learning rule
Request a detailed protocolSimulations of natural, Euclidean and approximated naturalgradient weight updates for several input patterns and randomly sampled initial conditions (Sec. ‘Evaluation of the approximated naturalgradient rule’) showed that the average angles (both in the Euclidean metric and in the Fisher metric) between the true and approximated naturalgradient weight update were small compared to the average angle between Euclidean and naturalgradient weight update (Figure 12A–B). This was confirmed by the learning curves for several tested input conditions in the setting of Figure 3, since the performance of the approximation lay in between the natural and the Euclidean gradient’s performance (Figure 12C–F, simulation details: Sec. ‘Evaluation of the approximated naturalgradient rule’). It can hence be regarded as a tradeoff between optimal learning speed, parameter invariance and biological implementability. Note that plugging the above calculations into Equation 13 still keeps invariance under coordinatewise smooth parameter changes.
Quadratic transfer function
Request a detailed protocolWhile for all our simulations, we used the sigmoidal transfer function given in Equation 24, the derivations that lead to Equation 100 also hold for other choices of transfer function. A particularly simple and thereby instructive result can be obtained for a transfer function $\varphi $ that satisfies
This is the case for a rectified quadratic transfer function
where $\mathrm{\Theta}$ is the Heaviside step function. To prevent issues with division by zero and to ensure that Equation 115 holds for all relevant membrane voltages $V$, we assume $\theta \ll {\mu}_{\mathrm{v}}{\sigma}_{\mathrm{v}}$. Then, Equation 67 reduces to
where the right side no longer depends on $\mathit{w}$, thus yielding ${\gamma}_{\mathrm{w}}=0$. Furthermore, we have
and therefore ${c}_{2}={c}_{3}=0$ (see Equation 82). Plugging this into Equation 81, we obtain
with $\mathrm{\Sigma}}_{\mathrm{u}\mathrm{s}\mathrm{p}$ defined in Equation 35. Inverting this using the ShermanMorrisonWoodbury Formula, we arrive at
Inserting this version of the Fisher information matrix into Equation 12, we get ${\gamma}_{\mathrm{s}}=1$ and ${\gamma}_{\mathrm{w}}=0$ (see Equation 107 and 109), thus obtaining a simplified version of the naturalgradient learning rule as
with $q$ as in Equation 36. This is in line with our empirical approximation for $\gamma}_{\mathrm{u}$ for the case of the sigmoidal transfer function (Equation 111). It is interesting to note that, for a large number of inputs, we have $\gamma}_{\mathrm{u}}\to {c}_{\u03f5$ and the average of the parenthesis in Equation 123, taken in isolation, reduces to zero. However, this does not mean that, on average, there is no plasticity. Instead, we can say that, for a quadratic activation regime, the natural gradient is purely driven by the correlation between the postsynaptic error and the (normalized and centered) presynaptic input.
Continuoustime limit
Request a detailed protocolUnder the Poissonprocess assumption, firing a spike in one of the time bins is independent from the spikes in other bins, therefore the probability for firing in two disjunct time intervals $[{t}_{1}+\mathrm{d}t]$ and $[{t}_{2}+\mathrm{d}t]$ is given as
Since $\mathbb{E}{\left[\frac{\mathrm{\partial}\mathrm{log}{p}_{\mathit{w}}}{\mathrm{\partial}\mathit{w}}\right]}_{{p}_{\mathit{w}}}=0$, we have
so the Fisher information matrix is additive.
In the continuoustime limit, where the interval $[0,T]$ is decomposed as $[0,T]={\cup}_{i=0}^{k1}[{t}_{i},{t}_{i+1}]$, with ${t}_{0}=0,{t}_{k}=T$ and $k\u27f6\mathrm{\infty}$, we have $\mathrm{d}t=\frac{T}{k}\u27f60$. Therefore,
Then, under the assumption that $T$ is small and firing rates are approximately constant on $[0,T]$, for the Fisher information matrix, we have
Data availability
The data used to generate the figures and the simulation code are available on GitHub, (copy archived at swh:1:rev:4da5611348ea969aff461886f85842fe36a8571a).
References

Synaptic plasticity as Bayesian inferenceNature Neuroscience 24:565–571.https://doi.org/10.1038/s41593021008095

BookDifferential geometrical theory of StatisticsIn: Amari S, editors. In Differential Geometry in Statistical Inference. Hayward: Institute of Mathematical Statistics. pp. 19–94.

Natural Gradient Works Efficiently in LearningNeural Computation 10:251–276.https://doi.org/10.1162/089976698300017746

ConferenceFisher information and natural gradient learning in random deep networksIn The 22nd International Conference on Artificial Intelligence and Statistics. pp. 694–702.

BookMethods of Information Geometry. Translations of Mathematical MonographsAmerican Mathematical Society.

Synaptic strength of individual spines correlates with bound Ca2+calmodulindependent kinase IIThe Journal of Neuroscience 27:14007–14011.https://doi.org/10.1523/JNEUROSCI.358707.2007

ConferenceExact natural gradient in deep linear networks and its application to the nonlinear caseAdvances in Neural Information Processing Systems. pp. 5941–5950.

BookOptimal Decision Rules and Optimal InferenceAmerican Mathematical Society, Rhode Island. Translation from Russian.

Heterosynaptic plasticity prevents runaway synaptic dynamicsThe Journal of Neuroscience 33:15915–15929.https://doi.org/10.1523/JNEUROSCI.508812.2013

Heterosynaptic plasticity in the neocortexExperimental Brain Research 199:377–390.https://doi.org/10.1007/s0022100918595

Homeostatic role of heterosynaptic plasticity: models and experimentsFrontiers in Computational Neuroscience 9:89.https://doi.org/10.3389/fncom.2015.00089

Spatiotemporal credit assignment in neuronal population learningPLOS Computational Biology 7:e1002092.https://doi.org/10.1371/journal.pcbi.1002092

BookSpiking Neuron Models: Single Neurons, Populations, PlasticityCambridge University Press.https://doi.org/10.1017/CBO9780511815706

ConferenceLatent equilibrium: A unified learning theory for arbitrarily fast computation with arbitrarily slow neuronsAdvances in Neural Information Processing Systems.

ConferenceA natural policy gradientAdvances in Neural Information Processing Systems. pp. 1531–1538.

Learning rules for spike timingdependent plasticity depend on dendritic synapse locationThe Journal of Neuroscience 26:10420–10429.https://doi.org/10.1523/JNEUROSCI.265006.2006

ConferenceNatural Langevin dynamics for neural networksIn International Conference on Geometric Science of Information. pp. 451–459.https://doi.org/10.1007/9783319684451

Interneurons of the neocortical inhibitory systemNature Reviews. Neuroscience 5:793–807.https://doi.org/10.1038/nrn1519

Riemannian metrics for neural networks I: feedforward networksInformation and Inference 4:108–153.https://doi.org/10.1093/imaiai/iav006

Information and the accuracy attainable in the estimation of statistical parametersBull. Calcutta Math. Soc 37:81–91.

Analysis of natural gradient descent for multilayer neural networksPhysical Review E 59:4523–4532.https://doi.org/10.1103/PhysRevE.59.4523

The perceptron: A probabilistic model for information storage and organization in the brainPsychological Review 65:386–408.https://doi.org/10.1037/h0042519

Somatodendritic Synaptic Plasticity and Errorbackpropagation in Active DendritesPLOS Computational Biology 12:e1004638.https://doi.org/10.1371/journal.pcbi.1004638

The highly irregular firing of cortical cells is inconsistent with temporal integration of random EPSPsThe Journal of Neuroscience 13:334–350.

Supervised learning in multilayer spiking neural networksNeural Computation 25:473–509.https://doi.org/10.1162/NECO_a_00396

Action potential initiation and backpropagation in neurons of the mammalian CNSTrends in Neurosciences 20:125–131.https://doi.org/10.1016/s01662236(96)100758

On the choice of metric in gradientbased theories of brain functionPLOS Computational Biology 16:e1007640.https://doi.org/10.1371/journal.pcbi.1007640

Dependence of EPSP efficacy on synapse location in neocortical pyramidal neuronsScience (New York, N.Y.) 295:1907–1910.https://doi.org/10.1126/science.1067903

Acute and longterm effects of MK801 on direct cortical input evoked homosynaptic and heterosynaptic plasticity in the CA1 region of the female ratThe European Journal of Neuroscience 26:2873–2883.https://doi.org/10.1111/j.14609568.2007.05899.x

Hebbian plasticity requires compensatory processes on multiple timescalesPhilosophical Transactions of the Royal Society of London. Series B, Biological Sciences 372:20160259.https://doi.org/10.1098/rstb.2016.0259
Decision letter

Peter LathamReviewing Editor; University College London, United Kingdom

John R HuguenardSenior Editor; Stanford University School of Medicine, United States
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Decision letter after peer review:
Thank you for submitting your article "Naturalgradient learning for spiking neurons" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by a Reviewing Editor and John Huguenard as the Senior Editor. The reviewers have opted to remain anonymous.
The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.
Essential revisions:
This paper derives a natural gradient learning rule for a spiking neuron in a supervised setting. Unlike conventional gradients in Euclidean space, the natural gradient is invariant under reparameterization, thus achieving fast convergence regardless of the position of synaptic contacts on the dendritic tree. The authors relate their rule to experimentally observed properties of synaptic plasticity, such as heterosynaptic regularization. We're not aware of any other work that applied natural gradient to a spiking neuron in a formal way, and we believe it is an important contribution.
That said, we do have several comments. All are relatively minor (many have to do with presentation), and, we hope, easy to address. Because this is a combination of three reviews, they're not exactly in order of appearance in the manuscript. But hopefully not too far off.
1. We're slightly concerned with the fact that the weight update scales inversely with the firing rate. First, presumably for small enough firing rates something in the derivation breaks down. Second, this makes a very strong prediction. The only data we know of that relates to this is in Aitchison and Latham, 2014 (now published in Nature Neuroscience vol 24, pgs 565571, 2021), where they showed, in one experiment, that learning rate scales as 1/sqrt{r} for moderate firing rates (greater than about 1 Hz), and saturates at small firing rates. This is one experiment, so doesn't rule out the theory here, but these points should be discussed.
2. The link to biology is often very cursory. References to experiments are often in the form 'so and so found something a bit similar'. Wherever possible, you should try to make precision comparisons to experiments. And on page 10 you say that the weight dependent heterosynaptic terms explains the central weight distributions found in cortex. However, the weight dependent heterosynaptic term is just one term out of three, and the other terms are not weightdependent. If you want to argue this, you should be more rigorous (and in particular take into account all three terms).
3. The conclusion of the Zenke and Gerstner paper is, we believe, incorrect: A learning rule with an intrinsic weight dependence, such as Oja's rule, does not need homeostasis. This clearly doesn't affect any of your analysis, but it does mean that the statement "…… (Zenke and Gerstner, 2017), where it was shown that pairing Hebbian terms with heterosynaptic and homeostatic plasticity is crucial for stability" is, we believe, not true.
4. Figure 1B is pretty incomprehensible, and it doesn't help that the equations are small and gray, making them hard to read. We would suggest dropping it, especially since the point is made very well in the text.
Along the same lines (but much later), in Sec. ‘The naive Euclidean gradient is not parametrizationinvariant’ we don't see any contradiction between Equations 26 and Equation 27; they are just two different definitions of Δ w^{s}. It might be an inconsistency, but not a contradiction.
5. Figure 3: Considering that several approximations were made in the derivation of the learning rule (even for Equation 7), learning performance should be evaluated a bit more carefully. In particular:
a. Please plot the distance between student weights and teacher weights, on the same timescale as in Figure 3F (either in Euclidean space or in Fisher metric space).
b. Please plot performance versus learning rate, so we can get a sense of robustness. "Performance" could either be asymptotic loss or time to achieve a particular loss. Also, can you speculate how the learning rate of a neuron can be optimized?
c. What's the target output firing rate? It might have been stated somewhere, but we missed it. It would be nice if it were in the figure caption, like it is in Figure 4.
d. In Figure 3, n=2 and n=100 are both used. It should be clear which panel is which.
e. Panels B and C: what's the timescale? There's only one labeled time point (0.1), so it's impossible to tell. And how many trials did it take to learn? We would have expected considerable learning during 2500 trials.
f. The direction of the gradient depends on whether or not there's a spike on the teacher neuron. Which means the path in weight space is noisy, and the weights should not converge  instead, they should exhibit fluctuations forever. But the trajectories in D and F were very smoothed. Is that because they're averaged over a large number of trials? Or was some extra processing done? Please explain.
6. Figure 4:
a. Can you reproduce experimental results on dendritic position dependent plasticity (eg. Letzkus JJ, Kampa BM, Stuart GJ, 2006; Sjöström PJ, Häusser M, 2006). These experimental results are inconsistent with vanilla Euclidean gradient; hence they would provide support for biological relevance of the proposed learning rule.
b. "the distance from soma was varied between 1 microns to 10 microns". This seems small; the characteristic length of dendritic decay is in order of 100 microns (Williams and Stuart, 2002).
7. Figure 5: Here, you compared three neurons each receiving single excitatory input with different input characteristics. Do you expect the same result to be true for a single neuron receiving three excitatory inputs? We're curious about it because that's a more relevant scenario. Some casual remarks would be helpful.
8. Figure 7:
a. We're somewhat confused that you you can distinguish between the three forms of plasticity in Equations 1113, at least for the stimulated synapses. Don't all three forms contribute at once? But in panel B you say, "Weight change of stimulated weights (homosynaptic)". And in panel C, you should have seen homosynaptic plasticity (see next comment). Could you explain how you can single out one form of plasticity? Or else drop the comment. Or maybe we misunderstood?
b. In the legend of Figure 7, it is mentioned that the rest of synapses received 0.01Hz input, but in Sec. ‘Comparison of homo and heterosynaptic plasticity’, that information is omitted. Did you actually stimulate these synapses? If so, we would have expected a big jump in synaptic weight when there was a spike, and in 60 s there should have been 3 spikes (0.01x60x5).
c. 7G: A should be S?
d. Equation 13 is out of the blue. Could you tell us where this comes from? And from the bottom of page 10, we have
"…… is roughly a linear function of the membrane potential (more specifically, its deviation with respect to its baseline)."
What does baseline refer to? We didn't see anything about baseline in Equation 25, which is where Equation 13 comes from.
e. Could you tell us how the colors in panels BD were computed? Presumably some assumptions were made. If so, what were they?
9. Equation 4: we believe that the right hand side should not have a minus sign. More importantly, we should be told that Y* consists of a sum of δfunctions; the statement "Here, Y* denotes a teacher spike train sampled from p*" was unhelpful. It's also important to tell us exactly how Y* is sampled, which is, presumably from
P(spike from Y* \in [t, t+dt]) = phi(sum_{j} w*_{j} x_{j}^{ε}) dt.
If that's not correct, then we're thoroughly confused.
10. The second paragraph on page 4 makes some strong statements, and we're not sure all of them are true. For instance,
"Convergence of learning is therefore harmed by the slow adaptation of distal synapses compared to equally important proximal counterparts."
Presumably, "adaptation" means "learning". If so, why is adaptation necessarily slow at distal synapses? That would seem to depend on the biological learning rule. If not, what does "adaptation" mean in this context?
And,
"With the multiplicative USP term x^{ε} in Equation 4 being the only manifestation of presynaptic activity, there is no mechanism by which to take into account input variability, which can, in turn, also impede learning."
This is a bit out of the blue. Why should input variability affect learning? So far the authors have not said anything about this. It comes up later, but right now it's pretty obscure.
11. Last paragraph on page 4: the motivation for using the Fisher information matrix is a bit obscure. (For instance, what does "it is generally the unique metric that remains invariant under sufficient statistics" mean?) We strongly suspect that we won't be only ones who are lost. So it would be good to motivate it in plain language. And in particular, it would be nice to show, at the very least, that using the Fisher information matrix automatically makes the learning rule invariant with respect to a change of variables. (Which is pretty easy to do.)
Along the same lines, after Equation 9 you say
"This represents an unmediated reflection of the philosophy of natural gradient descent, which finds the steepest path for a small change in output, rather than in the numeric value of some parameter."
This has been the theme all along, but was it ever shown? As far as we can tell, you just used the inverse of the Fisher information matrix. Is it obvious that it has the above effect?
12. It would be extremely useful to write down log p_{w} in the main text.
13. The quantities in Equation 7 should be defined immediately  right now we have to read down half a page to figure out what they are. And you should tell us immediately how γ_{u} and γ_{w} could be computed locally.
14. In the section "Natural gradient speeds up learning", please tell us exactly what the cost function is (which, presumably, means telling us w* and the filters you use to convert incoming spike trains to x). And it would be good if the firing rates were given in the main text, so one doesn't have to look at the figure.
15. Because of the factor of phi'(V)/phi(V) in Equations 7 and 8, the effective scaling in Equation 9 is, approximately, 1/phi'(V). This doesn't change the point that a shallow slope implies a high learning rate, but thing's aren't quite as bad as Equation 9 implies. This seems worth mentioning.
16. You mention, on the top of page 8, "that the absence of dendritic democracy does not contradict the presence of democratic plasticity". It seems worth mentioning, just for completeness, that dendritic democracy can be achieved without democratic plasticity. In particular, as far as we can tell, the Euclidean gradient is also consistent with dendritic democracy, although convergence to the democratic state might be slower that for the natural gradient.
17. p. 10 "In comparison, input from weak synapses only has……" We couldn't get the logic of this and the following sentences. This nontrivial claim is not explicitly tested in the subsequent paragraph, but reappears in the discussion. You should expand on it, or remove it.
18. p22: "The integral formulas follow from……". It would help if you referred to Equations 20 and 21 (in Sec. ‘Neuron model’) here. We were confused by the sudden appearance of ε_{o} and c_{ε}. Also, you should mention how they are related to dt.
19. It is somewhat confusing that you set up the problem about dendritic distance, but then first address, in Figure 3, the role of firing rate, before showing the solution to dendritic plasticity. Perhaps starting with Figure 4, contrasting standard gradients to natural gradients would help.
20. Page 7/8, you say "neurons are not symmetrical geometric objects". You should make it clear what this means (presumably it means that they have dendrites, and, therefore, some weights are attenuated). In addition, it's not clear that not being a symmetrical geometric objects implies a nonisotropic cost landscapes, since how isotropic the cost landscape is depends on the input and cost function as well as the attenuation in the dendrites. In principle they former could cancel the latter, producing an isotropic cost landscapes even with strongly attenuating neurons. It seems best to drop this point.
21. A number of studies have recently emphasized that in large networks most machine learning problems have highly degenerate minima (e.g. Belkin et al. et al. PNAS). Does the algorithm generalize to such situations? Your answer may be "we don't know". But whatever the answer, it would be worth mentioning in the paper.
22. Page 7, last paragraph: should the reference to Figure 5 be to Figure 4?
23. The abundance of superscripts, subscripts and decorations in the variables throughout the manuscript make it very hard to read. We would suggest simplifying notation as much as possible. In particular:
a. The superscript on the gradient operator doesn't help, as it requires extra memorization  which is hard because it's not standard. You would be much better off putting the superscript on the cost function, and not redefining the gradient. It's especially confusing in Equation 6, since at first glance the superscript n means take n derivatives.
b. There are two w's: w^{d} and w_{s}. Except sometimes there's a w without a superscript. It should always be clear which w you're referring to.
c. The notation in Equation 7 may make for easier reading, but it makes it very difficult to figure out what's actually going on, especially since there's no easy way to tell vectors from scalars. We would strongly recommend using components (dw_{i}/dt = ……). It would make it much more clear, with very little cost.
24. Along the same line, the phrase "unweighted synaptic potential (USP) train" is pretty distracting. We would strongly suggest dropping it, since everybody agrees what a spike train is. Plus, we searched, and "weighted dendritic potentials" was used only once, in the sentence before unweighted synaptic potentials were defined. So "unweighted" seems a bit redundant.
25. Important equations should, in our opinion, be displayed. That's because readers (OK, some of these readers) always go back and search for important quantities, and they're easier to see if displayed. (This was triggered by the expression for V, a few lines up from Equation 2.)
26. The setup should be made clear up front: the neuron is receiving a teacher spike train, y_{t}, and using those spike trains to update its synapses. The statement before Equation 3, "Assuming that the neuron strives to reproduce a target firing distribution p*(yx)" was not helpful (I couldn't really make sense of it, so I kind of ignored it). And so it took me forever to figure out what was going on.
27. Along the same lines, wouldn't it make sense to just say the neuron is trying to maximize the log probability of the teacher spikes? It gives the same cost function, and it's a lot less obscure than saying the cost function is the KL distance. Clearly a matter of taste, but log probability seems more sensible.
28. On the top of page 9, it says that learning rates are inversely correlated to the variance of σ^{2}(x^{ε}). It needs to be clear why the equations say that; right now it's pretty much out of the blue.
29. End of discussion (page 12): "errorcorrecting plasticity rule". Why is it error correcting?
30. Very end of discussion:
"Explicitly and exactly applying natural gradient at the network level does not appear biologically feasible due to the existence of crossunit terms in the Fisher information matrix G. However, methods such as the unitwise naturalgradient approach (Ollivier, 2015) could be employed to approximate the natural gradient using a blockdiagonal form of G. For spiking networks, this would reduce global naturalgradient descent to our local rule for single neurons."
We didn't understand that  it should be unpacked.
31. Given that this is eLife, and there's no page limit, we would strongly urge you to get rid of the Appendix and put all the analysis in Methods. We suspect anybody willing to read Methods will want to see the algebra in the Appendix. In any case, all relevant quantities (e.g., g_{1}, ……) should be in Methods; the reader shouldn't have to hunt them down in what will potentially be another document. And certainly, the gradient, which is central to the whole endeavor, should be in Methods. As should Equation S18, so that the reader knows where G(w) comes from.
32. There's a much easier derivation of G(w), Equation 31:
int dx exp(h dot t) N(x; mu, Σ) f(a dot x)
is easy to evaluate (here x is a vector and N(x; mu, Σ) is a Gaussian with mean mu and covariance Σ). Take two derivatives, and, voila, you have.. Same answer, but easier on the reader.
In addition, the derivation of its inverse can be more concise. G(w) is a sum of a diagonal matrix and a rank2 modulation. Denoting
V = (r w') and C = [[c1 e^{2}, c2 e], [c2 e, c3]],
we may write
G(w) = c1 Σ + V C V^{T}.
Applying the Woodbury identity,
G^{1}(w) = Σ^{1}/c1 – V' (C^{1} + V Σ^{1} V^{T})^{1} V'^{T},
where
V' = (r'/e w).
Thus,
g = [[g1 g3], [g2 g4]]
is given as
g = – (C^{1} + V Σ^{1} V^{T}) ^{1}.
33. Aren't Equations S1213 a repeat of arguments made above, on the same page? Or is this something new? Either way, it should be clear (and if it's a repeat, maybe it can be dropped).
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "Naturalgradient learning for spiking neurons" for further consideration by eLife. Your revised article has been evaluated by John Huguenard (Senior Editor) and a Reviewing Editor.
The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:
The good news: The reviewers are quite positive, and there are no major concerns.
The bad news: Reviewer 1 has many comments requiring clarification. But almost all should be relatively straightforward to address in a timely way.
Reviewer #1:
The manuscript is much improved, although it's still a bit hard to read. Some of which we think could be easily fixed. Specific comments follow.
1. p 1: "It certainly could be the case that evolution has favored one particular parametrization over all others during its gradual tuning of synaptic plasticity, but this would necessarily imply suboptimal convergence for all but a narrow set of neuron morphologies and connectome configurations."
We found this confusing, since parametrization and learning rules are not directly coupled. Did you mean "evolution has favored one particular parametrization over all others, and updates weights using Euclidean gradient descent ……"?
2. One more attempt to kill USP: why not call x^{ε} filtered spike trains, since that's what they are? "Unweighted synaptic potential" is completely unfamiliar, and with a paper this complicated, that's the last thing one needs  every single time we read that phrase, we had to mentally translate it to filtered spike trains.
3. Technically, x_{iε} should depend on distance to the soma: because of dendritic filtering, the farther a psp travels along a dendrite the more it spreads out (in time). You should mention this, and speculate on whether it will make much difference to your analysis. It seems like it might, since c_{ε} scales with the timescale of the psp.
4. In both places where you mention spikes (x_{i}, above Equation 4 and Y^{*}, above Equation 9) you should point out that they are a sum of δ functions. "Spikes" isn't so well defined in this field.
5. If you want anybody besides hard core theorists do understand what you're doing, you'll need to be explicit about what p* is: it's Equation 7 with phi replaced by phi*, the target firing rate as a function of voltage. We strongly suggest you do that. In particular, the description "a target firing distribution p*(y∣x^{ε})" doesn't help much, since a firing rate distribution is usually a distribution over firing rates. What you mean, though, is that p* gives you the true probability of a spike in a time interval.
5. It would be very helpful, in Equation 9, to add
\approx dt[ (phi – phi*) – phi* log(phi/phi*) ]
(and point out that it's exact in the limit dt > 0). Then Equation 10 is easy  just say that phi* is replaced by Y*.
6. Starting in Equation 10, you switch to a weight without a superscript. We found this very confusing, since the learning rule you wrote down in Equation 10 is exactly the one for w^{s}. We strongly suggest you never use w. But if there is a reason to do so, you should be crystal clear about what it is. As far as we could tell, this switch to w is completely unsignalled. So maybe it's a typo?
7. p 4: "which is likely to harm the convergence speed towards an optimal weight configuration."
The word "likely" is not all that convincing. Can you be a lot stronger? Are there relevant refs? Or you could point to the simulations in Figure 3.
8. p 4: "In the following, we therefore drop the index from the synaptic weights w to emphasize the parametrizationinvariant nature of the natural gradient."
In fact, what's really going on is that you're deriving the natural gradient learning rule (Equation 13) with V given by
V = sum_{i} f(w_{i}) x_{iε}
which is actually very confusing, since f was used previously to relate w^{s} to w^{d} (Equation 4). Is the f in Equation 14 the same as the one in Equation 4? If so, it should be w^{d} on the right hand side of Equation 14. If not, you should use a different symbol for the function.
As you can see, your notation is confusing us. Our suggestion would be to always use superscripts s or d, and never use w by itself.
9. Figure 2:
"(A) During supervised learning, the error between the current and the target state is measured in terms of a cost function defined on the neuron's output space; in our case, this is the manifold formed by the neuronal output distributions p(y, x)."
What does "defined on the neuron's output space" mean? Doesn't the cost function depend only on the weights, since it's an average over the inputoutput function? Which is more or less what you say in the next sentence,
"As the output of a neuron is determined by the strength of incoming synapses, the cost C is an implicit function of the afferent weight vector w.!
Although we're not sure why you say "implicit" function; isn't it a very explicit function (see point 5 above).
"If, instead, we follow the gradient on the output manifold itself, it becomes independent of the underlying parametrization."
Since we don't know what the output manifold is (presumably the statistical manifold in panel A? not that that helps), this sentence doesn't make sense to us.
"In contrast, naturalgradient learning will locally correct for distortions arising from nonoptimal parametrizations."
This is a highly nontrivial statement, and there's nothing in the manuscript so far that makes this obvious. I think a reference is needed here. Or point the reader to Figure 3 for an example?
10. Equation 13: You may not like components, but it would be extremely helpful to put them in Equation 13, or maybe add an equation with components to make it clear. There is a precedent; you use components in Equation 4. And let's face it, anybody who isn't mathematically competent will have been lost long ago, and anybody who is mathematically competent will want to be sure what's going on. And things are actually ambiguous, since it's really hard to tell what's a vector and what's a scalar. In particular, γ_{u} should be treated as a vector, even though it's not bold, so it's not immediately clear whether γ_{s} should also be treated as a vector. If nothing else, you should use γ_{u} {\bf 1}; that would help a little. Also, the convention these days when writing f(w) is for f not to be bold, but instead define it as a pointwise nonlinearity. You don't have to follow conventions, but it does make it easier on the reader.
The same comment applies to Equation 17.
11. In Equation 14, presumably w^{s} on the left hand side should be bold? And why not use indices, as in Equation 4? Much more clear, and fewer symbols.
12. p 6, typo: "inactive inactive".
13. Two questions and one comment about Figure 3F. First, why don't the orange and blue lines start in the same place? Second, the blue line doesn't appear to be saturating. Can the Euclidean gradient do better than the natural gradient? If so, that's important to point out. But maybe it's not true, and is a byproduct of your optimization method?
And the comment: presumably the D_{KL} term on the yaxis has a factor of dt. That makes it pretty much impossible to interpret, since we don't know what dt is (probably it's somewhere, but it's not in an obvious place). We suggest removing the factor of dt, and making it clear that when phi is close to phi*, what's being plotted is <(phiphi*)^2/2 phi*>. That will make the plot much easier to interpret.
14. p 8: "As a result, the effect of synaptic plasticity on the neuron's output is independent of the synapse location, since dendritic attenuation is precisely counterbalanced by weight update amplification."
Because of the term γ_{w} w^{d}, this is true only if w^{d} \propto 1/α(d). But, as you point out later, this doesn't have to be the case. So this statement needs to be modified. Maybe it's approximately true?
15. Figure 5, several comments:
In the equation in panel H, r should be in the numerator, not the denominator, and a factor of ε_{02} is missing. In addition, the term on the right hand side can be simplified:
σ^{2}(x^{ε}) = r ε_{02}/2(tau_{1} + tau_{2}).
Also it would be a good idea to switch to tau_{s} and tau_{m} here rather than tau_{1} and tau_{2}, to be consistent with Methods, and to not clash with the tau's in panels AB.
And in panels AB, presumably it should be tau_{si} rather than tau_{i}?. Also, tau_{m} should be reported here.
It would make it a lot easier on the reader if you wrote down an equation for cε, which isn't so complicated: c_{ε}/r_{i} = 2 (tau_{m} + tau_{s})/ ε_{02} r_{i}. Equation 19 is a natural place to do that.
Because of the other two terms in Equation 17, it's not true that "Naturalgradient learning scales inversely with input variance." It should be clear that this is approximate.
16. In Equation 19, it should be x_{iε} on the left hand side, not x^{ε}.
17. p 10: "Furthermore, it is also consistent with data from Aitchison and Latham (2014) and Aitchison et al.et al. (2021), as well as with their observation of an inverse dependence on presynaptic firing rates, although our interpretation is different from theirs."
There is no justification for this statement: in Aitchison et al.et al. a plot of Δ w/w versus r showed 1/sqrt(r) scaling. That would be consistent with the theory in this paper only if the weight scales as 1/sqrt(r). It might, but without checking, you can't make that claim.
That data is weak, so the fact that you don't fit it is hardly the end of the world. However, what's more important is to point out that the theory here makes very different predictions that the theory in Aitchison et al.et al. That way, experimentalists will have something to do.
Finally, the inverse scaling with firing rate must eventually saturate, since there's a limit to how big weight changes can be. You should comment on this, if briefly.
18. Figure 7, we're pretty lost, for several reasons:
a. Because homosynaptic scales as 1/r, it's not possible to have unstimulated synapses (for which r=0); at least not with the current derivation. If you want to have unstimulated synapses, you need to show that it is indeed possible, by computing G when r_{i=0} for some of the i's.
b. From the bottom of page 27, γ_{u} \approx ε_{0} c_{ε}. Thus, the first two plasticity terms are
c_{ε} (x^{ε}/r – ε_{0}).
Because = r ε_{0}, these two terms approximately cancel. Which means everything should be driven by the last term, c_{w} V f(w). This doesn't seem consistent with the explanation in Figure 7.
c. Because of the term c_{w} V f(w), shouldn't there be a strong dependence on the unstimulated weights in panels BD?
All this should be clarified.
19. p 13: "We further note an interesting property of our learning rule, which it inherits directly from the Fisher information metric that underlies natural gradient descent, namely invariance under sufficient statistics (Cencov, 1972)." What does "invariance under sufficient statistics" refer to?
20. p 13: "A further prediction that follows from our plasticity rule is the normalization of weight changes by the presynaptic variance. We would thus anticipate that increasing the jitter in presynaptic spike trains should reduce LTP in standard plasticity induction protocols."
Presumably this statement comes from the fact that the learning rate scales inversely with the variance of the filtered spike train. However, it's not clear that jittering spike trains increases the variance. What would definitely increase the variance, though, is temporal correlations. Maybe this could be swapped in for jittering? Either that, or explain why jittering increases variance. And also, it would be nice to refer the reader to the dependence of the learning rate on variance.
21. You should comment on where you think the teacher spike train comes from. Presumably it's delivered by PSPs at the soma, but those don't propagate back to synapses. How do you envision the teacher spike trains communicating with the synapses? Even if the answer is "we don't know", it's important to inform the reader  this may simply be an avenue for future research.
22. In Equations 29 and 30, you should explain why you use \approx. Presumably because that's because you assumed a constant firing rate?
23. Equation 31: it would be extremely useful to point out that c_{ε} = 2(tau_{m} + tau_{s})/ε_{02}, along with a reference (or calculate it yourself somewhere).
24. After Equation 31, "Unless indicated otherwise, simulations were performed with a membrane time constant tau_{m} = 10 ms and a synaptic time constant τs = 3 ms. Hence, USPs had an amplitude of 60 mV".
Doesn't the amplitude of the USPs depend on ε_{0}? And 60 mV seems pretty big. Is that a typo?
25. Equation 34: Missing parentheses around Σ_{USP} w^{s} in the second term in parentheses.
26. Equation 24 and the line above it: should w_{i} be w_{id}? If not, you shouldn't use f, since that was used to translate somatic to dendritic amplitude.
27. Equation 115: Should it be Theta(Vtheta)?
28. Equation 117 is identical to Equation 67. Did you mean to drop the term (phi')^{2}/phi?
Also, (phi')^{2}/phi = 1 when V > theta. But presumably it's equal to 0 when V < theta. If so, Equation 117 should be
G(w) = E(dt Theta(Vtheta) x xT).
if that's correct, then the integrals I_{1}I_{3} do not reduce to the values given in Equations 118120.
Reviewer #2:
The authors have done a good job in the revision.
A number of small suggestions remaining:
– Equation 13. Wouldn't it be easier to write this for a single component $\dot{wi)}=$……
– Figure 3 shows temporal structure in the teacher signal but this is nowhere explained in the main text.
– Fig3D+E perhaps the axis or caption can indicate whether w1,2 is 10 or 50Hz.
– Figure 4A: Shouldn't the purple top solid curve (dendritic voltage) be taller than the solid orange curve?
– Fig5D+E+F might look better with the position of the yaxis left instead of right.
https://doi.org/10.7554/eLife.66526.sa1Author response
Essential revisions:
This paper derives a natural gradient learning rule for a spiking neuron in a supervised setting. Unlike conventional gradients in Euclidean space, the natural gradient is invariant under reparameterization, thus achieving fast convergence regardless of the position of synaptic contacts on the dendritic tree. The authors relate their rule to experimentally observed properties of synaptic plasticity, such as heterosynaptic regularization. We're not aware of any other work that applied natural gradient to a spiking neuron in a formal way, and we believe it is an important contribution.
That said, we do have several comments many have to do with presentation, and, we hope, easy to address. Because this is a combination of three reviews, they're not exactly in order of appearance in the manuscript. But hopefully not too far off.
1. We're slightly concerned with the fact that the weight update scales inversely with the firing rate. First, presumably for small enough firing rates something in the derivation breaks down. Second, this makes a very strong prediction. The only data we know of that relates to this is in Aitchison and Latham, 2014 (now published in Nature Neuroscience vol 24, pgs 565571, 2021), where they showed, in one experiment, that learning rate scales as 1/sqrt{r} for moderate firing rates (greater than about 1 Hz), and saturates at small firing rates. This is one experiment, so doesn't rule out the theory here, but these points should be discussed.
The inverse scaling comes from the first term in the Fisher information matrix, which is proportional to the input’s covariance matrix. This fits nicely into the philosophy of natural gradient descent and the resulting synaptic democracy, as isotropization of the cost with respect to the weights also implies a renormalization of the inputs with respect to their variance. For the particular case of Poisson firing discussed in our manuscript, the variance of an input is proportional to its rate, hence the inverse dependence of our learning rule w.r.t. the input rate. The rule is wellbehaved as long as the input rates are not exactly zero, which is arguably the case in a nonpathological brain. Of course, depending on the exact experimental scenario and stimulation protocol, the other heterosynaptic components could override this scaling effect and even invert it. We also referenced Aitchison et al. as our prediction is, indeed, similar to theirs, and also compatible with their data, but made clear that our interpretation is different (lines 270272 in the revised manuscript with tracked changes).
2. The link to biology is often very cursory. References to experiments are often in the form 'so and so found something a bit similar'. Wherever possible, you should try to make precision comparisons to experiments. And on page 10 you say that the weight dependent heterosynaptic terms explains the central weight distributions found in cortex. However, the weight dependent heterosynaptic term is just one term out of three, and the other terms are not weightdependent. If you want to argue this, you should be more rigorous (and in particular take into account all three terms).
We tried to clarify the relationship to experimental data, while remaining cautious in order to not overstate our claims, since most experiments have not been performed in the exact same settings as those in which we tested our model. For example, many experiments from the referenced literature on heterosynaptic plasticity rely on deterministic stimulation protocols rather than on Poisson input (e.g., Royer and Paré, 2003, White et al. 1990). However, the qualitative matches of our results to this data represent encouraging evidence, while also providing an inspiration for future, more targeted experiments that also allow a quantitative comparison. To emphasize this point, we included an additional paragraph in the discussion, outlining potential directions of future experiments.
Furthermore, as highlighted by Häusser, data on the dependence of plasticity on synaptic location is rare, “as most such studies to date have implicitly assumed that synaptic function at distal and proximal synapses is identical.” (Häusser, 2001, p. R12). This puts the findings of Froemke, Poo, and Dan, 2005 in relation to our predictions on democratic plasticity, as we discuss at the end of Section 2.4 (lines 245249).
As to the last point, we argue that the existence of the het_{w} term corresponds to the observation made by Loewenstein et al. (2011) about changes in spine size being proportional to the spine size itself (in light of Asrican et al. (2007), who show that synaptic strength is positively correlated with spine size). However, we agree that the remark about heterosynaptic plasticity “explaining” the central weight distribution in cortex might have been phrased too strongly – “correspondence” is, in this case, certainly a better descriptor than “explanation”.
3. The conclusion of the Zenke and Gerstner paper is, we believe, incorrect: A learning rule with an intrinsic weight dependence, such as Oja's rule, does not need homeostasis. This clearly doesn't affect any of your analysis, but it does mean that the statement "… (Zenke and Gerstner, 2017), where it was shown that pairing Hebbian terms with heterosynaptic and homeostatic plasticity is crucial for stability" is, we believe, not true.
You are absolutely right about the role of the weightdependent term in Oja’s rule, but we don’t think that this necessarily contradicts Zenke and Gernstner. It might only be a matter of wording, but we would argue that in Oja’s rule, the negative, weightdependent term is, in addition to being heterosynaptic, also homeostatic, because it stabilizes (normalizes) the weight vector by punishing too large weights.
4. Figure 1B is pretty incomprehensible, and it doesn't help that the equations are small and gray, making them hard to read. We would suggest dropping it, especially since the point is made very well in the text.
Agreed, we moved Figure 1B to the corresponding Figure 8 (Section 6.3 of the revised manuscript) in the Methods.
Along the same lines (but much later), in Sec. ‘The naive Euclidean gradient is not parametrizationinvariant’ we don't see any contradiction between Equations 26 and Equation 27; they are just two different definitions of \Δ w^{s}. It might be an inconsistency, but not a contradiction.
Agreed, done (line 621).
5. Figure 3: Considering that several approximations were made in the derivation of the learning rule (even for Equation 7), learning performance should be evaluated a bit more carefully. In particular:
a. Please plot the distance between student weights and teacher weights, on the same timescale as in Figure 3F (either in Euclidean space or in Fisher metric space).
Good point, we added another Figure (#9) in the methods. We also plotted the distance between student and teacher firing rates, as well as the “minimum” and “maximum” learning curves.
b. Please plot performance versus learning rate, so we can get a sense of robustness. "Performance" could either be asymptotic loss or time to achieve a particular loss. Also, can you speculate how the learning rate of a neuron can be optimized?
Our simulations show (panel D in the new Figure 9) that the convergence behavior is robust against perturbations of the learning rate.
Metalearning is, in general, a complex subject of active research, and we are aware of several related experimental (e.g., https://elifesciences.org/articles/51439) and algorithmic (e.g., https://elifesciences.org/articles/66273) studies. For our purposes, the optimization of eta_{0} is a relatively simple case of univariate optimization, so our simple grid search was sufficient.
c. What's the target output firing rate? It might have been stated somewhere, but we missed it. It would be nice if it were in the figure caption, like it is in Figure 4.
The teacher was modeled as a Poisson neuron that fires with a timedependent rate, given by \phi (\sum_{i=1}^{n} w*_{i} \xepsi), with a randomly sampled target weight w^{*}. We added a sentence in the caption of Figure 3 to clarify this.
d. In Figure 3, n=2 and n=100 are both used. It should be clear which panel is which.
Absolutely, we adapted the caption accordingly.
e. Panels B and C: what's the timescale? There's only one labeled time point (0.1), so it's impossible to tell. And how many trials did it take to learn? We would have expected considerable learning during 2500 trials.
We added a second labeled time point so that the time scale in the panel becomes clear. Furthermore, we renamed the ordinate of the scatter plots into PSTH, since the label “trial” might have been misleading. The scatter plots in B and C show repeated sampling of the teacher and student firing distribution before and after learning, for one of the sample traces in 3F. This means they do not depict the learning process itself, but illustrate the firing distribution of the student and teacher at fixed points in the learning process.
f. The direction of the gradient depends on whether or not there's a spike on the teacher neuron. Which means the path in weight space is noisy, and the weights should not converge  instead, they should exhibit fluctuations forever. But the trajectories in D and F were very smoothed. Is that because they're averaged over a large number of trials? Or was some extra processing done? Please explain.
You are correct, the weight paths were averaged over 500 trials. We included this information in the figure caption so it is easier to find.
6. Figure 4:
a. Can you reproduce experimental results on dendritic position dependent plasticity (eg. Letzkus JJ, Kampa BM, Stuart GJ, 2006; Sjöström PJ, Häusser M, 2006). These experimental results are inconsistent with vanilla Euclidean gradient; hence they would provide support for biological relevance of the proposed learning rule.
Thank you for raising this interesting point. It is certainly true that this is conceptually related to the issue we discuss here – in some sense, it’s a mirror problem, namely the attenuation of output signals versus the attenuation of input signals. We have added a paragraph in the discussion where we describe the relationship between these findings and the predictions of our learning rule (lines 366376).
b. "the distance from soma was varied between 1 microns to 10 microns". This seems small; the characteristic length of dendritic decay is in order of 100 microns (Williams and Stuart, 2002).
Thank you for pointing this out, this is a mistake in the caption of Figure 4 that appeared during the final editing of the paper. It is not the distance from soma that is varied between 1 and 10 microns, but rather the inverse attenuation factor α(d) which is varied between 1 and 10. Since α(d)=exp(d/λ), this means that d/λ is varied between 0 and ln(10) ≅ 2,3. With an electrotonic length scale of approximately 200 μm, as described in Williams and Stuart (2002), this corresponds to a distance d from soma between 0 and 460 μm. We corrected this in the revised manuscript (caption of Figure 4).
7. Figure 5: Here, you compared three neurons each receiving single excitatory input with different input characteristics. Do you expect the same result to be true for a single neuron receiving three excitatory inputs? We're curious about it because that's a more relevant scenario. Some casual remarks would be helpful.
When the activity of inputs does not overlap in time (at any point in time, only one is active), the results will hold exactly as in Figure 5. We chose this scenario because it is the only one where a clearcut comparison can be made, without additional complications induced by their interaction.
In the case of simultaneously active inputs, we still expect the variance to strongly impact the learning rate in an inverseproportional manner, since Equation 34 of the revised manuscript (line 585) shows that the input variance is an important contributor to the Fisher information matrix, whose inverse plays an essential role in the natural gradient update (see Equation 12 of the revised manuscript, line 134). From Equation 37 (revised manuscript, line 591), it becomes clear that the homosynaptic component of our learning rule exactly scales with the inverse variance of the unweighted synaptic potential. We also ran computer simulations for this scenario, but the resulting data is much more difficult to study, since due to the interactions of the different inputs, the behavior of the heterosynaptic components becomes much more complicated.
8. Figure 7:
a. We're somewhat confused that you you can distinguish between the three forms of plasticity in Equations 1113, at least for the stimulated synapses. Don't all three forms contribute at once? But in panel B you say, "Weight change of stimulated weights (homosynaptic)". And in panel C, you should have seen homosynaptic plasticity (see next comment). Could you explain how you can single out one form of plasticity? Or else drop the comment. Or maybe we misunderstood?
You are right, stimulated synapses are affected by all three terms simultaneously, while unstimulated synapses are only affected by the heterosynaptic terms. The confusion resulted from our overloading of the terms “homosynaptic” and “heterosynaptic”. In literature, they are often used synonymously with “stimulated” and “unstimulated”. Here, we use the terms to describe the different components of the learning rule. We have revised the caption and text to avoid the resulting ambiguity.
b. In the legend of Figure 7, it is mentioned that the rest of synapses received 0.01Hz input, but in Sec. ‘Comparison of homo and heterosynaptic plasticity’, that information is omitted. Did you actually stimulate these synapses? If so, we would have expected a big jump in synaptic weight when there was a spike, and in 60 s there should have been 3 spikes (0.01x60x5).
You are right, and we addressed this misunderstanding in the revised caption of Figure 7 and Sec. ‘Comparison of homo and heterosynaptic plasticity’ (lines 677679). In the simulation, the synapses received no input spikes at all. The assumption of infinitesimally small input rates was simply an argument that allowed us to avoid division by zero in the homosynaptic term. Biologically, this would simply correspond to zero homosynaptic plasticity for zero input.
c. 7G: A should be S?
Yes, you are right. We corrected it in the figure.
d. Equation 13 is out of the blue. Could you tell us where this comes from?
Indeed, this was not sufficiently clear in the previous version. Furthermore, Equation 13 ( = Equation 23 of the revised manuscript, line 285) contained a typo as the last identity only holds approximately. We corrected it and added a reference to the section in the Methods (former SI) where the empirical analysis of γ_{w} is explained.
And from the bottom of page 10, we have
"… is roughly a linear function of the membrane potential (more specifically, its deviation with respect to its baseline)."
What does baseline refer to? We didn't see anything about baseline in Equation 25, which is where Equation 13 comes from.
By “baseline” we refer to the somatic membrane potential V_{rest} of the neuron in the absence of afferent input. Throughout the paper, we assumed a value of 70 mV for V_{rest} (see Sec. ‘Neuron model’). In contrast, the variable V captures the part of the somatic membrane potential which is evoked by afferent input (Equation 5, line 84). Hence, the total somatic membrane potential is given as V_{rest} + V. The definitions of both V_{rest} and V can also be found in the first paragraph of the Methods section (Sec. ‘Neuron model’ of the revised manuscript, lines 542 and 547 / Equation 25). We now explicitly define both V_{rest} and V at the very beginning of Section 2.2 (line 79).
e. Could you tell us how the colors in panels BD were computed? Presumably some assumptions were made. If so, what were they?
Yes, we made the following assumptions, which we have now also included in the Methods, Sec. ‘Comparison of homo and heterosynaptic plasticity’ (lines 683689).
For the plots in 7B and 7C, we chose the diverging colormap matplotlib.cm.seismic in matplotlib. The colormap was inverted such that red indicates negative values representing synaptic depression and blue indicates positive values representing synaptic potentiation. The colormap was linearly discretized into 500 steps. To avoid too bright regions where colors cannot be clearly distinguished, the indices 243257 were excluded. The colorbar range was manually set to a symmetric range that includes the max/min values of the data.
In Figure 7D, we only distinguish between regions where plasticity at stimulated and at unstimulated synapses have opposite signs (light green), and regions where they have the same sign (dark green).
9. Equation 4: we believe that the right hand side should not have a minus sign. More importantly, we should be told that Y* consists of a sum of δfunctions; the statement "Here, Y* denotes a teacher spike train sampled from p*" was unhelpful. It's also important to tell us exactly how Y* is sampled, which is, presumably from
P(spike from Y* \in [t, t+dt]) = phi(sum_{j} w*_{j} x_{jε}) dt.
If that's not correct, then we're thoroughly confused.
You are right about the minus sign, we corrected this typo (line 101 / Equation 10 of the revised manuscript).
We also added a reference to the Methods where we provide more information regarding the teacher spike train. For brevity and readability, and since the main text is already quite heavy on equations, we opted for having these equations in the Methods.
You are also right that for the simulations in Figure 3 we choose a realizable teacher, i.e., one that we know can be learned; as you say, its spikes are sampled from
P(spike from Y* \in [t, t+dt]) = phi(sum_{j} w*_{j} x_{jε}) dt.
We made this assumption to facilitate the analysis of our learning rule, but please note that Equation 13 (line 137) of the revised manuscript can be derived in a more general setting where the teacher spikes are generated from an arbitrary (inhomogeneous) Poisson process.
10. The second paragraph on page 4 makes some strong statements, and we're not sure all of them are true. For instance,
"Convergence of learning is therefore harmed by the slow adaptation of distal synapses compared to equally important proximal counterparts."
Presumably, "adaptation" means "learning". If so, why is adaptation necessarily slow at distal synapses? That would seem to depend on the biological learning rule. If not, what does "adaptation" mean in this context?
Yes, “adaptation” means “learning” in this context. And you are right, the statement was not precise enough. We were referring specifically to a dendritic parametrization, which might appear natural for local computation. We have corrected the text accordingly (lines 109113).
And,
"With the multiplicative USP term x^{ε} in Equation 4 being the only manifestation of presynaptic activity, there is no mechanism by which to take into account input variability, which can, in turn, also impede learning."
This is a bit out of the blue. Why should input variability affect learning? So far the authors have not said anything about this. It comes up later, but right now it's pretty obscure.
You are right, we should have at least provided some intuition for the reader until they reach the corresponding section in the manuscript. We have amended the text accordingly (lines 114115).
11. Last paragraph on page 4: the motivation for using the Fisher information matrix is a bit obscure. (For instance, what does "it is generally the unique metric that remains invariant under sufficient statistics" mean?) We strongly suspect that we won't be only ones who are lost. So it would be good to motivate it in plain language.
Indeed, invariance under sufficient statistics is not a commonplace property and needs explanation. To not disrupt the flow in this already dense section, we now simply refer to some classical literature on information geometry to motivate the canonical nature of the Fisher metric (line 129). We then specifically explain invariance under sufficient statistics and address its relevance for neuronal computation in the Discussion section (lines 358364).
And in particular, it would be nice to show, at the very least, that using the Fisher information matrix automatically makes the learning rule invariant with respect to a change of variables. (Which is pretty easy to do.)
We agree that at this point in the manuscript it would be useful to state this invariance explicitly. To not disrupt the flow of reading, we added a corresponding sentence in the main manuscript (line 135), along with a reference to the detailed derivation and figure in the Methods.
Along the same lines, after Equation 9 you say
"This represents an unmediated reflection of the philosophy of natural gradient descent, which finds the steepest path for a small change in output, rather than in the numeric value of some parameter."
This has been the theme all along, but was it ever shown? As far as we can tell, you just used the inverse of the Fisher information matrix. Is it obvious that it has the above effect?
We used this argument when introducing the Fisher metric in the first place (line 125 onwards) and tried to visualize it in Figure 2:
“The key idea of natural gradient as outlined by Amari is to follow the (locally) shortest path in terms of the neuron's firing distribution. Argued from a normative point of view, this is the only "correct" path to consider, since plasticity aims to adapt a neuron's behavior, i.e., its inputoutput relationship, rather than some internal parameter (Figure 2).”
Building on your other suggestions, we made a number of modifications to Section 2.2. to better explain different aspects of the theory. We hope that now readers are provided with sufficient intuition before reaching the “unmediated reflection” remark in Section 2.5.
12. It would be extremely useful to write down log p_{w} in the main text.
Done, but without the log, for brevity (line 90 / Equation 8 of the revised manuscript).
13. The quantities in Equation 7 should be defined immediately  right now we have to read down half a page to figure out what they are. And you should tell us immediately how γ_{u} and γ_{w} could be computed locally.
Agreed. We referenced the factors briefly directly after the equations (lines 139140) and added expressions for γ_{u} and γ_{w} to the main manuscript (line 174 / Equation 15 of the revised manuscript).
14. In the section "Natural gradient speeds up learning", please tell us exactly what the cost function is (which, presumably, means telling us w* and the filters you use to convert incoming spike trains to x). And it would be good if the firing rates were given in the main text, so one doesn't have to look at the figure.
Done. We have included more information in the text and also added references to the corresponding sections in the methods (lines 195204).
15. Because of the factor of phi'(V)/phi(V) in Equations 7 and 8, the effective scaling in Equation 9 is, approximately, 1/phi'(V). This doesn't change the point that a shallow slope implies a high learning rate, but thing's aren't quite as bad as Equation 9 implies. This seems worth mentioning.
Good idea, done (lines 256257).
16. You mention, on the top of page 8, "that the absence of dendritic democracy does not contradict the presence of democratic plasticity". It seems worth mentioning, just for completeness, that dendritic democracy can be achieved without democratic plasticity. In particular, as far as we can tell, the Euclidean gradient is also consistent with dendritic democracy, although convergence to the democratic state might be slower that for the natural gradient.
Done (lines 250251).
17. p. 10 "In comparison, input from weak synapses only has…" We couldn't get the logic of this and the following sentences. This nontrivial claim is not explicitly tested in the subsequent paragraph, but reappears in the discussion. You should expand on it, or remove it.
We understand how our previous formulation may have caused some confusion here, so we modified the text in the hope of avoiding this misunderstanding (lines 296304). In short: Assuming a single spike arrives via a strong synapse, it will cause a larger PSP (and therefore a larger V compared to the baseline) than the same spike arriving via a weak synapse. Thus, on average, weak synapses require more stimulus than strong synapses for the same weight change to occur.
18. p22: "The integral formulas follow from…". It would help if you referred to Equations 20 and 21 (in Sec. ‘Neuron model’) here. We were confused by the sudden appearance of ε_{o} and c_{ε}. Also, you should mention how they are related to dt.
Done (lines 762763).
19. It is somewhat confusing that you set up the problem about dendritic distance, but then first address, in Figure 3, the role of firing rate, before showing the solution to dendritic plasticity. Perhaps starting with Figure 4, contrasting standard gradients to natural gradients would help.
Indeed, we asked ourselves the same question when writing the manuscript. In the end, we decided to present our results in this order for several reasons. First, Figure 3 is intended to be the classical “it works” figure, before delving into the details and implications. Second, it does portray some fundamental properties of the natural gradient, namely the isotropization of the cost function close to the optimum. Third, it addresses the purely functional perspective (not only does it work, but it also can surpass the standard Euclidean approach) early on, which we believe to be important. Fourth, if it came after Figure 4, we feel it would disrupt the flow by switching from physiology to function then to physiology again. So while we certainly agree that swapping the figures would carry some benefit, we believe that it would also come with certain downsides and would therefore prefer the current order.
20. Page 7/8, you say "neurons are not symmetrical geometric objects". You should make it clear what this means (presumably it means that they have dendrites, and, therefore, some weights are attenuated). In addition, it's not clear that not being a symmetrical geometric objects implies a nonisotropic cost landscapes, since how isotropic the cost landscape is depends on the input and cost function as well as the attenuation in the dendrites. In principle they former could cancel the latter, producing an isotropic cost landscapes even with strongly attenuating neurons. It seems best to drop this point.
You are absolutely right that the different factors that affect the shape of a cost function could, in principle, cancel out. We thought that at this point it would be instructive to convey the intuition that not only are cost landscapes often anisotropic, but that, given the diversity of morphologies and input signals, it would be an incredible coincidence if cost landscapes were isotropic in the first place. We modified the text (lines 188192) to make it more clear, following your suggestions.
21. A number of studies have recently emphasized that in large networks most machine learning problems have highly degenerate minima (e.g. Belkin et al. PNAS). Does the algorithm generalize to such situations? Your answer may be "we don't know". But whatever the answer, it would be worth mentioning in the paper.
The generalization of our learning rule to larger networks is certainly a research direction that we are very interested to explore. Variants of naturalgradient learning have already been successfully applied to ratebased deep networks, and their convergence behavior has often been found to compare favorably to methods based on Euclidean gradient descent (see e.g. Bernacchia et al. 2018, Ollivier 2015, Pascanu and Bengio 2013).
We are therefore optimistic that also for the case of Poissonspiking neurons, naturalgradient learning will generalize well to the large network case. In particular, since our learning rule relies on local information, we do not expect it to be affected by the presence of degenerate minima in a different way than standard backpropagation.
A detailed analysis of this scenario is however beyond the scope of the current paper, since there are still many open questions on how to best train stochastically spiking deep networks with gradient methods, even for Euclidean gradient descent.
22. Page 7, last paragraph: should the reference to Figure 5 be to Figure 4?
You are right, we corrected this typo (line 231).
23. The abundance of superscripts, subscripts and decorations in the variables throughout the manuscript make it very hard to read. We would suggest simplifying notation as much as possible. In particular:
a. The superscript on the gradient operator doesn't help, as it requires extra memorization  which is hard because it's not standard. You would be much better off putting the superscript on the cost function, and not redefining the gradient. It's especially confusing in Equation 6, since at first glance the superscript n means take n derivatives.
We very much agree with the aim of simplifying notation, and we tried our best to do so. Moreover, the point about the nth derivative is absolutely correct. We tried to address it by capitalizing the superscript and making it nonitalic. However, we think that the superscript must remain on the gradient, because it would otherwise suggest that it is the cost function that changes. While it is true that it is useful to view the natural gradient as “isotropizing the cost function”, strictly speaking, the cost function does not change. For the exact same cost function under the exact same parametrization, the two gradients predict different trajectories in weight space.
b. There are two w's: w^{d} and w^{s}. Except sometimes there's a w without a superscript. It should always be clear which w you're referring to.
Agreed, and we have amended the manuscript accordingly. We use all three of these notations to emphasize the difference between somatic, dendritic and parametrizationinvariant learning. In Equation 10 (revised manuscript, line 101), the somatic amplitude is learned, and the text below now makes it explicit. As we move to the natural gradient, the learning becomes parametrizationinvariant, and the superscriptfree version of the weights is used to make this point. We added a sentence above Equation 11 in which we say this explicitly (lines 123124).
c. The notation in Equation 7 may make for easier reading, but it makes it very difficult to figure out what's actually going on, especially since there's no easy way to tell vectors from scalars. We would strongly recommend using components (dw_{i}/dt = …). It would make it much more clear, with very little cost.
Throughout the paper, we chose to use vectors for easier reading and have consistently used bold notation for vectors. In particular, it greatly reduces the number of indices, which is already rather high, due to the complexity of the problem, as you have pointed out above. We fear that a switch of notation around one particular equation might be confusing to the reader. Alternatively, using componentwise notation everywhere would make the equations significantly more difficult to parse. We would therefore kindly ask you to consider keeping the current notation.
24. Along the same line, the phrase "unweighted synaptic potential (USP) train" is pretty distracting. We would strongly suggest dropping it, since everybody agrees what a spike train is. Plus, we searched, and "weighted dendritic potentials" was used only once, in the sentence before unweighted synaptic potentials were defined. So "unweighted" seems a bit redundant.
We understand the possible confusion and hope that writing out the voltage equation (see changes in main text, Equation 5 of the revised manuscript, line 84) helps clarify this point. The spike trains are denoted as x_{i}, whereas the filtered spike trains (with the PSP kernel \ε) are denoted by x_{iε}. These are what we call USPs, and they are quite important, as they appear explicitly in the learning rule (and its derivation).
25. Important equations should, in our opinion, be displayed. That's because readers (OK, some of these readers) always go back and search for important quantities, and they're easier to see if displayed. (This was triggered by the expression for V, a few lines up from Equation 2.)
We absolutely agree and have displayed more equations inline in the revised manuscript, both in the main text and in the supplement. Being aware that different readers have different preferences regarding equations, we tried to reduce the number of displayed equations to a minimum, but your comments encouraged us to increase the value of this minimum.
26. The setup should be made clear up front: the neuron is receiving a teacher spike train, y_{t}, and using those spike trains to update its synapses. The statement before Equation 3, "Assuming that the neuron strives to reproduce a target firing distribution p*(yx)" was not helpful (I couldn't really make sense of it, so I kind of ignored it). And so it took me forever to figure out what was going on.
Agreed. We modified the text accordingly (lines 9397).
27. Along the same lines, wouldn't it make sense to just say the neuron is trying to maximize the log probability of the teacher spikes? It gives the same cost function, and it's a lot less obscure than saying the cost function is the KL distance. Clearly a matter of taste, but log probability seems more sensible.
We agree, and added a sentence about the equivalence of minimizing the KL and maximizing the loglikelihood of teacher spikes (lines 99100).
28. On the top of page 9, it says that learning rates are inversely correlated to the variance of σ^{2}(x^{ε}). It needs to be clear why the equations say that; right now it's pretty much out of the blue.
Indeed, this statement referred more to the empirical observation provided in Figure 5. The more precise mathematical reasoning is given in the next sentence (line 262): “In particular, for the homosynaptic component, the scaling is exactly [inversely proportional].”
29. End of discussion (page 12): "errorcorrecting plasticity rule". Why is it error correcting?
This is due to the multiplicative error term [Y* – phi(V)] in the learning rule. We address this term explicitly in Section 2.2, as well as in Section 2.6 and Figure 7, where plasticity switches sign at the zeroerror line.
30. Very end of discussion:
"Explicitly and exactly applying natural gradient at the network level does not appear biologically feasible due to the existence of crossunit terms in the Fisher information matrix G. However, methods such as the unitwise naturalgradient approach (Ollivier, 2015) could be employed to approximate the natural gradient using a blockdiagonal form of G. For spiking networks, this would reduce global naturalgradient descent to our local rule for single neurons."
We didn't understand that  it should be unpacked.
We have revised and extended this paragraph to hopefully improve its clarity (lines 407414). In particular, it now explicitly addresses the nonlocalities induced by applying naturalgradient descent at the network level and explains the benefits of the unitwise solution more clearly.
31. Given that this is eLife, and there's no page limit, we would strongly urge you to get rid of the Appendix and put all the analysis in Methods. We suspect anybody willing to read Methods will want to see the algebra in the Appendix. In any case, all relevant quantities (e.g., g_{1}, …) should be in Methods; the reader shouldn't have to hunt them down in what will potentially be another document. And certainly, the gradient, which is central to the whole endeavor, should be in Methods. As should Equation S18, so that the reader knows where G(w) comes from.
Agreed, done, and thank you for the suggestion! We struggled with this decision ourselves and tried to make the manuscript more readerfriendly by shifting some of the more detailed content into an SI, but are definitely happy with the new structure.
32. There's a much easier derivation of G(w), Equation 31:
int dx exp(h dot t) N(x; mu, Σ) f(a dot x)
is easy to evaluate (here x is a vector and N(x; mu, Σ) is a Gaussian with mean mu and covariance Σ). Take two derivatives, and, voila, you have. Same answer, but easier on the reader.
We really appreciate the detailed feedback, especially given how complicated some of our derivations are and we are certainly welcoming suggestions for simplification. In this case, however, we must apologize but the derivation of Equation 31 (now Equation 48) already seems quite straightforward to us – which, admittedly, might very much be a matter of taste.
In addition, the derivation of its inverse can be more concise. G(w) is a sum of a diagonal matrix and a rank2 modulation. Denoting
V = (r w') and C = [[c1 e^{2}, c2 e], [c2 e, c3]],
we may write
G(w) = c1 \Σ + V C V^{T}.
Applying the Woodbury identity,
G{^{1}}(w) = \Σ{^{1}}/c1 – V' (C{^{1}} + V \Σ{^{1}} V^{T}) {^{1}} V'^{T},
where
V' = (r'/e w).
Thus,
g = [[g1 g3], [g2 g4]]
is given as
g = – (C{^{1}} + V \Σ{^{1}} VT) {^{1}}.
Thank you for this suggestion! Indeed, the idea of working directly with a rank2 correction is more elegant and concise than our iterative application of ShermanMorrison, and certainly more appealing to a reader interested in understanding the principle behind our method. We therefore added this idea to our derivation in the Methods (line 779780). However, since the result remains the same, we have conserved our original derivation in order to avoid an extensive and errorprone change in notation.
33. Aren't Equations S1213 a repeat of arguments made above, on the same page? Or is this something new? Either way, it should be clear (and if it's a repeat, maybe it can be dropped).
Assuming you are referring to either Equation S3 or Equation S4 (now Equations 51 and 52, lines 733735), both of them are similar to the arguments made in Eq. S12 and Eq. S13 (now Equations 61 and 62, line 742), but not exactly the same.
1. In Equation S3 (now Equation 51), the density over the USPs vanishes, since it appears in both the counter and the denominator of the fraction.
2. In Equation S4 (now Equation 52), the density of the target firing distribution vanishes when taking the derivative, since it does not depend on w.
3. In contrast, in EquationS12 and S13 (now Equations 61 and 62), the density over the USPs vanishes when taking the derivative, since it also does not depend on w.
We have adapted the text to make this difference more clear.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Reviewer #1:
The manuscript is much improved, although it's still a bit hard to read. Some of which we think could be easily fixed. Specific comments follow.
1. p 1: "It certainly could be the case that evolution has favored one particular parametrization over all others during its gradual tuning of synaptic plasticity, but this would necessarily imply suboptimal convergence for all but a narrow set of neuron morphologies and connectome configurations."
We found this confusing, since parametrization and learning rules are not directly coupled. Did you mean "evolution has favored one particular parametrization over all others, and updates weights using Euclidean gradient descent ……"?
We adapted the sentence to address this issue.
2. One more attempt to kill USP: why not call x^{ε} filtered spike trains, since that's what they are? "Unweighted synaptic potential" is completely unfamiliar, and with a paper this complicated, that's the last thing one needs  every single time we read that phrase, we had to mentally translate it to filtered spike trains.
We believe that USP is a more exact description for x^{ε}, because it implicitly specifies the filter (same filter as the membrane, difference of exponentials) and the lack of synaptic weighting (see our reply to #24) whereas “filtered spike trains” does not (it could just be the synaptic current, for example). The first paragraph of Sec. ‘The naturalgradient plasticity rule’ and especially Equation 5 should leave no room for interpretation.
3. Technically, x_{iε} should depend on distance to the soma: because of dendritic filtering, the farther a psp travels along a dendrite the more it spreads out (in time). You should mention this, and speculate on whether it will make much difference to your analysis. It seems like it might, since c_{ε} scales with the timescale of the psp.
It is true that our model does not capture all the details of biological neurons, but our observations regarding the effects of weight reparametrization hold in general. We clarified this in a footnote above Equation 1.
4. In both places where you mention spikes (x_{i}, above Equation 4 and Y^{*}, above Equation 9) you should point out that they are a sum of δ functions. "Spikes" isn't so well defined in this field.
We agree, which is why we stated this explicitly in Equation 27.
5. If you want anybody besides hard core theorists do understand what you're doing, you'll need to be explicit about what p* is: it's Equation 7 with phi replaced by phi*, the target firing rate as a function of voltage. We strongly suggest you do that. In particular, the description "a target firing distribution p*(y∣x^{ε})" doesn't help much, since a firing rate distribution is usually a distribution over firing rates. What you mean, though, is that p* gives you the true probability of a spike in a time interval.
We agree, which is why we stated this explicitly in Sec. ‘Sketch for the derivation of the somatic naturalgradient learning rule‘, Equations 32 and 33.
5. It would be very helpful, in Equation 9, to add
\approx dt[ (phi – – phi*) – – phi* log(phi/phi*) ]
(and point out that it's exact in the limit dt > 0). Then Equation 10 is easy  just say that phi* is replaced by Y*.
We agree, but have decided – in line with suggestions from the reviewers – to keep the main manuscript as light as possible on mathematical details/derivations and rather focus on the key arguments and insights. (In this particular case, since Equation 10 is not our result and appears often in other literature, we point to Pfister et al.et al. (2006) for more details.)
6. Starting in Equation 10, you switch to a weight without a superscript. We found this very confusing, since the learning rule you wrote down in Equation 10 is exactly the one for w^{s}. We strongly suggest you never use w. But if there is a reason to do so, you should be crystal clear about what it is. As far as we could tell, this switch to w is completely unsignalled. So maybe it's a typo?
We agree, which is why we have stated this explicitly in the text above Equation 11: “In the following, we therefore drop the index from the synaptic weights w to emphasize the parametrizationinvariant nature of the natural gradient.” In particular, this means that Equation 13 holds in a more general setting than only for the somatic parametrization. If we use w_{s} instead of w, we lose this generality.
7. p 4: "which is likely to harm the convergence speed towards an optimal weight configuration."
The word "likely" is not all that convincing. Can you be a lot stronger? Are there relevant refs? Or you could point to the simulations in Figure 3.
We provide an explanation under Equation 3, which is explicitly referenced in the sentence you mention: “However, from a functional perspective, the opposite should be true: dendritic weights should experience a larger change than somatic weights in order to elicit the same effect on the cost.” We used “likely” simply because we want to be cautious. It is possible to imagine (arguably pathological) cases where the somatic learning rate is so large that it prevents convergence, so a dendritic parametrization can help by decreasing the learning rate.
8. p 4: "In the following, we therefore drop the index from the synaptic weights w to emphasize the parametrizationinvariant nature of the natural gradient."
In fact, what's really going on is that you're deriving the natural gradient learning rule (Equation 13) with V given by
V = sum_{i} f(w_{i}) x_{iε}
which is actually very confusing, since f was used previously to relate w^{s} to w^{d} (Equation 4). Is the f in Eq 14 the same as the one in Equation 4? If so, it should be w^{d} on the right hand side of Equation 14. If not, you should use a different symbol for the function.
The f here is a generalization of the f in Equation 4. It describes the relationship of an arbitrary weight parametrization to the somatic amplitude parametrization. Since Equation 4 is actually a special case for the dendritic amplitudes, we feel that using f in both cases is justified. We added a sentence below Equation 14 that states this explicitly.
As you can see, your notation is confusing us. Our suggestion would be to always use superscripts s or d, and never use w by itself.
9. Figure 2:
"(A) During supervised learning, the error between the current and the target state is measured in terms of a cost function defined on the neuron's output space; in our case, this is the manifold formed by the neuronal output distributions p(y, x)."
What does "defined on the neuron's output space" mean? Doesn't the cost function depend only on the weights, since it's an average over the inputoutput function? Which is more or less what you say in the next sentence,
"As the output of a neuron is determined by the strength of incoming synapses, the cost C is an implicit function of the afferent weight vector w.!
Although we're not sure why you say "implicit" function; isn't it a very explicit function (see point 5 above).
The cost function does indeed depend on the weights, but only indirectly, via the output. If two particular weight configurations give the same output, the cost function will be the same. This is illustrated in Figure 2A. We agree that “implicit” was not the right word here and replaced it with “indirectly”. C is defined on the space of p, but the p themselves depend on w, so C depends indirectly on w.
"If, instead, we follow the gradient on the output manifold itself, it becomes independent of the underlying parametrization."
Since we don't know what the output manifold is (presumably the statistical manifold in panel A? not that that helps), this sentence doesn't make sense to us.
In the paragraphs directly above Equation 11 we discuss the output manifold in detail. In particular, we say explicitly that “the set of our neuron’s realizable output distributions[,] forms a Riemannian manifold” and, in the first sentence under Figure 2, that “[the] cost function [is] defined on the neuron's output space; in our case, this is the manifold formed by the neuronal output distributions p(y,x)”. These distributions are, in turn, defined in Equation 7 and 8.
"In contrast, naturalgradient learning will locally correct for distortions arising from nonoptimal parametrizations."
This is a highly nontrivial statement, and there's nothing in the manuscript so far that makes this obvious. I think a reference is needed here. Or point the reader to Figure 3 for an example?
Since natural gradient multiplies the Euclidean gradient by the inverse Fisher information matrix G^{1} (and G is a Riemannian metric), provides the relationship between distances on the parameter space and the output manifold, we believe this statement is justified. We added a reference to Figure 3 as suggested.
10. Equation 13: You may not like components, but it would be extremely helpful to put them in Equation 13, or maybe add an equation with components to make it clear. There is a precedent; you use components in Equation 4. And let's face it, anybody who isn't mathematically competent will have been lost long ago, and anybody who is mathematically competent will want to be sure what's going on. And things are actually ambiguous, since it's really hard to tell what's a vector and what's a scalar. In particular, γ_{u} should be treated as a vector, even though it's not bold, so it's not immediately clear whether γ_{s} should also be treated as a vector. If nothing else, you should use γ_{u} {\bf 1}; that would help a little. Also, the convention these days when writing f(w) is for f not to be bold, but instead define it as a pointwise nonlinearity. You don't have to follow conventions, but it does make it easier on the reader.
The same comment applies to Equation 17.
Thank you for pointing this out – yes, we should have multiplied γ with the unit vector explicitly, as we did in other parts of the document. The vector notation was a conscious choice on our part, with one important reason being the avoidance of stacked indices. Given this decision, we used bold notation for vectorvalued functions such as f to help the reader differentiate them from scalarvalued ones such as C.
11. In Equation 14, presumably w^{s} on the left hand side should be bold? And why not use indices, as in Equation 4? Much more clear, and fewer symbols.
Good catch, thank you!
12. p 6, typo: "inactive inactive".
Good catch, thank you!
13. Two questions and one comment about Figure 3F. First, why don't the orange and blue lines start in the same place? Second, the blue line doesn't appear to be saturating. Can the Euclidean gradient do better than the natural gradient? If so, that's important to point out. But maybe it's not true, and is a byproduct of your optimization method?
We describe this in the simulation details (section 6.4.1), but we added a few clarifying words: “For the learning curves in Figure 3F, initial and target weight components were chosen randomly (but identically for the two curves) from a uniform distribution on U(1/n,1/n), corresponding to maximal PSP amplitudes between 600 µV and 600 µV for the simulation with n=100 input neurons. Learning curves were averaged over 1000 initial and target weight configurations.” This implies that the orange and blue line do indeed start at the same point, but it is just a bit difficult to see since they partly overlap.
Furthermore, the Euclidean gradient descent does converge, which is a bit more obvious in Figure 9A. However, we wanted to highlight the difference between naturalgradient and Euclideangradient descent and therefore chose to focus on the most relevant part of the learning curves. Please note also Figure 12CF, which shows the convergence behavior displayed in Figure 3F is consistent over multiple input scenarios.
Since we did not provide a formal proof that naturalgradientdescent learning outperforms Euclideangradientdescent learning, there could be special scenarios where Euclideangradientdescent learning exhibits, on average, a more favorable convergence behavior than natural gradient. However, we did not observe such a case in our simulations, which is consistent with the findings from other studies of naturalgradient descent.
And the comment: presumably the D_{KL} term on the yaxis has a factor of dt. That makes it pretty much impossible to interpret, since we don't know what dt is (probably it's somewhere, but it's not in an obvious place). We suggest removing the factor of dt, and making it clear that when phi is close to phi*, what's being plotted is <(phiphi*)^2/2 phi*>. That will make the plot much easier to interpret.
We see how this relates to you point #5, but we think that Figure 9C (distance of firing rates) should provide additional context to interpret Figure 3F, should the need arise.
14. p 8: "As a result, the effect of synaptic plasticity on the neuron's output is independent of the synapse location, since dendritic attenuation is precisely counterbalanced by weight update amplification."
Because of the term γ_{w} w_{d}, this is true only if w^{d} \propto 1/α(d). But, as you point out later, this doesn't have to be the case. So this statement needs to be modified. Maybe it's approximately true?
This might be a misunderstanding. Equation 17 is a special case of Equation 13 for exactly the case where Equation 16 holds. Please note the general form of the last term in Equation 13 and refer to Figure 8B for a detailed explanation on how attenuation is canceled out for naturalgradientdescent learning.
15. Figure 5, several comments:
In the equation in panel H, r should be in the numerator, not the denominator, and a factor of ε_{02} is missing. In addition, the term on the right hand side can be simplified:
σ^{2}(x^{ε}) = r ε_{02}/2(tau_{1} + tau_{2}).
Also it would be a good idea to switch to tau_{s} and tau_{m} here rather than tau_{1} and tau_{2}, to be consistent with Methods, and to not clash with the tau's in panels AB.
And in panels AB, presumably it should be tau{_{si}} rather than tau_{i}?. Also, tau_{m} should be reported here.
Good ideas, thank you, we have incorporated all of them into Figure 5.
It would make it a lot easier on the reader if you wrote down an equation for cε, which isn't so complicated: c_{ε}/r_{i} = 2 (tau_{m} + tau_{s})/ε_{02} r_{i}. Equation 19 is a natural place to do that.
Agreed, we added this expression to Equation 31 (see also # 23).
Because of the other two terms in Equation 17, it's not true that "Naturalgradient learning scales inversely with input variance." It should be clear that this is approximate.
You are certainly right, which is why we said “inversely” rather than “inversely proportional”. We added an “approximately” to make this unequivocal.
16. In Equation 19, it should be x_{iε} on the left hand side, not x^{ε}.
Good catch, thank you!
17. p 10: "Furthermore, it is also consistent with data from Aitchison and Latham (2014) and Aitchison et al.et al. (2021), as well as with their observation of an inverse dependence on presynaptic firing rates, although our interpretation is different from theirs."
There is no justification for this statement: in Aitchison et al.et al. a plot of Δ w/w versus r showed 1/sqrt(r) scaling. That would be consistent with the theory in this paper only if the weight scales as 1/sqrt(r). It might, but without checking, you can't make that claim.
That data is weak, so the fact that you don't fit it is hardly the end of the world. However, what's more important is to point out that the theory here makes very different predictions that the theory in Aitchison et al.et al. That way, experimentalists will have something to do.
Finally, the inverse scaling with firing rate must eventually saturate, since there's a limit to how big weight changes can be. You should comment on this, if briefly.
We suspect that this is simply a matter of wording, as we also pointed out above, under #15. We used “inverse” in an approximate, qualitative sense, and not as “inversely proportional”. It is meant to be taken as “if the presynaptic rate becomes larger, the weight changes become smaller”. We believe the statement to be correct and in line with Aitchison et al.et al., who write “the learning rate […] increases as the synapse’s uncertainty […] increases” and “the synapse’s uncertainty should fall as the presynaptic firing rate increases”. We adapted the corresponding sentence in our manuscript to mitigate the potential misunderstanding that our model makes the same quantitative predictions.
18. Figure 7, we're pretty lost, for several reasons:
a. Because homosynaptic scales as 1/r, it's not possible to have unstimulated synapses (for which r=0); at least not with the current derivation. If you want to have unstimulated synapses, you need to show that it is indeed possible, by computing G when r_{i}=0 for some of the i's.
We already addressed this in our response to point 8b of the previous review: “You are right, and we addressed this misunderstanding in the revised caption of Figure 7 and Sec. ‘Comparison of homo and heterosynaptic plasticity’ (lines 677679). In the simulation, the synapses received no input spikes at all. The assumption of infinitesimally small input rates [in the simulated learning rule] was simply an argument that allowed us to avoid division by zero in the homosynaptic term. Biologically, this would simply correspond to zero homosynaptic plasticity for zero input.” Please note that line numbers and highlighted text changes are now, necessarily, different; the referenced sentence is the second one in the first paragraph of Sec. ‘Comparison of homo and heterosynaptic plasticity’.
b. From the bottom of page 27, γ_{u} \approx ε_{0} c_{ε}. Thus, the first two plasticity terms are
c_{ε} (x^{ε}/r – ε_{0}).
Because = r ε_{0}, these two terms approximately cancel. Which means everything should be driven by the last term, c_{w} V f(w). This doesn't seem consistent with the explanation in Figure 7.
If we understand correctly, you are looking at the average weight change <\Δ w>. This is an average over a product between the postsynaptic error and the parenthesis containing the sum over the three terms (cf. Equation 13). The average of this product is not the product of the two averages, because the error is correlated with the input, and plasticity is driven by this very correlation (and of course the last term c_{w} V f(w)).
c. Because of the term c_{w} V f(w), shouldn't there be a strong dependence on the unstimulated weights in panels BD?
The purpose of these panels is to address this exact question. Figure 7B shows the change in the stimulated weights, so there is no dependence on the unstimulated weights (since the influence of the term c_{w} V f(w) is componentwise). In contrast, for the unstimulated weights (Figure 7C), we see a clear effect of this term which particularly manifests itself in the switch from potentiation to depression in the upper right corner of the panel. This dependence on the unstimulated weight is also clearly visible in the upper right corner of Figure 7D.
All this should be clarified.
19. p 13: "We further note an interesting property of our learning rule, which it inherits directly from the Fisher information metric that underlies natural gradient descent, namely invariance under sufficient statistics (Cencov, 1972)." What does "invariance under sufficient statistics" refer to?
A parameter of a probability distribution can often be described by a statistic (such as the sample mean) which is sufficient if it contains all necessary information about the parameter. The distributions of a sufficient statistic form themselves a parametric model and therefore a Riemannian manifold, with a onetoone relationship between the original parametric model and the one for the sufficient statistic. Due to the invariance property of the Fisher metric, the distances of corresponding points will be equal up to a constant factor. We felt the need to mention this distinguishing property of the Fisher metric, since it makes it stand out from other Riemannian metrics on statistical manifolds and because it is relevant for biological neurons, as explained in the paragraph. Since we believe that the discussion is not a good place to elaborate on mathematical details – again, to not distract the reader from the main message – we refer to the original literature, in this case Cencov, 1972.
20. p 13: "A further prediction that follows from our plasticity rule is the normalization of weight changes by the presynaptic variance. We would thus anticipate that increasing the jitter in presynaptic spike trains should reduce LTP in standard plasticity induction protocols."
Presumably this statement comes from the fact that the learning rate scales inversely with the variance of the filtered spike train. However, it's not clear that jittering spike trains increases the variance. What would definitely increase the variance, though, is temporal correlations. Maybe this could be swapped in for jittering? Eiththeyer that, or explain why jittering increases variance. And also it would be nice to refer the reader to the dependence of the learning rate on variance.
We adapted the sentence as requested.
21. You should comment on where you think the teacher spike train comes from. Presumably it's delivered by PSPs at the soma, but those don't propagate back to synapses. How do you envision the teacher spike trains communicating with the synapses? Even if the answer is "we don't know", it's important to inform the reader  this may simply be an avenue for future research.
We did, directly below Equation 10, where the teacher spike train first appears in the postsynaptic error term: “On the singleneuron level, a possible biological implementation has been suggested by Urbanczik and Senn (2014), who demonstrated how a neuron may exploit its morphology to store errors, an idea that was recently extended to multilayer networks (Sacramento et al.et al., 2017; Haider et al.et al., 2021a).”
22. In Equations 29 and 30, you should explain why you use \approx. Presumably because that's because you assumed a constant firing rate?
Exactly, “\approx” becomes “=” for constant input rates. We added an explanatory sentence below Equation 31.
23. Equation 31: it would be extremely useful to point out that c_{ε} = 2(tau_{m} + tau_{s})/ε_{0}^2, along with a reference (or calculate it yourself somewhere).
Done.
24. After Equation 31, "Unless indicated otherwise, simulations were performed with a membrane time constant tau_{m} = 10 ms and a synaptic time constant τs = 3 ms. Hence, USPs had an amplitude of 60 mV".
Doesn't the amplitude of the USPs depend on ε_{0}? And 60 mV seems pretty big. Is that a typo?
Yes, it does, and we explicitly say that ε_{0} = 1 mV ms just a few lines above. The size of the USP does not matter – the U stands for unweighted, hence our insistence on keeping the acronym to make this explicit (see our reply to #2). Just one sentence further down from the one you cite, we discuss the actual size of PSPs. (But also in other places, for example Sec. ‘Distance dependence of amplitude changes’.)
25. Equation 34: Missing parentheses around Σ_{USP} w^{s} in the second term in parentheses.
Good catch, thank you!
26. Equation 24 and the line above it: should w_{i} be w_{id}? If not, you shouldn't use f, since that was used to translate somatic to dendritic amplitude.
We are not sure what you mean, since Equation 24 does not contain a w_{i}. Could this be a typo? Maybe our reply to # 8 regarding the use of f is helpful in this context.
27. Equation 115: Should it be Theta(Vtheta)?
Good catch, thank you!
28. Equation 117 is identical to Equation 67. Did you mean to drop the term (phi')_{2}/phi?
Yes, thank you!
Also, (phi')^{2}/phi = 1 when V > theta. But presumably it's equal to 0 when V < theta. If so, Equation 117 should be
G(w) = E(dt Theta(Vtheta) x x^{T}).
if that's correct, then the integrals I_{1}I_{3} do not reduce to the values given in Equations 118120.
You are right and thank you for pointing this out. We think that everything should work out fine if we assume that the mass of the membrane potential distribution is above theta. While this is not necessarily biologically realistic, we think that this result is still instructive due to its simplicity. We adapted the text for more clarity.
Reviewer #2:
The authors have done a good job in the revision.
A number of small suggestions remaining:
– Equation 13. Wouldn't it be easier to write this for a single component $\dot{wi)}=$……
We agree that sometimes this can be easier, but in this case, we decided to use vector notation to avoid index clutter, see also our reply to #10 of Reviewer 1.
– Figure 3 shows temporal structure in the teacher signal but this is nowhere explained in the main text.
We added a few clarifying sentences to the caption of Figure 3: “During learning, the firing patterns of the student neuron align to those of the teacher neuron. The structure in these patterns comes from autocorrelations in this instantaneous rate. These, in turn, are due to mechanisms such as the membrane filter (as seen in the voltage traces) and
the nonlinear activation function.”
– Fig3D+E perhaps the axis or caption can indicate whether w1,2 is 10 or 50Hz.
Good idea, thank you! We added a corresponding sentence to the caption.
– Figure 4A: Shouldn't the purple top solid curve (dendritic voltage) be taller than the solid orange curve?
Yes, absolutely, we adapted the sketch to make this clearer.
– Fig5D+E+F might look better with the position of the yaxis left instead of right.
We chose the axes on the left so the histograms look more familiar and just rotated by 90° CCW.
https://doi.org/10.7554/eLife.66526.sa2Article and author information
Author details
Funding
European Union 7th Framework Programme (604102)
 Walter Senn
Horizon 2020 Framework Programme (720270)
 Walter Senn
 Mihai A Petrovici
Horizon 2020 Framework Programme (785907 and 945539)
 Walter Senn
 Mihai A Petrovici
Swiss National Fonds (personal grant 310030L_156863 (WS))
 Walter Senn
Swiss National Fonds (Sinergia grant CRSII5_180316)
 Walter Senn
Manfred Stärk Foundation (personal grant)
 Mihai A Petrovici
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This work has received funding from the European Union 7th Framework Programme under grant agreement 604102 (HBP), the Horizon 2020 Framework Programme under grant agreements 720270, 785907 and 945539 (HBP) and the SNF under the personal grant 310030L_156863 (WS) and the Sinergia grant CRSII5_180316. Calculations were performed on UBELIX (http://www.id.unibe.ch/hpc), the HPC cluster at the University of Bern. We thank Oliver Breitwieser for the figure template, and Simone Surace and João Sacramento, as well as the other members of the Senn and Petrovici groups for many insightful discussions. We are also deeply grateful to the late Robert Urbanczik for many enlightening conversations on gradient descent and its mathematical foundations. Furthermore, we wish to thank the Manfred Stärk Foundation for their ongoing support.
Senior Editor
 John R Huguenard, Stanford University School of Medicine, United States
Reviewing Editor
 Peter Latham, University College London, United Kingdom
Publication history
 Received: January 13, 2021
 Accepted: February 21, 2022
 Version of Record published: April 25, 2022 (version 1)
Copyright
© 2022, Kreutzer et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 607
 Page views

 77
 Downloads

 0
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Further reading

 Computational and Systems Biology
 Evolutionary Biology
Studies of protein fitness landscapes reveal biophysical constraints guiding protein evolution and empower prediction of functional proteins. However, generalisation of these findings is limited due to scarceness of systematic data on fitness landscapes of proteins with a defined evolutionary relationship. We characterized the fitness peaks of four orthologous fluorescent proteins with a broad range of sequence divergence. While two of the four studied fitness peaks were sharp, the other two were considerably flatter, being almost entirely free of epistatic interactions. Mutationally robust proteins, characterized by a flat fitness peak, were not optimal templates for machinelearningdriven protein design – instead, predictions were more accurate for fragile proteins with epistatic landscapes. Our work paves insights for practical application of fitness landscape heterogeneity in protein engineering.

 Computational and Systems Biology
 Evolutionary Biology
Using a neural network to predict how green fluorescent proteins respond to genetic mutations illuminates properties that could help design new proteins.