Natural-gradient learning for spiking neurons

  1. Elena Kreutzer  Is a corresponding author
  2. Walter Senn
  3. Mihai A Petrovici  Is a corresponding author
  1. Department of Physiology, University of Bern, Switzerland
  2. Kirchhoff-Institute for Physics, Heidelberg University, Germany
12 figures, 1 table and 2 additional files

Figures

Classical gradient descent depends on chosen parametrization.

(A) The strength of a synapse can be parametrized in various ways, for example, as the EPSP amplitude at either the soma ws or the dendrite wd. Biological processes such as attenuation govern the relationship between these variables. Depending on the chosen parametrization, Euclidean-gradient descent can yield different results. (B) Phenomenological correlates. EPSPs before learning are represented as continuous, after learning as dashed curves. The light blue arrow represents gradient descent on the error as a function of the somatic EPSP Cs[ws] (also shown in light blue). The resulting weight change leads to an increase Δws in the somatic EPSP after learning. The dark blue arrows track the calculation of the same gradient, but with respect to the dendritic EPSP (also shown in dark blue): (1) taking the attenuation into account in order to compute the error as a function of wd, (2) calculating the gradient, followed by (3) deriving the associated change in Δ~ws, again considering attenuation. Due to the attenuation f(w) entering the calculation twice, the synaptic weights updates, as well as the associated evolution of a neuron’s output statistics over time, will differ under the two parametrizations.

The natural gradient represents the true gradient direction on the manifold of neuronal input-output distributions.

(A) During supervised learning, the error between the current and the target state is measured in terms of a cost function defined on the neuron’s output space; in our case, this is the manifold formed by the neuronal output distributions p(y,x). As the output of a neuron is determined by the strength of incoming synapses, the cost C depends indirectly on the afferent weight vector w. Since the gradient of a function depends on the distance measure of the underlying space, Euclidean-gradient descent, which follows the gradient of the cost as a function of the synaptic weights C/w, is not uniquely defined, but depends on how w is parametrized. If, instead, we follow the gradient on the output manifold itself, it becomes independent of the underlying parametrization. Expressed in a specific parametrization, the resulting natural gradient contains a correction term that accounts for the distance distortion between the synaptic parameter space and the output manifold. (B–C) Standard gradient descent learning is suited for isotropic (C), rather than for non-isotropic (B) cost functions. For example, the magnitude of the gradient decreases in valley regions where the cost function is flat, resulting in slow convergence to the target. A non-optimal choice of parametrization can introduce such artefacts and therefore harm the performance of learning rules based on Euclidean-gradient descent. In contrast, natural-gradient learning will locally correct for distortions arising from non-optimal parametrizations (see also Figure 3).

Natural-gradient plasticity speeds up learning in a simple regression task.

(A) We tested the performance of the natural gradient rule in a supervised learning scenario, where a single output neuron had to adapt its firing distribution to a target distribution, delivered in form of spikes from a teacher neuron. The latter was modeled as a Poisson neuron firing with a time-dependent instantaneous rate,ϕ(i=1nwixiϵ) where w* represents a randomly chosen target weight vector. The input consisted of Poisson spikes from n afferents, half of them firing at 10 Hz and 50 Hz, respectively. For our simulations, we used n=100 afferents, except for the weight path plots in (D) and (E), where the number of afferents was reduced to n=2 for illustration purposes. (B–C) Spike trains, PSTHs and voltage traces for teacher (orange) and student (red) neuron before (B) and after (C) learning with natural-gradient plasticity. During learning, the firing patterns of the student neuron align to those of the teacher neuron. The structure in these patterns comes from autocorrelations in this instantaneous rate. These, in turn, are due to mechanisms such as the membrane filter (as seen in the voltage traces) and the nonlinear activation function. (D–E) Exemplary weight evolution during Euclidean-gradient (D) and natural-gradient (E) learning given n=2 afferents with the same two rates as before. Here, w1 corresponds to x1 in panel A (10 Hz input) and w2 to x2 (50 Hz input). Thick solid lines represent contour lines of the cost function C. The respective vector fields depict normalized negative Euclidean and natural gradients of the cost C, averaged over 2000 input samples. The thin solid lines represent the paths traced out by the input weights during learning averaged over 500 trials. (F) Learning curves for n=100 afferents using natural-gradient and Euclidean-gradient plasticity. The plot shows averages over 1000 trials with initial and target weights randomly chosen from a uniform distribution U(1/n,1/n). Fixed learning rates were tuned for each algorithm separately to exhibit the fastest possible convergence to a root mean squared error of 0.8 Hz in the student neuron’s output rate.

Natural-gradient learning scales synaptic weight updates depending on their distance from the soma.

We stimulated a single excitatory synapse with Poisson input at 5 Hz, paired with a Poisson teacher spike train at 20 Hz. The distance d from soma was varied between 0 μm and 460 μm and attenuation was assumed to be linear and proportional to the inverse distance from soma. To make weight changes comparable, we scaled dendritic PSP amplitudes with α(d)-1 in order for all of them to produce the same PSP amplitude at the soma. (A) Example PSPs before (solid lines) and after (dashed lines) learning for two synapses at 3 μm and 7 μm. Application of our natural-gradient rule results in equal changes for the somatic PSPs. (B) Example traces of synaptic weights for the two synapses in (A). (C) Absolute and relative dendritic amplitude change after 5 s as a function of a synapse’s distance from the soma.

Natural-gradient learning scales approximately inversely with input variance.

(A–C) Exemplary USPs xiϵ and (D–F) their distributions for three different scenarios between which the USP variance σ2(xiϵ) is varied. In each scenario, a neuron received a single excitatory input with a given rate r and synaptic time constant τs. The soma always received teacher spikes at a rate of 80 Hz. To enable a meaningful comparison, the mean USP was conserved by appropriately rescaling the height ϵ0 of the USP kernel ϵ (see Sec. ‘Neuron model’). (A,D) Reference simulation. (B,E) Reduced synaptic time constant, resulting in an increased USP variance σ22. (C,F) Reduced input rate, resulting in an increased USP variance σ32. (G) Synaptic weight changes over 5 s for the three scenarios above. (H) Total synaptic weight change after t0=5s as a function of USP variance. Each data point represents a different pair of r and τs. The three scenarios above are marked with their respective colors.

Natural-gradient learning combines multiple forms of plasticity.

Spike trains to the left of the neuron represent afferent inputs to two of the synapses and teacher input to the soma. The two synapses on the right of the dendritic tree receive no stimulus. The teacher is assumed to induce a positive error. (A) The homosynaptic component adapts all stimulated synapses, leaving all unstimulated synapses untouched. (B) The uniform heterosynaptic component changes all synapses in the same manner, only depending on global activity levels. (C) The proportional heterosynaptic component contributes a weight change that is proportional to the current synaptic strength. The magnitude of this weight change is approximately proportional to a product of the current membrane potential above baseline and the weight vector.

Interplay of homo- and heterosynaptic plasticity in natural-gradient learning.

(A) Simulation setup. Five out of 10 inputs received excitatory Poisson input at 5 Hz. In addition, we assumed the presence of tonic inhibition as a balancing mechanism for keeping the neuron’s output within a reasonable regime. Afferent stimulus was paired with teacher spike trains at 20 Hz and plasticity at both stimulated and unstimulated synapses was evaluated in comparison with their initial weights. For simplicity, initial weights within each group were assumed to be equal. (B) Weight change of stimulated weights (both homo- and heterosynaptic plasticity are present). These weight changes are independent of unstimulated weights. Equilibrium (dashed black line) is reached when the neuron’s output matches its teacher and the error vanishes. For increasing stimulated weights, potentiation switches to depression at the equilibrium line. (C) Weight change of unstimulated weights (only heterosynaptic plasticity is present). For very high activity caused by very large synaptic weights, heterosynaptic plasticity always causes synaptic depression. Otherwise, plasticity at unstimulated synapses behaves exactly opposite to plasticity at stimulated synapses. Increasing the size of initial stimulated weights results in a change from depression to potentiation at the same point where potentiation turns into depression at stimulated synapses. (D) Direct comparison of plasticity at stimulated and unstimulated synapses. The light green area (O1, O2) represents opposing signs, dark green (S) represents the same sign (more specifically, depression). Their shared equilibrium is marked by the dashed green line and represents the switch from positive to negative error. (E–G) Relative weight changes of synaptic weights for stimulated and unstimulated synapses during learning, with initial weights picked from the different regimes indicated by the crosses in (B, C, D).

Natural-gradient descent does not depend on chosen parametrization.

Mathematical derivation and phenomenological correlates. EPSPs before learning are represented as continuous, after learning as dashed curves. The light blue arrow represents gradient descent on the error as a function of the somatic EPSP Cs[ws] (also shown in light blue). The resulting weight change leads to an increase Δws in the somatic EPSP after learning. The dark blue arrows track the calculation of the same gradient, but with respect to the dendritic EPSP (also shown in dark blue): (1) taking the attenuation into account in order to compute the error as a function of wd, (2) calculating the gradient, followed by (3) deriving the associated change in Δ~ws, again considering attenuation. (A) For Euclidean-gradient descent. (B) For natural-gradient descent. Unlike for Euclidean-gradient descent, the factor f(w)2 is compensated, since its inverse enters via the Fisher information. This leads to the synaptic weights updates, as well as the associated evolution of a neuron’s output statistics over time, being equal under the two parametrizations.

Further convergence analysis of natural-gradient-descent learning.

Unless stated otherwise, all simulation parameters are the same as in Figure 3. (A) In addition to the average learning curves from Figure 3F (solid lines), we show the minimum and maximum values (semi-transparent lines) during learning. (B) Plot of the mean Euclidean distance between student and teacher weight. Note that a smaller distance in weights does not imply a smaller DKL, nor a smaller distance in firing rates. This is due to the non-linear relationship between weights and firing rates. (C) Development of mean Euclidean distance between student and teacher firing rate during learning. (D) Robustness of learning against perturbations of the firing rate. We varied the learning rate for natural-gradient and Euclidean-gradient descent relative to the learning rate η0 used in the simulations for Figure 3F (EGD: η0=ηe=4.5107, NGD: η0=ηn=6*10-4), and measured the time until the DKL first reached a value of 5*10-5.

Global learning rate scaling γs as a function of the mean membrane potential.

We sampled the global learning rate factor γs (blue) for various conditions. In line with Equation 107, γs is boosted in regions where the transfer function is flat, that is, ϕ(V) is small. The global scaling factor is additionally increased in regions where the transfer function reaches high absolute values.

Learning rule coefficients can be approximated by simpler quantities.

(A)-(D) Samples values for g1,,g4 for different afferent input rates. (E)-(H) In a second simulation, we varied the number n of afferent inputs. (I) Comparison of the sampled values of g1 (blue) as a function of the total input rate n*r to the values of the approximation given by g1-q-1. (J) Sampled values of γu (blue) as a function of the approximation s (Equation 111). The proximity of the sampled values to the diagonal indicates that s may indeed serve as an approximation for γu. (K) Sampled values of γw (blue) as a function of g4V. The proximity of the sampled values to the diagonal indicates that g4V serves as an approximation for γw. (L) Same as (K), but with g4 replaced by a constant cw=0.05.

Natural-gradient learning can be approximated by a simpler rule in many scenarios.

(A) Mean Fisher angles between true and approximated weight updates (orange) and between natural and Euclidean weight updates (blue), for n=100. Results for several input patterns were pooled (group1/group2: 10 Hz/10 Hz 10 Hz/30 Hz, 10 Hz/50 Hz, 20 Hz/20 Hz, 20 Hz/40 Hz). Initial weights and input spikes were sampled randomly (100 randomly sampled initial weight vectors per input pattern; for each, angles were averaged over 100 input spike train samples per afferent). (B) Same as (A), but angles measured in the Euclidean metric. (C–F) Comparison of learning curves for natural gradient (red), Euclidean gradient (blue) and approximation (orange) for n=100 afferents. Simulations were performed in the setting of Figure 3, under multiple input conditions. (C) Group one firing with 10 Hz, group two firing at 30 Hz. (D) Group one firing with 10 Hz, group two firing at 50 Hz. (E) Group one firing with 20 Hz, group two firing at 20 Hz. (F) Group one firing with 20 Hz, group two firing at 40 Hz.

Tables

Table 1
Learning rates.
r1r2ηnηaηe
10 Hz30 Hz0.0006550.000550.00000110
10 Hz50 Hz0.0006000.000450.00000045
20 Hz20 Hz0.0006500.000530.00000118
20 Hz40 Hz0.0005800.000450.00000055

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Elena Kreutzer
  2. Walter Senn
  3. Mihai A Petrovici
(2022)
Natural-gradient learning for spiking neurons
eLife 11:e66526.
https://doi.org/10.7554/eLife.66526