A neuronal least-action principle for real-time learning in cortical circuits

  1. Walter Senn  Is a corresponding author
  2. Dominik Dold
  3. Akos F Kungl
  4. Benjamin Ellenberger
  5. Jakob Jordan
  6. Yoshua Bengio
  7. João Sacramento
  8. Mihai A Petrovici
  1. Department of Physiology, University of Bern, Switzerland
  2. Kirchhoff-Institute for Physics, Heidelberg University, Germany
  3. European Space Research and Technology Centre, European Space Agency, Netherlands
  4. Insel Data Science Center, University Hospital Bern, Switzerland
  5. Electrical Engineering, Yale University, United States
  6. MILA, University of Montreal, Canada
  7. Department of Computer Science, ETH Zurich, Switzerland

eLife assessment

This manuscript describes a potentially important theoretical framework to link predictive coding, error-based learning, and neuronal dynamics. The provided evidence is solid, but some details would benefit from additional clarification. The exposition of the manuscript is targeted for a specialist audience.

https://doi.org/10.7554/eLife.89674.3.sa0

Abstract

One of the most fundamental laws of physics is the principle of least action. Motivated by its predictive power, we introduce a neuronal least-action principle for cortical processing of sensory streams to produce appropriate behavioral outputs in real time. The principle postulates that the voltage dynamics of cortical pyramidal neurons prospectively minimizes the local somato-dendritic mismatch error within individual neurons. For output neurons, the principle implies minimizing an instantaneous behavioral error. For deep network neurons, it implies the prospective firing to overcome integration delays and correct for possible output errors right in time. The neuron-specific errors are extracted in the apical dendrites of pyramidal neurons through a cortical microcircuit that tries to explain away the feedback from the periphery, and correct the trajectory on the fly. Any motor output is in a moving equilibrium with the sensory input and the motor feedback during the ongoing sensory-motor transform. Online synaptic plasticity reduces the somatodendritic mismatch error within each cortical neuron and performs gradient descent on the output cost at any moment in time. The neuronal least-action principle offers an axiomatic framework to derive local neuronal and synaptic laws for global real-time computation and learning in the brain.

Introduction

Wigner’s remark about the ‘unreasonable effectiveness’ of mathematics in allowing us to understand physical phenomena Wigner, 1960 is famously contrasted by Gelfand’s quip about its ‘unreasonable ineffectiveness’ in doing the same for biology (Borovik, 2021). Considering the component of randomness that is inherent to evolution, this may not be all that surprising. However, while this argument holds just as well for the brain at the cellular level, ultimately brains are computing devices. At the level of computation, machine learning, and neuroscience have revealed near-optimal strategies for information processing and storage, and evolution is likely to have found similar principles through trial and error (Hassabis et al., 2017). Thus, we have reason to hope for the existence of fundamental principles of cortical computation that are similar to those we have found in the physical sciences. Eventually, it is important for such approaches to relate these principles back to brain phenomenology and connect function to structure and dynamics.

In physics, a fundamental measure of ‘effort’ is the action of a system, which nature seeks to ‘minimize.’ Given an appropriate description of interactions between the system’s constituents, the least-action principle can be used to derive the equations of motion of any physical system (Feynman et al., 2011; Coopersmith, 2017). Here, we suggest that in biological information processing, a similar principle holds for prediction errors, which are of obvious relevance for cognition and behavior.

Based on such errors, we formulate a neuronal least-action (NLA) principle which can be used to derive neuronal dynamics and map them to observed dendritic morphologies and cortical microcircuits. Within this framework, local synaptic plasticity at basal and apical dendrites can be derived by stochastic gradient descent on errors. The errors that are minimized refer to the errors in output neurons that are typically thought to represent motor trajectories, planned and encoded in cortical motor areas and ultimately in the spinal cord and muscles. In the context of motor control, a phenomenological ‘minimal action principle’ has previously been proposed that guides the planning and execution of movements (Feldman and Levin, 2009). Our neuronal least-action principle reformulates and formalizes the classical equilibrium point hypothesis (Latash, 2010) in a dynamical setting, linking it to optimality principles in sensory-motor control (Todorov, 2004).

Other attempts exist to link biological information processing and neural networks with the least-action principle, for instance by directly learning to reproduce a given trajectory (Amirikian and Lukashin, 1992), by minimizing the physical action for the muscle force generation by motor unit recruitment (Senn et al., 1995), minimizing cognitive prediction errors (Alonso et al., 2012), minimizing output errors with a weight-change regularization (Betti and Gori, 2016), minimizing psychomotor work (Fox and Kotelba, 2018), minimizing data transport through a network (Karkar et al., 2021), minimizing the discrimination information (Summers, 2021), or minimizing the free energy (Friston, 2010; Friston et al., 2022). Apart from the latter, however, these attempts remain far from the biology that seems to resist a formalization with the tool of physics – at least, when applied too strictly.

The fundamental novelty of our NLA principle is the way it deals with time. In physics, bodies interact based on where they are now, irrespective of what happens in the future. Living systems, instead, interact based on what could happen in the near future, and react early to stay alive. This difference is also mirrored in the way our NLA principle looks for an error-minimizing trajectory of brain states. We postulate that the brain trades with near-future states and seeks for a path that minimizes errors of these future states at any moment in time. Looking ahead towards what will likely happen allows the network for correcting the internal trajectory of deep neurons early enough so that the delayed output moves along the desired path. The notion of looking into the future to gate a dynamical system is also central in optimal control theory (as expressed by the Bellman equation, see e.g. Todorov, 2006). Yet, starting with a neuronal action is more principled as it includes the derivation of the dynamical system itself that will be optimally controlled.

The insight into the time structure of biological information processing allows us to express a simple form of a total ‘mismatch energy’ for our cortical neuronal networks, from which we derive the dynamic neuronal and synaptic laws.In short, the mismatch energy within a single pyramidal neuron is the squared prediction error between basal dendrites and the soma, together with the apical dendrites receiving a top-down feedback. The apical dendrites calculate a local prospective prediction error that looks ahead in time and overcomes neuronal integration delays (Figure 1a). As a consequence, the output neurons are corrected on the fly by the prospective error processing, pushing them in real time closer to the desired path. In addition, the prospective errors are suited for gradient learning of the sensory synapses on the basal dendrites. This gradient learning is proven to reduce the error in the output neurons at any moment in time.

Somato-dendritic mismatch energies and the neuronal least-action (NLA) principle.

(a1) Sketch of a cross-cortical network of pyramidal neurons described by NLA. (a2) Correspondence between elements of NLA and biological observables such as membrane voltages and synaptic weights. (b1) The NLA principle postulates that small variations δ𝒖~ (dashed) of the trajectories 𝒖~ (solid) leave the action invariant, δA=0. It is formulated in the look-ahead coordinates 𝒖~ (symbolized by the spyglass) in which `hills' of the Lagrangian (shaded gray zones) are foreseen by the prospective voltage so that the trajectory can turn by early enough to surround them. (b2) In the absence of output nudging (β=0), the trajectory 𝒖(t) is solely driven by the sensory input, and prediction errors and energies vanish (L=0, outer blue trajectory at bottom). When nudging the output neurons towards a target voltage (β>0), somatodendritic prediction errors appear, the energy increases (red dashed arrows symbolising the growing ‘volcano’) and the trajectory 𝒖(t) moves out of the L=0 hyperplanes, riding on top of the `volcano' (red trajectory). Synaptic plasticity W˙ reduces the somatodendritic mismatch along the trajectory by optimally ‘shoveling down the volcano’ (blue dashed arrows) while the trajectory settles in a new place on the L=0 hyperplane (inner blue trajectory at bottom).

The NLA principle builds on and integrates various ingredients from existing work and theories. Output neurons, be they motor neurons or decision-making neurons, are postulated to be ‘nudged’ towards the desired target time course by additional synaptic input to the soma or the proximal apical dendrite, as described by Urbanczik and Senn, 2014. The cortical microcircuit with lateral ‘inhibition’ that seeks to cancel the top-down feedback in order to extract the apical error is inspired by Sacramento et al., 2018 and Haider et al., 2021. The energy-based approach for describing error-backpropagation for weak nudging is borrowed from the Equilibrium Propagation algorithm (Scellier and Bengio, 2017) that we generalize from a steady-state algorithm to real-time computation in cross-cortical microcircuits. Our theory covers both cases of weak and strong output nudging. For strong nudging, it likewise generalizes the least-control principle (Meulemans et al., 2022) and the prospective configuration algorithm (Song et al., 2024) from a steady-state to a dynamic real-time version, linking to optimal feedback control (Todorov and Jordan, 2002). Finally, the apical activity of our pyramidal neurons can be seen in the tradition of predictive coding (Rao and Ballard, 1999), where cortical feedback connections try to explain away lower-level activities. Yet, different from classical predictive coding, our prediction errors are integrated with the soma, and these errors are prospective in time. The errors extrapolate from current to future activities, so that their integration improves the network output in real time. The combination of an energy-based model with prospective coding in which neuronal integration delays are compensated on the fly enters also in Haider et al., 2021.

The paper is organized as follows: we first define the prospective somatodendritic mismatch error, construct out of this the mismatch energy of a network, and ‘minimize’ this energy to obtain the error-corrected, prospective voltage dynamics of the network neurons. We then show that the prospective error coding leads to an instantaneous and joint processing of low-pass filtered input signals and backpropagated errors. Applied to motor control, the instantaneous processing is interpreted as a moving equilibrium hypothesis according to which sensory inputs, network state, motor commands, and muscle feedback are in a self-consistent equilibrium at any point of the movement. We then derive a local learning rule that globally minimizes the somato-dendritic mismatch errors across the network, and show how this learning can be implemented through error-extracting cortical microcircuits and dendritic predictive plasticity.

Results

Somato-dendritic mismatch errors and the Lagrangian of cortical circuits

We consider a network of neurons – identified as pyramidal cells – with firing rates ri(t) in continuous time t. The somatic voltage ui of pyramidal neuron i is driven by the close-by basal input current, jWijrj, with presynaptic rates rj and synaptic weights Wij, and an additional distal apical input ei that will be learned to represent a prospective prediction error at any moment in time (Figure 1a). While in classical rate-based neuron models the firing rate ri of a neuron is a function of the somatic voltage, ρ(ui), the NLA principle implies that the effective firing rate of a cortical neuron is prospective. More concretely, the formalism derives a firing rate that linearly extrapolates from ρ(ui) into the future with the temporal derivative, ri=ρ(ui)+τρ˙(ui), where ρ˙(ui) represents the temporal derivative of ρ(ui(t)). There is experimental evidence for such prospective coding in cortical pyramidal neurons where the instantaneous rate ri is in fact not only a function of the underlying voltage, but also a function of how quickly that voltage increases (see Figure 2a).

Prospective coding in cortical pyramidal neurons enables instantaneous voltage-to-voltage transfer.

(a1) The instantaneous spike rate of cortical pyramidal neurons (top) in response to sinusoidally modulated noisy input current (bottom) is phase-advanced with respect to the input adapted from Köndgen et al., 2008. (a2) Similiarly, in neuronal least-action (NLA), the instantaneous firing rate of a model neuron (r=ρ(u)+τρ˙(u), black) is phase-advanced with respect to the underlying voltage (u, red, postulating that the low-pass filtered rate is a function of the voltage, r=ρ(u)). (b) Dendritic input in the apical tree (here called e) is instantaneously causing a somatic voltage modulation (u, modeling data from Ulrich, 2002). The low-pass filtering with τ along the dendritic shaft is compensated by a lookahead mechanism in the dendrite (e=e+τe˙). In (Ulrich, 2002) a phase advance is observed even with respect to the dendritic input current, not only the dendritic voltage, although only for slow modulations (as here). (c) While the voltage of the first neuron (u1) integrates the input rates rin from the past (bottom black upward arrows), the output rate r1 of that first neuron looks ahead in time, r1=ρ(u1)+τρ˙(u1) (red dashed arrows pointing into the future). The voltage of the second neuron (u2) integrates the prospective rates r1 (top black upwards arrows). By doing so, it inverts the lookahead operation, resulting in an instantaneous transfer from u1(t) to u2(t) (blue arrow and circles).

The second central notion of the theory is the prospective error ei, that we interpret as prospective somato-dendritic mismatch error in the individual network neurons, ei=(ui+τu˙i)jWijrj . It is defined as a mismatch between the prospective voltage, ui+τu˙i, and the weighted prospective input rates, jWijrj. In the same way, as the firing rates rj linearly extrapolate into the future given the current-voltages uj of the presynaptic neurons j, the postsynaptic error is based on the linear extrapolation of its current voltage ui using its temporal derivative, ui+τu˙i . If the prospective error ei is low-pass filtered with time constant τ, it takes the form ei=uijWijrj, where rj is the corresponding low-pass filtered firing rate of the presynaptic neuron j (that becomes a function of the presynaptic voltage, rj=ρ(uj) , see Methods, Sect. Euler-Lagrange equations as inverse low-pass filters). We refer to ei as a somato-dendritic mismatch error of neuron that, as compared to ei, is non-prospective and instantaneous.

We next interpret the mismatch error ei in terms of the morphology and biophysics of pyramidal neurons with basal and apical dendrites. While the error ei is formed in the apical dendrite, this error is low-pass filtered and added to the somatic voltage ui, that is also driven by the low-pass filtered basal input jWijrj, so that ui=jWijrj+ei. From the perspective of the basal dendrites, the low-pass filtered apical error ei can be calculated as the difference between the somatic voltage and the own local low-pass filtered input, ei=uijWijrj. The somatic voltage ui is assumed to be sampled in the basal dendrite by the backpropagating acting potentials (Urbanczik and Senn, 2014; Spicher et al., 2017). The apical error now appears as a ‘somato-basal’ mismatch error, that both are summarized as a somato-dendritic mismatch error. It tells the difference between ‘what a neuron does,’ which is based on the somatic voltage ui, and ‘what the basal inputs think it should do,’ which is based on its own input Wijrj (Figure 1a2). The two quantities may deviate because neuron i get additional ‘unpredicted’ apical inputs from higher-area neurons that integrate with the somatic voltage ui. What cannot be predicted in ui by the sensory-driven basal input remains as somato-basal (somato-dendritic) mismatch error ei.

Associated with this mismatch error is the somatodendritic mismatch energy defined for each network neuron iN as the squared mismatch error,

(1) EiM=12e¯i2=12(uijWijr¯j)2.

On a subset of output neurons of the whole network, ON, a cost is defined as a function of the somatic voltage and some instructive reference signal such as targets or a reward. When a target trajectory uo*(t) is available, the cost is defined at each time point as a squared target error,

(2) Co=12(e¯o)2=12(uouo)2

Much more general mismatch energies and cost functions are conceivable, for instance, errors of the form ei=uifi(𝒖,t) for general functions fi of the voltage vector 𝒖 and of time, encompassing conductance-based neurons, but also further dynamic variables can be included such as threshold adaptation (see Appendix 6). The cost represents a performance measure for the entire network that produces the output voltages uo(t) in response to some input rates 𝒓in(t). The cost directly relates to behavioral or cognitive measures such as the ability of an animal or human to perform a particular task in real time. The target could be provided by explicit external supervision, for example, target movements in time encoded by uo*(t), it could represent an expected reward signal, or it could arise via self-supervision from other internal prediction errors.

We define the Lagrangian (or ‘total energy’) of the network as a sum across all mismatch energies and costs, weighted by the nudging strength β of the output neurons,

(3) L=iNEiM+βoOCo=12iN(uijWijr¯j)2+β2oO(uouo)2.

The low-pass filtered presynaptic rates, rj, also encompass the external input neurons. While in classical energy-based approaches, L is called the total energy, we call it the ‘Lagrangian’ because it will be integrated along real and virtual voltage trajectories as done in variational calculus (leading to the Euler-Lagrange equations, see below and Appendix 6). We ‘prospectively’ minimize L locally across a voltage trajectory, so that, as a consequence, the local synaptic plasticity for Wij will globally reduce the cost along the trajectory (Theorem 1 below).

Due to the prospective coding, the Lagrangian can be minimal at any moment in time while the network dynamics evolve. This is different from the classical predictive coding (Rao and Ballard, 1999) and energy-based approaches (Scellier and Bengio, 2017; Song et al., 2024), where a stimulus needs to be fixed in time while the network relaxes to a steady state, and only there the prediction error is minimized (see Appendix 3).

The least-action principle expressed for prospective firing rates

Motivated by the prospective firing in pyramidal neurons, we postulate that cortical networks strive to look into the future to prevent instantaneous errors. Each neuron tries to move along a trajectory that minimizes its own mismatch error ei across time (Figure 1b). The ‘neuronal currency’ with which each neuron ‘trades’ with others to choose its own error-minimizing trajectory is the future discounted membrane potential,

(4) u~(t)=1τtu(t)ettτdt.

The prospective voltages u~ are the ‘canonical coordinates’ entering the NLA principle, and in these prospective coordinates the overall network searches for a ‘least-action trajectory’. Since from u~ we can recover the instantaneous voltage via u=u~τu~˙ (see Appendix 2), we can replace u in the Lagrangian and obtain L as a function of our new prospective coordinates u~ and the ‘velocities’ u~˙, i.e.,L=L[𝒖~,𝒖~˙], where bold fonts represent vectors. Inspired by the least-action principle from physics, we define the neuronal action A as a time-integral of the Lagrangian,

(5) A=t1t2L[𝒖~(t),𝒖~˙(t)]dt.

The NLA principle postulates that the trajectory 𝒖~(t) keeps the action A stationary with respect to small variations δ𝒖~ (Figure 1b1). In other words, nature chooses a trajectory such that, when deviating a little bit from it, say by δ𝒖~, the value of A will not change (or at most up to second order in the variation), formally δA=0. The motivation to search for a trajectory that keeps the action stationary is borrowed from physics. The motivation to search for a stationary trajectory by varying the near-future voltages 𝒖~, instead of 𝒖, is assigned to the evolutionary pressure in biology to ‘think ahead of time.’ To not react too late, internal delays involved in the integration of external feedback need to be considered and eventually need to be overcome. In fact, only for the ‘prospective coordinates’ defined by looking ahead into the future, even when only virtually, will a real-time learning from feedback errors become possible (as expressed by our Theorems below).

The equations of motion that keep the action stationary with respect to these prospective coordinates are known to satisfy the Euler-Lagrange equations.

(6) Lu~iddtLu~˙i=0.

Applying these equations to our Lagrangian yields a prospective version of the classical leaky integrator voltage dynamics, with rates 𝒓 and errors 𝒆 that are looking into the future (Methods, Sects. Euler-Lagrange equations as inverse low-pass filters, Deriving the network dynamics from the Euler-Lagrange equations),

(7a) τu˙=u+Wr+e,
(7b) e¯=r¯netWnetTe¯+βe¯.

The ‘’ denotes the component-wise product, and the weight matrix splits into weights from input neurons and weights from network neurons, 𝑾=(𝑾in,𝑾net). While for output neurons a target error can be defined, eo*=uo*uo, for non-output neurons i no target exists and we hence set ei*=0. In a control theoretic framework, the neuronal dynamics (Equation 7a) represent the state trajectory, and the adjoint error dynamics Equation 7b represent the integrated costate trajectory (Todorov, 2006).

From the point of view of theoretical physics, where the laws of motion derived from the least-action principle contain an acceleration term (as in Newton’s law of motion, like mx¨=x+F for a harmonic oscillator), one may wonder why no second-order time derivative appears in the NLA dynamics. As an intuitive example, consider driving into a bend. Looking ahead in time helps us to reduce the lateral acceleration by braking early enough, as opposed to braking only when the lateral acceleration is already present. This intuition is captured by minimizing the neuronal action A with respect to the discounted future voltages u~i instead of the instantaneous voltages ui. Keeping up an internal equilibrium in the presence of a changing environment requires looking ahead and compensating early for the predicted perturbations. Technically, the acceleration disappears because the Euler-Lagrange operator (Equation 6) turns into a lookahead-gradient operator, u~iddtu~˙i=(1+ddt)ui, since the u~¨i is absorbed via u~˙iτu~¨i=u˙i (see Methods, Sect. Euler-Lagrange equations as inverse low-pass filters, and Appendix 6 for the link to the least-action principle in physics).

Mathematically, the voltage dynamics in Equation 7a specifies an implicit differential equation since 𝒖˙(t) also appears on the right-hand side. This is because the prospective rates 𝒓=ρ(𝒖)+τρ˙(𝒖) include 𝒖˙ through ρ˙(u)=ρ(u)u˙. Likewise, the prospective errors 𝒆=𝒆+τ𝒆˙, with 𝒆 given in Equation 7b and plugged into Equation 7a, imply 𝒖˙ through e¯˙(u)=e¯(u)u˙. Nevertheless, the voltage dynamics can be stably run by replacing 𝒖˙(t) on the right-hand side of Equation 7a with the temporal derivative 𝒖˙(tdt) from the previous time step (technically, the Hessian (1𝑾𝝆𝒆) is required to be strictly positive definite, see Methods Sect. From implicit to explicit differential equations and Appendix 3). This ensures that the voltage dynamics of Equation 7a, Equation 7b can be implemented in cortical neurons with a prospective firing and a prospective dendritic error (see Figure 2).

The error expression in Equation 7b is reminiscent of error backpropagation Rumelhart et al., 1986 and can in fact be related (Methods, Sect. Deriving the error backpropagation formula). Formally, the errors are backpropagated via transposed network matrix, 𝑾netT, modulated by ri, the derivative of ri=ρ(ui) with respect to the underlying voltage. While the transpose can be constructed with various local methods see Akrout et al., 2019; Max et al., 2022 in our simulations we mainly adhere to the phenomenon of feedback alignment (Lillicrap et al., 2016) and consider fixed and randomized feedback weights 𝑩 (unless stated differently). Recent control theoretical work is exploiting the same prospective coding technique as expressed in Equation 7a, Equation 7b to tackle general time-varying optimization problems see Simonetto et al., 2020 for a review and Appendix 3 for the detailed connection.

Prospective coding in neurons and instantaneous propagation

The prospective rates and errors entering via 𝒓 and 𝒆 in the NLA (Equation 7a) are consistent with the prospective coding observed in cortical pyramidal neurons in vitro (Köndgen et al., 2008). Upon sinusoidal current injection into the soma, the somatic firing rate is advanced with respect to its voltage (Figure 2a), effectively compensating for the delay caused by the current integration. Likewise, sinusoidal current injection in the apical tree causes a lag-less voltage response in the soma (Figure 2b, Ulrich, 2002). While the rates and errors in general can be reconstructed from their low-pass filterings via 𝒓=𝒓+τ𝒓˙ and 𝒆=𝒆+τ𝒆˙, they become prospective in time because 𝒓 and 𝒆 are themselves instantaneous functions of the voltage 𝒖, and hence 𝒓 and 𝒆 depend on 𝒖˙. The derivative of the membrane potential implicitly also appears in the firing mechanism of Hodgkin-Huxley-type conductances, with a quick depolarization leading to a stronger sodium influx due to the dynamics of the gating variables (Hodgkin and Huxley, 1952). This advances the action potential as compared to a firing that would only depend on 𝒖, not 𝒖˙, giving an intuition of how such a prospective coding may arise. A similar prospective coding has been observed for retinal ganglion cells (Palmer et al., 2015) and cerebellar Purkinje cells (Ostojic et al., 2015), making a link from the visual input to the motor control.

To understand the instantaneous propagation through the network, we low-pass filter the dynamic equation 𝒖+τ𝒖˙=𝑾𝒓+𝒆 (obtained by rearranging Equation 7a), with 𝒆 given by Equation 7b, to obtain the somatic voltage 𝒖=𝑾𝒓(𝒖)+𝒆(𝒖). At any point in time, the voltage is in a moving equilibrium between forward and backpropagating inputs. Independently of the network architecture, whether recurrent or not, the output is an instantaneous function of the low-pass filtered input and a putative correction towards the target, 𝒖𝒐(t)=𝑭W(𝒓in(t),𝒆𝒐*(t)), see Figure 2C and Methods, Sect. Proving theorem 1 (rt-DeEP). The mapping again expresses an instantaneous propagation of voltages throughout the network in response to both, the low-pass filtered input 𝒓in and feedback error 𝒆𝒐*. This instantaneity is independent of the network size, and in a feed-forward network is independent of its depths (see also Haider et al., 2021, where the instantaneity is on the rates, not the voltages). In the absence of the look-ahead activity, each additional layer would slow down the network relaxation time.

Notice that an algorithmic implementation of the time-continuous dynamics of a N-layer feedforward network would still need N calculation steps until information from layer 1 reaches layer N. However, this does not imply that an analog implementation of the prospective dynamics will encounter delays. To see why, consider a finite step-change Δ𝒖1 in the voltage of layer 1. In the absence of the look-ahead, Δ𝒖1 was mapped within the infinitesimal time interval dt to an infinitesimal change d𝒖2 in the voltages of layer 2. But with a prospective firing rate, r1=ρ(u1)+τρ(u1)u˙1, a step-change Δ𝒖1 translates to a delta-function in 𝒓1, this in turn to a step-change in the low-pass filtered rates Δ𝒓1, and therefore within dt to a step-change Δ𝒖2 in the voltages 𝒖2 of the postsynaptic neurons (Figure 2c). Iterating this argument, a step-change Δ𝒖1 propagates ‘instantaneously’ through N layers within the ‘infinitesimal’ time interval Ndt to a step-change Δ𝒖N in the last layer. When run in a biophysical device in continuous time that exactly implements the dynamical Equation 7a, the implementation becomes an instantaneous computation (since dt0). Yet, in a biophysical device, information has to be moved across space. This typically introduces further propagation delays that may not be captured in our formalism where low-pass filtering and prospective coding cancel each other exactly. Nevertheless, analog computation in continuous time, as formalized here, offers an idea to ‘instantaneously’ realize an otherwise time-consuming numerical recipe run on time-discrete computing systems that operate with a finite clock cycle.

Prospective control and the moving equilibrium hypothesis

Crucially, at the level of the voltage dynamics (Equation 7a) the correction is based on the prospective error 𝒆. This links our framework to optimal control theory and motor control where delays are also taken into account, so that a movement can be corrected early enough (Wolpert and Ghahramani, 2000; Todorov and Jordan, 2002; Todorov, 2004). The link between energy-based models and optimal control was recently drawn for strong nudging (β) to learn individual equilibrium states (Meulemans et al., 2022). Our prospective error 𝒆(t) appears as a ‘controller’ that, when looking at the output neurons, pushes the voltage trajectories toward the target trajectories. Depending on the nudging strength β, the control is tighter or weaker. For infinitely large β, the voltages of the output neurons are clamped to the time-dependent target voltages, uo=uo* (implying eo*=0), while their errors, eo=uo(𝑾𝒓)o, instantaneously correct all network neurons. For small β, the output voltages are only weakly controlled, and they are dominated by the forward input, uo(W𝒓)o.

To show how the NLA principle with the prospective coding globally maps to cortico-spinal circuits we consider the example of motor control. In the context of motor control, our network mapping 𝒖𝒐=𝑭W(𝒓in,𝒆𝒐*) can be seen as a forward internal model that quickly calculates an estimate of the future muscle length 𝒖𝒐 based on some motor plans, sensory inputs, and the current proprioceptive feedback (Figure 3a). Forward models help to overcome delays in the execution of the motor plan by predicting the outcome, so that the intended motor plans and commands can be corrected on the fly (Kawato, 1999; Wolpert and Ghahramani, 2000).

Moving equilibrium hypothesis for motor control and real-time learning of cortical activity.

(a) A voluntary movement trajectory can be specified by the target length of the muscles in time, 𝒖𝒐*, encoded through the γ-innervation of muscle spindles, and the deviation of the effective muscle lengths from the target, 𝒖𝒐𝒖𝒐*=𝒆𝒐*. The Ia-afferents emerging from the spindles prospectively encode the error, so that their low-pass filtering is roughly proportional to the length deviation, truncated at zero (red). The moving equilibrium hypothesis states that the low-pass filtered input 𝒓in, composed of the movement plan 𝒓inplan and the sensory input (here encoding the state of the plant e.g., through visual and proprioceptive input, 𝒓invis and 𝒓inprop), together with the low-pass filtered error feedback from the spindles, 𝒆𝒐*, instantaneously generate the muscle lengths, 𝒖𝒐=𝑭W(𝒓in,𝒆𝒐*), and are thus at any point in time in an instantaneous equilibrium (defined by Equation 7a, Equation 7b). (b1) Intracortical intracortical electroencephalogram (iEEG) activity recorded from 56 deep electrodes and projected to the brain surface. Red nodes symbolize the 56 iEEG recording sites modeled alternately as input or output neurons, and blue nodes symbolize the 40 ‘hidden’ neurons for which no data is available, but used to reproduce the iEEG activity. (b2) Corresponding NLA network. During training, the voltages of the output neurons were nudged by the iEEG targets (black input arrows, but for all red output neurons). During testing, nudging was removed for 14 out of these 56 neurons (here, represented by neurons 1, 2, 3). (c1) Voltage traces for the 3 example neurons in a2, before (blue) and after (red) training, overlaid with their iEEG target traces (gray). (c2) Total cost, integrated over a window of 8 s of the 56 output nodes during training with sequences of the same duration. The cost for the test sequences was evaluated on a 8 s window not used during training.

The observation that muscle spindles prospectively encode the muscle length and velocity (Dimitriou and Edin, 2010) suggests that the prospective coding in the internal forward model mirrors the prospective coding in the effective forward pathway. This forward pathway leads from the motor plan to the spindle feedback, integrating also cerebellar and brainstem feedback (Kawato, 1999). Based on the motor plans, the intended spindle lengths and the effective muscle innervation are communicated via a descending pathway to activate the γ- and α-motoneurons, respectively (Li et al., 2015). The mapping from the intended arm trajectory to the intended spindle lengths via γ-innervation is mainly determined by the joint geometry. The mapping from the intended arm trajectory to the force-generating α-innervation, however, needs to also take account of the internal and external forces, and this is engaging our network 𝑾.

When we prepare an arm movement, spindles in antagonistic muscle pairs that measure the muscle length are tightened or relaxed before the movement starts (Papaioannou and Dimitriou, 2021). According to the classical equilibrium-point hypothesis (Feldman and Levin, 2009; Latash, 2010), top-down input adjusts the activation threshold of the spindles through (γ-) innervation from the spinal cord so that slight deviations from the equilibrium position can be signaled (Figure 3a). We postulate that this γ-innervation acts also during the movement, setting an instantaneous target 𝒖𝒐*(t) for the spindle lengths. The effective lengths of the muscle spindles is 𝒖𝒐, and the spindles are prospectively signaling back the deviation from the target through the Ia-afferents (Dimitriou and Edin, 2010; Dimitriou, 2022). The low-pass filtered Ia-afferents may be approximated by a threshold-nonlinearity, Ia=β𝒖𝒐𝒖𝒐*+, with β being interpreted as spindle gain (Latash, 2018). Combining the feedback from agonistic and antagonistic muscle pairs allows for extracting the scaled target error β𝒆𝒐*=β(𝒖𝒐*𝒖𝒐). Taking account of the prospective feedback, we postulate the moving equilibrium hypothesis according to which the instructional inputs, 𝒓in, the spindle feedback, β𝒆𝒐*, and the muscle lengths, 𝒖𝒐, are at any point of the movement in a dynamic equilibrium. The moving equilibrium hypothesis extends the classical equilibrium-point hypothesis from the spatial to the temporal domain (for a formal definition of a moving equilibrium see Methods, Sect. From implicit to explicit differential equations).

Prediction errors are also reduced when motor units within a muscle are recruited according to the size principle (Senn et al., 1997), which itself was interpreted in terms of the physical least-action principle (Senn et al., 1995). With regard to the interpretation of the prospective feedback error 𝒆𝒐* as spindle activity, it is worth noticing that in humans the spindle activity is not only ahead of the muscle activation (Dimitriou and Edin, 2010), but also share the property of a motor error (Dimitriou, 2016). The experiments show that during the learning of a gated hand movement, spindle activity is initially stronger when making movement errors, and it returns back to baseline with the success of learning. This observation is consistent with the NLA principle, saying that the proprioceptive prediction errors are minimized through the movement learning. We next address how the synaptic strengths 𝑾 involved in producing the muscle length can be optimally adapted to capture this learning.

Local plasticity at basal synapses minimizes the global cost in real time

The general learning paradigm starts with input time series rin(t),i and target time series uo*(t), while assuming that the target series are an instantaneous function of the low-pass filtered input series, 𝒖𝒐*(t)=𝑭*(𝒓in(t)). The low-pass filtering in the individual inputs could be with respect to any time constant τin,i* (that may also be learned, see Appendix 2). Yet, for simplicity, we assume the same time constant τ for low-pass filtering the rates of the network neurons and input neurons. The goal of learning is to adapt the synaptic strengths 𝑾 in the student network so that this moves towards the target mapping, 𝑭W𝑭*. The local synaptic plasticity will also reduce the global cost C defined on the output neurons o in terms of the deviation of the voltage from the target, uo*uo (Equation 2).

The problem of changing synaptic weights to correct the behavior of downstream neurons, potentially multiple synapses away, is typically referred to as the credit assignment problem and is notoriously challenging in physical or biological substrates operating in continuous time. A core aspect of the NLA principle is how it relates the global cost C to the Lagrangian L and eventually to somato-dendritic prediction errors 𝒆 that can be reduced through local synaptic plasticity 𝑾˙. We define this synaptic plasticity as a partial derivative of the Lagrangian with respect to the weights, 𝑾˙L𝑾=𝒆𝒓T. Since the somatodendritic mismatch error is 𝒆=𝒖𝑾𝒓, this leads to the local learning rule of the form ‘postsynaptic error times low-pass filtered presynaptic rate’,

(8) W˙=η(uWr¯)r¯T.

The plasticity rule runs simultaneously to the neuronal dynamics in the presence of a given nudging strength β that tells how strongly the voltage of an output neuron is pushed towards the target, uouo*. The learning rule is local in space since 𝑾𝒓 is represented as a voltage of the basal dendrites, and the somatic voltage 𝒖 may be read out at the synaptic site on the basal dendrite from the backpropagating action potentials that sample 𝒖 at a given time (Urbanczik and Senn, 2014). The basal voltage 𝑾𝒓 becomes the dendritic prediction of the somatic activity 𝒖, interpreting Equation 8 as ‘dendritic predictive plasticity’.

We have derived the neuronal dynamics as a path that keeps the action stationary. Without an external teaching signal, the errors vanish, and the voltage trajectory wriggles on the bottom of the energy landscape (L=0, Figure 1b2). If the external nudging is turned on, β>0, errors emerge and hills grow out of the landscape. The trajectory still tries to locally minimize the action, but it is lifted upwards on the hills (L>0, Figure 1b2). Synaptic plasticity reshapes the landscape so that, while keeping β fixed, the errors are reduced and the landscape again flattens. The transformed trajectory settles anew in another place (inside the ‘volcano’ in 1b2). Formally, the local plasticity rule (Equation 8) is shown to perform gradient descent on the Lagrangian and hence on the action. In the energy landscape picture, plasticity ‘shovels off’ energy along the voltage path so that this is lowered most efficiently. The error that is back-propagated through the network tells at any point on the voltage trajectory how much to ‘dig’ in each direction, i.e., how to adapt the basal input in each neuron in order to optimally lower the local error.

The following theorem tells that synaptic plasticity 𝑾˙ pushes the network mapping 𝒖𝒐=𝑭W(𝒓in) towards the target mapping 𝒖𝒐*=𝑭*(𝒓in) at any moment in time. The convergence of the mapping is a consequence of the fact the plasticity reduces the Lagrangian L=EM+βC along its gradient.

Theorem 1 (real-time dendritic error propagation, rt-DeEP)

Consider an arbitrary network 𝑾 with voltage and error dynamics following Equation 7a, Equation 7b. Then the local plasticity rule 𝑾˙𝒆𝒓T Equation 8, acting at each moment along the voltage trajectories, is gradient descent

  • (i) on the Lagrangian L for any nudging strength β0, i.e., 𝒆𝒓T=dLd𝑾, with limβ𝒆𝒓T=dEMd𝑾𝑾˙.

  • (ii) on the cost C for small nudging, β0, while up-scaling the error to 1β𝒆, i.e., limβ01β𝒆𝒓T=dCd𝑾𝑾˙.

The gradient statements hold at any point in time (long enough after initialization), even if the input trajectories 𝒓in(t) contain delta functions and the target trajectories 𝒖𝒐*(t) contain step functions.

Loosely speaking, the NLA enables the network to localize in space and time an otherwise global problem: what is good for a single neuron (the local plasticity) becomes good for the entire network (the gradient on the global cost). Learning is possible at any point in time along the trajectory because the NLA inferred a prospective voltage dynamics expressed in prospective firing rates ri and prospective errors ei of the network neurons. In the limit of strong nudging (β), the learning rule performs gradient descent on the mismatch energies EMi in the individual neurons. If the network architecture is powerful enough so that after learning all the mismatch energies vanish, EMi=0, then the cost will also vanish, C=12𝒖𝒐*𝒖𝒐2=0. This is because for the output neurons, the mismatch error includes the target error (Equation 7b). In the limit of weak nudging (β0), the learning rule performs gradient descent on C, and with this also finds a local minimum of the mismatch energies.

In the case of weak nudging and a single steady-state equilibrium, the NLA algorithm reduces to the Equilibrium Propagation algorithm (Scellier and Bengio, 2017) that minimizes the cost C for a constant input and a constant target. In the case of strong nudging and a single steady-state equilibrium, the NLA principle reduces to the Least-Control Principle (Meulemans et al., 2022) that minimizes the mismatch energy EM for a constant input and a constant target, with the apical prediction error becoming the prediction error from standard predictive coding (Rao and Ballard, 1999). While in the Least-Control Principle, the inputs and outputs are clamped to fixed values, the output errors are backpropagated and the network equilibrates in a steady state where the corrected network activities reproduce the clamped output activities. This state is called the ‘prospective configuration’ in Song et al., 2024 because neurons deep in the network are informed about the distal target already during the inference, and are correspondingly adapted to be consistent with this distal target. In the NLA principle, after an initial transient, the network always remains in the moving equilibrium due to the prospective coding. While inputs and targets dynamically change, the network moves along a continuous sequence of prospective configurations.

In the motor control example, the theorem tells that a given target motor trajectory 𝒖𝒐*(t) is learned to be reproduced with the forward model 𝒖𝒐(t)=𝑭W(𝒓in(t)), by applying the dendritic predictive plasticity for the network neurons (Equation 8). We next exemplify the theory by looking into the brain, reproducing cortical activity, and showing how a multi-layer cortical network can learn a sensory-motor mapping while staying in a moving equilibrium throughout the training.

Reproducing intracortical EEG recordings and recognizing handwritten digits

As an illustration, we consider a recurrently connected network that learns to represent intracortical electroencephalogram (iEEG) data from epileptic patients (Figure 3b). For each electrode, we assign a neuron within this network to represent the activity of the cell cluster recorded in the corresponding iEEG signal via its membrane potential. During learning, a randomly selected subset of electrode neurons are nudged towards the target activity from recorded data while learning to be reproduced by the other neurons. After learning, we can present only a subset of electrode neurons with previously unseen recordings and observe how the activity of the other neurons closely matches the recordings of their respective electrodes (Figure 3c). The network derived from NLA is thus able to learn complex correlations between signals evolving in real-time by embedding them in a recurrent connectivity structure.

As an example of sensory-motor processing in the NLA framework, we next consider a well-studied image recognition task, here reformulated in a challenging time-continuous setting, and interpreted as a motor task where 1 out of 10 fingers has to be bent upon seeing a corresponding visual stimulus (see Figure 3). In the context of our moving equilibrium hypothesis, we postulate that during the learning phase, but not the testing phase, an auditory signal identifies the correct finger and sets the target spindle lengths of 10 finger flexors, 𝒖𝒐*(t). The target spindle length encodes the desired contraction of a flexor muscle in the correct finger upon the visual input 𝒓in(t), and a corresponding relaxation for the nine incorrect fingers.

We train a hierarchical three-layer network on images of handwritten digits (MNIST, LeCun, 1998), with image presentation times between 0.5τ (=5 ms) and 20τ (=200 ms, with τ=10 the membrane time constant). Figure 4a-c depict the most challenging scenario with the shortest presentation time. Synaptic plasticity is continuously active, despite the network never reaching a temporal steady state (Figure 4b1). Due to the lookahead firing rates in the NLA, the mismatch errors ei(t) represent the correct gradient and propagate without lag throughout the network. As a consequence, our mismatch errors are almost equal to the errors obtained from classical error backpropagation applied at each time step to the purely forward network (i.e. the network that suppresses the error-correction 𝒆 of the voltage and instead considers the ‘classical’ voltage 𝒖l=𝑾lρ(𝒖l1) only, see blue dots in Figure 4b2). The network eventually learned to implement the mapping 𝒖𝒐=𝑭W(𝒓in)𝒖𝒐* with a performance comparable to error-backpropagation at each dt, despite the short presentation time of only 5 ms (Figure 4c1). The approximation is due to the fact that the NLA learns an instantaneous mapping from the low-pass filtered input rates 𝒓in to the output voltage 𝒖𝒐, while the mapping from the original input rates 𝒓in to the voltages 𝒖1 of the first-layer neurons (and hence also to the output voltages 𝒖𝒐) is delayed by τin. Since in the simulations, the target voltages 𝒖𝒐* were switched instantaneously with 𝒓in (and not with 𝒓in), however, a mismatch error between 𝒖𝒐 and 𝒖𝒐* remains for stimulus presentation times shorter than τin (Figure 4c2). The Latent Equilibrium (Haider et al., 2021) avoids these temporal limitations by implementing an instantaneous mapping on the rates instead on the voltages (Methods, Sect. From implicit to explicit differential equations).

On-the-fly learning of finger responses to visual input with real-time dendritic error propagation (rt-DeEP).

(a) Functionally feedforward network with handwritten digits as visual input (𝒓in(2)(t) in Figure 3a, here from the MNIST data set, 5 ms presentation time per image), backprojections enabling credit assignment, and activity of the 10 output neurons interpreted as commands for the 10 fingers (forward architecture: 784×500×10 neurons). (b) Example voltage trace (b1) and local error (b2) of a hidden neuron in neuronal least-action (NLA) (red) compared to an equivalent network without lookahead rates (orange). Note that neither network achieves a steady state due to the extremely short input presentation times. Errors are calculated via exact backpropagation, i.e., by using the error backpropagation algorithm on a pure feedforward NLA network at every simulation time step (with output errors scaled by β), shown for comparison (blue dots). (c) Comparison of network models during and after learning. Color scheme as in (b). (c1) The test error under NLA evolves during learning on par with classical error backpropagation performed each Euler dt based on the feedforward activities. In contrast, networks without lookahead rates are incapable of learning such rapidly changing stimuli. (c2) With increasing presentation time, the performance under NLA further improves, while networks without lookahead rates stagnate at high error rates. This is caused by transient, but long-lasting misrepresentation of errors following stimulus switches: when plasticity is turned off during transients and is only active in the steady state, comparably good performance can be achieved (dashed orange). (d) Receptive fields of 6 hidden-layer neurons after training, demonstrating that even for very brief image presentation times (5ms), the combined neuronal and synaptic dynamics are capable of learning useful feature extractors such as edge filters.

The instantaneous voltage propagation relieves an essential constraint of previous models of bio-plausible error backpropagation (e.g. Scellier and Bengio, 2017; Whittington and Bogacz, 2017; Sacramento et al., 2018), with reviews (Richards et al., 2019; Whittington and Bogacz, 2019; Lillicrap et al., 2020): without lookahead firing rates, networks need much longer to correctly propagate errors across layers, with each layer roughly adding another membrane time constant of 10 ms, and thus cannot cope with realistic input presentation times. In fact, in networks without lookahead output, learning is only successful if plasticity is switched off while the network dynamics did not reach a stationary state during a stimulus presentation interval (Figure 4c2). Notice also that the prospective coding is necessary to keep the network activity stable for an instantaneous processing of the sensory input. If, in the absence of prospective coding, we would only shrink the membrane time constant to 0, the recurrent error processing would become unstable (see Appendix 3).

Implementation in cortical microcircuits

So far, we did not specify how errors 𝒆 appearing in the differential equation for the voltage (Equation 7a) are transmitted across the network in a biologically plausible manner. Building on Sacramento et al., 2018, we propose a cortical microcircuit to enable this error transport, with all neuron dynamics evolving according to the NLA principle. Although the idea applies to arbitrarily connected networks, we use the simpler case of functionally feedforward networks to illustrate the flow of information in these microcircuits (Figure 5a).

Hierarchical plastic microcircuits implement real-time dendritic error learning (rt-DeEL).

(a) Microcircuit with ‘top-down’ input (originating from peripheral motor activity, blue line) that is explained away by the lateral input via interneurons (dark red), with the remaining activity representing the error el. Plastic connections are denoted with a small red arrow and nudging with a dashed line. (b1) Simulated network with 784-300-10 pyramidal-neurons and a population of 40 interneurons in the hidden layer used for the MNIST learning task where the handwritten digits have to be associated with the 10 fingers. (b2) Test errors for rt-DeEL with joint tabula rasa learning of the forward and lateral weights of the microcircuit. A similar performance is reached as with classical error backpropagation. For comparability, we also show the performance of a shallow network (dashed line). (b3) Angle derived from the Frobenius norm between the lateral pathway 𝑾IPl𝑾PIl and the feedback pathway 𝑩l𝑾l+1. During training, both pathways align to allow correct credit assignment throughout the network. Indices are dropped in the axis label for readability.

For such an architecture, pyramidal neurons in area l (that is a ‘layer’ of the feedforward network) are accompanied by a pool of interneurons in the same layer (area). The dendrites of the interneurons integrate in time (with time constant τ) lateral input from pyramidal neurons of the same layer (𝒓l) through plastic weights 𝑾IPl. Additionally, interneurons receive ‘top-down nudging’ from pyramidal neurons in the next layer through randomly initialized and fixed back projecting synapses 𝑩IPl targeting the somatic region, and interneuron nudging strength βI. The notion of ‘top-down’ originates from the functionally feed-forward architecture leading from sensory to ‘higher cortical areas.’ In the context of motor control, the highest ‘area’ is the last stage controlling the muscle lengths, being at the same time the first stage for the proprioceptive input (Figure 3a).

According to the biophysics of the interneuron, the somatic membrane potential becomes a convex combination of the two types of afferent input (Urbanczik and Senn, 2014),

(9) ulI=(1βI)WlIPr¯l+βIBlIPul+1.

In the biological implementation, the feedback input is mediated by the low-pass filtered firing rates 𝒓l+1=ρ(𝒖l+1), not by 𝒖l+1 as expressed in the above equation. Yet, we argue that for a threshold-linear ρ the ‘top-down nudging’ by the rate 𝒓l+1 is effectively reduced to a nudging by the voltage 𝒖l+1. This is because errors are only backpropagated when the slope of the transfer function is positive, 𝒓l+1>0, and hence when the upper-layer voltage is in the linear regime. For more general transfer functions, we argue that short-term synaptic depression may invert the low-pass filtered presynaptic rate back to the presynaptic membrane potential, 𝒓l+1𝒖l+1, provided that the recovery time constant τ matches the membrane time constant (see end of Results and Appendix 1).

Apical dendrites of pyramidal neurons in each layer receive top-down input from the pyramidal population in the upper layer through synaptic weights 𝑩l. These top-down weights could be learned to predict the lower-layer activity (Rao and Ballard, 1999) or to become the transposed of the forward weight matrix (𝑩l=𝑾l+1T, Max et al., 2022), but for simplicity, we randomly initialized them and keep them fixed (Lillicrap et al., 2020). Besides the top-down projections, the apical dendrites also receive lateral input via an interneuron population in the same layer through synaptic weights 𝑾PIl that are plastic and will be learned to obtain suitable dendritic errors. The ‘-’ sign is suggestive of these interneurons to subtract away the top-down input entering through 𝑩l (while the weights can still be positive or negative). Assuming again a conversion of rates to voltages, also for the inhibitory neurons that may operate in a linear regime, the overall apical voltage becomes

(10) e¯lA=Blul+1WlPIulI.

What cannot be explained away from the top-down input 𝑩l𝒖l+1 by the lateral feedback, 𝑾PIl𝒖Il, remains as dendritic prediction error 𝒆lA in the apical tree (Figure 5a). If the top-down and lateral feedback weights are learned as outlined next, these apical prediction errors take the role of the backpropagated errors in the classical backprop algorithm.

To adjust the interneuron circuit in each layer (‘area’), the synaptic strengths from pyramidal-to-interneurons, 𝑾IPl, are learned to minimize the interneuron mismatch energy, ElIP=12ulIWlIPr¯l2. The interneurons, while being driven by the lateral inputs 𝑾IPl𝒓l, learn to reproduce the upper-layer activity that also nudges the interneuron voltage. Learning is accomplished if the upper-layer activity, in the absence of an additional error on the upper layer, is fully reproduced in the interneurons by the lateral input.

Once the interneurons learn to represent the ‘error-free’ upper-layer activity, they can be used to explain away the top-down activities that also project to the apical trees. The synaptic strengths from the inter-to-pyramidal neurons, 𝑾PIl, are learned to minimize the apical mismatch energy, ElPI=12e¯lA2=12Blul+1WlPIulI2. While in the absence of an upper-layer error, the top-down activity 𝑩l𝒖l+1 can be fully cancelled by the interneuron activity 𝑾PIl𝒖Il, a neuron-specific error will remain in the apical dendrites of the lower-level pyramidal neurons if there is an error endowed in the upper-layer neurons. Gradient descent learning on these two energies results in the learning rules for the P-to-I and I-to-P synapses,

(11) W˙lIP=ηIP(ulIWlIPr¯l)r¯lTandW˙lPI=ηPI(Blul+1WlPIulI)ulIT.

The following theorem on dendritic error learning tells that the plasticity in the lateral feedback loop leads to an appropriate error representation in the apical dendrites of pyramidal neurons.

Theorem 2 (real-time dendritic error learning, rt-DeEL)

Consider a cortical microcircuit composed of pyramidal and interneurons, as illustrated in Figure 5a, with more interneurons in layer (‘cortical area’) l than pyramidal neurons in layer l+1, and with adaptable pyramidal-to-inhibitory weights 𝑾IPl within the same layer that are nudged through top-down weights 𝑩IPl, see Methods, Sect. Proving theorem 2 (rt-DeEL). Then, for suitable top-down nudging, learning rates, and initial conditions, the inhibitory-to-pyramidal synapses 𝑾PIl within each layer l (Equation 11) evolve such that the lateral feedback circuit aligns with the bottom-up-top-down feedback circuit,

(12) WlPIWlIP=BlWl+1.

After this horizontal-to-vertical circuit alignment, the apical voltages 𝒆lA=𝑩l𝒖l+1𝑾PIl𝒖Il of the layer-l pyramidal neurons (Equation 13) represent the ‘B-backpropagated’ errors, 𝒆lA=𝑩l𝒆l+1. When modulated by the postsynaptic rate derivatives, 𝒓l=𝝆(𝒖l), the apical voltages yield the appropriate error signals

(13) e¯l=ulWlr¯l1=r¯le¯lA=r¯lBle¯l+1

for learning the forward weights 𝑾l according to 𝑾˙l𝒆l𝒓l1T Equation 8.

The back projecting weights can also be learned by a local real-time learning rule to become transpose of the forward weights, 𝑩l=𝑾l+1T (Max et al., 2022). In this case, the error signals 𝒆l learned in the apical dendrites according to the above Theorem (Equation 13) represent the gradient errors 𝒆 appearing in the real-time dendritic error propagation (rt-DeEP, Theorem 1). There, the errors 𝒆 drive the gradient plasticity of the general weight matrix 𝑾, split up here into the forward weights 𝑾l to a layer l (for l=1,..,N).

Simultaneously learning apical errors and basal signals

Microcircuits following these neuronal and synaptic dynamics are able to learn the classification of hand-written digits from the MNIST dataset while learning the apical signal representation (Figure 5b1, b2). In this case, feedforward weights 𝑾l and lateral weights WlPI and WlIP are all adapted simultaneously. Including the WlIP˙-plasticity (by turning on the interneuron nudging from the upper layer, βI>0 in Equation 9), greatly speeds up the learning.

With and without WlIP˙-plasticity, the lateral feedback via interneurons (with effective weight WlIPWlPI) learns to align with the forward-backward feedback via upper layer pyramidal neurons (with effective weight 𝑩l𝑾l+1, Figure 5b3). The microcircuit extracts the gradient-based errors (Equation 13), while the forward weights use these errors to reduce these errors to first minimize the neuron-specific mismatch errors, and eventually the output cost.

Since the apical voltage 𝒆lA appears as a postsynaptic factor in the plasticity rule for the interneurons (WlPI˙), this I-to-P plasticity can be interpreted as Hebbian plasticity of inhbitory neurons, consistent with earlier suggestions (Vogels et al., 2011; Bannon et al., 2020). The plasticity WlIP˙ of the P-to-I synapses, in the same way as the plasticity for the forward synapses 𝑾˙l, can be interpreted as learning from the dendritic prediction of somatic activity (Urbanczik and Senn, 2014).

Crucially, by choosing a large enough interneuron population, the simultaneous learning of the lateral microcircuit and the forward network can be accomplished without fine-tuning of parameters. As an instance in case, all weights shared the same learning rate. Such stability bolsters the biophysical plausibility of our NLA framework and improves over the previous, more heuristic approach (Sacramento et al., 2018; Mesnard et al., 2019). The stability may be related to the nested gradient descent learning according to which somatic and apical mismatch errors in pyramidal neurons, and somatic mismatch errors in inhibitory neurons are minimized.

Finally, since errors are defined at the level of membrane voltages (Equation 11), synapses need a mechanism by which they can recover the presynaptic voltage errors from their afferent firing rates. While for threshold-linear transfer functions the backpropagated voltage errors translate into rate errors (Appendix 1), more general neuronal nonlinearities must be matched by corresponding synaptic nonlinearities. Pfister et al., 2010 have illustrated how spiking neurons can leverage short-term synaptic depression to estimate the membrane potential of their presynaptic partners. Here, we assume a similar mechanism in the context of our rate-based neurons. The monotonically increasing neuronal activation function, 𝒓l+1=ρ(𝒖l+1), can be approximately compensated by a vesicle release probability that monotonically decreases with the low-pass filtered presynaptic rate 𝒓l+1 (see Appendix 1 and Appendix 1—figure 1). If properly matched, this leads to a linear relationship between the presynaptic membrane potential 𝒖l+1 and the postsynaptic voltage contribution.

Discussion

We introduced a least-action principle to neuroscience for deriving the basic laws of the voltage and synaptic dynamics in networks of cortical neurons. The approach is inspired by the corresponding principle in physics where basic laws of motion are derived across the various scales. While in physics the action is defined as the time-integral of the kinetic minus potential energy, we define the action as the time-integral of instantaneous somatodendritic mismatch errors across network neurons plus a behavioral error. The ‘kinetics’ of a voltage trajectory only arises because we postulate that the action along a trajectory is minimized with respect to future voltages, not the instantaneous voltage, as would be done in physics. The postulate implies a prospective voltage dynamics that look ahead in time, together with prospective local errors, in order to minimize the action and hence the somatodendritic mismatch errors. The prospective errors nudge the firing of pyramidal neurons deep in the brain, so that motor neurons improve the output of the network right in time. A putative behavioral error, encoded in the motor feedback, propagates back through the network and produces prospective corrections of the pyramidal neuron activities that effectively manifest in instantaneous corrections of the motor trajectory. Through this prospective coding, the sensory stream, the deep network activity, and the motor feedback are in sync at any moment in time. We formulated the dynamic synchronization as a ‘moving equilibrium hypothesis’, referring to the classical equilibrium point hypothesis for motor control (Feldman and Levin, 2009; Latash, 2010). More generally, the brain activity formed by the prospective firing of cortical pyramidal neurons is in a moving equilibrium while converting sensory input streams into motor outputs, consistent with prospective sensory processing in the human cortex (Blom et al., 2020).

Because the neuronal dynamics derived from the global NLA principle is in a moving equilibrium, the prospective dendritic errors that globally correct the output trajectory are also suited to instruct local synapatic plasticity in the dendrites. In fact, working down the apical errors by adapting the sensory-driven synapses on the basal dendrites reduces the global output errors in real time. The apical errors are extracted from the top-down feedback via lateral ‘inhibition’ that tries to cancel the top-down signal. This top-down feedback includes activity from a putative erroneous motor output that was not foreseen by the local inhibition and thus survives as a local apical error. Given the prospective coding of the pyramidal neurons, the dendritic errors are also prospective and thus able to induce the correct error-minimizing plasticity online, while stimuli and targets continuously change.

The NLA principle as a bottom-up theory from neurons to behavior

To show that the NLA principle offers a viable program for a formalization of neuroscience following the example of physics, we exemplified its ramifications in dendritic computation, cortical microcircuits, synaptic plasticity, motor control, and sensory-based decision-making. The crucial point of our axiomatization is that it connects the local neuronal errors to the global behavioral errors right in the formulation of the principle, eventually leading to local gradient-based plasticity rules. Because the formulation builds upon computations that can be realized in single neurons and dendrites to produce a behavioral output, the NLA principle can be seen as a bottom-up theory of behavior. It is articulated in terms of apical and basal dendrites, somatic firing, network connectivity and behavioral outputs that jointly minimize their errors. This contrasts the related free energy principle, for instance, that leads to a top-down theory of behavior by starting with the statistical, but the more universal, notion of a free energy. It postulates that any self-organizing system, that is at a statistical equilibrium with its environment, must minimize its free energy (Friston, 2010; Friston et al., 2022), and from there work down its way to neurons and dendrites (Bastos et al., 2012; Kiebel and Friston, 2011).

Starting with a single Lagrangian function that specifies the form of the somatodendritic prediction errors leaves some freedom for the interpretation and the implementation of the emerging dynamical equations for the voltages. We interpret errors to be represented in the apical dendrites of pyramidal neurons while sensory input targets the basal dendrites, but other dendritic configurations are conceivable (Mikulasch et al., 2023) that apply also to non-pyramidal neurons. We have chosen a specific interneuron circuitry to extract our apical errors, but other microcircuits or error representations might also be considered (Keller and Mrsic-Flogel, 2018). On the other hand, the derived gradient-based synaptic plasticity is tightly linked to the specific form of the somatodendritic prediction errors expressed in the Lagrangian and its interpretation, making specific predictions for synaptic plasticity (as outlined below). The ‘external’ feedback entering through the cost function offers additional freedom to model behavioral interactions. We considered an explicit time course of a target voltage in motor neurons, for instance imposed by the feedback from muscle spindles that are themselves innervated by a prospective top-down signal to control muscle lengths (Papaioannou and Dimitriou, 2021; Dimitriou, 2022). But the cost may also link to reinforcement learning and express a delayed reward feedback delivered upon a behavioral decision of an agent acting in a changing environment (Friedrich et al., 2011; Friedrich and Senn, 2012).

A fundamental difficulty arises when the neuronal implementation of the Euler-Lagrange equations requires an additional microcircuit with its own dynamics. This is the case for the suggested microcircuit extracting the local errors. Formally, the representation of the apical feedback errors first needs to be learned before the errors can teach the feedforward synapses on the basal dendrites. We showed that this error learning can itself be formulated as minimizing an apical mismatch energy. What the lateral feedback through interneurons cannot explain away from the top-down feedback remains an apical prediction error. Ideally, while the network synapses targetting the basal tree are performing gradient descent on the global cost, the microcircuit synapses involved in the lateral feedback are performing gradient descent on local error functions, both at any moment in time. The simulations show that this intertwined system can in fact learn simultaneously with a common learning rate that is properly tuned. The cortical model network of inter- and pyramidal neurons learned to classify handwritten digits on the fly, with 10-digit samples presented per second. Yet, the overall learning is more robust if the error learning in the apical dendrites operates in phases without output teaching but with corresponding sensory activity, as may arise during sleep (see e.g., Deperrois et al., 2022; Deperrois et al., 2024).

The NLA principle integrates classical theories for cortical processing and learning

The prospective variational principle introduced with the NLA allows for integrating previous ideas on formalizing the processing and learning in cortical networks. Four such classical lines of theories come together. (i) The first line refers to the use of an energy function to jointly infer the neuronal dynamics and synaptic plasticity, originally formulated for discrete-time networks (Hopfield, 1982; Ackley et al., 1985), and recently extended to continuous-time networks (Scellier and Bengio, 2017). (ii) The second line refers to understanding error-backpropagation in the brain (Rumelhart et al., 1986; Xie and Seung, 2003; Whittington and Bogacz, 2017; Whittington and Bogacz, 2019; Lillicrap et al., 2020). (iii) The third line refers to dendritic computation and the use of dendritic compartmentalization for various functions such as nonlinear processing (Schiess et al., 2016; Poirazi and Papoutsi, 2020) and deep learning (Guerguiev et al., 2017; Sacramento et al., 2018; Haider et al., 2021). (iv) The fourth line refers to predictive coding (Rao and Ballard, 1999) and active inference (Pezzulo et al., 2022) to improve the sensory representation and motor output, respectively.

  • (i) With regard to energy functions, the NLA principle adds a variational approach to characterize continuous-time neuronal trajectories and plasticity. Variational approaches are studied in the context of optimal control theory where a cost integral is minimized across time, constrained to some network dynamics (Todorov and Jordan, 2002; Meulemans et al., 2021). The NLA represents a unifying notion that allows us to infer both, the network dynamics and its optimal control from a single Lagrangian. The error we derive represents prospective control variables that are applied to the voltages of each network neuron so that they push the output neurons towards their target trajectory. The full expression power of this control theoretic framework has yet to be proven when it is extended to genuine temporal processing that includes longer time constants, for instance, inherent in a slow threshold adaptation (Bellec et al., 2020). The NLA principle can also treat the case of strong feedback studied so far in relaxation networks only (Meulemans et al., 2022; Song et al., 2024). Our rt-DeEP Theorem makes a statement for real-time gradient descent learning while the network is in a moving equilibrium, linking to motor learning in the presence of perturbing force fields (Herzfeld et al., 2014) or perturbing visual inputs (Dimitriou, 2016).

  • (ii) With regard to error-backpropagation, the NLA principle infers a local error that originates from the error in the output neurons. In the current version, the NLA relies on feedback alignment (Lillicrap et al., 2016) to obtain a local apical error without postulating plasticity in the top-down synapses. Other works have explored learning the feedback weights (Akrout et al., 2019; Kunin et al., 2020), notably within a phase-less real-time learning framework as considered here (Max et al., 2022). It would be promising to combine these ideas to obtain a fully plastic microcircuit that adjusts from scratch to various real-time learning tasks.

  • (iii) With regard to dendritic computation, the NLA principle extends the idea of dendritic error representation (Sacramento et al., 2018; Mesnard et al., 2019) by the prospective coding of both errors and firing rates (Figure 2). As a consequence, the various dendritic delays are compensated and synaptic plasticity can operate at any moment, without need to wait for network relaxations (Scellier and Bengio, 2017; Song et al., 2024). In the present framework, the input rates are low-pass filtered by variable input time constants (τin) before the induced voltages are instantaneously processed in the network. While this offers the possibilities for a temporal processing of the inputs, step changes in the input rates cannot instantaneously propagate as made possible in Haider et al., 2021. The dendritic error representation has also been applied to spike-based learning in recurrent networks (Mikulasch et al., 2021).

  • (iv) With regard to the principle of predictive coding (Rao and Ballard, 1999; Pezzulo et al., 2022), the NLA offers a form of predictive error-coding in the apical tree of sensory pyramidal neurons that shapes the sensory representation. Both the predictive coding and the NLA principle imply an error-based adaptation of the lower and higher cortical representations. However, while predictive coding is based on propagating errors, the NLA principle directly propagates the error-corrected representations. Based on the backpropagated activities, it extracts in each area the prospective errors to reshape the local representations and induce synaptic plasticity. Otherwise, the freedom in defining the cost function in the NLA, and the inclusion of active inference in predictive coding (Pezzulo et al., 2022), make the two frameworks equally powerful. Both frameworks can explain the learning from motor and sensory prediction errors. In fact, our example of motor control minimizes the proprioceptive error formed by the target muscle lengths minus the effective muscle lengths. Active inference likewise minimizes the mismatch between an internal motor target and the proprioceptive inputs caused by the own actions. In turn, the cost function in the NLA principle may also capture the sensory prediction errors, where the ground truth lies in the external inputs that are learned to be matched by the sensory expectation. Hence, while being functionally equivalent, the neuronal interpretations are different: predictive coding focuses on the propagation of errors using the sensory- and motor representations as local auxiliary quantities, and the NLA principle directly integrates errors in ‘auxiliary dendrites’ to propagate sensory or motor representations.

The NLA integrates and predicts features of synapses, dendrites, and circuits

Motivated by the predictive power of the least-action principle in physics, we ask about experimental confirmation and predictions of the NLA principle. Given its axiomatic approach, it appears astonishing to find various preliminary matches at the dendritic, somatic, interneuron, synaptic, and even behavioral levels. Some of these are:

  1. the prospective coding of pyramidal neuron firing (Köndgen et al., 2008);

  2. the prospective processing of apical signals while propagating to the soma (Ulrich, 2002);

  3. the basal synaptic plasticity on pyramidal neurons and synaptic plasticity on interneurons, driven by the postsynaptic activity that is ‘unexplained’ by the distal dendritic voltage (Urbanczik and Senn, 2014) – while partly consistent with spike-timing dependent plasticity (Sjöström et al., 2001; Spicher et al., 2017), the postulated dendritic voltage-dependence of the plasticity rules still awaits an experimental inspection;

  4. the Hebbian homeostatic plasticity of interneurons targeting the apical dendritic tree of pyramidal neurons (Bannon et al., 2020);

  5. the short-term synaptic depression at top-down synapses targetting inhibitory neurons and apical dendrites (akin to Abbott et al., 1997), but with a faster recovery time constant that invert the presynaptic activation function (see also Pfister et al., 2010);

  6. the modulation of the apical contribution to the somatic voltage by the slope of the somatic activation function for instance by downregulating apical NMDA receptors with increasing rate of backpropagating action potentials, Theis et al., 2018; and

  7. the involvement of muscle spindles in the prospective encoding of motor errors during motor learning, with the γ-innervation setting the target length of the spindles (Dimitriou, 2016; Papaioannou and Dimitriou, 2021; Vargas et al., 2023).

More experimental and theoretical work is required to substantiate these links and test specific predictions, such as the apical error representation in cortical pyramidal neurons.

Overall, our approach adapts the least-action principle from physics to be applied to neuroscience, and couples it with a normative perspective on the prospective processing of neurons and synapses in global cortical networks and local microcircuits. Given its physical underpinnings, the approach may inspire the rebuilding of computational principles of cortical neurons and circuits in neuromorphic hardware (Bartolozzi et al., 2022). A step in this direction, building on the instantaneous computational capabilities by slowly integrating neurons, has made done by Haider et al., 2021. Given its aspiration for a theoretical framework in neurobiology, a next challenge would be to generalize the NLA principle to spiking neurons (Gerstner and Kistler, 2002; Brendel et al., 2020) with their potential for hardware implementation (Zenke and Ganguli, 2018; Göltz et al., 2021; Cramer et al., 2022), to include attentional mechanisms in terms of dendritic gain modulation (Larkum et al., 2004) with a putative link to self-attention in artificial intelligence (Vaswani et al., 2017), to add second-order errors to cope with certainties (Granier et al., 2023), and to incorporate longer temporal processing as, for instance, offered by neuronal adaptation processes (La Camera et al., 2006) or realistically modelled dendrites (Chavlis and Poirazi, 2021).

Methods

Euler-Lagrange equations as inverse low-pass filters

The theory is based on the look-ahead of neuronal quantities. In general, the look-ahead of a trajectory x(t) is defined via lookahead operator applied to x,

(14) (1+τddt)x=x+τx˙.

The lookahead operator is the inverse of the low-pass filter operator denoted by a bar,

(15) x(t)=1τtx(t)ettτdt.

This low-pass filtering can also be characterized by the differential equation τx˙(t)=x(t)+x(t), see Appendix 2. Hence, applying the low-pass filtering to x and then the lookahead operator (1+τddt) to x(t), and using the Leibnitz rule for differentiating an integral, we calculate (1+τddt)x(t)=x(t). In turn, applying first the lookahead, and then the low-pass filtering, also yields the original trace back, (1+τddt)x=x+τx˙=x.

We consider an arbitrary network architecture with network neurons that are recurrently connected and that receive external input through an overall weight matrix 𝑾=(𝑾in,𝑾net), aggregated column-wise. The instantaneous presnyaptic firing rates are 𝒓=(𝒓in,𝒓net)T, interpreted as a single-column vector. A subset of network neurons are output neurons, ON, for which target voltages 𝒖* may be imposed. Rates and voltages may change in time t. Network neurons are assigned a voltage 𝒖, generating the low-pass filtered rate 𝒓net=ρ(𝒖), and a low-pass filtered error 𝒆=𝒖𝑾𝒓. We further define output errors eo*=uo*uo for oO, and ei*=0 for non-output neurons iNO. With this, the Lagrangian from Equation 3 takes the form

(16) L=12e¯2+β2e¯2.

We next use that 𝒖=𝒖~τ𝒖~˙, with the .~ operator defined in Equation 4, to write out the Lagrangian L in the canonical coordinates (𝒖~,𝒖~˙) as (see also Equation 3)

(17) L=12iN[u~iτu~˙ijWijρ(u~jτu~˙j)]2+β2oO[uo*(u~oτu~˙o)]2.

The neuronal dynamics is derived from requiring a stationary action (see Equation 5), which is generally solved by the Euler-Lagrange equations Lu~iddtLu~˙i=0 (see Equation 6). Because u~ only arises in L in the compound u~τu~˙, the derivative of L with respect to u~ is identical to the derivative with respect to τu~˙,

(18) Lu~˙i=τLu~i.

Using the lookahead operator Equation 14, the Euler-Lagrange equations can then be rewritten as

(19) Lu~i+τddtLu~i=(1+τddt)Lu~i=0.

Since L(𝒖~,𝒖~˙)=L(𝒖) and 𝒖=𝒖~τ𝒖~˙, the derivative of L with respect to 𝒖~ is the same as the derivative of L with respect to 𝒖,

Lu~i=Lui.

Plugging this into Equation 19, the Euler-Lagrange equations become a function of 𝒖 and 𝒖˙,

(20) (1+τddt)Lui=0.

Notice that, if we had directly calculated Lu~iddtLu~˙i=0, the second-order time derivative u~¨i of the discounted future voltage would be absorbed in a first-order time derivative of the voltage. The reason is that u~˙iτu~¨i=u˙i, and u~¨i only arises in this combination because the Lagrangian L=L(𝒖) is only a function of 𝒖 and not of 𝒖˙. Hence, the acceleration term u~¨i disappears, while a voltage derivative u˙i appears.

The solution of this differential Equation 20 is Lui=ciett0τ, and hence any trajectory (u~i,u~˙i) which satisfy the Euler-Lagrange equations will hence cause Lui to converge to zero with a characteristic time scale of τ. Since we require that the initialisation is at t0=, we conclude that Lui=0, as required in the rt-DeEP Theorem. For a table with all the mathematical abbreviations see Methods-Table 1.

Table 1
Mathematical symbols.
Mathematical expressionNamingComment
uiInstantaneous (somatic) voltageonly for network neurons
ri=ρ(ui)+τρ˙(ui)Instantaneous firing rate of neuron ithat looks linearly ahead in time
r¯(t)=1τ-tr(t)e-t-tτdtDefinition of low-pass filteringSee Equation 15
r¯i=ρ(ui)=ri+τr˙i¯Low-pass filtered firing ratepostulated to be a function of ui
𝒓=𝒓¯+τ𝒓¯˙Self-consistency eq.for low-pass filtered rate
𝒓inInput rate vector, columnprojects to selected neurons
𝒓¯inLow-pass filter input ratesinstantaneously propagates
ei=(ui+τu˙i)jWijrjProspective error of neuroniin apical dendrite
e¯i=uijWijr¯jError of neuroniin soma
EMi=12e¯i2=12(uijWijr¯j)2Mismatch energy in neuron ibetween soma and basal dendrite
uo*Target voltage for output neuron ocould impose target on ro
or r¯o
e¯o*=uo*-uoError of output neuron oalso called target error
Co=12(e¯o*)2Cost contribution of output neuron obetween soma and basal dendrite
L=iNEMi+βoOCoLagrangianoutput Onetwork N
u~(t)=1τtu(t)𝒆(t-t)/τdtDiscounted future voltageprospective coordinates for NLA
𝒖=𝒖~-τ𝒖~˙Self-consistency eq.for discounted future voltage
A=t1t2L[𝒖~(t),𝒖~˙(t)]dtNeuronal Least Action (NLA)expressed in prospect. coordinates
Lu~i-ddtLu~˙i=(1+ddt)uiL=0Euler-Lagrange equationsturned into lookahead operator
𝑾inweights from input neurons 𝒓indim(N)×dim(𝒓in)
, most0
𝑾netweights between network neuronsdim(N)×dim(N)
𝑾=(𝑾in,𝑾net)total weight matrixdim(N)×(dim(𝒓in)+dim(N))
𝒓=(𝒓in,𝒓net)Tinstantaneous firing rate vectorcolumn (indicated by transpose)
𝑾˙𝒆¯𝒓¯TPlasticity of 𝑾e¯
is a column, 𝒓¯T a row vector
𝒖𝒐*(t)=𝑭*(𝒓¯in(t))Target function formulated for r¯in(t)a functional of 𝒓in(t)
𝒖𝒐(t)=𝑭W(𝒓¯in(t),𝒆¯𝒐*(t))Func. implemented by forward networkinstant. func. of 𝒓¯in(t)
, not 𝒓in(t)
NLayers in forward network, w/o rinLast-layer voltages:𝒖N=𝒖𝒐
WlIPWeights from pyr to interneuronslateral, within layer l
WlPIWeights from inter- to pyr’neuronslateral, within layer l
𝑾lBottom-up weights from layerl–1
tol
between pyramidal neurons
𝑩lTop-down weights from layerl+1
tol
between pyramidal neurons
e¯lA=Blul+1WPIluIlLow-pass filtered apical error in layerltop-down minus lateral feedback
e¯l=r¯le¯lA=r¯lBle¯l+1Somato-basal prediction erroris correct error for learning
ElIP=12ulIWlIPr¯l2Interneuron mismatch energyminimized to learn WlIP
ElPI=12Blul+1WPIluIl2Apical mismatch energyminimized to learn WlPI
η,ηIP,ηPILearning rates for plasticity of…Wl;WlIP;WlPI
𝑯=2L𝒖2=𝟏-𝑾net𝝆-𝒆¯Hessian,2L𝒖2=𝒇𝒖. If pos. definite⇒ stable dynamics
𝒇(𝒖,t)=L𝒖=𝒖-𝑾𝒓¯(𝒖)-𝒆¯(𝒖)Corrected errorbecomes 0
with τ
𝒇(𝒖,t)+τ𝒇˙(𝒖,t)=0Euler-Lagrange equationssatisfy f(u,t)=f0e(tt0)/τ
𝒇(𝒖,t)=0Always the case after transientexponentially decaying with τ
𝒖˙=-1τ𝑯-1(𝒖)(𝒇(𝒖)+τ𝒇t)Explicit diff. eq.obtained by solving for 𝒖˙
𝒈(𝒖,t)=-1τ𝑯-1(𝒖)(𝒇(𝒖)+τ𝒇t)Used to write the explicit diff. eq.𝒖˙=𝒈(𝒖,t)
𝑮(𝒚,𝒖˙)=(1+τddt)L𝒖=𝒇+τ𝒇˙Used for contraction anaylsis, Equation 53𝒚=(𝒓in,𝒖𝒐*,𝒖)
𝑴,𝑲Used to iteratively converge to𝒖˙see Equation 46
𝒖˘=𝒖+τ𝒖˙Linear lookahead voltageLatent Equilibrium, Appendix 4

Deriving the network dynamics from the Euler-Lagrange equations

We now derive the equations of motion from the Euler-Lagrange equations. Noticing that 𝒖 enters in 𝒆=𝒖𝑾𝒓 twice, directly and through 𝒓net=ρ(𝒖), and once in the output error 𝒆*, we calculate from 16, using 𝒓(𝒖)=(𝒓in,ρ(𝒖))T and 𝑾=(𝑾in,𝑾net),

(21) Lu=e¯ϵ¯βe¯, with ϵ¯=r¯netWnetTe¯.

Remember that for non-output neurons i no target exists, and for those we set 𝒆i*=0. Next, we apply the lookahead operator to this expression, as required by the Euler-Lagrange Equation 19. In general (1+τddt)𝒙=𝒙+τ𝒙˙=x, and we set for 𝒙 the expression on the right-hand side of Equation 21, 𝒙=𝒆𝝐β𝒆*, which at the same time is 𝒙=L𝒖. Hence, the Euler-Lagrange equations in the form of Equation 20, (1+τddt)x=0, translate into

(22) (1+τddt)Lu=0eϵβe=0τu˙=u+Wr+e.

To move from the middle to the last equality we replaced 𝒆 with 𝒆=(1+τddt)𝒆=𝒖+τ𝒖˙𝑾𝒓. In the last equality we interpret 𝒆 as the sum of the two errors, 𝒆=𝝐+β𝒆*, again using the middle equality. This proves Equation 7a, Equation 7b.

Notice that the differential equation τ𝒖˙=... in Equation 22 represents an implicit ordinary differential equation as on the right-hand side not only 𝒖, but also 𝒖˙ appears (in 𝒓 and 𝒆). The uniqueness of the solution 𝒖(t) for a given initial condition is only guaranteed if it can be converted into an explicit ordinary differential equation (see Sect. Appendix 3).

In taking the temporal derivative we assumed small learning rates such that terms including W˙ij can be neglected. The derived dynamics for the membrane potential of a neuron ui in Equation 22 show the usual leaky behavior of biological neurons. However, both presynaptic rates ri and prediction errors ei enter the equation of motion with lookaheads, i.e., they are advanced (ri=ri+τr˙i and ei=ei+τe˙i), cancelling the low-pass filtering. Since r˙i=ρ(ui)u˙i, the rate and error, ri and ei, can also be seen as nonlinear extrapolations from the voltage and its derivative into the future.

The instantaneous transmission of information throughout the network at the level of the voltages can now be seen by low-pass filtering Equation 22 with initialization far back in the past,

(23) 𝒖=𝒖+τ𝒖˙=𝑾𝒓+𝒆=𝑾𝒓(𝒖)+𝒆,

with column vector 𝒓(𝒖)=(𝒓in,ρ(𝒖))T and 𝒆=𝒓net𝑾netT𝒆+β𝒆*. Hence, solving the voltage dynamics for 𝒖 (Equation 7a), with apical voltage 𝒆=𝒆+τ𝒆˙ derived from Equation 7b, yields the somatic voltage 𝒖 satisfying the self-consistency Equation 23 at any time. In other words, 𝒖 and 𝒆‘propagate instantaneously’.

Deriving the error backpropagation formula

For clarity, we derive the error backpropagation algorithm for layered networks here. These can be seen as a special case of a general network with membrane potentials 𝒖 and all-to-all weight matrix 𝑾 (as introduced in Appendix 8), where the membrane potentials decompose into layerwise membrane potential vectors 𝒖l and the weight matrix into according to block diagonal matrices 𝑾l (with 𝑾l being the weights that project into layer l).

Assuming a network with N layers, by low-pass filtering the equations of motion we get

(24) ul=Wlr¯l1+e¯l,

for all l1,..,N, with the output error 𝒆𝒐 of the general recurrent network becoming the error in the last layer, that itself is the target error, 𝒆𝒐=𝒆N=β𝒆*=β(𝒖N*𝒖N). The error 𝒆=𝝐+β𝒆*, that we obtain from the general dynamics with 𝝐=𝒓net𝑾netT𝒆, see Equations 21 and 22, translates to an iterative formula for the error at the current layer l given the error at the downstream layer l+1, inherited from the drive 𝒓l=ρ(𝒖l) of that downstream layer via 𝑾l+1,

(25) e¯l=r¯lWl+1Te¯l+1   for   l<N.

and 𝒆N=β𝒆* for the output layer. The learning rule that reduces 𝒆l by gradient descent is proportional to this error and the presynaptic rate, as stated by Theorem 1, is

(26) W˙l(ulWlr¯l1)r¯l1T=e¯lr¯l1T,

for l=1...N. Equations 25 and 26 together take the form of the error backpropagation algorithm, where an output error is iteratively propagated through the network and used to adjust the weights in order to reduce the output cost C. From this, it is easy to see that without output nudging (i.e. β=0), the output error vanishes and consequently all other prediction errors vanish as well, 𝒆l=𝒖l𝑾l𝒓l=0 for all lN. This also means that in the absence of nudging, no weight updates are performed by the plasticity rule.

The learning rule for arbitrary connectivities is obtained in the same way by dropping the layer-wise notation. In this case, low-pass filtering the equations of motion yields 𝒖=𝑾𝒓+𝒆, as calculated in 23, and the low-pass filtered error 𝒆=𝝐+β𝒆*=𝒓net𝑾netT𝒆+β𝒆*, as inferred from Equations 21 and 22. Hence, the plasticity rule in general reads

(27) W˙(uWr¯)r¯T=e¯r¯T, with e¯=r¯netWnetTe¯+βe¯.

Proving theorem 1 (rt-DeEP)

The implicit assumption in Theorem 1 is that 𝒖˙ exists in the distributional sense for t>, which is the case for delta-functions in 𝒓in and step-functions in 𝒖*. Both parts (i) and (ii) of the Theorem are based on the requirement of stationary action δA=0, and hence on 𝒖 satisfying the Euler-Lagrange equations in the form of Equation 22, (1+τddt)L𝒖=0. From the solution Lui=cett0τ we conclude that for initialization at t0= we have L𝒖=0 for all t. It is the latter stronger condition that we require in the proof. With this, the main ingredient of the proof follows is the mathematical argument of Scellier and Bengio, 2017, according to which the total and partial derivative of L with respect to 𝑾 are identical, and this in our case is true for any time t,

(28) dLdW=LuTdudW+LW=LW,

For convenience we considered L𝒖 to be a column vector, deviating from the standard notations (see tutorial end of sec:Integration). Analogously to Equation 28, we infer dLdβ=Lβ. Reading Equation 28 from the right to the left, we conclude that the learning rule 𝑾˙L𝑾=𝒆𝒓T for all β>0 is gradient descent on L, i.e., 𝑾˙dLd𝑾. This total derivative of L can be analyzed for large and small β.

(i) We show that in the limit of large β, 𝑾˙ becomes gradient descent on the mismatch energy EM=12𝒆2. For this we first show that there is a solution of the self-consistency equation 𝒖=𝑭(𝒖)=𝑾𝒓+𝒓net𝑾netT𝒆+β𝒆* that is uniformly bounded for all t and β. For this we assume that the transfer function ρ(u) is non-negative, monotonically increasing, and bounded, that its derivative ρ(u) is bounded too, and that the input rates 𝒓in and the target potentials 𝒖𝒐* are also uniformly bounded. To show that under these conditions we always find a uniformly bounded solution 𝒖(t), we first consider the case where the output voltages are clamped to the target, 𝒖𝒐=𝒖𝒐* such that 𝒆*=0. For simplicity, we assume that ρ(u)=0 for |u|c0. For voltages 𝒖 with 𝒖ic0 the recurrent input current 𝑾𝒓 is bounded, say |(𝑾𝒓)j|c1 for some c1>c0. When including the error term 𝒓net𝑾netT𝒆, the total current still remains uniformly bounded, say |𝑭(𝒖)j|c2 for all 𝒖 with 𝒖ic0. Because for larger voltages 𝒖i>c0 the error term vanishes due to a vanishing derivative ρ(𝒖i)=0, the mapping 𝑭(𝒖) maps the c2-box 𝒖 (for which |𝒖i|c2) onto itself. Brouwer’s fixed point theorem then tells us that there is a fixed point 𝒖=𝑭(𝒖) within the c2-box. The theorem requires the continuity of 𝑭, and this is assured if the neuronal transfer function r=ρ(u) is continuous.

We next relax the voltages of the output neurons from their clamped stage, 𝒖𝒐=𝒖𝒐*. Remember that these voltages satisfy 𝒖𝒐=(𝑾𝒓+𝒓net𝑾netT𝒆+β𝒆*)𝒐=𝑭(𝒖)𝒐 at any time t. We determine the correction term β𝒆𝒐* such that in the limit β we get 𝒖𝒐=𝑭(𝒖)𝒐=𝒖𝒐*. The correction remains finite, and in the limit must be equal to limββ𝒆𝒐*=𝒖𝒐*(𝑾𝒓+𝒓net𝑾netT𝒆)𝒐. For arbitrary large nudging strength β, the output voltage 𝒖𝒐 deviates arbitrary little from the target voltage, 𝒖𝒐=𝒖𝒐*+o(1/β), with target error 𝒆𝒐*=1β(𝒖𝑾𝒓𝒓net𝑾netT𝒆)𝒐 shrinking like c2/β. Likewise, also for non-output neurons i, the self-consistency solution 𝒖i=𝑭(𝒖)i deviates arbitrarily little from the solution of the clamped state. To ensure the smooth drift of the fixed point while 1/β deviates from 0 we require that the Jacobian of 𝑭 at the fixed point is invertible.

Because the output 𝒆𝒐* shrinks with 1/β, the cost shrinks quadratically with increasing nudging strength, C=12𝒆*2=o(1β2), and hence the cost term β2𝒆*2 that enters in L=EM+β2𝒆*2 vanishes in the limit β. In this large β limit, where 𝒆𝒐*=0 and hence the outputs are clamped, 𝒖𝒐=𝒖𝒐*, the Lagrangian reduces to the mismatch energy, L=EM. Along the least-action trajectories, we, therefore, get 𝑾˙L𝑾=dLd𝑾=dEMd𝑾. The first equality uses Equation 28, and the second uses L=EM just derived for β=. This is a statement (i) of Theorem 1. In the case of successful learning, EM=0, we also conclude that the cost vanishes, C=0. This is the case because EM=0 implies EMo=0 for all output neurons o. Since EMo=12𝒆o2=12(𝒓net𝑾netT𝒆+β𝒆*)o2, we conclude that 𝒆o=0, and if the output neurons do not feed back to the network (which we can assume without loss of generality), we conclude that 𝒆o*=0.

(ii) To consider the case of small β, we use that the cost C can be expressed as C=Lβ. This is a direct consequence of how C enters in L=12𝒆2+β2C, see Equation 16 and Scellier and Bengio, 2017. We now put this together with Equation 28 and the finding that Lβ=dLdβ. Since for the Lipschitz continuous function L in u, W, and β (L is even smooth in these arguments), the total derivatives interchange (which is a consequence of the Moore-Osgood theorem applied to the limits of the difference quotients), we then get at any t,

(29) dCdW=ddWLβ=ddWdLdβ=ddβdLdW=ddβLW=ddβe¯r¯T.

The last expression is calculated from the specific form of the Lagrangian Equation 17, using that by definition 𝒆=𝒖𝑾𝒓.

Finally, in the absence of output nudging, β=0, we can assume vanishing errors, 𝒆=0, as they solve the self-consistency equation, 𝒆=𝒓net𝑾netT𝒆 for all t, see Equation 27. For these solutions we have 𝒆𝒓T|β=0=0. Writing out the total derivative of the function 𝒈(β)=𝒆𝒓T with respect to β at β=0 as limit of the difference quotient, d𝒈(β)dβ|β=0=limβ01β(𝒈(β)𝒈(0))=limβ01β𝒈(β), using that 𝒈(0)=0, we calculate at any t,

(30) de¯r¯Tdβ|β=0=limβ01β(e¯r¯Te¯r¯T|β=0)=limβ01βe¯r¯T.

Here, we assume that 𝒆𝒓T is evaluated at β>0 (that itself approaches 0), while 𝒆𝒓T|β=0 is evaluated at β=0. Combining Equations 29 and 30 yields the cost gradient at any t,

(31) dCdW=limβ01βe¯r¯T.

This justifies the gradient learning rule 𝑾˙ in Equation 27. Learning is stochastic gradient descent on the expected cost, where stochasticity enters in the randomization of the stimulus and target sequences 𝒓in(t) and 𝒖*(t). For the regularity statement, see ‘From implicit to explicit differential equations’ in the sec:Integration. Notice that this proof works for a very general form of the Lagrangian L, until the specific expression for L𝑾. For a proof in terms of partial derivatives only, see Appendix 8, and for a primer on partial and total derivatives see Appendix 7.

Instantaneous gradient descent on C(𝒖𝒐*,𝒓in)

The cost C=12𝒖𝒐*𝒖𝒐2 at each time t is a function of the voltage 𝒖𝒐 of the output neurons and the corresponding targets. In a feedforward network, due to the instantaneity of the voltage propagation Equation 23, 𝒖𝒐 is in the absence of output nudging (β=0) an instantaneous function of the voltage at the first layer, 𝒖1(t)=𝑾in𝒓in(t)+𝒖1(t0)ett0τ. For initialisation at t0=, the second term vanishes for all t and hence 𝒖1(t)=𝑾in𝒓in(t). The output voltage 𝒖𝒐(t), therefore, becomes a function 𝑭W of the low-pass filtered input rate 𝒓in(t) that captures the instantaneous network mapping, 𝒖𝒐(t)=𝑭W(𝒓in(t)), and with this the cost also becomes an instantaneous function of 𝒓in and 𝒖𝒐*, namely C(t)=12𝒖𝒐*(t)𝒖𝒐(t)2=12𝒖𝒐*(t)𝑭W(𝒓in(t))2.

For a general network, again assuming t0=, the voltage is determined by the vanishing gradient L𝒖=𝒇(𝒖,t)=𝒖𝑾𝒓(𝒖)𝒆(𝒖)=0 with 𝒆=𝝐β𝒆*, see Equation 21. For the inclusive treatment of the initial transient see Appendix 3 and Appendix 4. Remember that 𝒓=(𝒓in,𝒓net(𝒖))T and 𝒆*=𝒖𝒐*𝒖𝒐. For a given 𝒓in and 𝒖𝒐* at time t, the equation 𝒇(𝒖,t)=0 can be locally solved for 𝒖 if the Hessian 𝑯=2L𝒖2=𝒇𝒖=1𝑾net𝝆𝒆 is invertible, 𝒖=𝑭(𝒓in,𝒖𝒐*). This mapping can be restricted to the output voltages 𝒖𝒐 on the left-hand side, while replacing 𝒖𝒐*=𝒖𝒐+𝒆𝒐* in the argument on the right-hand side (even if this again introduces 𝒖𝒐 there). With this, we obtain the instantaneous mapping 𝒖𝒐(t)=𝑭W(𝒓in(t),𝒆𝒐*(t)) from the low-pass filtered input and the output error to the output itself. Notice that for functional feedforward network, the network weight matrix 𝑾net is lower triangular, and for small enough β the Hessian 𝑯 is, therefore, always positive definite (see also Methods, Sect. From implicit to explicit differential equations).

Proving theorem 2 (rt-DeEL)

Here, we restrict ourselves to layered network architectures. To prove Theorem 2 first assume that interneurons receive no nudging (βI=0) and only the lateral interneuron-to-pyramidal weights 𝑾PIl are plastic. This is already sufficient to prove the rt-DeEL theorem. Yet, simulations showed that learning the lateral pyramidal-to-interneuron weights 𝑾IPl via top-down nudging, so that the interneuron activity mimics the upper layer pyramidal neuron activity, helps in learning a correct error representation. We consider this case of learning 𝑾IPl later.

If the microcircuits is ought to correctly implement error backpropagation, all local prediction errors 𝒆l must vanish in the absence of output nudging (β=0) as there is no target error. Consequently, any remaining errors in the network are caused by a misalignment of the lateral microcircuit. We show how learning the interneuron-to-pyramidal weights 𝑾PIl corrects for such misalignments.

To define the gradient descent plasticity of the weights WPIl from the interneurons to the pyramidal neurons, we consider the apical error formed by the difference of top-down input and interneuron input, 𝒆lA=𝑩l𝒖l+1𝑾PIl𝒖Il, and define the apical mismatch energy as ElPI=12𝒆lA2. Gradient descent along this energy with respect to WPIl yields

(32) W˙lPI=ηPIelAulIT=ηPI(Blul+1WlPIulI)ulIT

evaluated online while presenting input patterns from the data distribution to the network. We assume that the apical contribution to the somatic voltage is further modulated by the somatic spike rate, 𝒓l𝒆lA. After successful learning, the top-down input 𝑩l𝒖l+1 is fully subtracted away by the lateral input in the apical compartment, and we have

(33) Blul+1=WlPIulI.

Once this condition is reached, the network achieves a state where, over the activity space spanned by the data, top-down prediction errors throughout the network vanish,

(34) e¯l=r¯le¯lA=r¯l(Blul+1WlPIulI)=0.

We show that this top-down prediction error, after the successful learning of the microcircuit, shares the properties of error-backpropagation for a suitable backprojection weights 𝑩.

Due to the vanishing prediction errors, pyramidal cells only receive bottom-up input 𝒖l+1=𝑾l+1𝒓l. Using this expression as well as the expression for interneuron membrane potentials without top-down nudging (βI=0 in Equation 9), 𝒖Il=𝑾IPl𝒓l, and plugging both into Equation 33, we get

(35) BlWl+1r¯l=WlPIWlIPr¯l.

Assuming that 𝑾IPl has full rank, and the low-pass filtered rates 𝒓l span the full nl dimensions of layer l when sampled across the data set, we conclude that

(36) BlWl+1=WlPIWlIP.

In other words, the loop via upper layer and back is learned to be matched by a lateral loop through the interneurons.

Equation 36 imposes a restriction on the minimal number of interneurons nlI at layer l. In fact, the matrix product 𝑩l𝑾l+1 maps a nl-dimensional space onto itself via nl+1-dimensional space. The maximal rank of the this matrix product is limited by the smallest dimension, i.e., rank(𝑩l𝑾l+1)min(nl,nl+1). Analogously, rank(𝑾PIl𝑾IPl)min(nl,nlI). But since the two ranks are the same according to Equation 36, we conclude that in general nlImin(nl,nl+1) must hold, i.e., there should be at least as many interneurons at layer l as the lowest number of pyramidal neurons at either layer l or l+1. Note that by choosing nlI=nl+1 as in Sacramento et al., 2018 (or nlI>nl+1 as in this work), the conditions is fulfilled.

With 𝒖Il=𝑾IPl𝒓l and Equation 36, the top-down prediction error from Equation 34, in the presence of output nudging (β>0), can be written in the backpropagation form

(37a) e¯l=r¯l(Blul+1WlPIulI)=r¯l(Blul+1WlPIWlIPr¯l)
(37b) =r¯l(Blul+1BlWl+1r¯l)=r¯lBl(ul+1Wl+1r¯l)
(37c) =r¯lBle¯l+1=r¯lBlr¯l+1e¯l+1A.

Finally, the simulations showed that learning the lateral weights in the microcircuit greatly benefits from also adapting the pyramidal-to-interneuron weights 𝑾IP by gradient descent on EIP=12l𝒖Il𝑾IPl𝒓l2, using top-down nudging of the inhibitory neurons (βI>0),

(38) W˙lIP=ηIP(ulIWlIPr¯l)r¯lT.

After learning we have 𝒖Il=𝑾IPl𝒓l, and plugging in 𝒖Il=(1βI)𝑾IPl𝒓l+βI𝑩IPl𝒖l+1 (Equation 9), we obtain 𝑾IPl𝒓l=𝑩IPl𝒖l+1. Since 𝒖l+1=𝑾l+1𝒓l, we conclude as before,

(39) WlIP=BlIPWl+1.

The top-down weights BIPl that nudge the lower-layer interneurons has randomized entries and may be considered as full rank. If there are less pyramidal neurons in the upper layer than interneurons in the lower layer, BIPl selects a subspace in the interneuron space of dimension nl+1<nlI. This seems to simplify the learning of the interneuron-to-pyramidal cell connections WPI. In fact, this learning now has only to match the nl+1-dimensional interneuron subspace embedded in nlI dimensions to an equal (nl+1-)dimensional pyramidal cell subspace emedded in nl dimensions.

Learning of the interneuron-to-pyramidal cell connections works with the interneuron nudging as before, and combining Equations 36 with 39 yields the ‘loop consistency’

(40) BlWl+1=WlPIBlIPWl+1.

The learning of the microcircuit was described in the absence of output nudging. Conceptually, this is not a problem as one could introduce a pre-learning phase where the lateral connections are first correctly aligned before learning of the feedforward weights begins. In simulations we find that both the lateral connections as well as the forward connections can be trained simultaneously, without the need for such a pre-learning phase. We conjecture that this is due to the fact that our plasticity rules are gradient descent on the energy functions L, EPI, and EIP, respectively.

From implicit to explicit differential equations

The voltage dynamics is solved by a forward-Euler scheme 𝒖(t+dt)=𝒖(t)+𝒖˙(t)dt. The derivative 𝒖˙(t) is calculated either through (i) the implicit differential Equation 7a yielding τ𝒖˙(t)=𝒉(𝒖(t),𝒖˙(tdt)), or (ii) by isolating 𝒖˙(t) and solving for the explicit differential equation τ𝒖˙(t)=𝒈(𝒖(t)), as explained in Appendix 3 (after Equation 51).

(i) The implicit differential equation, τ𝒖˙(t)=𝒖(t)+𝑾𝒓(t)+𝒆(t), see Equation 22, is iteratively solved by assigning 𝒓(t)=ρ(𝒖(t))+ρ(𝒖(t))𝒖˙(tdt) and calculating the error 𝒆(t)=𝒆(t)+τ𝒆˙(t) with 𝒆(𝒖)=ρ(𝒖)𝑾netT(𝒖𝑾netρ(𝒖)𝑾in𝒓in)+β𝒆* and 𝒆˙(t)=𝒆(𝒖(t))𝒖˙(tdt).

This iteration exponentially converges to a fixed point u˙(t) on a time scale dt1k, where 1k>0 is the smallest Eigenvalue of the Hessian 𝑯=2L𝒖2=1𝑾net𝝆𝒆, see Appendix 3.

(ii) The explicit differential equation is obtained by eliminating the 𝒖˙ from the right-hand side of the implicit differential equation. Since 𝒖˙ enters linearly we get τ𝑯𝒖˙=𝒇τ𝒇t with 𝒇(𝒖,t)=L𝒖=𝒖𝑾𝒓𝝐β𝒆*. The explicit form is obtained by matrix inversion, 𝒖˙=𝒈(𝒖,t)=1τ𝑯1(𝒇+τ𝒇t), as the Hessian is invertible if it is strictly positive definite (which is typically the case, see Appendix 3, after Equation 48). The external input and the target enter through 𝒇t=𝑾in𝒓˙in+β𝒖˙𝒐*, where the derivative of the target voltage is only added for the output neurons 𝒐. This explicit differential equation is shown to be contractive in the sense that for each input trajectory 𝒓in(t) and target trajectory 𝒖*(t), the voltage trajectory 𝒖(t) is locally attracting for neighbouring trajectories. This local attracting trajectory is the vanishing-gradient trajectory 𝒇(𝒖,t)=0, and the gradient remains 0 even if the input contains delta-functions, see Appendix 4.

Moving and latent equilibria: a formal definition

We showed that the motor output (𝒖𝒐), together with the low-pass filtered sensory input (𝒓in) and the motor feedback (𝒆𝒐*) is in a moving equilibrium, 𝒖𝒐=𝑭W(𝒓in,𝒆𝒐*), see Figure 3a. In general, a dynamical system in 𝒖 that is given in an implicit form 𝑮(𝒙,𝒙˙,𝒖,𝒖˙)=0 with external inputs (𝒙,𝒙˙) is said to be in a moving equilibrium if the variable 𝒖 is an instantaneous function of the input 𝒙 at any point in time, 𝒖=𝑭(𝒙). The fact that the implicit differential equation 𝑮=0 represents a dynamical system in 𝒖 implies that, in principle, it has a representation in the explicit form 𝒖˙=𝒈(𝒖,𝒙,𝒙˙), guaranteed by an invertible Jacobian 𝑮𝒖˙.

Our example is obtained from 𝑮=(1+𝝉ddt)𝒇 with 𝒇(𝒖,𝒙)=L𝒖 and 𝒙=(𝒓in,𝒆𝒐*), leading to 𝒙+𝝉𝒙˙=(𝒓in,𝒆𝒐*). Given the paramterization of 𝑮 by the weights, we get the parametrized function 𝒖=𝑭W(𝒙), and this is restricted to the output components 𝒖𝒐 of 𝒖. The condition on the Jacobian translates to 𝑮𝒖˙=𝒇𝒖=2L𝒖2 being invertible. Crucially, the description of the dynamics in the biological or physical substrate is not given in its explicit form 𝒖˙=𝒈(𝒖,𝒙,𝒙˙). However, it is given in an implicit form expressed as 𝒖˙=𝒉(𝒙,𝒙˙,𝒖,𝒖˙), where 𝒖˙ still appears on the right-hand side. This ‘hybrid’ form is directly solved either in real time by the biophysical substrate itself, or by the forward-Euler scheme on clocked hardware, see (i) above. Notice that moving equilibria 𝒖=𝑭W(𝒙) with 𝒙=(𝒓in,𝒆𝒐*) are able to capture complex temporal processing of the instantaneous input 𝒓in. In fact, the low-pass filtering 𝒓in can be obtained on various time scales through different τin’s, and 𝑭W for a general network 𝑾 can be arbitrary complex. The task is to adapt 𝑾 such that the ‘hybrid’ dynamical system eventually implements the target mapping 𝒖𝒐*=𝑭*(𝒙).

The Latent Equilibrium (Haider et al., 2021) can be analogously formalized as a dynamical system in 𝒖, implicitly given by 𝑮(𝒙,𝒖,𝒖˙)=0, and having a solution of the form 𝒖+𝝉𝒖˙=𝑭(𝒙). Abbreviating again 𝒇(𝒖,𝒙)=L𝒖 with the same Lagrangian L=12𝒖𝑾ρ(𝒖)2+β2C as in the present NLA, the Latent Equilibrium is obtained for 𝑮(𝒙,𝒖,𝒖˙)=𝒇(𝒖+𝝉𝒖˙,𝒙). The solution implies that the rate 𝒓=ρ(𝒖+𝝉𝒖˙)=ρ(𝑭(𝒙)) is an instantaneous function of 𝒙=(𝒓in,𝒆𝒐*), here without low-pass filtering. As for moving equilibria, the crucial point is that the biophysical substrate implements a hybrid form of the dynamical system, now 𝒖˙=𝒉(𝒙,𝒖,𝒖˙), that is implicitly solved by the analog substrate, and also allows for a solution in clocked hardware. For an extended stability analysis of the moving and latent equilibria see Appendix 4.

Simulation details

Solving the explicit differential equation seems to be more robust when the learning rate for 𝑾˙ gets larger. The explicit form is also less sensitive to large Euler steps dt, see Appendix 3. By this reason, the ordinary differential equations (ODE) were solved in the explicit form when including plasticity 𝑾˙. The algorithms are summarized as follows, once without interneurons (Algorithm 1), and once with interneurons (Algorithm 2):

Algorithm 1. with projection neurons only, for Figures 3 and 4 (using the explicit ODE, i.e., Step 12 instead of 11)
1: current state: u(t), W(t)
2: # consider full vectors and matrices (padded with 0’s for feedforward networks)
3:# drop time argument (t) for convenience
4: r¯netρ(u), r¯(r¯in,r¯net)T , W(Win,Wnet)
5: calculate weight derivatives
6: W˙η(uWr¯)r¯T
7: calculate low-pass-filtered errors
8: e¯ououo, e¯i=0 for non-output neurons
9: e¯r¯netWnetT(uWr¯)+βe¯
10: calculate temporal voltage derivatives either implicitly (11) or explicitly (12)
11: Implicit: τu˙u+W(r¯+τr¯˙)+(e¯+τe¯˙)
12: Explicit: fuWr¯e¯ , Hfu , u˙ solve τH(u)u˙=fτft via Cholesky decomposition
13: update voltage and weights
14: uu+u˙dt, WW+W˙dt
Algorithm 2. including plastic interneurons, for Figure 5 (using the explicit ODE, i.e., Step 13 instead of 12)
1: current state: u(t), W(t),uI(t),WPI(t),WIP(t)
2: # consider full vectors and matrices and drop time argument as in Algorithm 1
3: r¯(r¯in,ρ(u))T
4: calculate weight derivatives
5: W˙η(uWr¯)r¯T
6: W˙PIηPI(BuWPIuI)uIT
7: W˙IPηIP(uIWIPr¯)r¯T
8: calculate low-pass filtered errors
9: e¯ououo (e¯i=0 for non-output neurons i)
10: e¯BuWPIuI+βe¯ (Bo,:=Wo,:PI=0 for output neurons o)
11: calculate temporal voltage derivatives either implicitly (12) or explicitly (13)
12: Implicit: τu˙u+W(r¯+τr¯˙)+(e¯+τe¯˙)
13: Explicit: fuWr¯e¯ , Hfu , u˙ solve τH(u)u˙=fτft via Cholesky decomposition
14: update network state \ForX{u,W,WPI,WIP}
15: XX+X˙dt End For
16: uI(1βI)WIPr¯+βIBIPu

Details for Figure 3b

Color coded snapshot of cortical local field potentials (LFPs) in a human brain from 56 deep iEEG electrodes at various locations, converted with the sigmoidal voltage-to-rate function r(u)=11+eu and plotted onto a standard Talairach Brain (Talairach and Tournoux, 1988). The iEEG data is from a patient with pharmacoresistant epilepsy and electrodes implanted during presurgical evaluation, extracted from the data release of Burrello et al., 2019. The locations of the electrodes are chosen in accordance with plausibilty, as the original positions of the electrodes were omitted due to ethical standards to prevent patient identification.

Details for Figure 3c

Simulations of the voltage dynamics (Equation 7a) and weight dynamics (Equation 8), with learning rate η=103, step size dt= 1ms for the forward Euler integration, membrane time constant τ=10ms and logistic activation function. Weights were initialized randomly from a normal distribution N(0,0.12) with a cut-off at ± 0.3. The number of neurons in the network N was n=96, among them 56 output neurons ON that were simultaneously nudged, and 40 hidden neurons. During training, all output neurons were nudged simultaneously (with β=0.1), whereas during testing, only 42 out of 56 neurons were nudged, the remaining 14 left to reproduce the traces. Data points of the iEEG signal were sampled with a frequency of 512 Hz. For simplicity, we, therefore, assumed that successive data points are separated by 2ms, and up-sampled the signal via simple interpolation to 1 ms resolution as required by our integration scheme. Furthermore, the raw values were normalized by dividing them by a factor of 200 to ensure that they are approximately in a range of ±1–2. Training and testing was done on two separate 8 s traces of the iEEG recording. Same data as in Figure 3b1.

Details for Figure 4

Simulation of the neuronal and synaptic dynamics as given by Equation 8, Equation 7a, Equation 7b. For 5 ms, 10 ms, and 50 ms presentation time, we used an integration step size of dt=0.05ms, dt=0.1ms and dt=0.5ms, respectively (and dt=1ms otherwise). As an activation function, we used the step-linear function (hard sigmoidal) with r(u)=0 for u0, r(u)=1 for u1 and r(u)=u in between. The learning rate was initially set to η=103 and then reduced to η=104 after 22,000 s. The nudging strength was β=0.1 and the membrane time constant τ=10ms. In these simulations (and only for these) we assumed that at each presynaptic layer l=0,1,..,n1 there is a first neuron indexed by 0 that fires with constant rate rl,0=1, effectively allowing the postsynaptic neurons 𝒓l+1 to learn a bias through the first column of the weight matrix 𝑾l+1. Weights were initialized randomly from a normal distribution N(0,0.012) with a cut-off at ±0.03. For an algorithmic conversion see the scheme below. In Figure 4c1, ‘rt-DeEP w/o lookahead’ is based on the dynamics τ𝒖˙=𝒖+𝑾𝒓+𝒆. For ‘𝒖˙ w/o error + backprop,’ we use τ𝒖˙=𝒖+𝑾𝒓 as the forward model (so without error terms on the membrane potential, but a prospective 𝒓), and calculate weight updates using error backpropagation. In 4c2, we provide three controls: the test error for (i) a standard shallow artificial neural network trained on MNIST (black dashed line), (ii) rt-DeEP without prospective coding (as in Figure 4c1), but in Figure 4c2 with plasticity only turned on when the network is completely stationary, i.e., after waiting for several 100ms, such that synaptic weights are not changed during transients (orange dashed line, denoted by ‘w/o transients’), and (iii) an equivalent artificial neural network, 𝒖l=𝑾l𝒓l1, trained using error backpropagation (black dashed line, ‘standard backprop’).

Details for Figure 5

Simulation of neuronal and synaptic dynamics with plastic microcircuit, i.e., the pyramidal-to-interneuron and lateral weights of the microcircuit learned during training.

For the results shown in Figure 5c2, the following parameters were used. As an activation function, we used a hard sigmoid function and the membrane time constant was set to τ=10 ms. Image presentation time is 100ms. Forward, pyramidal-to-interneuron and interneuron-to-pyramidal weights were initialized randomly from a normal distribution N(0,0.012) with a cut-off at ±0.03. All learning rates were chosen equal η=103 and were subsequently reduced to η=104 after 22,000 s training time. The nudging parameters were set to β=0.1 and βI=0.11.1. The feedback connections 𝑩l and the nudging matrices 𝑩lIP were initialized randomly from a normal distribution 5N(0,0.012) with a cut-off at ±0.15. The used integration step size was dt=0.25 ms. All weights were trained simultaneously. For an algorithmic conversion see the scheme below. The interneuron membrane potential was calculated by Equation 9 with a linear transfer function.

Appendix 1

Threshold-linear transfer functions

There is an important special case where the presynaptic voltage error can directly be extracted from presynaptic firing rates, without need to invert the transfer function via synaptic depression as shown below. This is the case when voltage errors in the upper layers are small, and the voltage-to-rate transfer function has derivatives ρ=0 or 1, so that 𝒓l+1=(𝒓l+1)2. The condition is satisfied, for instance, for a doubly threshold-linear function (a ‘doubly rectified linear unit’, ReLu) defined by ρ(u˘)=0 for u˘<0 and ρ(u˘)=u˘ for 0u˘rmax, while ρ(u˘)=rmax for larger voltages. In this case we calculate

(41a) e¯l+1=r¯l+1e¯l+1A=(r¯l+1)2e¯l+1A=r¯l+1e¯l+1=r¯l+1(ul+1Wl+1r¯l)
(41b) ρ(ul+1)ρ(Wl+1r¯l)=r¯l+1ρ(Wl+1r¯l).

The approximation uses the Taylor expansion in 𝒖l+1 and assumes that 𝒆l+1 is small. The crucial point of Equation 41a is that the mismatch error defined on the voltage, 𝒆l+1=𝒖l+1𝑾l+1𝒓l, can be factorized into a product of the postsynaptic rate derivative, 𝒓l+1, and the apical error, and hence it can be expressed as an error defined on the rate. Restricted to the segment 0𝒖l+1𝒓max where the transfer function is linear and errors do not vanish, the same microcircuit delivers the feedback 𝑩l𝒓l+1 to the apical tree through the top-down projections, and 𝑩lρ(𝑾l+1𝒓l) through the lateral connections from the interneurons. While the plasticity rules for 𝑾PIl and 𝑾IPl stay the same, the top-down nudging of the interneurons, see Equation 9, can then be formulated based on the rate instead of the upper layer voltage, 𝒖Il=(1βI)𝑾IPl𝒓l+βI𝑩IPl𝒓l+1, with transfer function of the interneuron again the (doubly) threshold-linear ρ(𝒖Il). Since voltages and rates are identical in the segment 0𝒖l+1𝒓max for each component, Equations 36 with 39 can still be inferred.

Rate-to-voltage inversion by short-term synaptic depression

We wish to readout the voltage error also for other nonlinear transfer functions than clipped ReLu’s. To do so, we take inspiration from the classical short-term synaptic depression model (Tsodyks and Markram, 1997; Abbott et al., 1997; Varela et al., 1997), see also Appendix 1—figure 1a1 (that deviates from the release-independent synaptic depression, see Campagnola et al., 2022). We consider a dynamic vesicle release probability that is proportional to the pool size of available vesicles, v(r), and this pool size is postulated to depend on past presynaptic rates,

(42) p(release|r¯)v(r¯)=1+a1+dr¯,

where r is the low-pass filtered presynaptic rate, a and d are constants. The proportionality factor is 11+a, making a probability out of the vesicle pools size. The effective synaptic strength B of a ‘backprojecting top-down’ connection is the product of the absolute synaptic strength B and the vesicle pool size v, i.e., B=Bv(r). The contribution to the postsynaptic current of the synapse is Wr, and the contribution to the postsynaptic voltage is Wr.

We search for an activation function r(u) such that the postsynaptic voltage contribution is the scaled presynaptic potential, Br(u)=Bu. Plugging in the above expression for B yields Br(u)=Bv(r)r(u)=Bu, and dividing B out, v(r)r(u)=u, see Appendix 1—figure 1a2. With the expression for v(r) in Equation 42 we obtain a quadratic equation in r that is solved by the non-negative and monotonically increasing function

(43) r¯=12(uθ+(uθ)2+4u/d)0, for u0,

with ‘smooth’ threshold θ=(1+a)/d and asymptote ruθ. This gives us a transfer function r(u) that qualitatively matches those observed for pyramidal neurons recorded in the steady state (Anderson et al., 2000; Rauch et al., 2003), see Appendix 1—figure 1a3.

The approach generalizes to other pairs of strictly monotonic neuronal activation and depression functions {r(u),v(r)}, as long as u is not driven below some minimal value, here urest=0, corresponding to r=0. The last requirement can, for instance, be accomplished by offsetting the activation function into a regime that guarantees that u stays positive.

In our simulations for Figure 5, we did not explicitly implement a dynamic vesicle pool, i.e., the right-hand side of v(r)r=u, but instead directly used the recovered membrane potentials u.

Appendix 1—figure 1
Recovering presynaptic potentials through short term depression.

(a1) Relative voltage response of a depressing cortical synapse recreated from Abbott et al., 1997, identified as synaptic release probability p. (a2) The product of the low-pass filtered presynaptic firing rate r¯(u) times the synaptic release probability is p(r¯) proportional to the presynaptic membrane potential, p(r¯)r¯u. (a3) Average in vivo firing rate of a neuron in the visual cortex as a function of the somatic membrane potential recreated from Anderson et al., 2000, which can be qualitatively identified as the stationary rate r¯(u) derived in Equation 43.

Appendix 2

Looking back and forward in time with derivatives

Since dealing with extrapolations into the future is a crucial notion of the paper we present here some of the calculations. The discounted future voltage was introduced in Equation 4 as

u~(t)=1τtu(t)ettτdt.

To show that u~ satisfies u=u~τu~˙, we need to apply the Leibniz integral rule in calculating the derivative u~˙. This leads to

du~dt(t)=1τu(t)+1τtu(t)1τettτdt.

Multiplying this equation by τ and using the definition of u~ yields τu~˙(t)=u(t)+u~(t), or u=u~τu~˙.

By applying the Leibniz integral rule one also shows that x, defined in Equation 15,

x¯(t)=1τtx(t)ettτdt,

solves τx˙(t)=x(t)+x(t). This differential equation can be written as (1+τddt)x=x, with lookahead operator (1+τddt) defined in Equation 14. To show that (1+τddt)x=x+τx˙=x, one applies partial integration to x˙(t). Note that the equality x+τx˙=x+τx˙=x only holds if we integrate from , and hence if the initialization of the trajectory is far back in the past as compared to the time constant τ.

Uniqueness of the rate function

In the main text we concluded from the postulate r=ρ(u)+τρ˙(u) and the general relation r=r+τr˙ that r(t)=ρ(u(t)). This conclusion is a consequence of the uniqueness of a solution of an ordinary differential equation for a given initial condition that may include delta-functions on the right-hand side, see e.g., Nedeljkov and Oberguggenberger, 2012. In fact, we may consider both variables ρ and r as solutions of the differential equation τx˙=x+r, with x either ρ or r. Because the solution is unique, we conclude that ρ=r.

Learning the input time constants

In our applications we assumed that the input rates in the original mapping 𝒖𝒐*(t)=𝑭*(𝒓in(t)) are low-pass filtered by a common time constant τin(t),i*=τ that is also shared as membrane time constant of the neurons. But in general, we may consider a subclass of network neurons (with voltage vector 𝒖1) that receive exclusively input rates 𝒓in and integrate these rates with corresponding time constants 𝝉in. The prospective rates of these neurons, 𝒓1=ρ(𝒖1)+τρ˙(𝒖1), are then produced by the same voltage time constant τ of the down-stream neurons. The downstream voltage integration then cancels the presynaptic look-ahead in the presynaptic firing rate, leading again to an instantaneous voltage propagation across the entire network.

The general setting of learning to map time series is that input time series, 𝒓in(t) are low-pass filtered with given time constants 𝝉in*, and the target output time series 𝒖𝒐*(t) are a function of these low-pass filtered inputs, 𝒖𝒐*(t)=𝑭𝒐*(𝒓in𝝉in*(t)). In the student network, the input time constants appear as parameters that can be learned by gradient descent to match the output of the teacher network, just as the synaptic weights are learned.

We ask for a learning rule so that the student network can extract the input time constants 𝝉in* used by the teacher network. For this, we assume that only the neurons 𝒖1 obtain the external input and hence carry the time constant 𝝉in. Because 𝒓in𝝉in=𝒓in𝝉in𝒓˙in𝝉in, the gradient rule for the input time constant is

(44) 𝝉˙in=L𝝉in=L𝒓in𝝉inT𝒓in𝝉in𝝉in=𝑾inT(𝒖1𝑾in𝒓in𝝉in)T𝒓˙in𝝉in.

This local learning rule also globally reduces the Lagrangian, and in the limits of β it is gradient descent on the mismatch energy, while in the limit β0 it is gradient descent on the cost. The proof works as in Theorem 1. To learn a more complex mapping of time series that includes more complex temporal processing beyond a function of merely 𝒖𝒐*(t)=𝑭𝒐*(𝒓in𝝉in*(t)), additional variables need to be introduced that form memories (see Appendix 6).

Appendix 3

From the implicit to the explicit differential equation

Global convergence to a moving equilibrium (i.e. to L𝒖=0)

The original Euler-Lagrange equation is (1+τddt)L𝒖=0, 20, with Jacobian L𝒖=𝒇(𝒖,t)=𝒖𝑾𝒓𝝐β𝒆*. Remember the definition of 𝝐=𝒓net𝑾netT𝒆 given in 21. While 𝒆=𝒖𝑾𝒓 is an error that may not vanish, the ‘corrected error’ 𝒇=L𝒖 is shown to vanish exponentially quick, leading to the moving equilibrium with 𝒖=𝑾𝒓𝝐β𝒆*. In fact, (1+τddt)L𝒖=𝒇+τ𝒇˙=0, has the general solution

(45) f(u(t),t)=f(u0,t0)ett0τ0

as an exponentially decaying function in t with 𝒖(t0)=𝒖0. Note that the argument 𝒖 of the function 𝒇(𝒖,t)=𝒖𝑾𝒓𝝐β𝒆* does apriori not depend on time. It only becomes a function of time, 𝒖(t), by requiring that 𝒇+τ𝒇˙=0. The voltage 𝒖(t) becomes a non-trivial function of time because the time-dependence of the function 𝒇(𝒖,t), that apriori enters only through the second argument, expresses the temporally varying input 𝒓in(t), including a possible target output 𝒖𝒐*(t).

Implicit differential equation

To make the dependence of 𝒖˙ on the right-hand-side of the implicit differential Equation 22 more transparent, we rewrite this in the form

(46) u˙=fτft+u(Wr¯+ϵ¯+βe¯)u˙=fτft+Mu˙=K(rin,uo,u,u˙).

The partial derivative of 𝒇(𝒖,t) with respect to time, 𝒇t, captures the external drive from the inputs and output targets that do not depend on 𝒖, but may directly depend on time. In fact, instead of the argument t of 𝒇, we could consider the two arguments 𝒓in and 𝒖𝒐*, so that the partial derivative of 𝒇(𝒖,t)=𝒇(𝒖,𝒓in,𝒖𝒐*) with respect to t can be written as

(47) fit=fir¯inr¯˙in+fiuou˙o=Wi,inr¯˙in+βδiou˙o,

where δio is the Kronecker delta that is equal to 1 if ii is an output neuron, and 0 else. The partial derivatives of 𝒇 with respect to 𝒖 represents the (symmetric) Hessian of the Lagrangian, Hij=fiuj=2Luiuj=δijuj(𝑾𝒓+𝝐+β𝒆*)i, with δij being the Kronecker-delta, 𝝐 defined in 21 and 𝒆* defined above 16. Remember that 𝑾=(𝑾in,𝑾net) and 𝒓=(𝒓in,𝒓net)=(𝒓in,𝝆(𝒖)). In vectorial notation the Hessian of the Lagrangian is

(48) H(u)=fu=2Luu=1Wnetdiag(ρ(u))u(ϵ¯+βe¯)(u),

where 1 is the identity matrix. This Hessian is ‘typically’ positive definite, as we argue next. For a functionally feedforward network, the weight matrix 𝑾net is lower triangular, and for small enough β (for which 𝝐 is small) the diagonal elements are all positive and hence the Hessian 𝑯 is invertible. Since 𝑯 is also symmetric, it is, therefore, positive definite. In the general case, we require that 𝑾net has Eigenvalues smaller than 1, since otherwise the recurrent network dynamics may explode. If ρ(𝒖) is also smaller than 1, and β still small enough, we conclude again that the Hessian is invertible (and hence positive definite). Notice that the property of 𝑯=2L𝒖𝒖 being symmetric is a general property of differentiable functions and ways weaker than the requirement that the recurrent weight matrix 𝑾net is symmetric, as classically required for fixed point convergence in Hopfield-type networks.

Since in the implicit differential equation of Equation 46 the 𝒖˙ also arises on the right-hand-side, we need to show that this Equation 46, for fixed (𝒓in,𝒖𝒐*,𝒖), can be solved for 𝒖˙. To find the solution 𝒖˙ for given arguments (𝒓in,𝒖𝒐*,𝒖) in 𝑲, we search for a fixed point of the mapping 𝒖˙(i+1)=𝑲(𝒖˙(i)). In the argument 𝒖˙, the mapping is affine, 𝒖˙(i+1)=𝒇τ𝒇t+𝑴𝒖˙(i), with matrix 𝑴(𝒖)=𝒖(𝑾𝒓+𝝐+β𝒆*) appearing in Equation 46. The Banach fixed point theorem asserts that if 𝑲(𝒖˙) is strictly contracting with k, i.e., if 𝑲(𝒖˙(2))𝑲(𝒖˙(1))2k𝒖˙(2)𝒖˙(1) for all pairs of inputs and 0k<1, then the iteration (here with iteration time step dt) locally converges to a fixed point 𝒖˙=𝑲(𝒖˙). Because 𝑲(𝒖˙(2))𝑲(𝒖˙(1))=𝑴(𝒖˙(2)𝒖˙(1)), the mapping is k-contractive if the eigenvalues of 𝑴 have absolute value smaller than k. This is the case if the Hessian 𝑯(𝒖)=1𝑴(𝒖)=1𝑾netρ(𝒖)𝒆(𝒖), with 𝒆=𝝐β𝒆*, is strictly positive definite (which is ‘typically’ the case, see above).

Stable solution of the implicit differential equation

According to the above reasoning, the Banach fixed point theorem asserts (for the typically positive Hessian) that during convergence of the mapping 𝒖˙(i+1)=𝑲(𝒖˙(i)) the distance to the fixed point is bounded by a constant times ei1kdt, with iteration index i and ‘virtual’ Euler step dt. Formally,

(49) K(u˙(i+1))K(u˙(i))=M(u˙(i+1)u˙(1))ku˙(i+1)u˙(i)c0ei1kdt0.

Crucially, because the mapping is affine in 𝒖˙, the convergence is global in 𝒖˙ while fixing the other arguments (𝒓in,𝒖𝒐*,𝒖) of 𝑲.

In an analog physical device that implements exactly this feedback circuit in continuous time, the dt becomes truly infinitesimal and in this sense the convergence is instantaneous. If dt remains finite, 𝒖˙(i) converges to a moving target because the mapping 𝒖˙(i+1)=𝑲(𝒖˙(i)) changes with time. The target should not move quicker than the time scale dt1k of the 𝒖˙(i) convergence (but, fortunately, this convergence time-scale can be made arbitrarily quick by reducing dt). Given a time course of the input rate 𝒓in(t) and target 𝒖o*(t) that has bounded variation, the dt can be chosen so that convergence becomes arbitrary quick, and in the limit instantaneous. As a consequence, for small enough dt, the 𝒖˙ becomes the same on the left- and right-hand side of the implicit differential Equation 46.

If 𝒓in(t) contains well-separated delta-functions, while otherwise still having bounded variations, the reasoning still applies since at any point in time, except at the time points of the isolated delta-function, 𝒇(𝒖,t)=0. This statement is proven in the Appendix 4.

In practical simulations, a caveat is in place when the learning rate becomes too big and 𝑾˙ starts to change the neuronal dynamics. In this case, the Hessian becomes 𝑯=1(𝑾diag(ρ)+𝑾˙diag(ρ))𝒆, and the Eigenvalues may become negative due to the 𝑾˙ term. Simulations can in fact become unstable with a big learning rate, and this is more pronounced if also the Euler dt is large. The explicit differential equation avoids the fast iteration towards a moving target and hence allows for larger dt. This in particular pays out in the presence of a high learning rate (although the Cholesky decomposition also requires positive definiteness). For this reason, the large-scale simulations involving plasticity are performed with the explicit form described next.

Explicit differential equation

To isolate 𝒖˙ in the implicit differential equation, we rewrite (1+τddt)𝒇=0 again as

(50) τ𝒇˙=τ(𝒇𝒖𝒖˙+𝒇t)=𝒇, or τdfidt=τ(jfiuju˙j+fit)=fi.

Plugging the Hessian 𝑯(𝒖)=𝒇𝒖 from Equation 48 into Equation 50, we obtain the voltage dynamics from Equation 22 in the equivalent form

(51) τH(u)u˙=fτft.

In our applications, the Hessian 𝑯 appears to be invertible (although this may not be the case for arbitrary networks), and Equation 51 can be solved for the unique 𝒖˙ using the Cholesky decompositions. In fact, if 𝑯 is invertible, the system implicit ordinary differential equations from Equation 51 can be converted into a system of explicit ordinary differential equations (with 𝒇(𝒖,t)=L𝒖=𝒖𝑾𝒓𝝐β𝒆* and 𝑯 given in Equation 48),

(52) u˙=g(u,t)=1τH1(u)(f(u)+τft),

and as such it has a unique solution for any given initial condition. Equation 52 represents the explicit solution of the iteration 𝒖˙(i+1)=𝑲(𝒖˙(i)) after the convergence to the fixed point 𝒖˙=𝑲(𝒖˙). The condition for the convergence is the same as the condition for solving for 𝒖˙, namely an invertible (and hence positive definite) Hessian.

Because fit=Wi,in𝒓˙in+βδio𝒖˙𝒐*, see Equation 47, the regularity requirement for τ𝒖˙ to be integrable is satisfied even if 𝒓in(t) and 𝒖𝒐*(t) contain step functions (and 𝒓in(t) delta-functions), see Appendix 4 for details.

Even in the presence of such step-functions, the Euler-Lagrange equations (1+τddt)𝒇=0 lead to an 𝒇 that is a decaying exponential, fi(𝒖(t),t)=ciett0τ, see Equation 45. For initialization at t0= we have 𝒇=L𝒖=0 at any time. In fact, Equation 52 is equivalent to (1+τddt)𝒇=0, and hence any solution of Equation 52, even if 𝒇t contains a delta-function, is also a solution of (1+τddt)𝒇=0. Possible jumps in 𝒓in or 𝒖𝒐* are compensated by the jumps they induce in 𝒖 (see below for the full mathematical description with a simple example). To give an intuition, we assume that a recurrent network of our prospective neurons has separate fixed points for the two constant input currents 𝒓in(1) and 𝒓in(2). This network still shows an overall relaxation time of 𝝉in (but not longer!) when the input rates instantaneously switch from 𝒓in(1) to 𝒓in(2). Nevertheless, at any moment during this relaxation process, gradient learning of the mapping 𝒖𝒐(t)=𝑭𝒐(𝒓in(t)) towards 𝒖𝒐*(t)=𝑭𝒐*(𝒓in(t)) is still guaranteed (Theorem 1).

Link to time-varying optimal control

The explicit differential equation, Equation 52, is a special case of the one in Simonetto et al., 2020, Equation 20, where the function to be minimized (their f) can take a general (Lipschitz continuous) form (hence their f is our Lagrangian, fL). To avoid inverting the Hessian, an iteration algorithm can be applied similar to our implicit form, although more involved to deal with the more general form of L (Simonetto and Dall’Anese, 2017). The idea of tracking the solution of a time-varying optimization problem with a linear look-ahead in time has been introduced introduced in Zhao and Swamy, 1998.

The NLA keeps stability as compared to an infinitesimally fast (τ0) equilibrium propagation

What have we gained as compared to making the time constant in Equilibrium propagation very small? The answer lies in the stability. To make the comparison explicit, we remind us of the Equilibrium Propagation which we formulate in terms of a Lagrangian. In both cases, the NLA and the Equilibrium Propagation, the Lagrangians are identical. We have chosen the notation 𝒓=(𝒓in,ρ(𝒖)) to express that the quantities can be interpreted as a low-pass filtering of 𝒓+τ𝒓. In the Equilibrium propagation, they are called 𝒓=(𝒓in,ρ(𝒖)), but this only represents a renaming of the input rates (and a notation change from 𝒓 to 𝒓). Hence, the Lagrangian L=12𝒖𝑾𝒓2+βC in the NLA with 𝒓=(𝒓in,ρ(𝒖)) is rewritten by L=12𝒖𝑾𝒓2+βC with 𝒓=(𝒓in,ρ(𝒖)).

The only true difference is that in the NLA we apply the combined lookahead operator (1+τddt)𝒖 to the Lagrangian L, while in the Equilibrium Propagation one applies the reduced operator 𝒖 to L. We claim that the lookahead makes the difference for ‘living’ systems that learnt to optimize in-time with respect to future quantities. The Euler-Lagrange equations for the Equilibrium Propagation reduce to L𝒖=𝒖𝑾𝒓ρ(𝒖)𝑾T𝒆β𝒆*=0, where now 𝒆=𝒖𝑾𝒓. Since these are not differential equations (no 𝒖˙ in there), one introduces the gradient descent dynamics τ𝒖˙=L𝒖=𝒖+𝑾𝒓+ρ(𝒖)𝑾T𝒆+β𝒆* and waits for convergence for fixed input and fixed target. If τ=0 in both the NLA and the Equilibrium propagation, we obtain the same implicit and stationary equations in 𝒖.

Because the equation is implicit in 𝒖, it has to be solved first for 𝒖. While the Equilibrium propagation suggests to keep the inputs constant until a fixed point is reached, the NLA dynamically moves along the trajectory of fixed points while the inputs are changing. In the Equilibrium propagation we cannot simply take the limit τ0 since then the dynamics either disappears (when τ remains on the left, τΔ𝒖0) or explodes (when τ is moved to the right, dtτ), leading to either too small or too big jumps.

Appendix 4

Contraction analysis and delta-function inputs

Contraction property

We next show when the voltage dynamics obtained from (1+τddt)L𝒖=0 is locally contracting, i.e., when neighboring trajectories of the explicit differential equations, for given inputs and targets, locally converge. This is a different stability property as compared to the (global) convergence properties for the implicit differential equation stated in Equations 45 and 49.

For the contraction analysis we rewrite Equation 51 in the form 𝑮(𝒚,𝒖˙)=(1+τddt)L𝒖=𝒇+τ𝒇˙=0, where the explicit time dependence of E and 𝒇 is a short-cut to express the dependence on 𝒚=(𝒓in,𝒖𝒐*,𝒖). According to the implicit function theorem, at any point in time when 𝑮𝒖˙ is invertible, we can locally write 𝒖˙=𝒈(𝒚). When absorbing the dependence of 𝒖˙ on (𝒓in,𝒖𝒐*) into an explicit time dependence, we can rewrite this differential equation as 𝒖˙=𝒈(𝒖,t), see Equation 52, when explicitly expressing the dependency of 𝒈 on 𝒖 only (instead of all the variables 𝒚). This differential equation is contractive and thus stable if the Jacobian of 𝒈 with respect to 𝒖 is uniformly negative definite (Lohmiller and Slotine, 1998). The contraction analysis tells that locally, where 𝑮𝒖˙ is invertible, we can express 𝒖˙ as a function 𝒖˙=𝒈(𝒚), that has derivative 𝒈𝒚=(𝑮𝒖˙)1𝑮𝒚 according to the implicit function theorem. Restricted to the 𝒖-component of 𝒚 we get the Jacobian

(53) gu=(Gu˙)1Gu=1τ1H1Huu˙=1τ1dlogHdt.

To prove Equation 53 we note that according to Equation 51 we have 𝑮(𝒚,𝒖˙)=𝒇+τ𝑯(𝒖)𝒖˙+τ𝒇t=0, and with this calculate 𝑮𝒖˙=τ𝑯, with Hij=fiuj specified above. Since according to Equation 47 the term 𝒇t does not depend on 𝒖, we also calculate 𝑮𝒖=𝑯+τ𝑯𝒖𝒖˙=𝑯+τd𝑯dt. Hence, 𝒈𝒖=1τ𝑯1(𝑯+τd𝑯dt)=1τ1dlog𝑯dt.

If we had 𝒈𝒖=1τ1 alone, 𝒈𝒖 would be obviously negative definite, guaranteeing the local contraction property (Lohmiller and Slotine, 1998). To understand this statement, we consider the local representation of neighboring trajectories. The linear approximation of the voltage dynamics 𝒖(t), starting with 𝒖(t) in a dim-𝒖-dimensional neighbourhood around some point 𝒖(t), is ddt(𝒖𝒖)=𝒈(𝒖,t)𝒖(𝒖𝒖)=1τ(𝒖𝒖). While 𝒖(t) is the linear approximations of the original voltage dynamics starting at 𝒖(t), the trajectory 𝒖(t) is the true voltage dynamics starting at the ‘center’ 𝒖(t) of the neighborhood. This shows the exponential local contraction of surrounding trajectories to 𝒖 at times t around t, would we have 𝒈𝒖=1τ1.

The additional log-term in Equation 53 may cause a local violation of this contraction property. However, the additional term in Equation 53 becomes small for large or small voltages for which we assume that the curvature of the transfer function also becomes small, ρ(𝒖)0. In fact, based on Equation 48 we calculate 𝑯𝒖 being a function of ρ(𝒖) that vanishes with vanishing ρ(𝒖),

(54) H(u)u=Wnetdiag(ρ(u))2u2(ϵ¯+βe¯)(u)=o(ρ(u)).

Plugging Equation 54 into Equation 53 we conclude that 𝒈𝒖1τ1 where the curvature of the transfer function vanishes, ρ(𝒖)0. Yet, ρ may strongly deviate from 0 for intermediate values of 𝒖. For a rectified linear unit (ReLu), ρ(u)=u for u0 and 0 else, for instance, ρ is a delta-function at 0. Even if the deviation from ρ(𝒖)=0 in this case is on a set of measure 0, integrating across voltages u=0 can cause a jump in the voltage dynamics, preventing a strict local contraction property everywhere. The contraction property, in the example of a ReLu, only holds almost everywhere. Because the ReLu is an important transfer function in practical applications, we elaborate on this example further.

Comparison of the NLA with the latent equilibrium

Here, we point out that the appearance of a delta-function for ReLu’s in the NLA may represent a technical challenge, that this specific challenge is tamed in the simpler version of the NLA principle formulated as Latent Equilibrium (Haider et al., 2021), and that the Latent Equilibrium may represent a viable new approach for time-varying optimization in general.

The Euler-Lagrange equation for the presented NLA, (1+τddt)L𝒖=0, Equation 20, with Jacobian L𝒖=𝒇(𝒖,t)=𝒖𝑾𝒓𝝐β𝒆*, implies second-order derivatives with respect to the voltage 𝒖. This is the case because the total derivative with respect to time, ddt, applied to the error 𝝐=𝒓net𝑾netT𝒆 contained in 𝒇(𝒖,t), implies the expression ddt𝒓net=ρ(𝒖)𝒖˙, with ″ equivalent to 2𝒖𝒖. These second-order derivatives enter both in the implicit form (Equations 46 and 51) and the explicit differential Equation 52. When integrating these equations across the components u=0, a jump in u emerges that must be explicitly included in the simulations. It is the same jump that may transiently violate the contraction property as shown above (Equations 53 and 54, but does not violate the exponential convergence from Equation 45, see below).

In the Latent Equilibrium, our Lagrangian L=12𝒖𝑾𝒓2+βC with 𝒓=(𝒓in,ρ(𝒖)) is replaced by L(𝒖,𝒖˙)=12𝒖˘𝑾𝒓2+βC with 𝒓=(𝒓in,ρ(𝒖˘)) and 𝒖˘=𝒖+τ𝒖˙. If we were asking for stationary trajectories 𝒖(t) of L(𝒖,𝒖˙)dt with respect to variations in 𝒖, we would obtain the unstable Euler-Lagrange equations L𝒖ddtL𝒖˙=(1τddt)L𝒖=0, solved by an exponentially growing function of time, L𝒖=𝒄ett0τ, unless 𝒄=0. To force the trajectory moving along L𝒖=0, one again requires that variations of the action Ldt are stationary with respect to a prospective voltage, this time with respect to 𝒖˘=𝒖+τ𝒖˙. Because the Lagrangian for the Latent Equilibrium can be expressed as a function L(𝒖˘) of 𝒖˘ only, not of 𝒖˘˙, the Euler-Lagrange equations becomes L𝒖˘=𝒖˘𝑾𝒓ρ(𝒖˘)𝑾T𝒆β𝒆*=0, where now 𝒆=𝒖˘𝑾𝒓.

The Latent Equilibrium is consistent with the central idea of the NLA to deal with future quantities. While the presented version of the NLA deals with the discounted future voltage 𝒖~ (leading to the operator (1+τddt)𝒖 applied to the Lagrangian), the Latent Equilibrium deals with the linear approximation 𝒖˘ of the future voltage (leading to the operator 𝒖˘ applied to the Lagrangian). Both, the NLA and the Latent Equilibrium differ in this crucial aspect from the Equilibrium Propagation that deals with the current voltage 𝒖 (leading to the operator 𝒖 applied to the Lagrangian L(𝒖), without appearance of 𝒖˘˙, and hence without intrinsic dynamics, see above).

For the Latent Equilibrium, it is not obvious to find global convergence properties as expressed in Equations 45 and 49 for the presented version of the NLA. However, the Latent Equilibrium satisfies the strict local contraction property 𝒈𝒖=1τ1. In fact, writing the Euler-Lagrange equations as 𝑮(𝒚,𝒖˙)=L𝒖˘=0 with the abbreviation 𝒚=(𝒓in,𝒓˙in,𝒖𝒐*,𝒖˙𝒐*,𝒖), leads to 𝑮𝒖˘=𝑮𝒖=1τ𝑮𝒖˙, and the linear approximation obtained from plugging these partial derivatives into Equation 53 yields the claimed contraction property 𝒈𝒖=1τ1 for the Latent Equilibrium. The Jacobian 𝑮𝒖˙ is invertible if the Hessian 𝑯=2L𝒖˘𝒖˘ is (that looks as in Equation 48, but with 𝒖 replaced by 𝒖˘).

Crucially, for the Latent Equilibrium, the analogous theorems on real-time dendritic error propagation and learning (rt-DeEP and rt-DeEL) hold. The key property is that the trajectories of the Latent Equilibrium are stationary solution of the corresponding action A=Ldt, so that the variation of A induced by 𝑾 reduces to the partial derivative of L with respect to 𝑾.

The above strict contraction property that for non-vanishing errors only holds for the Latent Equilibrium, but not for the NLA, may be a reason why the simulations of the error-based updates in our hands seem to be more stable for the Latent Equilibrium as compared to the NLA. This does not withstand the fact that the NLA equations are used in the context of time-varying optimal control (see above). It would be interesting to consider also the Latent Equilibrium as a tool for non-stationary, time-varying optimization (Simonetto et al., 2020).

Delta-function inputs keep L𝒖=0

We next explain in more details why delta-functions in the input rates 𝒓in(t) for the NLA the stationarity (‘equilibrium’) condition L𝒖=0 is always satisfied (the delta-function in 𝒓in for the NLA corresponding to step-function in 𝒓in for the Latent Equilibrium). We reconsider the explicit differential equation 𝒖˙=𝒈(𝒖,t)=1τ𝑯1(𝒇+τ𝒇t) given in Equation 52, with 𝒇(𝒖,t)=L𝒖=𝒖𝑾𝒓𝝐β𝒆* and 𝑯 given in Equation 48.

To simplify matters, we consider a single delta-function at t=0 as input in the absence of output nudging. In this case, we get 𝒇t=𝒇𝒓in𝒓˙in+𝒇𝒓in𝒖˙𝒐*=𝑾in𝜹in(t), where the input matrix 𝑾in is typically sparse (not all network neurons receive external input), and 𝜹in(t) is a vector of delta-functions restricted to the input neurons. Following (Nedeljkov and Oberguggenberger, 2012), Proposition 2.1, we can then write the explicit differential equation, Equation 51, in the form

(55) u˙=g(u,t)=gˇ(u,t)+H(u)1Winδin(t),

where 𝒈ˇ(𝒖,t) is globally Lipschitz continuous. Due to the Lipschitz continuity the change in 𝒖 evoked by 𝒈ˇ(𝒖,t) during a small time interval [ε,ε] around t=0 vanishes when this interval shrinks, ε0. To quantify the change in 𝒖 during these intervals it is, therefore, enough to consider 𝒖˙=𝑯(𝒖)1𝑾in𝜹in(t), or equivalently 𝑯(𝒖)𝒖˙=𝑾in𝜹in(t). To estimate the jump induced by the delta-functions, we consider some mollifier ϕε(t)=ε1ϕ(t/ε), where ϕ(t) is a smooth function on the interval [1,1] with integral 1. By 𝝓in,ε(t) we denote the vector of mollifiers centered at the delta-functions of the input neurons. We now consider the two differential equations, with the second approximating the first on the interval [ε,ε], but without regular term 𝒈ˇ(𝒖,t),

(56a) τH(uε)u˙ε=τH(uε)gˇ(u,t)+Winϕin,ε(t)
(56b) τH(uˇε)uˇ˙ε=Winϕin,ε(t),uˇε(ε)=uε(ε).

We assume that for all t[ε,ε] the matrices 𝑯(𝒖ε(t)) and 𝑯(𝒖ˇε(t)) are invertible, so that the two Equation 56a, Equation 56b can be turned into an explicit differential equations. Analogously to the 1-dimensional case (Nedeljkov and Oberguggenberger, 2012), we conclude that the solution of Equation 56a, Equation 56b on the interval [ε,ε] converge to each other, supt[ε,ε]|𝒖ε(t)𝒖ˇε(t)|0 for ε0. As a consequence, the jump of 𝒖ˇε at t=0 converges to the corresponding jump of 𝒖ε in the various dimensions.

To calculate the jump of 𝒖ˇε we have to integrate τ𝑯(𝒖ˇε)𝒖ˇ˙ε across the time interval [ε,ε]. Instead of integrating 𝒖ˇ˙ε, we first integrate τ𝑯(𝒖ˇε)𝒖ˇ˙ε given in Equation 56b. Moving from right to left yields

(57) 𝑾in𝑰in=εε𝑾in𝝓in,ε(t)dt=τεε𝑯(𝒖ˇε(t))𝒗˙ε(t)dt=τ𝒖ˇε(ε)𝒖ˇε(ε)𝑯(𝒖ˇε)d𝒖ˇε,

where 𝑰in is the index vector of the input neurons having a delta-function, i.e., Iin,j=1 if there is a delta-input, and else 0. Note that v˙ε,jdt=dvε,j. Because 𝑯 is itself a derivative, 𝑯=𝒇𝒖, we can explicitly calculate the latter integral (also for β>0, but for clarity here only done for β=0). The last integral in Equation 57 is defined as a vector with i-th component being

(58) (𝒖ˇε(ε)𝒖ˇε(ε)𝑯(𝒖ˇε)d𝒖ˇε)i=j=1Nuˇε,j(ε)uˇε,j(ε)Hij(uj)duj=j=1N(δijuiWijρ(uj))|uˇε,j(ε)uˇε,j(ε)=(𝒗ε(ε)𝒗ε(ε))i,

where in the second last equality we used that Hij(u)=δijWijρ(uj) does only depend on the component uj, see Equation 48. In the last equality we introduced the ‘network voltage error’ 𝒗ˇε=𝒖ˇε𝑾netρ(𝒖ˇε). Following the 1-dimensional case treated in Equation 66, Proposition 1.2 we introduce the ‘jump function’ (called G(y) in the cited work)

(59) 𝑱(𝒖ˇε)=τ(𝒗ˇε𝒗ˇ),

with 𝒗ˇ thought to represent 𝒗ˇε(t) at some time t before the delta-kick sets in. With this setting, Equation 57 turns into

(60) WinIin=J(uˇε(ε))J(uˇε(ε)).

In the limit ε0 we get a relation between 𝒖(t) immediately before and after the jump, 𝒖(0) and 𝒖(+0), using that in this limit the boundary points of the trajectories also converge, 𝒖ˇε(±ε)𝒖(±0),

(61) J(u(+0))=J(u(0))+WinIin.

We now assume that the function 𝑱(𝒖)=τ(𝒗𝒗) is invertible around the jump. This is the case if the Jacobian 𝑱(𝒖)𝒖 is invertible, and because 𝒗=𝒖𝑾netρ(𝒖), we require invertability of 𝑱(𝒖)𝒖=τ𝑯(𝒖), with Hessian defined in Equation 48.

In the case of invertability we get the voltage after the jump as

(62) u(+0)=J1(J(u(0))+WinIin).

We next calculate the jump in 𝒗. This is easy since 𝒗=𝒖𝑾netρ(𝒖) linearly enters in the function 𝑱(𝒖)=τ(𝒗𝒗). Plugging the explicit expression for 𝑱 into Equation 61 we get τ(𝒗(+0)𝒗)=τ(𝒗(0)𝒗)+𝑾in𝑰in, or

(63) v(+0)=v(0)+1τWinIin.

Knowing the jump in 𝒗 helps to show that the equilibrium condition L𝒖=0 is always satisfied, even immediately after the delta-in put, provided the initialization is at t0=. To show this, remember that in the absence of nudging we have L𝒖=𝒇(𝒖,t)=𝒖𝑾𝒓=𝒖𝑾netρ(𝒖)𝑾in𝒓in=𝒗𝑾in𝒓in. The jump size of 𝒓in for a delta-function at t=0, 𝒓in(t)=𝜹in(t) is 1τ. This is because 𝒓in satisfies the differential equations 𝒓˙in(t)=1τ𝒓in(t)+1τ𝜹in(t), provided that t0=. Hence,

(64) r¯in(+0)=r¯in(0)+1τIin.

With Equations 63 and 64 we conclude that L𝒖=𝒗𝑾in𝒓in=0 throughout.

Appendix 5

Example of a single recurrently connected neuron

To get an intuition for the instantaneity in the recurrent case we consider the example of a single, recurrently connected neuron. We also put this into the context of the Latent Equilibrium (Haider et al., 2021). Consider the weight vector 𝑾=(Win,Wnet) with an input rate rin(t) driving the postsynaptic voltage u. The postsynaptic rate is r=ρ(u)+τρ˙(u), and its low-pass filter with respect to τ is r=ρ(u). As always, the low-pass filtering reaches back to an initialization at t0=, see Equation 15. The Lagrangian has the form

(65) L=12e2+β2e*2=12(u(Wnetρ(u)+Winrin))2+β2(u*u)2.

The Euler-Lagrange equations (1+τddt)Lu=0 are derived from Lu=eρ(u)Wneteβe*. Applying the look-ahead operator (1+τddt) (Equation 14), and abbreviating ϵ=ρ(u)Wnete, the Euler-Lagrange equations deliver the voltage dynamics,

(66) τu˙=u+Wnet(ρ(u)+τρ˙(u))+Winrin+ϵ+βe*.

To simplify matters, we consider the nudging-free case, β=0. This implies that e=ϵ=0. With ρ˙(u)=ρ(u)u˙, we obtain the differential equation

(67) τ(1Wnetρ(u))u˙=u+Wnetρ(u)+Winrin.

Abbreviating v=uWnetρ(u) as ‘network voltage error,’ the above differential equation reads as

(68) τv˙=v+Winrin.

Integrating the effective voltage dynamics (Equation 68), assuming initialization infinitely far in the past, is equal to v=Winrin. This equation is equivalent to the Euler-Lagrange equation (1+τddt)Lu=0 being integrate, and because the solution of the Euler Lagrange equation is Lu=c𝒆tt0τ, we have (using t0=)

(69) Lu=vWinrin=0.

Voltage dynamics for a delta-function input

We next apply a delta-function in the input rate, say rin(t)=δ(t) and consider the dynamics at the level of the voltage, Equation 67. As in Equation 66, Proposition 1.2 we introduce the ‘jump function’ (called G(y) in the cited work)

(70) J(u)=τuu(1Wnetρ(y))dy=τ((uWnetρ(u))v)=τ(vv).

As in Nedeljkov and Oberguggenberger, 2012, Proposition 1.2, we show that the voltage u makes a unique jump at the moment of the delta function that J(u) is invertible around the jump.

We set v=uWnetρ(u). Here, u is some voltage before the jump, say u=u(1) evaluated at time t=1, when the jump is at t=0. When u0=u(0) is the voltage immediately before the jump, the voltage immediately after the jump is u0+=u(+0) specified by

(71) J(u0+)=J(u0)+Win.

The reason is that the Win-scaled delta-function triggers a step of size Win when integrating over it as done in Equation 70. The new value u0+ is unique if J is invertible, and looking at the defining integral in Equation 70, this is the case if 1Wnetρ(u0+)0.

The jump in u translates into a jump in v=uWnetρ(u) from v0 to v0+=u0+Wnetρ(u0+). This endpoint can also be expressed as

(72) v0+=v0+Winτ.

To check this, we assume without loss of generality that v=uWnetρ(u)=0. Then J(u)=τ(uWnetρ(u))=τv and J(u0+)=J(u0)+Win=τv0+Win according to Equation 71. Since also J(u0+)=τv0+, we conclude that τv0+=τv0+Win, as claimed above.

We finally show that even far away from the initialization, the stationarity condition Lu=0 holds before and immediately after the jump. In fact, for t0=, the evolution of the ‘network voltage error’ becomes

(73) v(t)=0 for t0, and v(t)=Winτ𝒆tτ for t>0.

Here, we used that v0=0 and according to Equation 72 the v jumps to v0=Winτ. Remember, that for initialization far in the past, Lu=0 is equivalent to v=Winrin, see Equation 69. We, therefore, have to calculate the jump in rin induced by the delta-input. Since rin(t) is the solution of τr˙in(t)=rin(t)+δ(t) for t0=, we find that

(74) rin(t)=0 for t0, and rin(t)=1τ𝒆tτ for t>0.

Combing the two Equations 73 and 74 proves that Lu=vWinrin=0 holds true any moment in time, provided the initialization is far in the past.

One may ask why the delta-kink is different from resetting v at a new initialization off from 0. The reason is that at t=0 there is a cause for the jump in r(0), while at t0 there is no cause in r(t0). In fact, there is no jump initially, just the start of v at some initial condition. Differently from the initialization at t0, where v(t0)>0 implies Lu(t)>0 for finite tt0>0, the jump of v(0) at t=0 to a positive value does leave Lu(t)=0 for all t>0, provided t0=.

Linear transfer function

We first consider the case of a linear transfer-function ρ(u)=0 (or threshold linear, being in the linear regime). Then the differential equation becomes

(75) τu˙=u+Win1Wnetrin.

With initialization at t= and low-pass filtering rinτ defined in Equation 15 the solution is

(76) u(t)=Win1Wnetrinτ(t).

The point is that the time constant is τ and is not τ1Wnet, as this would be the case without prospective firing rate. In fact, for the ‘classical’ differential equation,

(77) τu˙=u+Wnetρ(u)+Winrin,

and ρ the identity, we obtain the differential equation τeffu˙=u+Win1Wnetrin with τeff=τ1Wnet and solution u=Win1Wnetrinτeff that is now the low-pass filtering with respect to the effective time constant.

Sigmoidal transfer function

For a sigmoidal transfer function ρ(u)=rmax1+𝒆ϑu, a positive feedback weight Wnet>0, and a constant external input rin, say, the solutions u(t) of Equation 66 either converge to a fixed point or diverge. When converging, the voltage satisfies the fixed point condition

(78) u=Wnetρ(u)+Winrin.

This fixed point equation can be numerically solved by time-discrete iteration process. But it can also be solved by a time-continuous process that underlies a neural or neuromorphic implementation. The prospective firing rate introduced in the NLA can be seen as a method to quickly find the fixed point in continuous time. When directly solving the implicit differential equation (as opposed to convert this into an explicit differential equation using e.g., the Cholesky decomposition), the fixed point is potentially found with a fewer number of steps.

To estimate the speed of convergence, we look at the initial speed when taking off at initial condition u(0) between the unstable and stable fixed point. The initial speed for the classical differential equation, Equation 77, and the NLA version, Equation 66, are, respectively,

(79a) u˙(0)=Δu(0)τ,
(79b) u˙(0)=Δu(0)τ(1Wnetρ(u(0))),

where we set Δu(0)=u(0)+Wnetρ(u(0))+Winrin(0). As Wnetρ>0, the initial convergence speed of the NLA solution is larger. The scheme has some resemblance to the Newton algorithm of finding zero’s of a function by using its derivative.

Appendix 6

NLA for conductance-based neurons and least-action principle in physics

The mismatch energies and costs can be generalized in different ways. Here, we focus on a biophysical version of the mismatch energy that includes conductance-based neurons. This also relates to the least-action principle in physics. But the NLA can also be generalized to include other dynamica variables such as adaptive thresholds or synaptic short-term plasticity.

Equivalent somato-dentritic circuit

For conductance based synapses, the excitatory and inhibitory conductances, gE and gI, are driven by the presynaptic firing rates and have the form gE(t)=WEr(t), and analogously gI(t)=WIr(t). The dynamics of a somatic voltage u and a dendritic voltage v reads as

(80) cu˙=gL(ELu)+gsd(vu),
(81) cdv˙=gL(ELv)+gE(EEv)+gI(EIv)+gds(uv)

where c and cd are the somatic and dendritic capacitances, EL/E/I the reversal potentials for the leak, the excitatory and inhibitory currents, gsd the transfer conductance from the dendrite to the soma, and gds in the other direction.

We consider the case when the dendritic capacitance cd is small as compared to the sum of conductances gd on the right-hand-side of Equation 81, yielding a fast dendritic time constant. In this case we can solve this equation in the steady state for v, plug this into Equation 80, and get

(82) cu˙=g(Vu),

with effective reversal potential V=(gLEL+gsdgffgff+gdsvff)/g, total conductance g=gL+gsdgffgff+gds, feedforward dendritic voltage vff=(gLEL+gEEE+gIEI)/gff and feedforward dendritic conductance gff=gL+gE+gI. Because gE/I=WE/Ir(upre,u˙pre), the conductance depends on the presynaptic voltage and its derivative. Equation 82 describes the effective circuit that has the identical voltage time course as Equations 80 and 81 with v˙=0, but with a single time-dependent ‘battery voltage’ V and Ohmic resistance R=1/g.

Somato-dentritic mismatch power and action

The synaptic inputs gE(t) and gI(t) are continuously driving V(t), and the best what one can hope for the dynamics of u is that it traces V with some integration delay determined by the time constant τ=c/g. In fact, if u follows the dynamics of Equation 82, then u becomes the low-pass filtered target potential, u=V, where V(t) is filtered with the dynamic time constant τ(t). The defining equations for the low-pass filtering is

(83) u¯+τu¯˙=u,

and this self-consistency equation is equivalent to the explicit form

(84) u(t)=tdtu(t)τ(t)ettdt1τ(t).

To capture the voltage dynamics with our NLA principle we recall that the somatic voltage u can be nudged by an ‘apical voltage’ e that causes a somatic voltage error e=uV. The voltage error drives a current I=ge through the conductance g. The electrical power of this current I driven by the voltage e is P=Ie2. This motivates the definition of the mismatch power in a network of N neurons by

(85) P=iNgi2(uiVi)2=12𝒈T(𝒖𝑽)2.

P is a virtual power that, nevertheless, is related to some physical flow of ions. Assume we could measure all the ions flowing in the original circuit of Equations 80 and 81 (in the limit of small ratio Cd/gd). From this flow, delete the ion flow that cancels at the level of electrical charge exchange due to the counter directed flow. The remaining effective ion flow defines an effective current flowing through the conductance g with driving force (Vu), Equation 82. If it were only this effective current in the reduced circuit, the voltage u(t), starting at u(0), would converge with time-constant τ to the low-pass filtering V. Without additional ‘hidden’ current, the voltage u would then instantaneously follow V that is itself given by the forward dendritic input conductances. The deviation of u from V, caused by some initial conditions in u or by a feedback current from the network affecting u, builds the mismatch power P. The feedback may originate from a target imposed downstream, and the neuron is ‘free’ in how to dynamically match u and V. It is, therefore, tempting to see P as a ‘free power’, and the NLA principle as minimizing the corresponding ‘free energy’. In fact, the free-energy principle says that any self-organizing system that is in a dynamic equilibrium with its environment (here u=V in the absence of output nudging) must minimize its free energy (that here builds up by imposing a target; Friston, 2010).

The NLA principle states that the time-integral of P is minimized with respect to the look-ahead voltage u~. We, therefore, define the physical mismatch energy as

(86) A=t1t2Pdt,

that has the units of energy. EM takes the role of our neural action (A in the main text) for conductance-based neurons.

Euler-Lagrange equations for conductance-based neurons

The NLA for conductance-based neurons seeks to minimize A=P(𝒖)dt with respect to variations of 𝒖, such that δAδ𝒖=0. In the simplest example of the main text we considered prospective rates, 𝒓=ρ(𝒖)+τρ˙(𝒖), so that the low-pass filtered rates become a function of the instantaneous voltage, 𝒓=ρ(𝒖). These low-pass filtered presynaptic rates, 𝒓, determine the postsynaptic voltage u. Analogously, the low-pass filtered reversal potential, 𝑽, determines the postsynaptic voltage u, and we again postulate that 𝑽 is an instantaneous function of the presynaptic voltage, 𝑽=ϕ(𝒖). Here, we argue that active dendritic mechanisms advance to postsynaptic reversal potential V, so that the delayed V again becomes instantaneous, similarly to the advancement of the apical dendritic potential observed in cortical pyramidal neurons (Ulrich, 2002), see also Figure 2b. With this instantaneity, the stationarity of the action with respect to generalized (compact and non-compact) variations, δAδ𝒖=0, translates to the condition P𝒖=0.

Calculating P𝒖=0 with P given in Equation 85 and 𝝉=c/𝒈 for the total conductance 𝒈(𝒖) specified after Equation 82 is a bit more demanding. For a probabilistic version, where P is derived from the negative log-likelihood of a Gaussian density of the voltage, P=logp(𝒖|𝑽)=12𝒈T(𝒖𝑽)212log𝒈+const, the calculation is done in Jordan et al., 2022. In this probabilistic version, there is an additional normalization term that enters as log𝒈. In the deterministic version considered here, this log term is not present and we calculate

(87) Pu=g(uV¯)ϵ¯with ϵ¯=r¯WnetT((uV¯)(EE/IV¯)12(uV¯)2).

Notice that the transpose 𝑾netT selects the downstream network neuron to backpropagate from there the first- and second-order errors. From P𝒖=0 and 𝝉=c/𝒈 we conclude that

(88) cτ(uV¯)ϵ¯=0 or u=V¯+τcϵ¯.

We next apply the look ahead operator to the expression in this Equation 88. Assuming an initialization at t0=, the condition P𝒖=0 becomes equivalent to (1+τddt)P𝒖=0, and hence Equation 88 becomes equivalent to𝒈(𝒖𝑽)𝝐+τ𝒈(𝒖˙𝑽˙)τ𝝐˙+τ𝒈˙(𝒖𝑽)=c𝒖˙𝒈(𝑽𝒖)𝝐+τ𝒈˙(𝒖𝑽)=0, and with the total conductance 𝒈(𝒖) specified after Equation 82 this is calculated to become (see Jordan et al., 2022).

(89) cu˙=g(Vu)+ϵ+τ˙ϵ¯.

A learning rule of the form W˙PW=e¯postr¯T with an appropriate postsynaptic error e¯post can again be derived (see Jordan et al., 2022) for a single neuron in the probabilistic framework.

Generalizations: Long memories, reinforcement learning

One could also extend the NLA principle by adding e.g., threshold adaptation that endows the dynamics with additional and longer time constants. For this, the rate function is parametrized by an additional threshold, r=ρ(uϑ), an the Lagrangian is added by an error term on the threshold. Such an error addition can take the form L=...+12ϑrτϑ2, where rτϑ now represents a low-pass filtering of the rate with a long threshold adaptation time constant τϑ. Neurons that show an additional threshold adaptation will still be able to instantaneously transmit a voltage jump through a prospective firing rate, but this will now also depend on the neuron’s history. Short-term plasticity may be included in the same way. Due to the history dependence, the stationarity of the action δA=0 cannot anymore be reduced to the stationarity of the Lagrangian at any moment in time. As a consequence, the errors will look-ahead into to future more than only based on the local derivatives.

Generalizations are further possible for the cost function that favor voltage regions with high cost, say corresponding to punishment, or negative cost, corresponding to punishment to reward. These extensions will be considered in future work.

Voltage dynamics from the least-action principle in physics

We have shown that through the look-ahead in Hamilton’s least-action principle, the notion of friction enters through the backdoor. In the least-action formalism in physics, friction is directly introduced by extending the Hamiltonian principle to a generalized D’Alembert principle, where at the level of the Euler-Lagrange equations the generalized force is equated to the dissipation force (Flannery, 2005). For an introduction to the least-action principle in physics see e.g., (Feynman et al., 2011), and for an introduction to the calculus of variation in general with a derivation of the Euler-Lagrange equation see e.g., (Chachuat, 2007).

The electro-chemical properties of a membrane can be captured by an equivalent circuit consisting of a battery voltage V, a capacitance C and a resistance R, arranged in parallel. The voltage dynamics is derived from the Euler-Lagrange equations that are added by a dissipative force.

Formally, in the absence of an inductance defining the kinetic energy, the Lagrangian L becomes identical to the potential energy U that itself is

(90) L=U=QVQ2/(2C),

where Q represents the charge across the membrane. Looking for stationary trajectory of the action (time integral of the Lagrangian) with only the potential energy in the Lagrangian would not give any dynamics. In fact, the Euler-Lagrange equation QLdtQ˙L=0 yields Q=CV. The potential energy is, therefore, extended by the dissipative Rayleigh energy F that introduces friction,

(91) F=RQ˙2/2.

According to D’Alembert’s principle, the Rayleigh energy enters as a gradient at the level of the Euler-Lagrange equation. At this point, the theory steps out from the original principle by adding the Rayleigh gradient ‘by hand’ to the Euler-Lagrange equation (different from the NLA, see below). Otherwise, no first order differential equation with friction in the desired form would be obtained from the least-action principle in physics (if we were adding F to L, for instance, the dtQ˙L-term in the Euler-Lagrange equations would yield the term RQ¨ instead of RQ˙). Hence, with D’Alembert’s patch the dynamics is characterized by the Euler-Lagrange equation added by a dissipative force,

QLdtQ˙L+Q˙F=0,

and this equation reduces to V+Q/C+RQ˙=0. Identifying the charge by means of voltage across the capacitance, Q=Cu, this equation can also be written as V+u+RCu˙=0, or

(92) τu˙=u+V,

with τ=RC, similarly as derived in Equation 89.

The link to our NLA principle is most easily made by assuming that the ‘membrane’ conductance g=1/R does not depend on the voltage u (of the post- or presynaptic neuron, as this is the case in general, see explanation after Equation 82). In this simplified case, the Lagrangian from Equation 86 becomes the power

(93) P=(uV¯)2/(2R).

The Euler-Lagrange equation with respect to u~, applied to the action A=Pdt is then calculated by

(94) Pu~ddtPu~˙=(1+τddt)Pu=1R(u+τu˙(V¯+τV¯˙))=0,or
(95) τu˙=u+V.

We used that τ=RC while u=u~τu~˙ and V=V+τV˙.

The dynamics derived from the NLA Equation 95 is identical to the dynamics derived from the least-action principle with friction in physics (Equation 92). Hence, the minimization of the energy (i.e. the time-integral of the power, A=Pdt) by looking ahead in time is equivalent to the stationarity of the physical action without looking ahead, but by taking account of the Rayleigh dissipation. The trick of looking ahead in time generates a dynamics out of an action A that encompasses only a potential energy, no kinetic energy. Without looking ahead, no dynamics would emerge from such an action. Notice, though, that the NLA action has units of energy (the time-integral of a power), while the physical action has units of energy times time (the time-integral of energies).

Appendix 7

A tutorial on total and partial derivatives as used in the paper

The proof of Theorem 1 given in the Methods (Sect. Proving theorem 1 (rt-DeEP)) makes use of partial and total derivatives and follows the notation of Scellier and Bengio, 2017 and Meulemans et al., 2022. As there is some variability (and community-dependent sloppiness) in the notation of partial and total derivatives here and in general, we provide some explanations on how this notation is interpreted, and why, for instance, total derivatives commute with each other, and also partial with each other (although not total with partial). For a standard basic textbook introducing the concepts of partial derivatives, total differentials, the implicit function theorem etc., see Stewart, 2010, being historically based on the classics of Courant, 1934.

  1. In a differential geometric setting, the derivative of real-valued function E(𝒖) on a point 𝒖 of a manifold (like the flat Euclidean space) is considered as a mapping of tangent vectors at u to the real numbers. When interpreting the derivative as mapping, the E𝒖 is living in the dual space of u and is, therefore, a row vector if is a column vector. If a function 𝒇(𝒖) of 𝒖 is a vector valued, then its derivative is a matrix with entries (𝒇𝒖)ij=fiuj, where i is indexing the rows (i.e. running down) and j is indexing the columns (i.e. running right). When 𝒓=ρ(𝒖) is a column vector with ρ applied to each component of 𝒖, we consider the (partial) derivative 𝒓=ρ(𝒖) for convenience as column vector with components ρ(ui). To strictly follow the formalism, it should be a diagonal matrix.

  2. Because we introduced the error vector 𝒆 as column vector, it is easier to write L𝒖=𝒆+..., where now L𝒖 is also considered as column vector. To be consistent with the above, we should have written L𝒖T=𝒆+.... The .T appeared us as too heavy so that we neglected it, where it did not have further mathematical consequences (hoping it does not cause confusions). Sticking here to a column vector also renders the backpropagation error to be a column vector in Equation 21. This error then gets the classical form with the weight transpose, 𝝐=𝒓net𝑾netT𝒆.

  3. We typically have real-valued functions of the form E(𝒖𝜽,𝜽), with 𝜽=(𝑾,β) being a vector of parameters, and 𝒖 being a function of 𝜽. To get the total derivative of E with respect to 𝜽 we consider the values E as a function of 𝜽. This can be done by introducing a new function E(𝜽) defined as E(𝜽)=E(𝒖𝜽,𝜽), where the components θi are considered as independent variables. The total derivative of E with respect to 𝜽 is then defined as vector-valued function dEd𝜽=(dEdθ1,...,dEdθn)=(dEdθ1,...,dEdθn)=E𝜽 for a n-dimensional 𝜽. It can be helpful to think of the components of this total derivative as a (total) directional derivatives in the unit directions. For the last (n-th) unit direction Δ𝜽(n)=(0,..,0,1), for instance, we have dEdθn=limε01ε(E(𝒖𝜽+εΔ𝜽(n),𝜽+εΔ𝜽(n))E(𝒖𝜽,𝜽)).

  4. The cost gradient, dCd𝑾(𝒖𝑾𝒓)𝒓T has the same dimension as 𝑾. Recall that by the cost gradient we mean dCd𝑾=C𝑾, where C is defined as C(𝑾)=C(𝒖𝐨(𝑾),𝒖𝐨*), with the voltage 𝒖𝐨 of the output neurons being itself a function of 𝑾.

  5. To calculate the partial derivative 𝜽E(𝒖,𝜽) with respect to 𝜽, we fix the first argument 𝒖𝜽, even if for 𝒖 we often plugged in the components of the trajectory 𝒖=𝒖𝜽(t) that now does depend on 𝜽. In contrast, the total derivative is dd𝜽E(𝒖,𝜽)=E(𝒖,𝜽)𝒖d𝒖d𝜽+E(𝒖,𝜽)𝜽. Here, E(𝒖,𝜽)𝒖 is a row vector, as also E(𝒖,𝜽)𝜽, consistent with the convention (that we have broken in Equation 28 to keep vectors as columns). When 𝒖 is considered as trajectory, d𝒖d𝜽 does not vanish in general, but it does when 𝒖 is simply considered as independent variable.

  6. When replacing the argument 𝒖 in E(𝒖,𝜽) by 𝒖=𝒖~τ𝒖~˙ we get the ‘Lagrangian’ L(𝒖~,𝒖~˙,𝜽)=E(𝒖~τ𝒖~˙,θ). The partial derivative of L with respect to 𝒖~, for instance, is then 𝒖~L(𝒖~,𝒖~˙,𝜽)=𝒖E(𝒖,𝜽)𝒖𝒖~=𝒖E(𝒖,𝜽). The partial derivative 𝒖~L considers L as a function of independent arguments 𝒖~, 𝒖~˙ and 𝜽.

  7. We also used that the total derivatives commute, in the current example dd𝑾ddβE=ddβdd𝑾E. This is generally true for derivatives of Lipschitz continuous functions, for which derivatives exist almost everywhere. The total derivatives (where they exist) then commute because the difference quotients in 𝑾 and β are uniformly bounded. The Moore-Osgood theorem tells that two limits, of which at least one is uniform in the other, can be commuted. This also applies to the double difference quotients involved in the definition of dd𝑾ddβ. Remember that the total derivative, for instance with respect to the n-th parameter, can be written as dEdθn=limε01ε(E(𝒖𝜽+εΔ𝜽(n),𝜽+εΔ𝜽(n))E(𝒖𝜽,𝜽)). But note again that, while for Lipschitz functions partial derivatives commute among themselves, and total derivatives commute among themselves, partial and total derivatives in general do not commute (not even with additional smoothness conditions)!

Appendix 8

Proof of theorem 1, part (ii), using only partial derivatives

This section proves Equations 29–31 in terms of only partial derivatives, banning nested functions and banning the main theorem to calculate total derivatives in higher dimensions as presented in item (v) of the above Tutorial. While a reduction to partial derivatives only may represent a conceptual simplification, it requires many more additional (unnested) functions to be defined (and in fact being the reason why in mathematics the concept of a total differential / total derivative is introduced, see e.g. Courant, 1934). – The cited three equations are also the core for the proof for Equilibrium Propagation (Scellier and Bengio, 2017), although there only applied in the steady state after the network converged to a constant activity.

We assume a network of d neurons whose membrane potential is given by 𝒖d and which are connected via weights 𝑾d×d. By 𝒖, we denote the gradient with respect to the membrane potentials, i.e., 𝒖=(u1,...,ud). Similarly, 𝑾 is a matrix containing the derivatives with respect to the weights, (𝑾)ij=Wij.

To prove the rt-DeEP theorem 1, we first have to make a few definitions and observations:

  1. For a given (𝑾,β), the dynamics yield certain membrane potentials f𝒖d. Formally, we define this as

    (96) f𝒖:d×d×d,(𝑾,β)f𝒖(𝑾,β).

    The ith element of f𝒖 is denoted by fui and hence f𝒖=(fu1,...,fud).

  2. We define the following functions:

    • the mismatch energy EM:d×d×d,(𝒖,𝑾)EM(𝒖,𝑾)=12i=1d[uijWijrj]2 ,

    • the cost function C:d,𝒖C(𝒖)=12kOuk*uk2 ,

    • the Lagrangian L:d×d×d×,(𝒖,𝑾,β)L(𝒖,𝑾,β)=EM(𝒖,𝑾)+βC(𝒖) . To make the dependency of the cost and energies on β and W explicit, we further introduce three auxiliary functions FM, FC and FL :

    • for the mismatch energy FM:d×d×,(𝑾,β)FM(𝑾,β)=EM(f𝒖(𝑾,β),𝑾) ,

    • for the cost function FC:d×d×,(𝑾,β)FC(𝑾,β)=C(f𝒖(𝑾,β)) ,

    • for the Lagrangian FL:d×d×,(𝑾,β)FL(𝑾,β)=FM(𝑾,β)+βFC(𝑾,β)

  3. The Euler-Lagrange equations can be written as τddt𝒖L=𝒖L.

    Hence, far enough away from initialization and for smooth enough input (and targets) we have f𝒖L=0 at all times, even when changing the network input continuously. Note that both the cost and the mismatch energies are defined on low-pass-filtered signals, and it is with respect to the low-pass filtered external input that the low-pass-filtered output error is minimized.

  4. Without output nudging (i.e. β=0), the output error vanishes and consequently all other prediction errors vanish as well, 𝒆=𝒖𝑾𝒓=0. This can be easily shown for layered network architectures and holds true for arbitrary connections (e.g. recurrent networks) as long as f𝒖 uniquely exists, i.e., 𝒖=𝑾𝒓 has a unique solution for 𝒖. From the form of the mismatch energy, we then get

    (97) WEM|β=0=(Wr¯u)r¯T|β=0=0.

    Since we are assuming smooth functions, this also implies that

    (98) limϵβ0WEM|β=ϵβ=0.
  5. From the assumption of well-behaved (smooth) functions, it also follows that partial derivatives commute 𝑾β=β𝑾.

Our goal is to find a plasticity rule that minimizes the cost C, which we do by calculating 𝑾FC|β=0. Similar to Scellier and Bengio, 2017, to achieve this, we first calculate the partial derivatives of FL with respect to the nudging strength β

(99a) FLβ=FMβ+βFCβ+FC
(99b) =i=1d(EMfuifuiβ+βCfuifuiβ)+C
(99c) =i=1d(EM+βC)fui=0fuiβ+C
(99d) =C,

and the weights 𝑾

(100a) FLWij=FMWij+βFCWij
(100b) k=1d(EMfukfukWij+βCfukfukWij)+EMWij
(100c) =k=1d(EM+βC)fuk=0fukWij+EMWij
(100d) =EMWij,

or in vectorized form

(101) WFL=WEM.

With these identities in place, we can calculate the plasticity rule:

(102a) limϵβ0WFC=limϵβ0WFLβ
(102b) limϵβ0βWFL
(102c) =limϵβ0limδ01δ(WFL|β=ϵβ+δWFL|β=ϵβ)
(102d) =limδ01δlimϵβ0(WEM|β=ϵβ+δWEM|β=ϵβ)
(102e) =limδ01δ(limϵβ0WEM|β=ϵβ+δlimϵβ0WEM|β=ϵβ=0 from (equation 98))
(102f) =limδ01δWEM|β=δ
(103g) 1δWEM|β=δfor δ1.

where we used that limits can be exchanged for smooth functions. Using the definition of EM, we obtain a plasticity rule that minimizes the cost function

(103) limϵβ0WC|β=ϵβ1ϵβWEM|β=ϵβ=1ϵβ(uWr¯)r¯T|β=ϵβ  for ϵβ1.

In practice, the prefactor 1ϵβ is absorbed into the learning rate.

Data availability

The current manuscript is a computational study, and the modelling code is available at GitHub, copy archived at Ellenberger, 2024.

References

  1. Conference
    1. Alonso E
    2. Fairbank M
    3. Mondragón E
    (2012)
    Conditioning for least action
    Proceedings of the 11th International Conference on Cognitive Modeling, ICCM. pp. 234–239.
  2. Report
    1. Chachuat B
    (2007)
    Nonlinear and dynamic optimization: from theory to practice
    EPFL - Swiss Federal Institute of Technology Lausanne.
  3. Book
    1. Courant R
    (1934)
    Differential and Integral Calculus
    John Wiley & Sons.
  4. Book
    1. Feynman RP
    2. Leighton RB
    3. Sands M
    (2011)
    The Feynman Lectures on Physics, Vol. II: Mainly Electromagnetism and Matter
    Feynmanlectures.
  5. Book
    1. Meulemans A
    2. Farinha MT
    3. Ordóñez JG
    4. Aceituno PV
    5. Sacramento J
    6. Grewe BF
    (2021)
    Credit assignment in neural networks through deep feedback control
    In: Meulemans A, Farinha MT, editors. Advances in Neural Information Processing Systems. Semantic Scholar. pp. 1–47.
  6. Conference
    1. Meulemans A
    2. Zucchet N
    3. Kobayashi S
    4. von Oswald J
    5. Sacramento J
    (2022)
    The least-control principle for local learning at equilibrium
    Conference on Neural Information Processing Systems. pp. 1–47.
  7. Conference
    1. Sacramento J
    2. Costa RP
    3. Bengio Y
    4. Senn W
    (2018)
    Dendritic cortical microcircuits approximate the backpropagation algorithm
    Advances in Neural Information Processing Systems. pp. 8721–8732.
  8. Book
    1. Stewart J
    (2010)
    Multivariable calculus
    Udmey.
  9. Book
    1. Talairach J
    2. Tournoux P
    (1988)
    Co-planar stereotaxic atlas of the human brain: 3-dimensional proportional system: an approach to cerebral imaging
    New York: Thieme Medical Publishers, Inc.
  10. Book
    1. Todorov E
    (2006) Optimal control theory
    In: Kenji D, Shin I, Alexandre P, Rajesh P, editors. The Bayesian Brain. MIT Press. pp. 1–28.
    https://doi.org/10.7551/mitpress/9780262042383.003.0012
  11. Conference
    1. Zhao Y
    2. Swamy MNS
    (1998) A novel technique for tracking time-varying minimum and its applications
    Canadian Conference on Electrical and Computer Engineering. pp. 910–913.
    https://doi.org/10.1109/CCECE.1998.685646

Article and author information

Author details

  1. Walter Senn

    Department of Physiology, University of Bern, Bern, Switzerland
    Contribution
    Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing – original draft, Project administration, Writing – review and editing
    Contributed equally with
    Dominik Dold and Mihai A Petrovici
    For correspondence
    walter.senn@unibe.ch
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3622-0497
  2. Dominik Dold

    1. Department of Physiology, University of Bern, Bern, Switzerland
    2. Kirchhoff-Institute for Physics, Heidelberg University, Heidelberg, Germany
    3. European Space Research and Technology Centre, European Space Agency, Noordwijk, Netherlands
    Contribution
    Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing – original draft
    Contributed equally with
    Walter Senn and Mihai A Petrovici
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7626-9960
  3. Akos F Kungl

    1. Department of Physiology, University of Bern, Bern, Switzerland
    2. Kirchhoff-Institute for Physics, Heidelberg University, Heidelberg, Germany
    Contribution
    Data curation, Software, Formal analysis, Investigation, Methodology
    Competing interests
    No competing interests declared
  4. Benjamin Ellenberger

    1. Department of Physiology, University of Bern, Bern, Switzerland
    2. Insel Data Science Center, University Hospital Bern, Bern, Switzerland
    Contribution
    Data curation, Software, Investigation, Visualization
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4787-0471
  5. Jakob Jordan

    1. Department of Physiology, University of Bern, Bern, Switzerland
    2. Electrical Engineering, Yale University, New Haven, United States
    Contribution
    Software, Supervision, Investigation, Methodology, Writing – original draft
    Competing interests
    No competing interests declared
  6. Yoshua Bengio

    MILA, University of Montreal, Montreal, Canada
    Contribution
    Conceptualization, Methodology, Writing – original draft
    Competing interests
    No competing interests declared
  7. João Sacramento

    Department of Computer Science, ETH Zurich, Zurich, Switzerland
    Contribution
    Conceptualization, Formal analysis, Investigation, Methodology, Writing – original draft
    Competing interests
    No competing interests declared
  8. Mihai A Petrovici

    1. Department of Physiology, University of Bern, Bern, Switzerland
    2. Kirchhoff-Institute for Physics, Heidelberg University, Heidelberg, Germany
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing – original draft, Project administration
    Contributed equally with
    Walter Senn and Dominik Dold
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2632-0427

Funding

European Union 7th Framework Programme (720270)

  • Walter Senn
  • Mihai A Petrovici

Swiss National Science Foundation (CRSII5180316)

  • Walter Senn

Swiss National Science Foundation (PZ00P3_186027)

  • João Sacramento

European Union 7th Framework Programme (785907)

  • Walter Senn
  • Mihai A Petrovici

European Union 7th Framework Programme (945539)

  • Walter Senn
  • Mihai A Petrovici

European Union 7th Framework Programme (604102)

  • Walter Senn

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We would like to thank Federico Benitez, Jonathan Binas, Paul Haider, Kevin Max, Alexander Mathis, Alexander Meulemans, and Jean-Pascal Pfister for helpful discussions, Kaspar Schindler for providing the human intracortical EEG data, and the late Karlheinz Meier for his dedication and support throughout the early stages of the project. WS thanks Nicolas Zucchet for mathematical discussions and hints to the literature on time-varying optimal control. DD acknowledges support through the European Space Agency's Postdoctoral Research Fellowship Programme. We express our particular gratitude towards the Manfred Stärk Foundation for their continued support.

Version history

  1. Sent for peer review:
  2. Preprint posted:
  3. Reviewed Preprint version 1:
  4. Reviewed Preprint version 2:
  5. Version of Record published:

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.89674. This DOI represents all versions, and will always resolve to the latest one.

Copyright

© 2023, Senn, Dold et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,692
    views
  • 137
    downloads
  • 0
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Walter Senn
  2. Dominik Dold
  3. Akos F Kungl
  4. Benjamin Ellenberger
  5. Jakob Jordan
  6. Yoshua Bengio
  7. João Sacramento
  8. Mihai A Petrovici
(2024)
A neuronal least-action principle for real-time learning in cortical circuits
eLife 12:RP89674.
https://doi.org/10.7554/eLife.89674.3

Share this article

https://doi.org/10.7554/eLife.89674

Further reading

    1. Neuroscience
    Cristina Gil Avila, Elisabeth S May ... Markus Ploner
    Research Article

    Chronic pain is a prevalent and debilitating condition whose neural mechanisms are incompletely understood. An imbalance of cerebral excitation and inhibition (E/I), particularly in the medial prefrontal cortex (mPFC), is believed to represent a crucial mechanism in the development and maintenance of chronic pain. Thus, identifying a non-invasive, scalable marker of E/I could provide valuable insights into the neural mechanisms of chronic pain and aid in developing clinically useful biomarkers. Recently, the aperiodic component of the electroencephalography (EEG) power spectrum has been proposed to represent a non-invasive proxy for E/I. We, therefore, assessed the aperiodic component in the mPFC of resting-state EEG recordings in 149 people with chronic pain and 115 healthy participants. We found robust evidence against differences in the aperiodic component in the mPFC between people with chronic pain and healthy participants, and no correlation between the aperiodic component and pain intensity. These findings were consistent across different subtypes of chronic pain and were similarly found in a whole-brain analysis. Their robustness was supported by preregistration and multiverse analyses across many different methodological choices. Together, our results suggest that the EEG aperiodic component does not differentiate between people with chronic pain and healthy individuals. These findings and the rigorous methodological approach can guide future studies investigating non-invasive, scalable markers of cerebral dysfunction in people with chronic pain and beyond.

    1. Neuroscience
    Claire Meissner-Bernard, Friedemann Zenke, Rainer W Friedrich
    Research Article

    Biological memory networks are thought to store information by experience-dependent changes in the synaptic connectivity between assemblies of neurons. Recent models suggest that these assemblies contain both excitatory and inhibitory neurons (E/I assemblies), resulting in co-tuning and precise balance of excitation and inhibition. To understand computational consequences of E/I assemblies under biologically realistic constraints we built a spiking network model based on experimental data from telencephalic area Dp of adult zebrafish, a precisely balanced recurrent network homologous to piriform cortex. We found that E/I assemblies stabilized firing rate distributions compared to networks with excitatory assemblies and global inhibition. Unlike classical memory models, networks with E/I assemblies did not show discrete attractor dynamics. Rather, responses to learned inputs were locally constrained onto manifolds that ‘focused’ activity into neuronal subspaces. The covariance structure of these manifolds supported pattern classification when information was retrieved from selected neuronal subsets. Networks with E/I assemblies therefore transformed the geometry of neuronal coding space, resulting in continuous representations that reflected both relatedness of inputs and an individual’s experience. Such continuous representations enable fast pattern classification, can support continual learning, and may provide a basis for higher-order learning and cognitive computations.