Abstract
Synaptic connections in many brain circuits fluctuate, exhibiting substantial turnover and remodelling over hours to days. Surprisingly, experiments show that most of this flux in connectivity persists in the absence of learning or known plasticity signals. How can neural circuits retain learned information despite a large proportion of ongoing and potentially disruptive synaptic changes? We address this question from first principles by analysing how much compensatory plasticity would be required to optimally counteract ongoing fluctuations, regardless of whether fluctuations are random or systematic. Remarkably, we find that the answer is largely independent of plasticity mechanisms and circuit architectures: compensatory plasticity should be at most equal in magnitude to fluctuations, and often less, in direct agreement with previously unexplained experimental observations. Moreover, our analysis shows that a high proportion of learningindependent synaptic change is consistent with plasticity mechanisms that accurately compute error gradients.
Introduction
Learning depends upon systematic changes to the connectivity and strengths of synapses in neural circuits. This has been shown across experimental systems (Moczulska et al., 2013; Lai et al., 2012; HayashiTakagi et al., 2015) and is assumed by most theories of learning (Hebb, 1949; Bienenstock et al., 1982; Gerstner et al., 1996).
Neural circuits are required not only to learn, but also to retain previously learned information. One might therefore expect synaptic stability in the absence of an explicit learning signal. However, many recent experiments in multiple brain areas have documented substantial ongoing synaptic modification in the absence of any obvious learning or change in behaviour (Attardo et al., 2015; Pfeiffer et al., 2018; Holtmaat et al., 2005; Loewenstein et al., 2015; Yasumatsu et al., 2008; Loewenstein et al., 2011).
This ongoing synaptic flux is heterogeneous in its magnitude and form. For instance, the expected lifetime of dendritic spines in mouse CA1 hippocampus has been estimated as 1–2 weeks (Attardo et al., 2015). Elsewhere in the brain, over 70% of spines in mouse barrel cortex are found to persist for 18 months (Zuo et al., 2005), although these persistent spines exhibited large deviations in size over the imaging period (on average, a >25% deviation in spine head diameter).
The sources of these ongoing changes remain unaccounted for, but are hypothesised to fall into systematic changes associated with learning, development and homeostatic maintenance, and unsystematic changes due to random turnover (Rule et al., 2019; Mongillo et al., 2017; Ziv and Brenner, 2018). A number of experimental studies have attempted to disambiguate and quantify the contributions of different biological processes to overall synaptic changes, either by directly interfering with synaptic plasticity, or by correlating changes to circuitwide measurements of ongoing physiological activity (Nagaoka et al., 2016; Quinn et al., 2019; Yasumatsu et al., 2008; Minerbi et al., 2009; Dvorkin and Ziv, 2016). Consistently, these studies find that the total rate of ongoing synaptic change is reduced by only 50% or less in the absence of neural activity or when plasticity pathways are blocked.
Thus, the bulk of steadystate synaptic changes seem to arise from fluctuations that are independent of activity patterns at pre/post synaptic neurons or known plasticity induction pathways. As such, it seems unlikely that their source is some external learning signal or internal reconsolidation mechanism. This is surprising, because maintenance of neural circuit properties and learned behaviour would intuitively require changes across synapses to be highly coordinated. To our knowledge, there is no theoretical account or model prediction that explains these observations.
One way of reconciling stable circuit function with unstable synapses is to assume that ongoing synaptic changes are localised to ‘unimportant’ synapses, which do not affect circuit function. While this may hold in particular circuits and contexts (Mongillo et al., 2017), at least some of the ongoing synaptic changes are likely associated with ongoing learning, which must somehow affect overall circuit function to be effective (Rule et al., 2020). Furthermore, this model does not account for the dominant contribution of fluctuations among those synapses that do not remain stable over time.
In this work we explore another, nonmutually exclusive hypothesis that active plasticity mechanisms continually maintain the overall function of a neural circuit by compensating changes that degrade memories and learned task performance. This fits within the broad framework of memory maintenance via internal replay and reconsolidation, a widely hypothesised class of mechanisms for which there is widespread evidence (Carr et al., 2011; Foster, 2017; Nader and Einarsson, 2010; Tronson and Taylor, 2007).
Compensatory plasticity can be induced by external reinforcement signals (Kappel et al., 2018), interactions between different brain areas and circuits (Acker et al., 2018), or spontaneous, networklevel reactivation events (Fauth and van Rossum, 2019). Either way, we can conceptually divide plasticity processes into two types: those that degrade previously learned information, and those that protect against such degradation. We will typically refer to memorydegrading processes as ‘fluctuations’. While these may be stochastic in origin, for example due to intrinsic molecular noise in synapses, we do not demand that this is the case. Fluctuations will therefore account for any synaptic change, random or systematic, that disrupts stored information.
The central question we address in this work is how compensatory plasticity should act in order to optimally maintain stored information at the circuit level, in the presence of ongoing synaptic fluctuations. To do this, we develop a general modelling framework and conduct a firstprinciples mathematical analysis that is independent of specific plasticity mechanism and circuit architectures. We find that the rate of compensatory plasticity should not exceed that of the synaptic fluctuations, in direct agreement with experimental measurements. Moreover, fluctuations should dominate as the precision of compensatory plasticity mechanisms increases, where ‘precision’ is defined as the quality of approximation of an error gradient. This provides a potential means of accounting for differences in relative magnitudes of fluctuations in different neural circuits. We validate our theoretical predictions through simulation. Together, our results explain a number of consistent but puzzling experimental findings by developing the hypothesis that synaptic plasticity is optimised for dynamic maintenance of learned information.
Results
Review of key experimental findings
To motivate the main analysis in this paper we begin with a brief survey of quantitative, experimental measurements of ongoing synaptic dynamics. These studies, summarised in Table 1, provide quantifications of the rates of systematic/activitydependent plasticity relative to ongoing synaptic fluctuations.
We focused on studies that measured ‘baseline’ synaptic changes that occur outside of any behavioural learning paradigm, and which controlled for stimuli that may induce widespread changes in synaptic strength. The approaches fall into two categories:
Those that chemically suppress neural activity, and/or block known synaptic plasticity pathways, quantifying consequent changes in the rate of synaptic dynamics, in vitro (Yasumatsu et al., 2008; Minerbi et al., 2009; Quinn et al., 2019) and in vivo (Nagaoka et al., 2016). The latter study included a challenging experiment in which neural activity was pharmacologically suppressed in the visual cortex of mice raised in visually enriched conditions.
Those that compare ‘redundant’ synapses sharing pre and postsynaptic neurons, and quantify the proportion of synaptic strength changes attributable to spontaneous processes independent of their shared activity history. These included in vitro studies that involved precise longitudinal imaging of dendritic spines in cultured cortical neurons (Dvorkin and Ziv, 2016). They also included in vivo studies, that used electron microscopy to reconstruct and compare the sizes of redundant synapses (Kasthuri et al., 2015) post mortem.
The studies in Table 1 consistently report that the the main component (more than 50%) of baseline synaptic dynamics is due to synaptic fluctuations that are independent of neural activity and/or easily identifiable plasticity signals. This is surprising because such a large contribution of fluctuations might be expected to disrupt circuit function. A key question that we address in this study is whether such a large relative magnitude of fluctuations can be accounted for from first principles, assuming that neural circuits need to protect overall function against perturbations.
The hypothesis we assumed is that some active plasticity mechanism compensates for the degradation of a learned memory trace or circuit function caused by ongoing synaptic fluctuations. We will thus express overall plasticity as a combination of synaptic fluctuations (taskindependent processes that degrade memory quality) and compensatory plasticity, which counteracts this effect. There are various ways such a compensatory mechanism might access information on the integrity of overall circuit function, memory quality or ’task performance’. It could use external reinforcement signals (Kappel et al., 2018; Rule et al., 2020). Alternatively, such information could come from another brain region, as hypothesised in for example Acker et al., 2018, where cortical memories are stabilised by hippocampal replay events. Spontaneous, networklevel reactivation events internal to the neural circuit itself could also plausibly induce performanceincreasing plasticity (Fauth and van Rossum, 2019). Regardless, the decomposition of total ongoing plasticity into fluctuations and systematic plasticity allows us to derive relationships between both that are independent of the underlying mechanisms, which are not the focus of this study.
We must acknowledge that it is difficult, experimentally, to pin down and control for all physiological factors that regulate synaptic changes, or indeed to measure such changes accurately. However, even if one does not take the observations in Table 1 – or their interpretation – at face value, the conceptual question we ask remains relevant for any neural circuit that needs to retain information in the face of ongoing synaptic change.
Modelling setup
Suppose a neural circuit is maintaining previously learned information on a task. The circuit is subject to taskindependent synaptic fluctuations which can degrade the quality of learned information. Meanwhile, some compensatory plasticity mechanism counteracts this degradation. Throughout this paper, we treat ‘memory’ and ‘task performance’ as interchangeable because our framework analyses the effect of synaptic weight change on overall circuit function. In this context, we ask:
if a network optimally maintains learned task performance, what rate of compensatory plasticity is required relative to the rate of synaptic fluctuations?
By ‘rate’ we mean magnitude of change in a given time interval. Our setup is depicted in Figure 1. We make the following assumptions, which are also stated mathematically in Box 1:
The neural network has $N$ adaptive elements that we call ‘synaptic weights’ for convenience, although they could include parameters controlling intrinsic neural excitability. We represent these elements through a vector $\mathbf{\mathbf{w}}(t)$, which we call the neural network state. Changes to $\mathbf{\mathbf{w}}(t)$ correspond to plasticity.
Any state $\mathbf{\mathbf{w}}(t)$ is associated with a quantifiable (scalar) level of task error, denoted $F[\mathbf{\mathbf{w}}(t)]$, and called the loss function. A higher value of $F[\mathbf{\mathbf{w}}(t)]$ implies greater corruption of previously learned information.
The network state can be varied continuously. Task error varies smoothly with respect to changes in $\mathbf{\mathbf{w}}(t)$.
At any point of time, we can represent the rate of change (i.e. timederivative) of the synaptic weights as
$\dot{\mathbf{w}}(t)=\dot{\mathbf{c}}(t)+\dot{\u03f5}(t).$as discussed previously, which correspond to compensatory plasticity and synaptic fluctuations, respectively.
The magnitude and direction of plasticity may or may not change continually over time. Correspondingly, we may pick an appropriately small time interval, $\mathrm{\Delta}t$, (which is not necessarily infinitesimally small) over which the directions of plasticity can be assumed constant, and write
where for any timedependent variable $x(t)$, we use the notation $\mathrm{\Delta}x(t):=x(t+\mathrm{\Delta}t)x(t)$. We regard $\mathrm{\Delta}\mathbf{\mathbf{c}}(t)$ and $\mathrm{\Delta}\u03f5(t)$ as coming from unknown probability distributions, which obey the following constraints:
Synaptic fluctuations $\mathrm{\Delta}\mathbf{}\u03f5\mathbf{}\mathrm{(}t\mathrm{)}$: We want to capture ‘task independent’ plasticity mechanisms. As such, we demand that the probability of the mechanism increasing or decreasing any particular synaptic weight over $\mathrm{\Delta}t$ is independent of whether such a change increases or decreases task error. A trivial example would be white noise, but systematic mechanisms, such as homeostatic plasticity, could also contribute (O’Leary, 2018; O'Leary and Wyllie, 2011).
Compensatory plasticity $\mathrm{\Delta}\mathbf{}\mathbf{\mathbf{c}}\mathbf{}\mathrm{(}t\mathrm{)}$: We demand that compensatory plasticity mechanisms change the network state in a direction of decreasing task error, on average. As such, they cause the network to preserve previously stored information, though not in general by restoring synaptic weights to their previous values following a perturbation.
Box 1.
Mathematical assumptions on plasticity.
To quantify memory quality/task performance we consider a loss function $F[\mathbf{\mathbf{w}}({t}^{*})]$, which is twice differentiable in $\mathbf{\mathbf{w}}(t)$. This loss function is simply an implicit measure of memory quality; we do not assume that the network explicitly represents $F$, or has direct access to it. Consider an infinitesimal weightchange $\mathrm{\Delta}\mathbf{\mathbf{w}}$ over the infinitesimal timeinterval $\mathrm{\Delta}t$. We apply a second order Taylor expansion to express the consequent change in task error: $\mathrm{\Delta}F=F[\mathbf{\mathbf{w}}({t}^{*})+\mathrm{\Delta}\mathbf{\mathbf{w}}]F[\mathbf{\mathbf{w}}({t}^{*})]$:
Here, $\nabla F[\mathbf{\mathbf{w}}({t}^{*})]$ and ${\nabla}^{2}F[\mathbf{\mathbf{w}}(t)]$ represent the first two derivatives (gradient and hessian) of $F[\mathbf{\mathbf{w}}({t}^{*})]$, with respect to a change in the weights $\mathbf{\mathbf{w}}({t}^{*})$. We assume that $\mathrm{\Delta}\mathbf{\mathbf{c}}$ and $\mathrm{\Delta}\u03f5$ are sufficiently small (due to the short time interval) that the thirdorder term $\mathcal{O}({\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}+\mathrm{\Delta}\u03f5\parallel}_{2}^{3})$ can be ignored.
Next, we assume that $\mathrm{\Delta}\mathbf{\mathbf{c}}$ and $\mathrm{\Delta}\u03f5$ are generated from unknown probability distributions. We place some constraints on these distributions. Firstly, synaptic fluctuations should be uncorrelated, in expectation, with the derivatives of $F[\mathbf{\mathbf{w}}]$, which govern learning. Accordingly,
Secondly, we require that $\mathrm{\Delta}\mathbf{c}$ points in a direction of plasticity that decreases task error, for sufficiently small
Motivating example
Having described a generic modelling framework, we next uncover a key observation using a simple simulation.
Figure 1 depicts an abstract, artificial neural network trying to maintain a given inputoutput mapping over time, which is analogous to preservation of a memory trace or learned task. At every timestep, synaptic fluctuations corrupt the weights, and a compensatory plasticity mechanism acts to reduce any error in the inputoutput mapping (see Equation (1)). We fix the rate (i.e. magnitude per timestep) of synaptic fluctuations throughout. We increase the compensatory plasticity rate in stages, ranging from a level far below the synaptic fluctuation rate, to a level far above it. Each stage is maintained so that task error can settle to a steady state.
Two interesting phenomena emerge. The task error of the network is smallest when the compensatory plasticity rate is smaller than the synaptic fluctuation rate (Figure 1b). Meanwhile, individual weights in the network continually change even as overall task error remains stable due to redundancy in the weight configuration (Figure 1c), (see e.g. Rule et al., 2019 for a review).
In this simple simulation, we made a number of arbitrary and nonbiologically motivated choices. In particular, we used an abstract, ratebased network, and synthesised compensatory plasticity directions using the biologically questionable backpropagation rule (see Materials and methods for full simulation details). Nevertheless, Figure 1 highlights a phenomenon that we claim is more general:
The ‘sweetspot’ compensatory plasticity rate that leads to optimal, steadystate retention of previously learned information is at most equal to the rate of synaptic fluctuations, and often less.
In the remainder of the results section, we will build intuition as to when and why this claim holds. We will also explore factors influence the precise ‘sweetspot’ compensatory plasticity rate.
The loss landscape
In order to analyse a general learning scenario that can accommodate biologically relevant assumptions about synaptic plasticity, we will develop a few general mathematical constructs that will allow us to draw conclusions about how synaptic weights affect the overall function of a network.
We first describe the ‘loss landscape’: a conceptually useful, geometrical visualisation of task error $F[\mathbf{\mathbf{w}}]$ (see also Figure 2). Every point on the landscape corresponds to a different network state $\mathbf{\mathbf{w}}$. Whereas any point on a standard threedimensional landscape has two lateral (xy) coordinates, any point on the loss landscape has $N$ coordinates representing each synaptic strength. Plasticity changes $\mathbf{\mathbf{w}}$, and thus corresponds to movement on the landscape. Any movement $\mathrm{\Delta}\mathbf{\mathbf{w}}$ has both a direction $\widehat{\mathrm{\Delta}\mathbf{\mathbf{w}}}$ (where hats denote normalised vectors), and a magnitude ${\parallel \mathrm{\Delta}\mathbf{\mathbf{w}}\parallel}_{2}$. Meanwhile, the elevation of a point $\mathbf{\mathbf{w}}$ on the landscape represents the degree of task error, $F[\mathbf{\mathbf{w}}]$. Compensatory plasticity improves task error, and thus moves downhill, regardless of the underlying plasticity mechanism.
Understanding curvature in the loss landscape
Intuitively, one would expect taskindependent synaptic fluctuations to increase task error. This is true even if fluctuations are unbiased in moving in an uphill or downhill direction on the loss landscape (see Equation (3a)) due to the curvature of the landscape (see Figure 2C). For instance, the slope (mathematically represented by the gradient $\nabla F[\mathbf{\mathbf{w}}]$) at the bottom of a valley is zero. However, every direction is positively curved, and thus moves uphill. More generally, consider a fluctuation that is unbiased in selecting uphill or downhill directions, at a network state $\mathbf{\mathbf{w}}$. The fluctuation will increase task error in expectation if the total curvature of the upwardly curved directions at $\mathbf{\mathbf{w}}$ exceeds that of the downwardly curved directions, as illustrated in Figure 2c. We refer to such a state as partially trained. If all directions are upwardly curved, such as at/near the bottom of a valley, we refer to the state as highly trained. Mathematical definitions for these terms are provided in Box 2.
Box 2.
Curvature and the loss landscape.
Consider a fluctuation $\mathrm{\Delta}\mathbf{\mathbf{w}}$ at a state $\mathbf{\mathbf{w}}$. The change in task error, to second order, can be written as
via a Taylor expansion. Suppose the fluctuation is taskindependent. So it is unbiased with respect to selecting uphill/downhill, and more/less curved directions on the loss landscape. In this case
In expectation, Equation (4) thus becomes
If $Tr({\nabla}^{2}F[\mathbf{\mathbf{w}}])>0$, then the expected change in task error is positive, and we refer to the network state as ‘partially trained’. If additionally, ${\nabla}^{2}F[\mathbf{\mathbf{w}}]\u2ab00$, that is, $\mathrm{\Delta}{\mathbf{\mathbf{w}}}^{T}{\nabla}^{2}F[\mathbf{\mathbf{w}}]\mathrm{\Delta}\mathbf{\mathbf{w}}\ge 0$ for any choice of $\mathrm{\Delta}\mathbf{\mathbf{w}}$, then we refer to the network as highly trained. The ‘highly trained’ condition always holds in a neighbourhood of a local minimum of task error.
Comparison of the upward curvature of different plasticity directions plays an important role in the remainder of the section. Therefore, we introduce the following operator:
${Q}_{\mathbf{\mathbf{w}}}[\mathbf{\mathbf{v}}]$ is mathematical shorthand for the degree of curvature in the direction $\mathbf{\mathbf{v}}$, at point $\mathbf{\mathbf{w}}$ on the loss landscape, and is depicted in Figure 3a. Note that ${Q}_{\mathbf{\mathbf{w}}}[\mathbf{\mathbf{v}}]$ depends solely upon the direction, and not the magnitude, of $\mathbf{\mathbf{v}}$.
An expression for the optimal degree of compensatory plasticity during learning
The rates of compensatory plasticity and synaptic fluctuations, at time $t$, are $\dot{\mathbf{\mathbf{c}}}(t)$ and $\dot{\u03f5}(t)$, respectively. These rates may change continually over time. Let’s temporarily assume they are fixed over a small time interval $[t,t+\mathrm{\Delta}t]$. Thus,
What magnitude of compensatory plasticity, ${\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}$, most decreases task error over $\mathrm{\Delta}t$? The answer is
A mathematical derivation is contained in Box 3, with geometric intuition in Figure 3b. Note that our answer turns out to be independent of the synaptic fluctuation rate $\dot{\u03f5}(t)$. Here,
${\parallel \nabla F[\mathbf{\mathbf{w}}]\parallel}_{2}$ represents the sensitivity of the task error to changes (i.e. the steepness of the loss landscape).
$\mathrm{\Delta}{\widehat{\mathbf{\mathbf{c}}}}^{T}\nabla \widehat{F}[\mathbf{\mathbf{w}}]$ represents the accuracy of the compensatory plasticity direction in conforming to the steepest downhill direction on the loss landscape (in particular, their normalised correlation).
${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\mathbf{\mathbf{c}}]$ represents the upward curvature of the compensatory plasticity direction. As shown in Figure 3b, excessive plasticity in an upwardly curved, but downhill, direction, can eventually increase task error. Thus, upward curvature limits the ideal magnitude of compensatory plasticity in the direction $\mathrm{\Delta}\widehat{\mathbf{\mathbf{c}}}$.
Box 3.
Optimal magnitude of compensatory plasticity.
Let us rewrite Equation (2), using the operator $Q$ and omitting higher order terms, as justified in Box 1:
We can substitute our assumptions on synaptic fluctuations (Equations (3)) into Equation (7) to get
Note that the requirement for assumption (3b) can be removed, but the alternative resulting derivation is more involved (see SI section two for this alternative).
We can differentiate Equation (8) in ${\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}$, to get:
The root of this derivative gives a global minimum of the Equation (8) in ${\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}$, as long as ${Q}_{\mathbf{\mathbf{w}}({t}^{*})}[\mathrm{\Delta}\mathbf{\mathbf{c}}]\ge 0$ holds (justified in SI section 2.1). We get Equation (6), which defines the compensatory plasticity magnitude that minimises $\mathrm{\Delta}F$, and thus overall task error, at time ${t}^{*}+\mathrm{\Delta}t$.
For now, Equation (6) is valid only if the compensatory plasticity direction is fixed during $\mathrm{\Delta}t$. If we want Equation (6) to also be compatible with continually changing compensatory plasticity directions, it needs to be valid for an arbitrarily small $\mathrm{\Delta}t$. However, enacting a nonnegligible magnitude ${\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}^{*}$ of plasticity over an arbitrarily small time interval $\mathrm{\Delta}t$ would require an unattainable, ‘infinitelyfast’ plasticity rate.
In fact, we show in the next section that our expression for ${\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}^{*}$ does become compatible with continuously changing plasticity at the end of learning, when taskerror is stable.
Characterising the optimal rate of compensatory plasticity at steady state
Consider a scenario where task error is approximately stable. In this case, $\mathrm{\Delta}F\approx 0$ over $\mathrm{\Delta}t$. In this scenario, Equation (6) simplifies to
as derived in Box 4 and illustrated geometrically in Figure 3c. We see that the magnitude ${\parallel \mathrm{\Delta}\mathbf{c}\parallel}_{2}^{*}$ is proportional to ${\parallel \mathrm{\Delta}\u03f5\parallel}_{2}$, which is itself proportional to $\mathrm{\Delta}t$ from Equation (5), given some fixed rate of synaptic fluctuations. Thus, ${\parallel \mathrm{\Delta}\mathbf{c}\parallel}_{2}^{*}$ is attainable even as $\mathrm{\Delta}t$ shrinks to zero, and is thus compatible with continually changing compensatory plasticity directions. In this case, Equation (9) can be rewritten as
Equation (9) is a key result of the paper. It applies regardless of the underlying plasticity mechanisms that induced $\mathrm{\Delta}\mathbf{\mathbf{c}}$ and $\mathrm{\Delta}\u03f5$. It is compatible with continually or occasionally changing directions of compensatory plasticity (i.e. infinitesimal or noninfinitesimal $\mathrm{\Delta}t$). It says that the optimal compensatory plasticity rate, relative to the rate of synaptic fluctuations, depends on the relative upward curvature of these two plasticity directions on the loss landscape.
A corollary is that the optimal rate of compensatory plasticity is greater during learning than at steady state. If we substitute the steadystate requirement: $\mathbb{E}[\mathrm{\Delta}F]=0$, with the condition for learning: $\mathbb{E}[\mathrm{\Delta}F]<0$, in the derivation of Box 4, then we get
Indeed, the faster the optimal potential learning rate $\mathbb{E}[\mathrm{\Delta}F]$, the greater the optimal compensatory plasticity rate. Thus ${\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}^{*}$ decreases as learning slows to a halt, eventually reaching the level of Equation (9b).
Box 4.
Optimal compensatory plasticity magnitude at steady state error.
Let us substitute the special condition $\mathbb{E}[\mathrm{\Delta}F]=0$ (steadystate task error) into Equation (8). This gives
Next, we substitute in our optimal reconsolidation magnitude (Equation (6)). This gives
which in turn implies the result (Equation (9)).
Note that Equation (9) is only valid when both the numerator and denominator of the right hand side are both positive. The converse is unlikely in a partially trained network, and impossible in a highly trained network (see SI section 2.1).
Main claim
We now claim that generically, the optimal compensatory plasticity rate should not outcompete the rate of synaptic fluctuations at steady state error. We will first provide geometric intuition for our claim, before bolstering with analytical arguments and making precise our notion of ‘generically’.
From Equation (9), our main claim holds if
that is, $\mathrm{\Delta}\mathbf{\mathbf{c}}$ points in a more upwardly curved direction than $\mathrm{\Delta}\u03f5$. When would this be true?
First consider $\mathrm{\Delta}\u03f5$. Statistical independence from the task error means it should point in an ‘averagely’ curved direction. Mathematically (see SI secton 2.1), this means
Our assumption of ‘average’ curvature fails if synaptic fluctuations are specialised to ‘unimportant’ synapses whose changes have little effect on task error. In this case ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\u03f5]$ would be even smaller, since $\mathrm{\Delta}\u03f5$ would be constrained to consistently shallow, lesscurved directions. Thus, this possibility does not interfere with our main claim.
For Equation (11) to hold, $\mathrm{\Delta}\mathbf{\mathbf{c}}$ should point in directions of ‘morethanaverage’ upward curvature. This follows intuitively because a steep downhill direction, which effectively reduces task error, will usually have higher upward curvature than an arbitrary direction (see Figure 3c for intuition). It remains to formalise this argument mathematically, and consider edge cases where it doesn’t hold.
Dependence of the optimal magnitude of steadystate, compensatory plasticity on the mechanism
Compensatory plasticity is analogous to learning, since it acts to reduce task error. We do not yet know the algorithms that neural circuits use to learn, although constructing biologically plausible learning algorithms is an active research area. Nevertheless, all the potential learning algorithms we are aware of fit into three broad categories. For each category, we shall show why and when our main claim holds. We will furthermore investigate quantitative differences in the optimal compensatory plasticity rate, across and within categories. A full mathematical justification of all the assertions we make is found in SI section 1.3.
We first highlight a few general points:
For any compensatory plasticity mechanism, ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\mathbf{\mathbf{c}}]$ depends not only on the algorithm, but the point $\mathbf{\mathbf{w}}$ on the landscape. We cannot ever claim that Equation (11) holds for all network states.
We calculate the expected value of ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\mathbf{\mathbf{c}}]$ for an ‘average’, trained, state $\mathbf{\mathbf{w}}$, across classes of algorithm. This corresponds to a plausible bestcase tuning of compensatory plasticity that a neural circuit might be able to achieve. Any improvement would rely on online calculation of ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\mathbf{\mathbf{c}}]$, which we do not believe would be plausible biologically.
Learning algorithms attempt to move to the bottom of the loss landscape. But they are blind. Spying a distant valley equates to ‘magically’ predicting that a very different network state will have very low task error. How do they find their way downhill? There are three broad strategies (Raman and O'Leary, 2021):
0^{th} order algorithms take small, exploratory steps in random directions. Information from the change in task error over these steps informs retained changes. For instance, steps that improve task error are retained. A notable 0order algorithm is REINFORCE (Williams, 1992). Many computational models of biological learning in different circuits derive from this algorithm (Seung, 2003; Fee and Goldberg, 2011; Bouvier et al., 2018; Kornfeld et al., 2020).
1^{st} order algorithms explicitly approximate/calculate, and then step down the locally steepest direction (i.e. the gradient $\nabla F[\mathbf{\mathbf{w}}]$). The backpropagation algorithm implements perfect gradient descent. Many approximate gradient descent methods with more biologically plausible assumptions have been developed in the recent literature (see e.g. Murray, 2019; Whittington and Bogacz, 2019; Bellec et al., 2020; Lillicrap et al., 2016; Guerguiev et al., 2017, and Lillicrap et al., 2020 for a review).
2^{nd} order algorithms additionally approximate/calculate the hessian ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$, which provides information on local curvature. They look for descent directions that are both steep, and less upwardly curved. We doubt it is possible for biologically plausible learning rules to accurately approximate the hessian, which has ${N}^{2}$ entries representing the interaction between every possible pair of synaptic weights.
Table 2 shows the categories for which our main claim holds.
We first consider the simplest case of a quadratic loss function $F[\mathbf{\mathbf{w}}]$. Here, directions of curvature in any direction are constant (mathematically, the hessian ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$ does not vary with network state). Moreover, the gradient obeys a consistent relationship with the hessian:
Components of $(\mathbf{\mathbf{w}}{\mathbf{\mathbf{w}}}^{*})$ with high upward curvature are magnified under the transformation ${\nabla}^{2}F[{\mathbf{\mathbf{w}}}^{*}]$, since they correspond to eigenvectors of ${\nabla}^{2}F[{\mathbf{\mathbf{w}}}^{*}]$ with high eigenvalue. Conversely, components with low upward curvature are shrunk. As the gradient $\nabla F[\mathbf{\mathbf{w}}]$ is the output of such a transformation from Equation (13), this suggests it is biased towards directions of high upward curvature. Indeed, we can quantify this bias. Let $\{{\lambda}_{i}\}$ be the eigenvalues of ${\nabla}^{2}F[{\mathbf{\mathbf{w}}}^{*}]$, and $\{{c}_{i}\}$ the projections of the corresponding eigenvectors onto $\mathbf{\mathbf{w}}{\mathbf{\mathbf{w}}}^{*}$. Then
The value of Equation (14) depends on the values $\{{c}_{i}\}$. In the ‘average’ case, where they are equal, and $\mathbf{\mathbf{w}}{\mathbf{\mathbf{w}}}^{*}$ is thus a direction of ‘average’ curvature, ${Q}_{\mathbf{\mathbf{w}}}[\nabla F[\mathbf{\mathbf{w}}]]\ge {Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\u03f5]$ holds. This inequality gap widens with increasing anisotropy in the curvature of different directions (i.e. with a wider spread of eigenvalues ${\lambda}_{i}$, corresponding to more elliptical/less circular level sets in the illustration of Figure 4b). Indeed, simulation results in Figure 5—figure supplement 1 (top row) show how the ratio ${\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}:{\parallel \mathrm{\Delta}\u03f5\parallel}_{2}$ that optimises steadystate task error is significantly less than one, in a quadratic error function where compensatory plasticity accurately follows the gradient, and for different synaptic fluctuation rates.
What about the case of a nonlinear loss function? Close to a minimum ${\mathbf{\mathbf{w}}}^{*}$, the relationship of Equation (13) approximately holds (the loss function is locally quadratic). So if steadystate error is very low, we can directly transport the intuition of the quadratic case. However when steady state error increases, Equation (13) becomes increasingly approximate. In the limiting case, we could consider $\nabla F[\mathbf{\mathbf{w}}]$ as being completely uncorrelated from ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$, in which case ${Q}_{\mathbf{\mathbf{w}}}[\nabla F[\mathbf{\mathbf{w}}]]\approx {Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\u03f5]$ would hold. Numerical results in Figure 5 supports this assertion in nonlinear networks: the optimal ratio satisfies ${\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}:{\parallel \mathrm{\Delta}\u03f5\parallel}_{2}\approx 1$ in conditions where steadystate task error is high, and ${\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}:{\parallel \mathrm{\Delta}\u03f5\parallel}_{2}\le 1$ in conditions where it is low.
Overall, we see that if $\mathrm{\Delta}\mathbf{\mathbf{c}}\propto \nabla F[\mathbf{\mathbf{w}}]$ (i.e. compensatory plasticity enacts gradient descent), then we would expect compensatory plasticity to be outcompeted by synaptic fluctuations to maintain optimal steadystate error.
Even if compensatory plasticity does not move in the steepest direction of error decrease (i.e. the error gradient), it must move in an approximate downhill direction to improve task error (see e.g. Raman et al., 2019). Furthermore, the worse the quality of the gradient approximation, the larger the optimal level of compensatory plasticity (illustrated conceptually in Figure 4b–c, and numerically in Figure 5 and Figure 5—figure supplement 1). Why? We can rewrite such a learning rule as
where $\nu $ represents systematic error in the gradient approximation. The upward curvature in the direction $\mathrm{\Delta}\mathbf{\mathbf{c}}$ becomes a (nonlinear) interpolation of the upward curvatures in the directions $\nabla F[\mathbf{\mathbf{w}}]$ and $\nu $ (see Equation (A6) of the SI). As long as $\nu $ is less biased towards high curvature directions than $\nabla F[\mathbf{\mathbf{w}}]$ itself, then this decreases the upward curvature in the direction $\mathrm{\Delta}\widehat{\mathbf{\mathbf{c}}}$, and thus increases the optimal compensatory plasticity rate. Indeed Figure 5 shows in simulation that this rate increases for more inaccurate compensatory plasticity mechanisms.
We now turn to zeroorder learning algorithms, such as REINFORCE. These do not explicitly approximate a gradient, but generate random plasticity directions, which are retained/opposed based upon their observed effect on task error. We would expect randomly generated plasticity directions to have ‘average’ upward curvature, similarly to synaptic fluctuations. In this case, we would therefore get ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\mathbf{\mathbf{c}}]\approx {Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\u03f5]$, and compensatory plasticity should thus equal synaptic fluctuations in magnitude.
Finally, we consider secondorder learning algorithms, and in particular the Newton update:
As previously discussed, we assume that learning algorithms that require detailed information about the Hessian are biologically implausible. As such, our treatment is brief, and mainly contained in SI section 2.2.2.
In a linear network, the Newton update corresponds to compensatory plasticity making a direct ‘beeline’ for ${\mathbf{\mathbf{w}}}^{*}$ (see Figure 4d). As such ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\mathbf{\mathbf{c}}]={Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\u03f5]$ and the optimal magnitude of compensatory plasticity should match synaptic fluctuations. The same is true for a nonlinear network in a nearoptimal state. However if steadystate task error is high in a nonlinear network, then compensatory plasticity should outcompete synaptic fluctuations. This case does not contradict our central claim however, since high task error at steady state implies that the task is not truly learned.
Together our results and analyses show that the magnitude of compensatory plasticity, at steady state task error, should be less or equal to that of synaptic fluctuations. This conclusion does not depend upon circuit architecture, or choice of biologically plausible learning algorithm.
Discussion
A longstanding question in neuroscience is how neural circuits maintain learned memories while being buffeted by synaptic fluctuations from noise and other taskindependent processes (Fusi et al., 2005). There are several hypotheses that offer potential answers, none of which are mutually exclusive. One possibility is that fluctuations only occur in a subset of volatile connections that are relatively unimportant for learned behaviours (Moczulska et al., 2013; Chambers and Rumpel, 2017; Kasai et al., 2003). Following this line of thought, circuit models have been proposed that only require stability in a subset of synapses for stable function (Clopath et al., 2017; Mongillo et al., 2018; Susman et al., 2018).
Another hypothesis is that any memory degradation due to fluctuations is counteracted by restorative plasticity processes that allow circuits to continually ‘relearn’ stored associations. The information source directing this restorative plasticity could come from an external reinforcement signal (Kappel et al., 2018), from interactions with other circuits (Acker et al., 2018), or spontaneous, networklevel reactivation events (Fauth and van Rossum, 2019). A final possibility is that ongoing synaptic fluctuations are accounted for by behavioural changes unrelated to learned task performance .
All these hypotheses share two core assumptions that we make, and several include a third that our results depend on:
Not all synaptic changes are related to learning.
Unchecked, these learningindependent plasticity sources generically hasten the degradation of previously stored information within a neural circuit.
Some internal compensatory plasticity mechanism counteracts the degradation of previously stored information.
We extracted mathematical consequences of these three assumptions by building a general framework. We first modelled the the degree of degradation of previously learned information in terms of an abstract, scalarvalued, task error function or ‘loss landscape’. The brain may not have, and in any case does not require, explicit representation of such a function for a specific task. All that is required is error feedback from the environment and/or some internal prediction.
We then noted that compensatory plasticity should act to decrease task error, and thus point in a downhill direction on the ‘loss landscape’. We stress that we do not assume a gradientbased learning rule such as the backpropagation algorithm, the plausibility of which is an ongoing debate (Whittington and Bogacz, 2019).
Our results do not depend on whether synaptic changes during learning are gradual, or occur in large, abrupt steps. Although most theory work assumes plasticity to be gradual, there is evidence that plasticity can proceed in discrete jumps. For instance, abrupt potentiation of synaptic inputs that lead to the formation of place fields in mouse CA1 hippocampal neurons can occur within seconds as an animal explores a new environment (Bittner et al., 2017). Even classical plasticity paradigms that depend upon millisecond level precision in the relative timing of pre/post synaptic spikes follow a paradigm where there is a short ‘induction phase’ of a minute or so, following which there is a large and sustained change in synaptic efficacy (e.g. Markram et al., 1997; Bi and Poo, 1998). It is therefore an open question as to whether various forms of synaptic plasticity are best accounted for as an accumulation of small changes or a threshold phenomenon that results in a stepwise change. Our analysis is valid in either case. We quantify plasticity rate by picking a (large or small) time interval over which the net plasticity direction is approximately constant, and evaluate the optimal, steadystate magnitude of compensatory plasticity over this interval, relative to the magnitude of synaptic fluctuations.
A combination of learninginduced and learningindependent plasticity should lead to an eventual steady state level of task error, at which point the quality of stored information does not decay appreciably over time. The absolute quality of this steady state depends upon both the magnitude of the synaptic fluctuations, and the effectiveness of the compensatory plasticity.
Our main finding was that the quality of this steady state is optimal when the rate of compensatory plasticity does not outcompete that of the synaptic fluctuations. This result, which is purely mathematical in nature, is far from obvious. While it is intuitively clear that retention of circuit function will suffer when compensatory plasticity is absent or too weak, it is far less intuitive that the same is true generally when compensatory plasticity is too strong.
We also found that the precision of compensatory plasticity influenced its optimal rate. When ‘precision’ corresponds to the closeness of an approximation to a gradientbased compensatory plasticity rule, an increase in precision resulted in the optimal rate of compensatory plasticity being strictly less than that of fluctuations. In other words, sophisticated learning rules need to do less work to optimally overcome the damage done by learningindependent synaptic fluctuations. Indeed experimental estimates (see Table 1) suggest that activityindependent synaptic fluctuations can significantly outcompete systematic, activitydependent changes in certain experimental contexts. Tentatively, this means that the high degree of synaptic turnover in these systems is in fact evidence for the operation of precise synaptic plasticity mechanisms as opposed to crude and imprecise mechanisms.
Our results are generic, in that they follow from fundamental mathematical relationships in optimisation theory, and hence are not dependent on particular circuit architectures or plasticity mechanisms. We considered cases in which synaptic fluctuations were distributed across an entire neural circuit. However, the basic framework easily extends, allowing for predictions in more specialised cases. For instance, recent theoretical work (Clopath et al., 2017; Mongillo et al., 2018; Susman et al., 2018) have hypothesised that synaptic fluctuations could be restricted to ‘unimportant’ synapses. These correspond to low curvature (globally insensitive) directions in the ‘loss landscape’. Our framework (Equation (9) in particular) immediately predicts that the optimal rate of compensatory plasticity will decrease proportionately with this curvature.
Precise experimental isolation/elimination of the plasticity sources attributable to learning and retention of memories remains challenging. Nevertheless, in conventional theories of learning (e.g. Hebbian learning), neural networks learn through plasticity induced by patterns of pre and postsynaptic neural activity. A reasonable approximation, therefore, is to equate the ‘compensatory/learninginduced’ plasticity of our paper with ‘activitydependent’ plasticity in experimental setups. With this assumption, our results provide several testable predictions.
Firstly, our results show that that the rate of compensatory (i.e. learningdependent) plasticity is greater when a neural circuit is in a phase of active learning, as opposed to maintaining previously learned information (see Equation (10) and the surrounding discussion). Consequently, the relative contribution of synaptic fluctuations to the overall plasticity rate should be lower in this case. It would be interesting to test whether this were indeed the case, by comparing brain circuits in immature vs mature organisms, and in neural circuits thought to be actively learning vs those thought to be retaining previously learned information. One way to do this would be to measure the covariance of functional synaptic strengths at coinnervated synapses using EM reconstructions of neural tissue. A higher covariance implies a lower proportion of activitydependent (i.e. compensatory) plasticity, since coinnervated synapses share presynaptic activity histories. Interestingly, two very similar experiments (Bartol et al., 2015) and (Dvorkin and Ziv, 2016) did indeed examine covariance in EM reconstructions of hippocampus and neocortex, respectively. This covariance appears to be much lower in hippocampus (compare Figure 1 of Bartol et al., 2015 to Figure 8 of Dvorkin and Ziv, 2016). Many cognitive theories characterise hippocampus as a continual learner and neocortex as a consolidator of previously learned information (e.g. O'Reilly and Rudy, 2001). Our analysis provides support for this hypothesis at a mechanistic level by linking low covariance in coinnervated hippocampal synapses to continual learning.
Secondly, a number of experimental studies (Nagaoka et al., 2016; Quinn et al., 2019; Yasumatsu et al., 2008; Minerbi et al., 2009; Dvorkin and Ziv, 2016) note a persistence of the bulk of synaptic plasticity in the absence of activitydependent plasticity or other correlates of an explicit learning signal, as explained in our review of key experimental findings. However, there are two important caveats for relating our work to these experimental observations:
Experimentally isolating different plasticity mechanisms, measuring synaptic changes, and accounting for confounding behavioural/physiological changes is extremely challenging. The most compelling in vivo support comes from Nagaoka et al., 2016, where an analogue of compensatory plasticity in the mouse visual cortex was suppressed both chemically (by suppression of spiking activity) and behaviourally (by raising the mouse in visually impoverished conditions). Synaptic turnover was reduced by about half for both suppression protocols, and also when they were applied simultaneously. Further studies that quantified changes in synaptic strength in addition to spine turnover in an analogous setup would lend further credence to our results.
We do not know if observed synaptic plasticity in the experiments we cite truly reflect a neural circuit attempting to minimise steadystate error on a particular learning goal (as captured through an abstract, implicit, ‘loss function’). Our analysis simply shows that somewhat surprising levels of ongoing plasticity can be explained parsimoniously in such a framework. In particular, the concepts of ‘learning’ and behaviour have no clear relationship with neural circuit dynamics in vitro. Nevertheless, we might speculate that synapses could tune the extent to which they respond to ‘endogenous’ (task independent) signals versus external signals that could convey task information in the intact animal. Even if the information conveyed by activitydependent signals were disrupted in vitro, the fact that activitydependent signals determined such a small proportion of plasticity is notable, and seems to carry over to the in vivo case.
Thus, while our results offer a surprising agreement with a number of experimental observations, we believe it is important to further replicate measurements of synaptic modification in a variety of settings, both in vivo and in vitro. We hope our analysis provides an impetus for this difficult experimental work by offering a firstprinciples theory for the volatility of connections in neural circuits.
Materials and methods
Simulations
Request a detailed protocolWe simulated two types of network, which we refer to as linear (Figure 5—figure supplement 1) and nonlinear (Figures 1 and 5) respectively. We ran our simulations in the Julia programming language (version 1.3), and in particular used the Flux.jl software package (version 0.9) to construct and update networks. Source code is available at https://github.com/Dhruva2/OptimalPlasticityRatios (copy archived at swh:1:rev:fcb1717a822f90b733c49d62bfc2f970155b7364, Raman, 2021).
Nonlinear networks
Request a detailed protocolNetworks were ratebased, with the firing rate $r(t)$ of a given neuron defined as
where $w$ is the vector of presynaptic strengths, $u$ represents the firing rate of the associated presynaptic neurons, and $\sigma (x):=\frac{1}{1+\mathrm{exp}(x)}$ is the sigmoid function. Initial weight values were generated randomly, according to the standard Xavier distribution (Glorot and Bengio, 2010). Networks were organised into three layers, containing 12, 20, and 10 neurons, respectively. Any given neuron was connected to all neurons in the previous layer. For the first layer, the firing rates of the ‘previous layer’ corresponded to the network inputs.
Linear networks
Request a detailed protocolNetworks were organised into an input layer of 12 neurons, and an output layer of 10 neurons. Each output neuron was connected to all input layer neurons. Networks were ratebased, with the firing rate $r(t)$ of a given neuron defined as
where ${u}_{i}(t)$ corresponds to the ${i}^{th}$ input (inputlayer neuron) or the firing rate of the ${i}^{th}$ inputlayer neuron (outputlayer neuron). Initial weight values were generated randomly, according to the Xavier distribution (Glorot and Bengio, 2010).
Task error
Request a detailed protocolFor each network, we generated 1000 different, random, input vectors. Each component of the vector was generated from a unit Gaussian distribution. Task error, at the ${t}^{th}$ timestep, was taken as the mean squared error of the network in recreating the outputs of the initial ($t=0$) network, in response to the suite of inputs. Mathematically, this equates to
where $y(\mathbf{\mathbf{w}}(t),u)$ denotes the output of the network given the synaptic strengths at time $t$, in response to input $u\in \mathcal{U}$. Note that this task error recreates the ‘studentteacher’ framework of e.g. (Levin et al., 1990; Seung et al., 1992), where a fixed copy of the initial network is the teacher.
Weight dynamics
Request a detailed protocolAt each simulation timestep, synaptic weights were updated as
We took the synaptic fluctuations term, $\mathrm{\Delta}{\u03f5}_{t}$, as scaled white noise, that is,
The constant of proportionality was calculated so that the magnitude ${\parallel \mathrm{\Delta}\u03f5\parallel}_{2}$ conformed to a prespecified value. This magnitude was 2 in the simulation of Figure 1, and was a graphed variable in the simulations of Figure 5 and Figure 5—figure supplement 1.
The compensatory plasticity term, $\mathrm{\Delta}{\mathbf{\mathbf{c}}}_{t}$, was calculated in two stages. First we applied the backpropagation algorithm, using $y(\mathbf{\mathbf{w}}(0),u)$ as the ideal network outputs to train against. This generated an ‘ideal’ direction of compensatory plasticity , proportional to the negative gradient $\nabla F[\mathbf{\mathbf{w}}(t)]$. For Figure 5 and Figure 5—figure supplement 1 we then corrupted this gradient with a tunable proportion of white noise. Overall, this gives,
where ${\nu}_{t}\sim \mathcal{N}(0,\mathbb{I})$ is the noise corruption term, and ${\gamma}_{1},{\gamma}_{2}>0$ are tunable hyperparameters. The higher the ratio ${\gamma}_{2}:{\gamma}_{1}$, the greater the noise corruption. Meanwhile, $\sqrt{{\gamma}_{1}^{2}+{\gamma}_{2}^{2}}$ sets the overall magnitude of compensatory plasticity . By tuning ${\gamma}_{1}$ and ${\gamma}_{2}$, we can therefore independently modify the magnitude and precision of the compensatory plasticity term. In Figure 1, we set ${\gamma}_{2}=0$.
Appendix 1
Alternative derivation of Equation (9a)
We provide an alternative derivation of Equation (9a) that removes the need for assumption (3b). We did not put this main derivation in the main text as we perceive it to have less clarity.
The derivation proceeds identically to that given in the main text until Equation (7). We can then use (3a) to simplify Equation (7). We get
Recall that expectation is taken over an unknown probability distribution from which $\mathrm{\Delta}\u03f5$ is drawn, which satisfies Equation (3a).
We then assume that we are in a phase of stable memory retention, so that $\mathbb{E}[\mathrm{\Delta}F]=0$. Now if the magnitude of compensatory plasticity ${\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}$ is tuned to minimise steady state error $F$, then any change to ${\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}$ will result in an increase in $\mathbb{E}[\mathrm{\Delta}F]$. So $\mathbb{E}[\mathrm{\Delta}F]$ is locally minimal in ${\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}$. This implies
We also claim that local minimality implies
Why? $\mathbb{E}[\mathrm{\Delta}F]=0$ implies that $\mathbb{E}[\frac{\mathrm{\Delta}F}{{\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}}]=0$. If a small change to ${\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}$ results in $\mathbb{E}[\mathrm{\Delta}F]\ge 0$, then it also results in $\mathbb{E}[\frac{\mathrm{\Delta}F}{{\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}}]\ge 0$, since ${\parallel \mathrm{\Delta}\mathbf{\mathbf{c}}\parallel}_{2}$ is nonnegative.
Expanding the LHS of Equation (1), we get
Differentiating, we get
from which (9a) follows.
Positivity of the numerator and denominator in Equation (9a)
Equation (9a) of the main text asserts that
holds as long as both the numerator and denominator of the RHS are positive. Here we describe sufficient conditions for positivity.
The inequality ${\nabla}^{2}F[\mathbf{\mathbf{w}}]\u2ab00$ must hold in some neighbourhood of any minimum ${\mathbf{\mathbf{w}}}^{*}$. Recall that we referred to such a neighbourhood as a highly trained state of the network in the main text. In such a state, our assertion follows immediately, as ${Q}_{\mathbf{\mathbf{w}}}[\mathbf{\mathbf{v}}]:=\frac{1}{{\parallel \mathbf{\mathbf{v}}\parallel}_{2}^{2}}{\mathbf{\mathbf{v}}}^{T}({\nabla}^{2}F[\mathbf{\mathbf{w}}])\mathbf{\mathbf{v}}\ge 0$, for any vector $\mathbf{\mathbf{v}}$. Therefore, ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\u03f5]\ge 0$ and ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\mathbf{\mathbf{c}}]\ge 0$.
We now consider a partially trained network state, which we defined in the main text as any $\mathbf{\mathbf{w}}$ satisfying $Tr({\nabla}^{2}F)\ge 0$. Note that
We assumed in the main text (Equation (3a)), that $\mathrm{\Delta}\u03f5$ is uncorrelated with the gradient $\nabla F[\mathbf{\mathbf{w}}]$ in expectation, since $\mathrm{\Delta}\u03f5$ is realised by memoryindependent processes. Similarly we can assume that $\mathrm{\Delta}\u03f5$ is unbiased in how it projects onto the eigenvectors of ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$. In other words,
for any normalised eigenvectors ${\widehat{v}}_{i}$, ${\widehat{v}}_{j}$ of ${\mathrm{\nabla}}^{2}F[\mathbf{w}]$. In expectation, we can therefore simplify to
where $N$ is the dimensionality of the vector $\mathbf{\mathbf{w}}$. So a partially trained network is one for which small, memoryindependent weight fluctuations (such as $\mathrm{\Delta}\u03f5$, or white noise) are expected to decrease task performance.
Now recall that ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\u03f5]=\frac{1}{{\parallel \mathrm{\Delta}\u03f5\parallel}_{2}^{2}}\mathrm{\Delta}{\u03f5}^{T}{\nabla}^{2}F[\mathbf{\mathbf{w}}]\mathrm{\Delta}\u03f5$. So we have
where the positivity constraint comes from being in a partially trained network.
We now consider why ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\mathbf{\mathbf{c}}]$ should be generically positive in a partially trained network. Suppose ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\mathbf{\mathbf{c}}]<0$ holds. We can rewrite this as $\mathrm{\Delta}{\mathbf{\mathbf{c}}}^{T}{\nabla}^{2}F[\mathbf{\mathbf{w}}]\mathrm{\Delta}\mathbf{\mathbf{c}}\le 0$. In this case, maintaining the same compensatory plasticity $\mathrm{\Delta}\mathbf{\mathbf{c}}$ over the time interval $[{t}^{*}+\mathrm{\Delta}t,{t}^{*}+2\mathrm{\Delta}t]$ would result in increased improvement in loss, as
Effectively, memory improvement due to compensatory plasticity $\mathrm{\Delta}\mathbf{\mathbf{c}}$ would be in an ‘accelerating’ direction, and maintaining the same direction $\mathrm{\Delta}\mathbf{\mathbf{c}}$ of compensatory plasticity would lead to ever faster learning. However, by assumption, we are in a regime of steady state task performance, where
Optimal plasticity ratios in specific learning rules
Noisefree learning rules (firstorder)
Let us first consider the case where $\mathrm{\Delta}\mathbf{\mathbf{c}}$ can be computed with perfect access to the gradient $\nabla F[\mathbf{\mathbf{w}}]$, but without access to ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$. Such a $\mathrm{\Delta}\mathbf{\mathbf{c}}$ is known as a firstorder learning rule, as it has access only to the first derivative of $F$ (Polyak, 1987). Imperfect access is considered subsequently. In this case, the optimal direction of compensatory plasticity is
In other words, $\mathrm{\Delta}\mathbf{\mathbf{c}}$ would implement perfect gradient descent on $F[\mathbf{\mathbf{w}}]$. The condition of Equation (11) for synaptic fluctuations to outcompete reconsolidation plasticity evaluates to
To what extent can we quantify ${Q}_{\mathbf{\mathbf{w}}}[\nabla F[\mathbf{\mathbf{w}}]]$? First let us relate the gradient and Hessian of $F[\mathbf{\mathbf{w}}]$. Let ${\mathbf{\mathbf{w}}}^{*}$ be an optimal state of the network (i.e. one where $F$ is minimised). Let us parameterise the straight line connecting $\mathbf{\mathbf{w}}$ with ${\mathbf{\mathbf{w}}}^{*}$:
Then
This gives
First let us rewrite
where $({\lambda}_{i},{v}_{i})$ is the ${i}^{th}$ eigenvalue/eigenvector pair of ${\nabla}^{2}F$ (sorted in ascending order of ${\lambda}_{i}$), and c_{i}, d_{i} are some scalar weights. Now
The value of ${Q}_{\mathbf{\mathbf{w}}}[\nabla F[\mathbf{\mathbf{w}}]]$ now depends upon the distribution of mass of the sequence $\{{d}_{i}\}$. If later elements of the sequence are larger (i.e. $M(\mathbf{\mathbf{w}}={\mathbf{\mathbf{w}}}^{*})$ projects more highly onto eigenvectors of ${\mathrm{\nabla}}^{2}F[\mathbf{w}]$ with large eigenvalue), then ${Q}_{\mathbf{\mathbf{w}}}[\nabla F[\mathbf{\mathbf{w}}]]$ becomes larger, and the optimal magnitude of reconsolidation plasticity decreases, relative to the magnitude of synaptic fluctuations. The opposite is true if earlier elements of the sequence are larger.
Guaranteed bounds on the value of Equation (2) are vacuous. If we do not restrict $M$, then we can tailor the sequence $\{{d}_{i}\}$ as we like, and we end up with ${\lambda}_{1}\le {Q}_{\mathbf{\mathbf{w}}}[\nabla F[\mathbf{\mathbf{w}}]]\le {\lambda}_{N}$. However, pragmatic bounds are much tighter. Let us now consider two plausibly extremal cases.
First consider the simplest case of a network that linearly transforms its outputs, and which has a quadratic loss function $F[\mathbf{\mathbf{w}}]$. In this case ${\nabla}^{2}F$ is a constant, (independent of $\mathbf{\mathbf{w}}$), positivesemidefinite matrix, and $M={\nabla}^{2}F$. This means that
Condition (11) then becomes
A conservative sufficient condition for (Equation 3), using Chebyshev’s summation inequality, is that
Under what conditions would a plausible reconsolidation mechanism choose to ‘outcompete’ synaptic fluctuations, in this linear example? For ${Q}_{\mathbf{\mathbf{w}}}[\nabla F[\mathbf{\mathbf{w}}]]<{Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\u03f5]$ to even hold, (26) would have to be broken, and significantly so due to conservatism in the inequality. In other words, $\mathbf{\mathbf{w}}{\mathbf{\mathbf{w}}}^{*}$ must project quite biasedly onto the eigenvectors of ${\nabla}^{2}F$ with smallerthanaverage eigenvalue. If the discrepancy between $\mathbf{\mathbf{w}}$ and ${\mathbf{\mathbf{w}}}^{*}$ were caused by fluctuations (which are independent of ${\nabla}^{2}F$), then this would not be the case, in expectation. Even if this were the case, the reconsolidation mechanism would have to know about the described bias. This requires knowledge of both ${\mathbf{\mathbf{w}}}^{*}$ and ${\nabla}^{2}F$, and is thus implausible.
Now let us consider the case of a generic nonlinear network. At one extreme, if ${\parallel \mathbf{\mathbf{w}}{\mathbf{\mathbf{w}}}^{*}\parallel}_{2}$ is small, then $M\approx {\nabla}^{2}F[\mathbf{\mathbf{w}}]$, and the discussion of the linear case is valid. This corresponds to the case where steady state error is close to the minimum achievable by the network. As ${\parallel \mathbf{\mathbf{w}}{\mathbf{\mathbf{w}}}^{*}\parallel}_{2}$ increases (i.e. steady state error gets worse), the correspondence between $M$ and ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$ will likely decrease. Thus the optimal magnitude of reconsolidation plasticity, relative to the level of synaptic fluctuations, will rise.
We could consider another ‘extreme’ case in which $M$ and ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$ were completely independent of each other. In this case,
In other words, the projection of $M(\mathbf{\mathbf{w}}{\mathbf{\mathbf{w}}}^{*})$ onto the different eigenvectors of ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$ is approximately even. Using (24), this gives
In summary, we have two plausible extremes. One occurs where $M={\nabla}^{2}F[\mathbf{\mathbf{w}}]$, and another occurs where $M$ is completely independent of ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$. In either case, ${Q}_{\mathbf{\mathbf{w}}}[\nabla F[\mathbf{\mathbf{w}}]]\ge {Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\u03f5]$, and so the magnitude of synaptic fluctuations should optimally outcompete/equal the magnitude of reconsolidation plasticity. Of course, there might be particular values of $\mathbf{\mathbf{w}}$ where the correspondence between $M$ and ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$ is ‘worse’ than chance. In other words, eigenvectors of $M$ with large eigenvalue preferentially project onto eigenvectors of ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$ with small eigenvalue. In such cases, we would have ${Q}_{\mathbf{\mathbf{w}}}[\nabla F[\mathbf{\mathbf{w}}]]\le {Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\u03f5]$. However, we find it implausible that a reconsolidation mechanism would be able to gain sufficient information on $M$ to determine this at particular points in time, and thereby increase its plasticity magnitude.
Noisefree learning rules (secondorder)
Let us now suppose that $\mathrm{\Delta}\mathbf{\mathbf{c}}$ can be computed with perfect access to both $\nabla F[\mathbf{\mathbf{w}}]$ and ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$. In this case, the reconsolidation mechanism would optimally apply plasticity in the direction of the Newton step: we would have
Note that the Newton step is often conceptualised as a weighted form of gradient descent, where movement on the loss landscape is biased towards direction of lower curvature. Thus we would expect ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\mathbf{\mathbf{c}}]$ to be smaller, and the optimal proportion of reconsolidation plasticity to be larger. This is indeed the case. For mathematical tractability, we will restrict our discussion to the case in which ${\nabla}^{2}F[\mathbf{\mathbf{w}}]\succ 0$, and $M\succ 0$. This would hold if $F[\mathbf{\mathbf{w}}]$ were convex, or if $\mathbf{\mathbf{w}}$ were sufficiently close to a unique local minimum ${\mathbf{\mathbf{w}}}^{*}$. In this case we can rewrite
which gives
Once again, we first consider the case of a linear network with quadratic loss function, and hence with constant Hessian ${\nabla}^{2}F$. This gives $M={\nabla}^{2}F$, and
We again assume that the reconsolidation mechanism does not have knowledge of the relative projections of $\mathbf{\mathbf{w}}{\mathbf{\mathbf{w}}}^{*}$ onto the different eigenvectors of ${\nabla}^{2}F$, which requires knowledge of ${\mathbf{\mathbf{w}}}^{*}$. Without such information, we can use an analogous argument to that preceding (Equation 5) to argue that the approximation ${c}_{i}^{2}\approx \frac{1}{N}{\sum}_{i=1}^{N}{c}_{i}^{2}$ is reasonable. This gives ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\mathbf{\mathbf{c}}]\approx {Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\u03f5]$.
Note that the Newton step, in the linearquadratic case just considered, corresponds to a direction ${\mathbf{\mathbf{w}}}^{*}\mathbf{\mathbf{w}}$, that is, a direct path to a local minimum. So we could consider a compensatory plasticity mechanism implementing the Newton step as one directly undoing synaptic changes caused by $\mathrm{\Delta}\u03f5$.
We now consider the case of a nonlinear network. As before, if ${\parallel \mathbf{\mathbf{w}}{\mathbf{\mathbf{w}}}^{*}\parallel}_{2}$ is small, then we have $M\approx {\nabla}^{2}F[\mathbf{\mathbf{w}}]$, and the arguments of the linear network hold. As ${\parallel \mathbf{\mathbf{w}}{\mathbf{\mathbf{w}}}^{*}\parallel}_{2}$ increases, the correspondence between $M$ and ${\nabla}^{2}F$ will decrease. We again consider the plausible extreme where $M$ is completely uncorrelated with ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$, and so the approximation (Equation 5) holds. In this case, Equation (6) can be simplified to give
We assumed that ${\nabla}^{2}F[\mathbf{\mathbf{w}}]\succ 0$. Therefore, all eigenvalues are positive. This allows us to use Chebyshev’s summation inequality to arrive at
So as ${\parallel \mathbf{\mathbf{w}}{\mathbf{\mathbf{w}}}^{*}\parallel}_{2}$ increases, the magnitude of reconsolidation plasticity will optimally outcompete that of synaptic fluctuations. This is the one case that contradicts our main claim.
Imperfect learning rules
The previous section applied in the implausible case where a reconsolidation mechanism had perfect access to $\nabla F[\mathbf{\mathbf{w}}]$ and/or ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$. Recall from the main text that at least some information on $\nabla F[\mathbf{\mathbf{w}}]$ is required, in order for compensatory plasticity to move in a direction of decreasing task error. What if $\mathrm{\Delta}\mathbf{\mathbf{c}}$ contains a meanzero noise term, corresponding to unbiased noise corruption of these quantities? We will now show how such noise pushes ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\mathbf{\mathbf{c}}]$ towards equality with ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\u03f5]$, and thus pushes the optimal magnitude of reconsolidation plasticity towards the magnitude of synaptic fluctuations. Let us use the model
where $\nu $ is some meanzero random variable, and $\stackrel{~}{\mathrm{\Delta}\mathbf{\mathbf{c}}}$ is the ideal output of the reconsolidation mechanism, assuming perfect access to the derivatives of $F[\mathbf{\mathbf{w}}]$. Here $\nu $ represents the portion of compensatory plasticity attributable to systematic error in the algorithm, due to imperfect information on $F[\mathbf{\mathbf{w}}]$. This could arise due to imperfect sensory information or limited communication between synapses. We can therefore assume, as for $\mathrm{\Delta}\u03f5$, that it does not contain information on ${\nabla}^{2}F[\mathbf{\mathbf{w}}]$. We therefore get
analogously to Equation (12). Now the operator ${Q}_{\mathbf{\mathbf{w}}}$ satisfies
So depending upon the relative magnitudes of $\stackrel{~}{\mathrm{\Delta}\mathbf{\mathbf{c}}}$ and $\nu $, ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\mathbf{\mathbf{c}}]$ interpolates between ${Q}_{\mathbf{\mathbf{w}}}[\stackrel{~}{\mathrm{\Delta}\mathbf{\mathbf{c}}}]$ and ${Q}_{\mathbf{\mathbf{w}}}[\nu ]$. In particular, as the crudeness of the learning rule (i.e. the ratio $\frac{\parallel \nu \parallel}{\parallel \stackrel{~}{\mathrm{\Delta}\mathbf{\mathbf{c}}}\parallel}$ ) grows, ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\mathbf{\mathbf{c}}]$ approaches equality (from below) with ${Q}_{\mathbf{\mathbf{w}}}[\nu ]$, and thus ${Q}_{\mathbf{\mathbf{w}}}[\mathrm{\Delta}\u03f5]$, completing our argument.
Data availability
All code is publicly available on github at this URL: https://github.com/Dhruva2/OptimalPlasticityRatios (copy archived at https://archive.softwareheritage.org/swh:1:rev:fcb1717a822f90b733c49d62bfc2f970155b7364).
References

Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell typeThe Journal of Neuroscience 18:10464–10472.

Variance and invariance of neuronal longterm representationsPhilosophical Transactions of the Royal Society B: Biological Sciences 372:20160161.https://doi.org/10.1098/rstb.2016.0161

Replay comes of ageAnnual Review of Neuroscience 40:581–602.https://doi.org/10.1146/annurevneuro072116031538

ConferenceUnderstanding the difficulty of training deep feedforward neural networksProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. pp. 249–256.

BookThe Organization of Behavior: A Neuropsychological TheoryWiley: Chapman & Hall.

Structurestabilityfunction relationships of dendritic spinesTrends in Neurosciences 26:360–368.https://doi.org/10.1016/S01662236(03)001620

A statistical approach to learning and generalization in layered neural networksProceedings of the IEEE 78:1568–1574.https://doi.org/10.1109/5.58339

Random synaptic feedback weights support error backpropagation for deep learningNature Communications 7:13276.https://doi.org/10.1038/ncomms13276

Backpropagation and the brainNature Reviews Neuroscience 21:335–346.https://doi.org/10.1038/s4158302002773

Predicting the dynamics of network connectivity in the neocortexJournal of Neuroscience 35:12535–12544.https://doi.org/10.1523/JNEUROSCI.291714.2015

Intrinsic volatility of synaptic connections  a challenge to the synaptic trace theory of memoryCurrent Opinion in Neurobiology 46:7–13.https://doi.org/10.1016/j.conb.2017.06.006

Inhibitory connectivity defines the realm of excitatory plasticityNature Neuroscience 21:1463–1470.https://doi.org/10.1038/s415930180226x

Memory reconsolidation: an updateAnnals of the New York Academy of Sciences 1191:27–41.https://doi.org/10.1111/j.17496632.2010.05443.x

Neuronal homeostasis: time for a change?The Journal of Physiology 589:4811–4826.https://doi.org/10.1113/jphysiol.2011.210179

Homeostasis, failure of homeostasis and degenerate ion channel regulationCurrent Opinion in Physiology 2:129–138.https://doi.org/10.1016/j.cophys.2018.01.006

BookIntroduction to OptimizationNew York: Optimization Software, Publications Division.

The stability of glutamatergic synapses is independent of activity level, but predicted by synapse sizeFrontiers in Cellular Neuroscience 13:291.https://doi.org/10.3389/fncel.2019.00291

SoftwareOptimalPlasticityRatios, version swh:1:rev:fcb1717a822f90b733c49d62bfc2f970155b7364Software Heritage.

Frozen algorithms: how the brain's wiring facilitates learningCurrent Opinion in Neurobiology 67:207–214.https://doi.org/10.1016/j.conb.2020.12.017

Causes and consequences of representational driftCurrent Opinion in Neurobiology 58:141–147.https://doi.org/10.1016/j.conb.2019.08.005

Statistical mechanics of learning from examplesPhysical Review A 45:6056–6091.https://doi.org/10.1103/PhysRevA.45.6056

Molecular mechanisms of memory reconsolidationNature Reviews Neuroscience 8:262–275.https://doi.org/10.1038/nrn2090

Theories of error BackPropagation in the brainTrends in Cognitive Sciences 23:235–250.https://doi.org/10.1016/j.tics.2018.12.005

Principles of longterm dynamics of dendritic spinesJournal of Neuroscience 28:13592–13608.https://doi.org/10.1523/JNEUROSCI.060308.2008

Synaptic tenacity or lack thereof: spontaneous remodeling of synapsesTrends in Neurosciences 41:89–99.https://doi.org/10.1016/j.tins.2017.12.003
Decision letter

Srdjan OstojicReviewing Editor; Ecole Normale Superieure Paris, France

Timothy E BehrensSenior Editor; University of Oxford, United Kingdom

Yonatan LoewensteinReviewer; Hebrew University of Jerusalem, Israel

Matthias H HennigReviewer; University of Edinburgh, United Kingdom
In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.
Acceptance summary:
The halfcentury research on synaptic plasticity has primarily focused on how neural activity gives rise to changes in synaptic connections, and how these changes underlie learning and memory. However, recent studies have shown that in fact, most of the synaptic changes are activityindependent. This result is surprising given the generally held belief that activitydependent changes in connectivity underlie network functionality. This manuscript proposes a theoretical explanation of why this should be the case. Specifically, this work presents a mathematical analysis of the amount of synaptic plasticity required to maintain learned circuit function in presence of random synaptic changes. The central finding, supported by simulations, is that for an "optimal" learning algorithm that rectifies random changes in connectivity, taskrelated plasticity should generally be smaller than the magnitude of the fluctuations. All reviewers agreed that this is a very interesting theoretical perspective on an important biological problem.
Decision letter after peer review:
Thank you for submitting your article "Optimal synaptic dynamics for memory maintenance in the presence of noise" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, one of whom is a member of our Board of Reviewing Editors, and the evaluation has been overseen by Timothy Behrens as the Senior Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Yonatan Loewenstein (Reviewer #2); Matthias H Hennig (Reviewer #3).
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
The halfcentury research on synaptic plasticity has primarily focused on how neural activity gives rise to changes in synaptic connections, and how these changes underlie learning and memory. However, recent studies have shown that in fact, most of the synaptic changes are activityindependent. This result is surprising given the generally held belief that activitydependent changes in connectivity underlie network functionality. This manuscript proposes an explanation of why this should be the case.
Specifically, this work presents a mathematical analysis of the amount of synaptic plasticity required to maintain learned circuit function in presence of random synaptic changes. The central finding, supported by simulations, is that for an "optimal" learning algorithm that rectifies random changes in connectivity, taskrelated plasticity should generally be smaller than the magnitude of the fluctuations.
All reviewers agreed that this is a very interesting perspective on an important biological problem. However the reviewers have had difficulties fully understanding the specific claim, its derivation and potential implications. These issues are detailed in the Main Comments below, and need to be clarified. Additional suggestions are summarised in Other Comments.
Main comments:
1. How much does the main claim depend on the assumed learning algorithm? What is the family of learning algorithms that the authors refer to as "optimal" that have the property that directed changes are smaller in their magnitude than the noisedriven changes?
The reviewers' current understanding is the following, please clarify whether it is correct:
A) If the learning algorithm simply backtracks the noise, then the magnitude of learninginduced changes will be trivially equal that of the noiseinduced changes. An optimal algorithm will not be worse than this trivial one.
B) If the learning algorithm (slowly) goes in the direction of the gradient of the objective function then (1) if the network is in a maximum of the objective function and changes are small then the magnitude of learninginduced changes will be equal to that of the noiseinduced changes; (2) if the network is NOT in a maximum of the objective function then unless the noise is in the opposite direction to the gradient, the learninginduced path to the same level of performance would be shorter. Thus, the magnitude of learninginduced changes would be smaller that of that of the noiseinduced changes
C) There exist (inefficient) learning algorithms such that the magnitude of the learninginduced changes are larger than the magnitude of the noiseinduced ones. A learning algorithm that overshoots because of a toolarge learning rate could be one of them, and it is trivial to construct other inefficient learning algorithms.
2. The reviewers have found the mathematical derivation in the main text difficult to follow, in part because it seems to use an unnecessarily complex set of mathematical notations and arguments. The reviewers suggest to focus on intuitive arguments in the main text, to make the arguments more transparent, and the paper more accessible to the broad neuroscience audience. Ultimately, this is up to authors to decide. In any case, the main arguments need to be clarified as laid out above.
3. The reviewers were not convinced by the role played by the numerical simulations. On one hand, it was not clear what the simulations added with respect to the analytical derivations. Is the purpose to check any specific approximation? On the other hand, the model does not seem to help link the derivations to the original biological question, as it is quite far removed from a biological setup. A minimal suggestion would be to use the model to illustrate more directly the geometric intuition behind the derivation, for instance by showing movements of weights along with some view of the gradient, contrasting optimal and suboptimal.
Other comments:
4. The whole argument rests on an assumption that biological networks optimise a cost function, in particular during reconsolidation. How that assumption applies to the experimental setups detailed in the first part is unclear. At the very least, a clear statement and motivation of this assumption is needed.
5. One of the reviewers was not convinced (but happy to debate) cellular noise is such a major contributor to synaptic changes as stated in the introduction and in Table 1, as silencing neural activity pharmacologically will almost certainly affect synaptic function strongly. Strong synapses (big spines) tend to be more stable, and in any case it would be very difficult to know which of the observed modifications reported in the referenced papers have functional consequences. It would be very interesting to see what turnover (or weight dynamics) this model would predict under optimal and nonoptimal conditions. In Figure 1 it is implied that weight changes continues unchanged in presence of noise (and optimal learning to maintain the objective), is this actually the case? What concrete experimental predictions the authors would (dare to) make?
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "Optimal plasticity for memory maintenance in the presence of synaptic fluctuations" for further consideration by eLife. Your revised article has been evaluated by Timothy Behrens (Senior Editor) and a Reviewing Editor.
All reviewers have found that the manuscript has been very much improved and should be eventually published. The reviewers now fully understand the mathematical framework and the main mathematical result. During the consultation, it has however appeared that all reviewers feel that the main message of the manuscript, and in particular its implications for biology, need to be further clarified. Two main issues remain, which we suggest should be either addressed in the Results section or discussed in detail:
1. How much do the results rely on the assumption that synaptic changes are "strong"? To what extent is this assumption consistent with experiments? Is this theoretical framework really needed when infinitesimally small noise is immediately corrected by a learning signal, as often assumed (see below for more details)? Is the main result trivial when changes are small? Is the main contribution is to show that the result also holds far from this trivial regime, when noise and corrections are large?
2. what are the implications of the main mathematical result for interpreting measurable experimental quantities? The relation with experiments listed in the Discussion seems rather indirect (eg on lines 330335 the interpretations of EM reconstructions in the authors' modelling framework seems unclear; it would be worth unpacking how the papers listed on lines 347348 are consistent with the main result). Moreover, in many of the panels shown in Figures56, the dependence of the loss on the ratio between compensatory plasticity and synaptic fluctuations is rather flat; what does this imply for experimental data?
More details on the first point from one of the reviewers:
When studying learning in neuronal networks, the underlying assumption is always ("always" to the best of my knowledge) that learninginduced changes are gradual. For example, some form of activitydependent plasticity has, on average, a negative projection on the gradient of the loss function of the current state. Small changes to synaptic efficacies are made and now the network is in a slightly different (improved) state. Activitydependent plasticity in that new state has, again, a negative projection on the (new) gradient of the loss function at the new state, etc. If learning is sufficiently slow, we can average over the stochasticities in the learning process, organism's actions, rewards etc. and learning will improve performance.
In contrast to this approach, this paper suggests a very different learning process: activityindependent "noise" induces a LARGE change in the synaptic efficacies. This change is followed by a SINGLE LARGE compensatory learninginduced change. The question addressed in this manuscript is how large should this single optimal compensatory learninginduced change be relative to the single noiseinduced change. The fact that the compensatory changes are not assumed to be small and that learning is done in a single step, rather than learning being gradual, allowing the local sampling of the loss function, complicates the mathematical analysis. While for analyzing infinitesimallysmall changes we only need to consider the local gradient of the loss function, higherorder terms are required when considering single large changes.
What is the justification to this approach given that gradual learning is what we seem to observe in the experiments, specifically those cited in this manuscript? There is a lot of evidence of gradual changes in numbers of spines or synaptic efficacies, etc. If everything is gradual, why not "recompute" the gradient on the fly as is done in all previous models of learning?
https://doi.org/10.7554/eLife.62912.sa1Author response
Main comments:
1. How much does the main claim depend on the assumed learning algorithm? What is the family of learning algorithms that the authors refer to as "optimal" that have the property that directed changes are smaller in their magnitude than the noisedriven changes?
We have made this much more clear in the main manuscript, with an additional table that summarises the families of learning algorithm satisfying the main claim.
Before proceeding, let us emphasise that we do not consider a learning algorithm itself as `optimal'. Learning algorithms induce both a direction and a magnitude of plasticity. We study the optimal magnitude of plasticity, for a given (probably imperfect) direction set by the learning algorithm. This leads to the equation:
in the main text, which is valid for any direction Δc of learninginduced plasticity.
Our main claim (learningindependent plasticity should outcompete learninginduced plasticity) follows as long as equation (1) is less than one. The value of equation (1), at steadystate error, does depend on the direction Δc of learninginduced plasticity and current network state (and hence on the learning algorithm).
For a network to exactly calculate equation (1), and thus work out the optimal magnitude of learninginduced plasticity, it would need to exactly calculate the Hessian ∇^{2}F[w], which we consider implausible. Instead, the network could set the optimal magnitude using an expected value of (1) for an 'average' weight w. We calculate this value for different families of learning algorithm (see the new Table 2 in the main text). We now summarise these results (but they are also discussed in the section: Dependence of the optimal, steadystate magnitude of compensatory, learninginduced plasticity on learning algorithm).
The entire space of learning algorithms we consider is the space of incremental error based learning rules; that is, learning rules that induce small synaptic changes on small time intervals, for which the increment depends on some recent measure of error in a given task or deviation from some pre specified goal. This covers all standard rules assumed in theoretical studies: supervised, unsupervised and reinforcement learning that express weight change as a differential quantity.
We divide this space of learning algorithms into three cases: 0^{th}order algorithms (i.e. perturbation based), 1^{st}order algorithms (which approximate/calculate the gradient), and 2^{nd}order algorithms (which approximate/calculate both the gradient and the hessian).
In the case of a quadratic error function, we show that all three cases should should obey our main claim. We then note that nonlinear error functions, near a local minimum, should look quadratic, and thus have analogous conclusions. For nonlinear error functions far from a local minimum, we find that second order algorithms do not obey our main claim: learninginduced plasticity should exceed learningindependent plasticity at steadystate error. However, in the main text we question the biological plausibility of an accurate second order algorithm operating at a steadystate error far from a local minimum.
We have made the assumptions underlying the previously described results much more clear. In particular, we describe how we calculate our results for `perfect' (0^{th}/1^{st}/2^{nd})order algorithms, and push these insights to `approximate' (0^{th}/1^{st}/2^{nd})order algorithms by assuming that the approximation error term will not project onto the hessian ∇^{2}F[w] more biasedly than the `perfect' component of the relevant term.
The reviewers' current understanding is the following, please clarify whether it is correct:
A) If the learning algorithm simply backtracks the noise, then the magnitude of learninginduced changes will be trivially equal that of the noiseinduced changes. An optimal algorithm will not be worse than this trivial one.
The first sentence is correct. The second sentence is not. The word `optimal' has been misconstrued: it applies to the magnitude of plasticity induced by a given learning algorithm, not the algorithm itself (which our results are essentially agnostic to, as explained above). Consider an arbitrary learning algorithm falling into the previously described space of algorithms we consider. The algorithm, especially if it obeys biological constraints, may be inefficient in selecting a `good' direction of plasticity (i.e. one that effectively decreases task error). Thus it may perform worse (in maintaining good steadystate error) than the ‘trivial' algorithm that backtracks the noise. It may also perform better, especially if the `backtracking' is onto some highly suboptimal network state. Regardless of how good the directions of plasticity induced by the learning algorithm is, there will be an optimal, associated magnitude of plasticity. We claim that this magnitude of plasticity should optimally be smaller or equal to the magnitude of ongoing synaptic fluctuations. We do not make claims about the direction of plasticity induced by different algorithms.
B) If the learning algorithm (slowly) goes in the direction of the gradient of the objective function then (1) if the network is in a maximum of the objective function and changes are small then the magnitude of learninginduced changes will be equal to that of the noiseinduced changes; (2) if the network is NOT in a maximum of the objective function then unless the noise is in the opposite direction to the gradient, the learninginduced path to the same level of performance would be shorter. Thus, the magnitude of learninginduced changes would be smaller that of that of the noiseinduced changes
This paragraph talks about gradient descent, which is one of the cases explored in the manuscript. Other cases are described in the main reply above. For the remainder of this reply, we assume that learninginduced plasticity is in the direction of the gradient of the objective function.
If the network is at steadystate task error, and close to a minimum w* of the task error (i.e. maximum of the objective function), we would expect the optimal magnitude of learninginduced plasticity to be less than that of the learningindependent plasticity. The more anisotropic the curvature (i.e. eigenvalues of ∇^{2}F[w*]), the smaller this learninginduced magnitude should be, relative to the learningindependent plasticity. Extremally, it reaches equality with the magnitude of learninginduced plasticity when the eigenvalues of the latter matrix are completely isotropic.
The further the steadystate task error is from a minimum w*, the closer to parity we would expect the optimal magnitude of learninginduced plasticity to be, relative to the level of learningindependent plasticity.
We have revamped the figures, and in particular the new Figure 4 directly explains the geometric intuition behind this claim. Meanwhile, the section `Dependence of the optimal magnitude of steadystate, compensatory plasticity on the mechanism' justifies this claim.
C) There exist (inefficient) learning algorithms such that the magnitude of the learninginduced changes are larger than the magnitude of the noiseinduced ones. A learning algorithm that overshoots because of a toolarge learning rate could be one of them, and it is trivial to construct other inefficient learning algorithms.
Our understanding of the reviewer's comment is as follows: a learning algorithm could be `inefficient' for two reasons:
1. It selects noisy/imperfect directions of plasticity due to biological constraints. The degree to which this occurs in different learning systems is an open scientific question.
2. Given a (possibly imperfect) direction of plasticity, the accompanying magnitude of plasticity is too high/low. We agree that any learning algorithm could set the magnitude of plasticity associated with a particular direction too high. In this case the magnitude of learninginduced changes could indeed be larger than the learningindependent ones. By lowering the magnitude of learninginduced changes in this case, better steadystate task error would be achieved.
2. The reviewers have found the mathematical derivation in the main text difficult to follow, in part because it seems to use an unnecessarily complex set of mathematical notations and arguments. The reviewers suggest to focus on intuitive arguments in the main text, to make the arguments more transparent, and the paper more accessible to the broad neuroscience audience. Ultimately, this is up to authors to decide. In any case, the main arguments need to be clarified as laid out above.
We've tried out best to incorporate this suggestion. In particular, almost all of the maths is now contained in yellow boxes, which are separated from the main text. A reader who does not want to engage with the mathematics can read the entirety of the Results section without referring to the yellow boxes. We've expanded the description of the geometric intuition behind our results, and added new figures to help with this.
3. The reviewers were not convinced by the role played by the numerical simulations. On one hand, it was not clear what the simulations added with respect to the analytical derivations. Is the purpose to check any specific approximation? On the other hand, the model does not seem to help link the derivations to the original biological question, as it is quite far removed from a biological setup. A minimal suggestion would be to use the model to illustrate more directly the geometric intuition behind the derivation, for instance by showing movements of weights along with some view of the gradient, contrasting optimal and suboptimal.
The numerical simulations themselves were there just to check the validity of the analytic derivations under different conditions (i.e. different magnitudes of learningindependent plasticity, and different accuracies of learninginduced plasticity). We agree that they did not provide much geometric intuition into the results, and that this geometric intuition was lacking in the original submission. We have rectified this by providing more detailed figures highlighting the geometric intuition (Figure 4 in particular). These depict your `minimal suggestion'. These were drawn, rather than derived by simulation, since they were depicting precise geometrical features of weight changes at a particular timepoint that we found difficult to cleanly show through simulation. They contrast `optimal' and `suboptimal', as requested, and provide intuition into why the optimal magnitude of learninginduced plasticity is usually lower than the fixed, learningindependent term. We added an extra `motivating' simulation in Figure 1.
The main claim of the paper is quite generic, and not specific to a particular circuit architecture and/or `biologically plausible' learning rule. We therefore decided to test the claim on an abstract, simpletoexplain setup, where we could easily and intuitively manipulate the `accuracy' of learninginduced plasticity. We decided not to run simulations on a more biologicallymotivated circuit architecture/learning rule. If we had done so, we would have had to choose a particular, but arbitrary biologicallymotivated setup. This would have required a detailed explanation that was unrelated to the point of the paper. It may have additionally confused casual readers as to the generic nature of the results. We have more clearly described the motivation behind the examples in the new section: Motivating example.
Other comments:
4. The whole argument rests on an assumption that biological networks optimise a cost function, in particular during reconsolidation. How that assumption applies to the experimental setups detailed in the first part is unclear. At the very least, a clear statement and motivation of this assumption is needed.
We've rewritten this part of the paper to make this assumption much more explicit. We now list our exact assumptions at the beginning of the `Modelling setup' section. We note that any descriptive quantity such as `memory quality', or `learning performance', makes the implicit assumption of a loss function. Of course, the loss function may not be explicitly optimised by a neural circuit as the reviewer notes, but this does not actually matter from a mathematical point of view. As long as there is some organised state that a network evolves toward, one can posit an implicit loss function (or set of loss functions) that are being optimised in order to carry out the kind of analysis we performed here.
This is analogous to very widely known `cost functions' in physics: closed thermodynamic systems tend to maximise entropy over time; conservative mechanical systems minimise an abstract quantity called action. The components in these systems don't represent or interact with such abstract cost functions, yet the theories that assume them capture the relevant phenomena very successfully.
We would say that any view of reconsolidation that relies on retaining (to the greatest possible extent) some previously learned information, is implicitly trying to optimise a steadystate for some implicit loss function. Clearly, the notion of retaining previously learned information does not make sense in e.g. an in vitro setup. Nevertheless, the lowlevel plasticity mechanisms are likely to be have similarities to an intact setup, even if the signals they are receiving (e.g. patterns of incoming neural activity) are pathological. That said, we might speculate that synapses could tune the extent to which they respond to `endogenous' (task independent) signals versus external signals that could convey task information in the intact animal. If true, this is a potential way to interpret the in vitro data. We have included this point in the discussion.
We have reworded the introduction, making the link between experimental results and our analysis more clear. Note that we do not directly analyse the data provided by the relevant experimental papers. We explored the conclusions of a particular hypothesis: that neural circuits attempt to retain previously learned information in the face of synaptic fluctuations, through compensatory plasticity mechanisms. We then found that many experiments across the literature seemed consistent with our view, even if they don't conclusively imply it. Our work serves as a motivation to conduct further, specific experiments that attempt to isolate plasticity attributable to learningindependent mechanisms.
5. One of the reviewers was not convinced (but happy to debate) cellular noise is such a major contributor to synaptic changes as stated in the introduction and in Table 1, as silencing neural activity pharmacologically will almost certainly affect synaptic function strongly. Strong synapses (big spines) tend to be more stable, and in any case it would be very difficult to know which of the observed modifications reported in the referenced papers have functional consequences. It would be very interesting to see what turnover (or weight dynamics) this model would predict under optimal and nonoptimal conditions. In Figure 1 it is implied that weight changes continues unchanged in presence of noise (and optimal learning to maintain the objective), is this actually the case? What concrete experimental predictions the authors would (dare to) make?
Firstly, we have reworded the manuscript to make more clear the fact that synaptic fluctuations may not only represent noise. They represent any plasticity process independent of the learned task: the probability of such a process increasing a weight is independent of whether such an increase is locally beneficial to task performance. Intrinsic processes such as homeostatic plasticity may also be important.
We agree that pharmacological silencing will strongly affect synaptic activity. Note that we also found experiments in the literature where the proportion of learning/activityindependent plasticity was estimated without pharmacological intervention. [2] looked at commonly innervated spines sharing a pre/post neuron, and by observing their in vitro weight dynamics found that the majority of such dynamics were accounted for by activity independent processes. [2] also found the same conclusion in vivo by analysing commonly innervated synapses from an EM reconstruction of brain tissue from [3]. Meanwhile, [4] did a control experiment of raising mice in a visually impoverished environment, to compare against pharmacological silencing.
The literature does strongly suggest that large spines are more stable (e.g. [6]). We don't believe this contradicts any of the conclusions of the paper. We have added some more detail in the results and the discussion about how our results should be interpreted in the case where synaptic fluctuations are not distributed evenly within the neural circuit. In particular, the more that synaptic fluctuations are biased to more heavily alter less functionally important synapses, the lower the optimal magnitude of compensatory plasticity.
Note that even for simplified probabilistic models of evenlydistributed synaptic fluctuations, such as e.g. a white noise process, larger spines would still be more stable, as the probability of constant magnitude fluctuations eliminating a spine within a time period would decrease with spine size. In general, our results apply for any probabilistic form of synaptic fluctuation, as long as the probability of synaptic fluctuations increasing/decreasing a particular synaptic strength is independent of whether such a change would be beneficial for task performance.
Figure 1 one does imply that weight changes continue at steady state error. We have changed Figure 1 to show an explicit numerical simulation showing precisely that. As long as compensatory plasticity is not directly backtracking synaptic fluctuations, there will be an overall change in synaptic weights over time.
During learning, the magnitude of compensatory (i.e. learning) plasticity will be larger than at the point that steady state error is achieved. We have an extra section on the optimal magnitude of compensatory plasticity during learning describing this. Thus, the overall magnitude of plasticity (compensatory plus synaptic fluctuations) is predicted to be greater during learning.
A concrete experimental prediction follows from this observation that we now outline in the discussion: the proportion of learningindependent plasticity should be lower in a system that is actively learning, as opposed to retaining previously learned information. Interestingly, the literature seems to tentatively support this. Both [2] and [1] considered the covariance of functional synaptic strengths for coinnervated synapses, but in neocortex and hippocampus respectively. The hippocampal experiment showed much less activity independent plasticity (compare Figure 1 of [1] to Figure 8 of [2]). This would make sense in light of our results if the hippocampus was in a phase of active learning, while the neocortex was in a phase of retaining previously learned information. In fact, many conventional cognitive theories of hippocampus and neocortex characterise the hippocampus as a continual, active learner, with the neocortex as a consolidator of previously learned information (see e.g. [5]).
References
[1] Thomas M Bartol Jr, Cailey Bromer, Justin Kinney, Michael A Chirillo, Jennifer N Bourne, Kristen M Harris, and Terrence J Sejnowski. Nanoconnectomic upper bound on the variability of synaptic plasticity. eLife, 4:e10778, 2015.
[2] Roman Dvorkin and Noam E. Ziv. Relative Contributions of Specific Activity Histories and Spontaneous Processes to Size Remodeling of Glutamatergic Synapses. PLOS Biology, 14(10):e1002572, 2016.
[3] Narayanan Kasthuri, Kenneth Jeffrey Hayworth, Daniel Raimund Berger, Richard Lee Schalek, Jose Angel Conchello, Seymour KnowlesBarley, Dongil Lee, Amelio Vazquez Reina, Verena Kaynig, Thouis Raymond Jones, Mike Roberts, Josh Lyskowski Morgan, Juan Carlos Tapia, H. Sebastian Seung, William Gray Roncal, Joshua Tzvi Vogel stein, Randal Burns, Daniel Lewis Sussman, Carey Eldin Priebe, Hanspeter Pfister, and Jeff William Lichtman. Saturated reconstruction of a volume of neocortex. Cell, 162(3):648{661, 2015.
[4] Akira Nagaoka, Hiroaki Takehara, Akiko HayashiTakagi, Jun Noguchi, Kazuhiko Ishii, Fukutoshi Shirai, Sho Yagishita, Takanori Akagi, Takanori Ichiki, and Haruo Kasai. Abnormal intrinsic dynamics of dendritic spines in a fragile X syndrome mouse model in vivo. Scientific Reports, 6:26651, 2016.
[5] Randall C. O'Reilly and Jerry W. Rudy. Conjunctive representations in learning and memory: principles of cortical and hippocampal function. Psychological review, 108(2):311, 2001.
[6] Nobuaki Yasumatsu, Masanori Matsuzaki, Takashi Miyazaki, Jun Noguchi, and Haruo Kasai. Principles of LongTerm Dynamics of Dendritic Spines. Journal of Neuroscience, 28(50):13592{13608, 2008.
[Editors' note: further revisions were suggested prior to acceptance, as described below.]
1. How much do the results rely on the assumption that synaptic changes are "strong"? To what extent is this assumption consistent with experiments? Is this theoretical framework really needed when infinitesimally small noise is immediately corrected by a learning signal, as often assumed (see below for more details)? Is the main result trivial when changes are small? Is the main contribution is to show that the result also holds far from this trivial regime, when noise and corrections are large?
There are a number of questions to unpack here. Before doing so we'd like to point out that the main results go beyond the calculations that corroborate magnitudes of fluctuations and systematic plasticity in experiments. The contributions include an analysis framework that lets us query and understand general relationships between learning rule quality and ongoing synaptic change without making detailed assumptions about circuit architecture and plasticity rules. For example, the relationships we derived reveal that fluctuations should dominate at steady state (significantly exceed 50% of total ongoing change) when a learning rule closely approximate the gradient of a loss function. Given the intense interest in whether approximations of gradient descent occur biologically in synaptic learning rules, this observation alone says that the high degree of turnover observed in some parts of the brain is in fact consistent with, and maybe regarded as circumstantial evidence for, gradientlike learning rules. We'd argue that this insight and the framework that provides it are far from trivial.
We now turn to the specifics of the reviewers' comment.
More details on the first point from one of the reviewers:
When studying learning in neuronal networks, the underlying assumption is always ("always" to the best of my knowledge) that learninginduced changes are gradual. For example, some form of activitydependent plasticity has, on average, a negative projection on the gradient of the loss function of the current state. Small changes to synaptic efficacies are made and now the network is in a slightly different (improved) state. Activitydependent plasticity in that new state has, again, a negative projection on the (new) gradient of the loss function at the new state, etc.
It is an open question whether `gradual' change is the only way by which synaptic plasticity manifests experimentally. For instance, recent results from Jeff Magee's lab show that a single burst of synaptic plasticity over a single behavioural trial can activate place fields in mouse hippocampal CA1 neurons [1]. Moreover, this plasticity burst can be triggered even where there is a difference of seconds between the necessary factors of synaptic transmission and postsynaptic activation: the synapse integrates information over a long window before potentiating (or not). Even in more classical LTP papers from the last few decades that we now cite are somewhat equivocal on whether synaptic changes occur in one lump, or can be reduced to incremental changes. What is undeniable, empirically, is that a large change (e.g. 200% potentiation) can occur on a timescale of a few minutes, typical of most `induction windows'. Relative to behavioural timescales this is rather fast, so should it be modelled as gradual? In the end it might simply be mathematical convention or convenience that has led most theory papers to assume continuous changes. Fortunately, the setup of our paper accounts for both cases: continual, gradual change or temporally sparse bursts of synaptic plasticity. We have added detailed discussion points in the paper to make this clear.
If learning is sufficiently slow, we can average over the stochasticities in the learning process, organism's actions, rewards etc. and learning will improve performance.
We disagree with this statement in the context where learningindependent synaptic fluctuations are present. Learning must be fast enough to compensate for the synaptic fluctuations. If learning is arbitrarily slow, synaptic fluctuations will grow unchecked. How fast should learning (ie compensatory plasticity) be to have optimal learning performance in the presence of a given degree of synaptic fluctuations. That is the subject of the paper. We do agree with this statement in the context where the only source of stochasticity is the learning rule. In this case, the magnitude of stochasticity decreases with the speed of learning.
In contrast to this approach, this paper suggests a very different learning process: activityindependent "noise" induces a LARGE change in the synaptic efficacies. This change is followed by a SINGLE LARGE compensatory learninginduced change. The question addressed in this manuscript is how large should this single optimal compensatory learninginduced change be relative to the single noiseinduced change. The fact that the compensatory changes are not assumed to be small and that learning is done in a single step, rather than learning being gradual, allowing the local sampling of the loss function, complicates the mathematical analysis. While for analyzing infinitesimallysmall changes we only need to consider the local gradient of the loss function, higherorder terms are required when considering single large changes.
There is no sequential ordering of the compensatory plasticity and synaptic fluctuation terms, and we are not considering an alternating scenario, where a compensatory plasticity change reacts to a noisy change. Instead, we consider the relative proportions of the two, ongoing, plasticity terms over a (potentially infinitesimally small) time window. Our mathematics is in fact a first order, and not a second order analysis, and the results (in the regime of steady state task error) thus hold in the infinitesimal limit of small changes over the considered time window. We appreciate that this wasn't clear in the previous iteration, and have rewritten the Results section to make this more clear.
How can our analysis, which critically depends upon the Hessian (i.e. second order term) of the loss function, be a first order (i.e. linear) analysis, which is the relevant analysis in the limit of infinitesimally small changes? To find the optimal rate of compensatory plasticity, we have to differentiate the effect of compensatory plasticity with respect to its magnitude, and set the derivative equal to zero. This derivativetaking (unlabelled equation in Box 3, between equations 8 and 9) turns coefficients that were previously quadratic (second order) in the rate of compensatory plasticity (i.e. the Hessian), into first order (linear) coefficients. Indeed, our formula is locally linear: if we double the rate of fluctuations, it says that the optimal rate of compensatory plasticity should correspondingly double. The aforementioned derivative also turns the first order term in the loss function (i.e. the local gradient mentioned by the reviewer) into a zeroth order (constant) term, that is independent of plasticity rates. As a demonstration, suppose an ongoing compensatory plasticity mechanism (gradient descent, for simplicity) corrected the effects of whitenoise, synaptic fluctuations. Meanwhile, task error was at a steady state F[w] = k. Over an infinitesimal time period δt, the white noise fluctuations changed the synaptic weights by a magnitude ϵ(δt). What rate of compensatory plasticity is required to cancel out the effect of white noise on the task error?
White noise is uncorrelated in expectation with the gradient of the task error. A first order Taylor expansion that only considers the local gradient would therefore give
E[F[w(t) + ϵ(δt)]  F[w(t)]] = E[∇F[w(t)]^{T} ϵ(δt)] = 0.
This analysis would suggest that the optimal rate of compensatory plasticity over δt is zero, since the synaptic fluctuations have no effect, to first order, on the task error. This is a zeroth order approximation of the optimal magnitude of compensatory plasticity. In other words, it is independent of the rate of synaptic fluctuations. They could double, and this analysis would still suggest that the optimal rate of compensatory plasticity should still be zero. This is why an analysis only considering the local gradient of the loss function is insufficient, even in the infinitesimal limit of small weight changes.
We have rewritten the Results section to make clear the correspondence between magnitudes of plasticity over small time intervals Δt, and instantaneous rates of plasticity, and have rephrased our terminology, where appropriate, to refer to plasticity `rates'.
What is the justification to this approach given that gradual learning is what we seem to observe in the experiments, specifically those cited in this manuscript? There is a lot of evidence of gradual changes in numbers of spines or synaptic efficacies, etc. If everything is gradual, why not "recompute" the gradient on the fly as is done in all previous models of learning?
In light of the previous comments, we can say that the manuscript is consistent with, and indeed describes, the gradual learning observed in the mentioned experiments. In the numerical simulations, each timestep corresponds to an “on the fly recomputation" of the (noisy) gradient, consistent with previous models of learning.
2. what are the implications of the main mathematical result for interpreting measurable experimental quantities? The relation with experiments listed in the Discussion seems rather indirect (eg on lines 330335 the interpretations of EM reconstructions in the authors' modelling framework seems unclear; it would be worth unpacking how the papers listed on lines 347348 are consistent with the main result). Moreover, in many of the panels shown in Figures56, the dependence of the loss on the ratio between compensatory plasticity and synaptic fluctuations is rather flat; what does this imply for experimental data?
We have expanded the discussion on how our modelling framework relates to the mentioned papers. In particular, we have spelled out how our notion of `compensatory plasticity' may be approximated using
– the covariance in synaptic strengths for coinnervated synapses in EM reconstructions;
– the ‘activityindependent' plasticity in experiments that suppress neural activity,
We also go into greater detail about the biological assumptions inherent in our modelling. Even more detail is provided in the first subsection of the Results (review of key experimental findings), as we are aware of the need to keep the discussion reasonably concise.
As to the relative flatness of the curves in Figure 6, this is for a linear network and is included for mathematical completeness. We have now made this a supplement to Figure 5 (nonlinear network) which is likely more relevant biologically. In Figure 5 itself, the reviewers will note that the relationship is only at for an interval of relatively poor quality learning rules (middle row, where steady state error is almost 2 orders of magnitude worse than the case in the top row, which itself has a learning rule with a correlation of only 0.1 with a gradient). We included this along with the third row (where the dependence is once again steep) to show the general trend, in which a strong Ushape is more typical. In any case, a flat dependence for high compensatory plasticity, while consistent with the theory in certain regimes, is less consistent with the experimental data we reviewed and sought to account for.
Additional comment
Lines 180, 198: Figure 3d is referred to from the text but seems to be missing!
We corrected this typo, which occurred when we merged panels c and d in a draft copy of Figure 3.
References
[1] Katie C Bittner, Aaron D Milstein, Christine Grienberger, Sandro Romani, and Jeffrey C Magee. Behavioral time scale synaptic plasticity underlies ca1 place fields. Science, 357(6355):1033{1036, 2017.
https://doi.org/10.7554/eLife.62912.sa2Article and author information
Author details
Funding
European Commission (StG 2016 716643 FLEXNEURO)
 Dhruva V Raman
 Timothy O'Leary
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This work was supported by ERC grant StG 2016 716643 FLEXNEURO.
Senior Editor
 Timothy E Behrens, University of Oxford, United Kingdom
Reviewing Editor
 Srdjan Ostojic, Ecole Normale Superieure Paris, France
Reviewers
 Yonatan Loewenstein, Hebrew University of Jerusalem, Israel
 Matthias H Hennig, University of Edinburgh, United Kingdom
Publication history
 Preprint posted: August 19, 2020 (view preprint)
 Received: September 8, 2020
 Accepted: September 13, 2021
 Accepted Manuscript published: September 14, 2021 (version 1)
 Version of Record published: October 11, 2021 (version 2)
Copyright
© 2021, Raman and O'Leary
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 1,127
 Page views

 226
 Downloads

 1
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.