Selective consolidation of learning and memory via recallgated plasticity
Abstract
In a variety of species and behavioral contexts, learning and memory formation recruits two neural systems, with initial plasticity in one system being consolidated into the other over time. Moreover, consolidation is known to be selective; that is, some experiences are more likely to be consolidated into longterm memory than others. Here, we propose and analyze a model that captures common computational principles underlying such phenomena. The key component of this model is a mechanism by which a longterm learning and memory system prioritizes the storage of synaptic changes that are consistent with prior updates to the shortterm system. This mechanism, which we refer to as recallgated consolidation, has the effect of shielding longterm memory from spurious synaptic changes, enabling it to focus on reliable signals in the environment. We describe neural circuit implementations of this model for different types of learning problems, including supervised learning, reinforcement learning, and autoassociative memory storage. These implementations involve synaptic plasticity rules modulated by factors such as prediction accuracy, decision confidence, or familiarity. We then develop an analytical theory of the learning and memory performance of the model, in comparison to alternatives relying only on synapselocal consolidation mechanisms. We find that recallgated consolidation provides significant advantages, substantially amplifying the signaltonoise ratio with which memories can be stored in noisy environments. We show that recallgated consolidation gives rise to a number of phenomena that are present in behavioral learning paradigms, including spaced learning effects, taskdependent rates of consolidation, and differing neural representations in short and longterm pathways.
eLife assessment
This fundamental work proposes a novel mechanism for memory consolidation where shortterm memory provides a gating signal for memories to be consolidated into longterm storage. The work combines extensive analytical and numerical work applied to three different scenarios and provides a convincing analysis of the benefits of the proposed model, although some of the analyses are limited to the type of memory consolidation the authors consider (and don't consider), which limits the impact. The work will be of interest to neuroscientists and many other researchers interested in the mechanistic underpinnings of memory.
https://doi.org/10.7554/eLife.90793.3.sa0Introduction
Systems that learn and remember confront a tradeoff between memory acquisition and retention. Plasticity enables learning but can corrupt previously stored information. Consolidation mechanisms, which stabilize or render more resilient certain plasticity events associated with memory formation, are key to navigating this tradeoff (Kandel et al., 2014). Consolidation may be mediated both by molecular dynamics at the synapse level (synaptic consolidation) and dynamics at the neural population level (systems consolidation).
In this work, we present a model and theoretical analysis of selective systems consolidation, with a focus on understanding the computational advantages it offers in terms of longterm learning and memory storage. Several prior theoretical studies have studied synaptic consolidation models from this perspective, providing descriptions of how synaptic consolidation affects the strength and lifetime of memories (Fusi et al., 2005; Lahiri and Ganguli, 2013; Benna and Fusi, 2016). In these studies, synapses are modeled with multiple internal variables, operating at distinct timescales, which enable individual synapses to exist in more labile or more rigid (‘consolidated’) states. Such models can prolong memory lifetime and recapitulate certain memoryrelated phenomena, notably powerlaw forgetting curves. Moreover, this line of work has established theoretical limits on the memory retention capabilities of any such synaptic model, and shown that biologically realistic models can approximately achieve these limits (Lahiri and Ganguli, 2013; Benna and Fusi, 2016). These theoretical frameworks leave open the question of what computational benefit is provided by systems consolidation mechanisms that take place in a coordinated fashion across populations of neurons.
The term systems consolidation most often refers to the process by which memories stored in the hippocampus are transferred to the neocortex (Figure 1A; Squire and Alvarez, 1995; Frankland and Bontempi, 2005; McClelland et al., 1995; McClelland and Goddard, 1996). Prior work has described the hippocampus and the neocortex as ‘complementary learning systems’, emphasizing their distinct roles: the hippocampus stores information about individual experiences, and the neocortex extracts structure from many experiences (McClelland et al., 1995; McClelland and Goddard, 1996). Related phenomena also occur in other brain systems. In rodents, distinct pathways underlie the initial acquisition and longterm execution of some motor skills, with motor cortex apparently passing off responsibility to basal ganglia structures as learning progresses (Kawai et al., 2015; Dhawale et al., 2021). A similar consolidation process is observed during vocal learning in songbirds, where song learning is dependent on the region LMAN but song execution can, after multiple days of practice, become LMANindependent and rely instead on the song motor pathway (Figure 1B; Warren et al., 2011). Some insects also display a form of systems consolidation. Olfactory learning experiments in fruit flies reveal that shortterm memory (STM) and longterm memory (LTM) retrieval recruit distinct neurons within the mushroom body, and the shortterm pathway is necessary for LTM formation (Figure 1C; CervantesSandoval et al., 2013; Dubnau and Chiang, 2013).
The examples above are all characterized by two essential features: the presence of two systems involved in learning similar information and an asymmetric relationship between them, such that learning in one system (the ‘LTM’) is facilitated by the other (the ‘STM’). Moreover, mounting evidence indicates that across all these systems, there exist mechanisms that selectively modulate or gate consolidation into longterm storage. In flies, for instance, recent work has shown that shortterm olfactory memory recall gates LTM storage via a disinhibitory circuit, such that repeated stimulusoutcome pairings are consolidated into LTM but oncepresented pairings are not (Awata et al., 2019). A recent study in songbirds indicates that the rate at which song learning is consolidated into the song motor pathway is modulated by performance quality (Tachibana et al., 2022). Finally, a large body of work has shown that propensity of hippocampal memories to be cortically consolidated is modulated by a variety of factors including repetition, reliability, and novelty (Terada et al., 2022; Gorriz et al., 2023; Jackson et al., 2006; Brodt et al., 2016).
The ubiquity of the systems consolidation motif across species, neural circuits, and behaviors suggests that it offers broadly useful computational advantages that are complementary to those offered by synaptic mechanisms. In this work, we propose that the ability to selectively consolidate memories is the key computational feature that distinguishes systems from synaptic memory consolidation. To formalize this idea, we generalize prior theoretical studies studies by considering environments in which some memories should be prioritized more than others for longterm retention. We then introduce a model of selective systems consolidation and show that it can provide substantial performance advantages in such environments. In the model, synaptic updates are consolidated into LTM depending on their consistency with knowledge already stored in STM. We term this mechanism ‘recallgated consolidation’. This model is wellsuited to prioritize storage of reliable patterns of synaptic updates which are consistently reinforced over time. We derive neural circuit implementations of this model for several tasks of interest. These involve plasticity rules modulated by globally broadcast factors such as prediction accuracy, confidence, and familiarity. We develop an analytical theory that describes the limits on learning performance achievable by synaptic consolidation mechanisms and shows that recallgated systems consolidation exhibits qualitatively different and complementary benefits. Our theory depends on a quantitative treatment of environmental statistics, in particular the statistics with which similar events recur over time. Different model parameter choices suit different environmental statistics, and give rise to different learning behavior, including spaced training effects. The model makes predictions about the dependence of consolidation rate on the consistency of features in an environment, and the amount of time spent in it. It also predicts that STM benefits from employing sparser representations compared to LTM. A variety of experimental data support predictions of the model, which we review in the Discussion.
Results
Following prior work (Fusi et al., 2005; Benna and Fusi, 2016), we consider a network of neurons connected by an ensemble of N synapses whose weights are described by a vector $\mathbf{w}\in {\mathbb{R}}^{N}$ (our analysis also generalizes to synapses that store additional auxiliary information besides their weight; see Methods). For now, we are agnostic as to the structure of the network and its synaptic connections. The network’s synapses are subject to a stream of patterns of candidate synaptic potentiation and depression events. We refer to such a pattern as a memory, defined by a vector ${w}^{*}\in {\mathbb{R}}^{N}$. Synaptic weights are updated by memories according to a plasticity rule (see Methods). The plasticity rule always updates $\mathbf{w}$ to be more aligned with the memory vector ${w}^{*}$; thus, the memory may be interpreted as a ‘target’ value for the synaptic weights. One simple example of a synaptic update rule is a ‘binary switch’ model, in which synapses can exist in two states (active or inactive), and candidate synaptic updates are binary (potentiation or depression). In this model, inactive synapses activate (resp. active synapses inactivate) in response to potentiation (resp. depression) events with some probability p. However, our systems consolidation model can be used with any underlying synaptic mechanisms, and we will consider a variety of synaptic plasticity rules as underlying substrates.
In our framework, the same memory can be encountered repeatedly over time, and we will refer to such repeated encounters as ‘reinforcement’ of a memory (not to be confused with rewardcontingent notions of reinforcement). We distinguish memories by the reliability, or frequency, with which they are reinforced. The notion of reliability in our framework is meant to capture the idea that the structure of events in the world which drive synaptic plasticity is in some cases consistent over time, and in other cases inconsistent. For now, we focus on a simple environment model which captures this essential distinction, in which there are two kinds of memories: ‘reliable’ and ‘unreliable’. Reliable memories are consistent patterns of synaptic updates that are reinforced regularly over time, whereas unreliable memories are spurious, randomly sampled patterns of synaptic updates. Concretely, in simulations, we assume that a given reliable memory is reinforced with independent probability λ at each timestep, and otherwise a randomly sampled unreliable memory is encountered.
A useful measure of system performance is the memory recall factor, defined as the overlap $r=w\cdot {w}^{*}$ between a memory ${w}^{*}$ and the current synaptic state $\mathbf{w}$. Specifically, we are interested in the signaltonoise ratio (SNR) of the recall factor for reliable memories, (Fusi et al., 2005; Benna and Fusi, 2016), which normalizes the recall strength relative to the expected value of the fluctuations in recall factors for random memory vectors ${w}_{\text{rand}}^{*}$. We assume for simplicity that memories and weight vectors are meancentered, so that the SNR may be written as follows:
Recallgated systems consolidation
In our model (Figure 2A), we propose that the population of N synapses is split into two subpopulations which we call the ‘shortterm memory’ (STM) and ‘longterm memory’ (LTM). Upon every presentation of a memory ${w}^{*}$, the STM recall ${r}_{\text{STM}}={w}_{\text{STM}}\cdot {w}_{\text{STM}}^{*}$ is computed. Learning in the LTM is modulated by a factor $g({r}_{\text{STM}})$. We refer to g as the ‘gating function’. For now we assume g to be a simple threshold function, equal to 0 for ${r}_{\text{STM}}<\theta$ and 1 for ${r}_{\text{STM}}\ge \theta $, for some suitable threshold θ. This means that consolidation occurs only when a memory is reinforced at a time when it can be recalled sufficiently strongly by the STM. Later we will consider different choices of the gating function g, which may be more appropriate depending on the statistics of memory recurrence in the environment.
We refer to this mechanism as recallgated consolidation. Its function is to filter out unreliable memories, preventing them from affecting LTM synaptic weights. With an appropriately chosen gating function, reliable memories will pass through the gate at a higher rate than unreliable memories. Consequently, events that trigger plasticity in the LTM will consist of a higher proportion of reliable memories (Figure 2B), and hence the LTM will attain a higher SNR than the STM. The cost of this gating is to incur some false negatives—reliable memory presentations that fail to update the LTM. However, some false negatives can be tolerated given that we expect reliable memories to recur multiple times, and information about these events is still present in the STM. As a proof of concept of the efficacy of recallgated consolidation, we conducted a simulation in which memories correspond to random binary patterns and plasticity follows a binary switch rule (Figure 2C). This simulation implements an ‘ideal observer’ model in which we assume the system has direct access to the memory vector and can compute the recall factor exactly (realistic implementations are discussed below). Notably, recalldependent consolidation results in reliable memory recall with a much higher SNR than an alternative model in which LTM weight updates proceed independently of STM recall.
Neural circuit implementations of recallgated consolidation
Our model requires a computation of recall strength, which we defined as the overlap between a memory and the current state of the synaptic population. From this definition, it is not clear how recall strength can be computed biologically. The mechanisms underlying computation of recall strength will depend on the task, network architecture, and learning rule giving rise to memory vectors. A simple example is the case of a population of input neurons connected to a single downstream output neuron, subject to a plasticity rule that potentiates synapses corresponding to active inputs. In this case, the recall strength quantity corresponds exactly to the total input received by the output neuron, which acts as a familiarity detector. Below, we give the corresponding recall factors for other learning and memory tasks: supervised learning, reinforcement learning, and unsupervised autoassociative memory, summarized in Figure 3A. The expressions for the recall factors are derived for each learning rule of interest in the Appendix. We emphasize that our use of the term ‘recall’ refers to the familiarity of synaptic update patterns (specifically, memory vectors), and does not necessarily correspond to familiarity of stimuli or other task variables. Thus, although we will continue to use the term ‘recall factor’, for a given task the recall factor quantity may have a different semantic interpretation, summarized in Figure 3A. Note also that we use the term ‘learning rule’ to refer to how the memory vector is constructed from task quantities, while ‘plasticity rule’ is reserved for the mechanism by which memory vectors update synaptic weights.
Supervised learning
Suppose a population of neurons with activity x representing stimuli is connected via feedforward weights to a readout population with activity $\hat{\mathbf{y}}=\mathbf{W}\mathbf{x}$. The goal of the system is to predict groundtruth outputs $\mathbf{y}$. A simple choice of the form of the memory ${W}^{*}$ (written with a capital letter since the synaptic weights are being interpreted as a matrix) that will train the system is ${W}^{*}=y{x}^{T}$, corresponding to a associative Hebbian learning rule Hebb, 1949. The corresponding recall factor is $\mathbf{y}\cdot \hat{\mathbf{y}}$, corresponding to prediction accuracy (Figure 3B; see Appendix for derivation).
Reinforcement learning
Suppose a population of neurons with activity x representing an animal’s state is connected to a population with activity $\pi =Wx$ which controls the action selection probabilities. Specifically, the log probability of selecting action a is proportional to π_{a} (see Methods). Following action selection, the animal receives a reward, the value of which depends on the state and chosen action. A simple approach to reinforcement learning is to use a memory vector arising from a ‘threefactor’ rule (Joel et al., 2002; Frémaux and Gerstner, 2015; Lindsey et al., 2024) ${W}^{*}=\text{reward}\cdot a{x}^{T}$, where $\mathbf{a}$ is a vector with 1 in the index corresponding to the selected action and 0 elsewhere. This learning rule reinforces actions that lead to reward. For this model, the corresponding recall factor is $\text{reward}\cdot {\pi}_{a}$, a multiplicative combination of reward and the animal’s confidence in its selected action, as measured by its a priori log likelihood of selecting that action (see Appendix for derivation). Intuitively, the recall factor will be high when a confidently chosen action leads to reward (Figure 3C).
Unsupervised autoassociative memory
Suppose a population of neurons with activity $\mathbf{x}$ and recurrent weights $\mathbf{W}$ stores memories as attractors according to an autoassociative Hebbian rule, where memories correspond to ${W}^{*}=x{x}^{T}$, similar to a Hopfield network Hopfield, 1982. In this case, the recall factor can be expressed as $\mathbf{x}\cdot (\mathbf{W}\mathbf{x})$, a comparison between stimulus input $\mathbf{x}$ and recurrent input $Wx$ (see Appendix for derivation). Intuitively, the recall factor measures the familiarity of the stimulus, as highly familiar stimuli will exhibit attractor behavior, making $\mathbf{x}$ and $Wx$ highly correlated. Such a quantity could in principle be computed directly, for instance if separate dendritic compartments represent the feedforward inputs and the recurrent inputs, though such a mechanism is speculative. This quantity can also be approximated using a separate novelty readout trained alongside the recurrent weights, which is the implementation we use in our simulation. In this approach, a set of familiarity readout weights $\mathbf{u}$ receive the neural population activity as input, and outputs a scalar signal indicating the familiarity of that activity pattern. These familiarity readouts are updated according to their own corresponding memory vector ${u}^{*}=x$. The output of these weights, $u\cdot x$, estimates the familiarity of the activity pattern, and is used as the recall factor (see Appendix for more details on when this approximation is equal to the ideal recall factor).
To verify that the advantages of recallgated consolidation illustrated in Figure 2 apply in these tasks, we simulated the three architectures and learning rules described above (see Methods for simulation details). In each case, learning takes place online, with reliable taskrelevant stimuli appearing a fraction λ of the time, interspersed among randomly sampled unreliable stimuli. In the case of supervised and reinforcement learning tasks, unreliable stimuli are paired with random labels and random rewards, respectively. Reliable stimuli are associated with consistent labels or actionreward contingencies. We find that recallgated consolidation provides significant benefits in each case, illustrating that the theoretical benefits of increased SNR in memory storage translate to improved performance on meaningful tasks (Figure 3E, F and G).
An analytical theory of the recall of repeatedly reinforced memories
We now turn to analyzing the behavior of the recallgated systems consolidation model more systematically, to understand the source of its computational benefits and characterize other predictions it makes. To do so, we developed an analytic theory of memory system performance, with and without recallgated consolidation. To make the analysis tractable, our subsequent results assume an ideal observer model as in Figure 2C, where we assume the system has direct access to memory vectors and can compute the recall factor exactly. Importantly, our framework differs from prior work (Fusi et al., 2005; Benna and Fusi, 2016) in considering environments with intermittent repeated presentations of the same memory. We adopt several assumptions for analytical tractability. First, as in previous studies, we assume that inputs have been preprocessed so that the representations of different memories are random and uncorrelated with one another (Gluck and Myers, 1993; Benna and Fusi, 2016). We also assume, for now, that each memory consists of an equal number of positive and negative entries, although later we will relax this assumption. We are interested in tracking the SNR of recall for a given reliable memory. We emphasize that this quantity is an abstract measure of system performance reflecting the degree to which a specific set of synaptic changes (a memory trace) is retained in the system, and its interpretation varies according to the task in question (Figure 3).
The dynamics of memory storage depend strongly on the underlying synapse model and plasticity rule. Given a synaptic model, an important quantity to consider is its associated ‘forgetting curve’ m(t), defined as the average SNR of recall for a memory ${w}^{*}$ at t timesteps following its first presentation, assuming a new randomly sampled memory has been presented at each timestep since. For example, the binary switch model with transition probability p has an associated forgetting curve $m(t)\approx \sqrt{N}p{e}^{pt}$ (Fusi et al., 2005). More sophisticated synapse models, such as the cascade model of Fusi et al., 2005 and multivariable model of Benna and Fusi, 2016 achieve powerlaw forgetting curves (see Methods). In the limit of large system size N and under the assumption that memories are random, uncorrelated patterns, the forgetting curve is an exact description of the decay of recall strength.
Forgetting curves capture the behavior of a system in response to a single presentation of a memory, but we are concerned with the behavior of memory systems in response to multiple reinforcements of the same memory trace. Thus, another key quantity in our theory is the interarrival distribution p(I), which describes the distribution of intervals between repeated presentations of the same memory, and its expected value $\tau =E[I]$, the average interval length. Our simplest case of interest is the case in which a given memory recurs according to a Poisson process; that is, it is reinforced with probability λ at each timestep, independent of the history of recent reinforcements (as in the simulation in Figure 2C). This case corresponds to an exponential interarrival distribution $p(I)=\lambda {e}^{\lambda x}$, with mean interarrival time $\tau =1/\lambda $.
We now quantify the recall strength for a memory that has been reinforced R times. For the synapse models we consider, this quantity can be approximated accurately (see Appendix) by summing the strengths of preceding forgetting curves, that is:
where t_{i} is the time elapsed since the ith reinforcement of the memory. This quantity is a random variable whose value depends on the history of interarrival intervals of the memory, and the specific unreliable memories that have been stored in intervening timesteps. To more concisely characterize a system’s memory performance, we introduce a new summary metric, the learnable timescale of the system. For a given target SNR value and allowable probability of error ϵ, the learnable timescale ${\tau}_{\beta}^{\u03f5}$ is defined as the maximum interarrival timescale τ for which the $\text{SNR}$ of recall will exceed β with probability $1\u03f5$. We fix $\u03f5=0.1$ throughout this work; this choice has no qualitative effect on our results. Learnable timescale captures the system’s ability to reliably recall memories that are presented intermittently. We note that there exists a close relationship between learnable timescale and the memory capacity of the system (the number of memories it can store), with the two quantities becoming linearly related in environments with a high frequency of unreliable memory presentations (see Appendix and Figure 2—figure supplement 1).
The quantifications of recall SNR and learnable timescale we present in figures are computed numerically, as deriving exact analytical expressions for learnable timescale is difficult due to the randomness of the interarrival distribution. However, to gain theoretical intuition, we find it useful to consider the following approximation, corresponding to an environment in which memories are reinforced at deterministic intervals of length τ:
This approximation is an upper bound on the true SNR in the limit of small ϵ, and empirically provides a close match to the true dependence of $\text{SNR}$ on R (Figure 2—figure supplement 2). Using this approximation allows us to provide closedform analytical estimates of the behavior of SNR and learnable timescale as a function of system and environment parameters.
Theory of recallgated consolidation
In the recallgated consolidation model, the behavior of the STM is identical to that of a model without systems consolidation. The LTM, on the other hand, behaves differently, updating only in response to the subset of memory presentations that exceed a threshold level of recall in the STM. From the perspective of the LTM, this phenomenon has the effect of changing the distribution of interval lengths between repeated reinforcements of a reliable memory. For exponentially distributed interarrival times, the induced effective interarrival distribution in the LTM is also exponential with new time constant ${\tau}_{\text{LTM}}$ given by
where I is the (stochastic) length of intervals between presentations of the same reliable memory, θ is the consolidation threshold, and $\Phi $ is the cumulative distribution function of the Gaussian distribution with mean 0 and variance 1. This approximation is valid in the limit of large system sizes N, where responses to unreliable memories are nearly Gaussian. For general (nonexponential) interarrival distributions, the shape of the effective LTM interarrival distribution may change, but the above expression for ${\tau}_{\text{LTM}}$ remains valid.
We note that although the consolidation threshold θ can be chosen arbitrarily, setting it to too high a value has the effect of reducing the probability with which reliable memories are consolidated, by a factor of $P(I<{m}^{1}(\theta ))$. For large values of θ this reduction can become unacceptably small. For a given number of memory repetitions R, we restrict ourselves to values of θ for which the probability that no consolidation takes place after R repetitions is smaller than the allowable probability of error ϵ. Where R is not explicitly reported, we set R=2, which corresponds to analyzing the behavior of the model when a memory is reinforced once following its initial presentation.
Recallgated consolidation increases SNR and learnable timescale of memories
For fixed statistics of memory presentations, as the SNR of the STM increases (say, due to increasing N), stricter thresholds can be chosen for consolidation which filter out an increasing proportion of unreliable memory presentations, without reducing the consolidation rate of reliable memories (Figure 4A, Figure 4—figure supplement 1). Consequently, the SNR of the LTM can grow much larger than that of the STM, and the amplification of SNR increases with the SNR of the STM. Notably, the SNR of the LTM in the recallgated consolidation model also exceeds that of a control model in which STM and LTM modules are both present but do not interact (Figure 4B, Figure 4—figure supplement 1), which performs comparably to the STM by itself due to the lack of selective consolidation.
We may also view the benefits of consolidation in terms of the learnable timescale of the system. Recallgated consolidation enables longer learnable timescales, particularly at high target SNRs (Figure 4C, Figure 4—figure supplement 2). We note that our definition of SNR considers only noise arising from random memory sampling and presentation order. High SNR values may be essential for adequate task performance in the face of additional sources of noise, or when the system is asked to generalize by recalling partially overlapping memory traces (Benna and Fusi, 2016).
Recallgated consolidation enables better scaling of memory retention with repeated reinforcement
As mentioned previously, higher consolidation thresholds reduce the rate at which reliable memories are consolidated. However, the consolidation rate of unreliable memories decreases even more quickly as a function of the threshold (Figure 4D and E). Hence, higher thresholds increase the fraction of consolidated memories which are reliable, at the expense of reducing the rate of consolidation into LTM. This tradeoff may be acceptable if reliable memories are reinforced a large number of times, as in this case they can still be consolidated despite infrequent LTM plasticity. In other words, as the number of anticipated repetitions R of a single reliable memory increases, higher thresholds can be used in the gating function, without preventing the eventual consolidation of that memory. Doing so allows more unreliable memory presentations to be filtered out and consequently increases the SNR in the LTM (Figure 4F).
Assuming, as we have so far, that reliable memories are reinforced at independently sampled times at a constant rate, we show (calculations in Appendix) that the dependence of learnable timescale on R is linear, regardless of the underlying synaptic model (Figure 4G, Figure 4C, Figure 4—figure supplement 3). Synaptic models with a small number of states, such as binary switch or cascade models, are unable to achieve this scaling without systems consolidation (Figure 4G). In particular, the learnable timescale is roughly invariant to R for the binary switch model, and scales approximately logarithmically with R for the cascade model (see Appendix for derivation). Synaptic models employing a large number of internal states (growing exponentially with the intended timescale of memory retention), like the multivariable model of Benna and Fusi, 2016, can also achieve linear scaling of learnable timescale on R. However, these models still suffer a large constant factor reduction in learnable timescale compared to models employing recallgated consolidation (Figure 4G).
Consolidation dynamics depend on the statistics of memory recurrence
The benefit of recallgated consolidation is even more notable when the reinforcement of reliable memories does not occur at independently sampled times, but rather in clusters. Such irregular interarrival times might naturally arise in realworld environments. For instance, exposure to one environmental context might induce a burst of highfrequency reinforcement of the same pattern of synaptic updates, followed by a long drought when the context changes. Intentional bouts of study or practice could also produce such effects. The systems consolidation model can capitalize on such bursts of reinforcement to consolidate memories while they can still be recalled.
To formalize this intuition, we extend our theoretical framework to allow for more general patterns of memory recurrence. In particular, we let p(I) indicate the probability distribution of interarrival intervals I between reliable memory presentations. So far, we have considered the case of reliable memories whose occurence times follow Poisson statistics, corresponding to an exponentially distributed interval distribution. To consider more general occurrence statistics, we consider a family of interrarival distributions known as Weibull distributions. This class allows control over an additional parameter k which modulates “burstiness” of reinforcement, and contains the exponential distribution as a special case (k=1). For k<1, reliable memory presentations occur with probability that decays with time since the last presentation. In this regime, the same memory is liable to recur in bursts separated by long gaps (details in Methods).
Without systems consolidation, the most sophisticated synapse model we consider, the multivariable model of Benna and Fusi, 2016, achieves a scaling of learnable timescale that is linear with R regardless of the regularity factor k. In fact, we show (see Appendix) that the best possible learnable timescale that can achieved by any synaptic consolidation mechanism scales approximately linearly in R, up to logarithmic factors. However, for the recallgated consolidation model, the learnable timescale scales as ${R}^{1/k}$ when $k\le 1$ (Figure 4H, Figure 4C, Figure 4—figure supplement 3). In this sense, recallgated consolidation outperforms any form of synaptic consolidation at learning from irregularly spaced memory reinforcement.
Alternative gating functions suit different environmental statistics and predict spaced training effects
Thus far, we have considered a threshold gating function, which is wellsuited to environments in which unreliable memories are each only encountered once. We may also imagine an environment in which unreliable memories tend to recur multiple times, but over a short timescale (Figure 5A, top). In such an environment, the strongest evidence for a memory’s reliability is if it overlaps to an intermediate degree with the synaptic state (Figure 5A, bottom). The appropriate gating function in this case is no longer a threshold, but rather a nonmonotonic function of STM memory overlap, meaning that memories are most likely to be consolidated if reinforced at intermediatelength intervals (Figure 5B). Such a mechanism is straightforward to implement using neurons tuned to particular ranges of recall strengths. This model behavior is consistent with spaced learning effects reported in flies (Beck et al., 2000), rodents (Glas et al., 2021), and humans (RoveeCollier et al., 1995; Verkoeijen et al., 2005), which all show a characteristic inverted Ushaped dependence of memory performance on spacing interval.
While some synapselevel models (such as the multivariable synapse model of Benna and Fusi, 2016) can also give rise to spaced training effects, these effects require that a synapse undergoes few additional potentiation or depression events between the spaced reinforcements (Figure 5C, Figure 5—figure supplement 1). This is because spacing effects in such models arise when synapselocal variables are saturated, and saturation effects are disrupted when other events are interspersed between repeated presentations of the same memory. Hence, the spacing effects arising from such models are unlikely to be robust over long timescales. Recallgated systems consolidation, on the other hand, can yield spaced training effects robustly in the presence of many intervening plasticity events.
Heterogeneous gating functions suit complex environments with multiple memories reinforced at different timescales
Thus far we has assumed a dichotomy between unreliable, oneoff memories and reliable memories which recur according to particular statistics. In more realistic scenarios, there will exist many repeatedly reinforced memories, which may be reinforced at distinct timescales. We may be interested in ensuring good recall performance over a distribution of memories with varying recurrence statistics. For concreteness, we consider the specific case of an environment with a large number of distinct reliably reinforced memories, whose characteristic interarrival timescales are loguniformly distributed. As before, unreliable memories are also presented with a constant probability per timestep.
The recallgated plasticity model already described, using a threshold function for consolidation, still provides the benefit of filtering unreliable memory traces from the LTM. However, further improved memory recall performance is achieved with a simple extension to the model. The LTM can be subdivided into a set of subpopulations, each with distinct gating functions that specialize for different memory timescales by selecting for different recall strengths (Figure 5D and E). That is, one subpopulation consolidates strongly recalled memories, another consolidates weakly recalled memories, and others lie on a spectrum between these extremes. The effect of this arrangement is to assign infrequently reinforced memory traces to subpopulations which experience less plasticity, allowing these traces to persist until their next reinforcement. This heterogeneity of timescales is consistent with observations in a variety of species of intermediate timescale memory traces (Rosenzweig et al., 1993; Cepeda et al., 2008; Davis, 2011).
Studies of spaced training effects have found that the optimal spacing interval during training depends on the interval between training and evaluation (Cepeda et al., 2006; Cepeda et al., 2008). In particular, the timescale of memory retention is observed to increase smoothly with the spacing interval used during training. Our extended model naturally gives rise to this behavior (Figure 5F, Figure 5—figure supplement 2, Figure 5—figure supplement 3), due to the fact that the lifetime of a consolidated memory scales inversely with the frequency with which memories are consolidated into its corresponding LTM subpopulation.
Predicted features of memory representations and consolidation dynamics
The recallgated consolidation model makes a number of key predictions. The most basic consequence of the model is that responsibility for recalling a memory will gradually shift from the STM to the LTM as consolidation progresses, rendering the recall performance of the system increasingly robust to the inactivation of the STM (Figure 6A). A more specific prediction of the model is the rate of updates to the LTM increases with time, as STM recall grows stronger (Figure 6B). The rate of LTM updates also increases with reliability of the environment (operationalized as the proportion of synaptic update events which correspond to reliable memories; Figure 6B).
The recallgated consolidation model also makes predictions regarding neural representations in the STM and LTM. Until now we have assumed that memories consist of balanced potentiation and depression events distributed across the population. However, memories may involve only a sparse subset of synapses, for instance if synaptic plasticity arises from neural activity which is itself sparse. To formalize this notion, we consider memories that potentiate a fraction f of synapses, and a correspondingly modified binary switch plasticity rule such that potentiation activates synapses with probability p and depression inactivates synapses with probability $\frac{f}{1f}p$. We show analytically (see Appendix) that in the limit of low f, the SNRoptimizing choice of f is proportional to the rate λ of reliable memory reinforcement (Figure 6C). Other factors, such as energetic constraints and noiserobustness, may also affect the optimal coding level. In general, however, our analysis shows that environments with infrequent reinforcement of a given reliable memory incentivize sparser representations. As the effective value of λ is amplified in the LTM module, it follows that the LTM benefits from a denser representation than the STM. Interestingly, we also find that the optimal sparsity in the STM decreases when optimizing for the overall SNR of the system—that is, the optimal STM representation is even more sparse in the context of supporting LTM consolidation than it would be in isolation. Taken together, these two effects result in much denser representations being optimal in the LTM than in the STM (Figure 6D). One consequence of denser representations is greater generalization in the face of input noise (Babadi and Sompolinsky, 2014), implying that an optimal STM/LTM system should employ more robust and generalizable representations in the LTM.
Discussion
We have presented a theory of systems memory consolidation via recallgated longterm plasticity, which provides complementary benefits to synaptic consolidation mechanisms in terms of memory lifetime and retrieval accuracy. Its advantage arises from the ability to integrate over the information present in an entire neuronal population, rather than individual synapses, in order to decide which memory traces are consolidated. This capability is important in environments that induce a mixture of reliable and unreliable synaptic updates, in which a system must prioritize which updates to store longterm.
Experimental evidence for recallgated consolidation
The recallgated consolidation model is by design agnostic to the underlying neural circuit and hence potentially applicable to a wide variety of species and brain regions. Here, we summarize evidence consistent with recallgated consolidation in several model organisms. As our proposal is new, the experiments we describe were not designed to directly test our model predictions, and thus provide incomplete evidence for them. We hope that future work will more directly clarify the relevance of our model to these systems as well as others, the mechanisms by which it is implemented, and the shortcomings it may have in accounting for experimental results.
Associative learning in insects
In the Drosophila mushroom body, plasticity is driven by activity of dopamine neurons innervating a collection of anatomically defined compartments. These contain mushroom body output neurons (MBONs) that drive learned behavioral responses, such as approach or avoidance, to sensory stimuli (Aso et al., 2014). The compartments are grouped into anatomically defined lobes referred to by Greek letters: $\gamma ,\alpha ,\beta ,{\alpha}^{\mathrm{\prime}},{\beta}^{\mathrm{\prime}}$. In general the γ lobe compartments are implicated in STM while the $\alpha /\beta $ compartments are implicated in LTM (Aso et al., 2014). Mushroom body dopamine neurons receive a wide variety of inputs, including from MBONs themselves (Li et al., 2020). Such inputs provide a substrate by which longterm learning can be modulated by the outputs of shortterm pathways. To implement recallgated consolidation, the activity of dopamine neurons modulating plasticity in LTM compartments should be gated by learning in corresponding shortterm pathways. A recent study found an instance of this motif (Awata et al., 2019). Shortterm aversive learning decreases the activity of the $\gamma 1$ MBON (implicated in shortterm aversive memory). This $\gamma 1$ MBON is inhibitory and synapses onto a dopamine neuron innervating the $\alpha 2$ compartment (which is associated with longterm aversive learning). Thus, shortterm aversive learing in the $\gamma 1$ MBON disinhibits the $\alpha 2$ dopamine neuron, allowing for learning to proceed in the LTM $\alpha 2$ compartment. This circuit is a precise mechanistic implementation of our recallgated consolidation model. More work is needed to determine if other examples of this motif can be found in Drosophila or other insects.
Motor learning
Several lines of work have indicated that the neural substrate of motor skills can shift with practice. In songbirds, learned changes to song performance are initially driven by a corticobasal ganglia circuit called the anterior forebrain pathway (AFB) but eventually are consolidated into the song motor pathway (SMP) and become AFBindependent (Andalman and Fee, 2009; Warren et al., 2011). Using transient inactivations of LMAN, a region forming part of the AFB, a recent study quantified the degree of AFBtoSMP consolidation over time and found that it strongly correlated with the bird’s motor performance at the time (Tachibana et al., 2022). Although this finding does not establish the mechanism for this phenomena, the behavioral result is consistent with our model’s prediction that the rate of consolidation should increase as learning progresses in the shortterm pathway.
A related motor consolidation process has been observed during motor learning in rats. Experiments have shown that motor cortex disengages from heavily practiced skills (Kawai et al., 2015; Hwang et al., 2019), transferring control at least in part to the basal ganglia (Dhawale et al., 2019; Dhawale et al., 2021), and that the degree of cortical disengagement tracks motor performance, as measured by the variability of learned trajectories (Hwang et al., 2021). This finding is broadly consistent with recallconsolidation, with shortterm learning being mediated by motor cortex and longterm learning being mediated by basal ganglia. However, we note that unlike in the song learning study referenced above, it neither confirms nor rejects our stronger prediction that the rate (rather than overall extent) of motor consolidation increases with learning.
Spatial learning and hippocampal replay
Hippocampal replay is thought to be crucial to the consolidation of episodic memories to cortex (Carr et al., 2011; Ólafsdóttir et al., 2018). Replay has many proposed computational functions, such as enabling continual learning (van de Ven et al., 2020), or optimizing generalization performance (Sun et al., 2021), which are outside the scope of our model. However, under the assumption that replay enables LTM storage in cortex, the recallgated consolidation model makes predictions about which memories should be replayed—namely, replay should disproportionately emphasize memories that are familiar to the hippocampus. That is, we would predict more frequent replay of events or associations that are frequently encountered than of those that were experienced only once, or unreliably.
Recent experimental work supports this hypothesis. A recent study found that CA3 axonal projections to CA1, those that respond visual cues associated with a fixed spatial location are recruited more readily in sharpwave ripple events than those that respond to the randomly presented cues (Terada et al., 2022). This observation is consistent with our model’s prediction that repeatedly experienced patterns of activity are more likely to be consolidated, though other interpretations are possible. Earlier work found that sharpwave ripple events occur more frequently during maze navigation sessions with regular trajectories, and increase in frequency over the course of session, similar to the behavior of our model in Figure 6B; Jackson et al., 2006. Thus, existing evidence suggests that hippocampal replay is biased toward familiar patterns of activity, consistent with a form recallgated consolidation. Other experiments provide preliminary evidence for signatures of such a bias in cortical plasticity. For instance, fMRI study of activity in hippocampus and posterior parietal cortex (PPC) during a human virtual navigation experiment found that that the recruitment of PPC during the task, which was linked with memory performance, tended to strengthen with experience in a static environment, but did not strengthen when subjects were exposed to an constantly changing environment, consistent with consolidation of only reliable memories (Brodt et al., 2016).
Comparison with synaptic consolidation mechanisms
Recallgated consolidation improves memory performance regardless of the underlying synapse model (Figure 4), indicating that its benefits are complementary to those of synaptic consolidation. Our theory quantifies these benefits in terms of the scaling behavior of the model’s maximum learnable timescale with respect to other parameters. First, for any underlying synapse model, recallgated consolidation allows the learnable timescale to decay much more slowly as a function of the desired SNR of memory storage. Second, recallgated consolidation achieves (at worst) linear scaling of learnable timescale as a function of the number of memory reinforcements R. For models with a fixed, finite number of internal states per synapse, this scaling is at best logarithmic. Our results therefore illustrate that systemslevel consolidation mechanisms allow relatively simple synaptic machinery to support LTM storage. We note that more sophisticated synaptic models, which involve a large number of internal states that scales with the memory timescale of interest (Benna and Fusi, 2016), can also achieve linear scaling of learnable timescale with R (although recallgated consolidation still improves their performance by a large constant factor). However, for environmental statistics characterized by concentrated bursts of repeated events separated by long gaps, recallgated consolidation achieves superlinear powerlaw scaling, which we showed is not achievable by any synapselocal consolidation mechanism.
Our model provides an explanation for spaced training effects (Figure 5) based on optimal gating of LTM consolidation depending on the recurrence statistics of reliable stimuli. It is important to note that, depending on the specific form of internal dynamics present in individual synapses, synaptic consolidation models can also reproduce spacing effects. For example, the initial improvement of memory strength with increased spacing arises in the model of Benna and Fusi, 2016 due to saturation of fast synaptic variables, meaning that the timescale of these internal variables determines optimal spacing, and that intervening stimuli can block the effect by preventing saturation (Figure 5). In contrast, in our model this timescale is set by populationlevel forgetting curves, rendering spacing effects robust over long timescales and in the presence of intervening events. It is likely that mechanisms at both the synaptic and systems level contribute to spacing effects; our results suggest that effects observed at longer timescales are likely to arise from memory recall mechanisms at the systems level.
Other models of systems consolidation
Unlike previous theories, our study emphasizes the role of repeated memory reinforcement and selective consolidation. As such, our model has novel capabilities, but also has limitations compared to other models of consolidation. As the key insight of our model is distinct from most other theoretical work on the subject, we believe that future work will be able to fruitfully integrate the notion of recallgated plasticity into other models of consolidation and attain the benefits of both.
Much prior work focuses on consolidation via hippocampal replay. Prior work has proposed that replay (or similar mechanisms) can prolong memory lifetimes (Shaham et al., 2021; Remme et al., 2021), alleviate the problem of catastrophic forgetting of previously learned information (van de Ven et al., 2020; González et al., 2020; Shaham et al., 2021), and facilitate generalization of learned information (McClelland et al., 1995; Sun et al., 2021). One prior theoretical study (Roxin and Fusi, 2013), which uses replayed activity to consolidate synaptic changes from short to longterm modules, explored how systems consolidation extends forgetting curves. Unlike our work, this model (and related work, such as that of Brea et al., 2023) essentially models consolidation as ‘copying’ synaptic weights from one system to another. While such a mechanism has potentially useful consequences, such as enabling decoding of the age of a memory (Brea et al., 2023), it does not involve gating of memory consolidation, and consequently provides no additional benefit in consolidating repeatedly reinforced memories. Our model is thus distinct from, but also complementary to, these prior studies. In particular, recallgated consolidation can be implemented in realtime, without replay of old memories. However, as discussed above, selective replay of familiar memories is one possible implementation of recallgated consolidation. Selective replay is a feature of some of the work cited above (Shaham et al., 2021; Sun et al., 2021), which suggests it can provide advantages for retention and generalization (Shaham et al., 2021; Sun et al., 2021).
Other work has proposed models of consolidation, particularly in the context of motor learning, in which one module ‘tutors’ another to perform learned behaviors by providing it with target outputs (Murray and Escola, 2017; Teşileanu et al., 2017). Murray and Escola, 2020 proposes a fastlearning pathway (which learns using reward or supervision) which tutors the slowlearning longterm module via a Hebbian learning rule. In machine learning, a similar concept has become popular (typically referred to ‘knowledge distillation’), in which the outputs of a trained neural network are used to supervise the learning of a second neural network on the same task (Hinton et al., 2015; Gou et al., 2021). Empirically, this procedure is found to improve generalization performance and enable the use of smaller networks. Our model can be interpreted as a form of partial tutoring of the LTM by the STM, as learning in the LTM is partially dictated by outputs of the STM. In this sense, our work provides a theoretical justification for the use of tutoring signals between two neural populations.
Limitations and future work
In addition to motivating new experiments to test the predictions of a recallgated consolidation model, our work leaves open a number of theoretical questions that future modeling could address. Our theory assumes fixed and random representations of memory traces. Subject to this assumption, we showed that STM benefits from sparser representations than LTM. In realistic scenarios, synaptic updates are likely to be highly structured, and the optimal representations in each module could differ in more sophisticated ways. Moreover, adapting representations online—for instance, in order to decorrelate consolidated memory traces—may improve learning performance further. Addressing these questions requires extending our theory to handle memory statistics with nontrivial correlations. Another possibility we left unaddressed is that of more complex interactions between memory modules—for instance, reciprocal rather than unidirectional interactions—or the use of more than two interacting systems.
Finally, in this work we considered only a limited family of ways in which longterm consolidation may be modulated—namely, according to thresholdlike functions of recall in the shortterm pathway. Considering richer relationships between recall and consolidation rate may enable improved memory performance and/or better fits to experimental data. Moreover, in real neural circuits, additional factors besides recall, such as reward or salience, are likely to influence consolidation as well. For instance, a sufficiently salient event should be stored in LTM even if encountered only once. Furthermore, while in our model familiarity drives consolidation, certain forms of novelty may also incentivize consolidation, raising the prospect of a nonmonotonic relationship between consolidation probability and familiarity. Unlike our notion of recall, which can be modeled in taskagnostic fashion, the impact of such additional factors on learning likely depends strongly depend on the behavior in question. Our work provides a theoretical framework that will facilitate more detailed models of the rich dynamics of consolidation in specific neural systems of interest.
Methods
Theoretical framework
We consider a population of N synapses, indexed by $i\in \{1,2,\mathrm{...},N\}$ each with a synaptic weight ${w}_{i}\in \mathbb{R}$. The set of synaptic weights across the population can be denoted by the vector $\mathbf{w}\in {\mathbb{R}}^{N}$. The synapses may retain additional information besides strength as well; if each synapse carries ddimensional state information in addition to its strength, the synaptic state can be written as $\tilde{{\mathbf{w}}_{\mathbf{i}}}\in {\mathbb{R}}^{d}$, with the scalar synaptic strengths $w}_{i}\in \mathbb{R$ defined as a function of the highdimensional state $\tilde{{w}_{i}}$. We define memories as patterns of target synaptic weights, following prior work (Benna and Fusi, 2016; Fusi et al., 2005). More specifically, we model each memory as a vector $\mathbf{w}}^{\ast}\in {\mathbb{R}}^{N$. By defining memories in this fashion, our analysis can remain agnostic to the network architecture and plasticity rule that give rise to synaptic modifications. We will typically model memories as binary potentiation/depression events for simplicity, but in principle, memories can be continuous valued. Synaptic are updated by memories according to a plasticity rule $({\stackrel{~}{\mathbf{w}}}_{i}{)}_{new}=\mathcal{l}(\stackrel{~}{\mathbf{w}},{m}_{i})$, which maps the synaptic state at the time of a memory event to the subsequent synaptic state.
For theoretical calculations, we assume as in prior work (Fusi et al., 2005; Benna and Fusi, 2016), that the components of each memory ${w}^{*}$ are independent and uncorrelated with those of other memories (although this assumption is violated in our task learning simulations). We also assume for simplicity that memories are meancentered so that $E[\mathbf{w}\cdot {\mathbf{w}}_{\text{rand}}^{\ast}]=0$ over randomly sampled memories ${w}_{\text{rand}}^{*}$.
We define the recall strength associated with memory as the overlap $r=w\cdot {w}^{*}$. This definition reflects an ‘ideal observer’ perspective, as it requires direct and complete access to the state of the synaptic population. The ideal observer perspective provides an upper bound on the recall performance of a real system, and should be a fairly good approximation assuming that memory readout mechanisms are sophisticated enough. We are particularly interested in the normalized recall strength
where the expectation is taken over randomly sampled memories ${w}_{\text{rand}}^{*}$. We refer to this quantity as the signaltonoise ratio (SNR) of memory recall.
Synaptic models and plasticity rules
In this paper, we primarily consider three synaptic models and corresponding plasticity rules, taken from prior work.
The first and simplest is is a ‘binary switch’ model in which synapses take on binary ( ± 1) values and stochastically activate (resp. inactivate) in response to positive (resp. negative) values of a memory vector with probability p (Amit and Fusi, 1994). No auxiliary state variables are used in this model.
The second is the ‘cascade’ model of Fusi et al., 2005, in which synapses are modeled as a Markov chain with a finite number 2 k of discrete states with transition probabilities dependent on the kind of memory event (potentiation or depression). Half the states (states ${a}_{1},\mathrm{...},{a}_{k}$) are considered potentiated (strength +1) and half (states $b}_{1},...,{b}_{k$) are depressed (strength –1). Intuitively, states of the same potentiation level correspond to different propensities for plasticity in the synapse, enabling a form of synaptic consolidation. Formally, for i<k, the potentiated state a_{i} (resp. depressed state b_{i}) transitions to state ${a}_{i+1}$ (resp. ${b}_{i+1}$) with probability $\frac{{\alpha}^{i}}{1\alpha}$ following a potentiation (resp. depression) event. And for i<k, the potentiated state a_{i} (resp. depressed state b_{i}) transitions to state b_{1} (resp. a_{1}) with probability ${\alpha}^{i1}$ following a depression (resp. potentiation) event. For i=k this latter transition occurs with probability $\frac{{\alpha}^{i1}}{1\alpha}$; as described in Fusi et al., 2005, this choice is made for convenience to ensure equal occupancy of the different synaptic states. We assume $\alpha =0.5$ throughout.
The third synaptic model is the model of Benna and Fusi, 2016, which we refer to as the ‘multivariable’ model. In this model, synapses are described by a chain of m interacting continuousvalued variables ${u}_{1},\mathrm{...},{u}_{m}$, the first of which corresponds to synaptic strength. Potentiation and depression events increment or decrement the value of the first synaptic variable, and a set of difference equations governs the evolution of the multidimensional state at each time step:
where n and α a parameter of the model (we assume $n=2,\alpha =0.5$ throughout). This model also provides the ability for synapses to store information at different timescales, due to the information retained in auxiliary variables.
Model implementation for example tasks
Supervised Hebbian learning
We simulated a singlelayer feedforward network with a population of N=1,000 input neurons (activity denoted by $\mathbf{x}$) and a single output neuron, (activity denoted by $\widehat{y}$), connected with a 1×N binary weight matrix $\mathbf{W}$, such that $\hat{\mathbf{y}}=\mathbf{W}\mathbf{x}$. In each simulation, a set of P=20 reliable stimuli were randomly generated, which corresponded to binary ($\pm \frac{1}{N}$) random Ndimensional activity patterns in the input neurons. Note that due to the scaling of the inputs and use of binary synaptic weights, the activity $\widehat{y}$ is constrained to lie in the interval $[1,1]$. Each reliable stimulus was associated with a randomly chosen (but consistent across the simulation) label y, 1 or –1. At each time step, one of the reliable stimuli (along with its label) was presented to the network with probability ${\lambda}_{i}=0.01$ for all $i=1,\mathrm{...},P$. Otherwise (with probability $1{\displaystyle {\sum}_{i}{\lambda}_{i}}$), a randomly sampled unreliable stimulus was presented with a randomly chosen label. Memory vectors (written as matrices since the synaptic weights are interpreted as a matrix) were given by a Hebbian learning rule ${W}^{*}=y{x}^{T}$, corresponding to the product of the binary input neuron activity and the corresponding label. Learning followed the binary switch rule with p=0.1; that is, positive entries in the memory vector resulted in potentiation with probability p, and likewise for negative entries and depression. At each timestep, the product of the STM output and the ±1 label was computed, and if it exceeded the consolidation threshold $\theta =0.125$, plasticity was permitted in the LTM network.
Reinforcement learning
We used the same setup as in the supervised learning task, with the following modifications. Instead of a single readout, the network had A=3 output neurons corresponding to different possible actions. The activity of the output neuron a (denoted by π_{a}) represented the unnormalized log probabiliy of taking action a: $p(a)=\frac{{e}^{\beta {\pi}_{a}}}{\sum _{j}{e}^{\beta {\pi}_{j}}}$, where β is a parameter controlling the stochasticity of the action selection (we set β=10 in our simulations). For the purposes of learning, the STM outputs were used to compute action probabilities, but both the STM and the LTM were evaluated throughout training. Each of the 5 reliable stimuli was associated with a correct action. Taking the correct action yielded a reward of 1, while taking the other action yielded a reward of 0. Unreliable stimuli were associated with randomly sampled correct actions. Memory vectors were derived from the following threefactor learning rule: ${\mathbf{W}}^{\ast}=\text{reward}\cdot (\mathbf{a}{\mathbf{x}}^{T})$, where $\mathbf{a}$ is a onehot vector with a value of 1 at the index of the chosen action. At each timestep the product ${\pi}_{a}\cdot \text{reward}$ was computed, and if it exceeded the consolidation threshold, plasticity was permitted in the LTM network. All other parameters were the same as in the supervised learning simulation.
Unsupervised Hebbian learning
We simulated two recurrent neural networks with N=1,000 binary neurons each and with binary recurrent weight matrices ${W}_{\text{STM}}$ and ${W}_{\text{LTM}}$, respectively. Memories consisted of binary (entries equal to $\pm \frac{1}{N}$) random Ndimensional vectors that provided direct input $\mathbf{x}$ to the network neurons at each timestep. The network state $\mathbf{h}$ evolved for T=5 timesteps according to the following dynamics equation:
where $\varphi $ is a binary threshold nonlinearity with threshold set so that 50% of neurons were active at each time step (corresponding to a mechanism that normalizes activity across the network). The weights $\mathbf{W}$ of the network were binary and initialized as binary random variables with equal on/off probability. On each trial a stimulus was presented, which with probability $\lambda =0.25$ was a (randomly sampled but consistent) reliable stimulus, and otherwise was a newly randomly sampled unreliable stimulus. The network weights W_{ij} were subjected to potentiation events when x_{i} and x_{j} were both active at t=0, and otherwise subjected to depression events. Synaptic updates followed the binary switch rule with probability P=1.0.
Additionally, a set of N weights $\mathbf{u}$ connected the STM neurons to a single readout neuron that measured familiarity. These weights were also binary and updated according to their own memory vector ${u}^{*}$. They experienced candidate potentiation/depression events when their corresponding stimulus input neuron was active/inactive, respectively (i.e. the memory entry ${u}_{i}^{*}$ was equal to x_{i}). These weights followed the binary switch rule with probability p=1.0.
Plasticity in the LTM proceeded according to the same rule as in the STM but was gated by recall strength $r=u\cdot {x}_{\text{STM}}$, according to a threshold function with threshold equal to 0.25.
The performance of the network was determined by presenting noisecorrupted versions of the single reliable stimulus and measuring the correlation between the network state and the uncorrupted memory after T=5 time steps. The corrupted patterns were obtained by adding Gaussian noise of variance $\frac{1}{N}=0.001$ to the groundtruth pattern, and binarizing the result by choosing the fraction 0.5 of neurons with the highest values to be active.
Forgetting curves for different synaptic plasticity rules
Prior work (Fusi et al., 2005; Benna and Fusi, 2016) has considered environments in which a given memory is presented to the system only once. In this case, the performance of a single population of synapses with a given plasticity rule depends crucially on the memory trace function m(t). This is defined as
the recall SNR at time t for a memory ${w}^{*}$ presented at t=0, assuming randomly sampled memories have been presented in the intervening timesteps. For the binary switch model, $m(t)\approx \sqrt{N}p{e}^{pt}$. More sophisticated synaptic models, like the cascade and multivariable models, can achieve powerlaw scalings (Fusi et al., 2005; Benna and Fusi, 2016). The key feature of these models that enables powerlaw forgetting is that their synapses maintain additional information besides their weight, which encodes their propensity to change state. In this fashion, memories can be consolidated at the synaptic level into more stable, slowly decaying traces. The cascade model of Fusi et al. achieves
for some characteristic timescale T which can be chosen as a model parameter. Hence, its performance is upper bounded by
The model of Benna and Fusi can achieve
which is upper bounded by
Benna and Fusi also show that $\sqrt{N/t}$ scaling is an upper bound on the performance of any synapse model with finite dynamic range.
Implementation of SNR and learnable timescale computations
To compute recall strengths associated with single synaptic populations, we first sampled interarrival intervals I from the environmental statistics p(I). Given a number of repetitions R, we computed recall strength samples ${r}^{\mathrm{\prime}}=\sum _{i}^{R}m({t}_{i})$, where m is the forgetting curve associated with the underlying synaptic plasticity rule, $t}_{i}=\sum _{j\ge i}{I}_{j$, and the I_{j} are independent samples from p(I). We scaled recall strengths by a factor of $\frac{1}{\sqrt{N}}$ to compute the recall SNR (an approximation that is exact in the largeN limit).
To compute recall strengths associated with the recallgated consolidation model, we repeated the above procedure using a new interarrival distribution $p({I}_{\mathrm{L}\mathrm{T}\mathrm{M}})$ induced by the gating model. The induced distribution $I}_{\mathrm{L}\mathrm{T}\mathrm{M}$ is obtained by drawing as samples the lengths of intervals between consecutive interarrival interval samples for which the corresponding recall SNR in the STM exceeds the gating threshold θ (corresponding to the interval between consolidated reliable memory presentations), and rescaling it by the fraction of unreliable memories that are consolidated. Strictly speaking, in the general case this distribution is nonstationary, as the probability of STM recall exceeding the threshold can change as synaptic updates accumulate across repetitions for sophisticated synapse models like that of Benna and Fusi, 2016. We adopt a conservative approximation that ignores such effects and thus slightly underestimates the rate of consolidation when such synaptic models are used (and consequently underestimates the SNR and learnable timescale of the recallgated consolidation model). With this approximation, the random variable $I}_{\mathrm{L}\mathrm{T}\mathrm{M}$ is defined as as the following mixture distribution
where each ${I}_{i}\sim p(I),\phantom{\rule{thinmathspace}{0ex}}{\zeta}_{t}\sim N(0,1)$, q indicates the probability of a reliable memory presentation inducing consolidation, and $1[\cdot ]$ denotes an indicator function, equal to 1 or 0 depending on whether the condition is met. The value of j corresponds to the number of reinforcements that go by between instances of consolidation. For sufficiently large τ this distribution can be approximated by
where $\Phi $ is the CDF of the standard normal distribution. For large N, the probability of consolidation $q=P(I<{m}^{1}(\theta ))$.
We note that for an exponential interarrival distribution with mean τ, the induced distribution of ${I}_{\text{LTM}}$ is also exponential, with mean $\tau}_{\text{LTM}}=\frac{P\left(I<{m}^{1}\left(\theta \right)\right)}{1\mathrm{\Phi}\left(\theta \right)$. This is because the sums of j independent samples I_{i} are distributed according to a Gamma distribution with shape parameter j, and the mixture of such Gamma distributions with geometrically distributed mixture weights $p(j)=q(1q{)}^{j1}$ is itself an exponential distribution with mean $\tau /q$.
For a given number of expected memory repetitions R, the gating threshold θ was set such that at least one of the R repetitions would be consolidated with probability $1\u03f5,\phantom{\rule{thinmathspace}{0ex}}\u03f5=0.1$. Where R is not reported, we assume it equal to 2, the minimum number of repetitions for the notion of consolidation to be meaningful in our model.
To compute learnable timescales, we repeated the above SNR computations over a range of mean interarrival times $\tau =\mathbb{E}\left[I\right]$, keeping the interarrival distribution family (Weibull distributions with a fixed value of k, see below) constant. We report the maximum value of τ for which the SNR exceeds the designated target threshold with probability $1\u03f5,\phantom{\rule{thinmathspace}{0ex}}\u03f5=0.1$.
Throughout, for our interarrival distributions we use Weibull distributions with regularity parameter k. The corresponding cumulative distribution function is
where $\tau =E[I]$, and Γ is the Gamma function. In the case k=1, this reduces to an exponential distribution of interarrival intervals, which corresponds to memory reinforcements that occur according to a Poisson process with rate $\lambda =1/\tau $. In the limit $k\to \infty $, it corresponds to interarrival intervals of deterministic length τ. For k<1, the interarrival distribution is ‘bursty’, with periods of dense reinforcement separated by long gaps.
Spacing effect simulations
We simulated the multivariable synapse model of Benna and Fusi, 2016, in which each synapse is described by m continuousvalued dynamical variables ${u}_{1},\mathrm{...},{u}_{m}$ which evolve as follows:
For the first variable u_{1}, in place of ${u}_{i1}$ we substitute components m_{j} of the memory traces. For the last variable u_{m}, in place of ${u}_{i+1}$ we substitute 0. The strength of each synapse corresponds to the value of its first dynamical variable. For our simulations, we chose m=10 dynamical variables, n=2, $\alpha =0.5$, and N=400 synapses. The value of α is also varied in Figure 5—figure supplement 3. A spacing interval Δ was selected and a randomly drawn reliable memory was presented at Δlength intervals (the same pattern at each presentation). In the case without intervening memories, the dynamics of each synapse ran unimpeded between these presentations. In the case with intervening memories, new randomly drawn patterns were presented to the system at each timestep between the reliable memory presentations. Each pattern was drawn with values equal to ± 1/2, with equal probability.
Generalized model with multiple memory timescales
In our generalized environment model, the environment contains a variety of distinct reliable memories x_{i} which recur with Poisson statistics at a variety of rates λ_{i}. Timescales $\tau}_{i}=1/{\lambda}_{i$ are distributed as $p(\mathrm{l}\mathrm{o}\mathrm{g}\phantom{\rule{thinmathspace}{0ex}}\tau )\sim [0,A]$ where A is a large constant. This corresponds to the value of $\mathrm{l}\mathrm{o}\mathrm{g}\lambda$ being uniformly distributed in $[A,0]$, or equivalently to $p(\lambda )\sim 1/\lambda$ and bounded between ${e}^{A}$ and 1. The environment also contains an additional fraction of unreliable memories as before, sampled randomly and presented with a fixed probability at each timestep. The natural generalization of learnable timescale to this setting is the maximum interarrival interval timescale for which the lifetime of a corresponding memory (the time following last reinforcement its recall strength decays to an SNR below the target SNR) exceeds that timescale.
The distribution of interarrival intervals for memory i is
Integrating across the distribution of λ, we get the distribution of interarrival intervals for reliable memories observed by the system:
for large I.
The full distribution of interval strengths (including unreliable memories) is a mixture of ${p}_{\text{reliable}}$ and a delta function at $I=\infty $, with the latter’s weight corresponding to the probability with which an unreliable memory is sampled at a given timestep (in our simulations we chose 0.9).
From here we can compute a distribution of STM recall strengths r
We simulated a model in which an ensemble of LTM subpopulations are assigned gating functions ${g}_{i}(r)$ equal 1 for $\mathrm{l}\mathrm{o}\mathrm{g}\phantom{\rule{thinmathspace}{0ex}}r\in [{A}_{i},{A}_{i+1}]$ and 0 elsewhere, with the A_{i} spaced evenly over $\left[0,\frac{1}{2}\mathrm{l}\mathrm{o}\mathrm{g}N\right]$. The expected lifetime of a memory reinforced with a given interval $I}^{\mathrm{\prime}$ is given approximately by the STM lifetime divided by the fraction of memory presentations for which the recall strength lies in the same interval as $m({I}^{\mathrm{\prime}})$. This quantity reflects the proportion of memories presentations that are consolidated into the same LTM subpopulation as the memory in question.
Appendix 1
Derivation of recall strength quantity for specific plasticity rules
In the following derivations, we derive the recall factors for various learning rules, which correspond to different choices of ${W}^{*}$. The recall factor is defined as the elementwise dot product between W and ${W}^{*}$, which we denote as $W\cdot {W}^{*}$. We make use of the fact that this elementwise dot product between the matrices is equal to $\mathrm{T}\mathrm{r}[{\mathbf{W}}^{T}{\mathbf{W}}^{\ast}]$.
Supervised Hebbian learning
Let x be the input population activity, W be the prediction weights, $\hat{\mathbf{y}}=\mathbf{W}\mathbf{x}$ the output population activity (predicted probabilities), and y indicate groundtruth target values. A supervised Hebbian plasticity rule gives rise to a memory vector (interpreted here as a matrix since the synaptic weights form a matrix) ${W}^{*}=y{x}^{T}$, and thus the recall strength $r=W\cdot {W}^{*}$ can be written as
corresponding to the accuracy of the prediction $\widehat{y}$.
Reinforcement learning
Let x be the input population activity representing state information, W be the output weights, and $\pi =Wx$ be the output population activity representing unnormalized log probabilities of taking different actions $a\in \{1,\mathrm{...},A\}$, so that $p(a)=\frac{{e}^{\beta {\pi}_{a}}}{\sum _{j}{e}^{\beta {\pi}_{j}}}$, where β is a parameter controlling the stochasticity of the action selection. Let a be a onehot vector indicating the sampled action, and $r=\pm 1$ be the scalar reward that results. For a reinforcement learning rule giving rise to a memory vector ${\mathbf{W}}^{\ast}=\text{reward}\cdot (\mathbf{a}{\mathbf{x}}^{T})$, the recall strength $r=W\cdot {W}^{*}$ can be written as
Following the same steps as the derivation for the supervised learning case, with a in place of y, gives
corresponding to the unnormalized log probability with which the action a was selected (which can be interpreted as the confidence in the selection), modulated by reward.
Computing this factor requires preserving the network’s action probability distribution, extracting from it the probability of the sampled action, and multiplicatively scaling the result by the obtained reward.
Autoassociative memory
Let x be the population activity and W be the recurrent weight matrix. For an autoassociative memory storage rule with memory vector ${W}^{*}=x{x}^{T}$, assuming that the weight matrix W can be approximated as a sum ${\sum}_{i}{x}_{i}}{x}_{i}^{T$ over prior plasticitydriven updates, then the recall strength $r=W\cdot {W}^{*}$ can be written as
corresponding the familiarity of the current pattern x relative to all previously seen patterns ${x}_{i}$.
Familiarity also be computed with a separate familiarity readout trained alongside the recurrent weights. If the familiarity readout employs a Hebbian rule, the resulting estimate of familiarity will be equal to
For uncorrelated patterns in a network below capacity, this strategy corresponds exactly to the true recall factor in the limit of large network size.
Relationship between learnable timescale and capacity
We note that theoretical work on memory systems often focuses on memory capacity, the number of memories that can be reliably stored in the system (Gardner, 1988; Fusi et al., 2005; Benna and Fusi, 2016). Our learnable timescale metric is distinct from capacity. However, the two are closely linked in a particular regime. Suppose P distinct reliable memories are reinforced independently at rates ${\lambda}_{1},\mathrm{...},{\lambda}_{P}$. In the regime in which the overall rate of reliable memory presentation ${\lambda}_{\text{tot}}={\displaystyle {\sum}_{i}{\lambda}_{i}}$ is small, the SNR of memory recall for memory i will be the same as in the case of a single reliable memory with $\lambda ={\lambda}_{i}$ (Figure 2—figure supplement 1). Hence, for a fixed ${\lambda}_{\text{tot}}$, and for simplicity assuming that distinct reliable memories are presented at equal rates ${\lambda}_{i}=\frac{1}{P}{\lambda}_{\text{tot}}$ for all i, the learnable timescale ${\tau}^{*}$ of the system dictates its capacity, equal to ${\tau}^{*}{\lambda}_{\text{tot}}$. We note that this correspondence does not hold in the case where most observed memories are reliable. In this work, however, we are interested primarily in the regime of scarce reliability, where recallgated consolidation provides the most benefit. In this regime, we regard the learnable timescale as the most natural measure of system performance, as the primary obstacle to memory storage is the presence of long gaps between reinforcements of reliable memories.
The effect of repeated reinforcement on memory dynamics without recallgated consolidation
When memories can recur multiple times, the memory trace function m(t) is no longer an adequate description of system behavior, as the synaptic updates from multiple presentations can combine. For the synaptic plasticity rules we consider here – the binary switch, the cascade model of Fusi et al., and the multivariable model of Benna & Fusi, this combination is approximately additive (Benna and Fusi, 2016). This is because for each of these plasicity rules, the change in distribution of synaptic states following the presentation of a memory is approximately independent of the existing synaptic state. The only dependencies are saturation effects – synapses which have reached the edge of their dynamic range – which can only lead to subadditive behavior. Saturation effects can be avoided by making the dynamic range of synapses sufficiently large. Thus for these plasticity rules of interest we may consider additive memory trace combination to represent a close approximation (and a tight upper bound) on the combined memory trace strength.
For a reliable memory presented at times ${t}_{1},\mathrm{...},{t}_{R}$, and a population of synapses using additive plasticity rules, the current SNR at time t can therefore be approximated as
If memory presentations occur separated by regular intervals of length $\tau =\frac{1}{\lambda}$, we have
For the binary switch model, m(t) decays exponentially with time constant $1/p$, and so the second term is negligible compared to the first. Hence the learnable timescale of the system is the same as the memory lifetime, approximately $1/p$. For target SNR threshold δ, we require $p\ge \delta /\sqrt{N}$, so the best possible learnable timescale, optimizing over p, is $O(\sqrt{N}/\delta )$.
For the cascade model, m(t) decays as $\frac{\sqrt{N}}{t\mathrm{log}T}{e}^{t/T}$. For $t\gg T$ the exponential factor dominates, resulting in the same behavior as the binary switch model. For $t\ll T$, the exponential term approximately vanishes, so the following expression for $\mathrm{S}\mathrm{N}\mathrm{R}(t)$ is a close approximation and tight upper bound:
Again, for computing learnable timescale we are interested in when $t{t}_{R}\approx \tau $, in which case:
For the multivariable model, m(t) decays as $\sqrt{\frac{N}{t\phantom{\rule{thinmathspace}{0ex}}\mathrm{l}\mathrm{o}\mathrm{g}\phantom{\rule{thinmathspace}{0ex}}T}}{e}^{t/T}$. Again we are primarily interested in the $t\ll T$ regime, in which the expression for $\mathrm{S}\mathrm{N}\mathrm{R}(t)$ is approximately
This SNR is maximized for $T\approx R\cdot \tau $. And for computing learnable timescale we are interested in when $t{t}_{R}\approx \tau $. So we have
To compute the learnable timescale at target SNR δ for $1\ll R\ll \tau $, we have $4NR\ge \tau \mathrm{l}\mathrm{o}\mathrm{g}(R\tau ){\delta}^{2}$, the solution of which is within logarithmic factors of $\mathcal{O}(RN)$.
The above calculations assume deterministic interarrival intervals of length τ. In general, we are interested in an interarrival distribution p(I) with mean τ. However, we show numerically that for Weibull distributions with reasonable values of k (not too close to zero), the true learnable timescale figures are very bounded very closely to our results above (Figure 2—figure supplement 2). Moreover, for the purpose of computing learnable timescale ${\tau}_{\u03f5}^{\beta}$ with error probability tolerance ϵ, for sufficiently small ϵ the deterministic approximation represents an upper bound on the SNR for distributions with mean τ. This is because to ensure high SNR with very high probability, deterministic intervals are a best case scenario, as stochastic interval lengths will with some nonzero probability deviate far above the mean.
Bounds on an ideal synaptic consolidation model
In this section we show that no realistic synapselocal mechanism can achieve significantly better learnable timescale than $\mathcal{O}(RN)$, and hence that the ability of recallgated systems consolidation to achieve learnable timescale scaling superlinearly with R in some environments (see previous section) represents a qualitative advantage.
We consider a very general class of synaptic plasticity rules. In particular, we suppose a synapse can main tain a history of sequences of potentiation and depression events for arbitrarily long time windows and track the number of windows for which Δ, the difference in number of potentiation and depression events, exceeds a threshold δ. Let ${p}_{\mathrm{r}\mathrm{e}\mathrm{l}\mathrm{i}\mathrm{a}\mathrm{b}\mathrm{l}\mathrm{e}}(\mathrm{\Delta};\tau )$ refer to the probability distribution of values of Δ after τ timesteps, given that a synapse is potentiated by the reliable memory of interest – and $p}_{\mathrm{u}\mathrm{n}\mathrm{r}\mathrm{e}\mathrm{l}\mathrm{i}\mathrm{a}\mathrm{b}\mathrm{l}\mathrm{e}$ refers to the analogous distribution for synapses subject only to potentiation by unreliable memories. After $\frac{T}{\tau}$ intervals of length τ, for a synapse potentiated by the reliable memory, we have that
The memory can be considered retrievable with SNR of order $\mathcal{O}\left(1\right)$ once the expression above exceeds $\mathcal{O}\left(\frac{1}{N}\right)$ (since evidence can be accumulated across the N synapses) for any choice of τ (since we are interested in the best achievable performance).
Now, for large enough τ, ${p}_{\text{unreliable}}(\mathrm{\Delta};\tau )$ is approximately Gaussian with mean 0 and standard deviation $\sqrt{\tau}$. Conditioned on the reliable memory being presented r times in τ timesteps, ${p}_{\text{reliable}}(\mathrm{\Delta})$ is approximately Gaussian with mean r and standard deviation $\sqrt{\tau}$. The KL divergence between these two distributions is $\frac{{r}^{2}}{2\tau}$. Now consider the distribution ${p}_{\tau}(r)$ of number of repetitions r that occur in a time window τ. We want to find a value $r}^{\mathrm{m}\mathrm{a}\mathrm{x}$ such that $P}_{\tau}(r\ge {r}^{\mathrm{m}\mathrm{a}\mathrm{x}})\le \frac{\tau}{T$. From there we can assume that $r<{r}^{\mathrm{m}\mathrm{a}\mathrm{x}}$ in any of the τlength intervals, since after T timesteps we cannot reliably count on r exceeding $r}^{\mathrm{m}\mathrm{a}\mathrm{x}$ in any of the intervals.
For $\tau \le M[I]$, the median of the distribution p(I), note that $r}^{\mathrm{m}\mathrm{a}\mathrm{x}}\le \mathrm{l}\mathrm{o}\mathrm{g}(T/\tau )\le \mathrm{l}\mathrm{o}\mathrm{g}\phantom{\rule{thinmathspace}{0ex}}\mathrm{T$. For $\tau \le c\cdot M[I]$, if $R>c\mathrm{l}\mathrm{o}\mathrm{g}\phantom{\rule{thinmathspace}{0ex}}T$ then at least one interval of less than length $M[I]$ contains at least $\mathrm{l}\mathrm{o}\mathrm{g}\phantom{\rule{thinmathspace}{0ex}}T$ repetitions. Hence ${r}^{\mathrm{m}\mathrm{a}\mathrm{x}}\le c\mathrm{l}\mathrm{o}\mathrm{g}\phantom{\rule{thinmathspace}{0ex}}T$. So conservatively we can take ${r}^{\mathrm{m}\mathrm{a}\mathrm{x}}=\frac{\tau}{M\left[I\right]}\mathrm{l}\mathrm{o}\mathrm{g}\phantom{\rule{thinmathspace}{0ex}}T$.
Thus, our log probability expression above is bounded as follows
Hence the KL divergence criterion becomes
or equivalently,
The number of repetitions is $R\approx T/\mathbb{E}\left[I\right]$, giving
assuming $M[I]~O(\mathbb{E}\left[I\right])$. For the interarrival distributions we consider (of the Weibull family), $M\left[I\right]<E\left[I\right]$ so this is a conservative assumption. Hence the learnable timescale of any population using only synapselocal plasticity rules is no greater than the solution for $\mathbb{E}\left[I\right]$ of the equation above. We have
the solution of which is within logarithmic factors of $\mathcal{O}\left(RN\right)$.
Scaling behavior of the STM/LTM model
In the recallgated consolidation model, the overlap $r={w}_{\text{STM}}\cdot {w}_{\text{STM}}^{*}$ indicates the recall strength of memory x given the current synaptic state of the STM. LTM plasticity is modulated by a factor g(r) – we refer to g as the “gating function” and r as the STM recall strength. We assume for now that the gating function g(r) is chosen to be a threshold function, $g(r)=H(r\theta )$, where r is the SNR of the memory overlap, H is the Heaviside step function, and θ is referred to as the “consolidation threshold.” With this choice, unreliable memories will be consolidated at a rate of $1\Phi (\theta )$, where $\Phi $ is the CDF of the normal distribution, in the limit of large system size N.
Suppose a memory ${w}^{*}$ is presented twice with interval I. Then the SNR at the second presentation will be lowerbounded by m(I), in expectation. It follows that the rate at which reliable memories will be consolidated at for the gating function above is lower bounded by $P(I<{m}^{1}(\theta ))$. After R repetitions of the reliable memory, the probability that consolidation has occurred will be at least
We are interested in the maximum θ for which this expression exceeds $1\u03f5$ – this is the most stringest consolidation threshold we can set while still ensuring consolidation of the reliable memory with high probability. This value of θ is given by
If R is large then the solution will be such that $P(I<{m}^{1}(\theta ))$ is small, enabling the approximation:
For tractability we consider, as our family of interarrival distributions, Weibull distributions with regularity parameter k. The cumulative distribution function is
For $t<<\tau$, this is approximated as
Importantly, $P(I\le t)$ decays as t^{k}. Thus, increasing the number of repetitions R has the effect of scaling the τ that satisfies Equation 40 by ${R}^{1/k}$. That is, for a fixed θ, and hence a fixed degree of amplification ${\tau}_{\text{LTM}}/\tau $ of the effective rate of reliable memories in the LTM, the maximum τ achieving that SNR with probability $1\u03f5$ (i.e. the learnable timescale ${\tau}_{\beta}^{\u03f5}$) scales as $\mathcal{O}\left({R}^{1/k}\right)$.
For a gating function threshold θ, the corresponding SNR in the LTM will be the SNR induced by an interarrival distribution ${I}_{\text{LTM}}$ with mean value
Since $1\Phi (\theta )$ decays much more rapidly than any power of ${m}^{1}(\theta )$, it follows that $\mathbb{E}\left[{I}_{\text{LTM}}\right]$ can be made O(1), and thus the SNR of the LTM can become $O(\sqrt{N})$, for relatively small values of θ (and hence a small number of required repetitions). In other words, for a fixed number of expected memory repetitions, the learnable timescale of the LTM decreases only slightly as the target SNR is raised from O(1) to $O(\sqrt{N})$.
Note that if P different reliable memories are present, then $\mathbb{E}\left[{I}_{\text{LTM}}\right]$ for any given reliable memory will be lowerbounded by $\mathcal{O}\left(P\right)$ instead of $\mathcal{O}\left(1\right)$. The induced SNR for any given reliable memory in the LTM will in this case be of order m(P), rather than $O(\sqrt{N})$.
Optimal sparsity calculations
We now consider the case of sparse memories – those which potentiate a fraction f of synapses. Consider the behavior of a single population of binary synapses employing the binary switch plasticity rule. We modify the plasticity rule slightly so that potentiation flips the state of a synapse with probability p and depression with probability $\frac{f}{1f}p$, to ensure that the fractions of potentiated and depressed synapses remain balanced.
We consider an environment with a single reliable memory that is presented with probability λ at each time step (otherwise, a randomly sampled unreliable memory is presented). We can compute the behavior analytically by tracking how the distributions of u (the output neuron response to true stimuli) and v (the output neuron response to noise) evolve over time. We assume that the coding level f is sufficiently small that terms of order $O({f}^{2})$ may be ignored.
Due to the balanced plasticity rule, $\frac{1}{2}$ of synapses are strong at any given time, so the mean response ${v}^{*}$ to a randomly sampled noise pattern is $\frac{1}{2}$. The variance of v is also constant and equal to $\frac{1}{4Nf}$.
The evolution of u is a stochastic process that, in the limit of large Nf (i.e. a large number of active neurons for each stimulus), can be described as an OrnsteinUhlenbeck (OU) process:
where $\u03f5\sim N(0,{\sigma}^{2})$
In the limit of small f we have:
The quantity ${u}^{*}$ determines the asymptotic mean of u and the quantity θ determines the rate at which u converges to this mean. Immediately we see that ${u}^{*}$ scales with the frequency λ with which the true stimulus is presented, and that the rate of convergence (speed of learning) is proportional to p.
By wellknown properties of OU processes, the asymptotic variance of u is equal to $\frac{{\sigma}^{2}}{2\theta}$. In the smallp limit, this quantity comes out to
Note that in the lowp limit (slow learning rate) this is the same as the variance of v. Thus in the limit of slow learning, we have that
And thus
From this expression we can see that for a given f, the asymptotic SNR always increases with λ and N. For a given λ, we would like to maximize this expression with respect to f.
This expression equals zero when
so the asymptotic SNR is maximized for $f=\frac{1}{2}\lambda $. That is, the optimal coding level is proportional to the frequency with which reliable (as opposed to unreliable) stimuli are observed in the environment.
Data availability
The current manuscript is a theoretical study, so no data have been generated for this manuscript.
References

Learning in neural networks with material synapsesNeural Computation 6:957–982.https://doi.org/10.1162/neco.1994.6.5.957

Learning performance of normal and mutant Drosophila after repeated conditioning trials with discrete stimuliThe Journal of Neuroscience 20:2944–2953.https://doi.org/10.1523/JNEUROSCI.200802944.2000

Computational principles of synaptic memory consolidationNature Neuroscience 19:1697–1706.https://doi.org/10.1038/nn.4401

Computational models of episodiclike memory in foodcaching birdsNature Communications 14:2979.https://doi.org/10.1038/s4146702338570x

Distributed practice in verbal recall tasks: A review and quantitative synthesisPsychological Bulletin 132:354–380.https://doi.org/10.1037/00332909.132.3.354

Spacing effects in learning: A temporal ridgeline of optimal retentionPsychological Science 19:1095–1102.https://doi.org/10.1111/j.14679280.2008.02209.x

Systemlike consolidation of olfactory memories in DrosophilaThe Journal of Neuroscience 33:9846–9854.https://doi.org/10.1523/JNEUROSCI.045113.2013

The basal ganglia control the detailed kinematics of learned motor skillsNature Neuroscience 24:1256–1269.https://doi.org/10.1038/s41593021008893

Systems memory consolidation in DrosophilaCurrent Opinion in Neurobiology 23:84–91.https://doi.org/10.1016/j.conb.2012.09.006

The organization of recent and remote memoriesNature Reviews. Neuroscience 6:119–130.https://doi.org/10.1038/nrn1607

Neuromodulated spiketimingdependent plasticity, and theory of threefactor learning rulesFrontiers in Neural Circuits 9:85.https://doi.org/10.3389/fncir.2015.00085

The space of interactions in neural network modelsJournal of Physics A 21:257–270.https://doi.org/10.1088/03054470/21/1/030

Knowledge distillation: A surveyInternational Journal of Computer Vision 129:1789–1819.https://doi.org/10.1007/s1126302101453z

Disengagement of motor cortex during longterm learning tracks the performance level of learned movementsThe Journal of Neuroscience 41:7029–7047.https://doi.org/10.1523/JNEUROSCI.304920.2021

Hippocampal sharp waves and reactivation during awake states depend on repeated sequential experienceThe Journal of Neuroscience 26:12415–12426.https://doi.org/10.1523/JNEUROSCI.411806.2006

ConferenceA memory frontier for complex synapsesAdvances in Neural Information Processing Systems. pp. 1034–1042.

The role of hippocampal replay in memory andplanningCurrent Biology 28:R37–R50.https://doi.org/10.1016/j.cub.2017.10.073

Hebbian plasticity in parallel synaptic pathways: A circuit mechanism for systems memory consolidationPLOS Computational Biology 17:e1009681.https://doi.org/10.1371/journal.pcbi.1009681

Shortterm, intermediateterm, and longterm memoriesBehavioural Brain Research 57:193–198.https://doi.org/10.1016/01664328(93)90135d

The time window hypothesis: Spacing effectsInfant Behavior and Development 18:69–78.https://doi.org/10.1016/01636383(95)90008X

Efficient partitioning of memory systems and its importance for memory consolidationPLOS Computational Biology 9:e1003146.https://doi.org/10.1371/journal.pcbi.1003146

Retrograde amnesia and memory consolidation: a neurobiological perspectiveCurrent Opinion in Neurobiology 5:169–177.https://doi.org/10.1016/09594388(95)800239

Performancedependent consolidation of learned vocal changes in adult songbirdsThe Journal of Neuroscience 42:1974–1986.https://doi.org/10.1523/JNEUROSCI.194221.2021

Mechanisms and time course of vocal learning and consolidation in the adult songbirdJournal of Neurophysiology 106:1806–1821.https://doi.org/10.1152/jn.00311.2011
Article and author information
Author details
Funding
Gatsby Charitable Foundation
 Jack W Lindsey
 Ashok LitwinKumar
National Science Foundation (DBI1707398)
 Jack W Lindsey
 Ashok LitwinKumar
Department of Energy (CSGF (DESC0020347))
 Jack W Lindsey
Burroughs Wellcome Fund
 Ashok LitwinKumar
McKnight Foundation
 Ashok LitwinKumar
Mathers Foundation
 Ashok LitwinKumar
National Institutes of Health (R01EB029858)
 Ashok LitwinKumar
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Stefano Fusi and Samuel Muscinelli for helpful discussions and comments on the manuscript. ALK and JL were supported by the Gatsby Charitable Foundation, NSF award DBI1707398. JL was also supported by the DOE CSGF (DESC0020347). ALK was also supported by the Burroughs Wellcome Foundation, the McKnight Endowment Fund, the Mathers Foundation, and NIH award R01EB029858.
Version history
 Preprint posted: July 6, 2023 (view preprint)
 Sent for peer review: July 14, 2023
 Preprint posted: November 24, 2023 (view preprint)
 Preprint posted: July 2, 2024 (view preprint)
 Version of Record published: July 18, 2024 (version 1)
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.90793. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2023, Lindsey and LitwinKumar
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 773
 views

 45
 downloads

 2
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Neuroscience
The brain’s ability to appraise threats and execute appropriate defensive responses is essential for survival in a dynamic environment. Humans studies have implicated the anterior insular cortex (aIC) in subjective fear regulation and its abnormal activity in fear/anxiety disorders. However, the complex aIC connectivity patterns involved in regulating fear remain under investigated. To address this, we recorded single units in the aIC of freely moving male mice that had previously undergone auditory fear conditioning, assessed the effect of optogenetically activating specific aIC output structures in fear, and examined the organization of aIC neurons projecting to the specific structures with retrograde tracing. Singleunit recordings revealed that a balanced number of aIC pyramidal neurons’ activity either positively or negatively correlated with a conditioned toneinduced freezing (fear) response. Optogenetic manipulations of aIC pyramidal neuronal activity during conditioned tone presentation altered the expression of conditioned freezing. Neural tracing showed that nonoverlapping populations of aIC neurons project to the amygdala or the medial thalamus, and the pathway bidirectionally modulated conditioned fear. Specifically, optogenetic stimulation of the aICamygdala pathway increased conditioned freezing, while optogenetic stimulation of the aICmedial thalamus pathway decreased it. Our findings suggest that the balance of freezingexcited and freezinginhibited neuronal activity in the aIC and the distinct efferent circuits interact collectively to modulate fear behavior.

 Neuroscience
Motor learning is often viewed as a unitary process that operates outside of conscious awareness. This perspective has led to the development of sophisticated models designed to elucidate the mechanisms of implicit sensorimotor learning. In this review, we argue for a broader perspective, emphasizing the contribution of explicit strategies to sensorimotor learning tasks. Furthermore, we propose a theoretical framework for motor learning that consists of three fundamental processes: reasoning, the process of understanding action–outcome relationships; refinement, the process of optimizing sensorimotor and cognitive parameters to achieve motor goals; and retrieval, the process of inferring the context and recalling a control policy. We anticipate that this ‘3R’ framework for understanding how complex movements are learned will open exciting avenues for future research at the intersection between cognition and action.