Abstract
Recent studies show that, even in constant environments, the tuning of single neurons changes over time in a variety of brain regions. This representational drift has been suggested to be a consequence of continuous learning under noise, but its properties are still not fully understood. To uncover the underlying mechanism, we trained an artificial network on a simplified navigational task, inspired by the predictive coding literature. The network quickly reached a state of high performance, and many neurons exhibited spatial tuning. We then continued training the network and noticed that the activity became sparser with time. We observed vastly different time scales between the initial learning and the ensuing sparsification. We verified the generality of this phenomenon across tasks, learning algorithms, and parameters. This sparseness is a manifestation of the movement within the solution space - the networks drift until they reach a flat loss landscape. This is consistent with recent experimental results demonstrating that CA1 neurons increase sparseness with exposure to the same environment and become more spatially informative. We conclude that learning is divided into three overlapping phases: Fast familiarity with the environment, slow implicit regularization, and a steady state of null drift. The variability in drift dynamics opens the possibility of inferring learning algorithms from observations of drift statistics.
What do we mean when we say that the brain represents the external world? One interpretation is the existence of neurons whose activity is tuned to world variables. Such neurons have been observed in many contexts: place cells [1, 2] – which are tuned to position in a specific context, visual cells [3] – which are tuned to specific visual cues, neurons that are tuned to the execution of actions [4] and more. This tight link between the external world and neural activity might suggest that, in the absence of environmental or behavioral changes, neural activity is constant. In contrast, recent studies show that, even in constant environments, the tuning of single neurons to outside world variables gradually changes over time in a variety of brain regions, even long after good representations of the stimuli were achieved. This phenomenon has been termed representational drift, and has changed the way we think about the stability of memory and perception, but its driving forces and properties are still unknown [5, 6, 7, 8] (see [9, 10] for an alternative account).
There are at least two immediate theoretical questions arising from the observation of drift – why does it happen, and whether and how behavior is resistant to it [11, 12]? One mechanistic explanation is that the underlying anatomical substrates are themselves undergoing constant change, such that drift is a direct manifestation of this structural morphing [13]. A normative interpretation posits that drift is a solution to a computational demand, such as temporal encoding [14], ‘drop-out’ regularization [15], exploration of the solution space [16], or re-encoding during continual learning [11]. Several studies also address the resistance question, providing possible explanations on how behavior can be robust to such phenomena [17, 18, 19, 20].
Here, we focus on the mechanistic question, and leverage analyses of drift statistics for this purpose. Specifically, recent studies showed that representational drift in the CA1 is driven by active experience [21, 22]. Namely, rate maps decorrelate more when mice are active for a longer time in a given context. This implies that drift is not just a passive process, but rather an active learning one. As drift seems to occur after an adequate representation has formed, it seems fitting to model it as a form of a continuous learning process.
This approach has been recently explored by [23, 24]. They considered continuous learning in noisy, overparameterized neural networks. Because the system is overparameterized, a manifold of zero-loss solutions exists. [23] showed that for feedforward neural networks (FNNs) trained using Hebbian learning with added parameter noise, neurons change their tuning over time. This was due to an undirected random walk within the manifold of solutions. The coordinated drift of neighboring place fields was used as evidence to support this view. The phenomenon of undirected motion within the space of solutions seems plausible, as all members of this space achieve equally good performance (Fig 1A left). However, there may be other properties of the solutions (Fig 1B) that vary along this manifold, which could potentially bias drift in a certain direction (Fig 1A right). It is likely that the drift observed in experiments is a combination of both an undirected and directed movement. We will now introduce theoretical results from machine learning that support the possibility of directed drift.
Recent work provided a tractable analytical framework for the learning dynamics of Stochastic Gradient Descent (SGD) with added noise and an overparameterized regime [25, 26]. These studies showed that, after the network has converged to the zero-loss manifold, a second-order effect biases the random walk along a specific direction within this manifold. This direction reduces an implicit regularizer, determined by the type of noise the network is exposed to. The regularizer is related to the Hessian of the loss – a measure of the flatness of the loss landscape in the vicinity of the solutions. Since this directed movement is a second-order effect, its timescale is orders of magnitude larger than that of the initial convergence.
Consider a biological neural network performing a task. The ML implicit regularization mentioned above requires three components: an overparameterized regime, noise, and SGD. Both biological and artificial networks possess a large number of synapses, or parameters, and hence can reasonably be expected to be overparameterized. Noise can emerge from the external environment or from internal biological elements. It is not reasonable to assume that a precise form of gradient descent is implemented in the brain [27], thereby casting doubt on the third element. Nevertheless, biologically plausible rules could be considered as noisy versions of gradient descent, as long as there is a coherent improvement in performance [28, 29]. Motivated by this analogy, we explore representational drift in models and experimental data.
Because drift is commonly observed in spatially-selective cells, we base our analysis on a model which has been shown to contain such cells [30]. Specifically, we trained artificial neural networks on a predictive coding task in the presence of noise. In this task, an agent moves along a linear track while receiving visual input from the walls, such that the goal is to predict the subsequent input. We observed that neurons became tuned to the latent variable, which is position, in accordance with previous results [30]. We continued training and found that in addition to the gradual change of tuning curves, similar to [23], we witnessed that the number of active neurons decreased slowly while their tuning specificity increased. These results align with recent experimental observations [21]. Finally, we demonstrated the connection between this sparsificiation effect and changes to the Hessian, in accordance with ML theory.
Results
Spontaneous sparsification in a predictive coding network
To model representational drift in the CA1 area, we chose a simple model that could give rise to spatially-tuned cells [30]. In this model, an agent traverses a corridor while slightly modulating its angle with respect to the main axis (Fig 2A). The walls are textured by a fixed smooth noisy signal, and the agent receives this as input according to its current field of view. The model itself is a single hidden layer feedforward network, with the velocity and visual field as inputs. The desired output is the predicted visual input in the next time step. The model equations are given by:
where m and n are the input and output matrices respectively, b is the bias vector, and σ is the ReLU activation function. The task is for the network’s output, y, to match the visual input, x of the following time step, resulting in the following loss function:
We train the network using Gradient Descent (GD), while adding update noise to the learning dynamics:
where θ = (m, n, b) is the vectorized parameters-vector, τ is the current training step and is
Gaussian noise. We let the network converge to a good solution, demonstrated by a loss plateau, and continue training for an additional period. Note that this additional period can be orders of magnitude longer than the initial training period. The network quickly converged to a low loss and stayed at the same loss during the additional training period (Fig 2B). Surprisingly, when looking at the activity within the hidden layer, we noticed that it slowly became sparse. This sparsification did not hurt performance, because individual units became more informative, as quantified by the average mutual information between unit activity and the position of the agent (Fig 2C). When looking at the rate maps of neurons, i.e. their tuning to position, one can observe an image similar to representational drift observed in experiments [5] – namely that neurons changed their tuning over time (Fig 2D). Additionally, their tuning specificity increased in accordance with the information increase. By observing the correlation matrix of the rate maps over time, it is apparent that there was a gradual change that slowed down (Fig 2E). To summarize, we observed a spontaneous sparsification over a timescale much longer than the initial convergence, without introducing any explicit regularization. This is comparable to experimental data from [21], where indeed drift was characterized by a decrease in the fraction of active place cells, and an increase in cells’ information while the decoding error for the position of the mouse stayed relatively constant (Fig 2F). [31] also reported a decrease in CA1 neural activity and rise in specificity with environment familiarity. Another recent study further demonstrated an increase in information over days [32].
Generality of the phenomenon
To explore the sensitivity of our results to specific modeling choices, we systematically varied many of them (Fig 3A). Specifically, we replaced the task with either a simplified predictive coding, random mappings or smoothed random mappings. Noise was introduced to the outputs (label noise), instead of the update noise. We simulated different activation functions. Perhaps most important, we varied the learning rules, as SGD is not a biologically plausible one. We used both Adam [33] and RMSprop [34], from the ML literature. We also used Stochastic Error-Descent (SED) [35], which does not require gradient calculation and is more biologically plausible (5). All cases demonstrated an initial, fast, phase of convergence to low loss, followed by a much slower phase of directed random motion within the low-loss space.
The results of the simulations supported our main conclusion, though several qualitative phe-nomenons could be observed. First of all, sparsification dynamics were not sensitive to most of the parameters. The main qualitative difference observed was that the timescales could vary by orders of magnitude as a function of the noise scale (Fig 3B bottom). Note that we calculate the timescale of sparsification by fitting an exponential curve to the fraction of active units over time, and take the time constant of the fitted exponential. Additionally, apart from simulations that did not converge due to too big timescales, the final sparsity was the same for all networks of the same size (Fig 3B top), in accordance with results from [23]. In a sense, once noise is introduced the network is driven to maximal sparsification. For Adam, RMSprop and SED sparsification ensued in the absence of any added noise. For SED the explanation is straightforward, as the parameter updates are driven by noise. For Adam and RMSprop, we suggest that in the vicinity of the zero-loss manifold, the second moment acts as noise.
For label noise, the dynamics were qualitatively different, the fraction of active units did not reduce, but the activity of the units did sparsify. In some cases, the networks quickly collapsed to a sparse solution, most likely as a result of the learning rate being too high, in relation to the input statistics [36]. Importantly, for GD without noise, there was no change after the initial convergence.
As a further test of the generality of this phenomenon, we consider the recent simulation from [23]. The learning rule used in this work was very different from the ones we applied. We, therefore, simulated that network using the published code. We found the same type of dynamics as shown above, namely that the network initially converged to a good solution followed by a longer period of sparsification (Fig 3C). Note that in The original publication [23] the focus was on the stage following this sparsification, in which the network indeed maintained a constant fraction of active cells.
In conclusion, we see that noisy learning leads to three phases under rather general conditions. First, fast learning of the task and convergence to the manifold of low-loss solutions. The second phase is directed movement on this manifold driven by a second-order effect of implicit regularization. The third phase is an undirected random walk within the sub-manifold of low loss and maximum regularization.
Mechanism of sparsification
What are the mechanisms that give rise to sparsification? As illustrated in Fig. 1, different solutions in the zero-loss manifold might vary in some of their properties. The specific property suggested from theory [25] is the flatness of the loss landscape in the vicinity of the solution. This can be demonstrated with a simple example. Consider a two-dimensional loss function. The function is shaped like a valley with a continuous one-dimensional zero-loss manifold at it’s bottom (Fig 4A). Crucially, the loss on the entire manifold is exactly zero, while the vicinity of the manifold becomes systematically flatter in one direction. We simulated gradient descent with added noise on this function from a random starting point (red dot). The trajectory quickly converged to the zero-loss manifold, and began a random walk on it. This walk was clearly biased towards the flatter area of the manifold, as can be seen by the spread of the trajectory. This bias could be comprehended by noting that the gradient was orthogonal to the contour lines of the loss, and therefore had a component directed towards the flat region.
In higher dimensions, flatness is captured by the eigenvalues of the Hessian of the loss. Because these eigenvalues are a collection of numbers, different scenarios could lead to minimizing different aspects of this collection. Specifically, according to [25], update noise should regularize the sum of the log of the non-zero eigenvalues while label noise should do the same for the sum of eigenvalues. In our predictive coding example, where update noise was added, each inactivated unit translates into a set of zero-rows in the Hessian, and thus also into a set of zero-eigenvalues (Fig 4B). The slope of the regularizer approaches infinity as the eigenvalue approaches zero, and thus small eigenvalues are driven to zero much faster than large eigenvalues (Fig 4C). So in this case, update noise leads to an increase in the number of zero eigenvalues, which are manifested as a sparse solution. Another, perhaps more intuitive, way to understand these results is that units below the activation threshold are insensitive to noise perturbations. In other scenarios, in which we simulated with label noise, we indeed observed a gradual decrease in the sum of eigenvalues (Fig 4D).
Discussion
We showed that representational drift could arise from ongoing learning in the presence of noise, after a network has already reached good performance. We suggest that learning is divided into three overlapping phases: a fast initial phase, where good performance is achieved, a second slower phase in which directed drift along the low-loss manifold leads to an implicit regularization and finally, a third undirected phase ensues once the regularizer is minimized. In our results, the directed component was associated with sparsification of the neural code, a phenomenon we also observed in experimental data.
Interpreting drift as a learning process has recently been suggested by [23, 24]. Both studies focused on the final phase in which the statistics of the representations were constant. Experimentally, [7] reported a decrease in activity at the beginning of the experiment, which they suggested was correlated with some behavioral change, but we believe it could also be a result of the directed drift phase. [37] also reported a slow directed change in representation long after familiarity with the stimuli. There is another consequence of the timescale separation. Unlike in the setting of drift experiments, natural environments are never truly constant. Thus, it is possible that the second phase of learning never stops because the task is slowly changing. This would imply that the second, directed, phase may be the natural regime in which neural networks reside.
Here, we reported directed drift in the space of solutions of neural networks. This drift could be observed by examining changes to the representation of external world variables, and hence is related to the phenomenon of representational drift. Note, however, that representations are not a full description of a network’s behavior [38]. The statistics of representational changes can be used as a window into changes of network dynamics and function.
The phenomenon of directed drift is very robust to various modeling choices, and also consistent with recent theoretical results [25, 26] The details of the direction of the drift, however, are dependent on specific choices. Specifically, which aspects of the Hessian are minimized during the second phase of learning, as well as the timescale of this phase, depend on the specifics of the learning rule and the noise in the system. This suggests an exciting opportunity – inferring the learning rule of a network from the statistics of representational drift.
Our explanation of drift invoked the concept of a low-loss manifold – a family of network configurations that have identical performance on a task. The definition of low-loss, however, depends on the specific task and context analyzed. Challenging a system with new inputs could dissociate two configurations that otherwise appear identical [39]. It will be interesting to explore whether various environmental perturbations could uncover the motion along the low-loss manifold in the CA1 population. For instance, remapping was interpreted as an indicator of the detection of a context switch [40]. One can therefore speculate that the probability for remapping given the same environmental change will systematically vary as the network moves to flatter areas of the loss landscape.
Machine learning has been suggested as a model tool for neuroscience research [41, 42, 43]. However, the implicit regularization in ML has not been studied to explain representational drift in neuroscience, and may have been done without awareness of this phenomenon. It’s worth noting that this isn’t a phenomenon specific to neural networks, but rather a general property of overparameterized systems that optimize a cost function. Importing insights from this domain into neuroscience shows the utility of studying general phenomena in systems that learn. For example, another complex learning system in which a similar idea has been proposed is evolution – “survival of the flattest” suggests that, under a high mutation rate, the fittest replicators are not just the ones with the highest fitness, but also with a flat fitness function which is more robust to mutations [44]. One can hope that more such insights will arise as we open our eyes.
Materials and methods
Predictive coding task
The agent is moving in an arena of size (Lx, Ly), with constant velocity in the y direction of V0. The agent’s heading direction is θ and it changes at every time step by , the agent’s visual field has an angle θvis and is represented as a vector of size Lvis. The texture of the walls is generated from a random Gaussian vector of size Lwalls = 2(Lx + Ly)Lvis, smoothed with a Gaussian filter with σ2 = KsmoothLwalls. At each time step the agent receives the visual input from the walls, determined by the intersection points of it’s visual field with the walls. When the agent reaches a distance of LyLbuffer from the wall, it turns to the opposite direction.
Tuning properties of units
For each unit we calculated a tuning curve. We divided the arena into 100 equal bins and computed the number of time steps in each bin and the mean unit activation. We then obtained the tuning curve by dividing the mean activity for each bin by the occupancy. We treated movement in each direction as a separate location. We calculated the spatial information (SI) of the tuning curves for each unit:
where i is the index of the bin, pi is the probability of being in the bin, ri is the value of the tuning curve in the bin and is the unit’s mean activity rate. Active unit was defined as a unit with non-zero activation for at least one input.
Simulations
For the random simulations, we train each network for 107 training steps while choosing random learning algorithm and parameters. The ranges and relevant values of parameters are specified in Table 1. For Adam and SED there was no added noise.
Stochastic Error Descent
The equation for parameter updates under this learning rule is given by:
In this learning rule, the parameters are randomly perturbed at each training step by a Gaussian noise denoted by ξτ and then updated in proportion to the change in loss.
Label noise
Label noise is introduced to the loss function given by the following formula:
where is Gaussian noise.
Gradient descent dynamics around the zero-loss manifold
The function we used for the two-dimensional example was given by:
which has zero loss on the x and y axes. For small enough update noise, GD will converge to the vicinity of this manifold (the axes). We consider a point on the x axis: (x0, 0), and calculate the direction of the gradient near that point. Because we are interested in motion along the zero-loss manifold, we consider a small perturbation in the orthogonal direction (x0, 0 + Δy) where x0 >> 1 and |Δy| << 1. Any component of the gradient in the x direction will lead to motion along the manifold. The update step at this point is given by:
One can observe that the step has a large component in the y direction, quickly returning to the manifold. There is also a smaller component in the x direction, reducing the value of x. Reducing x also reduces the Hessian’s eigenvalues:
Thus, it becomes clear that the trajectory will have a bias that reduces the curvature in the y direction.
For general loss functions and various noise models, rigorous proofs can be found in [25], and a different approach can be found in [26]. Here, we will briefly outline the intuition for the general case. Consider again the update rule for GD:
In order to understand the dynamics close to the zero-loss manifold, we consider a point θ, for which L(θ) = 0 expand the loss around it:
We can then take the gradient of this expansion with respect to θ:
The first term is zero, because the gradient is zero on the manifold. The second term is the largest one, as it linear in δθ. Note that the Hessian matrix has zero eigenvalues in directions on the zero-loss manifold, and non-zero eigenvalues in other directions. Thus, the second term corresponds to projecting δθ in a direction that is orthogonal to the zero-loss manifold. The third term can be interpreted as the gradient of some auxiliary loss function. Thus, we expect gradient descent to minimize this new loss, which corresponds to a quadratic form with the Hessian. This is the reason for the implicit regularization along the manifold. Note that the auxiliary loss function is defined by δθ, and thus different noise statistics will correspond, on average, to different implicit regularizations. In conclusion, the update step will have a large component that moves the parameter vector towards the zero-loss manifold, and a small component that moves the parameter vector on the manifold in a direction that minimizes some measure of the Hessian.
Hessian and sparseness
In the main text, we show that the implicit regularization of the Hessian leads to sparse representations. Here, we show this relationship for a single-hidden layer feed-forward neural network with ReLU activation and Mean Squared Error loss:
The gradient and Hessian at the zero-loss manifold are given by [45]:
where 𝕝(xi; θ) is an indicator vector denoting whether each unit is active for some input xi. Sparseness means that a unit has become inactive for all inputs. All the partial derivatives of input, output and bias weights associated with such a unit are zero, and thus the relevant rows of the Hessian are zero as well. Thus, every inactive unit leads to several zero eigenvalues.
Acknowledgements
We thank Ron Teichner and Kabir Dabholkar for comments on the manuscript. This research was supported by the ISRAEL SCIENCE FOUNDATION (grants Nos. 2655/18 and 2183/21 to DD, and 1442/21to OB), by the German-Israeli Foundation (GIF I-1477-421.13/2018) to DD, by a grant from the US-Israel Binational Science Foundation (NIMH-BSF CRCNS BSF:2019807, NIMH:R01 MH125544- 01 to DD), by an HFSP research grant (RGP0017/2021) to OB, A Rappaport Institute Collaborative research grant to DD, by Israel PBC-VATAT and by the Technion Center for Machine Learning and Intelligent Systems (MLIS) to DD and OB, by the Prince Center for the Aging Brain, and by a University of Michigan – Israel Partnership for Research and Education Collaborative Research stipend to DK. (data science)
References
- [1]The hippocampus as a cognitive mapBehavioral and Brain Sciences 2:487–494
- [2]The hippocampus as a spatial map: preliminary evidence from unit activity in the freely-moving ratBrain research
- [3]Receptive fields, binocular interaction and functional archi-tecture in the cat’s visual cortexThe Journal of physiology 160
- [4]Cortical representation of motion during unrestrained spatial navigation in the ratCerebral Cortex 4:27–39
- [5]Long-term dynamics of ca1 hippocampal place codesNature neuroscience 16:264–266
- [6]Dynamic Reorganization of Neuronal Activity Patterns in Parietal CortexCell 170:986–999
- [7]Representational drift in the mouse visual cortexCurrent biology 31:4327–4339
- [8]Representational drift in primary olfactory cortexNature 594
- [9]Publisher correction: A stable hippocampal code in freely flying batsNature 606
- [10]Contribution of behavioural variability to representational driftElife 11
- [11]Causes and consequences of representational driftCurr. Opin. Neurobiol 58:141–147
- [12]Representational drift: Emerging theories for continual learning and experimental future directionsCurrent Opinion in Neurobiology 76
- [13]Synaptic tenacity or lack thereof: spontaneous remodeling of synapsesTrends in neurosciences 41:89–99
- [14]Hippocampal ensemble dynamics timestamp events in long-term memoryelife 4
- [15]The geometry of representational drift in natural and artificial neural networksPLOS Computational Biology 18
- [16]Network plasticity as bayesian inferencePLoS computational biology 11
- [17]Motor learning with unstable neural representationsNeuron 54:653–666
- [18]Stable memory with unstable synapsesNature communications 10
- [19]Intrinsic volatility of synaptic connections—a challenge to the synaptic trace theory of memoryCurrent opinion in neurobiology 46:7–13
- [20]Drifting assemblies for persistent memory: Neuron transitions and unsupervised compensationProceedings of the National Academy of Sciences 118
- [21]Active experience, not time, determines within-day representational drift in dorsal ca1Neuron
- [22]Time and experience differentially affect distinct aspects of hippocampal representational driftNeuron
- [23]Coordinated drift of receptive fields in hebbian/anti-hebbian network models during noisy representation learningNature Neuroscience :1–11
- [24]Stochastic gradient descent-induced drift of representation in a two-layer neural networkarXiv preprint
- [25]Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like processIn Conference on learning theory PMLR :483–513
- [26]What happens after sgd reaches zero loss?–a mathematical frameworkarXiv preprint
- [27]Towards biologically plausible deep learningarXiv preprint
- [28]Beyond accuracy: generalization properties of bio-plausible temporal credit assignment rulesAdvances in Neural Information Processing Systems 35:23077–23097
- [29]A unified framework of online learning algorithms for training recurrent neural networksThe Journal of Machine Learning Research 21:5320–5353
- [30]Predictive learning as a network mechanism for extracting low-dimensional latent space representationsNature Communications 12
- [31]Network dynamics underlying the formation of sparse, informative representations in the hippocampusJournal of Neuroscience 28:14271–14281
- [32]Bias-free estimation of information content in temporally sparse neuronal activityPLoS computational biology 18
- [33]Adam: A method for stochastic optimizationarXiv preprint
- [34]Neural networks for machine learning lecture 6a overview of mini-batch gradient descentCited on 14
- [35]A fast stochastic error-descent algorithm for supervised learning and optimizationAdvances in neural information processing systems 5
- [36]The implicit bias of minima stability: A view from function spaceAdvances in Neural Information Processing Systems 34:17749–17761
- [37]Cortical reactivations predict future sensory responsesbioRxiv :2022–11
- [38]Is coding a relevant metaphor for the brain?Behavioral and Brain Sciences 42
- [39]Charting and navigating the space of solutions for recurrent neural networksAdvances in Neural Information Processing Systems 34:25320–25333
- [40]Hippocampal remapping as hidden state inferenceElife 9
- [41]A deep learning framework for neuroscienceNature neuroscience 22:1761–1770
- [42]Toward an integration of deep learning and neuroscienceFrontiers in computational neuroscience
- [43]If deep learning is the answer, what is the question?Nature Reviews Neuroscience 22:55–67
- [44]The fittest versus the flattest: experimental confirmation of the quasispecies effect with subviral pathogensPLoS pathogens 2
- [45]The implicit bias of minima stability in multivariate shallow reLU networksinproceedings
Article and author information
Author information
Version history
- Sent for peer review:
- Preprint posted:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
- Version of Record published:
Copyright
© 2023, Ratzon et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 1,664
- downloads
- 156
- citations
- 6
Views, downloads and citations are aggregated across all versions of this paper published by eLife.