Representational drift as a result of implicit regularization

  1. Aviv Ratzon  Is a corresponding author
  2. Dori Derdikman
  3. Omri Barak
  1. Rappaport Faculty of Medicine, Technion - Israel Institute of Technology, Israel
  2. Network Biology Research Laboratory, Technion - Israel Institute of Technology, Israel
7 figures, 3 tables and 1 additional file

Figures

Two types of possible movements within the solution space.

(A) Two options of how drift may look in the solution space. Random walk within the space of equally good solutions that is either undirected (left) or directed (right). (B) The qualitative consequence of the two movement types. For an undirected random walk, all properties of the solution will remain roughly constant (left). For the directed movement there should be a given property that is gradually increasing or decreasing (right).

Continuous noisy learning leads to drift and spontaneous sparsification.

(A) Illustration of an agent in a corridor receiving high-dimensional visual input from the walls. (B) Loss as a function of training steps (log scale). Zero loss corresponds to a mean estimator. Note the rapid drop in loss at the beginning, after which it remains roughly constant. (C) Mean spatial information (SI, blue) and fraction of units with non-zero activation for at least one input (red) as a function of training steps. (D) Rate maps sampled at four different time points (columns). Maps in each row are sorted according to a different time point. Sorting is done based on the peak tuning value to the latent variable. (E) Correlation of rate maps between different time points along training. Only active units are used.

Experimental data consistent with simulations.

Data from four different labs show sparsification of CA1 spatial code, along with an increase in the information of active cells. Values are normalized to the first recording session in each experiment. Error bars show standard error of the mean. (A) Fraction of place cells (slope=-0.0003 p < .001) and mean spatial information (SI) (slope=0.002, p < .001) per animal over 200 min (Khatib et al., 2023). (B) Number of cells per animal (slope=-0.052, p = .004) and mean SI (slope=0.094, p < .001) over all cells pooled together over 10 days. Note that we calculated the number of active cells rather than fraction of place cells because of the nature of the available data (Jercog et al., 2019b). (C) Fraction of place cells (slope=-0.048, p = .011) and mean SI per animal (slope=0.054, p < .001) over 11 days (Karlsson and Frank, 2008). (D) Fraction of place cells (slope=-0.026, p < .001) and mean SI (slope=0.068, p < .001) per animal over 8 days (Sheintuch et al., 2023).

Figure 4 with 1 supplement
Generality of the results.

Summary of 616 simulations with various parameters, excluding stochastic gradient descent (SGD) with label noise (see Table 2). (A) Fraction of active units normalized by the first timestep for all simulations. Red line is the mean. Note that all simulations exhibit a stochastic decrease in the fraction of active units. See Figure 4—figure supplement 1 for further breakdown. (B) Dependence of sparseness (top) and sparsification time scale (bottom) on noise amplitude. Each point is one of 178 simulations with the same parameters except noise variance. (C) Learning a similarity matching task with Hebbian and anti-Hebbian learning using published code from Qin et al., 2023. Performance of the network (blue) and fraction of active units (red) as a function of training steps. Note that the loss axis does not start at zero, and the dynamic range is small. The background colors indicate which phase is dominant throughout learning (1 - red, 2 - yellow, 3 - green).

Figure 4—figure supplement 1
Noisy learning leads to spontaneous sparsification.

Summary of 516 simulations with three different learning algorithms: Stochastic error descent (SED, Cauwenberghs, 1992), SGD, Adam. All values are normalized to the first time step of each simulation. The red lines indicate mean over all simulations. (A) Fraction active units – number of units with any response. (B) Active fraction – overall activity across all units (see methods).

Figure 5 with 1 supplement
Noisy learning leads to a flat landscape.

(A) Gradient Descent dynamics over a two-dimensional loss function with a one-dimensional zero-loss manifold (colors from blue to yellow denote loss). Note that the loss is identically zero along the horizontal axis, but the left area is flatter. The orange trajectory begins at the red dot. Note the asymmetric extension into the left area. (B) Fraction of active units is highly correlated with the number of non-zero eigenvalues of the Hessian. (C) Update noise reduces small eigenvalues. Log of non-zero eigenvalues at two consecutive time points for learning with update noise. Note that eigenvalues do not correspond to one another when calculated at two different time points, and this plot demonstrates the change in their distribution rather than changes in eigenvalues corresponding to specific directions. The distribution of larger eigenvalues hardly changes, while the distribution of smaller eigenvalues is pushed to smaller values. (D) Label noise reduces the sum over eigenvalues. Same as (C), but for actual values instead of log.

Figure 5—figure supplement 1
Label and update noise impose different regularization over the Hessian with distinct signatures in activity statistics.

Summary of 362 simulations with either label or update noise added to stochastic gradient descent (SGD) learning algorithm. All values are normalized to the first time step of each simulation. Lines indicate the mean of simulations and shaded regions indicate one standard deviation. Loss convergence varies between simulations, and is achieved on a scale of no more than 105 time steps. (A) Active fraction as a function of training time. Note this metric decreases significantly for both types of noise. (B) Fraction of active units as a function of training time. For label noise, the change is much smaller. (C) Sum of the loss Hessian’s eigenvalues as a function of training time. Here the difference is apparent - label noise imposes slow implicit regularization over this metric while update noise does not. (D) Fraction of non-zero eigenvalues in the loss Hessian as a function of training time. As explained in the main text, update noise imposes implicit regularization over the sum of log-eigenvalues, which manifests as a zeroing of eigenvalues over time and thus a reduction in the fraction of active units.

Illustration of sparsity metrics.
Author response image 1
PV correlation between training time points averaged over 362 simulations.

(B) Mean SI of units normalized to first time step, averaged over 362 simulations. Red line shows the average time point of loss convergence, the shaded area represents one standard deviation.

Tables

Table 1
The three phases of noisy learning.
PhaseDurationPerformanceActivity statisticsRepresentations
learning of taskshortchangingchangingchanging
directed driftlongstationarychangingchanging
null driftendlessstationarystationarychanging
Table 2
Parameter ranges for random simulations.
ParameterPossible values
learning algorithm{SGD, Adam, SED}
noise type{update, label}
number of samplesO’keefe and Nadel, 1979; Susman et al., 2019
initialization regime{lazy, rich}
task{abstract predictive, random, random smoothed}
input dimensionO’keefe and Nadel, 1979; Susman et al., 2019
output dimensionO’keefe and Nadel, 1979; Susman et al., 2019
noise variance (label/update)[0.1,1]/[0.01,0.1]
hidden layer size100
Table 3
Description of experimental datasets.
Khatib et al., 2023Jercog et al., 2019bKarlsson et al., 2015Sheintuch et al., 2023Geva et al., 2023
Familiarity3–5 daysnovelnovelnovel6–9 days
Speciesmicemiceratsmicemice
# Animals812988
Recordings days1 day10 daysmax. 11 days10 days10 days
Session length200 min40 min15–30 min20 min20 min
Recording typecalcium imagingelectrophysiologyelectrophysiologycalcium imagingcalcium imaging
Arenalinear tracksquare or circleW-shapedlinear or L-shaped tracklinear track
Activity metricfraction of place cells decreasenumber of active cells decreasefraction of place cells decreasefraction of place cells decreasefraction of place cell stationary
Mean SI changeincreaseincreaseincreaseincreasestationary

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Aviv Ratzon
  2. Dori Derdikman
  3. Omri Barak
(2024)
Representational drift as a result of implicit regularization
eLife 12:RP90069.
https://doi.org/10.7554/eLife.90069.3