Beyond gradients: Factorized, geometric control of interference and generalization

  1. Department of Neuroscience, Brown University, Providence, United States
  2. Carney Institute for Brain Science, Brown University, Providence, United States
  3. Department of Cognitive and Psychological Sciences, Brown University, Providence, United States

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Friedemann Zenke
    Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
  • Senior Editor
    Timothy Behrens
    University of Oxford, Oxford, United Kingdom

Reviewer #1 (Public review):

Summary:

This paper advances a new understanding of plasticity in artificial neural networks. It shows that weight changes can be decomposed into two components: the first governs the magnitude (or gain) of responses in a particular layer; the second governs the relationship of those responses to the input to that layer. Then, it shows that separate control of these two factors via a surprise-based metaplasticity can avoid catastrophic forgetting as well as induce successful generalization in different conditions, through a series of simulation experiments in linear networks. The authors argue that separate control of the two factors may be at work in the brain and may underlie the ability of humans and other animals to perform successful sequential learning. The paper is hampered by confusing terminology and the precise setup of some of the simulations is unclear. The paper also focuses exclusively on the linear case, which limits confidence in the generality of the results. The paper would also benefit from the inclusion of specific predictions for neural data that would confirm the idea that the separate control of these two factors underlies successful continual learning in the brain.

Strengths:

(1) The theoretical framework developed by the paper is interesting, and could have wide applicability for both training networks and for understanding plasticity.

(2) The simulations convincingly show benefits to the coordinated eligibility model of plasticity advanced by the authors.

Weaknesses:

(1) The simulation results are limited to simple tasks in linear networks, it would be interesting to see how the intuitions developed in the linear case extend to nonlinear networks.

(2) The terminology is somewhat confusing and this can make the paper difficult to follow in some places.

(3) The details of some of the simulations are lacking.

Reviewer #2 (Public review):

Summary:

Scott and Frank propose a new method for controlling generalization and interference in neural networks that undergo continual learning. Their method called coordinated eligibility models (CEM), relies on the factorization of synaptic updates into input-driven and output-driving factors. They subsequently employ the fact that it is sufficient to orthogonalize any one of these two factors across different data points to nullify the interference during learning. They exemplify this on a number of toy tasks while comparing their result to vanilla gradient.

Strengths:

The specific mechanism proposed here is novel (while, as authors acknowledge, there is a large number of other mechanisms for the selective recruitment of synapses for the prevention of catastrophic forgetting). Furthermore, it is simple, elegant, and to a large extent biologically plausible, potentially pointing to specific and testable aspects of learning dynamics.

Weaknesses:

(1) Scope and toy nature of experiments: the model was only applied to very simple problems tailored specifically to demonstrate the strengths of the CEM method. Furthermore, single hyperparameter setting is presented for every scenario which leaves it questionable how general the numerical results are. The selection of input, output dimensionality and data set size also seems to be underexplored. Will a larger curriculum, smaller or larger dimension, compromise any of the CEM ingredients? Restriction to linear models seems arbitrary (it should be a no-time test to add non-linearity within a pytorch framework that authors used), and applicability for any non-synthetic problem is not obvious.

It is also unclear to what extent of domain knowledge is needed for surprise signals to be successfully generated. Can the authors make a stronger case about novel curriculum entries being easily recognizable by cosine distance, either in the brain or in machine learning? Can they alternatively demonstrate their method on a less toy benchmark (e.g. permuted MNIST from Kirkpatrick et al 2017 that they cite)?

Another limitation is that unlike smoother models of plasticity budgets (e.g. Kirkpatrick et al 17, Zenke et al 17), here eligibility seems to be lost forever, once surprise is applied. What happens to the model if more data from a previously visited task becomes available? Will the system be able to continue learning within the right context and how does CEM perform compared to other catastrophic-forgetting-prevention strategies?

(2) The clarity and organization must be improved. Specifically, the balance between verbal descriptions, equations, figures, and their captions needs to be improved. For example - two full-size equations are dedicated to the application of linear regression (around lines 183 and 236) while by far less obvious math such as settings for fig 7, including 'feature loadings', 'demands', etc., is presented in a hardly readable mixture figure and main text. Similarly, the surprise mechanism which is a key ingredient for the model is presented in a very non-straightforward fashion, scattered between the main text, figure, and methods. The figure legends are poorly informative in many cases as well (see minor comments for examples).

Reviewer #3 (Public review):

Summary:

This paper describes a modification of gradient descent learning, and shows in several simulations that this modification allows online learning of linear regression problems where naive gradient descent fails. The modification starts from the observation that the rank-1 weight update of online gradient learning can be written as the outer product Δw ∝ g xᵀ of a vector g and the input x. Modifying this update rule, by projecting g or x to some subspaces, i.e. Δw ∝ Pg (Qx)ᵀ, allows for preventing the typical catastrophic forgetting behavior of online gradient descent, as confirmed in the simulations. The projection matrices P and Q are updated with a "surprise"-modulation rule.

Strengths:

I find it interesting to explore the benefits of alternatives to naive online gradient learning for continual learning.

Weaknesses:

The novelty and advancement in our theoretical understanding of plasticity in neural systems are unclear. I appreciate gaining insights from simple mathematical arguments and simulations with toy models, but for this paper, I do not yet clearly see what I learned: on the mathematical/ML/simulation side it is unclear how it relates to the continual learning literature, on the neuroscience/surprise side I see only a number of papers cited but not any clear connection to data or novel insights.

More specifically:

(1) It is unclear what exactly the "coordinated eligibility theory" is. Is any update rule that satisfies Equation 4 included in the coordinated eligibility theory? If yes, what is the point: any update rule can be written in this way, including standard online gradient descent. If no, what is it? It is not Equation 5 it seems, because this is called "one of the simplest coordinated eligibility models".

(2) There is a lot of work on continual learning which is not discussed, e.g. "Orthogonal Gradient Descent for Continual Learning" (Farajtabar et al. 2019), "Continual learning in low-rank orthogonal subspaces" (Chaudhry et al. 2020), or "Keep Moving: identifying task-relevant subspaces to maximise plasticity for newly learned tasks" (Anthes et al. 2024), to name just a few. What is the novelty of this work relative to these existing works? Is the novelty in the specific projection operator? If yes, what are the benefits of this projection operator in theory and simulations? How would, for example, the approach of Farajtabar et al. 2019 perform on the tasks in Figures 3-7?

(3) There is also work on using surprise signals for multitask learning in models of biological neural networks, e.g. "Fast adaptation to rule switching using neuronal surprise" (Barry et al. 2023).

(4) What is the motivation for the projection to the unit sphere in Equation 5?

(5) What is the motivation for the surprise definition? For example, why cos(x⋅μ) = cos(|x||μ|cos(θ)) = cos(cos(θ))? (Assuming x and μ have unit length and θ is the angle between x and μ).

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation