Neuroscience

Beyond gradients: Factorized, geometric control of interference and generalization

Daniel N Scott author has email address
Michael J Frank

Department of Neuroscience, Brown University, Providence, United States
Carney Institute for Brain Science, Brown University, Providence, United States
Department of Cognitive and Psychological Sciences, Brown University, Providence, United States

https://doi.org/10.7554/eLife.103701.1

Open access
Copyright information

Figures and data

Conceptual overview of this work. (A) Animals have goals, which they must learn to achieve. In this case, consider revising a paper. (B) Some feedback signals (e.g., from reviewers or co-authors) will conflict with one another. An appropriate way to deal with these conflicts is to integrate over learning signals, averaging out noise and clarifying signal (upper arrow). A poor way to deal with conflict is to completely adhere to all feedback, even when it conflicts with other feedback (lower arrow), i.e., to perform sequential gradient descent, which can result in undoing previous learning, rather than reconciling new learning with old. (C) A related situation arises when contextual information suggests generalizing learning. For example, recognizing that two types of feedback reflect the same principle can support learning based on the principle (upper arrow) rather than solely the particulars of the feedback (lower arrow), generalizing learning. (D) Within a network, regularizing plasticity towards particular activity subspaces (which can be shared across contexts) and minimizing the overlap of these subspaces when they interfere, can accomplish these goals. (E) Examining gradient descent, we observe that “input” and “output” or “receptive field” and “population response” factors in a network layer’s plasticity partition this plasticity into subspaces. (F) We explore the idea that independently controlling these two biologically meaningful factors (using functions u and v in the panel) would be useful for avoiding interference and promoting generalization. (G) Example of two tasks that overlap in either RFs (bottom) or PRs(top). (H) We investigate four different scenarios, showing that managing population-response and receptive-field plasticity can avoid interference and promote generalization. (I) These properties result from the fact that coordinating plasticity factors can take arbitrary paths through weight space, whereas gradients always move directly towards individual tasks’ solutions.

Illustrations of interference from PR and RF plasticity. (A) A simple linear network, with one input neuron (x), one response neuron (y), and a loss function (L). If an input x has a target output y^∗(1) for one task, and y^∗(2) for a second, then alternating training will pull the weight connecting x and y in opposite directions. Notice that if the output activity started on one side of both y^∗(1) and y^∗(2), then there would first be a period of generalization, during which performance improved on both tasks while training either one. In the lower panel, showing weight/activity dynamics over time, these regions of the weight/activity space are denoted with + and -symbols indicating generalization and interference, and the task-solutions are denoted with dashed lines. The basic illustration is also representative of higher dimensional cases; in more complex networks the main question is how these phenomena are distributed over groups of neurons and weights, instead of individual ones, and multiple tasks, instead of task pairs. (B) A network with two inputs, illustrating RF-change induced interference. Here, training changes weights in two dimensions generating a pattern of regions of interference and generalization based on current weights and the angles between task solutions. Lower left panel:Gradient descent is subject to the same oscillatory behavior in the interference-producing region between task solutions. Lower right panel:By restricting weight update dimensions, one can avoid interference. (C) A network with two outputs, illustrating PR-change induced interference, analogous to B.

Receptive-field eligibility separation. (A) The network we used was single linear perceptron layer, with a single readout. (B) Tasks in this simulation were each defined by random, unit-normal, 100 dimensional input vectors and similarly distributed 20 dimensional target vectors. (C) Training was performed sequentially over 80 such tasks. When outputs were within 0.05 units of Euclidean distance of targets, training proceeded to the next task. (D) Networks computed surprise over inputs, which was used to determine task change-points and orthogonalize new input plasticity vectors against previous plasticity. (E) The surprise function used was a logistic curve over input cosine angle. (F) The optimal set of weights for the curriculum was computed, for comparison with network outputs, using the pseudo-inverse of the inputs. Intuitively, the solution is the intersection of individual task solutions, which themselves are rank-1 outer products between the (unit normal) inputs and targets, shown here as lines intersecting a unit sphere. (G) Error on each task, computed after training that task. (H) Backward transfer on each task, i.e. task errors at the end of curriculum training. (I) Initial task error on new tasks at each point in curriculum learning (forward transfer). The CEM shows negative transfer related to the fact that it remembers previous inputs, whereas this is reduced, but still present, for GD. (J) Layer weight norms in both models, over the course of learning. CEM weight norms grew over the course of learning to match the optimal network weight norm, given by the dashed red line, whereas GD does not. GD struggles to leave a region of weight space proximal to all individual task solutions, but not their intersection. (I) Distance from the optimal set of weights, indicating that not only is the weight norm of the CEM solution growing properly, the network is also converging to the optimum rather than diverging in an inappropriate direction. By contrast, GD gets further from the curriculum solution over time.

Population-response eligibility separation. (A) These simulations used linear network layers subject to linear readouts, one for each task. (B) Each task is defined by a new random 100-dimensional readout. (C) Tasks are solved sequentially, to within a small error bound, as with the previous simulation, before a new random readout is drawn and applied to initiate learning a new task. (D) Updates to firing-rate response subspaces are orthogonalized based on surprise computed over feedback (gradient) components. (E) Surprise is computed as previously, using a logistic function over relative angles. (F) Optimal weights are analogous to those of the previous simulation. (G) Training error both networks completely solve each task in the curriculum. (H) Gradient descent shows significant forgetting, whereas the CEM does not. (I) Remembering earlier learning produces negative forward transfer, as previously. (J) GD fails to push weights outside the region around the origin, causing them to (K) become increasingly far from optimal as new tasks are seen.

Receptive field pattern completion. (A) The network we used was an MLP with one readout (head) for the association curriculum and one for the generalization curriculum. (B) Inputs were 100 dim. unit random vectors. Targets were 20 dim. unit random vectors. (C) During the association phase, the network learned to map pairs of inputs, presented sequentially, to targets (of which there was one for each pair). (D) Surprise was computed over target prediction errors and used to chunk learning temporally, producing a coarse-code for gradient components between elevated surprise events. (E) A generalization curriculum re-used inputs from the association curriculum, but paired them with new targets. (F) Training was performed on only one item out of each pair, testing was performed on the held out item subject to coarse-coded plasticity. (G) Initial error, error at the end of the association-learning phase, error at the end of training during the generalization phase (for trained items) and generalization (test) error at the end of the generalization phase. Coarse coding generalized learning.

Population response pattern completion. (A) The network we used was an MLP with one readout (head) per target in the association curriculum, and these were re-used for the generalization curriculum. (B) Inputs were 20 dim. unit random vectors. Targets were 100 dim. unit random vectors. (C) During the association phase, the network learned to map inputs, presented sequentially, to pairs of targets (one pair per input). (D) Surprise was computed over inputs and used to chunk learning temporally and coarse-code gradient components between elevated surprise events. (E) A generalization curriculum re-used readouts from the association curriculum, but paired them with new inputs. (F) Training was performed on only one target out of each pair associated with a given input, while testing was performed on the held out readout, subject to coarse-coded plasticity. (G) Initial error, error at the end of the association-learning phase, error at the end of training during the generalization phase (for trained targets) and generalization (test) error at the end of the generalization phase. Coarse coding generalized learning.

Compositional plasticity. (A) Networks were MLPs with single readouts. (B) Tasks for a curriculum were produced by taking 10 feature loadings, converting them via scaling to 10 task demands (latent outputs), and then generating observed inputs and targets as sums of features and demands (respectively). (C) Once input and target vectors were computed, we produced a curriculum of overlapping (interfering) tasks by circularly shifting them. (D) During training, networks learned to produce each target given each input. (E) Plasticity in the network was restricted to a sum of de-mixed subspaces (the representation in the network of each latent task demand could only learn about the representation of each feature). Unlike our other simulations, we did not perform unsupervised plasticity on gradient elements to first learn this de-mixing, as this learning problem is itself complex, and our main concern is the use of the PR-vs-RF eligibility decomposition itself. (F) Training error over 10 passes through the data, averaged over all tasks and over 50 simulation repetitions. The CEM converges to sub-criterion error with far fewer passes through the data than the network learning via GD. (G) Average weights between feature representations and their demand representations, with 1 being optimal. (H) Average weights between demands and features which are irrelevant for them, with 0 being optimal. Note that there are many more such spurious relationships than true ones. (I) Impact of linking number l (and hence dimensionality reduction) on cumulative training error over the 10-repetition window in F. Linking number 1 (corresponding to plasticity width 1 in the figure) represents completely accurate prior associations, whereas linking number greater than or equal to 20 (plasticity width 41) represents all-to-all plasticity between representations (gradient descent). The y-axis is cumulative error of the CEM as a fraction of GD.

RF splitting in nonlinear neural networks that learn according to policy-gradient-like updates. (A) Training error for a simulation mirroring the RF splitting simulation (figure 3 from the main text). (B) The policy-gradient learner shows significant interference, whereas the coordinated eligibility model avoids this. (C). Sampled population response changes mirror those derived from analytic gradients in both models.

PR splitting in nonlinear neural networks that learn according to policy-gradient-like updates. (A) Training error for a simulation mirroring the PR splitting simulation (figure 4 from the main text). (B) The gradient-based method shows significant interference, whereas the coordinated eligibility model reduces this interference. Here, the CET does not totally abolish interference, because updates are only locally optimal, rather than globally optimal, owing to the curvature of the loss landscape induced by network nonlinearities. (C) Angles between CEM PR changes and gradient ones. CEM PR changes are consistently at a fairly high angle to analytic gradients, indicating that the locally optimal updates are nearly orthogonal to the gradient updates.

Sign up for email alerts