Neuroscience

Aligned and oblique dynamics in recurrent neural networks

Friedrich Schuessler author has email address
Francesca Mastrogiuseppe
Srdjan Ostojic
Omri Barak

Faculty of Electrical Engineering and Computer Science, Technical University Berlin, Germany
Science of Intel ligence, Research Cluster of Excel lence, Berlin, Germany
Champalimaud Research, Lisbon, Portugal
Laboratoire de Neurosciences Cognitives et Computationnel les, INSERM U960, Ecole Normale Superieure - PSL Research University, Paris, France
Rappaport Faculty of Medicine and Network Biology Research laboratories, Technion - Israeli Institute of Technology, Haifa, Israel

https://doi.org/10.7554/eLife.93060.1

Open access
Copyright information

Figures and data

Schematic of aligned and oblique dynamics in recurrent neural networks.
A Output generated by both networks. B Neural activity of aligned (top) and oblique (bottom) dynamics, visualized in the space spanned by three neurons. Here, the activity (green) is three-dimensional, but most of the variance is concentrated along the two largest PCs (blue). For aligned dynamics, the output weights (red) are small and lie in the subspace spanned by the largest PCs; they are hence correlated to the activity. For oblique dynamics, the output weights are large and lie outside of the subspace spanned by the largest PCs; they are hence poorly correlated to the activity. C Projection of activity onto the two largest PCs. For oblique dynamics, the output weights are orthogonal to the leading PCs. D Evolution of PC projections over time. For aligned dynamics, the projection on the PCs resembles the output z(t), and reconstructing the output from the largest two components is possible. For the oblique dynamics, such reconstruction is not possible, because the projections oscillate much more slowly than the output.

Aligned and oblique dynamics for a cycling task [54].
A A network with two outputs needed to generate either clockwise or anticlockwise rotations, depending on the context (top). Our model RNN (bottom) received a context input pulse, generated dynamics x(t) via recurrent weights W , and yielded output as linear projections of the states. We trained the recurrent weights W with gradient descent. B-C Resulting internal dynamics for two networks with small (top) and large (bottom) output weights, corresponding to aligned and oblique dynamics, respectively. B Dynamics projected on the first 2 PCs and the remaining direction w_{out, ⊥}, of the first output vector (for z₁). The output weights are amplified to be visible. Arrowheads indicate the direction of the dynamics. Note that for the large output weights, the dynamics in the first two PCs co-rotated, despite the counter-rotating output. C Output reconstructed from the largest PCs, with dimension D = 2 (full lines) or 8 (dotted). Two dimensions already yield a fit with R² = 0.99 for aligned dynamics (top), but almost no output for oblique (bottom, R² = 0.005, no arrows shown). For the latter, a good fit with R² > 90% is only reached with D = 8.

Testing how the magnitude of output weights determines regimes across multiple neuroscience tasks [37, 52, 54, 69].
A Correlation and norms of output weights and neural activity. For each task, we initialized networks with small or large output weights (dark vs light orange). The initial norms ||w_out|| are indicated by the dashed lines. Learning does not change the norm dramatically. Note that all y-axes are logarithmically scaled. B Variance of x explained and R² of reconstructed output for projections of x on increasing number of PCs. Results from one example network trained on the cycling task for each condition are shown. C Number of PCs necessary to reach 90% of the variance of x(t) or of the R² of the output reconstruction (top/bottom; dotted lines in B). In A, C, violin plots show the distribution over 5 sample networks, with vertical bars indicating the mean and the extreme values (where visible).

Variability between learners for the two regimes.
A-B Examples of networks trained on the cycling task with small (aligned) or large (oblique) output weights. The top left and central networks, respectively, are the same as those plotted in Fig. 2. C Dissimilarity between solutions across different tasks. Aligned solutions (dark) were less dissimilar to each other than oblique ones (light). The violin plots show the distribution over all possible different pairs for five samples (mean and extrema as bars).

Perturbations differentially affect dynamics in the aligned and oblique regimes.
A Cartoon illustrating the relation between perturbations along output weights or PCs and the feedback loops driving autonomous dynamics. B Output after perturbation for aligned (top) and oblique (bottom) networks trained on the cycling task. The unperturbed network (light red line) yields a sine wave along the first output direction z₁. At t_p = 9, a perturbation with amplitude ∥Δx∥ = 34 is applied along the output weights (dashed red) or the first PC (dashed-dotted blue). The perturbations only differ in the directions applied. While the immediate response for the oblique network to a perturbation along the output weights is much larger, z₁(t_p) ≈ 80, the long-term dynamics yield the same output as the unperturbed network. See also Fig. 19 for more details. C Loss for perturbations of different amplitudes for the two networks in B. Lines and shades are means and std devs over different perturbation times t_p ∈ [5, 15] and random directions spanned by the output weights (red) or the two largest PCs (blue). The loss is the mean squared error between output and target for t > 20. Gray dot indicates example in B. D Relative susceptibility of networks to perturbation directions for different tasks and dynamical regimes. We measured the area under the curve (AUC) of loss over perturbation amplitude for perturbations along the output weights or two largest PCs. The relative susceptibility is the ratio between the two AUCs. The example in C is indicated by gray triangles.

Noise suppression along the output direction in the oblique regime.
A A cartoon of the feedback loop structure for aligned (top) and oblique (bottom) dynamics. The latter develops a negative feedback loop which suppresses fluctuations along the output direction. B Comparing the distribution of variance of mean-subtracted activity along different directions for network trained on cycling task (see Fig. 20): PCs of trial-averaged activity (blue), readout (red), and random (grey) directions. For the PCs and output weights, we sampled 100 normalized combinations of either the first two PCs or the two output vectors. For the random directions, we drew 1000 random vectors in the full, N-dimensional space. C Noise compression across tasks as measured by the ratio between variance along output and random directions. The dashed line indicates neither compression nor expansion. Black markers indicate the values for the two examples in B-C. Note the log-scales in B-C.

Quantifying aligned and oblique dynamics in experimental data [11, 20, 23, 47, 54].
A Cartoon of a two types of experimental data considered. In motor control experiments (top), we first needed to obtain the output weights w_out via linear regression. We then computed the correlation p and the reconstruction dimension D_fit,90, i.e. the number of PCs of x necessary to obtain a coefficient of determination R² > 90%. In BCI experiments (bottom), the output (cursor velocity) is generated from neural activity x(t) via output weights w_out defined by the experimenter. This allowed us to directly compute correlation and fitting dimension. B Correlation ρ (top) and relative fitting dimension D_fit,90/D_x,90 (bottom) for a number of publicly available data sets. The cycling task data (purple) were trial-conditioned averages, the BCI experiments (red) and NLB tasks (yellow) single-trial data. Results for the full data sets are shown as dots, violin plots indicate results for 20 random subsets of 25% of the data points in each data set (see small text below x-axis).

Task, simulation, and network parameters for Figs. 3 to 6

Scaling of the correlation ρ with the number of neurons N in experimental data.
We fitted the output weights to subsets of N neurons and computed the quality of fit (top) and the correlation between the resulting output weight and firing rates (bottom). To compare with random vectors, the correlation is scaled by . Dashed lines are N^p/2, for ρ ∈ {1/2, 1/4, 0} for comparison. The aligned regime corresponds to p = 1/2, and the oblique one to p = 0.

Scaling of the correlation ρ with the number of neurons N in experimental data.
We fitted the output weights to subsets of N neurons and computed the quality of fit (top) and the correlation between the resulting output weight and firing rates (bottom). To compare with random vectors, the correlation is scaled by . Dashed lines are N^p/2, for ρ ∈ {1/2, 1/4, 0} for comparison. The aligned regime corresponds to p = 1/2, and the oblique one to p = 0.

Different solutions for networks trained on sine wave task.
All networks have N = 512 neurons. Four regimes: A: aligned for small output weights, B: marginal for large output weights, small recurrent weights, C: lazy for both large output and recurrent weights, D: oblique for large output weights and noise added during training. Left: Output (dark), target (purple dots) and four states (light) of the network after training. Black bars indicate the scales for output and states (length = 1; same for all regimes). The output beyond the target interval t ∈ [1, 21] can be considered as extrapolation. The network in the oblique regime, D, receives white noise during training, and the evaluation is shown with the same noise. Without noise, this network still produces a sine wave (not shown). Right: Projection of states on the first 2 PCs and the orthogonal component w_out,⊥ of the output vector. All axes have the same scale, which allows for comparison between the dynamics. Vectors show the (amplified) output weights, dotted lines the projection on the PCs (not visible for lazy and oblique). The insets for the marginal solution (B, left and right) show the dynamics magnified by .

Different solutions for networks trained on sine wave task.
All networks have N = 512 neurons. Four regimes: A: aligned for small output weights, B: marginal for large output weights, small recurrent weights, C: lazy for both large output and recurrent weights, D: oblique for large output weights and noise added during training. Left: Output (dark), target (purple dots) and four states (light) of the network after training. Black bars indicate the scales for output and states (length = 1; same for all regimes). The output beyond the target interval t ∈ [1, 21] can be considered as extrapolation. The network in the oblique regime, D, receives white noise during training, and the evaluation is shown with the same noise. Without noise, this network still produces a sine wave (not shown). Right: Projection of states on the first 2 PCs and the orthogonal component w_out,⊥ of the output vector. All axes have the same scale, which allows for comparison between the dynamics. Vectors show the (amplified) output weights, dotted lines the projection on the PCs (not visible for lazy and oblique). The insets for the marginal solution (B, left and right) show the dynamics magnified by .

Noise-induced learning for linear network with input-driven fixed point.
Learning separates into fast learning of the bias part of the loss (left), and slow learning reducing the variance part (right). Learning rates are η = η₀/N and η = η₀, respectively, with η₀ = 0.002 and network size N = 256. Learning epochs in the first phase are counted from -1000, so that the second phase starts at 0. In the right column, the initial learning phase with learning time steps multiplied by 1/N is shown for comparison. In all plots, simulations (full lines) are compared with theory (dashed lines). A Loss L = L_bias + L_var. The two components are obtained by averaging over a batch with 32 examples at each learning step. The full loss is not plotted in the slow phase, because it is indistinguishable from L_var. B Coefficients of the 2-by-2 coupling matrix M. M₁₁ = λ_- is the feedback loop along the output weights, a feedforward coupling from input to output. The theory predicts M₂₁ ~ M₂₂ ~ O(1/N). C Norm of the state changes during training. The theory predicts that it remains constant during the second phase and small compared to . Other parameters: target ẑ = 1, σ_noise = 1, overlap between input and output vectors .

Noise-induced learning for linear network with input-driven fixed point.
Learning separates into fast learning of the bias part of the loss (left), and slow learning reducing the variance part (right). Learning rates are η = η₀/N and η = η₀, respectively, with η₀ = 0.002 and network size N = 256. Learning epochs in the first phase are counted from -1000, so that the second phase starts at 0. In the right column, the initial learning phase with learning time steps multiplied by 1/N is shown for comparison. In all plots, simulations (full lines) are compared with theory (dashed lines). A Loss L = L_bias + L_var. The two components are obtained by averaging over a batch with 32 examples at each learning step. The full loss is not plotted in the slow phase, because it is indistinguishable from L_var. B Coefficients of the 2-by-2 coupling matrix M. M₁₁ = λ_- is the feedback loop along the output weights, a feedforward coupling from input to output. The theory predicts M₂₁ ~ M₂₂ ~ O(1/N). C Norm of the state changes during training. The theory predicts that it remains constant during the second phase and small compared to . Other parameters: target ẑ = 1, σ_noise = 1, overlap between input and output vectors .

Cartoon illustrating the split into bias and variance components of the loss, and noise suppression along the output direction.
The two-dimensional subspace spanned by U illustrates the main directions under consideration: the PCs of the average trajectories (here only a fixed point ), and the direction of output weights . Left: During learning, a fast processes keeps the average output close to the target so that L_bias = 0. Center: The variance component, L_var, is determined by the projection of the fluctuations δτ onto the output vector. Note that the noise in the low-D subspace is very small, , but the output is still affected due to the large output weights. Right: During training, the noise becomes non-isotropic. Along the average direction , the fluctuations are increased as a byproduct of the positive feedback λ₊. Meanwhile, a slow learning process suppresses the output variance via a negative feedback λ_-.

Cartoon illustrating the split into bias and variance components of the loss, and noise suppression along the output direction.
The two-dimensional subspace spanned by U illustrates the main directions under consideration: the PCs of the average trajectories (here only a fixed point ), and the direction of output weights . Left: During learning, a fast processes keeps the average output close to the target so that L_bias = 0. Center: The variance component, L_var, is determined by the projection of the fluctuations δτ onto the output vector. Note that the noise in the low-D subspace is very small, , but the output is still affected due to the large output weights. Right: During training, the noise becomes non-isotropic. Along the average direction , the fluctuations are increased as a byproduct of the positive feedback λ₊. Meanwhile, a slow learning process suppresses the output variance via a negative feedback λ_-.

Mechanisms behind oblique solutions predicted by mean field theory.
A-D: Mean-field theory predictions as a function of positive feedback strength λ₊. The dotted lines indicate λ_+,min, the minimal eigenvalue necessary to generate fixed points. A: Norm of fixed point . B: Correlation ρ so that L_bias = 0. C,D: Loss due to fluctuations for different λ_- or networks sizes N. Dots indicate minima. E-G: Latent states τ of simulated networks for randomly drawn projections U. The symmetric matrix M is fixed by setting λ₊ as noted, λ_- = -5, and demanding L_bias = 0 (for the mean field prediction). Dots are samples from the simulation interval t ∈ [20, 100]. H-J: Histogram for the corresponding output z. Mean is indicated by full lines, the dashed lines indicate the target ẑ. Other parameters: N = 256, σ_noise = 1 ẑ = 1

Mechanisms behind oblique solutions predicted by mean field theory.
A-D: Mean-field theory predictions as a function of positive feedback strength λ₊. The dotted lines indicate λ_+,min, the minimal eigenvalue necessary to generate fixed points. A: Norm of fixed point . B: Correlation ρ so that L_bias = 0. C,D: Loss due to fluctuations for different λ_- or networks sizes N. Dots indicate minima. E-G: Latent states τ of simulated networks for randomly drawn projections U. The symmetric matrix M is fixed by setting λ₊ as noted, λ_- = -5, and demanding L_bias = 0 (for the mean field prediction). Dots are samples from the simulation interval t ∈ [20, 100]. H-J: Histogram for the corresponding output z. Mean is indicated by full lines, the dashed lines indicate the target ẑ. Other parameters: N = 256, σ_noise = 1 ẑ = 1

Mean field theory predicts learning with gradient descent.
A-C: Learning dynamics with gradient descent for example network with N = 1024 neurons and with noise variance . A: Loss with separate bias and variance components. B: Matrix coefficients M_ij. The dotted lines almost identical to M₂₂ and M₁₁ indicate the eigenvalues A₊ and λ_-, respectively. The dashed line indicates λ_+,min. C: Fixed point norm and correlation. D-F: Final loss, fixed point norm and correlation for networks of different size N. Shown are mean (dots and lines) and std dev (shades) for 5 sample networks, and the prediction by the mean field theory. Grey lines indicate scaling as aN^k, with k ∈ {0, — 1/4, — 1/2}. Note the log-log axes for E, F.

Mean field theory predicts learning with gradient descent.
A-C: Learning dynamics with gradient descent for example network with N = 1024 neurons and with noise variance . A: Loss with separate bias and variance components. B: Matrix coefficients M_ij. The dotted lines almost identical to M₂₂ and M₁₁ indicate the eigenvalues A₊ and λ_-, respectively. The dashed line indicates λ_+,min. C: Fixed point norm and correlation. D-F: Final loss, fixed point norm and correlation for networks of different size N. Shown are mean (dots and lines) and std dev (shades) for 5 sample networks, and the prediction by the mean field theory. Grey lines indicate scaling as aN^k, with k ∈ {0, — 1/4, — 1/2}. Note the log-log axes for E, F.

Path to oblique solutions for networks with large output weights.
Left: all networks produce the same output (cf. Fig. 1). Center: Unstable solutions that arise early in learning. For lazy solutions, initial chaotic activity is slightly adapted, without changing the dynamics qualitatively. For marginal solutions, vanishingly small initial activity is replaced with a very small dynamics sufficient to generate the output. Right: With more learning time and noise added during the learning process, stable, oblique solutions arise. The neural dynamics along the largest PCs can be either decoupled from the output (center right), or coupled (right). For decoupled dynamics, the components along the largest PCs (blue subspace) differ qualitatively from those generating the output (same as Fig. 1B, bottom). The dynamics along the largest PCs inherit task-unrelated components from the initial dynamics or randomness during learning. Another possibility are oblique, but coupled dynamics (right). Such solutions don’t inherit task-unrelated components of the dynamics at initialization. They are qualitatively similar to aligned solutions, and the output is generated by a small projection of the output weights onto the largest PCs (dashed orange arrow).

Solution to linearized network dynamics in the lazy regime.
A Network output for weight changes ΔW_lin obtained from linearized dynamics for different network sizes. Each plot shows 10 different networks (one example in bold). Target points in purple. B Output for networks trained with GD from the same initial conditions as those above. C Loss on training set for the linear (lin) and GD solutions. D Frobenius norms of weight changes ΔW of the linear and GD solutions, as well as of the difference between the two. ΔW_lin - ΔW_GD (dashed green). Grey and black dashed lines for comparison of scales.

Detailed network dynamics for cycling task for aligned (top) and oblique (bottom) dynamics.
A Cumulative explained variance along PCs taken for the dynamics in both context (orange), and separately for each context (blue, green). For the aligned network, the dynamics of each context separately are essentially 2D. B Projection of dynamics onto the PCs obtained from the separate PCAs. Arrows indicate directions of trajectories. Three arrows divide one output cycle into equal parts, and hence indicate time - the internal dynamics for the oblique network are slower than the output, cf. Fig. 2C. C Histograms of state norms along the limit cycles and distance between limit cycles. The distance at each time point is defined as d₁₂(t) = min_t’∥x₁(t) — x₂(t’)∥, where the subcripts indicate contexts 1 and 2.

Power spectral densities for the six networks shown in Fig. 4.
Dashed orange line indicates the output frequency. Note the high power for non-target frequencies in the first PCs in some of the large output solutions.

Summary of all measures for initially decaying or chaotic networks.

Neural activity in response to the perturbations applied in Fig. 5.
Activity is plotted in the space spanned by the leading two PCs and the output weights w_out,1. We first show the unperturbed trajectories in each network (left), then the perturbed ones for perturbations along the first output direction (center), and along the first PC (right). The unperturbed trajectories are also plotted for comparison. Yellow dots indicate the point where the perturbation is applied. All perturbations but the one along the output for aligned lead to trajectories on the same attractor, but potentially with a phase shift. Note that in general, perturbations can also lead to the activity converging on a different attractor. Here, we see a specific example of this happening for the cycling task in the aligned regime.

Noise compression over training time for the cycling task.
Example networks trained on the cycling tasks with σ_noise = 0.2 and N = 256. The network at the end of training is analyzed in Fig. 6B. A Full loss (grey) and decomposition in bias (golden) and variance (purple) parts over learning time. The bias part decays rapidly (y-axis is clipped, initial loss L₀ = 1.4), whereas the variance part needs many more training steps to decrease. Dotted lines indicate the two examples in B-D. B Output fluctuations around the trial-conditioned average . Mean is over 16 samples for each of the two trial conditions (clockwise and anticlockwise rotation). Because both output dimension i ∈ {1, 2} are equivalent in scale, we collected both for the histogram. C-D Example output trajectories early (C) and late (D) in learning. Shown are the mean (dark) and 5 samples (light).

Noise compression over training time for the cycling task.
Example networks trained on the cycling tasks with σ_noise = 0.2 and N = 256. The network at the end of training is analyzed in Fig. 6B. A Full loss (grey) and decomposition in bias (golden) and variance (purple) parts over learning time. The bias part decays rapidly (y-axis is clipped, initial loss L₀ = 1.4), whereas the variance part needs many more training steps to decrease. Dotted lines indicate the two examples in B-D. B Output fluctuations around the trial-conditioned average . Mean is over 16 samples for each of the two trial conditions (clockwise and anticlockwise rotation). Because both output dimension i ∈ {1, 2} are equivalent in scale, we collected both for the histogram. C-D Example output trajectories early (C) and late (D) in learning. Shown are the mean (dark) and 5 samples (light).

Learning curves for neuroscience tasks analyzed in Fig. 3.
The learning rates are , where η₀ is the number in the upper right corner of each plot. The network size is N = 512.

Learning curves for neuroscience tasks analyzed in Fig. 3.
The learning rates are , where η₀ is the number in the upper right corner of each plot. The network size is N = 512.

Example of task variability for the flipflop task.
The titles in each row indicate the spectral radius g of the initial recurrent connectivity (g > 1 for initially chaotic, else decaying activity), and the norm of initial output weights.

Fitting neural activity to other output modalities (hand position, velocity, acceleration, EMG).
Output modality is indicated by the x-ticks, the corresponding datasets by the labels below and the color. A: Quality of fit. B: Correlation, normalized by , with N the number of neurons (see Fig. 7B). The dashed line at corresponds to the correlation between two random vectors. C: Embedding dimension of firing rates X, i.e. the number of PCs necessary to span 90% of the variance of X. D: Fitting dimension: number of PCs necessary to reach , where the latter is the R² value obtained for fitting with all dimensions. For each output modality, the delay between activity and output is optimized. Position decodes earlier (300-200ms) than velocity or acceleration (100-50ms), no delay for EMG. The data X is the same with each dataset apart from a potential shift by the respective delay, so that dimension D_x,90 in C is almost the same.

Fitting neural activity to other output modalities (hand position, velocity, acceleration, EMG).
Output modality is indicated by the x-ticks, the corresponding datasets by the labels below and the color. A: Quality of fit. B: Correlation, normalized by , with N the number of neurons (see Fig. 7B). The dashed line at corresponds to the correlation between two random vectors. C: Embedding dimension of firing rates X, i.e. the number of PCs necessary to span 90% of the variance of X. D: Fitting dimension: number of PCs necessary to reach , where the latter is the R² value obtained for fitting with all dimensions. For each output modality, the delay between activity and output is optimized. Position decodes earlier (300-200ms) than velocity or acceleration (100-50ms), no delay for EMG. The data X is the same with each dataset apart from a potential shift by the respective delay, so that dimension D_x,90 in C is almost the same.

Learning curves and eigenvalue spectra for sine wave task.
(a) Loss over training steps for four networks. The light red line is the bias term of the loss for the oblique network. (b) Norm of weight changes over learning time. (c) Eigenvalue spectra of connectivity matrix W_f after training. The dashed line indicates the stability line for the fixed point at the origin.

The aligned regime is robust to the choice of other hyperparameters.

The oblique regime is robust to the choice of other hyperparameters.

Training history leads to order-one fixed point norm.
We trained RNNs on the example fixed point task. Similar to Fig. 13, but with smaller noise, σ_noise = 0.2, and with learning rate increased by 2 and number of epochs by 2.5. A-C: Learning dynamics with gradient descent for one network with N = 256 neurons. The first 400 epochs are dominated by L_bias, and M₁₁ ≈ λ_- becomes positive. The negative feedback loop, λ_- < 0 only forms later in learning. The matrix M does not become symmetric during learning. D-F: Fixed point norm and correlation for different N evaluated when λ_- = 0 (left) and at the end of learning (right). The time points are indicated by a square and triangle in C, respectively. At λ_- = 0, simulation and theory agree for the scaling: and . At the end of training, the theory predicts a decreasing fixed point norm, but the simulated networks inherit the order-one norm from the training history.

Training history leads to order-one fixed point norm.
We trained RNNs on the example fixed point task. Similar to Fig. 13, but with smaller noise, σ_noise = 0.2, and with learning rate increased by 2 and number of epochs by 2.5. A-C: Learning dynamics with gradient descent for one network with N = 256 neurons. The first 400 epochs are dominated by L_bias, and M₁₁ ≈ λ_- becomes positive. The negative feedback loop, λ_- < 0 only forms later in learning. The matrix M does not become symmetric during learning. D-F: Fixed point norm and correlation for different N evaluated when λ_- = 0 (left) and at the end of learning (right). The time points are indicated by a square and triangle in C, respectively. At λ_- = 0, simulation and theory agree for the scaling: and . At the end of training, the theory predicts a decreasing fixed point norm, but the simulated networks inherit the order-one norm from the training history.

Sign up for email alerts