Schematic of aligned and oblique dynamics in recurrent neural networks. A Output generated by both networks. B Neural activity of aligned (top) and oblique (bottom) dynamics, visualized in the space spanned by three neurons. Here, the activity (green) is three-dimensional, but most of the variance is concentrated along the two largest PCs (blue). For aligned dynamics, the output weights (red) are small and lie in the subspace spanned by the largest PCs; they are hence correlated to the activity. For oblique dynamics, the output weights are large and lie outside of the subspace spanned by the largest PCs; they are hence poorly correlated to the activity. C Projection of activity onto the two largest PCs. For oblique dynamics, the output weights are orthogonal to the leading PCs. D Evolution of PC projections over time. For aligned dynamics, the projection on the PCs resembles the output z(t), and reconstructing the output from the largest two components is possible. For the oblique dynamics, such reconstruction is not possible, because the projections oscillate much more slowly than the output.

Aligned and oblique dynamics for a cycling task [61]. A A network with two outputs was trained to generate either clockwise or anticlockwise rotations, depending on the context (top). Our model RNN (bottom) received a context input pulse, generated dynamics x(t) via recurrent weights W, and yielded the output as linear projections of the states. We trained the recurrent weights W with gradient descent. B-C Resulting internal dynamics for two networks with small (top) and large (bottom) output weights, corresponding to aligned and oblique dynamics, respectively. B Dynamics projected on the first 2 PCs and the remaining direction wout,⊥ of the first output vector (for z1). The output weights are amplified to be visible. Arrowheads indicate the direction of the dynamics. Note that for the large output weights, the dynamics in the first two PCs co-rotated, despite the counter-rotating output. C Output reconstructed from the largest PCs, with dimension D = 2 (full lines) or 8 (dotted). Two dimensions already yield a fit with R2 = 0.99 for aligned dynamics (top), but almost no output for oblique (bottom, R2 = 0.005, no arrows shown). For the latter, a good fit with R2 > 90% is only reached with D = 8.

The magnitude of output weights determines regimes across multiple neuroscience tasks [41, 59, 61, 76]. A Correlation and norms of output weights and neural activity. For each task, we initialized networks with small or large output weights (dark vs light orange). The initial norms ║wout║ are indicated by the dashed lines. Learning only weakly changes the norm of the output weights. Note that all y-axes are logarithmically scaled. B Variance of x explained and R2 of reconstructed output for projections of x on increasing number of PCs. Results from one example network trained on the cycling task are shown for each condition. C Number of PCs necessary to reach 90% of the variance of x(t) or of the R2 of the output reconstruction (top/bottom; dotted lines in B). In A, C, violin plots show the distribution over 5 sample networks, with vertical bars indicating the mean and the extreme values (where visible).

Variability between learners for the two regimes. A-B Examples of networks trained on the cycling task with small (aligned) or large (oblique) output weights. The top left and central networks, respectively, are the same as those plotted in Fig. 2. C Dissimilarity between solutions across different tasks. Aligned dynamics (red) were less dissimilar to each other than oblique ones (yellow). The violin plots show the distribution over all possible different pairs for five samples (mean and extrema as bars).

Perturbations differentially affect dynamics in the aligned and oblique regimes. A Cartoon illustrating the relationship between perturbations along output weights or PCs and the feedback loops driving autonomous dynamics. B Output after perturbation for aligned (top) and oblique (bottom) networks trained on the cycling task. The unperturbed network (light red line) yields a sine wave along the first output direction z1. At tp = 9, a perturbation with amplitude ║Δx║ = 34 is applied along the output weights (dashed red) or the first PC (dashed-dotted blue). The perturbations only differ in the directions applied. While the immediate response for the oblique network to a perturbation along the output weights is much larger, z1(tp) ≈ 80, the long-term dynamics yield the same output as the unperturbed network. See also Fig. 20 for more details. C Loss for perturbations of different amplitudes for the two networks in B. Lines and shades are means and std devs over different perturbation times tp ∈ [5, 15] and random directions spanned by the output weights (red) or the two largest PCs (blue). The loss is the mean squared error between output and target for t > 20. The gray dot indicates an example in B. D Relative susceptibility of networks to perturbation directions for different tasks and dynamical regimes. We measured the area under the curve (AUC) of loss over perturbation amplitude for perturbations along the output weights of the two largest PCs. The relative susceptibility is the ratio between the two AUCs. The example in C is indicated by gray triangles.

Noise suppression along the output direction in the oblique regime. A A cartoon of the feedback loop structure for aligned (top) and oblique (bottom) dynamics. The latter develops a negative feedback loop which suppresses fluctuations along the output direction. B Comparing the distribution of variance of mean-subtracted activity along different directions for networks trained on the cycling task (see Fig. 21): PCs of trial-averaged activity (blue), readout (red), and random (grey) directions. For the PCs and output weights, we sampled 100 normalized combinations of either the first two PCs or the two output vectors. For the random directions, we drew 1000 random vectors in the full, N-dimensional space. C Noise compression across tasks as measured by the ratio between variance along output and random directions. The dashed line indicates neither compression nor expansion. Black markers indicate the values for the two examples in B-C. Note the log-scales in B-C.

Quantifying aligned and oblique dynamics in experimental data [12, 22, 25, 52, 61]. A Diagram of the two types of experimental data considered. Here we always took the velocity as the output (hand, finger, or cursor). In motor control experiments (top), we first needed to obtain the output weights wout via linear regression. We then computed the correlation ρ and the reconstruction dimension Dfit,90, i.e. the number of PCs of x necessary to obtain a coefficient of determination R2 > 90%. In BCI experiments (bottom), the output (cursor velocity) is generated from neural activity x(t) via output weights wout defined by the experimenter. This allowed us to directly compute correlations and fitting dimensions. B Correlation ρ (top) and relative fitting dimension Dfit,90/Dx,90 (bottom) for several publicly available data sets. The cycling task data (purple) were trial-conditioned averages, the BCI experiments (red) and NLB tasks (yellow) single-trial data. Results for the full data sets are shown as dots. Violin plots indicate results for 20 random subsets of 25% of the data points in each data set (bars indicate mean). See small text below x-axis for the number of time points and neurons.

Task, simulation, and network parameters for Figs. 3 to 6

Scaling of the correlation ρ with the number of neurons N in experimental data. We fitted the output weights to subsets of N neurons and computed the quality of fit (top) and the correlation between the resulting output weight and firing rates (bottom). To compare with random vectors, the correlation is scaled by . Dashed lines are Np/2, for p ∈ {1/2, 1/4, 0} for comparison. The aligned regime corresponds to p = 1/2, and the oblique one to p = 0.

Different solutions for networks trained on a sine wave task. All networks have N = 512 neurons. Four regimes: A: aligned for small output weights, B: marginal for large output weights, small recurrent weights, C: lazy for both large output and recurrent weights, D: oblique for large output weights and noise added during training. Left: Output (dark), target (purple dots), and four states (light) of the network after training. Black bars indicate the scales for output and states (length = 1; same for all regimes). The output beyond the target interval t ∈ [1, 21] can be considered as extrapolation. The network in the oblique regime, D, receives white noise during training, and the evaluation is shown with the same noise. Without noise, this network still produces a sine wave (not shown). Right: Projection of states on the first 2 PCs and the orthogonal component wout,⊥ of the output vector. All axes have the same scale, which allows for comparison between the dynamics. Vectors show the (amplified) output weights, dotted lines the projection on the PCs (not visible for lazy and oblique). The insets for the marginal solution (B, left and right) show the dynamics magnified by .

Noise-induced learning for a linear network with input-driven fixed point. Learning separates into fast learning of the bias part of the loss (left), and slow learning reducing the variance part (right). Learning rates are η = η0/N and η = η0, respectively, with η0 = 0.002 and network size N = 256. Learning epochs in the first phase are counted from -1000, so that the second phase starts at 0. In the right column, the initial learning phase with learning time steps multiplied by 1/N is shown for comparison. In all plots, simulations (full lines) are compared with theory (dashed lines). A Loss L = Lbias + Lvar. The two components are obtained by averaging over a batch with 32 examples at each learning step. The full loss is not plotted in the slow phase, because it is indistinguishable from Lvar. B Coefficients of the 2-by-2 coupling matrix M. M11 = λ is the feedback loop along the output weights, a feedforward coupling from input to output. The theory predicts M21 ~ M22 ~ O(1/N). C Norm of state changes during training. The theory predicts that it remains constant during the second phase and small compared to . Other parameters: target = 1, σnoise = 1, overlap between input and output vectors .

Cartoon illustrating the split into bias and variance components of the loss, and noise suppression along the output direction. The two-dimensional subspace spanned by U illustrates the main directions under consideration: the PCs of the average trajectories (here only a fixed point ), and the direction of output weights . Left: During learning, a fast process keeps the average output close to the target so that Lbias = 0. Center: The variance component, Lvar, is determined by the projection of the fluctuations δκ onto the output vector. Note that the noise in the low-D subspace is very small, , but the output is still affected due to the large output weights. Right: During training, the noise becomes non-isotropic. Along the average direction , the fluctuations are increased as a byproduct of the positive feedback λ+. Meanwhile, a slow learning process suppresses the output variance via a negative feedback λ.

Mechanisms behind oblique solutions predicted by mean field theory. A-D: Mean-field theory predictions as a function of positive feedback strength λ+. The dotted lines indicate λ+,min, the minimal eigenvalue necessary to generate fixed points. A: Norm of fixed point . B: Correlation ρ so that Lbias = 0. C,D: Loss due to fluctuations for different λ or networks sizes N. Dots indicate minima. E-G: Latent states κ of simulated networks for randomly drawn projections U. The symmetric matrix M is fixed by setting λ+ as noted, λ = −5, and demanding Lbias = 0 (for the mean field prediction). Dots are samples from the simulation interval t ∈ [20, 100]. H-J: Histogram for the corresponding output z. Mean is indicated by full lines, the dashed lines indicate the target . Other parameters: N = 256, σnoise = 1 = 1

Mean field theory predicts learning with gradient descent. A-C: Learning dynamics with gradient descent for example network with N = 1024 neurons and with noise variance . A: Loss with separate bias and variance components. B: Matrix coefficients Mij. The dotted lines almost identical to M22 and M11 indicate the eigenvalues λ+ and λ, respectively. The dashed line indicates λ+,min. C: Fixed point norm and correlation. D-F: Final loss, fixed point norm, and correlation for networks of different sizes N. Shown are mean (dots and lines) and std dev (shades) for 5 sample networks, and the prediction by the mean field theory. Grey lines indicate scaling as aNk, with k ∈ {0, − 1/4, −1/2}. Note the log-log axes for E, F.

Path to oblique solutions for networks with large output weights. Left: all networks produce the same output (cf. Fig. 1). Center: Unstable solutions that arise early in learning. For lazy solutions, initial chaotic activity is slightly adapted, without changing the dynamics qualitatively. For marginal solutions, vanishingly small initial activity is replaced with very small dynamics sufficient to generate the output. Right: With more learning time and noise added during the learning process, stable, oblique solutions arise. The neural dynamics along the largest PCs can be either decoupled from the output (center right), or coupled (right). For decoupled dynamics, the components along the largest PCs (blue subspace) differ qualitatively from those generating the output (same as Fig. 1B, bottom). The dynamics along the largest PCs inherit task-unrelated components from the initial dynamics or randomness during learning. Another possibility are oblique, but coupled dynamics (right). Such solutions don’t inherit task-unrelated components of the dynamics at initialization. They are qualitatively similar to aligned solutions, and the output is generated by a small projection of the output weights onto the largest PCs (dashed orange arrow).

Solution to linearized network dynamics in the lazy regime. A Network output for weight changes ΔWlin obtained from linearized dynamics for different network sizes. Each plot shows 10 different networks (one example in bold). Target points in purple. B Output for networks trained with GD from the same initial conditions as those above. C Loss on the training set for the linear (lin) and GD solutions. D Frobenius norms of weight changes ΔW of the linear and GD solutions, as well as of the difference between the two. ΔWlin − ΔWGD (dashed green). Grey and black dashed lines for comparison of scales.

Summary of all measures for initially decaying or chaotic networks.

Projection of neural dynamics onto first 4 PCs for the cycling task, Fig. 2. The x-axis for all plots is PC 1 of the respective dynamics (left: aligned, right: oblique). The y-axes are PCs 2 to 4. Axis labels indicate the relative variance explained by each PC. Arrows indicate direction. Note that there is co-rotation for the aligned network for PCs 1 and 3, as well as the counter-rotation for the oblique network for PCs 1 and 4.

Power spectral densities for the six networks shown in Fig. 4. The dashed orange line indicates the output frequency. Note the high power for non-target frequencies in the first PCs in some of the large output solutions.

Example of task variability for the flipflop task. The titles in each row indicate the spectral radius g of the initial recurrent connectivity (g > 1 for initially chaotic, else decaying, activity), and the norm of initial output weights.

Neural activity in response to the perturbations applied in Fig. 5. Activity is plotted in the space spanned by the leading two PCs and the output weights wout,1. We first show the unperturbed trajectories in each network (left), then the perturbed ones for perturbations along the first output direction (center) and along the first PC (right). The unperturbed trajectories are also plotted for comparison. Yellow dots indicate the point where the perturbation is applied. All perturbations but the one along the output for aligned lead to trajectories on the same attractor, but potentially with a phase shift. Note that in general, perturbations can also lead to the activity converging on a different attractor. Here, we see a specific example of this happening for the cycling task in the aligned regime.

Noise compression over training time for the cycling task. Example networks trained on the cycling tasks with σnoise = 0.2 and N = 256. The network at the end of training is analyzed in Fig. 6B. A Full loss (grey) and decomposition in bias (golden) and variance (purple) parts over learning time. The bias part decays rapidly (the y-axis is clipped, initial loss L0 = 1.4), whereas the variance part needs many more training steps to decrease. Dotted lines indicate the two examples in B-D. B Output fluctuations around the trial-conditioned average . Mean is over 16 samples for each of the two trial conditions (clockwise and anticlockwise rotation). Because both output dimensions i ∈ {1, 2} are equivalent in scale, we collected both for the histogram. C-D Example output trajectories early (C) and late (D) in learning. Shown are the mean (dark) and 5 samples (light).

Fitting neural activity to different output modalities (hand position, velocity, acceleration, EMG). Output modality is indicated by the x-ticks, the corresponding datasets by the color and the labels below. A, B: Correlation and relative fitting dimension. Similar to Fig. 7B, where these are shown for velocity alone. C-E: Additional details. C: Coefficient of determination R2 of the linear regression. D: Number of PCs necessary to reach 90 % of the variance of the neural activity X. E: Number of PCs necessary to reach 90 % of the R2 value of the full neural activity. For each output modality, the delay between activity and output is optimized. Position decodes earlier (300–200ms) than velocity or acceleration (100–50ms); no delay for EMG. The data X is the same with each dataset apart from a potential shift by the respective delay, so that dimension Dx,90 in D is almost the same. Note that we also computed trial averages for the NLB maze task to test for the effect of trial averaging. These, however, have a small sample number (189 data points, but 162 neurons), which leads to considerable discrepancy between the dots (full data) and the subsamples with only 25% of the data points, for example in the correlation (A).

Learning curves and eigenvalue spectra for sine wave task. (a) Loss over training steps for four networks. The light red line is the bias term of the loss for the oblique network. (b) Norm of weight changes over learning time. (c) Eigenvalue spectra of connectivity matrix Wf after training. The dashed line indicates the stability line for the fixed point at the origin.

The aligned regime is robust to the choice of other hyperparameters.

The oblique regime is robust to the choice of other hyperparameters.

Training history leads to order-one fixed point norm. We trained RNNs on the example fixed point task. Similar to Fig. 13, but with smaller noise, a noise = 0.2, and with learning rate increased by 2 and number of epochs by 2.5. A-C: Learning dynamics with gradient descent for one network with N = 256 neurons. The first 400 epochs are dominated by Lbias, and M11λ becomes positive. The negative feedback loop, λ < 0 only forms later in learning. The matrix M does not become symmetric during learning. D-F: Fixed point norm and correlation for different N evaluated when λ = 0 (left) and at the end of learning (right). The time points are indicated by a square and triangle in C, respectively. At λ = 0, simulation and theory agree for the scaling: and . At the end of training, the theory predicts a decreasing fixed point norm, but the simulated networks inherit the order-one norm from the training history.