Abstract
The relation between neural activity and behaviorally relevant variables is at the heart of neuroscience research. When strong, this relation is termed a neural representation. There is increasing evidence, however, for partial dissociations between activity in an area and relevant external variables. While many explanations have been proposed, a theoretical framework for the relationship between external and internal variables is lacking. Here, we utilize recurrent neural networks (RNNs) to explore the question of when and how neural dynamics and the network’s output are related from a geometrical point of view. We find that RNNs can operate in two regimes: dynamics can either be aligned with the directions that generate output variables, or oblique to them. We show that the magnitude of the readout weights can serve as a control knob between the regimes. Importantly, these regimes are functionally distinct. Oblique networks are more heterogeneous and suppress noise in their output directions. They are furthermore more robust to perturbations along the output directions. Finally, we show that the two regimes can be dissociated in neural recordings. Altogether, our results open a new perspective for interpreting neural activity by relating network dynamics and their output.
1. Introduction
The relation between neural activity and behavioral variables is often expressed in terms of neural representations. Sensory input and motor output have been related to the tuning curves of single neurons [21, 24, 41] and, since the advent of large-scale recordings, to population activity [6, 57, 73]. Both input and output can be decoded from population activity, [10, 37], even in real time, closed-loop settings [56, 74]. However, neural activity is often not fully explained by observable behavioral variables. Some components of the unexplained neural activity have been interpreted as random trial-to-trial fluctuations [15], potentially linked to unobserved behavior [40, 65]. Activity may further be due to other ongoing computations not immediately related to behavior, such as preparatory motor activity in a null-space of the motor readout [22, 29]. Finally, neural activity may partially be due to other constraints, for example related to the underlying connectivity [2, 44], the process of learning [56], or stability, i.e., the robustness of the neural dynamics to perturbations [55].
Here we aim for a theoretical understanding of neural representations: What determines how strongly activity and behavioral output variables are related? To this end, we use trained recurrent neural networks (RNNs). In this setting, output variables are determined by the task at hand, and neural activity can be described by its projection onto the principal components (PCs). We show that these networks can operate between two extremes: an “aligned” regime in which the output weights and the largest PCs are strongly correlated. In the second, “oblique” regime, the output weights and the largest PCs are poorly correlated.
What determines the regime a network operates in? We show that quite general considerations lead to a link between the magnitude of output weights and the regime of the network. As a consequence, we can use output magnitude as a control knob for trained RNNs. Indeed, when we train RNN models on different neuroscience tasks, large output weights led to oblique dynamics, and small output weights to aligned dynamics.
We then considered the functional consequences of the two regimes. Building on the concept of feedback loops driving the network dynamics [50, 68], we show that in the aligned regime, the largest PCs and the output are qualitatively similar. In the oblique regimes, in contrast, the two may be qualitatively different. This functional decoupling in oblique networks leads to large freedom for the neural dynamics. Different networks with oblique dynamics thus tend to employ different dynamics for the same tasks. Aligned dynamics, in contrast, are much more stereotypical. Furthermore, as a result of how neural dynamics and output are coupled, oblique and aligned networks react differently to perturbations of the neural activity along the output direction. Further, oblique (but not aligned) networks develop an additional negative feedback loop that suppresses output noise. We finally link our theoretical results to experimental data by showing that these different regimes can be identified in neural recordings from several experiments.
Altogether, our work opens a new perspective relating network dynamics and their output, yielding both important insights for modeling brain dynamics as well as experimentally accessible questions about learning and dynamics in the brain.

Schematic of aligned and oblique dynamics in recurrent neural networks.
A Output generated by both networks. B Neural activity of aligned (top) and oblique (bottom) dynamics, visualized in the space spanned by three neurons. Here, the activity (green) is three-dimensional, but most of the variance is concentrated along the two largest PCs (blue). For aligned dynamics, the output weights (red) are small and lie in the subspace spanned by the largest PCs; they are hence correlated to the activity. For oblique dynamics, the output weights are large and lie outside of the subspace spanned by the largest PCs; they are hence poorly correlated to the activity. C Projection of activity onto the two largest PCs. For oblique dynamics, the output weights are orthogonal to the leading PCs. D Evolution of PC projections over time. For aligned dynamics, the projection on the PCs resembles the output z(t), and reconstructing the output from the largest two components is possible. For the oblique dynamics, such reconstruction is not possible, because the projections oscillate much more slowly than the output.
2. Results
2.1. Aligned and oblique population dynamics
We consider an animal performing a task while both behavior and neural activity are recorded. For example, the task might be to produce a periodic motion, described by the output z(t) of Fig. 1A. For simplicity, we assume that the behavioral output can be decoded linearly from the neural activity [18, 37, 54, 56, 74]. We can thus write
with readout weights wout. The activity of neuron i ∈ {1, …, N} is given by xi (t), and we refer to the vector x as the state of the network.
Neural activity has to generate the output in some subspace of the state space, where each axis represents the activity of one neuron. In the simplest case (Fig. 1B, top), the output is produced along the largest PCs of activity, as shown by the fact that projecting the neural activity x(t) onto the largest PCs returns the target oscillation (Fig. 1D, top). We call such dynamics “aligned” because of the alignment between the subspace spanned by the largest PCs and the output vector (red).
There is, however, another possibility. Neural activity may have many other components not directly related to the output, and these other components may even dominate the overall activity. In this case (Fig. 1B-D, bottom), the two largest PCs are not enough to read out the output, and smaller PCs are needed. We call such dynamics “oblique”, because the subspace spanned by the largest PCs and the output vector are poorly aligned.
We consider these two possibilities as distinct regimes, noting that intermediate solutions are also possible. The actual regime of neural dynamics has important consequences for how one interprets neural recordings. For aligned dynamics, analyzing the dynamics within the largest PCs may lead to insights about the computations generating the output [73]. For oblique dynamics, such an analysis is hampered by the dissociation between the large PCs and the components generating the output [54].

Aligned and oblique dynamics for a cycling task [54].
A A network with two outputs needed to generate either clockwise or anticlockwise rotations, depending on the context (top). Our model RNN (bottom) received a context input pulse, generated dynamics x(t) via recurrent weights W , and yielded output as linear projections of the states. We trained the recurrent weights W with gradient descent. B-C Resulting internal dynamics for two networks with small (top) and large (bottom) output weights, corresponding to aligned and oblique dynamics, respectively. B Dynamics projected on the first 2 PCs and the remaining direction wout, ⊥, of the first output vector (for z1). The output weights are amplified to be visible. Arrowheads indicate the direction of the dynamics. Note that for the large output weights, the dynamics in the first two PCs co-rotated, despite the counter-rotating output. C Output reconstructed from the largest PCs, with dimension D = 2 (full lines) or 8 (dotted). Two dimensions already yield a fit with R2 = 0.99 for aligned dynamics (top), but almost no output for oblique (bottom, R2 = 0.005, no arrows shown). For the latter, a good fit with R2 > 90% is only reached with D = 8.
2.2. Magnitude of output weights controls regime
What determines which regime the network operates in? To probe this, we directly use the notion of alignment between output weights and states. For this, we assume that the output weights are known, and compute their correlation ρ, to the states:
For aligned dynamics, the correlation is large in magnitude, corresponding to the alignment between the large PCs of the neural activity and the output weights (Fig. 1B top). In contrast, for oblique dynamics, this correlation is small (Fig. 1B bottom). Note that the concept of correlation can be generalized to accommodate multiple time points and multidimensional output (see Section 4.3).
This allows us to express the output in terms of this correlation:
This factorization into correlation and magnitudes (as quantified by vector norms) has one immediate consequence. We consider a situation in which a network learns to generate output by adapting its dynamics for given, fixed output weights. The magnitude of the output ∥z∥ is set by the definition of the task. We moreover assume that the magnitude of the activity ∥x∥ is constrained - which we show below is the case for robust RNNs in presence of noise. In this case, correlation and output weight magnitude must compensate each other. For aligned dynamics, we expect small output weights, because these suffice to generate the output via strongly correlated activity. For oblique dynamics, we need large output weights to amplify small correlation (Fig. 1B bottom).

Testing how the magnitude of output weights determines regimes across multiple neuroscience tasks [37, 52, 54, 69].
A Correlation and norms of output weights and neural activity. For each task, we initialized networks with small or large output weights (dark vs light orange). The initial norms ||wout|| are indicated by the dashed lines. Learning does not change the norm dramatically. Note that all y-axes are logarithmically scaled. B Variance of x explained and R2 of reconstructed output for projections of x on increasing number of PCs. Results from one example network trained on the cycling task for each condition are shown. C Number of PCs necessary to reach 90% of the variance of x(t) or of the R2 of the output reconstruction (top/bottom; dotted lines in B). In A, C, violin plots show the distribution over 5 sample networks, with vertical bars indicating the mean and the extreme values (where visible).
We tested whether output weights can serve as a control knob to select dynamical regimes using an RNN model trained on an abstract version of the cycling task introduced in Ref. [54]. The networks were trained to generate a 2D signal that rotated in the plane spanned by two outputs z1 (t) and z2(t) (Fig. 2A). An input pulse at the beginning of each trial indicated the desired direction of rotation. We set up two models with either small or large output weights, and trained the recurrent weights of each with gradient descent.
After both models learned the task (Section 4.1), we projected the network activity into a threedimensional space spanned by the two largest PCs of the dynamics x(t). A third direction, wout,⊥, spanned the remaining part of the first output vector wout,1. The resulting plots, Fig. 2B, corroborate our hypothesis: Small output weights led to aligned dynamics with large correlation between the largest PCs and the output weights. In contrast, the large output weights of the second network were almost orthogonal, or oblique, to the two leading PCs. Further qualitative differences between the two solutions in terms of the direction of trajectories will be discussed below.
Another way to quantify these regimes is by the ability to reconstruct the output from the large PCs of neural activity, as quantified by the coefficient of determination R2. For the aligned network, the projection on the two largest PCs (Fig. 2C, solid) already led to a good reconstruction. For the oblique networks, the two largest PCs were not sufficient, and we needed eight dimensions (Fig. 2C, dashed) to obtain a good reconstruction (R2 > 0.9). In contrast to the differences in these fits, the neural dynamics themselves were much more similar between the networks. Specifically, 90% of the variance was explained by 4 and 5 dimensions for the aligned and oblique networks, respectively.

Variability between learners for the two regimes.
A-B Examples of networks trained on the cycling task with small (aligned) or large (oblique) output weights. The top left and central networks, respectively, are the same as those plotted in Fig. 2. C Dissimilarity between solutions across different tasks. Aligned solutions (dark) were less dissimilar to each other than oblique ones (light). The violin plots show the distribution over all possible different pairs for five samples (mean and extrema as bars).
Can we use the output weights to induce aligned or oblique dynamics in more general settings? We trained RNN models with small or large initial output weights on five different neuroscience tasks. All weights (input, recurrent, and output) were trained using the Adam algorithm (Section 4.1). After training, we measured the three quantities of Eq. (3): magnitudes of neural activity and output weights, and the correlation between the two. The results in Fig. 3A show that across tasks, initialization with large output weights led to oblique solutions (small correlation), and with small output weights to aligned solutions (large correlation). The output weights magnitude generally increased during training if it was initially small, but the initial difference in scale largely remained.
In Fig. 3B-C, we adopted the perspective of Fig. 2C and quantified how well we can reconstruct the output from a projection of x onto its largest D PCs. As expected, for an increasing number D, both the variance of x explained and the quality of output reconstruction increased (Fig. 3B). The manner in which both quantities increased, however, differs between the two regimes. While the variance explained increased similarly in both cases, the quality of the reconstruction increased much more slowly for the model with large output weights. We quantified this phenomenon by comparing the dimensions at which either the variance of x explained or R2 reaches 90%, denoted by Dx,90 and Dfit,90 , respectively.
In Fig. 3C, we compare Dx,90 and Dfit,90 across multiple networks and tasks. Generally, larger output weights led to larger numbers for both. However, the number of PCs necessary to obtain a good reconstruction increased much more drastically for large output weights than the dimension of the data. Thus, the output was less well represented by the large PCs of the dynamics for networks with large output weights, in accordance with our notion of oblique dynamics.
Importantly, reaching the aligned and oblique regimes relies on ensuring robust and stable solutions, which we achieve by adding noise to the dynamics during training. This yields a similar magnitude of neural activity ∥x∥ across networks and tasks (Fig. 3A). We show in Methods, Section 4.6, that learning in simple, noise-free conditions with large output weights can lead to solutions not captured by either aligned or oblique dynamics; those solutions, however, are unstable. Furthermore, we observed that some of the qualitative differences between aligned and oblique dynamics are less pronounced if we initialized networks with small recurrent weights and initially decaying dynamics (Fig. 18).
2.3. Neural dynamics decouple from output for the oblique regime
What are the functional consequences of the two regimes? A hint might be seen in an intriguing qualitative difference between the aligned and oblique solutions for the cycling task in Fig. 2. For the aligned network, the two trajectories for the two different contexts (green and purple) are counterrotating (Fig. 2B, top). This agrees with the output, which also counter-rotates as demanded by the task (Fig. 2A). In contrast, the neural activity of the oblique network co-rotates in the leading two PCs (Fig. 2B, bottom). This is despite the counter-rotating output, since this network also solves the task (not shown). This also indicates why reconstructing the output from the leading two PCs is not possible (Fig. 2C). Taken together, aligned and oblique dynamics differ in the coupling between leading neural dynamics and output. For aligned dynamics, the two are strongly coupled. For oblique dynamics, the two decouple qualitatively.
Such a decoupling for oblique, but not aligned, dynamics leads to a prediction regarding the universality of solutions [35, 45, 72]. For aligned dynamics, the coupling implies that the internal dynamics are strongly constrained by the task. We thus expect different learners to converge to similar solutions, even if their initial connectivity is random and unstructured. In Fig. 4A, we show the dynamics of three randomly initialized aligned networks trained on the cycling task, projected onto the three leading PCs. Apart from global rotations, the dynamics in the three networks are very similar.
For oblique dynamics, the task-defined output exerts weaker constraints on the internal dynamics. Any variability experienced during learning can potentially build up, and eventually create qualitatively different solutions. Three examples of oblique networks solving the cycling tasks indeed show visibly different dynamics (Fig. 4B). Further analysis shows that the models also differ in the frequency components in the leading dynamics (Fig. 17).
The degree of variability between learners depends on the task. The differences observable in the PC projections were most striking for the cycling task. For the flipflop task, for example, solutions were generally noisier in the oblique regime than in the aligned, but did not have observable qualitative differences in either regime (Fig. 22). We quantified the difference between models for the different neuroscience tasks considered before. To compare different neural dynamics, we used a dissimilarity measure invariant under rotation (Section 4.4) [75]. The results are shown in Fig. 4C. Two observations stand out: First, across tasks, the dissimilarity was higher for networks in the oblique regime than for those in the aligned. Second, both overall dissimilarity and the discrepancy between regimes differed strongly between tasks. The largest dissimilarity (for oblique solutions) and the largest discrepancy between regimes was found for the cycling. The smallest discrepancy between regimes was found for the flipflop task. Such difference between tasks is consistent with the differences in the range of possible solutions for different tasks, as reported in Refs. [35, 72].
What are the underlying mechanisms for the qualitative decoupling in oblique networks, but not aligned? For aligned dynamics, we saw that the small output weights demand large activity to generate the output. In other words, the activity along the largest PCs must be coupled to the output. For oblique dynamics, this constraint is not present, which opens the possibility for small components outside the largest PCs to generate the output. If this is the case, we have a decoupling, such as the observed co-rotation in the cycling task, and the possible variability between solutions. We discuss this point in more detail in Section 4.9.
In the following two sections, we will explore how the decoupling between neural dynamics and output for oblique, but not aligned, dynamics influences the response to perturbations and the effects of noise during learning.
2.4. Differences in response to perturbations
To understand how networks respond to external perturbations and internal noise requires some understanding how dynamics are generated. Dynamics of trained networks are mostly generated internally, through recurrent interactions. In robust networks, these internally generated dynamics are a prominent part of the largest PCs (among input-driven components; Sections 4.6 and 4.8). Internally-generated dynamics are sustained by positive feedback loops, through which neurons excite each other. Those loops are low-dimensional, with activity along a few directions of the dynamics being amplified and fed back along the same directions. This results in dynamics being driven by effective feedback loops along the largest PCs (Fig. 5A). As shown above, the largest PCs can either be aligned, or not aligned, with the output weights. This leads to predictions about how aligned and oblique networks differentiate in their responses to perturbations along different directions.

Perturbations differentially affect dynamics in the aligned and oblique regimes.
A Cartoon illustrating the relation between perturbations along output weights or PCs and the feedback loops driving autonomous dynamics. B Output after perturbation for aligned (top) and oblique (bottom) networks trained on the cycling task. The unperturbed network (light red line) yields a sine wave along the first output direction z1. At tp = 9, a perturbation with amplitude ∥Δx∥ = 34 is applied along the output weights (dashed red) or the first PC (dashed-dotted blue). The perturbations only differ in the directions applied. While the immediate response for the oblique network to a perturbation along the output weights is much larger, z1(tp) ≈ 80, the long-term dynamics yield the same output as the unperturbed network. See also Fig. 19 for more details. C Loss for perturbations of different amplitudes for the two networks in B. Lines and shades are means and std devs over different perturbation times tp ∈ [5, 15] and random directions spanned by the output weights (red) or the two largest PCs (blue). The loss is the mean squared error between output and target for t > 20. Gray dot indicates example in B. D Relative susceptibility of networks to perturbation directions for different tasks and dynamical regimes. We measured the area under the curve (AUC) of loss over perturbation amplitude for perturbations along the output weights or two largest PCs. The relative susceptibility is the ratio between the two AUCs. The example in C is indicated by gray triangles.
We apply perturbations to the neural activity at a single point in time: x(t) evolves undisturbed until time tp. At that point, it is shifted to x(tp) + Δx. After the perturbation, we let the network evolve freely and compare this evolution to that of an unperturb ed copy. Such a perturbation mimics a very short optogenetic perturbation applied to a selected neural population [13, 42].
Our intuition about feedback loops suggests that networks respond strongly to a perturbation that is aligned with the directions contributing to the feedback loop, but weakly to a perturbation that is orthogonal to them. In particular, if a perturbation is applied along the output weights, aligned and oblique dynamics should dissociate, with a strong disruption of dynamics for aligned, but not for oblique dynamics (Fig. 5A).
To test this, we compare the response to perturbations along the output direction and along the largest PCs. In Fig. 5B, we show the output after such perturbations for an aligned (top) and an oblique network (bottom) trained on the cycling task. The time point and amplitude are the same for both directions and networks. For each network and type of perturbation, there is an immediate deflection and a long-term response. For both networks, perturbing along the PCs (blue) leads to a long-term phase shift. Only in the aligned network, however, perturbation along the output direction (red) leads to a visible long-term response. In the oblique network, the amplitude of the immediate response is larger, but the long-term response is smaller. Our results for the oblique network, but not for the aligned, agree with simulations of networks generating EMG data from the cycling experiments [58].
To quantify the relative long-term susceptibility of networks to perturbations along output weights or PCs, we sampled from different times tp and different directions in the 2D subspaces spanned either by the two output vectors or the two largest PCs. For each perturbation, we measured the loss of the perturbed networks on the original task (excluding the five time points after tp that correspond to the immediate deflection after the perturbation). Fig. 5C shows that the aligned network is almost equally susceptible to perturbations along the PCs and the output weights. In contrast, the oblique network is much more susceptible to perturbations along the PCs.

Noise suppression along the output direction in the oblique regime.
A A cartoon of the feedback loop structure for aligned (top) and oblique (bottom) dynamics. The latter develops a negative feedback loop which suppresses fluctuations along the output direction. B Comparing the distribution of variance of mean-subtracted activity along different directions for network trained on cycling task (see Fig. 20): PCs of trial-averaged activity (blue), readout (red), and random (grey) directions. For the PCs and output weights, we sampled 100 normalized combinations of either the first two PCs or the two output vectors. For the random directions, we drew 1000 random vectors in the full, N-dimensional space. C Noise compression across tasks as measured by the ratio between variance along output and random directions. The dashed line indicates neither compression nor expansion. Black markers indicate the values for the two examples in B-C. Note the log-scales in B-C.
We repeated this analysis for oblique and aligned networks trained on five different tasks. We computed the area under the curve (AUC) for the two loss profiles in Fig. 5C. We then defined the “relative susceptibility” as the ratio
2.5. Noise suppression for oblique dynamics
In the oblique regime, the output weights are large. To produce the correct output (and not a too large one), the large PCs of the dynamics are almost orthogonal to the output weights. The large output weights, however, pose a robustness problem: Small noise in the direction of the output weights is also amplified at the level of the readout. We show that learning leads to a slow process of sculpting noise statistics to avoid this effect (Fig. 11). Specifically, a negative feedback loop is generated that suppresses fluctuations along the output direction (Fig. 6A, Fig. 10). Because the positive feedback loop that gives rise to the large PCs is mostly orthogonal to the output direction, it remains unaffected by this additional negative feedback loop. A detailed analysis of how learning is affected by noise shows that, for large output weights, the network first learns a solution that is not robust to noise. This solution is then transformed to increasingly stable and oblique dynamics over longer time scales (Sections 4.7 and 4.8).
To illustrate the effect of the negative feedback loop, we consider the fluctuations around trial averages. We take a collection of states x(t) and then subtract the task-conditioned averages

Quantifying aligned and oblique dynamics in experimental data [11, 20, 23, 47, 54].
A Cartoon of a two types of experimental data considered. In motor control experiments (top), we first needed to obtain the output weights wout via linear regression. We then computed the correlation p and the reconstruction dimension Dfit,90, i.e. the number of PCs of x necessary to obtain a coefficient of determination R2 > 90%. In BCI experiments (bottom), the output (cursor velocity) is generated from neural activity x(t) via output weights wout defined by the experimenter. This allowed us to directly compute correlation and fitting dimension. B Correlation ρ (top) and relative fitting dimension Dfit,90/Dx,90 (bottom) for a number of publicly available data sets. The cycling task data (purple) were trial-conditioned averages, the BCI experiments (red) and NLB tasks (yellow) single-trial data. Results for the full data sets are shown as dots, violin plots indicate results for 20 random subsets of 25% of the data points in each data set (see small text below x-axis).
How strongly the activity fluctuates along each direction is quantified by the variance of the projections (Fig. 6B). For both aligned and oblique dynamics, the variance is much larger along the PCs than along random directions. This is not necessarily expected, because the PCA was performed on the averaged activity, hence without the fluctuations. Instead, it is a dynamical effect: the same positive feedback that generates the autonomous dynamics also amplifies the noise (Section 4.8).
The two network regimes, however, dissociate when considering the variance along the output direction. For aligned dynamics, there is no negative feedback loop, and wout is correlated with the PCs. The variance along the output direction is hence similar to that along the PCs, and larger than along random directions. For oblique dynamics, the negative feedback loop suppresses the fluctuations along the output direction, so that they become weaker than along random directions.
In Fig. 6C, we quantify this dissociation across different tasks. We measured the ratio between variance along output and random directions. Aligned networks have a ratio much large than one, indicating that the fluctuations along the output direction are increased due to the autonomous dynamics along the PCs. In contrast, oblique networks have a ratio smaller than 1 for all tasks, which indicates noise compression along the output.
2.6. Aligned and oblique dynamics in experimental settings
For the cycling task, we observed that solutions were qualitatively different for the two regimes, with trajectories either counter- or co-rotating (Fig. 2B). Interestingly, the experimental results of Russo et al. [54] matched the oblique, but not the aligned dynamics. The authors observed co-rotating dynamics in the leading PCs of motor cortex activity despite counter-rotating activity of simultaneously recorded muscle activity. Here we test whether our theory can help to more clearly and quantitatively distinguish between the two regimes in experimental settings.
In typical experimental settings, we do not have direct access to output weights. We can, however, approximate these by fitting neural data to simultaneously recorded behavioral output, such as hand velocity in a motor control experiment (Fig. 7A top). Following the model above, where the output is a weighted average of the states, we reconstruct the output from the neural activity with linear regression. To quantify the dynamical regime, we then compute the correlation ρ between the weights from fitting and the neural data. Additionally, we can also quantify the regime by the “relative fitting dimension” Dfit,90/Dx,90, where Dfit,90 is the number of PCs necessary to recover the output and Dx,90 number to the number of PCs necessary to represent 90% of the variance of the neural data. We computed both the correlation and the relative fitting dimension for different publicly available data sets (Fig. 7B). For details, see Section 4.5.
We started with data sets from two monkeys performing the cycling task [54]. The data contained motor cortex activity, hand movement, and EMG from the arms, all averaged over multiple trials of the same condition. In Fig. 7B, we show results for reconstructing the hand velocity. The correlation was small ρ ∈ {0.10, 0.12}. To obtain a good reconstruction, we needed a substantial fraction of the dimension of the neural data: The relative fitting dimension was Dfit,90/Dx,90 ∈ {0.57, 0.48}. Both quantities are consistent with oblique dynamics, as suggested by the qualitative agreement between oblique RNN solutions and experimental data. Fitting other behavioral outputs yielded similar results (Fig. 23). Our results are in agreement with previous studies, showing that the best decoding directions for EMG are only weakly correlated with the leading PCs of motor cortex activity [59].
We also analyzed data made available through the Neural Latents Benchmark (NLB) [47]. In two different tasks, monkeys needed to perform movements along a screen. In a maze task, the monkeys were trained to follow a trajectory through a maze with their hand [9]. In a random target task (RTT), the monkeys had to reach for a randomly generated target position on a screen, with a successive target point generated once the previous one was reached [36]. In both cases, we reconstructed the output from neural activity on single trials. The correlation was slightly smaller than for the cycling task, ρ ∈ {0.07, 0.08}, and the relative fitting dimension was higher, Dfit,90/Dx,90 ∈ {0.65, 0.62}. These values indicate that the neural dynamics for these two tasks are also oblique.
Do we always observe oblique dynamics in experimental data? One setting in which this may not be the case are brain computer interface (BCI) experiments [56]. In these experiments, monkeys were trained to control a cursor on a screen via activity read out from their motor cortex (Fig. 7A bottom). The output weights were set by the experimenter (so we don’t need to fit). Importantly, the output weights were typically chosen to be spanned by the largest PCs (the “neural manifold”), suggesting aligned dynamics. We tested this hypothesis on three example data sets [11, 20, 23]. We obtained higher correlation values, ρ = 0.2 (Fig. 7B). The relative fitting dimension was much smaller than for the non-BCI data sets, especially for the two largest data sets, where Dfit,90/Dx,90 ∈ {0.03, 0.04}. Both measures are thus consistent with the hypothesis that the neural dynamics arising in BCI experiments is more aligned than those in non-BCI settings.
The results above show that our measures can indeed be used to distinguish dynamical regimes in neural data. It would be interesting to test the differences between BCI and non-BCI data on larger data sets, and different experiments with different dimensions of neural data [19, 64]. Finally, we hypothesize that larger and out-of-manifold weights in the BCI setting, if learnable [43], would lead to qualitatively different solutions. One interesting scenario would be to train a monkey on the cycling task [54] but using a BCI either with small, inside-manifold or large, outside-manifold weights.
3. Discussion
We analyzed the relationship between neural dynamics and behavior, asking to which extent a network’s output is represented in its dynamics. We identified two different limiting regimes: aligned dynamics, in which most of the neural activity in a network is related to its output, and oblique dynamics, where the output is only a small modulation on top of the dominating dynamics. We demonstrated that these two regimes have different functional implications. We also showed how they might arise through learning, and how they relate to experimental findings.
Linking neural activity to external variables is one of the core challenges of neuroscience [24]. In most cases, however, such links are far from perfect. The activity of single neurons can be related in a nonlinear, mixed manner, to task variables [49]. Even when considering populations of neurons, a large fraction of neural activity is not easily explained by external variables [1]. Various explanations have been proposed for this disconnect. In the visual cortex, activity has been shown to be related to “irrelevant” external variables, such as body movements [65]. Follow-up work showed that, in primates, some of these effects can be explained by the induced changes on retinal images [71], but this study still explained only half of the neural variability. An alternative explanation hinges on the redundancy of the neural code, which allows “null spaces” in which activity can visit without affecting behavior [28, 29, 51]. While such accounts can explain why irrelevant activity is al lowed, they do not explain whether and why it is necessary. Here, we showed that merely ensuring stability of neural dynamics can lead to the oblique regime with a decoupling between output and dynamics.
We showed theoretically and in simulations that, when training recurrent neural networks, the magnitude of output weights is a central parameter that controls which regime is reached. This finding is vital for the use of RNNs as hypothesis generators [3, 67, 73], where it is often implicitly assumed that training results in universal solutions [35] (even though biases in the distribution of solutions have been discussed [70]). Here, we show that a specific control knob allows to move between qualitatively different solutions of the same task, thereby expanding the control over the hypothesis space [45, 72]. Note in particular that the default initialization in standard learning frameworks has large output weights, which results in oblique dynamics (or unstable solutions if training without noise, see Methods, Section 4.6).
The role of the magnitude of output weights is also discussed in machine learning settings, where different learning regimes have been found [8, 25, 26, 39]. In particular, “lazy” solutions were observed for large output weights in feed-forward networks. We show in Methods, Section 4.6, that these are unstable for recurrent networks, and are replaced in a second phase of learning by oblique solutions. It would be interesting to see whether an analog to the oblique regime also exists in feed-forward settings. On a broader scale, which learning regime is relevant when modeling biological learning is an open question that is just started to be explored [14].
The particular control knob we studied has an analog in the biological circuit - the synaptic weights. We can thus use experimental data to study whether the brain might rely on oblique or aligned dynamics. Existing experimental work has partially addressed this question. In particular, the work by Russo et al. [54] has been a major inspiration for our study. Our results share some of the key findings from that paper - the importance of stability leading to “untangled” dynamics [66] and a dissociation between hidden dynamics and output. In addition, we suggest a specific mechanism to reach oblique dynamics - via large output weights. Furthermore, we characterize the aligned and oblique regimes along experimentally accessible axes.
We see three avenues for exploring our results experimentally. First, it would be interesting to test whether simultaneous measurements of neural dynamics and muscle activity, such as in Ref. [54], support the idea of noise compression along the output direction. We make a suggestion on how to test this in Fig. 6C. Second, we show how the dynamical regimes dissociate under perturbations along specific directions. Experiments along these lines have recently become possible [7, 13, 53]. Future work is left to combine our model with biological constraints that induce additional effects during perturbations, e.g., through non-normal synaptic connectivity [5, 30, 34, 42]. Third, our work connects to the setting of brain-computer interfaces [20, 56, 74]. In this case, the output weights are directly controlled by the experimenter, hence allowing to test the effect of output weights on network dynamics.
Overall, our results provide an explanation for the plethora of relationships between neural activity and external variables.
Acknowledgements
FS was supported by the Deutsche Forschungsgemeinschaft (German Research Foundation) under Germany’s Excellence Strategy - EXC 2002/1 “Science of Intelligence” - project no. 390523135. OB was supported by the Israeli Science Foundation (grant 1442/21) and an HFSP research grant (RGP0017/2021). SO was supported by the program “Ecoles Universitaires de Recherche” ANR-17-EURE-0017
Data availability
The code for all simulations and generation of figures is available on github, https://github.com/frschu/aligned_oblique_in_rnns/.
4. Methods
4.1. Details on RNN models and training
We consider rate-based RNNs with N neurons. The states x(t) ∈ ℝN are governed by
where W is a recurrent weight matrix and ϕ = tanh a nonlinearity applied element-wise. The network receives a low-dimensional input
The initial output weights are drawn from centered normal distributions with variance
To simulate the noisy RNN dynamics numerically, we used the Euler-Maruyama method [32] with a time step of Δt. We used the Adam algorithm [31] implemented in PyTorch [46]. Apart from the learning rate, we kept the parameters for Adam at the default (some filtering, no weight decay). We selected learning rates and number of training steps such that learning was relatively smooth and converged sufficiently within the given number of trails. Learning rates were set to η = η0/N. Details for all simulation parameters can be found in Table 1.
For the comparisons over different tasks (Figs. 3 to 6), we trained 5 networks for each task. All weights (Wout, W, Win) were adapted. For the example networks trained on the cycling task (Figs. 2, 5 and 6), we used networks with N = 256 neurons and only changed the recurrent weights W. We also trained for longer (5000 training steps) and with a higher learning rate (η = 0.1/N).
4.2. Task details
The tasks the networks were trained on are taken from the neuroscience literature: a cycling task [54], a 3- bit flip-flop task and a “complex sine” task (with input-dependent frequencies) [69], a context-dependent decision making task (“Mante”) [37, 61], and a working memory task comparing the amplitudes of two pulse stimuli (“Romo”) [52, 61]. All tasks have similar structure [62]: A trial of length T starts with a fixation period (length tfix). This is followed by an input for tstim. For the cycling and flip-flop task, the inputs are pulses of amplitude 1; else see below. After a delay tdelay, the output of the network is required to reach an input-dependent value during a time-period tdec. During this decision period, we set target points ti every Δttarget time steps. The loss was defined as the mean squared error between network output and target at these time points. Below we provide further details for each task.
Cycling task
The network receives an initial pulse, whose direction ([1, 0]T or [0, 1]T) determines the sense of direction of the target. The target is a given by a rotation in 2D, ẑ(t) = [sin( a2πft), cos(2πft)]T, with frequency f = 0.1 and a = ±1 for the two directions (clockwise or anticlockwise).
Flip-flop task
The network repeatedly receives input pulses along one of 3 directions, followed by decision periods. In each decision period, the output coordinate corresponding to the last input should reach ±1 depending on the sign of the input. All other coordinates should remain at ±1 as defined by the last time they were triggered. To make sure that this is well-defined, we trigger all inputs at time steps kΔt for k ∈ [Nin] with random signs.
Mante task
Input channels for this task are split into two groups of size Nin/2: half of the channels for the signal and the other half for the context, indicating which of the signal channels is relevant. All signal channels si(t) deliver a constant mean ŝi plus additional white noise: si(t) = ŝi + anoiseηi(t). The mean is drawn uniformly from
Romo task
For the Romo task, the input consists of two input pulses separated by random delays tsd. The amplitude of the inputs is drawn independently from U(0.5, 1.5) with the condition of being at least 0.2 apart (else both are redrawn). During the decision period, the network needs to yield ẑ(t) = ±1, depending on which of the two pulses was larger.
Complex Sine
The target is ẑ(t) = sin(2πft), with frequency f = (1 — a)fmin + afmax, and boundaries fmin = 0.04, fmax = 0.2, and where a ~ U(0, 1). The input is a constant input of amplitude s = a + 0.25.
4.3. Generalized correlation
For Fig. 3A, we used a generalized correlation measure which allows for multiple output dimensions, multiple time points, and noisy data. Consider neural activity of N neurons at P time points stacked into the matrix X = [x(1), …, x(P)]. We assume the states to be centered in time,
with weights Wout ∈ ℝN × D. We define the generalized correlation as
The norm is the Frobenius norm,
The case of one-dimensional output and a single time step discussed in the main text, Eq. (3), is recovered up to the sign, which we discard. Note that in that case, the vectors wout and x(t) should be centered along coordinates to receive a valid correlation. Our numerical results did not change qualitatively when centering across coordinates only or both coordinates and time.
For trajectories with multiple conditions, we stack these instances in a matrix
4.4. Dissimilarity measure
For measuring the dissimilarity between learners in Fig. 4, we apply a measure following Williams et al. [75]. We define the distance between two sets with P data points X, Y ∈ ℝN × P as
where the hat corresponds to centering along the rows, and Q is an orthogonal matrix. The solution to this so-called orthogonal Procrustes’ problem is found via the singular value decomposition
Note that this is more restricted than canonical correlation analysis (CCA), which is also commonly used [16, 17]. In particular, CCA involves whitening the matrices X and Y before applying a rotation [75]. This sets all singular values to 1. For originally low-D data, this mostly means amplifying the noise, unless the data was previously projected onto a small number of PCs. In the latter case, the procedure still removes the information about how much each PC contributes.
4.5. Experimental data
We detail the analyses of neural data in Section 2.6. We made use of publicly available data sets: data from the cycling task of Ref. [54], two data sets available through the Neural Latents Benchmark (NLB) [47], and data from monkeys trained on a center out reaching task with a BCI [11, 20, 23]. For all data sets, we first obtain firing rates X ∈ ℝN × T where N is the number of measured neurons, and T the number of data points, (see Fig. 7B for these numbers). We also collect the simultaneously measured behavior in the matrix Z ∈ ℝD × T. In Fig. 7, we only analyzed cursor or hand velocity for behavior, so that the output dimension is D = 2. See Fig. 23 for similar results for hand position or the largest two PCs of the EMG data recorded for the cycling task.
For the cycling task, the firing rates were binned in 1 ms bins, and convolved with a 25 ms Gaussian filter. The mean firing rate was 22 and 18 Hz for the two monkeys, respectively. For the NLB data, spikes came in 1 ms bins. We binned data to 45 ms bins, and also applied a Gaussian filter with 45 ms width. This increased the quality of the fit, as firing rates where much lower (mean of 5 Hz for both) than in the cycling data set. For the BCI experiments, firing rates came as spike counts in 45 ms bins. The mean firing rate was (45, 45, 55) Hz for the data of Refs. [11, 20, 23], respectively. In agreement with the original BCI experiments, we did not apply a filter to the neural data.
For fitting, we z-scored both firing rates X and output Z across time (but not coordinates). We also added a delay of 100 ms between firing rates and behavior for the cycling and NLB data sets, which increased the quality of the fits. We then fitted the output Z to the firing rates X with ridge regression, with regularization obtained from cross validation. We treated the coefficients as output weights Wout. The trial average data of the cycling tasks was very well fitted for both monkeys, R2 = [0.97, 0.98]. For the NLB tasks with single trial data, the fits were not as good, R2 = [0.73, 0.69]. For two of the BCI data sets [20, 23], the output weights were also given, and we checked that the fit recovers these. For the third BCI data set [11], we did not have access to the output weights, and only access to the cursor velocity after Kalman filtering. Here, fitting yielded R2 = 0.83.
For the fitting dimension in Fig. 7B bottom, we used an adapted definition: Because R2 = 90% is not reached for all data sets, we asked for the number of PCs necessary to obtain 90% of the value obtained for the full data set.

Scaling of the correlation ρ with the number of neurons N in experimental data.
We fitted the output weights to subsets of N neurons and computed the quality of fit (top) and the correlation between the resulting output weight and firing rates (bottom). To compare with random vectors, the correlation is scaled by
We also considered whether the correlation ρ scales with the number of neurons N. In our model, oblique and aligned dynamics can be defined in terms of such a scaling: aligned dynamics have highly correlated output weights and low-dimensional dynamics, so that ρ ~ N0 = 1, i.e. independent of the network size. For oblique dynamics, large output weights with norm wout ~ 1 lead to vanishing correlation,
4.6. Analysis of solutions under noiseless conditions
In the sections below, we explore in detail under which conditions aligned and oblique solutions arise, and which other solutions arise if these conditions are not met.
We first consider small output weights and show that these lead to aligned solutions. Then, for large output weights, we show that without noise, two different, unstable solutions arise. Finally, we consider how adding noise affects learning dynamics. For a linear model, we can solve the dynamics of learning analytically and show how a negative feedback loop arises, that suppresses noise along the output direction. However, the linear model does not yield an oblique solution, so we also consider a nonlinear model for which we show in detail why oblique solutions arise.

Different solutions for networks trained on sine wave task.
All networks have N = 512 neurons. Four regimes: A: aligned for small output weights, B: marginal for large output weights, small recurrent weights, C: lazy for both large output and recurrent weights, D: oblique for large output weights and noise added during training. Left: Output (dark), target (purple dots) and four states (light) of the network after training. Black bars indicate the scales for output and states (length = 1; same for all regimes). The output beyond the target interval t ∈ [1, 21] can be considered as extrapolation. The network in the oblique regime, D, receives white noise during training, and the evaluation is shown with the same noise. Without noise, this network still produces a sine wave (not shown). Right: Projection of states on the first 2 PCs and the orthogonal component wout,⊥ of the output vector. All axes have the same scale, which allows for comparison between the dynamics. Vectors show the (amplified) output weights, dotted lines the projection on the PCs (not visible for lazy and oblique). The insets for the marginal solution (B, left and right) show the dynamics magnified by
We start by analyzing a simplified version of the network dynamics Eq. (4): autonomous dynamics without noise,
with fixed initial condition x(0). We assume a one-dimensional output z(t) and a target ẑ(ti) defined on a finite set of time points ti.
We illustrate the theory with an example of a simple sine wave task (Fig. 9). We demand the network to autonomously produce a sine wave with fixed frequency f = 0.1. At the beginning of the task, the network receives an input pulse that sets the starting point of the trajectory. We set the noise σinit on the initial state x(0) to zero. We define the target as 20 target points in the interval t ∈ [1, 21] (two cycles; purple dots in Fig. 9).
4.6.1. Small weights lead to aligned solutions
For small output weights,
so that
For the sine wave task, training with small output weights converges to an intuitive solution (Fig. 9A). Neural activity evolves on a limit cycle, and the output is a sine wave that extrapolates beyond the training interval. Plotting activity and output weights along the largest two PCs and the remaining direction wout,⊥ confirms substantial correlation, ρ = O(1), as expected. The solution was robust to adding noise after training (not shown). Changes in the initial dynamics or the presence of noise during training did not lead to qualitatively different solutions (Fig. 25). A further look at the eigenvalue spectrum of the trained recurrent weights revealed a pair of complex conjugate outliers corresponding to the limit cycle [38, 61], and a bulk of remaining eigenvalues concentrated on a disk with radius g, Fig. 24.
4.6.2. Large weights, no noise: linearization of dynamics
We now consider learning with large output weights, ∥wout∥ = 1, for noise-less dynamics, Eq. (9). We start with the assumption that the activity changes for each neuron are small,
where ρΔ = corr(wout, Δx). This shows that our assumption of small state changes Δx is consistent - it allows for a solution - and that such small changes need to be strongly correlated to the output weights. Note that we make the distinction between the changes Δx and the final activity x = x0 + Δx, because the latter may be dominated by x0. In the main text, we only consider the correlation between x and wout, because (as we show below) solutions with small Δx are not robust, and the final x will be dominated by Δx.
For now, however, we ignore robustness and continue with the assumption of small Δx. Given this assumptions, we linearize the dynamics around the initial trajectory x0(t):
with diagonal matrix R’(x)ij = δijϕ’(xi) and the weights changes ΔW = W — W0 that induce Δx. Note that we haven’t yet constrained the weight changes ΔW, so that we cannot discard the terms of the kind ΔW Δx. The next steps depend on the initial trajectories x0(t).
4.6.3. Initially decaying dynamics lead to a marginal regime
We first consider networks with decaying dynamics before learning. This is obtained by drawing the initial recurrent weights independently from
To have self-sustained dynamics, the matrix W0 + ΔW must have a leading eigenvalue λ+ with real part above the stability line: ℜλ+ = 1 + ε.
The distance ε > 0 must be small, else the states would become large. To understand how ε needs to scale with N, we turn to a simple model studied before [38, 61]: an autonomously generated fixed point and rank-one connectivity
where
Sigmoidal functions have ϕ‴ 0, e.g. ϕ‴(0) = —2 for ϕ = tanh. Hence λ = 1 + ε, with
The insights from this simplified setting extend to the example of the sine wave task (Fig. 9B). The mo del with large output weights, g = 0.7, and no noise yields a limit cycle. The output extrapolates in time, but the states are very small, scaling as
4.6.4. Initially chaotic dynamics lead to a lazy regime
In contrast to the situation before, initially chaotic dynamics (for g > 1) imply order-one initial states, (x0(t))i = O(1) for all trial times t. The driving term a in Eq. (12) can thus not be ignored and we expect it to be on the same scale as Δx:
The smallest possible weight changes ΔW will be those for which ϕ(x0) yields a maximal response, but other vectors do not yield a strong response. This is captured by the operator norm, ∥ΔW∥2 = max{∥ΔWx∥ : x ∈ ℝN with ∥x∥ = 1}. We can then write ∥ΔWϕ(x0)∥ ~ ∥ΔW∥2∥x0∥, and hence
For the sine wave task, the network with initially chaotic dynamics indeed converges to such a solution (Fig. 9C). The output does not extrapolate beyond the training interval t ∈ [1, 21], and dynamics remain qualitatively similar to those before training. During the training interval, the dynamics also remain close to the initial trajectories (dashed line). Testing the response to small perturbations in x(0) indicated that the dynamics remain chaotic (not shown). No limit cycle was formed, and the spectrum of eigenvalues did not show outliers (Fig. 24).
We called this regime “lazy”, following similar settings in feedforward networks [8, 25]. Note that there, the output z(t) is linearized around the weights at initialization (as opposed to the dynamics, Eq. (9)). This can be done in our case as well:
Demanding ẑ(t) = z(t) yields a linear equation for each time point t. As we have N2 parameters, this system is typically underconstrained. Gradient descent for this linear system leads to the minimal norm solution, which can also be found directly using the Moore-Penrose pseudo-inverse. Numerically, we found that the weights ΔWlin obtained by this linearization are very close to those found by gradient descent (GD) on the nonlinear system, ΔWGD, with Frobenius norm
4.6.5. Marginal and lazy solutions disappear with noise during training
The deduction ab ove hinges on the assumption that the Ax are small. This assumption does not hold if dynamics are noisy, Eq. (4). For marginal dynamics, the noise would push solutions to different attractors or different positions along the limit cycle. For lazy dynamics, the chaotic dynamics would amplify any perturbations along the trajectory (if chaos persist under noise [60]).
We will explore how learning is affected by noise in the sections below. Here we only show that adding noise for our example task abolishes the marginal or lazy solutions and leads to oblique ones (Fig. 9D). We added white noise with amplitude
4.7. Learning with noise for linear RNNs
In the next two sections, we aim to understand how adding noise affects dynamics and training. We start with a simple setting of a linear RNN which allows to track the learning dynamics analytically. Despite its simplicity, this setting already captures a range of observations: different times scales for learning of the bias and variance part, and the rise of a negative feedback loop for noise suppression. Oblique dynamics, however, do not arise, showing that these need autonomously generated, nonlinear dynamics, covered in Section 4.8.

Noise-induced learning for linear network with input-driven fixed point.
Learning separates into fast learning of the bias part of the loss (left), and slow learning reducing the variance part (right). Learning rates are η = η0/N and η = η0, respectively, with η0 = 0.002 and network size N = 256. Learning epochs in the first phase are counted from -1000, so that the second phase starts at 0. In the right column, the initial learning phase with learning time steps multiplied by 1/N is shown for comparison. In all plots, simulations (full lines) are compared with theory (dashed lines). A Loss L = Lbias + Lvar. The two components are obtained by averaging over a batch with 32 examples at each learning step. The full loss is not plotted in the slow phase, because it is indistinguishable from Lvar. B Coefficients of the 2-by-2 coupling matrix M. M11 = λ- is the feedback loop along the output weights,
We consider a linear network driven by a constant input and additional white noise. The dynamics read
with noise ξ as in Eq. (4). We focus on a simplified task, which is to produce a constant nonzero output ẑ once the average dynamics converged, i.e., for large trial times. The output is
The loss separates into two parts, L = Lbias + Lvar with
Learning based on the bias part alone converges to a lazy solution (see Appendix A.2). For no initial weights, W0 = 0, we have to leading order
with
Thus, ΔW is rank one with norm
4.7.1. Learning to reduce noise alone slowly produces a negative feedback loop
Next, we consider learning based on the variance part Lvar alone, that is, to reduce fluctuations in the output while ignoring the mean. The network dynamics are linear, so that δx is an Ornstein-Uhlenbeck process. Its stationary variance ∑ is the solution to the Lyapunov equation
where A — -I + W. The variance part of the loss is then
One can state the gradient of this loss in terms of a second Lyapunov equation [76]:
where Ω is the solution to
Generally, solving both Lyapunov equations analytically is not possible, and even results for random matrices are still sparse (e.g. for symmetric Wigner matrices W [48]). For the purpose of gaining intuition, we thus restrict ourselves to the case of no initial connectivity, which leads to the connectivity spanned by input and output weights only. We start with the simplest case of a rank-one matrix only spanned by the output weights,
The gradient Gvar = 2Ω∑ is therefore in the same subspace as W, and we can evaluate the onedimensional dynamics
where τ is the number of update steps. We assume τ to be continuous, that is, we assume a sufficiently small learning rate and approximate the discrete dynamics of gradient descent with gradient flow. With initial condition λ- (0) = 0, the solution is
which is negative for τ > 0. The loss then decays as
namely at order O(τ-1/3) in learning time. We thus obtained that, during the variance phase of learning, connectivity develops a negative feedback look aligned with the output weights, which serves to suppress output noise. For very long learning times, τ ~ N3/η, learning can in principle reduce the output fluctuations to
4.7.2. Optimizing mean and fluctuations occurs on different time scales
We now consider learning both the mean and the variance part. For zero initial recurrent weights, W0 = 0, the input and output vectors make up the only relevant directions in space. We thus express the recurrent weights as a rank-two matrix,
We computed the learning dynamics in terms of the coefficient matrix M , using the same tools introduced above, and the insight that learning the mean is much faster than learning to reduce the variance. The details are relegated to Appendix A.3, here we summarize the results. We obtained
Before discussing the temporal evolution of the two components, we analyze the structure of the matrix. The eigenvalue M11 = λ- is a negative feedback loop along the eigenvector wout, and
The dynamics unfold on two time scales. First, there is a very fast learning of the bias via the feedforward coefficient
During this phase, the eigenvalue M11 = λ- remains at zero, so that overall weight changes remain small,
The variance part of the loss does not change during this phase.
In a second, slower learning phase, the eigenvalue λ- evolves like above, Eq. (28), the case where only the variance part is learned. The second coefficient compensates for the resulting change in the output. This compensation happens at a much faster time scale (N3 times faster than λ-), so we consider its steady state:
Because of this compensation, the bias part of the loss always remains at zero, and the full loss L(τ) = Lvar(τ) evolves as before, Eq. (29). Meanwhile, the average fixed point does not change anymore; we have
We compare our theoretical predictions against numerical simulations in Fig. 10. For the task, we let the linear network dynamics converge from x(0) = 0 until t = 15, and demand the output z(t) to be at the target z during the interval t ∈ [15, 20]. Because the first learning phase converges N times faster than the second one (with N = 256), using a single learning rate n is problematic. One can either observe the first phase only (for small η), or one risks unstable learning during the first phase (for large η). We thus split learning into two parts with adapted learning rates. For the initial phase, we set a learning rate to η = η0/N, with n0 = 0.002 (Fig. 10 left column). For the second phase, we set η = η0 (Fig. 10 right column). Theory and simulation agree well for both phases. Small deviations can be observed for the second phase and long learning times: nonzero coefficients M21 and M22 and a corresponding increase in the norm

Cartoon illustrating the split into bias and variance components of the loss, and noise suppression along the output direction.
The two-dimensional subspace spanned by U illustrates the main directions under consideration: the PCs of the average trajectories (here only a fixed point
Our results show that learning in this linear system is a hybrid between oblique and aligned during the second phase. We have a large term that compresses the output noise, but a very small, “lazy” correction in the feedforward component that corrects the output, and the states are only marginally changed. Although this system is a somewhat degenerate limiting case, we can still derive important insights. The two most striking features - the different time scales of the two learning processes, and the slow emergence of a negative feedback λ-(τ) ~ — τ1/3 along the output - are found robustly also for nonlinear networks and other tasks.
4.8. Oblique solutions arise for noisy, nonlinear systems
We now examine the origin of oblique solutions. The linear system did not yield oblique solutions, so we turn to a nonlinear model. We consider a 1D flip-flop task, where the network has to yield a constant, nonzero output z depending on the sign of the last input pulse. We further simplify the analysis by only considering the steady state of a network, not how the input pulse mediates the transition. At the output, we thus only consider the average
Because following the learning dynamics in nonlinear networks is difficult, we take a different approach. We develop a mean field theory to show how noise affects the dynamics of a nonlinear network with autonomous fixed points. This allows us to compute the loss components Lbias and Lvar in terms of M. We then show that the minimum of the loss function corresponds to oblique solutions. This leads to a clear interpretation of the mechanisms pushing for oblique solutions. Finally, we show that the theory quantitatively predicts the outcome of learning with gradient descent.
4.8.1. Rank-two connectivity model with fixed point
We first introduce the connectivity model and compute the latent dynamics using mean field theory. We constrain the recurrent connectivity to a rank-two model of the form
In order to understand the dynamics Eq. (4) analytically, we make use of the low-rank connectivity. Following previous work [27, 50], we split the dynamics into two parts: a parallel part x∥ in the subspace spanned by U, and an orthogonal part x⊥. This yields
Notice that the parallel part is partially driven by the orthogonal one, but not vice versa. The parallel part can be expressed in terms of the latent variable
The scaling here ensures that τ is order one if X⊥ has order-one states. Because the readout is assumed to be spanned by U, the output is fully determined by the parallel part. We can write
with projected output weights
We split the latent state into its average over the noise ξ and fluctuations,
We want to understand situations with small loss. For the bias term, the average
To understand the underlying mechanisms in detail, we next explore how the relevant variables
The
with variance
(We have
This is very similar to the noiseless situation discussed briefly above, Eq. (14). However, here the additional variance
In other words, the noise decreases the effective gain
4.8.2. Fluctuations of the latent variable
The fluctuations δτ around a fixed point
where the order-one term ζ contains both the white and the colored noise term. The Jacobian A depends on M and
The averages are again evaluated at the joint variance
Next we compute the covariance of the fluctuations at steady state, see Appendix A.4. The outcome is
where
The second term in Eq. (47) stems from the colored noise component of ζ.
The loss Eq. (39) is obtained by projecting the covariance on the output weights. Importantly, the factor 1/N in the covariance is compensated by the factor N in the loss. Hence, even if the covariance shrinks with increasing network size, the output is still affected at order one. We next explore the implications of minimizing this loss.

Mechanisms behind oblique solutions predicted by mean field theory.
A-D: Mean-field theory predictions as a function of positive feedback strength λ+. The dotted lines indicate λ+,min, the minimal eigenvalue necessary to generate fixed points. A: Norm of fixed point
4.8.3. Minimizing the loss by balancing saturation and negative feedback loop
To gain understanding of how the the output fluctuations responsible for Lbias can be reduced, we first consider the case of a symmetric coefficient matrix M. Simulations of networks trained with gradient descent below show that this approximation is reasonable. For symmetric M, the orthogonal eigenvectors v± with eigenvalues A± of the recurrent weights M are also eigenvectors of A, in that case corresponding to the eigenvalues γ±. This allows to diagonalize the Lyapunov Eq. (48) and yields the solution
For the loss Eq. (39), we further need the relation between the eigenvectors v± and the output weights. Because the fixed point
For more explicit insight, we choose the nonlinearity ϕ(x) = erf(αx) with
as shown in Fig. 12A. The larger eigenvalue of the Jacobian is
so that increasing λ+ also decreases the correlation (Fig. 12B). Finally, we obtain an expression for the variance part of the loss only in terms of the eigenvalues of M :
We show Lvar over λ+ for different negative feedback loop sizes λ- (Fig. 12C) and different network sizes N (Fig. 12D). The first term diverges at the phase transition where the fixed point appears, λ+ ↘ λ+,min. Learning will thus push the weights away from the phase transition towards larger λ+. With such increasing λ+, the fixed point norm increases, and the fixed point rotates away from the output, decreasing the correlation. This in term emphasizes the last term, scaled by 1 — ρ2. The last term is reduced with increasingly negative A- , corresponding to the negative feedback loop that suppresses noise.
Learning can in principle strengthen this feedback further and further, λ- → — -∞ (apart from possible stability issues [27]). However, as for the linear network, Section 4.7, this process takes time. We thus assume λ- to be fixed and search for a minimum across λ+. For λ- < 0, the last term in Eq. (53) increases with increasing λ+ : the effective feedback loop in the full, nonlinear system is weakened by saturation. The loss Lvar thus has a minimum at some moderate λ+.
To illustrate the mechanisms described above, we simulated networks at different λ+ and with λ- = —5. For each λ+ , we compute ρ according to Eq. (52). Setting vout = [1, 0]T, we then set the resulting symmetric M = VΛVT. For each network sample, we then draw independent random projections U ∈ ℝN × 2. We started simulations at either one of the two nonzero fixed points. For λ+ just above λ+,min, the noise pushes activity from one basin of attraction to the next (Fig. 12C). The resulting output becomes centered around zero and independent of the initial condition for long simulation times (Fig. 12F). For the optimal λ+, the trajectories remain close to either one fixed point (Fig. 12F). The output forms two overlapping distributions, each closely matching the target on average (Fig. 12I). For larger λ+, the fixed points become increasingly larger (Fig. 12G). While this decreases the probability to leave the basin of attraction even further, the variance along the output weights becomes larger (slightly wider histograms in Fig. 12J). Note that the mean starts to deviate from the prediction. This is not covered by our theory, and potentially due to the linearization of the fluctuations.
All-in-all, this section revealed a potential path to oblique solutions, initiated by the large fluctuations close to a phase transition, and the interplay between the negative feedback loop and saturation. In the following section, we show that learning via gradient descent actually follows this path, and that the parameters predicted by the minimum of Lvar quantitatively predict solutions from learning.
4.8.4. Oblique solutions from learning are predicted by the mean field theory
We trained neural networks on the fixed points task described above by applying gradient descent to the 2 × 2 matrix M, initialized at M0 = 0. The gradients GM with respect to M are equivalent to those with respect to

Mean field theory predicts learning with gradient descent.
A-C: Learning dynamics with gradient descent for example network with N = 1024 neurons and with noise variance
The output weights were set to
The loss over learning time for one network is shown in Fig. 13A. Learning consisted of two phases: a first phase in which the network did not posses a fixed point, and hence did not match the target on average, Lbias > 0. At some point, Lbias rapidly decreases, and Lvar dominates the overall loss. During the second phase, Lvar slowly decreases, with Lbias hovering around zero.
The coefficients of the matrix M indicate the underlying learning dynamics (Fig. 13B). The coefficient along the output weights, M11, is almost identical with λ-. It continually grows in the negative direction, unaffected by the different phases. Its time course is very similar to the time course observed for the linear system (Fig. 10B). In contrast, the coefficient along the input weights, M22, mirrors the two phases. It grows increasingly fast in the first phase, and saturates during the second phase. Its value is very close to the larger eigenvalue, λ+. The transition between the two phases of learning happens at the phase transition of the dynamical system, when the fixed point emerges for λ+ = λ+,min. The off-diagonal entries show that M is asymmetric during the first phase and becomes symmetric later on. The coefficient M12 corresponds to the feedforward mode mapping the state decaying from x(0) = u2 after the pulse to the output weights
Tracing the fixed point norm
We compared the outcome of learning to our mean field theory. Given that M is approximately symmetric at the end of learning, we directly applied our results from the previous sections, again assuming Lbias = 0. We fixated λ- to match the value at the end of training, and computed the λ+ that minimized Lvar, Eq. (53). The results for the norm and correlation match the those values obtained with gradient descent very closely.
Our high-level description of oblique and aligned dynamics did not involve scaling with network size (Section 2.1). However, the underlying assumption was that the activity of single neurons xi is not vanishing for large networks, i.e., xi = O(1). This would imply
In summary, the last two sections capture the main mechanisms that drive solutions to the oblique regime in nonlinear, noise-driven networks. Although marginally small solutions seem possible for large output weights, such solutions are close to a phase transition, so that the resulting system is very susceptible to noise. To accommodate robust solutions, the trajectories (here the fixed points) must increase in magnitude, while rotating away from the output weights. This further allows to implement a negative feedback loop, that suppresses noise along the output direction. Its efficacy can be reduced by too much saturation, which in turn keeps solutions from growing ever larger. The resulting sweet spot is a network with oblique dynamics.
4.9. Mechanisms behind decoupling of neural dynamics and output
Here we discuss in more detail the underlying mechanisms for the qualitative decoupling in oblique networks. We make a high-level argument that splits into two parts: first the possibility of decoupling in oblique, but not aligned, networks, and second a putative mechanism driving the decoupling.
For the first part, we make the observation that the output in oblique networks can be obtained from the leading components of the dynamics (along the PCs), but importantly also from the non-leading ones. To see this, we unpack the output in Eq. (1) in a slightly different way than before, Eq. (3). Namely, we split the activity vector x into its component along the leading PCs, xlead, and the remaining, trailing component, xtrail. By definition of the leading PCs as the directions of largest variance, the leading component is expected to be large, and the trailing one small. Inserting this decomposition x = xlead + xtrail, into Eq. (1) leads to
with separately defined correlations ρlead = corr(wout, xlead) and ρtrail = corr(wout, xtrail).
For aligned networks, we recover the results from before: wout is small so that the output can only be generated by the leading part with large correlation ρlead. The trailing part is unconstrained, but also not contributing to either the output or the leading dynamics, and hence not of interest.
For oblique networks, wout is large so that the output can be generated by either of the two terms in Eq. (54). The correlation ρlead has to be small because else the output would be too large. The other correlation, ρtrail, can be large, because non-dominant component xtrail is small. Both terms are potentially of the same magnitude, which means both can potentially contribute to the output. If the dominant part alone generates the output, then neural dynamics and output are coupled and the solution is similar to an aligned one (Fig. 14, right). If, however, the non-dominant part alone generates the output, and the correlation ρlead is so small that the dominant part does not contribute to the output, then the dominant part is not constrained by the task (Fig. 14, center right). In that case, the dominant dynamics and the output can decouple qualitatively, and we may see the large variability between learners observed above.

Path to oblique solutions for networks with large output weights.
Left: all networks produce the same output (cf. Fig. 1). Center: Unstable solutions that arise early in learning. For lazy solutions, initial chaotic activity is slightly adapted, without changing the dynamics qualitatively. For marginal solutions, vanishingly small initial activity is replaced with a very small dynamics sufficient to generate the output. Right: With more learning time and noise added during the learning process, stable, oblique solutions arise. The neural dynamics along the largest PCs can be either decoupled from the output (center right), or coupled (right). For decoupled dynamics, the components along the largest PCs (blue subspace) differ qualitatively from those generating the output (same as Fig. 1B, bottom). The dynamics along the largest PCs inherit task-unrelated components from the initial dynamics or randomness during learning. Another possibility are oblique, but coupled dynamics (right). Such solutions don’t inherit task-unrelated components of the dynamics at initialization. They are qualitatively similar to aligned solutions, and the output is generated by a small projection of the output weights onto the largest PCs (dashed orange arrow).
The existence of decoupled solutions for oblique dynamics leads to the second question: Why and when do such solutions arise? Understanding this requires more detailed insights into the learning process. Roughly speaking, learning in the oblique regime has two opposite goals: first, to generate the desired output as fast as possible, and hence to induce changes to activity that are as small as possible (Section 4.6); and second, to generate solutions that are robust and stable, and hence to induce changes in activity that are large enough to not be disrupted by noise (Section 4.8). During the process of learning, small, unstable solutions appear first (Fig. 14 center). These may be highly variable, depending strongly on random initialization or other randomness experienced during learning. Such solutions then slowly solidify into stable solutions, that may inherit the variability of the early solutions (Fig. 14 right).
The process of how learning transforms small, unstable solutions to larger, robust ones is analyzed in Sections 4.7 and 4.8. The details of how this process introduces variability between learners, however, are not discussed there and left for future work.

Solution to linearized network dynamics in the lazy regime.
A Network output for weight changes ΔWlin obtained from linearized dynamics for different network sizes. Each plot shows 10 different networks (one example in bold). Target points in purple. B Output for networks trained with GD from the same initial conditions as those above. C Loss on training set for the linear (lin) and GD solutions. D Frobenius norms of weight changes ΔW of the linear and GD solutions, as well as of the difference between the two. ΔWlin - ΔWGD (dashed green). Grey and black dashed lines for comparison of scales.
A. Supplementary Information
A.1. Linear approximation for lazy learning
We numerically investigated the linearization, Eq. (17), of the output trajectory z(t) in weight changes ΔW, Eq. (17). For this, we used the sine wave task and the same configuration as for the lazy network in Fig. 9(c). In Fig. 15(a), we plot the output of the full network after inserting the linear solution ΔWlin. The fit to the training points got more accurate with increasing network size. However, it did not reach zero error for the networks shown. In general, the error increased with increasing trial time. This indicates that the linear approximation was not valid. This is in line with the nature of the dynamical system: Trajectories become more nonlinear in W with increasing time. The loss on the training set [Fig. 15(c)] summarizes the observations seen before: The linear solution became more accurate with increasing network size.
How close was the linear solution to one found by GD? To test this, we optimized the network with GD starting the same initial condition W0. The output of this network is shown in orange in Fig. 15(b). As seen before [Fig. 9(c)] training points were met, but the dynamics after the last training points remain chaotic. We then quantified the norms of the two solutions and the difference between both. As shown in Fig. 15(d), both solutions had similar norm, which decays as
A.2. Details linear model: Bias only: lazy learning
We consider learning for the linear, input-driven model, Eq. (4). We start by the bias part, which is equivalent to the noise-free model [62]. Disregarding initial transients, the average output is
We make the ansatz of small weight changes, and hence linearize the output in weights (like in feedforward networks [33]):
with Gz,0 the gradient of the output at initialization,
and B0 = (1 - W0)-1. The dot-product signifies sum over all entries, A · B = ∑ijAijBij for two matrices A, B. Because the output is linear, the weight changes are spanned by the gradient, ΔW = b(τ)Gz,0/∥Gz,0∥. Insertion into gradient descent dynamics yields
with
We used that
The corresponding output changes are
where the hats indicate normalized vectors. Consistent with linearization, the norm (both Frobenius and operator) of the changes is small:
To connect with the following paragraph, we consider the case g = 0, so that ΔW = W. This is for reference, and Fig. 10; see details below. We express the results in the orthonormalized bases introduced below:
with
The corresponding activity changes are
A.3. Details linear model:, Bias and variance combined
We now combine results from the previous section, concerning the bias, with results from the Methods, concerning the variance, to derive the full learning dynamics of the linear model. For the case no initial connectivity, W0 = 0, the gradients for both the bias and the variance component, Eqs. (23) and (55), are spanned by the output and input vector. By orthonormalizing, we define
The projection of the deterministic gradient is then
with deterministic output
and projection
For the stochastic part, we need to solve the two Lyapunov Eqs. (21) and (24). They reduce to low-D versions after projection. For the first one, we have
and the projection solves the 2-dimensional Lyapunov equation
with
The full learning dynamics for gradient flow are then given by
with M(0) = 0 and
For a small output scale, the deterministic part is order one, while the stochastic one is order 1/N (through Ω). Assuming η = O(1), learning thus reaches an arbitrary small order-one loss after O(1) updates, with noise adding a 1/N stochastic loss, and noise-driven learning equally adding order 1/N deviations to the entries of M. Hence, learning remains unaffected by the noise unless it is continued for O(ηN2) steps.
For a large output scale, the deterministic part is fast, but only leads to
In detail: For a large readout, σout = 1, we expand M to first order in
The leading order of the output is
which implies that b0 = 0 throughout learning. Under this assumption, we have
Furthermore, the deterministic gradient is
This component of learning converges on the order of η/ε2 learning steps due to the prefactor 1/ε and the fact that b1 enters
The stochastic part of the gradient simplifies by assuming b0 = c0 = d0 = 0. The latter two are justified self-consistently, because neither the deterministic nor the stochastic part contribute to c and d at order one if we start with M = 0. With this assumption, we have
Consistent with our assumption, we only have order-one growth for the entry a. The resulting dynamics are
where τ is the number of learning steps. Further, because the deterministic part is so much faster, we can assume that the second entry, b = b1, is always constrained by the target. Namely, from Eq. (73), we have
Lastly, the first-order corrections for the other entries all vanish due to vanishing initial conditions:
The loss due to the stochastic part is then
while the deterministic part is kept at O(1/N).
During learning, the average states change only very little. To see this, we express the connectivity as
and orthonormalized input weights
The output is then
The initial states are
where the scalar part is also the norm. Thus,
A.4. Details nonlinear autonomous system with noise
This section contains detailed derivations for the average and fluctuations of the latent projection κ. The theoretical treatment of the network dynamics is very similar to that for a network with only a negative feedback loop for “balance” by Kadmon et al. [27]. What is added here is a colored noise term that enters via the variance
The dynamics of the latent variable are
That is, the low-D variable is driven autonomously by itself, but externally by the high-D orthogonal part x⊥. To resolve this into low-D dynamics only, we apply two averages: the trial-conditioned average, which is over the noise ξ, and denoted by the bar,
One can show that the error is small, EU[εa] = 0 and EU[εaεb] ~ 1 /N. For the average, we apply partial integration as in the noise-free case, Eq. (14), and arrive at
where the average on the right-hand side is over the standard normal variable u. In contrast to the noise-free case, the variance is now comprised of two terms,
and
We now split the term Eq. (85) into average and fluctuations around this. We self-consistently assume that all fluctuations, denoted with a δ, are small, which allows to linearize around the respective averages. We have
We only keep terms up to order
The orthogonal part is an Ornstein-Uhlenbeck process with steady state covariance
where the variance is inherited from the white noise,
We take the average over the latent dynamics Eq. (84). Apart from an initial transient, which we discard, this leaves us with a fixed point
We now discuss the fluctuations in Eq. (88). There are two sources of fluctuations, δκ and
All derivatives are evaluated at the averages (we discard the bar in the following lines to keep notation at bay). We further compute
Inserting these terms into the dynamics Eq. (84), we arrive at
The white noise term has variance
For the fluctuations in the populations average of the variance, we use Eq. (90) to compute the covariance function between different time points,
For third line, the average is taken over the independent standard normal variables u, u′. Putting all pieces together, we have
with Jacobian
and noise term ζ(t) with zero average and covariance
with
We inserted the eigenvalue equation,
We now compute the variance of the fluctuations δκ. The computation is simplified by the fact the
The eigenvector is also direction in which the colored noise acts. Because of this, we can write
The first summand stems from the white noise term,
The integral I(t) can be computed to yield
For the steady-state variance, we take t → ∞. The first term yields limt → ∞∑A(t) = ∑A, which can be expressed as the solution of the Lyapunov equation
The integral converges to
Joining all the bits and pieces, we have the steady state covariance
with normalized eigenvector
For the loss, we only need to consider the variance along the output vector. Namely, we have the output fluctuations
with projected output weights
Thus,
A.5. Additional figures

Detailed network dynamics for cycling task for aligned (top) and oblique (bottom) dynamics.
A Cumulative explained variance along PCs taken for the dynamics in both context (orange), and separately for each context (blue, green). For the aligned network, the dynamics of each context separately are essentially 2D. B Projection of dynamics onto the PCs obtained from the separate PCAs. Arrows indicate directions of trajectories. Three arrows divide one output cycle into equal parts, and hence indicate time - the internal dynamics for the oblique network are slower than the output, cf. Fig. 2C. C Histograms of state norms along the limit cycles and distance between limit cycles. The distance at each time point is defined as d12(t) = mint’∥x1(t) — x2(t’)∥, where the subcripts indicate contexts 1 and 2.

Power spectral densities for the six networks shown in Fig. 4.
Dashed orange line indicates the output frequency. Note the high power for non-target frequencies in the first PCs in some of the large output solutions.

Summary of all measures for initially decaying or chaotic networks.

Neural activity in response to the perturbations applied in Fig. 5.
Activity is plotted in the space spanned by the leading two PCs and the output weights wout,1. We first show the unperturbed trajectories in each network (left), then the perturbed ones for perturbations along the first output direction (center), and along the first PC (right). The unperturbed trajectories are also plotted for comparison. Yellow dots indicate the point where the perturbation is applied. All perturbations but the one along the output for aligned lead to trajectories on the same attractor, but potentially with a phase shift. Note that in general, perturbations can also lead to the activity converging on a different attractor. Here, we see a specific example of this happening for the cycling task in the aligned regime.

Noise compression over training time for the cycling task.
Example networks trained on the cycling tasks with σnoise = 0.2 and N = 256. The network at the end of training is analyzed in Fig. 6B. A Full loss (grey) and decomposition in bias (golden) and variance (purple) parts over learning time. The bias part decays rapidly (y-axis is clipped, initial loss L0 = 1.4), whereas the variance part needs many more training steps to decrease. Dotted lines indicate the two examples in B-D. B Output fluctuations

Learning curves for neuroscience tasks analyzed in Fig. 3.
The learning rates are

Example of task variability for the flipflop task.
The titles in each row indicate the spectral radius g of the initial recurrent connectivity (g > 1 for initially chaotic, else decaying activity), and the norm of initial output weights.

Fitting neural activity to other output modalities (hand position, velocity, acceleration, EMG).
Output modality is indicated by the x-ticks, the corresponding datasets by the labels below and the color. A: Quality of fit. B: Correlation, normalized by

Learning curves and eigenvalue spectra for sine wave task.
(a) Loss over training steps for four networks. The light red line is the bias term of the loss for the oblique network. (b) Norm of weight changes over learning time. (c) Eigenvalue spectra of connectivity matrix Wf after training. The dashed line indicates the stability line for the fixed point at the origin.

The aligned regime is robust to the choice of other hyperparameters.

The oblique regime is robust to the choice of other hyperparameters.

Training history leads to order-one fixed point norm.
We trained RNNs on the example fixed point task. Similar to Fig. 13, but with smaller noise, σnoise = 0.2, and with learning rate increased by 2 and number of epochs by 2.5. A-C: Learning dynamics with gradient descent for one network with N = 256 neurons. The first 400 epochs are dominated by Lbias, and M11 ≈ λ- becomes positive. The negative feedback loop, λ- < 0 only forms later in learning. The matrix M does not become symmetric during learning. D-F: Fixed point norm and correlation for different N evaluated when λ- = 0 (left) and at the end of learning (right). The time points are indicated by a square and triangle in C, respectively. At λ- = 0, simulation and theory agree for the scaling:
References
- [1]Dynamics of ongoing activity: explanation of the large variability in evoked cortical responsesScience 273:1868–1871
- [2]Instantaneous modulation of gamma oscillation frequency by balancing excitation with inhibitionNeuron 62:566–577
- [3]Recurrent neural networks as versatile tools of neuroscience researchCurrent Opinion in Neurobiology 46:1–6
- [4]Shaping dynamics with multiple populations in low-rank recurrent networksNeural computation 33:1572–1615
- [5]Coding with transient trajectories in recurrent neural networksPLoS computational biology 16:e1007655
- [6]State-dependent computations: spatiotemporal processing in cortical networksNature Reviews Neuroscience 10:113–125
- [7]Single-neuron perturbations reveal feature-specific competition in V1Nature 567:334–340
- [8]On lazy training in differentiable programmingAdvances in Neural Information Processing Systems :2937–2947
- [9]Cortical preparatory activity: representation of movement or first cog in a dynamical machine?Neuron 68:387–400
- [10]Neural population dynamics during reachingNature 487:51–56
- [11]Stabilization of a brain-computer interface via the alignment of lowdimensional spaces of neural activityNature biomedical engineering 4:672–685
- [12]The role of population structure in computations through neural dynamicsbioRxiv :2020–7
- [13]Attractor dynamics gate cortical information flow during decisionmakingNature Neuroscience 24:843–850
- [14]Orthogonal representations for robust context-dependent task performance in brains and neural networksNeuron 110:1258–1270
- [15]Residual dynamics resolves recurrent contributions to neural computationbioRxiv
- [16]Cortical population activity within a preserved neural manifold underlies multiple motor behaviorsNature communications 9:1–13
- [17]Long-term stability of cortical population dynamics underlying consistent behaviorNature neuroscience 23:260–270
- [18]Neural manifolds for the control of movementNeuron 94:978–984
- [19]A theory of multineuronal dimensionality, dynamics and measurementbioRxiv doi: 10.1101/214262v2
- [20]Learning by neural reassociationNature neuroscience 21:607–616
- [21]Microstructure of a spatial map in the entorhinal cortexNature 436:801–806
- [22]Optimal control of transient dynamics in balanced networks supports generation of complex movementsNeuron 82:1394–1406
- [23]Constraints on neural redundancyElife 7:e36774
- [24]Receptive fields, binocular interaction and functional architecture in the cat’s visual cortexThe Journal of physiology 160:106
- [25]Neural tangent kernel: Convergence and generalization in neural networksAdvances in neural information processing systems :8571–8580
- [26]Saddle-to-Saddle Dynamics in Deep Linear Networks: Small Initialization Training, Symmetry, and SparsityarXiv:2106.15933
- [27]Predictive coding in balanced neural networks with noise, chaos and delaysAdvances in Neural Information Processing Systems 33:16677–16688
- [28]Optimal anticipatory control as a theory of motor preparation: A thalamo-cortical circuit modelNeuron 109:15671581
- [29]Cortical activity in the null space: permitting preparation without movementNature neuroscience 17:440–448
- [30]Distributing task-related neural activity across a cortical network through task-independent connectionsNature Communications 14:2851
- [31]Adam: A method for stochastic optimizationarXiv: 1412.6980
- [32]Numerical Solution of Stochastic Differential EquationsSpringer
- [33]On the linearity of large non-linear models: when and why the tangent kernel is constantAdvances in Neural Information Processing Systems 33
- [34]Thalamic control of cortical dynamics in a model of flexible motor sequencingCell reports 35:109090
- [35]Universality and individuality in neural dynamics across large populations of recurrent networksAdvances in Neural Information Processing Systems :15603–15615
- [36]Superior arm-movement decoding from cortex with a new, unsupervised-learning algorithmJournal of neural engineering 15:026010
- [37]Context-dependent computation by recurrent dynamics in prefrontal cortexNature 503:78
- [38]Linking connectivity, dynamics, and computations in low-rank recurrent neural networksNeuron 99:609–623
- [39]A mean field view of the landscape of two-layer neural networksProceedings of the National Academy of Sciences 115:E7665–E7671
- [40]Single-trial neural dynamics are dominated by richly varied movementsNature neuroscience 22:1677–1686
- [41]The hippocampus as a spatial map: preliminary evidence from unit activity in the freely-moving ratBrain research
- [42]Direct neural perturbations reveal a dynamical mechanism for robust computationbioRxiv
- [43]New neural activity patterns emerge with long-term learningProceedings of the National Academy of Sciences 116:15210–15215
- [44]Instantaneous correlation of excitation and inhibition during ongoing and sensory-evoked activitiesNature neuroscience 11:535–537
- [45]A new theoretical framework jointly explains behavioral and neural variability across subjects performing flexible decision-makingbioRxiv
- [46]Automatic differentiation in PyTorchnone
- [47]Neural Latents Benchmark ‘21: Evaluating latent variable models of neural population activityAdvances in Neural Information Processing Systems :1
- [48]Controllability gramian spectra of random networks2016 American Control Conference (ACC) IEEE :3874–3879
- [49]The importance of mixed selectivity in complex cognitive tasksNature 497:585
- [50]Local dynamics in trained recurrent neural networksPhysical Review Letters 118:258101
- [51]Motor learning with unstable neural representationsNeuron 54:653–666
- [52]Neuronal correlates of parametric working memory in the prefrontal cortexNature 399:470
- [53]All-optical interrogation of neural circuits in behaving miceNature Protocols 17:1579–1620
- [54]Motor cortex embeds muscle-like commands in an untangled population responseNeuron 97:953–966
- [55]Neural trajectories in the supplementary motor area and motor cortex exhibit distinct geometries, compatible with different classes of computationNeuron 107:745–758
- [56]Neural constraints on learningNature 512:423–426
- [57]Towards the neural population doctrineCurrent opinion in neurobiology 55:103–111
- [58]Motor cortex activity across movement speeds is predicted by network-level strategies for generating muscle activityElife 11:e67620
- [59]Cortical control of virtual self-motion using task-specific subspacesJournal of Neuroscience 42:220–239
- [60]Optimal sequence memory in driven random networksPhysical Review X 8:041029
- [61]Dynamics of random recurrent networks with correlated low-rank structurePhysical Review Research 2:013111
- [62]The interplay between randomness and structure during learning in RNNsAdvances in Neural Information Processing Systems Curran Associates, Inc. :13352–13362Ed. by
- [63]Chaos in random neural networksPhysical Review Letters 61:259
- [64]High-dimensional geometry of population responses in visual cortexNature 571:361–365
- [65]Spontaneous behaviors drive multidimensional, brainwide activityScience 364:eaav7893
- [66]Quality of internal representation shapes learning performance in feedback neural networksPhysical Review Research 3:013176
- [67]Neural circuits as computational dynamical systemsCurrent Opinion in Neurobiology 25:156–163
- [68]Generating coherent patterns of activity from chaotic neural networksNeuron 63:544–557
- [69]Opening the black box: low-dimensional dynamics in highdimensional recurrent neural networksNeural computation 25:626–649
- [70]A neural network that finds a naturalistic solution for the production of muscle activityNature Neuroscience 18:1025
- [71]Activity in primate visual cortex is minimally driven by spontaneous movementsbioRxiv :2022–9
- [72]Charting and navigating the space of solutions for recurrent neural networksAdvances in Neural Information Processing Systems 34
- [73]Computation through neural population dynamicsAnnual Review of Neuroscience 43:249–275
- [74]High-performance brain-to-text communication via handwritingNature 593:249–254
- [75]Generalized Shape Metrics on Neural RepresentationsAdvances in Neural Information Processing Systems 34
- [76]Adjoint methods of sensitivity analysis for Lyapunov equationStructural and Multidisciplinary Optimization 53:225–237
- [77]Feature Learning in Infinite-Width Neural NetworksarXiv: 2011.14522
Article and author information
Author information
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
- Version of Record published:
Copyright
© 2024, Schuessler et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 902
- downloads
- 105
- citations
- 2
Views, downloads and citations are aggregated across all versions of this paper published by eLife.