Introduction

Understanding how motor cortex encodes and decodes movement behaviors is a fundamental goal of neuroscience (Kriegeskorte and Douglas, 2019; Saxena and Cunningham, 2019). Here, we define behaviors as behavioral variables of interest measured within a given task, such as arm kine-matics during a motor control task; we employ terms like ‘behaviorally-relevant’ and ‘behaviorally-irrelevant’ only regarding such measured behavioral variables. However, achieving this goal faces significant challenges because behaviorally-relevant neural responses are entangled with behaviorally-irrelevant factors such as responses for other variables of no interest (Fusi et al., 2016; Rigotti et al., 2013) and ongoing noise (Azouz and Gray, 1999; Faisal et al., 2008). Generally, irrelevant signals would hinder the accurate investigation of the relationship between neural activity and movement behaviors. This raises concerns about whether irrelevant signals could conceal some critical facts about neural encoding and decoding mechanisms.

If the answer is yes, a natural question arises: what critical facts about neural encoding and de-coding would irrelevant signals conceal? In terms of neural encoding, irrelevant signals may mask some small neural components, making their encoded information dificult to detect (Moreno-Bote et al., 2014), thereby misleading us to neglect the role of these signals, leading to a partial under-standing of neural mechanisms. For example, at the single-neuron level, weakly-tuned neurons are often assumed useless and not analyzed (Georgopoulos et al., 1986; Hochberg et al., 2012; Wodlinger et al., 2014; Inoue et al., 2018); at the population level, neural signals composed of lower variance principal components (PCs) are typically treated as noise and discarded (Churchland et al., 2012; Gallego et al., 2018, 2020; Cunningham and Yu, 2014). So are these ignored signals truly use-less or appear that way only due to being covered by irrelevant signals? And what’s the role of these ignored signals? In terms of neural decoding, irrelevant signals would significantly complicate the information readout (Pitkow et al., 2015; Yang et al., 2021), potentially hindering the discovery of the true readout mechanism of behaviorally-relevant responses. Specifically, in motor cortex, in what form (linear or nonlinear) downstream neurons read out behavioral information from neural responses is an open question. The linear readout is biologically plausible and widely used (Geor-gopoulos et al., 1986; Hochberg et al., 2012; Wodlinger et al., 2014), but recent studies (Glaser et al., 2020; Willsey et al., 2022) demonstrate nonlinear readout outperforms linear readout. So which readout scheme is more likely to be used by the motor cortex? Whether irrelevant signals are the culprits for the performance gap? Unfortunately, all the above issues remain unclear.

One approach to address the above issues is to accurately separate behaviorally-relevant and irrelevant signals at the single-neuron level and then analyze noise-free behaviorally-relevant sig-nals, which enables us to gain a more accurate and comprehensive understanding of the under-lying neural mechanisms. However, this approach is hampered by the fact that the ground truth of behaviorally-relevant signals is unknown, which makes the definition, extraction, and valida-tion of behaviorally-relevant signals a challenging task. As a result, methods of accurate sep-aration remain elusive to date. Existing methods for extracting behaviorally-relevant patterns mainly focus on the latent population level (Churchland et al., 2012; Sani et al., 2021; Pandarinath et al., 2018; Yu et al., 2008; Zhou and Wei, 2020; Keshtkaran et al., 2022) rather than the single-neuron level, and they extract neural activities based on assumptions about specific neural prop-erties, such as linear or nonlinear dynamics (Sani et al., 2021; Pandarinath et al., 2018). Although these methods have shown promising results, they fail to capture other parts of behaviorally-relevant neural activity that do not meet their assumptions, thereby providing an incomplete pic-ture of behaviorally-relevant neural activity. To overcome these limitations and obtain accurate behaviorally-relevant signals at the single-neuron level, we propose a novel framework that de-fines, extracts, and validates behaviorally-relevant signals by simultaneously considering such sig-nals’ encoding (behaviorally-relevant signals should be similar to raw signals to preserve the under-lying neuronal properties) and decoding (behaviorally-relevant signals should contain behavioral information as much as possible) properties (see Methods and Fig. 1). This framework establishes a prerequisite foundation for the subsequent detailed analysis of neural mechanisms.

Semantic illustration of extracting and validating behaviorally-relevant signals.

a-e The ideal decomposition of raw signals. a, The temporal neuronal activity of raw signals, where x-axis denotes time, and y-axis represents firing rate. Raw signals are decomposed to relevant (b) and irrelevant (d) signals. The red dotted line indicates the decoding performance of raw signals. The red and blue bars represent the decoding performance of relevant and irrelevant signals. The purple bar represents the generating performance of relevant signals, which measures the neural similarity between generated signals and raw signals. The longer the bar, the larger the performance. The ground truth of relevant signals decode information perfectly (c, red bar) and is similar to raw signals to some extent (c, purple bar), and the ground truth of irrelevant signals contain little behavioral information (e, blue bar). f-h, Three different cases of behaviorally-relevant signals distillation. f, When the model is biased toward generating relevant signals that are similar to raw signals, it will achieve high generating performance, but the decoding performance will suffer due to the inclusion of too many irrelevant signals. As it is dificult for models to extract complete relevant signals, the residuals will also contain some behavioral information. g, When the model is biased toward generating signals that prioritize decoding over similarity to raw signals, it will achieve high decoding performance, but the generating performance will be low. Meanwhile, the residuals will contain a significant amount of behavioral information. h, When the model balances the trade-off of decoding and generating capabilities of relevant signals, both decoding and generating performance will be good, and the residuals will only contain a little behavioral information.

Here, we conducted experiments using datasets recorded from the motor cortex of three mon-keys performing different reaching tasks, where the behavioral variable is movement kinemat-ics. After signal separation by our approach, we first explored how the presence of behaviorally-irrelevant signals affect the analysis of neural activity. We found that behaviorally-irrelevant signals account for a large amount of trial-to-trial neuronal variability, and are evenly distributed across the neural dimensions of behaviorally-relevant signals. Then we explored whether irrelevant sig-nals conceal some facts of neural encoding and decoding. For neural encoding, irrelevant signals obscure the behavioral information encoded by neural responses, especially for neural responses with a large degree of nonlinearity. Surprisingly, neural responses that are usually ignored (weakly-tuned neurons and neural signals composed of small variance PCs) actually encode rich behav-ioral information in complex nonlinear ways. These responses underpin an unprecedented neu-ronal redundancy and reveal that movement behaviors are distributed in a higher-dimensional neural space than previously thought. In addition, we found that the integration of smaller and larger variance PCs results in a synergistic effect, allowing the smaller variance PC signals that can-not be linearly decoded to significantly enhance the linear decoding performance, particularly for finer speed control. This finding suggests that lower variance PC signals are involved in regulat-ing precise motor control. For neural decoding, irrelevant signals complicate information readout. Strikingly, when uncovering small neural components obscured by irrelevant signals, linear de-coders can achieve comparable decoding performance with nonlinear decoders, providing strong evidence for the presence of linear readout in motor cortex. Together, our findings reveal un-expected complex encoding but simple decoding mechanisms in the motor cortex. Finally, our study also has implications for developing accurate and robust brain-machine interfaces (BMIs) and, more generally, provides a powerful framework for separating behaviorally-relevant and ir-relevant signals, which can be applied to other cortical data to uncover more neural mechanisms masked by behaviorally-irrelevant signals.

Results

Framework for defining, extracting, and validating behaviorally-relevant neural signals

What are behaviorally-relevant neural signals?

Since the ground truth of behaviorally-relevant signals is unknown, the definition of behaviorally-relevant signals is not well established yet. In this study, we propose that behaviorally-relevant sig-nals should meet two requirements: (1) they should be similar to raw signals in order to preserve the underlying neuronal properties (encoding requirement), and (2) they should contain behav-ioral information as much as possible (decoding requirement). If the signals meet both of these requirements, we consider them to be effective behaviorally-relevant signals.

How to extract behaviorally-relevant signals?

One way to extract behaviorally-relevant signals is to use a distillation model to generate them from raw signals, while considering the remaining signals as behaviorally-irrelevant. However, due to the unknown ground truth of behaviorally-relevant signals, a key challenge for the model is to determine the optimal degree of similarity between the generated signals and raw signals. If the generated signals are too similar to raw signals, they may contain a large amount of irrelevant information, which would hinder the exploration of neural mechanisms. Conversely, if the gener-ated signals are too dissimilar to raw signals, they may lose behaviorally-relevant information, also hindering the exploration of neural mechanisms. To overcome this challenge, we exploited the trade-off between the similarity of generated signals to raw signals (encoding requirement) and their decoding performance of behaviors (decoding requirement) to extract effective behaviorally-relevant signals (for details, see Methods). The core assumption of our model is that behaviorally-irrelevant signals are noise relative to behaviorally-relevant signals, and thereby irrelevant signals would degrade the decoding generalization of generated behaviorally-relevant signals. Generally, the distillation model is faced with three cases: a bias toward generating (Fig. 1f), a bias toward decoding (Fig. 1g), and a proper trade-off (Fig. 1h). If the distillation model is biased toward extract-ing signals similar to raw signals, the distilled behaviorally-relevant signals will contain an excessive amount of behaviorally-irrelevant information, affecting the decoding generalization of these sig-nals (Fig. 1f). If the model is biased toward extracting parsimonious signals that are discriminative for decoding, the distilled signals will not be similar enough to raw signals, and some redundant but useful signals will be left in the residuals (Fig. 1g). Neither of these two cases is desirable because the former loses decoding performance, while the latter loses some useful neural signals, which are not conducive to our subsequent analysis of the relationship between behaviorally-relevant sig-nals and behaviors. The behaviorally-relevant signals we want should be similar to raw signals and preserve the behavioral information maximally, which can be obtained by balancing the encoding and decoding properties of generated behaviorally-relevant signals (Fig. 1h).

How to validate behaviorally-relevant signals?

To validate the effectiveness of the distilled signals, we proposed three criteria. The first criterion is that the decoding performance of the behaviorally-relevant signals (red bar, Fig.1) should sur-pass that of raw signals (the red dotted line, Fig.1). Since decoding models, such as deep neural networks, are more prone to overfit noisy raw signals than behaviorally-relevant signals, the dis-tilled signals should demonstrate better decoding generalization than the raw signals. The second criterion is that the behaviorally-irrelevant signals should contain minimal behavioral information (blue bar, Fig.1). This criterion can assess whether the distilled signals maximally preserve behav-ioral information from the opposite perspective and effectively exclude undesirable cases, such as over-generated and under-generated signals. Specifically, in the case of over-generation, suppose z = x + y, where z, x, and y represent raw, relevant, and irrelevant signals, respectively. If the dis-tilled relevant signals are added extra signals m which do not exist in the real behaviorally-relevant signals, that is, , then the corresponding residuals will be equal to the ideal irrelevant sig-nals y plus the negative extra signals −m, namely, , thus the residuals contain the amount of information preserved by negative extra signals −m. Similarly, in the case of under-generation, if the distilled behaviorally-relevant signals are incomplete and lose some useful information, this lost information will also be reflected in the residuals. In these cases, the distilled signals are not suitable for analysis. The third criterion is that the distilled behaviorally-relevant signals should be similar to raw signals to maintain essential neuronal properties (purple bar, Fig.1). If the distilled signals do not resemble raw signals, they will not retain the critical characteristics of raw signals, which are not qualified for subsequent analysis. Overall, if the distilled signals satisfy the above three criteria, we consider the distilled signals to be effective.

d-VAE extracts effective behaviorally-relevant signals

To demonstrate the effectiveness of our model (d-VAE) in extracting behaviorally-relevant signals, we conducted experiments on the synthetic dataset where the ground truth of relevant and ir-relevant signals are already known (see Methods) and three benchmark datasets with different paradigms (Fig. 2a,e,i; see Methods for details), and compared d-VAE with four other distillation models, including pi-VAE (Zhou and Wei, 2020), PSID (Sani et al., 2021) and LFADS (Pandarinath et al., 2018), and a vanilla VAE (Kingma and Welling, 2013). Specifically, we first applied these distillation models to raw signals to obtain the distilled behaviorally-relevant signals, consider-ing the residuals as behaviorally-irrelevant signals. We then evaluated the decoding R2 between the predicted velocity and actual velocity of the two partition signals using a linear Kalman fil-ter (KF) and a nonlinear artificial neural network (ANN) and measured the neural similarity be-tween behaviorally-relevant and raw signals. Besides, we compared the decoding performance of behaviorally-relevant signals extracted by d-VAE and behaviorally-relevant embeddings extracted by CEBRA (a non-generative method which outperforms pi-VAE) (Schneider et al., 2022).

Evaluation of separated signals.

a, The obstacle avoidance paradigm b, The decoding R2 between true velocity and predicted velocity of raw signals (purple bars with slash lines) and behaviorally-relevant signals obtained by d-VAE (red), PSID (pink), pi-VAE (green), LFADS (blue), and VAE (light green) on dataset A. Error bars denote mean ± standard deviation (s.d.) across five cross-validation folds. Asterisks represent significance of Wilcoxon rank-sum test with P < 0.05, P < 0.01. c, Same as b, but for behaviorally-irrelevant signals obtained by five different methods. d, The neural similarity (R2) between raw signals and behaviorally-relevant signals extracted by d-VAE, PSID, pi-VAE, LFADS, and VAE. Error bars represent mean ± s.d. across five cross-validation folds. Asterisks indicate significance of Wilcoxon rank-sum test with P < 0.01. e-h and i-l, Same as a-d, but for dataset B with the center-out paradigm (e) and dataset C with the self-paced reaching paradigm (i). m, The firing rates of raw signals and distilled signals obtained by d-VAE in five held-out trials under the same condition of dataset B.

Overall, d-VAE successfully extracts effective behaviorally-relevant signals that meet the three criteria outlined above on both synthetic and real data. On the synthetic data (Supplementary Fig. S1), results show that d-VAE can indeed utilize a proper trade-off between generating and decod-ing to extract effective neural signals that are similar to the ground truth signals (Supplementary Fig. S1 a-g) and outperforms other distillation models (Supplementary Fig. S1 h-k). In the obstacle avoidance task (Fig. 2a), the monkey is required to move the ball from the start point (yellow) to the target point (blue) without hitting the obstacle. For the decoding performance of behaviorally-relevant signals (Fig. 2b), the signals distilled by d-VAE outperform the raw signals (purple bars with slash lines) and the signals distilled by all other distillation models (PSID, pink; pi-VAE, green; LFADS, blue; and VAE, light green) with the KF as well as the ANN. For the decoding performance of behaviorally-irrelevant signals (Fig. 2c), behaviorally-irrelevant signals obtained by d-VAE achieves the lowest decoding performance compared with behaviorally-irrelevant signals obtained by other approaches. Therefore, the combination of dVAE’s highest decoding performance for behaviorally-relevant signals and lowest decoding performance for behaviorally-irrelevant signals demonstrate its superior ability to extract behaviorally-relevant signals from noisy signals. For the neural similar-ity between behaviorally-relevant and raw signals (Fig. 2d), the distilled signals obtained by d-VAE achieve the highest performance among competitors (P < 0.01, Wilcoxon rank-sum test). Similar results were obtained for the center-out task (Fig. 2e-h) and the self-paced reaching task (Fig. 2i-l), indicating the consistency of d-VAE’s distillation ability across a range of motor tasks. To provide a more intuitive illustration of the similarity between raw and distilled signals, we displayed the firing rate of neuronal activity in five trials under the same condition (Fig. 2m), and results clearly show that the firing pattern of distilled signals is similar to the corresponding raw signals. Finally, we compared d-VAE with CEBRA on the four datasets; results show that d-VAE achieves compara-ble performance with CEBRA using the ANN decoder but outperforms CEBRA with the KF decoder (Supplementary Fig. S2), demonstrating that d-VAE can effectively extract behavioral information.

In summary, d-VAE distills effective behaviorally-relevant signals that preserve behavioral infor-mation maximally and are similar to raw signals. Meanwhile, the behaviorally-irrelevant signals discarded by d-VAE contain a little behavioral information. Therefore, these signals are reliable for exploring the encoding and decoding mechanisms of relevant signals.

How do behaviorally-irrelevant signals affect the analysis of neural activity at the single-neuron level?

Following signal separation, we first explored how behaviorally-irrelevant signals affect the analysis of neural activity at the single-neuron level. Specifically, we examined the effect of irrelevant signals on two critical properties of neuronal activity: the preferred direction (PD) (Georgopoulos et al., 1986) and trial-to-trial variability. Our objective was to know how irrelevant signals affect the PD of neurons and whether irrelevant signals contribute significantly to neuronal variability.

To explore how irrelevant signals affect the PD of neurons, we first calculated the PD of both raw and distilled signals separately and then quantified the PD deviation by the angle difference between these two signals. Results show that the PD deviation increases as the neuronal R2 de-creases (red curve, Fig. 3a,e and Supplementary Fig. S3a); neurons with larger R2 (strongly-tuned neurons) exhibit stable PDs with signal distillation (see example PDs in the inset), while neurons with smaller R2s (weakly-tuned neurons) show a larger PD deviation. These results indicate that irrelevant signals have a small effect on strongly-tuned neurons but a large effect on weakly-tuned neurons. One possible reason for the larger PD deviation in weakly-tuned neurons is that they have a lower degree of linear encoding but a higher degree of nonlinear encoding, and highly nonlinear structures are more susceptible to interference from irrelevant signals (Nogueira et al., 2023). Moreover, after filtering out the behaviorally-irrelevant signals, the cosine tuning fit (R2) of neurons increases (P < 10−20, Wilcoxon signed-rank test; Fig. 3b,f and Supplementary Fig. S3b), indicating that irrelevant signals reduce the neurons’ tuning expression. Notably, even after remov-ing the interference of irrelevant signals, the R2 of neurons remains relatively low. These results demonstrate that the linear encoding model only explains a small fraction of neural responses, and neuronal activity encodes behavioral information in complex nonlinear ways.

The effect of irrelevant signals on analyzing neural activity at the single-neuron level.

a, The angle difference (AD) of preferred direction (PD) between raw and distilled signals as a function of the R2 of raw signals on datasets A, where R2 represents the proportion of neuronal activity explained by the linear encoding model. Each black point represents a neuron (n = 90). The red curve is the fitting curve between R2 and AD. Five example larger R2 neurons’ PDs are shown in the inset plot, where the solid and dotted line arrows represent the PDs of relevant and raw signals, respectively. b, Comparison of the cosine tuning fit (R2) before and after distillation of single neurons (black points), where the x-axis and y-axis represent neurons’ R2 of raw and distilled signals, respectively. c, Comparison of neurons’ Fano factor (FF) averaged across conditions of raw (x-axis) and distilled (y-axis) signals, where FF is used to measure the neuronal variability of different trials in the same condition. d, Boxplots of raw (purple) and distilled (red) signals under different conditions for all neurons (12 conditions). Boxplots represent medians (lines), quartiles (boxes), and whiskers extending to ± 1.5 times the interquartile range. The broken lines represent the mean FF across all neurons. e-h, Same as a-d, but for dataset B (n=159, 8 conditions). i, Example of three neurons’ raw firing activity decomposed into behaviorally-relevant and irrelevant parts using all trials under two conditions (2 of 8 directions) in held-out test sets of dataset B.

To investigate whether irrelevant signals significantly contribute to neuronal variability, we com-pared the neuronal variability (measured with the Fano factor (Churchland et al., 2010), FF) of rel-evant and raw signals. Results show that the condition-averaged FF of each neuron of distilled signals is lower than that of raw signals (P < 10−20, Wilcoxon signed-rank test; Fig. 3c,g), and the mean (broken line) and median FFs of all neurons under different conditions are also significantly lower than those of raw signals (P < 0.01, Wilcoxon signed-rank test; Fig. 3d,h), indicating that irrelevant signals significantly contribute to neuronal variability. We then visualized the single-trial neuronal activity of example neurons under different conditions (Fig. 3i and Supplementary Fig. S4). Results demonstrate that the patterns of relevant signals are more consistent and stable across different trials than raw signals, and the firing activity of irrelevant signals varies randomly. These results indicate that irrelevant signals significantly contribute to neuronal variability, and eliminat-ing the interference of irrelevant signals enables us observe the changes in neural pattern more accurately.

How do behaviorally-irrelevant signals affect the analysis of neural activity at the population level?

The neural population structure is an essential characteristic of neural activity. Here, we examined how behaviorally-irrelevant signals affect the analysis of neural activity at the population level, in-cluding four aspects: (1) the population properties of relevant and irrelevant signals, (2) the sub-space overlap relationship between the two signal components, (3) how the two partitions con-tribute to raw signals, and (4) the difference in population properties between raw and distilled signals.

To explore the population properties of relevant and irrelevant signals, we separately applied principal component analysis (PCA) on each partition to obtain the corresponding cumulative vari-ance curve in a descending variance order. Our results show that the primary subspace (capturing the top 90% variance) of relevant signals (thick red line, Fig.4a,e and Supplementary Fig. S5a) is only explained by a few dimensions (7, 13, and 9 for each dataset), indicating that the primary part of behaviorally-relevant signals exists in a low-dimensional subspace. In contrast, the primary sub-space of irrelevant signals (thick blue line, Fig. 4b,f and Supplementary Fig. S5b) requires more dimensions (46, 81, and 59). The variance distribution of behaviorally-irrelevant signals across dimensions (thick blue line, Fig. 4b,f and Supplementary Fig. S5b) is more even than behaviorally-relevant signals (thick red line, Fig. 4a,e and Supplementary Fig. S5a) but not as uniform as Gaus-sian noise (0, I) (thin gray line, Fig.4b,f and Supplementary Fig. S5b), indicating that irrelevant signals are not pure noise but rather bear some significant structure, which may represent infor-mation from other irrelevant tasks.

The effect of irrelevant signals on analyzing neural activity at the population level.

a,b, PCA is separately applied on relevant and irrelevant signals to get relevant PCs and irrelevant PCs. The thick lines represent the cumulative variance explained for the signals on which PCA has been performed, while the thin lines represent the variance explained by those PCs for other signals. Red, blue, and gray colors indicate relevant signals, irrelevant signals, and random Gaussian noise (0, I) (for chance level) where the mean vector is zero and the covariance matrix is the identity matrix. The horizontal lines represent the percentage of variance explained. The vertical lines indicate the number of dimensions accounted for 90% of the variance in behaviorally-relevant (left) and irrelevant (right) signals. For convenience, we defined the principal component subspace describing the top 90% variance as the primary subspace and the subspace capturing the last 10% variance as the secondary subspace. The cumulative variance explained for behaviorally-relevant (a) and irrelevant (b) signals got by d-VAE on dataset A. c,d, PCA is applied on raw signals to get raw PCs. c, The bar plot shows the composition of each raw PC. The inset pie plot shows the overall proportion of raw signals, where red, blue, and purple colors indicate relevant signals, irrelevant signals, and the correlation between relevant and relevant signals. The PC marked with a red triangle indicates the last PC where the variance of relevant signals is greater than or equal to that of irrelevant signals. d, The cumulative variance explained by raw PCs for different signals, where the thick line represents the cumulative variance explained for raw signals (purple), while the thin line represents the variance explained for relevant (red) and irrelevant (blue) signals. e-h, Same as a-d, but for dataset B.

To investigate the subspace overlap between relevant and irrelevant signals, we calculated how many variances of irrelevant signals can be captured by relevant PCs by projecting irrelevant sig-nals onto relevant PCs and vice versa (Elsayed et al., 2016; Rouse and Schieber, 2018; Jiang et al., 2020)(see Methods). We found that the variance of irrelevant signals increases relatively uniformly over relevant PCs (blue thin line, Fig. 4a,e and Supplementary Fig. S5a), like random noise’s variance accumulation explained by relevant PCs (gray thin line, Fig. 4a,e and Supplementary Fig. S5a); sim-ilar results are observed for relevant signals explained by irrelevant PCs (red thin line, Fig. 4b,f and Supplementary Fig. S5b). These results indicate that relevant PCs can not match the informative dimensions of irrelevant signals and vice versa, suggesting the dimensions of behaviorally-relevant and irrelevant signals are unrelated. It is worth mentioning that the signals obtained by pi-VAE are in contrast to our findings. Its results show that a few relevant PCs can explain a considerable variance of irrelevant signals (thin red line, Supplementary Fig. S6b,f,j), which indicates that the relevant and irrelevant PCs are closely related. The possible reason is that the pi-VAE leaks a lot of information into the irrelevant signals. Notably, Fig. 4a,e and Supplementary Fig. S5a show that ir-relevant signals got by d-VAE projecting on behaviorally-relevant primary subspace only capture a little variance information (9%, 12%, and 9%), indicating that the primary subspace of behaviorally-relevant signals is nearly orthogonal to irrelevant space.

To investigate the composition of raw signals by the two partitions, we performed PCA on raw neural signals to obtain raw PCs, and then projected the relevant and irrelevant signals onto these PCs to assess the proportion of variance of raw signals explained by each partition. First, we ana-lyzed the overall proportion of relevant and irrelevant signals that constitute the raw signals (the inset pie plot, Fig. 4c,g and Supplementary Fig. S5c). The variance of the raw signals is composed of three parts: the variance of relevant signals, the variance of irrelevant signals, and the correlation between relevant and irrelevant signals (see Methods). The results demonstrate that the irrele-vant signals account for a large proportion of raw signals, suggesting the motor cortex encodes considerable information that is not related to the measured behaviors. In addition, there is only a weak correlation between relevant and irrelevant signals, implying that behaviorally-relevant and irrelevant signals are nearly uncorrelated.

We then examined the proportions of relevant and irrelevant signals in each PC of raw signals. We found that relevant signals (red) occupy the dominant proportions in the larger variance raw PCs (before the PC marked with a red triangle), while irrelevant signals (blue) occupy the dominant proportions in the smaller variance raw PCs (after the PC marked with a red triangle) (Fig. 4c,g and Supplementary Fig. S5c). Similar results are observed in the accumulation of each raw PC (Fig. 4d,h and Supplementary Fig. S5d). Specifically, the results show that the variance accumulation of raw signals (purple line) in larger variance PCs is mainly contributed by relevant signals (red line), while irrelevant signals (blue line) primarily contribute to the lower variance PCs. These results demonstrate that irrelevant signals have a small effect on larger variance raw PCs but a large effect on smaller variance raw PCs. This finding eliminates the concern that irrelevant signals would significantly affect the top few PCs of raw signals and thus produce inaccurate conclusions. To further validate this finding, we used the top six PCs as jPCA (Churchland et al., 2012) did to examine the rotational dynamics of distilled and raw signals (Fig. S7). Results show that the rotational dynamics of distilled signals are similar to those of raw signals.

Finally, to directly compare the population properties of raw and relevant signals, we plotted the cumulative variance curves of raw and relevant signals (Supplementary Fig. S8). Results (upper left corner curves, Supplementary Fig. S8) show that the cumulative variance curve of relevant signals (red line) accumulates faster than that of raw signals (purple line) in the preceding larger variance PCs, indicating that the variance of the relevant signal is more concentrated in the larger variance PCs than that of raw signals. Furthermore, we found that the dimensionality of primary subspace of raw signals (26, 64, and 45 for datasets A, B, and C) is significantly higher than that of behaviorally-relevant signals (7, 13, and 9), indicating that behaviorally-irrelevant signals lead to an overestimation of the neural dimensionality of behaviorally-relevant signals.

Distilled behaviorally-relevant signals uncover that smaller R2 neurons encode rich behavioral information in complex nonlinear ways

The results presented above regarding PDs (Fig. 3 and Fig. S3) demonstrate that irrelevant signals significantly impact smaller R2 neurons and weakly impact larger R2 neurons. Under the interfer-ence of irrelevant signals, it is dificult to explore the amount of behavioral information in neuronal activity. Given that we have accurately separated the behaviorally-relevant and irrelevant signals, we explored whether irrelevant signals would mask some encoded information of neuronal activity, especially for smaller R2 neurons.

To answer the question, we divided the neurons into two groups of smaller R2 (R2 <= 0.03) and larger R2 (R2 > 0.03), and then used decoding models to assess how much information is en-coded in raw and distilled signals. As shown in Fig. 5a, for the smaller R2 neuron group, both KF and ANN decode behavioral information poorly on raw signals, but achieve high decoding perfor-mance using relevant signals. Specifically, the KF decoder (left plot, Fig. 5a) improves the decoding R2 significantly from 0.044 to 0.616 (improves by about 1300%; P < 0.01, Wilcoxon rank-sum test) after signal distillation; the ANN decoder (right plot, Fig. 5a) improves from 0.374 to 0.753 (im-proves by about 100%; P < 0.01, Wilcoxon rank-sum test). For the larger R2 neuron group, the decoding performance of relevant signals with ANN does not improve much compared with the decoding performance of raw signals, but the decoding performance of relevant signals with KF is significantly better than that of raw signals (P < 0.01, Wilcoxon rank-sum test). Similar results are obtained with datasets B (Fig. 5d) and C (Supplementary Fig. S9d). These results indicate that irrele-vant signals mask behavioral information encoded by neuronal populations, especially for smaller R2 neurons with a higher degree of nonlinearity, and that smaller R2 neurons actually encode rich behavioral information.

Smaller R2 neurons encode rich behavioral information in complex nonlinear ways.

a, The comparison of decoding performance between raw (purple) and distilled signals (red) on dataset A with different neuron groups, including smaller R2 neuron (R2<= 0.03), larger R2 neuron (R2> 0.03), and all neurons. Error bars indicate mean ± s.d. across five cross-validation folds. Asterisks denote significance of Wilcoxon rank-sum test with P < 0.05, P < 0.01. b, The correlation matrix of all neurons of raw (left) and behaviorally-relevant (right) signals on dataset A. Neurons are ordered to highlight correlation structure (details in Methods). c, The decoding performance of KF (left) and ANN (right) with neurons dropped out from larger to smaller Ron dataset A. The vertical gray line indicates the number of dropped neurons at which raw and behaviorally-relevant signals have the greatest performance difference. d-f, Same as a-c, for dataset B.

The fact that the smaller R2 neurons encode rich information seems unexpected, and interest-ingly, we cannot obtain this rich information by just distilling smaller R2 neurons. This leads to two alternative cases. The first is that the larger R2 neurons leak their information to the smaller R2 neurons, causing them contain too much behavioral information. The second is that the smaller R2 neurons per se contain a lot of information, and the larger R2 neurons help the smaller R2 neurons to restore their original appearance. We first tested the first case and found it does not hold for two reasons. Firstly, our model constrains distilled neuronal activity to be similar to the corresponding original neuronal activity (see Fig. 3i and Supplementary Fig. S4), inhibiting the gen-eration of arbitrarily shaped neuronal activity, such as that of other neurons, so it is impossible for our model to add the activity of larger R2 neurons to that of smaller R2 neurons. To check this, we compared the neural similarity (R2) of each distilled neuron to all raw neurons. We found 78/90 (87%, dataset A), 153/159 (96%, dataset B), 91/91 (100%, dataset C) distilled neurons are most sim-ilar to the corresponding neurons, and the other distilled neurons belong to the top four similar to the corresponding neurons, demonstrating the distilled neuronal activity is similar to the cor-responding raw neuronal activity. Secondly, if this large amount of information is compensated by other neurons, the residuals should also contain a large amount of information, but they only contain a little information (Fig. 2c,g,k). Given these two reasons, the first case is rejected. Then we tested the second case. To verify the second case, we simulated a situation where the smaller R2 neurons contain a certain amount of behavioral information, but the behavioral information can-not be decoded from these neurons due to being covered by noise (see Methods), to see if d-VAE can restore these neurons’ original appearance. Results (Supplementary Fig. S10a-c) show that for smaller R2 neurons, the decoding R2 of distilled signals (0.362, 0.401 for KF and ANN) is significantly higher than that of raw signals (0.019, 0.063 for KF and ANN; P < 0.01, Wilcoxon rank-sum test), and is more closer to that of ground truth (0.454, 0.467 for KF and ANN). These results approve the second case. Taken together, these results demonstrate that the smaller R2 neurons indeed contain rich behavioral information.

Given that both smaller and larger R2 neurons encode rich behavioral information, it is worth noting that the sum of the decoding performance of smaller R2 neurons and larger R2 neurons is significantly greater than that of all neurons for relevant signals (red bar, Fig. 5a,d and Supple-mentary Fig. S9a), demonstrating that movement parameters are encoded very redundantly in neuronal population. In contrast, we can not find this degree of neural redundancy in raw signals (purple bar, Fig. 5a,d and Supplementary Fig. S9a) because the encoded information of smaller R2 neurons are masked by irrelevant signals. Therefore, these smaller R2 neurons, which are usually ignored, are actually useful and play a critical role in supporting neural redundancy. Generally, cortical redundancy can arise from neuronal correlations, which are critical for revealing certain aspects of neural circuit organization (Yatsenko et al., 2015). Accordingly, we visualized the or-dered correlation matrix of neurons (see Methods) for both raw and relevant signals (Fig. 5b,e and Supplementary Fig. S9b) and found that the neuronal correlation of relevant signals is stronger than that of raw signals. These results demonstrate that irrelevant signals weaken the neuronal correlation, which may hinder the accurate investigation of neural circuit organization.

Considering the rich redundancy and strong correlation of neuronal activity, we wondered whether the neuronal population could utilize redundant information from other neurons to ex-hibit robustness under the perturbation of neuronal destruction. To investigate this question, we evaluated the decoding performance of dropping out neurons from larger R2 to smaller R2 on raw and relevant signals. The results (Fig. 5c,f and Supplementary Fig. S9c) show that the decoding performance of the KF and ANN on raw signals (purple line) decreases steadily before the number of neurons marked (vertical gray line), and the remaining smaller R2 neurons decode behavioral in-formation poorly. In contrast, even if many neurons are lost, the decoding performance of KF and ANN on relevant signals (red line) maintains high accuracy. This finding indicates that behaviorally-relevant signals are robust to the disturbance of neuron drop-out, and smaller R2 neurons play a critical role in compensating for the failure of larger R2 neurons. In contrast, this robustness can not be observed in raw signals because irrelevant signals mask neurons’ information and weaken their correlation. Notably, the ANN outperforms the KF when only smaller R2 neurons are left (Fig. 5c,f and Supplementary Fig. S9c), suggesting that smaller R2 neurons can fully exploit their nonlinear ability to cope with large-scale neuronal destruction.

Distilled behaviorally-relevant signals uncover that signals composed of smaller variance PCs encode rich behavioral information in complex nonlinear ways

The results presented above regarding subspace overlap (Fig. 4 and Fig. S5) show that irrelevant signals have a small impact on larger variance PCs but dominate smaller variance PCs. Therefore, we aimed to investigate whether irrelevant signals would mask some encoded information of neu-ral population, especially signals composed of smaller variance PCs.

To answer the question, we compared the decoding performance of raw and distilled signals with different raw PC groups. Specifically, we firstly divided the raw PCs into two groups, that is, smaller variance PCs and larger variance PCs, defined by ratio of relevant to irrelevant signals in the raw PCs (the red triangle, see Fig. 4c,g and Supplementary Fig. S5c). Then we projected raw and distilled signals onto these two PC groups and got the corresponding signals. Results show that, for the smaller variance PC group, both KF and ANN achieve much better performance on distilled signals than raw signals (P < 0.01, Wilcoxon rank-sum test, for ANN), whereas for the larger variance PC group, the decoding performance of relevant signals does not improve a lot compared with the decoding performance of raw signals (see Fig. 6a,d and Supplementary Fig. S9a). These results demonstrate that irrelevant signals mask the behavioral information encoded by different PC groups, especially for signals composed of smaller variance PCs (smaller variance PC signals), and smaller variance PC signals actually encode rich behavioral information.

Signals composed of smaller variance PCs encode rich behavioral information in complex nonlinear ways.

a, The comparison of decoding performance between raw (purple) and distilled signals (red) composed of different raw PC groups, including smaller variance PCs (the proportion of irrelevant signals that make up raw PCs is higher than that of relevant signals), larger variance PCs (the proportion of irrelevant signals is lower than that of relevant ones) on dataset A. Error bars indicate mean ± s.d. across five cross-validation folds. Asterisks denote significance of Wilcoxon rank-sum test with *P < 0.05, **P < 0.01. b, The cumulative decoding performance of signals composed of cumulative PCs that are ordered from smaller to larger variance using KF (left) and ANN (right) on dataset A. The red patches indicate the decoding ability of the last 10% variance of relevant signals. c, The cumulative decoding performance of signals composed of cumulative PCs that are ordered from larger to smaller variance using KF (left) and ANN (right) on dataset A. The red patches indicate the decoding gain of the last 10% variance signals of relevant signals superimposing on their top 90% variance signals. The inset shows the partially enlarged plot for view clearly. d-f, Same as a-c, but for dataset B.

The above results are based on raw PCs. However, raw PCs are biased by irrelevant signals and thus cannot faithfully reflect the characteristics of relevant signals. As we have successfully separated the behaviorally-relevant signals, we aimed to explore how behavioral information of distilled signals is distributed across relevant PCs. To do so, we used decoding models to evalu-ated the amount of behavioral information contained in cumulative PCs of relevant signals (using raw signals as a comparison). The cumulative variance explained by PCs in descending and as-cending order of variance and the dimensionality corresponding to the top 90% variance signals (called primary signals) and the last 10% variance signals (called secondary signals) are shown in Supplementary Fig. S8.

Here we first investigated secondary signals’ decoding ability solely by accumulating PCs from smaller to larger variance. The results show that, for relevant signals, KF can hardly decode behav-ioral information solely using secondary signals (red line; left plot, Fig. 6b,e and Supplementary Fig. S9e), but ANN can decode rich information (red line; right plot, Fig. 6b,e and Supplementary Fig. S9e). These results indicate that smaller variance PC signals encode rich information in complex nonlinear ways. In contrast, when using raw signals composed of the same number of dimensions as the secondary signals (purple line, Fig. 6b,e and Supplementary Fig. S9e), the amount of infor-mation identified by ANN is significantly smaller than that of relevant secondary signals (P < 0.01, Wilcoxon rank-sum test). These results demonstrate that irrelevant signals obscure these neural di-mensions and make them seem insignificant, indicating that behavioral information is distributed in a higher-dimensional subspace than expected from raw signals.

We then investigated the effect of superimposing secondary signals on primary signals by ac-cumulating PCs from larger to lower variance. The results (Fig. 6c,f and Supplementary Fig. S9f) show that secondary signals improve the decoding performance of ANN a little but improve the decoding performance of KF a lot. The discrepancy between the two decoders reflects their differ-ent abilities to utilize the information within the signal. KF cannot use the nonlinear information in primary signals as effectively as ANN can and thus requires secondary signals to improve decoding performance. Notably, the improvement of KF using secondary signals is significantly higher than using raw signals composed of the same number of dimensions as the secondary signals (P < 0.01, Wilcoxon rank-sum test). These results demonstrate that irrelevant signals conceal the smaller vari-ance PC signals, making their encoded information dificult to be linearly decoded, suggesting that behavioral information exists in a higher-dimensional subspace than anticipated from raw signals. Interestingly, we can find that although secondary signals nonlinearly encode behavioral informa-tion and are decoded poorly by linear decoders, they considerably improve KF performance by superimposing on primary signals (Fig. 6c,f and Supplementary Fig. S9f left); and the sum of the sole decoding performance of primary and secondary signals is lower than the decoding perfor-mance of full signals. These results indicate that the combination of smaller and larger variance PCs produces a synergy effect (Narayanan et al., 2005) enabling secondary signals that cannot be linearly decoded to improve the linear decoding performance.

Finally, we explored which aspect of movement parameters was improved by superimposing the secondary signals on the primary ones with KF. In BMIs, directional control has achieved great success (Georgopoulos et al., 1986; Hochberg et al., 2012), but precise speed control, especially at lower speeds such as hold or stop, has always been challenging (Wodlinger et al., 2014; Inoue et al., 2018). Thus we hypothesized that these signals might improve the lower-speed velocity. To test this, we divided samples into lower-speed and higher-speed regions and assessed which re-gion improved the most by superimposing the secondary signals (see details in Methods). After superimposing the secondary signals, the absolute improvement ratio of the lower-speed region is significantly higher than that of the higher-speed region (P < 0.05, Wilcoxon rank-sum test; Supple-mentary Fig.S11a,b,c). Furthermore, we visualized the relative improvement ratio of five example trials for the two regions, and the results (Supplementary Fig. S11d) demonstrate that secondary signals significantly improve the estimation of lower speed. These results demonstrate that the secondary signals enhance the lower-speed control, suggesting that smaller variance PC signals are involved in regulating precise motor control.

Distilled behaviorally-relevant signals suggest that motor cortex may use a linear readout mechanism to generate movement behaviors

Understanding the readout mechanism of the motor cortex is crucial for both neuroscience and neural engineering, which remains unclear. By filtering out the interference of behaviorally-irrelevant signals, we found a stunning result: the linear decoder KF achieves comparable performance to that of the nonlinear decoder ANN (P = 0.10, 0.15, and 0.55 for datasets A, B, and C, Wilcoxon rank-sum test; Fig. 2b,f,g). Considering the decoding performance and model complexity (the simplicity principle, also called Occam’s razor), movement behaviors are more likely to be generated by the linear readout, suggesting linear readout is performed in the motor cortex.

However, one might doubt that it is our distillation model that makes signals which originally require nonlinear decoding linearly decodable. To address this concern, we conducted a simulation experiment where linear decoding is significantly inferior to nonlinear decoding to see if signals processed by our model could enable linear decoders to achieve performance similar to that of nonlinear decoders (see Methods). Results (Supplementary Fig. S10d-f) show that the difference of decoding R2 between KF and ANN on distilled signals is 0.146, which is comparable to the difference (0.162) on raw signals (P = 0.15, Wilcoxon rank-sum test). This demonstrate that our model cannot make linear decoders achieve similar performance as nonlinear decoders, thus eliminating the doubt.

In summary, these findings demonstrate that behaviorally-irrelevant signals significantly com-plicate the readout of behavioral information and provide compelling evidence supporting the notion that the motor cortex uses a linear readout mechanism to generate movement behaviors.

Discussion

In this study, we proposed a new perspective for studying neural mechanisms, namely, using sep-arated accurate behaviorally-relevant signals instead of raw signals; and we provided a novel dis-tillation framework to define, extract, and validate behaviorally-relevant signals. By separating behaviorally-relevant and irrelevant signals, we found that neural responses previously considered trivial encode rich behavioral information in complex nonlinear ways, and they play an important role in neural encoding and decoding. Furthermore, we found that linear decoders can achieve comparable performance to that of nonlinear decoders, providing compelling evidence for the presence of linear readout in the motor cortex. Overall, our results reveal unexpected complex encoding but simple decoding mechanisms in the motor cortex.

Signal separation by d-VAE

Behaviorally-relevant patterns can be extracted either at single-neuron or latent neural popula-tion levels. In our study, we focused on the single-neuron level, aiming to preserve the underlying properties of individual neurons. By maintaining the properties of each neuron, researchers can in-vestigate how the neuronal population performs when one of the neurons is destroyed. This kind of analysis is particularly useful in closed-loop stimulation experiments that use electrophysiolog-ical (Sun et al., 2022) or optogenetic (Zhang et al., 2022) interventions. Furthermore, behaviorally-relevant signals also allow for population-level analysis and provide clean benchmark signals to test and compare the variance capture ability of different hypothesis-driven models.

At the single-neuron level, it is common practice to use trial-averaged neuronal responses of the same task parameters to analyze neural mechanisms (Kobak et al., 2016; Rouse and Schieber, 2018). However, trial averaging sacrifices single-trial information, thereby providing an incomplete characterization of neural activity. Furthermore, trial-averaged responses still contain a significant amount of behaviorally-irrelevant signals caused by uninstructed movements (Musall et al., 2019), which can lead to a contaminated version of behaviorally-relevant signals. In contrast, our model is capable of extracting clean behaviorally-relevant neural activity for every single trial. At the la-tent population level, existing latent variable models (Sani et al., 2021; Pandarinath et al., 2018; Yu et al., 2008; Zhou and Wei, 2020) focus on modeling some specific properties of latent population representations, such as linear or nonlinear dynamics (Sani et al., 2021; Pandarinath et al., 2018; Churchland et al., 2012), temporal smoothness (Yu et al., 2008), and interpretability (Zhou and Wei, 2020). Since these models make restrictive assumptions involving characterizing specific neural properties, they fail to capture other parts of behaviorally-relevant signals that do not meet their assumptions, providing no guarantee that the generated signals preserve behavioral information maximally. In contrast, our model aims to extract maximal behavioral information at the single-neuron level. To ensure this, we explicitly constrain the model to generate signals that maintain the neuronal properties while preserving behavioral information to the greatest extent possible. At the population level, dimensionality reduction methods aided by task parameters (Kobak et al., 2016; Schneider et al., 2022) are another important way to discover the latent neural embeddings relevant to task parameters, which may provide new insight into neural representations. In con-trast with this class of methods, our model focuses on the signal level, not the latent embedding level.

Although we made every effort, our model is still not able to perfectly extract behaviorally-relevant neural signals, resulting in a small amount of behavioral information leakage in the resid-uals. Nevertheless, the signals distilled by our model are reliable, and the minor imperfections do not affect the conclusions drawn from our analysis. In the future, better models can be developed to extract behaviorally-relevant signals more accurately.

Implications for analyzing neural activity by separation

Studying neural mechanisms through noisy signals is akin to looking at flowers in a fog, which is dificult to discern the truth. Thus, removing the interference of irrelevant signals is necessary and beneficial for analyzing neural activity, whether at the single-neuron level or population level.

At the single-neuron level, trial-to-trial neuronal variability poses a significant challenge to iden-tifying the actual neuronal pattern changes. The variability can arise from various sources, includ-ing meaningless noise (Faisal et al., 2008), meaningful but behaviorally-irrelevant neural processes (Musall et al., 2019), and intrinsic components of neural encoding (Walker et al., 2020). However, it is still unclear to what extent each source contributes to the variability (Faisal et al., 2008). By sep-arating behaviorally-relevant and irrelevant parts, we could roughly determine the extent to which these two parts contribute to the variability and explore which type of variability these two parts may contain. Our results demonstrate that behaviorally-irrelevant signals are a significant contrib-utor to variability, which may include both meaningless noise and meaningful but behaviorally-irrelevant signals as behaviorally-irrelevant signals are not pure noise and may carry some struc-tures (thick blue line, Fig. 4b,f and Supplementary Fig. S5b). Notably, behaviorally-relevant sig-nals also exhibit some variability, which may arise from intrinsic components of neural encoding and provide the neural basis for motor learning (Dhawale et al., 2017). Moreover, eliminating the variability caused by irrelevant signals enables us to better observe and compare actual neuronal pattern changes and may facilitate the study of learning mechanisms (Sadtler et al., 2014; Hennig et al., 2021).

At the population level, determining the neural dimensionality of specific behaviors from raw signals is challenging since it is dificult to discern how many variances correspond to irrelevant signals, which often depend heavily on signal quality. A previous study (Altan et al., 2021) demon-strated, through simulation experiments involving different levels of noise, that such noise makes methods overestimate the neural dimensionality. Our results, consistent with theirs, indicate that behaviorally-irrelevant signals can cause an overestimation of the neural dimensionality of behaviorally-relevant responses (Supplementary Fig. S8). These findings highlight the need to filter out irrele-vant signals when estimating the neural dimensionality. It is important to note that the dimension-ality of neural subspaces encoding behavioral information cannot be accurately determined by the variance they capture. Estimating the dimensionality accurately requires considering the amount of behavioral information encoded by signals distributed within these subspaces. This is because the variance of neural signals does not directly reflect the amount of behavioral information they encode. For example, behaviorally-irrelevant signals occupy some neural subspaces without con-taining behavioral information. Furthermore, this perspective of signal separation has broader implications for other studies. For instance, researchers can isolate neural signals corresponding to different behaviors and explore their shared and exclusive patterns to uncover underlying com-mon and unique mechanisms of different behaviors (Gallego et al., 2018).

Implications for exploring neural mechanisms by separation

At the single-neuron level, previous studies (Carmena et al., 2005; Narayanan et al., 2005) have shown that neuronal ensembles redundantly encode movement behaviors in the motor cortex. However, our results reveal a significantly higher level of redundancy than previously reported. Specifically, prior studies found that the decoding performance steadily declines as neurons drop out, which is consistent with our results drawn from raw signals. In contrast, our results show that decoders maintain high performance on distilled signals even when many neurons drop out. Our findings reinforce the idea that movement behavior is redundantly encoded in the motor cortex and demonstrate that the brain is robust enough to tolerate large-scale neuronal destruction while maintaining brain function (Alstott et al., 2009).

At the population level, previous studies have proposed that motor control is achieved through low-dimensional neural manifolds (Cunningham and Yu, 2014; Gallego et al., 2017). However, our results challenge this idea by showing that signals composed of smaller variance PCs nonlinearly encode a significant amount of behavioral information. This suggests that behavioral information is distributed in a higher-dimensional neural space than previously thought. Interestingly, although smaller variance PC signals nonlinearly encode behavioral information, their behavioral informa-tion can be linearly decoded by superimposing them onto larger variance PC signals. This result is consistent with the finding that nonlinear mixed selectivity can yield high dimensional neural re-sponses and thus allow linear readout of behavioral information by downstream neurons (Rigotti et al., 2013; Fusi et al., 2016). Moreover, we found that smaller variance PC signals can improve precise motor control, such as lower speed control. Analogously, recent studies have found that smaller variance PCs of hand postures are task-dependent and relate to the precise and complex postures (Yan et al., 2020). These findings suggest that neural signals composed of lower variance PCs are involved in the regulation of precise motor control.

In the motor cortex, in what form downstream neurons read out behavioral information is still an open question. By filtering out behaviorally-irrelevant signals, our study found that accurate decoding performance can be achieved through linear readout, suggesting that the motor cortex may perform linear readout to generate movement behaviors. Similar findings have been reported in other cortices, such as the inferotemporal cortex (IT) (Majaj et al., 2015), perirhinal cortex (PRH) (Pagan et al., 2013), and somatosensory cortex (Nogueira et al., 2023), supporting the idea of linear readout as a general principle in the brain. However, further experiments are needed to verify this hypothesis across a wider range of cortical regions. While neurons encode significant behavioral information nonlinearly, our study demonstrates that when neuronal assemblies with diverse fir-ing patterns collaborate to decode information, they do not necessarily need to fully exert their nonlinear ability. Instead, a simple linear combination of these assemblies can accomplish the decoding task. This finding suggests that the complexity of encoding mechanisms underlies the simplicity of decoding mechanisms.

With regard to studying decoding mechanisms, recent studies (Pitkow et al., 2015; Yang et al., 2021) have focused on investigating how the brain decodes task information in the presence of noise. Unlike their works, we focus on studying the decoding mechanism after removing the noise (irrelevant signals). Although our research perspectives differ, our results support their idea that the brain needs nonlinear operations to suppress noise interference (Yang et al., 2021). However, in principle, denoising (or filtering out behaviorally-irrelevant signals) and decoding are two sep-arate steps. Using a communication system as a metaphor, the receiver should first denoise the signal received from the sender and then decrypt the message. The same should be true for the brain. Therefore, removing irrelevant signals to study the readout mechanism is reasonable. Addi-tionally, investigating how the brain filters out noise or separates behaviorally-irrelevant factors is a promising research direction (Schneider et al., 2018); distilled behaviorally-relevant signals can serve as reliable reference signals to facilitate this study. Furthermore, our study reveals that ir-relevant signals are the most critical factor affecting accurate and robust decoding, and achieving accurate and robust linear decoding requires weak neural responses. These findings have two important implications for developing accurate and robust BMIs: designing preprocessing filter-ing algorithms or developing decoding algorithms that include filtering out behaviorally-irrelevant signals, and paying attention to the role of weak neural responses in motor control. More gener-ally, our study provides a powerful framework for separating behaviorally-relevant and irrelevant signals, which can be applied to other cortical data to uncover more hidden neural mechanisms.

Methods

Dataset and preprocessing

Three datasets with different paradigms are employed, including obstacle avoidance task dataset (Wang et al., 2015), center-out reaching task dataset (Dyer et al., 2017), and self-paced reaching task dataset (O’Doherty et al., 2017).

The first dataset is the obstacle avoidance dataset. An adult male Rhesus monkey was trained to use the joystick to move the computer cursor to bypass the obstacle and reach the target. Neu-ral data were recorded from the monkey’s upper limb area of the dorsal premotor (PMd) using a 96-electrode Utah array (Blackrock Microsystems Inc., USA). Multi-unit activity (MUA) signals are used in the present study. The corresponding behavioral data (velocity) were simultaneously col-lected. There are two days of data (20140106 and 20140107), and each day contains 171 trials on average. All animal handling procedures were authorized by the Animal Care Committee at Zhe-jiang University, China, and conducted following the Guide for Care and Use of Laboratory Animals (China Ministry of Health).

The second dataset is publicly available and provided by Kording Lab (Dyer et al., 2017). The monkey was trained to complete two-dimensional 8-direction center-out reaching tasks. We used two days of data from subject C (20161007 and 20161011). Each day contains 190 trials on average. Neural data are spike-sorted PMd signals. The behavioral data were simultaneously collected in instantaneous velocity.

The third dataset is publicly available and provided by Sabes Lab (Zenodo dataset) (O’Doherty et al., 2017). An adult male Rhesus monkey was trained to finish self-paced reaching tasks within an 8-by-8 square grid. There are no inter-trial intervals during the experiment. Neural data were recorded from the monkey’s primary motor cortex (M1) area with a 96-channel silicon microelec-trode array. The neural data are the multi-unit activity (MUA) signals. Hand velocity was obtained from the position through a discrete derivative. The recording period for the data (20170124 01) is about 10 minutes.

For all datasets, the neural signals were binned by a 100ms sliding window without overlap. As a preprocess, we smoothed the neural signals using a moving average filter with three bins. We excluded some electrode activities with low mean firing rates (<0.5 Hz mean firing rates across all bins) and did not perform any other pre-selection to select neurons. For the computation of the Fano factor, we chose twelve and fourteen points as the thresholds of trial length for datasets A and B, respectively; trials with a length less than the threshold were discarded (discard about 7% and 2% trials for datasets A and B), trials longer than the threshold were truncated to threshold length from the starting point. Since dataset C has no trial information, FF is not calculated for this dataset. For the analysis of datasets A and B, we selected one day of these two datasets for analysis (20140107 for dataset A and 20161011 for dataset B).

The synthetic dataset

The synthetic dataset is used to demonstrate that d-VAE can extract effective behaviorally-relevant signals that are similar to the ground truth signals. The specific process of generating synthetic data is as follows. First, we selected the top ten R2 neurons of dataset B (20161011). Second, we used deep neural networks to learn the encoding model between movement kinematics (movement velocity of dataset B) and neural signals. The details of the networks are demonstrated as follows. The networks use two hidden layer multilayer perceptron (MLP) with 500 and 500 hidden units. The activation function is ReLU. A SoftPlus activation function follows the last layer of the networks. The reconstruction loss is the Poisson likelihood function. After learning the encoding model, we used the learned encoding model to generate the ground truth of behaviorally-relevant signals from kinematics. Then we added white Gaussian noise to the behaviorally-relevant signals such that the noisy signals have a signal-to-noise ratio (SNR) of 10dB. We regarded the noisy signals as raw signals and the added Gaussian noise as behaviorally-irrelevant signals. Finally, we separated the synthetic data into five folds for cross-validation model evaluation.

Simulation experiments

In this experiment, we simulated four neurons, including two larger R2 neurons (greater than 0.4) and two smaller R2 neurons (lower than 0.03). The larger R2 neurons are strongly tuned neurons and are added with small Gaussian noise. In contrast, the smaller R2 neurons strongly nonlinearly encode behavioral information and are added with large Gaussian noise. We regarded the signals without adding Gaussian noise as ground-truth signals and those with Gaussian signals as raw signals. We used the velocity of dataset B (20161011) for simulation. The simulation dataset A includes the four neurons, which simulates the situation where the smaller R2 neurons contain a lot of behavioral information, but the behavioral information cannot be decoded from these neurons due to being covered by noise. The simulation dataset B includes the two larger R2 neurons, which simulates the situation where linear decoding is significantly inferior to nonlinear decoding.

The encoding model of larger R2 neurons is as follows:

where the subscript i refers to the ith unit of N neurons, N = 2 in this experiment, t represents time, b = 30 Hz represents the baseline firing rate of neurons, m = 0.25 Hz per cm s−1 indicates the modulation depth, the PDs of neurons θPDi(i=1, 2) are set to 300 and 600, ∊t(t) = (0, 0.3) denotes random Gaussian noise where the mean is zero and the standard deviation is 0.3, v represents velocity, | v | represents the norm of velocity (speed), bs = 0.25 Hz per cm s−1 denotes speed bias.

The encoding model of smaller R2 neurons is as follows:

where b = 30 Hz, m = 0.25 Hz per cm s−1, the PDs of neurons 0PDi(i=1, 2) are set to 400 and 500, ∊i(t) = (0, 1) denotes random Gaussian noise where the mean is zero and the standard deviation is one, bs = 0.25 Hz per cm s−1, bss = 0.1 Hz per cm2 s−1 denotes the bias of speed square for increasing the encoding nonlinearty of neurons.

Variational autoencoders

Variational autoencoders (VAEs) (Kingma and Welling, 2013) are famous latent variable generative models and can use latent representations zp(z) to generate data x by conditional distribution pθ(x | z), that is, pθ(x) = zp(z)pθ(x | z)dz. The usual training objective is maximum marginal likeli-hood log pθ(x). Since directly computing the integration is intractable, VAE propose an amortized inference distribution qф(z | x) and jointly optimize the evidence lower bound (ELBO) to the log-likelihood:

where the first term of right hand side of Eq. 3 is called reconstruction term, the second term is called regularization term, qф(z | x) is usually parameterized with a neural network called encoder or inference model with parameters, ф and pθ(z | x) is also parameterized with a neural network called decoder or generative model with parameters θ. p(z) is the prior over the latent representa-tions and is set to be Gaussian distribution. (0, I) in vallina VAE, DKL denotes the Kullback-Leibler divergence.

Distill-VAE (d-VAE)

Notations. x ∊ ℝn denotes raw neural signals. xr ∊ ℝn represents behaviorally-relevant signals. xi = xxr ∊ ℝn represents behaviorally-irrelevant signals. z ∊ ℝd denotes the latent neural repre-sentations. y ∊ ℝk represents kinematics. f: ℝn → ℝd represents the inference model (encoder) of VAE. g: ℝd → ℝn represents the generative model (decoder) of VAE.

To distill behaviorally-relevant neural signals, our approach utilizes the trade-off between the decoding and generating abilities of distilled behaviorally-relevant signals xr. The basic assump-tion is generated behaviorally-relevant signals that contain behaviorally-irrelevant signals harms decoding ability. Our method for distilling behaviorally-relevant signals consists of three steps, including identifying latent representations z, generating behaviorally-relevant signals xr, and de-coding behaviorally-relevant signals xr.

Identfiying latent representations

Identifying latent representations containing behaviorally-relevant information is the crucial part because latent representations influences the subsequent generation. Effective representations are more likely to generate proper behaviorally-relevant neural signals xr. d-VAE identifies latent representations with inference model f, that is, z = f(x), and guides latent representations con-taining behavioral information through an afine map h: ℝd → ℝk under the loss dec1,

where MSE(·, ·) denotes mean squared loss. In other words, we encourage latent representations to decode kinematics to distill behaviorally-relevant information.

Generating behaviorally-relevant signals

After identifying latent representations z, we send latent representations to the generative model g to generate behaviorally-relevant neural signals xr, that is, xr = g(z). We use following loss to make behaviorally-relevant signals reconstruct raw signals as much as possible:

where Poisson(·, ·) denotes Poisson negative log likelihood loss. It is important to note that optimiz-ing the generation of behaviorally-relevant signals to accurately reconstruct noisy raw signals may result in the inclusion of many behaviorally-irrelevant signals in the generated signals, which de-viates from our initial goal of extracting behaviorally-relevant signals. In the following subsection, we will introduce how to avoid generating behaviorally-irrelevant signals.

Decoding behaviorally-relevant signals

As mentioned above, if the generation of behaviorally-relevant signals xr is only guided by rec, gen-erated signals may contain more behaviorally-irrelevant signals. To avoid generated signals con-taining behaviorally-irrelevant signals, we introduce decoding loss dec2 to constrain xr to decode behavioral information. The basic assumption is that behaviorally-irrelevant signals act like noise for decoding behavioral information and are detrimental to decoding. Thus, there is a trade-off between generating and decoding ability of xr: the more behaviorally-irrelevant signals xr contains, the more decoding performance xr loses. The decoding loss dec2 is as follows:

We use the same networks f and h for xr and x in our experiment, because xr can act as data augmentation and make f distill robust representations without increasing model parameters. In addition, we combine the two decoding loss as one loss:

Learning the prior distribution with behavioral information

The prior distribution of latent representation is crucial because inappropriate prior assumptions can degrade latent representations and generated neural signals. Vanilla VAE uses a Gaussian prior. (0, I) to regularize the space of latent representation z. However, in neuroscience, the distribu-tion of latent representations is unknown and may exceed the scope of Gaussian. Therefore we adopt neural networks m: ℝ k → ℝ d to learn the prior representation zprior with kinematics y, that is, zprior = m(y). The prior representation zprior and z are aligned by the Kullback–Leibler divergence:

Since zprior and z are aligned and the generative network g models the relationship between z and xr, this is equivalent to indirectly establishing a neural encoding model from y to xr. Thus, we can observe the change of zprior and xr by changing y and can better understand the encoding mechanism of neural signals.

End-to-end optimization

d-VAE is optimized in an end-to-end manner under the following loss:

where β and a are hyperparameters, β is used to adjust the weight of KL divergence, and a deter-mines the trade-off between reconstruction loss rec and decoding loss dec.

In the training stage, we feed raw neural signals into the inference network f to get latent representation z, which is regularized by zprior coming from kinematics y and network m. Then we use z to decode kinematics y by afine layer h and send z to the generative networks g to generate neural signals xr. To ensure that xr preserves decoding ability, we send xr to f and h to decode y. The whole model is trained in an end-to-end manner under the guidance of total loss. Once the model has been trained, we can feed raw neural signals to it to obtain behaviorally-relevant neural signals xr, and we can aslo generate behaviorally-relevant neural signals using the prior distribution zprior.

The identifiability of d-VAE

Recent studies (Khemakhem et al., 2020) show that to obtain the identifiability of latent variables, we should either restrict the structures of generating models or introduce some additional con-straints on the distribution of latent variables and show that the additive noise models are identifi-able under certain assumptions. pi-VAE (Zhou and Wei, 2020) has proved that pi-VAE is identifiable under some assumptions. Since d-VAE is a straightforward extension of pi-VAE, it also uses auxiliary variables (kinematics) to constrain the latent variables as pi-VAE did; with the same assumptions, d-VAE is also identifiable.

Cross-validation evaluation of models

For each model, we use the five-fold cross-validation manner to assess performance. Specifically, we divide the data into five equal folds. In each experiment, we take one fold for testing and use the rest for training (taking three folds as the training set and one as the validation set). The reported performance is averaged over test sets of five experiments. The validation sets are used to choose hyperparameters. To avoid overfitting, we apply the early stopping strategy. Specifically, we assess the criteria (loss for training distillation methods, R2 for training ANN and KF) on the validation set every epoch in the training process. The model is saved when the model achieves better validation performance than the earlier epochs. If the model cannot increase by 1% of the best performance previously obtained within specified epochs, we stop the training process.

The strategy for selecting effective behaviorally-relevant signals

As previously mentioned, the hyperparameter a plays a crucial role in balancing the trade-off be-tween reconstruction and decoding loss. Once the appropriate value of a is determined, we can use this value to obtain accurate behaviorally-relevant signals for subsequent analysis. To deter-mine the optimal value of a, we first enumerated different values of a to guide d-VAE in distilling the behaviorally-relevant signals. Next, we used ANN to evaluate the decoding R2 of behaviorally-relevant (Dre) and irrelevant (Dir) signals generated by each a value. Finally, we selected the a value with the criteria formula 0.75 × Dre + 0.25 × (1 − Dir). The a value that obtained the highest criteria score on the validation set was selected as the optimal value.

Note that we did not use neural similarity between behaviorally-relevant and raw signals as a criterion for selecting behaviorally-relevant signals. This is because determining the threshold for neural similarity is challenging. However, not using similarity as a criterion does not affect the se-lection of suitable signals because the decoding performance of behaviorally-irrelevant signals can indirectly reflect the degree of similarity between the generated behaviorally-relevant signals and the raw signals. Specifically, if the generated behaviorally-relevant signals are dissimilar to the raw signals, the behaviorally-irrelevant signals will contain many useful signals. In other words, when the neural similarity between behaviorally-relevant and raw signals is low, the decoding perfor-mance of behaviorally-irrelevant signals is high. Therefore, the decoding performance of irrelevant signals is a reasonable alternative to the neural similarity.

Implementation details for methods

All the VAE-based models use the Poisson observation function. The details of different methods are demonstrated as follows:

  • d-VAE. The encoder of d-VAE uses two hidden layer multilayer perceptron (MLP) with 300 and 100 hidden units. The activation function of the hidden layers is ReLu. The dimension of the latent variable is set to 50. The decoder of d-VAE is symmetric with the encoder. The last layer of the decoder is followed by a SoftPlus activation function. The prior networks use one hidden layer MLP with 300 units. The β is set to 0.001. The a is set to 0.3, 0.4, 0.7, and 0.9 for datasets A, B, and C and synthetic dataset.

  • pi-VAE. The original paper utilizes label information (as shown in Eq. 6) to approximate the posterior of the latent variable and performs Monte Carlo sampling for decoding during the test stage. However, in our signal generation setting, it is inappropriate to use label informa-tion (kinematics) for extracting behaviorally-relevant signals during the test stage. As a result, we modified the model to exclude the use of label information in approximating the poste-rior. The encoder, decoder, and prior networks of our pi-VAE are kept the same as those in d-VAE.

  • VAE. The parameters of the encoder and decoder of VAE are the same as d-VAE.

  • LFADS. The hidden units of the encoder for the generator’s initial conditions, the controller, the generator are set to 200. The dimensionality of latent factor is set to 50. The dimension-ality of inferred inputs is set to 1. The training strategy follows the practice of the original paper.

  • PSID. The horizon parameter of PSID is set to 3, and the dimensionality of latent variable is set to 6.

  • CEBRA. We use CEBRA-behavior mode. The learning rate is set to 0.003, 0.003, 0.001, and 0.005 for datasets A, B, and C and synthetic dataset. The time offset is set to 10, 10, 15 and 10. The output dimension is set to 10 for all datasets. The number of iterations is set to 5000. The temperature is set to 1, and the temperature mode is set to auto. The batch size is set to 512. We perform a grid search for the learning rate in {0.0001, 0.0005, 0.001, 0.003, 0.005}, the time offset in {5, 10, 15, 20}, and the output dimension in {5, 10, 15, 20, 50}.

  • ANN. ANN has two hidden layers with 300 and 100 hidden units.

  • KF. The matrix parameters of observation and state transition process are optimized with the least square algorithm.

Percentage of explained variance captured in a subspace

We applied PCA to behaviorally-relevant and irrelevant signals to get relevant PCs and irrelevant PCs. Then we used the percentage variance captured (also called alignment index) to quantify how many variances of irrelevant signals can be captured by relevant PCs by projecting irrelevant signals onto the subspace composed of some relevant PCs and vice versa. The percentage of variance captured is:

where D ∊ ℝN×d is a d-dimensional population basis over N neurons, C ∊ ℝN×N is the covariance matrix of the neural signals, and Tr , where λi is the ith largest eigenvalue for C. The matrices D and C can be computed for both behaviorally-irrelevant and relevant signals. The per-centage variance is a quantity between 0% and 100%.

The composition of raw signals’ variance

Suppose x ∊ ℝT, y ∊ ℝT, and z ∊ ℝT are the random variables for a single neuron of behaviorally-relevant signals, behaviorally-irrelevant, and raw signals, where z = x + y, T denotes the number of samples. The composition of raw signals’ variance is as follows:

where , Var, and Cov denote expectation, variance, and covariance, respectively. Thus, the vari-ance of raw signals is composed by the variance of relevant signals Var(x), the variance of irrelevant signals Var(y), and the correlation between relevant and irrelevant signals 2Cov(x, y). If there are N neurons, calculate the composition of each neuron separately and then add it up to get the total composition of raw signals.

Reordered correlation matrix of neurons

The correlation matrix is reordered with a simple group method. The order of reordered neurons is determined from raw neural signals, which is then used for behaviorally-relevant signals.

The steps of the group method are as follows:

Step 1: We get the correlation matrix in original neuron order and set a list A that contains all neuron numbers and an empty list B.

Step 2: We select the neuron with the row number of the largest value of the correlation matrix except for the diagonal line and choose nine neurons in list A that are most similar to the selected neuron.

Step 3: Remove these selected neurons from list A, and add these selected neurons in descend-ing order of correlation value in list B.

Step 4: Repeat steps 2 and 3 until list A is empty.

The improvement ratio of lower and higher speed regions

Split lower and higher speed regions

Since the speed ranges of different datasets are different, it is hard to determine a common speed threshold to split lower and higher speed regions. Here we used the squared error as a criterion to split the two speed regions. And for the convenience of calculating the absolute improvement ratio, we need a unified benchmark for comparison. Therefore, we use half of the total squared error as the threshold. Specifically, first, we calculated all samples’ total squared error (Ep) between actual velocity and predicted velocity obtained by primary signals only. Then we enumerated the speed from 1 to 50 with a step of 0.1 and calculated the total squared error of selected samples whose speed is less than the enumerated speed. Once the total squared error of selected samples is greater than or equal to the half total squared error of all samples (0.5Ep), the enumerated speed is set as the speed threshold. The samples whose speed is less than or equal to the speed threshold belong to lower speed regions, and those whose speed is greater than the speed threshold belong to higher speed regions. The squared error of the lower speed part is approximately equal to that of the higher one , that is, (the difference is negligible).

The absolute improvement ratio

After splitting the speed regions, we calculated the improvement of the two regions by superim-posing secondary signals to primary signals, and got the squared error of lower and higher regions. Then we calculated the absolute improvement ratio (AIR):

Since the two regions refer to a common standard (0.5Ep), the improvement ratio of the two regions can be directly compared, and that’s why we call it the absolute improvement ratio.

The relative improvement ratio

The relative improvement ratio (RIR) measures the improvement ratio of each sample relative to itself before and after superimposing secondary signals. The relative improvement ratio is com-puted as follows:

where i denotes the ith sample.

Data availability

The datasets are available for research purposes from the corresponding author on reasonable request.

Code availability

The code will be publically available in github (see https://github.com/eric0li/d-VAE).

Acknowledgements

We would like to thank Yuxiao Yang, and Huaqin Sun for valuable discussions. This work was partly supported by grants from the National Key R&D Program of China (2018YFA0701400), Key R&D Program of Zhejiang (no. 2022C03011), the Chuanqi Research and Development Center of Zhejiang University, the Starry Night Science Fund of Zhejiang University Shanghai Institute for Advanced Study (SN-ZJU-SIAS-002), the Fundamental Research Funds for the Central Universities.

Author contributions

Y.G.L. conceived the study, designed the experiments, analyzed the data, and wrote the manuscript.

X.Y.Z. designed correlation analysis experiments and modified the manuscript. Y.M.W. and Y.Q. supervised this study.

Competing interests

The authors declare no competing interests.

Supplementary Information

Evaluation of separated signals on the synthetic dataset.

a, The temporal neuronal activity of raw signals (the purple line) of an example test trial, which is decomposed into relevant (b) and irrelevant (c) signals. b, Relevant signals (orange lines) extracted by d-VAE under three distillation cases, where bold gray lines represent ground truth relevant signals. Results show that when a = 0.09, the relevant signals are too similar to raw signals but not similar to ground truth; when a = 0.9, the relevant signals are well similar to the ground truth; when a = 9, the relevant signals are not similar to the ground truth. c, Same as b, but for irrelevant signals (blue lines). Notably, when a = 9, some useful signals are left in irrelevant signals. d, The decoding R2 of distilled relevant signals of three cases. Error bars indicate mean ± s.d. across five cross-validation folds. Results demonstrate that decoding R2 increases as a increases. e, Same as d, but for irrelevant signals. Notably, when a = 9, irrelevant signals will contain large behavioral information. f, The neural similarity between relevant and raw signals. Results show that the neural R2 decreases as a increases. g, The neural R2 between relevant signals and the ground truth. Results show that d-VAE can utilize a proper trade-off to extract effective relevant signals that are similar to the ground truth. h, The decoding R2 between true velocity and predicted velocity of raw signals (purple bars with slash lines), the ground truth signals (gray) and behaviorally-relevant signals obtained by d-VAE (red), PSID (pink), pi-VAE (green), LFADS (blue), and VAE (light green) on dataset A. Error bars denote mean ± standard deviation (s.d.) across five cross-validation folds. Asterisks represent significance of Wilcoxon rank-sum test with *P < 0.05, **P < 0.01. i, Same as h, but for irrelevant signals. j, The neural R2 between generated relevant signals and raw signals. k, Same as j, but for the ground truth.

Decoding performance comparison with CEBRA.

a, The decoding R2 comparison between d-VAE and CEBRA on synthetic dataset. The red bar represents the behaviorally-relevant signals extracted by d-VAE, and the light purple bar represents the behaviorally-relevant embeddings extracted by CEBRA. Error bars indicate mean ± s.d. across five cross-validation folds. Asterisks denote significance of Wilcoxon rank-sum test with *P < 0.05, **P < 0.01. b-d, Same as a, but for datasets A, B, and C, respectively.

The effect of irrelevant signals on relevant signals at the single-neuron level.

a,b Same as Fig. 3, but for dataset C. a, The angle difference (AD) of preferred direction (PD) between raw and distilled signals as a function of the R2 of raw signals. Each black point represents a neuron (n=91). The red curve is the fitting curve between R2 and AD. Five example larger R2 neurons’ PDs are shown in the inset plot, where the solid line arrows represent the PD of relevant signals, and the dotted line arrows represent the PDs of raw signals. b, Comparison of the cosine tuning fit (R2) before and after distillation of single neurons (black points), where the x-axis and y-axis represent neurons’ R2 of raw and distilled signals.

The firing activity of example neurons.

a, Example of three neurons’ raw firing activity decomposed into behaviorally-relevant and irrelevant parts using all trials in held-out test sets for four conditions (4 of 8 directions) of center-out reaching task. b, Example of three neurons’ raw firing activity decomposed into behaviorally-relevant and irrelevant parts using all trials in held-out test sets for four conditions (4 of 12 conditions) of obstacle avoidance task.

The effect of irrelevant signals on analyzing neural activity at the population level.

a-d, Same as Fig. 4, but for dataset C. a,b, PCA is separately applied on relevant and irrelevant signals to get relevant PCs and irrelevant PCs. The thick lines represent the cumulative variance explained for the signals on which PCA has been performed, while the thin lines represent the variance explained by those PCs for other signals. Red, blue, and gray colors indicate relevant signals, irrelevant signals, and random Gaussian noise (for chance level). The cumulative variance explained for behaviorally-relevant (a) and irrelevant (b) signals got by d-VAE. c,d, PCA is applied on raw signals to get raw PCs. c, The bar plot represents the composition of each raw PC. The inset pie plot shows the overall proportion of raw signals, where red, blue, and purple colors indicate relevant signals, irrelevant signals, and the correlation between relevant and relevant signals. The PC marked with a red triangle indicates the last PC where the variance of relevant signals is greater than or equal to that of irrelevant signals. d, The cumulative variance explained by raw PCs for different signals, where the thick lines represent the cumulative variance explained for raw signals(purple), while the thin lines represent the variance explained for relevant (red) and irrelevant (blue) signals.

The effect of irrelevant signals obtained by pi-VAE on analyzing neural activity at the population level.

a-l, Same as Fig. 4 and Fig. S5, but for pi-VAE. a,b, PCA is separately applied on relevant and irrelevant signals to get relevant PCs and irrelevant PCs. The thick lines represent the cumulative variance explained for the signals on which PCA has been performed, while the thin lines represent the variance explained by those PCs for other signals. Red, blue, and gray colors indicate relevant signals, irrelevant signals, and random Gaussian noise (for chance level). The cumulative variance explained for behaviorally-relevant (a) and irrelevant (b) signals on dataset A. c,d, PCA is applied on raw signals to get raw PCs. c, The bar plot represents the composition of each raw PC. The inset pie plot shows the overall proportion of raw signals, where red, blue, and purple colors indicate relevant signals, irrelevant signals, and the correlation between relevant and relevant signals. The PC marked with a red triangle indicates the last PC where the variance of relevant signals is greater than or equal to that of irrelevant signals. d, The cumulative variance explained by raw PCs for different signals, where the thick line represents the cumulative variance explained for raw signals(purple), while the thin line represents the variance explained for relevant (red) and irrelevant (blue) signals. e-h, i-l, Same as a-d, but for datasets B and C.

The rotational dynamics of raw, relevant, and irrelevant signals.

datasets A and B have twelve and eight conditions, respectively. We get the trial-averaged neural responses for each condition, then apply jPCA to raw, relevant, and irrelevant signals to get the top two jPC, respectively. a, The rotational dynamics of raw neural signals. b, The rotational dynamics of relevant signals obtained by d-VAE. c, The rotational dynamics of irrelevant signals obtained by d-VAE. We can see that the rotational dynamics of behaviorally-relevant signals are similar to that of raw signals, but the rotational dynamics of behaviorally-irrelevant signals are irregular. d-f, Same as a-c, but for dataset B.

The cumulative variance of raw and behaviorally-relevant signals.

a, PCA is applied separately on raw and distilled behaviorally-relevant signals to get raw PCs and relevant PCs. The cumulative variance of raw (purple) and behaviorally-relevant signals (red) on dataset A (n=90). Two upper left corner curves denote the variance accumulation from larger to smaller variance PCs. Two lower right corner curves indicate accumulation from smaller to larger variance PCs. The horizontal lines represent the 10%, and 90% variance explained. The vertical lines indicate the number of dimensions accounted for the last 10% and top 90% of the variance of behaviorally-relevant (red) and raw (purple) signals. Here we call the subspace composed by PCs of capturing top 90% variance the primary subspace, and the subspace composed by PCs of capturing last 10% variance the secondary subspace. We can see that the dimensionality of the primary subspace of raw signals is significantly higher than that of relevant signals, indicating that irrelevant signals make us overestimate the dimensionality of specific behaviors. b,c, Same as a, but for datasets B (n=159) and C (n=91).

trivial neural activities encode rich behavioral information in complex nonlinear ways.

a-c, Same as Fig. 5, but for dataset C (n=91). d-f, Same as Fig. 6, but for dataset C. a, The comparison of decoding performance between raw (purple) and distilled signals (red) with different neuron groups, including smaller R2 neuron (R2 <= 0.03), larger R2 neuron (R2 > 0.03), and all neurons. Error bars indicate mean ± SD across cross-validation folds. Asterisks denote significance of Wilcoxon rank-sum test with *P < 0.05, **P < 0.01. b, The correlation matrix of all neurons of raw and behaviorally-relevant signals. c, The decoding performance of KF (left) and ANN (right) with neurons dropped out from larger to smaller R2. The vertical gray lines indicate the number of dropped neurons at which raw and behaviorally-relevant signals have the greatest performance difference. d, The comparison of decoding performance between raw (purple) and distilled signals (red) composed of different raw-PC groups, including smaller variance PCs (the proportion of irrelevant signals that make up raw PCs is higher than that of relevant signals), larger variance PCs (the proportion of irrelevant signals is lower than that of relevant ones). Error bars indicate mean ± s.d. across five cross-validation folds. Asterisks denote significance of Wilcoxon rank-sum test with *P < 0.05, **P < 0.01. e, The cumulative decoding performance of signals composed of cumulative PCs that are ordered from smaller to larger variance using KF (left) and ANN (right). The red patches indicate the decoding ability of the last 10% variance of relevant signals. f, Same as e, but PCs are ordered from larger to smaller variance. The red patches indicate the decoding gain of the last 10% variance signals of relevant signals superimposing on their top 90% variance signals.

Simulation results.

a-c, The simulation dataset A simulated a situation where the smaller R2 neurons contain a certain amount of behavioral information, but the behavioral information cannot be decoded from these neurons due to being covered by noise. We used this experiment to demonstrate that d-VAE can utilize the larger R2 neurons to help the smaller R2 neurons restore their original face. a, The comparison of decoding performance between raw (purple), distilled (red), and ground truth (gray) signals with different neuron groups, including smaller R2 neuron (R2 <= 0.03), larger R2 neuron (R2 > 0.03), and all neurons. Error bars indicate mean ± s.d. (n=5 folds). Asterisks denote the significance of Wilcoxon rank-sum test with **P < 0.01. b, The decoding performance of behaviorally-irrelevant signals got by d-VAE with different neuron groups. Error bars indicate mean ± s.d. (n=5 folds). c, The neural similarity between distilled behaviorally-relevant (red) and raw signals (purple) and between relevant and ground truth signals (gray) with different neuron groups. Error bars indicate mean ± s.d. (n=5 folds). d-f, The simulation dataset B simulated the situation where linear decoding is significantly inferior to nonlinear decoding. We used this experiment to demonstrate that d-VAE can not make the linear decoder achieve similar performance as the nonlinear decoder. d, The comparison of decoding performance between raw (purple), relevant (red), and ground truth (gray) signals. Error bars indicate mean ± s.d. (n=5 folds). e, The decoding performance of behaviorally-irrelevant signals got by d-VAE. Error bars indicate mean ± s.d. (n=5 folds). f, The neural similarity between distilled behaviorally-relevant (red) and raw signals (purple) and between relevant and ground truth signals (gray). Error bars indicate mean ± s.d. (n=5 folds).

Smaller variance PC signals preferentially improve lower-speed velocity.

a, The comparison of absolute improvement ratio between lower-speed (red) and higher-speed (purple) velocity when superimposing secondary signals on primary signals with KF on dataset A. Error bars indicate mean ± s.d. across five cross-validation folds. Asterisks denote significance of Wilcoxon rank-sum test with *P < 0.05, **P < 0.01. b, c, Same as a, but for datasets B and C. d, The comparison of relative improvement ratio between lower-speed (red patch) and higher-speed (no patch) velocity when superimposing secondary signals on primary signals with KF on dataset B. The first-row plot shows five example trials’ speed profile of the decoded velocity using primary signals (light blue line) and full signals (dark blue line; superimposing secondary signals on primary signals) and the true velocity (red line). The black horizontal line denotes the speed threshold. The second and third-row plots are the same as the first-row plot, but for X and Y velocity. The fourth-row plot shows the relative improvement ratio for each point in trials.