Semantic illustration of extracting and validating behaviorally-relevant signals.

a-e The ideal decomposition of raw signals. a, The temporal neuronal activity of raw signals, where x-axis denotes time, and y-axis represents firing rate. Raw signals re decomposed to relevant (b) and irrelevant (d) signals. The red dotted line indicates the decoding performance of raw signals. The red and blue bars represent the decoding performance of relevant and irrelevant signals. The purple bar represents the reconstruction performance of relevant signals, which measures the neural similarity between generated signals and raw signals. The longer the bar, the larger the performance. The ground truth of relevant signals decode information perfectly (c, red bar) and is similar to raw signals to some extent (c, purple bar), and the ground truth of irrelevant signals contain little behavioral information (e, blue bar). f-h, Three different cases of behaviorally-relevant signals distillation. f, When the model is biased toward generating relevant signals that are similar to raw signals, it will achieve high reconstruction performance, but the decoding performance will suffer due to the inclusion of too many irrelevant signals. As it is diffcult for models to extract complete relevant signals, the residuals will also contain some behavioral information. g, When the model is biased toward generating signals that prioritize decoding over similarity to raw signals, it will achieve high decoding performance, but the reconstruction performance will be low. Meanwhile, the residuals will contain a significant amount of behavioral information. h, When the model balances the trade-off of decoding and reconstruction capabilities of relevant signals, both decoding and reconstruction performance will be good, and the residuals will only contain a little behavioral information.

Evaluation of separated signals.

a, The obstacle avoidance paradigm b, The decoding R2 between true velocity and predicted velocity of raw signals (purple bars with slash lines) and behaviorally-relevant signals obtained by d-VAE (red), PSID (pink), pi-VAE (green), TNDM (blue), and LFADS (light green) on dataset A. Error bars denote mean ± standard deviation (s.d.) across five cross-validation folds. Asterisks represent significance of Wilcoxon rank-sum test with *P < 0.05, **P < 0.01. c, Same as b, but for behaviorally-irrelevant signals obtained by five different methods. d, The neural similarity (R2) between raw signals and behaviorally-relevant signals extracted by d-VAE, PSID, pi-VAE, TNDM, and LFADS. Error bars represent mean ± s.d. across five cross-validation folds. Asterisks indicate significance of Wilcoxon rank-sum test with **P < 0.01. e-h and i-l, Same as a-d, but for dataset B with the center-out paradigm (e) and dataset C with the self-paced reaching paradigm (i). m, The firing rates of raw signals and distilled signals obtained by d-VAE in five held-out trials under the same condition of dataset B.

The effect of irrelevant signals on analyzing neural activity at the single-neuron level.

a, The angle difference (AD) of preferred direction (PD) between raw and distilled signals as a function of the R2 of raw signals on datasets A. When employing R2 to characterize neurons, it indicates the extent to which neuronal activity is explained by the linear encoding model. Smaller R2 neurons have a lower capacity for linearly tuning (encoding) behaviors, while larger R2 neurons have a higher capacity for linearly tuning (encoding) behaviors. Each black point represents a neuron (n = 90). The red curve is the fitting curve between R2 and AD. Five example larger R2 neurons’ PDs are shown in the inset plot, where the solid and dotted line arrows represent the PDs of relevant and raw signals, respectively. b, Comparison of the cosine tuning fit (R2) before and after distillation of single neurons (black points), where the x-axis and y-axis represent neurons’ R2 of raw and distilled signals, respectively. c, Comparison of neurons’ Fano factor (FF) averaged across conditions of raw (x-axis) and distilled (y-axis) signals, where FF is used to measure the neuronal variability of different trials in the same condition. d, Boxplots of raw (purple) and distilled (red) signals under different conditions for all neurons (12 conditions). Boxplots represent medians (lines), quartiles (boxes), and whiskers extending to ± 1.5 times the interquartile range. The broken lines represent the mean FF across all neurons. e-h, Same as a-d, but for dataset B (n=159, 8 conditions). i, Example of three neurons’ raw firing activity decomposed into behaviorally-relevant and irrelevant parts using all trials under two conditions (2 of 8 directions) in held-out test sets of dataset B.

The effect of irrelevant signals on analyzing neural activity at the population level.

a,b, PCA is separately applied on relevant and irrelevant signals to get relevant PCs and irrelevant PCs. The thick lines represent the cumulative variance explained for the signals on which PCA has been performed, while the thin lines represent the variance explained by those PCs for other signals. Red, blue, and gray colors indicate relevant signals, irrelevant signals, and random Gaussian noise .N (0, I) (for chance level) where the mean vector is zero and the covariance matrix is the identity matrix. The horizontal lines represent the percentage of variance explained. The vertical lines indicate the number of dimensions accounted for 90% of the variance in behaviorally-relevant (left) and irrelevant (right) signals. For convenience, we defined the principal component subspace describing the top 90% variance as the primary subspace and the subspace capturing the last 10% variance as the secondary subspace. The cumulative variance explained for behaviorally-relevant (a) and irrelevant (b) signals got by d-VAE on dataset A. c,d, PCA is applied on raw signals to get raw PCs. c, The bar plot shows the composition of each raw PC. The inset pie plot shows the overall proportion of raw signals, where red, blue, and purple colors indicate relevant signals, irrelevant signals, and the correlation between relevant and relevant signals. The PC marked with a red triangle indicates the last PC where the variance of relevant signals is greater than or equal to that of irrelevant signals. d, The cumulative variance explained by raw PCs for different signals, where the thick line represents the cumulative variance explained for raw signals (purple), while the thin line represents the variance explained for relevant (red) and irrelevant (blue) signals. e-h, Same as a-d, but for dataset B.

Smaller R2 neurons encode rich behavioral information in complex nonlinear ways.

a, The comparison of decoding performance between raw (purple) and distilled signals (red) on dataset A with different neuron groups, including smaller R2 neuron (R2 <= 0.03), larger R2 neuron (R2 > 0.03), and all neurons. Error bars indicate mean ± s.d. across five cross-validation folds. Asterisks denote significance of Wilcoxon rank-sum test with *P < 0.05, **P < 0.01. b, The correlation matrix of all neurons of raw (left) and behaviorally-relevant (right) signals on dataset A. Neurons are ordered to highlight correlation structure (details in Methods). c, The decoding performance of KF (left) and ANN (right) with neurons dropped out from larger to smaller R2 on dataset A. The vertical gray line indicates the number of dropped neurons at which raw and behaviorally-relevant signals have the greatest performance difference. d-f, Same as a-c, for dataset B.

Signals composed of smaller variance PCs encode rich behavioral information in complex nonlinear ways.

a, The comparison of decoding performance between raw (purple) and distilled signals (red) composed of different raw PC groups, including smaller variance PCs (the proportion of irrelevant signals that make up raw PCs is higher than that of relevant signals), larger variance PCs (the proportion of irrelevant signals is lower than that of relevant ones) on dataset A. Error bars indicate mean ± s.d. across five cross-validation folds. Asterisks denote significance of Wilcoxon rank-sum test with *P < 0.05, **P < 0.01. b, The cumulative decoding performance of signals composed of cumulative PCs that are ordered from smaller to larger variance using KF (left) and ANN (right) on dataset A. The red patches indicate the decoding ability of the last 10% variance of relevant signals. c, The cumulative decoding performance of signals composed of cumulative PCs that are ordered from larger to smaller variance using KF (left) and ANN (right) on dataset A. The red patches indicate the decoding gain of the last 10% variance signals of relevant signals superimposing on their top 90% variance signals. The inset shows the partially enlarged plot for view clearly. d-f, Same as a-c, but for dataset B.

Semantic overview of distill-VAE (d-VAE).

On the left, we present a set of raw signal examples (depicted by purple lines). The input neural signals fed into the d-VAE consist of single time step samples. Initially, the encoder compresses these input signals into latent variables. We constrain latent variables to decode behaviors to preserve behavioral information. Subsequently, these latent variables are transmitted to the generator, which produces behaviorally-relevant signals (depicted by red lines). To maintain the underlying neuronal properties, we constrain these generated signals to resemble the raw signals closely. At this juncture, relying solely on the constraint for signal resemblance to raw signals makes it challenging todetermine the extent of irrelevant signals present within the generated relevant signals. To tackle this hurdle, we introduce the generated signals back into the encoder. We then impose constraints on the resultant latent variables to decode behaviors. This approach is rooted in the assumption that irrelevant signals function as noise relative to relevant signals. Consequently, an excessive presence of irrelevant signals within the generated signals would lead to a degradation in their decoding performance. In essence, there exists a trade-off relationship between the decoding performance and the reconstruction performance of the generated signals. By striking a balance between these two constraints, we can effectively extract behaviorally-relevant signals.

Evaluation of separated signals on the synthetic dataset.

a, The temporal neuronal activity of raw signals (the purple line) of an example test trial, which is decomposed into relevant (b) and irrelevant (c) signals. b, Relevant signals (red lines) extracted by d-VAE under three distillation cases, where bold gray lines represent ground truth relevant signals. The hyperparameter a is very important to extracting behaviorally-relevant signals, which balances the trade-off between reconstruction loss and decoding loss. Results show that when a = 0.09, the relevant signals are too similar to raw signals but not similar to ground truth; when a = 0.9, the relevant signals are well similar to the ground truth; when a = 9, the relevant signals are not similar to the ground truth. c, Same as b, but for irrelevant signals (blue lines). Notably, when a = 9, some useful signals are left in irrelevant signals. d, The decoding R2 of distilled relevant signals of three cases. Error bars indicate mean ± s.d. across five cross-validation folds. Results demonstrate that decoding R2 increases as a increases. e, Same as d, but for irrelevant signals. Notably, when a = 9, irrelevant signals will contain large behavioral information. f, The neural similarity between relevant and raw signals. Results show that the neural R2 decreases as a increases. g, The neural R2 between relevant signals and the ground truth of relevant signals. Results show that d-VAE can utilize a proper trade-off to extract effective relevant signals that are similar to the ground truth. h, The neural R2 between irrelevant signals and the ground truth of irrelevant signals. Results show that d-VAE can utilize a proper trade-off to remove effective irrelevant signals that are similar to the ground truth. i, The decoding R2 between true velocity and predicted velocity of raw signals (purple bars with slash lines), the ground truth signals (gray) and behaviorally-relevant signals obtained by d-VAE (red), PSID (pink), pi-VAE (green), TNDM (blue), and LFADS (light green). Error bars denote mean ± standard deviation (s.d.) across five cross-validation folds. Asterisks represent significance of Wilcoxon rank-sum test with *P < 0.05, **P < 0.01. j, Same as i, but for irrelevant signals. k, The neural R2 between generated relevant signals and raw signals. l, Same as k, but for the ground truth of relevant signals.

Decoding performance comparison with CEBRA.

a, The decoding R2 comparison between d-VAE and CEBRA on synthetic dataset. The red bar represents the behaviorally-relevant signals extracted by d-VAE, and the light purple bar represents the behaviorally-relevant embeddings extracted by CEBRA. Error bars indicate mean ± s.d. across five cross-validation folds. Asterisks denote significance of Wilcoxon rank-sum test with *P < 0.05, **P < 0.01. b-d, Same as a, but for datasets A, B, and C, respectively.

The effect of irrelevant signals on relevant signals at the single-neuron level.

a,b Same as Fig. 3, but for dataset C. a, The angle difference (AD) of preferred direction (PD) between raw and distilled signals as a function of the R2 of raw signals. Each black point represents a neuron (n=91). The red curve is the fitting curve between R2 and AD. Five example larger R2 neurons’ PDs are shown in the inset plot, where the solid line arrows represent the PD of relevant signals, and the dotted line arrows represent the PDs of raw signals. b, Comparison of the cosine tuning fit (R2) before and after distillation of single neurons (black points), where the x-axis and y-axis represent neurons’ R2 of raw and distilled signals.

The firing activity of example neurons.

a, Example of three neurons’ raw firing activity decomposed into behaviorally-relevant and irrelevant parts using all trials in held-out test sets for four conditions (4 of 8 directions) of center-out reaching task. b, Example of three neurons’ raw firing activity decomposed into behaviorally-relevant and irrelevant parts using all trials in held-out test sets for four conditions (4 of 12 conditions) of obstacle avoidance task.

The effect of irrelevant signals on analyzing neural activity at the population level.

a-d, Same as Fig. 4, but for dataset C. a,b, PCA is separately applied on relevant and irrelevant signals to get relevant PCs and irrelevant PCs. The thick lines represent the cumulative variance explained for the signals on which PCA has been performed, while the thin lines represent the variance explained by those PCs for other signals. Red, blue, and gray colors indicate relevant signals, irrelevant signals, and random Gaussian noise (for chance level). The cumulative variance explained for behaviorally-relevant (a) and irrelevant (b) signals got by d-VAE. c,d, PCA is applied on raw signals to get raw PCs. c, The bar plot represents the composition of each raw PC. The inset pie plot shows the overall proportion of raw signals, where red, blue, and purple colors indicate relevant signals, irrelevant signals, and the correlation between relevant and relevant signals. The PC marked with a red triangle indicates the last PC where the variance of relevant signals is greater than or equal to that of irrelevant signals. d, The cumulative variance explained by raw PCs for different signals, where the thick lines represent the cumulative variance explained for raw signals(purple), while the thin lines represent the variance explained for relevant (red) and irrelevant (blue) signals.

The effect of irrelevant signals obtained by pi-VAE on analyzing neural activity at the population level.

a-l, Same as Fig. 4 and Fig. S6, but for pi-VAE. a,b, PCA is separately applied on relevant and irrelevant signals to get relevant PCs and irrelevant PCs. The thick lines represent the cumulative variance explained for the signals on which PCA has been performed, while the thin lines represent the variance explained by those PCs for other signals. Red, blue, and gray colors indicate relevant signals, irrelevant signals, and random Gaussian noise (for chance level). The cumulative variance explained for behaviorally-relevant (a) and irrelevant (b) signals on dataset A. c,d, PCA is applied on raw signals to get raw PCs. c, The bar plot represents the composition of each raw PC. The inset pie plot shows the overall proportion of raw signals, where red, blue, and purple colors indicate relevant signals, irrelevant signals, and the correlation between relevant and relevant signals. The PC marked with a red triangle indicates the last PC where the variance of relevant signals is greater than or equal to that of irrelevant signals. d, The cumulative variance explained by raw PCs for different signals, where the thick line represents the cumulative variance explained for raw signals(purple), while the thin line represents the variance explained for relevant (red) and irrelevant (blue) signals. e-h, i-l, Same as a-d, but for datasets B and C.

The rotational dynamics of raw, relevant, and irrelevant signals.

datasets A and B have twelve and eight conditions, respectively. We get the trial-averaged neural responses for each condition, then apply jPCA to raw, relevant, and irrelevant signals to get the top two jPC, respectively. a, The rotational dynamics of raw neural signals. b, The rotational dynamics of relevant signals obtained by d-VAE. c, The rotational dynamics of irrelevant signals obtained by d-VAE. We can see that the rotational dynamics of behaviorally-relevant signals are similar to that of raw signals, but the rotational dynamics of behaviorally-irrelevant signals are irregular. d-f, Same as a-c, but for dataset B.

The cumulative variance of raw and behaviorally-relevant signals.

a, PCA is applied separately on raw and distilled behaviorally-relevant signals to get raw PCs and relevant PCs. The cumulative variance of raw (purple) and behaviorally-relevant signals (red) on dataset A (n=90). Two upper left corner curves denote the variance accumulation from larger to smaller variance PCs. Two lower right corner curves indicate accumulation from smaller to larger variance PCs. The horizontal lines represent the 10%, and 90% variance explained. The vertical lines indicate the number of dimensions accounted for the last 10% and top 90% of the variance of behaviorally-relevant (red) and raw (purple) signals. Here we call the subspace composed by PCs of capturing top 90% variance the primary subspace, and the subspace composed by PCs of capturing last 10% variance the secondary subspace. We can see that the dimensionality of the primary subspace of raw signals is significantly higher than that of relevant signals, indicating that irrelevant signals make us overestimate the dimensionality of specific behaviors. b,c, Same as a, but for datasets B (n=159) and C (n=91).

neural responses usually considered useless encode rich behavioral information in complex nonlinear ways.

a-c, Same as Fig. 5, but for dataset C (n=91). d-f, Same as Fig. 6, but for dataset C. a, The comparison of decoding performance between raw (purple) and distilled signals (red) with different neuron groups, including smaller R2 neuron (R2 <= 0.03), larger R2 neuron (R2 > 0.03), and all neurons. Error bars indicate mean ± SD across cross-validation folds. Asterisks denote significance of Wilcoxon rank-sum test with *P < 0.05, **P < 0.01. b, The correlation matrix of all neurons of raw and behaviorally-relevant signals. c, The decoding performance of KF (left) and ANN (right) with neurons dropped out from larger to smaller R2. The vertical gray lines indicate the number of dropped neurons at which raw and behaviorally-relevant signals have the greatest performance difference. d, The comparison of decoding performance between raw (purple) and distilled signals (red) composed of different raw-PC groups, including smaller variance PCs (the proportion of irrelevant signals that make up raw PCs is higher than that of relevant signals), larger variance PCs (the proportion of irrelevant signals is lower than that of relevant ones). Error bars indicate mean ± s.d. across five cross-validation folds. Asterisks denote significance of Wilcoxon rank-sum test with *P < 0.05, **P < 0.01. e, The cumulative decoding performance of signals composed of cumulative PCs that are ordered from smaller to larger variance using KF (left) and ANN (right). The red patches indicate the decoding ability of the last 10% variance of relevant signals. f, Same as e, but PCs are ordered from larger to smaller variance. The red patches indicate the decoding gain of the last 10% variance signals of relevant signals superimposing on their top 90% variance signals.

Using synthetic data to demonstrate that conclusions are not a by-product of d-VAE.

a, b These results are used to demonstrate that d-VAE can utilize the larger R2 neurons to help the smaller R2 neurons restore their original face. a, The decoding R2 of the ground truth (gray), raw signals (purple), and distilled relevant signals (red) of smaller R2 neurons of synthetic data. Error bars indicate mean ± s.d. (n=5 folds). Asterisks denote the significance of Wilcoxon rank-sum test with **P < 0.01. We can see that the ground truth of smaller R2 neurons contain a certain amount of behavioral information, but the behavioral information cannot be decoded from raw signals due to being covered by noise; d-VAE can indeed utilize the larger R2 neurons to help the smaller R2 neurons restore their damaged information. b, The neural similarity of raw signals and relevant signals to ground truth of smaller R2 neurons. We can see that d-VAE can obtain effective relevant signals that are more similar to the ground truth compared to raw signals. c, The decoding R2 of the ground truth (gray) and distilled relevant signals (red) of smaller R2 neurons of synthetic data. These results are used to demonstrate that d-VAE can not make the linear decoder achieve similar performance as the nonlinear decoder. We can see that KF is significantly inferior to ANN on ground truth signals. The KF decoding performance of the ground truth signals is notably low, leaving significant room for compensation by d-VAE. However, after processing with d-VAE, the KF decoding performance of distilled signals does not surpass its ground truth performance. The disparity between KF and ANN remains substantial. These results demonstrate that d-VAE can not make signals that originally require nonlinear decoding linearly decodable.

Smaller variance PC signals preferentially improve lower-speed velocity.

a, The comparison of absolute improvement ratio between lower-speed (red) and higher-speed (purple) velocity when superimposing secondary signals on primary signals with KF on dataset A. Error bars indicate mean ± s.d. across five cross-validation folds. Asterisks denote significance of Wilcoxon rank-sum test with *P < 0.05, **P < 0.01. b, c, Same as a, but for datasets B and C. d, The comparison of relative improvement ratio between lower-speed (red patch) and higher-speed (no patch) velocity when superimposing secondary signals on primary signals with KF on dataset B. The first-row plot shows five example trials’ speed profile of the decoded velocity using primary signals (light blue line) and full signals (dark blue line; superimposing secondary signals on primary signals) and the true velocity (red line). The black horizontal line denotes the speed threshold. The second and third-row plots are the same as the first-row plot, but for X and Y velocity. The fourth-row plot shows the relative improvement ratio for each point in trials.

Visualization of latent variables.

a, The velocity samples of one-fold test data. Different colors denote different directions of the eight-direction center-out. b, The distribution plot of the top three principal components (PCs) of latent variables. The points on the bottom plane represent the two-dimensional projections of the three-dimensional data. c, The distribution plot of the top three PCs of learned prior latent variables. We can see that the distribution of prior latent variables closely resembles that of latent variables, thus illustrating the effectiveness of the KL divergence.