Demixed principal component analysis of neural population data
 Cited 34
 Views 9,985
 Annotations
Abstract
Neurons in higher cortical areas, such as the prefrontal cortex, are often tuned to a variety of sensory and motor variables, and are therefore said to display mixed selectivity. This complexity of single neuron responses can obscure what information these areas represent and how it is represented. Here we demonstrate the advantages of a new dimensionality reduction technique, demixed principal component analysis (dPCA), that decomposes population activity into a few components. In addition to systematically capturing the majority of the variance of the data, dPCA also exposes the dependence of the neural representation on task parameters such as stimuli, decisions, or rewards. To illustrate our method we reanalyze population data from four datasets comprising different species, different cortical areas and different experimental tasks. In each case, dPCA provides a concise way of visualizing the data that summarizes the taskdependent features of the population response in a single figure.
https://doi.org/10.7554/eLife.10989.001eLife digest
Many neuroscience experiments today involve using electrodes to record from the brain of an animal, such as a mouse or a monkey, while the animal performs a task. The goal of such experiments is to understand how a particular brain region works. However, modern experimental techniques allow the activity of hundreds of neurons to be recorded simultaneously. Analysing such large amounts of data then becomes a challenge in itself.
This is particularly true for brain regions such as the prefrontal cortex that are involved in the cognitive processes that allow an animal to acquire knowledge. Individual neurons in the prefrontal cortex encode many different types of information relevant to a given task. Imagine, for example, that an animal has to select one of two objects to obtain a reward. The same group of prefrontal cortex neurons will encode the object presented to the animal, the animal’s decision and its confidence in that decision. This simultaneous representation of different elements of a task is called a ‘mixed’ representation, and is difficult to analyse.
Kobak, Brendel et al. have now developed a data analysis tool that can ‘demix’ neural activity. The tool breaks down the activity of a population of neurons into its individual components. Each of these relates to only a single aspect of the task and is thus easier to interpret. Information about stimuli, for example, is distinguished from information about the animal’s confidence levels.
Kobak, Brendel et al. used the demixing tool to reanalyse existing datasets recorded from several different animals, tasks and brain regions. In each case, the tool provided a complete, concise and transparent summary of the data. The next steps will be to apply the analysis tool to new datasets to see how well it performs in practice. At a technical level, the tool could also be extended in a number of different directions to enable it to deal with more complicated experimental designs in future.
https://doi.org/10.7554/eLife.10989.002Introduction
In many state of the art experiments, a subject, such as a rat or a monkey, performs a behavioral task while the activity of tens to hundreds of neurons in the animal’s brain is monitored using electrophysiological or imaging techniques. The common goal of these studies is to relate the external task parameters, such as stimuli, rewards, or the animal’s actions, to the internal neural activity, and to then draw conclusions about brain function. This approach has typically relied on the analysis of single neuron recordings. However, as soon as hundreds of neurons are taken into account, the complexity of the recorded data poses a fundamental challenge in itself. This problem has been particularly severe in higherorder areas such as the prefrontal cortex, where neural responses display a baffling heterogeneity, even if animals are carrying out rather simple tasks (Brody et al., 2003; Machens et al., 2010; Hernández et al., 2010; Mante et al., 2013; Rigotti et al., 2013).
Traditionally, this heterogeneity has often been neglected. In neurophysiological studies, it is common practice to preselect cells based on particular criteria, such as responsiveness to the same stimulus, and to then average the firing rates of the preselected cells. This practice eliminates much of the richness of singlecell activities, similar to imaging techniques with low spatial resolution, such as MEG, EEG, or fMRI. While population averages can identify some of the information that higherorder areas process, they ignore much of the fine structure of the single cell responses (Wohrer et al., 2013). Indeed, most neurons in higher cortical areas will typically encode several task parameters simultaneously, and therefore display what has been termed mixed selectivity (Rigotti et al., 2013; Pagan and Rust, 2014; Park et al., 2014; Raposo et al., 2014).
Instead of looking at single neurons and selecting from or averaging over a population of neurons, neural population recordings can be analyzed using dimensionality reduction methods (for a review, see Cunningham and Yu, 2014). In recent years, several such methods have been developed that are specifically targeted to electrophysiological data, working on the level of single spikes (Pfau et al., 2013), accommodating different time scales of latent variables (Yu et al., 2009), or accounting for the dynamics of the population response (Buesing et al., 2012a; 2012b; Churchland et al., 2012). However, these approaches reduce the dimensionality of the data without taking task parameters, i.e., sensory and motor variables controlled or monitored by the experimenter, into account. Consequently, mixed selectivity remains in the data even after the dimensionality reduction step.
The problem can be addressed by dimensionality reduction methods that are informed by the task parameters (Machens et al., 2010; Machens, 2010; Brendel et al., 2011; Mante et al., 2013; Raposo et al., 2014). We have previously introduced a dimensionality reduction technique, demixed principal component analysis (dPCA) (Brendel et al., 2011), that emphasizes two goals. It aims to find a decomposition of the data into latent components that (a) are easily interpretable with respect to the experimentally controlled and monitored task parameters; and (b) preserve the original data as much as possible, ensuring that no valuable information is thrown away. Here we present a radically modified version of this method, and illustrate that it works well on a wide variety of experimental data. The new version of the method has the same objectives as the older version (Brendel et al., 2011), but is more principled, more flexible, and has an analytical solution, meaning that it does not suffer from any numerical optimization problems. Furthermore, the new mathematical formulation highlights similarities to and differences from related wellknown methods such as principal component analysis (PCA) and linear discriminant analysis (LDA).
The dPCA code is available at http://github.com/machenslab/dPCA for Matlab and Python.
Results
Existing approaches
We illustrate the classical approaches to analyzing neural activity data from higherorder areas in Figure 1. To be specific, we consider recordings from the prefrontal cortex (PFC) of monkeys performing a somatosensory working memory task (Romo et al., 1999; Brody et al., 2003). In this task, monkeys were required to discriminate two vibratory stimuli presented to the fingertip. The stimuli F1 and F2 were separated by a 3 s delay, and the monkeys had to report which stimulus had a higher frequency by pressing one of the two available buttons (Figure 1a).
When we focus on the neural representation of the stimulus F1 and the decision, we have to take 12 experimental conditions into account (six possible values of F1 and two possible decisions). For each of these conditions, we can average each neuron’s spike trains over trials and then smooth the resulting time series in order to estimate the neuron’s timedependent firing rate (also known as peristimulus time histogram or PSTH). We find that the PSTHs of many neurons are tuned to the stimulus F1, the decision, or both (Figure 1b; socalled mixed selectivity), and different neurons generally show different tuning. Our goal is to characterize and summarize the tuning of all $N$ recorded neurons.
The most standard and widespread approach is to resort to a statistical test (e.g. a twoway analysis of variance or ANOVA), in order to check whether the firing rate of a neuron depends significantly on the frequency F1 or on the monkey’s decision. Such a test can be run for each neuron and each time point, in which case the population tuning over time is often summarized as the fraction of cells significantly tuned to stimulus or decision at each time point ($p<0.05$, Figure 1c). In addition to providing such a 'summary statistics', this approach is also used to directly visualize the population activity. For that purpose, one selects the subset of neurons significantly tuned to stimulus or decision (e.g. by focusing on a particular time point, Figure 1d) and then averages their PSTHs. The resulting 'population average' is shown in Figure 1e, where we also took the sign of the effect size into account. The population average is generally thought to demonstrate the 'most typical' firing pattern among the cells encoding the corresponding parameter. Importantly, this method yields one single population average or 'component' for each parameter. Each such component can be understood as a linear combination (or a linear readout) of the individual PSTHs, with all ${N}_{s}$ significant neurons for a parameter having the same weights $\pm 1/{N}_{s}$ and all others having weight zero.
In a related approach, the firing rates of each neuron at each time point are linearly regressed on stimulus and decision (Figure 1f) (Brody et al., 2003). Mante et al. (2013) suggested to use the regression coefficients of all $N$ neurons (Figure 1g) as weights to form linear combinations of PSTHs representing stimulus and decision tuning (Figure 1h). This approach, which the authors call 'targeted dimensionality reduction' (TDR), also yields one component per task parameter: in our example, we obtain one component for the stimulus and one for the decision (Figure 1h; see Materials and methods for details).
Both of these approaches are supervised, meaning that they are informed by the task parameters. At the same time, they do not seek to faithfully represent the whole dataset and are prone to losing some information about the neural activities. Indeed, the two components from Figure 1e explain only 23% of the total variance of the population firing rates and the two components from Figure 1h explain only 22% (see Materials and methods). Consequently, a naive observer would not be able to infer from the components what the original neural activities looked like.
While such supervised approaches can be extended in various ways to produce more components and capture more variance, a more direct way to avoid this loss of information is to resort to unsupervised methods such as principal component analysis (PCA). This method extracts a set of principal components (PCs) that are linear combinations of the original PSTHs, just as the population averages above. However, the weights to form these linear combinations are chosen so as to maximize the amount of explained variance (first six components explain 69% of variance, see Figure 1i–k). The principal components can be thought of as 'building blocks' of neural activity: PSTHs of actual neurons are given by linear combinations of PCs, with the first PCs being more informative than the later ones. However, since PCA is an unsupervised method, information about stimuli and decisions is not taken into account, and the resulting components can retain mixed selectivity and therefore fail to highlight neural tuning to the task parameters.
The most striking observation when comparing supervised and unsupervised approaches is how different the results look. Indeed, PCA paints a much more complex picture of the population activity, dominated by strong temporal dynamics, with several stimulus and decisionrelated components. At the same time, none of the methods can fully demix the stimulus and decision information: even the supervised methods show decisionrelated activity in the stimulus components and stimulusrelated activity in the decision components (Figure 1e,h).
Demixed principal component analysis (dPCA)
To address these problems, we developed a modified version of PCA that not only compresses the data, but also demixes the dependencies of the population activity on the task parameters. We will first explain that these two goals generally constitute a tradeoff, then suggest a solution to this tradeoff for a single task parameter, and then generalize to multiple task parameters.
The tradeoff between demixing and compression is illustrated in Figure 2, where we compare linear discriminant analysis (LDA, Figure 2a,b), PCA (Figure 2c,d), and dPCA (Figure 2e–h). We will first focus on a single task parameter and seek to reduce the activity of $N=2$ neurons responding to three different stimuli. For each stimulus, the joint activity of the two neurons traces out a trajectory in the space of firing rates as time progresses (Figure 2b). The aim of 'demixing' in this simplified case is to find a linear mapping (decoder) of the neural activity that separates the different stimuli (Figure 2a) and ignores the timedependency. We can use LDA in order to determine a projection of the data that optimally separates the three stimuli. However, LDA will generally not preserve the 'geometry' of the original neural activity: firing patterns for stimuli 1 and 2 are close to each other and far away from stimulus 3, whereas in the LDA projection all three stimuli are equally spaced (Figure 2b). More generally, decoding is always prone to distorting the data and therefore tends to impede a proper reconstruction of the original data from the reduced description.
The aim of compression is to find a linear mapping (decoder) that reduces the dimensionality and preserves the original data as much as possible (Figure 2c,d). Using PCA, we determine a projection of the data that minimizes the reconstruction error between the projections and the original points. In contrast to LDA, PCA seeks to preserve the geometry of the neural activity, and thereby yields the most faithful reduction of the data (Figure 2d). However, the PCA projection does not properly separate the stimuli and mixes the timedependency with the stimulusdependency.
The wildly different projection axes for LDA (Figure 2b) and PCA (Figure 2d) seem to suggest that the goals of demixing and compression are essentially incompatible in this example. However, we can achieve both goals by assuming that the reconstruction of the original data works along a separate encoder axis (Figure 2f,h). Given this additional flexibility, we first choose a decoder axis that reconciles the decoding and compression objectives. Once projected onto this axis, all three stimuli are separated from each other, as in LDA, yet their geometrical arrangement is approximately preserved, as in PCA (Figure 2f). In turn, when reconstructed along the encoder axis, the projected data still approximates the original data (Figure 2h).
To define these ideas more formally, we assume that we simultaneously recorded the spike trains of $N$ neurons. Let $\mathbf{\mathbf{X}}$ be our data matrix with $N$ rows, in which the $i$th row contains the instantaneous firing rate (i.e. binned or smoothed spike train) of the $i$th neuron for all task conditions and all trials (assumed to be centered, i.e., with row means subtracted). Classical PCA compresses the data with a decoder matrix $\mathbf{\mathbf{D}}$. The resulting principal components can then be linearly decompressed through an encoder matrix ${\mathbf{\mathbf{D}}}^{\top}$, approximately reconstructing the original data (Hastie et al., 2009). The optimal decoder matrix is found by minimizing the squared error between the original data, $\mathbf{\mathbf{X}}$, and the reconstructed data, ${\mathbf{\mathbf{D}}}^{\top}\mathbf{\mathbf{D}\mathbf{X}}$, given by
In the toy example of Figure 2, the data matrix $\mathbf{\mathbf{X}}$ is of size $2\times 15$, and the decoder matrix $\mathbf{\mathbf{D}}$ is of size $1\times 2$. Crucially, the information about task parameters does not enter the loss function and hence PCA neither decodes nor demixes these parameters.
In our method, which we call demixed PCA (dPCA), we make two changes to this classical formulation. First, we require that the compression and decompression steps reconstruct not the neural activity directly, but the neural activity averaged over trials and over some of the task parameters. In the toy example, the reconstruction target is the matrix of stimulus averages, ${\mathbf{\mathbf{X}}}_{s}$, which has the same size as $\mathbf{\mathbf{X}}$, but in which every data point is replaced by the average neural activity for the corresponding stimulus, as shown in Figure 2h. Second, we gain additional flexibility in this quest by compressing the data with a linear mapping $\mathbf{\mathbf{D}}$, yet decompressing it with another linear mapping $\mathbf{\mathbf{F}}$ (Figure 2e). The respective matrices are chosen by minimizing the loss function
Accordingly, for each stimulus, the neural activities are projected close to the average stimulus, which allows us both to decode the stimulus value and to preserve the relative distances of the neural activities.
In order to see how this approach preserves all aspects of the original data, and not just some averages, we note that the data in our toy example included both stimulus and time. The matrix ${\mathbf{\mathbf{X}}}_{s}$ can be understood as part of a linear decomposition of the full data $\mathbf{\mathbf{X}}$ into parameterspecific averages: a timevarying part, ${\mathbf{\mathbf{X}}}_{t}$, that is obtained by averaging $\mathbf{\mathbf{X}}$ over stimuli, and a stimulusvarying part, ${\mathbf{\mathbf{X}}}_{s}$, that is obtained by averaging $\mathbf{\mathbf{X}}$ over time. Any remaining parts of the activity are captured in a noise term (Figure 2g). In turn, we can find separate decoder and encoder axes for each of these averages. Once more than $N=2$ neurons are considered, these decoder and encoder axes constitute a dimensionality reduction step that reduces the data into a few components, each of which properly decodes one of the task parameters. In turn, the original neural activity can be reconstructed through linear combinations of these components, just as in PCA.
The key ideas of this toy example can be extended to any number of task parameters. In this manuscript, all datasets will have three parameters: time, stimulus, and decision, and we will decompose the neural activities into five parts: conditionindependent, stimulusdependent, decisiondependent, dependent on the stimulusdecision interaction, and noise (see Figure 8 in the Materials and methods):
Individual terms are again given by a series of averages. This decomposition is fully analogous to the variance (covariance) decomposition done in ANOVA (MANOVA). The only important difference is that the standard (M)ANOVA decomposition for three parameters A, B, and C, would normally have ${2}^{3}=8$ terms corresponding to the main effects of A, B, C, pairwise interactions AB, BC, and AC, threeway interaction ABC, and the noise. Here we join some of these terms together, as we are not interested in demixing those (see Materials and methods).
Once this decomposition is performed, dPCA finds separate decoder and encoder matrices for each term $\varphi $ by minimizing the loss function
Each term within the sum can be minimized separately by using reducedrank regression, the solution of which can be obtained analytically in terms of singular value decompositions (see Materials and methods). Each row $\mathbf{\mathbf{d}}$ of each ${\mathbf{\mathbf{D}}}_{\varphi}$ yields one demixed principal component $\mathbf{\mathbf{d}\mathbf{X}}$ and, similar to PCA, we order the components by the amount of explained variance. Note that the decoder/encoder axes corresponding to two different task parameters ${\varphi}_{1}$ and ${\varphi}_{2}$ are found independently from each other and may end up being nonorthogonal (in contrast to PCA where principal axes are all orthogonal). In a nutshell, the loss function ensures that each set of decoder/encoder axes reconstructs the individual, parameterspecific terms, ${\mathbf{\mathbf{X}}}_{\varphi}$, thereby yielding proper demixing, and the data decomposition ensures that the combination of all decoder/encoder pairs allows to reconstruct the original data, $\mathbf{\mathbf{X}}$.
There are a few other technical subtleties (see Materials and methods for details). (1) We formulated dPCA for simultaneously recorded neural activities. However, all datasets analyzed in this manuscript have been recorded sequentially across many sessions, and so to apply dPCA we have to use 'pseudotrials'. (2) Similar to any other decoding method, dPCA is prone to overfitting and so we introduce a regularization term and perform crossvalidation to choose the regularization parameter. (3) The data and variance decompositions from above are exact only if the dataset is balanced, i.e., if the same number of trials were recorded in each condition. If this is not the case, one can use a rebalancing procedure. (4) A previous version of dPCA (Brendel et al., 2011) used the same variance decomposition but a different and less flexible loss function. The differences are layed out in the Materials and methods section.
Somatosensory working memory task in monkey PFC
We first applied dPCA to the dataset presented above (Romo et al., 1999; Brody et al., 2003), encompassing 832 neurons from two animals. As is typical for PFC, each neuron has a distinct response pattern and many neurons show mixed selectivity (some examples are shown in Figure 1b). Several previous studies have sought to make sense of these heterogeneous response patterns by separately analyzing different task periods, such as the stimulation and delay periods (Romo et al., 1999; Brody et al., 2003; Machens et al., 2010; Barak et al., 2010), the decision period (Jun et al., 2010), or both (Hernández et al., 2010). With dPCA, however, we can summarize the main features of the neural activity across the whole trial in a single figure (Figure 3).
Just as in PCA, we can think of the demixed principal components (Figure 3b) as the 'building blocks' of the observed neural activity, in that the activity of each single neuron is a linear combination (weighted average) of these components. These building blocks come in four distinct categories: some are conditionindependent (Figure 3b, top row); some depend only on stimulus F1 (second row); some depend only on decision (third row); and some depend on stimulus and decision together (bottom row). The components can be easily seen to demix the parameter dependencies, which is exactly what dPCA aimed for. Indeed, the components shown in Figure 3b are projections of the PSTHs of all neurons onto the most prominent decoding axes; each projection (each subplot) shows 12 lines corresponding to 12 conditions. As intended, conditionindependent components have all 12 lines closely overlapping, stimulus components have two lines for each stimulus closely overlapping, etc.
The overall variance explained by the dPCA components (Figure 3c, red line) is very close to the overall variance explained by the PCA components (black line). Accordingly, we barely lost any variance by imposing the demixing constraint, and the population activity is accurately represented by the obtained dPCA components.
The dPCA analysis captures the major findings previously obtained with these data: the persistence of the F1 tuning during the delay period (component #5; Romo et al., 1999; Machens et al., 2005), the temporal dynamics of shortterm memory (components ##5, 10, 13; Brody et al., 2003; Machens et al., 2010; Barak et al., 2010), the 'ramping' or 'climbing' activities in the delay period (components ##1–3; Brody et al., 2003; Machens et al., 2010); and pronounced decisionrelated activities (component #6, Jun et al., 2010). We note that the decision components resemble derivatives of each other; these higherorder derivatives likely arise due to slight variations in the timing of responses across neurons (see Appendix B for more details).
The first stimulus component (#5) looks similar to the stimulus components that we obtained with standard regressionbased methods (Figure 1e,h) but now we have further components as well. Together they show how stimulus representation evolves in time. In particular, plotting the first two stimulus components against each other (see Video 1) illustrates how stimulus representation rotates in the neural space during the delay period so that the encoding subspaces during F1 and F2 periods are not the same (but far from orthogonal either).
As explained above, the demixed principal axes are not constrained to be orthogonal. The angles between the encoding axes are shown in Figure 3e, upperright triangle; we discuss them later, together with other datasets. Pairwise correlations between components are all close to zero (Figure 3e, lowerleft triangle), as should be expected since the components are considered to represent independent signals.
To assess whether the condition tuning of individual dPCA components was statistically significant, we used each component as a linear decoder to classify conditions. Specifically, stimulus components were used to classify stimuli, decision components to classify decisions, and interaction components to classify all 12 conditions. We used crossvalidation to measure timedependent classification accuracy and a shuffling procedure to assess whether it was significantly above chance (see Materials and methods). Time periods of significant tuning are marked in Figure 3b with horizontal black lines.
Visuospatial working memory task in monkey PFC
We next applied dPCA to recordings from the PFC of monkeys performing a visuospatial working memory task (Qi et al., 2011, 2012; Meyer et al., 2011). In this task, monkeys first fixated a small white square at the centre of a screen, after which a square S1 appeared for 0.5 s in one of eight locations around the centre (Figure 4a). After a 1.5 s delay, a second square S2 appeared for 0.5 s in either the same ('match') or the opposite ('nonmatch') location. Following another 1.5 s delay, a green and a blue choice target appeared in locations orthogonal to the earlier presented stimuli. Monkeys had to saccade to the green target to report a match condition, and to the blue one to report a nonmatch.
We analyzed the activity of 956 neurons recorded in the lateral PFC of two monkeys performing this task. Proceeding exactly as before, we obtained the average timedependent firing rate of each neuron for each condition. Following the original studies, we eliminated the trivial rotational symmetry of the task by collapsing the eight possible stimulus locations into five locations that are defined with respect to the preferred location of each neuron (0°, 45°, 90°, 135°, or 180° away from the preferred location, see Materials and methods). As a consequence, we obtained ten conditions: five possible stimulus locations, each paired with two possible decisions of the monkey.
The dPCA results are shown in Figure 4. As before, stimulus and decision are well separated at the population level despite being intermingled at the singleneuron level; at the same time dPCA captures almost the same amount of variance as PCA. One notable difference from before is the presence of strong interaction components in Figure 4b. However, these interaction components are in fact stimulus components in disguise. In match trials, S2 and S1 appear at the same location, and in nonmatch trials at opposite locations. Information about S2 is therefore given by a nonlinear function of stimulus S1 and the trial type (i.e. decision), which is here captured by the interaction components.
Here again, our analysis summarizes previous findings obtained with this dataset. For instance, the first and the second decision components show tuning to the match/nonmatch decision during the S2 period and in the subsequent delay period. Using these components as fixed linear decoders, we achieve singletrial classification accuracy of match vs. nonmatch of 75% for $t\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}2$ (crossvalidated, see Materials and methods, Figure 12), which is approximately equal to the stateoftheart classification performance reported previously (Meyers et al., 2012).
Constantinidis et al. have also recorded population activity in PFC before starting the training (both S1 and S2 stimuli were presented exactly as above, but there were no cues displayed and no decision required). When analyzing this pretraining population activity with dPCA, the first stimulus and the first interaction components come out close to the ones shown in Figure 4, but there are no decision and no 'memory' components present (data not shown), in line with previous findings (Meyers et al., 2012). These taskspecific components appear in the population activity only after extensive training.
Olfactory discrimination task in rat OFC
Next, we applied dPCA to recordings from the OFC of rats performing an odor discrimination task (Feierstein et al., 2006). This behavioral task differs in two crucial aspects from the previously considered tasks: it requires no active storage of a stimulus, and it is selfpaced. To start a trial, rats entered an odor port, which triggered delivery of an odor with a random delay of 0.2–0.5 s. Each odor was uniquely associated with one of the two available water ports, located to the left and to the right from the odor port (Figure 5a). Rats could sample the odor for as long as they wanted (up to 1 s), and then had to move to one of the water ports. If they chose the correct water port, reward was delivered following an anticipation period of random length (0.2–0.5 s).
We analyzed the activity of 437 neurons recorded in five rats in four conditions: two stimuli (left and right) each paired with two decisions (left and right). Two of these conditions correspond to correct (rewarded) trials, and two correspond to error (unrewarded) trials. Since the task was selfpaced, each trial had a different length; in order to align events across trials, we restretched (timewarped) the firing rates in each trial (see Materials and methods). Alignment methods without time warping led to similar results (data not shown).
Just as neurons from monkey PFC, neurons in rat OFC exhibit diverse firing patterns and mixed selectivity (Feierstein et al., 2006). Nonetheless, dPCA was able to demix the population activity (Figure 5). In this dataset, interaction components separate rewarded and unrewarded conditions (thick and thin lines in Figure 5b, bottom row), i.e., correspond to neurons tuned either to reward, or to the absence of reward.
The overall pattern of neural tuning across task epochs agrees with the findings of the original study (Feierstein et al., 2006). Interaction components are by far the most prominent among all the conditiondependent components, corresponding to the observation that many neurons are tuned to the presence/absence of reward. Decision components come next, with the caveat that decision information may also reflect the rat’s movement direction and/or position, as was pointed out previously (Feierstein et al., 2006). Stimulus components are less prominent, but nevertheless show clear stimulus tuning, demonstrating that even in error trials there is reliable information about stimulus identity in the population activity.
Curiously, the first interaction component (#4) already shows significant tuning to reward in the anticipation period. In other words, neurons tuned to presence/absence of reward start firing before the reward delivery (or, on error trials, before the reward could have been delivered). We return to this observation in the next section.
Olfactory categorization task in rat OFC
Kepecs et al. (2008) extended the experiment of Feierstein et al. (2006) by using odor mixtures instead of pure odors, thereby varying the difficulty of each trial (Uchida and Mainen, 2003). In each trial, rats experienced mixtures of two fixed odors with different proportions (Figure 6a). Left choices were rewarded if the proportion of the 'left' odor was above 50%, and right choices otherwise. Furthermore, the waiting time until reward delivery (anticipation period) was increased to 0.3–2 s.
We analyzed the activity of 214 OFC neurons from three rats recorded in 8 conditions, corresponding to four odor mixtures, each paired with two decisions (left and right). During the presentation of pure odors (100% right and 100% left) rats made essentially no mistakes, and so we excluded these data from the dPCA computations (which require that all parameter combinations are present, see Discussion). Nevertheless, we displayed these additional two conditions in Figure 6.
The dPCA components shown in Figure 6b are similar to those presented in Figure 5b. Here again, some of the interaction components (especially the second one, #5) show strong tuning already during the anticipation period, i.e. before the actual reward delivery. The inset in Figure 6b shows the mean value of the component #5 during the anticipation period, separating correct (green) and incorrect (red) trials for each stimulus. The characteristic Ushape for the error trials and the inverted Ushape for the correct trials agrees well with the predicted value of the rat’s uncertainty in each condition (Kepecs et al., 2008). Accordingly, this component can be interpreted as corresponding to the rat’s uncertainty or confidence about its own choice, confirming the results of Kepecs et al. (2008). In summary, both the main features of this dataset, as well as some of the subtleties, are picked up and reproduced by dPCA.
Universal features of the PFC population activity
One of the key advantages of applying dPCA to these four datasets is that we can now compare them far more easily than was previously possible. This comparison allows us to highlight several general features of the population activity in prefrontal areas.
First, most of the variance of the neural activity is always captured by the conditionindependent components that together amount to 65–90% of the signal variance (see pie charts in Figures 3–6d; see Materials and methods for definition of 'signal variance'). These components capture the temporal modulations of the neural activity throughout the trial, irrespective of the task condition. Their striking dominance in the data may come as a surprise, as such conditionindependent components are usually not analyzed or shown (cf. Figure 1e,h), even though conditionindependent firing has been described even in sensory areas (Sornborger et al., 2005). These components are likely explained in part by an overall firing rate increase during certain task periods (e.g. during stimulus presentation). More speculatively, they could also be influenced by residual sensory or motor variables that vary rhythmically with the task, but are not controlled or monitored (Renart and Machens, 2014). The attentional or motivational state of animals, for instance, often correlates with breathing (Huijbers et al., 2014), pupil dilation (Eldar et al., 2013), body movements (Gouvêa et al., 2014), etc.
Second, even though dPCA, unlike PCA, does not enforce orthogonality between encoding axes corresponding to different task parameters, most of them turned out to be close to orthogonal to each other (Figures 3–6e, upper triangle), as has been observed before (Brendel et al., 2011; Rishel et al., 2013; Raposo et al., 2014). Nevertheless, many pairs were significantly nonorthogonal, meaning that neurons expressing one of the components tended to also express the other one. Throughout the four datasets, we identified 277 pairs of axes (among the first 15 axes) corresponding to different parameters. Of these, 38, i.e. 14%, were significantly nonorthogonal with $p<0.001$ (8 out of 53 if we do not take time axes into account).
Third, all dPCA components in each of the datasets are distributed across the whole neural population (as opposed to being exhibited only by a subset of cells). For each component and each neuron, the corresponding encoder weight shows how much this particular component is exhibited by this particular neuron. For each component, the distribution of weights is strongly unimodal, centred at zero (Figure 7a), and rather symmetric (although it is skewed to one side for some components). In other words, there are no distinct subpopulations of neurons predominantly expressing a particular component; rather, each individual neuron can be visualized as a random linear combination of these components. We confirmed this observation by applying a recently developed clustering algorithm (Rodriguez and Laio, 2014) to the population of neurons in the 15dimensional space of dPC weights. In all cases, the algorithm found only one cluster (Figure 7b). An alternative clustering analysis with Gaussian mixture models yielded similar results (data not shown). This absence of any detectable clusters of neurons has been noted before (Machens et al., 2010) and was recently observed in other datasets as well (Raposo et al., 2014).
Discussion
Mixed selectivity of neurons in higher cortical areas has been increasingly recognized as a problem for the analysis of neurophysiological recordings, with many different approaches suggested to deal with it (Brody et al., 2003; Machens et al., 2010; Machens, 2010; Brendel et al., 2011; Rigotti et al., 2013; Pagan and Rust, 2014; Park et al., 2014; Raposo et al., 2014; Cunningham and Yu, 2014). The main strength and the main novelty of the method suggested here (dPCA) is that it offers a unified and principled way of analyzing such data.
Demixed PCA combines the strengths of existing supervised and unsupervised approaches to neural population data analysis (Table 1, see also the first section of the Results). Supervised methods can characterize population tuning to various parameters of interest but often do not faithfully represent the whole dataset. Unsupervised methods can capture the overall variance but are not informed by task parameters. Our method yields components that capture almost as much variance as PCA does, but are demixed.
We view both properties as equally important. On one hand, demixing can greatly simplify visualization and interpretation of neural population data. Indeed, in all cases presented here, all the major aspects of the population activity that had previously been reported are directly visible on the dPCA summary figure. On the other hand, faithful representation of the population activity (i.e. 'capturing variance') avoids that a particular interpretation distorts characteristic features of the data. The latter feature is particularly important for the development of theoretical models, which otherwise may inherit an interpretation bias without being aware of it.
Apart from being a useful tool for analyzing any particular dataset, dPCA highlights common features of neural activity when applied to several datasets, allowing to adopt a comparative approach to study population activity.
Relationship to other methods, including our earlier work
The method presented here is conceptually based on our previous work (Machens, 2010; Machens et al., 2010; Brendel et al., 2011), but is technically very different. The original approach from Machens et al. (2010) only works for two parameters of interest, such as time and stimulus. Machens (2010) suggested a partial generalization to multiple parameters and Brendel et al. (2011) introduced the full covariance decomposition and developed a probabilistic model. However, all of them imposed orthogonality on the decoder/encoder axes (and as a result did not distinguish them), a constraint that cannot be easily relaxed. While we have previously argued that orthogonality is a desirable feature of the decomposition, we now believe that it is better not to impose it upfront. First, by looking across many datasets, we have learnt that encoding subspaces can sometimes be highly nonorthogonal (Figures 3–6e) and hence not demixable under orthogonality constraints. Second, by not imposing orthogonality, we can easier identify components that are truly orthogonal. Third, removing the orthogonality constraint allowed us to obtain a simple analytical solution in terms of singular value decompositions (see Materials and methods) and hence to avoid local minima, convergence issues, and any additional optimizationrelated hyperparameters.
To demonstrate these advantages, we ran the algorithm of Brendel et al. (2011), dPCA2011, on all our datasets. The resulting components were similar to the components presented here, with the amount of variance captured by the first 15 components being very close; but the achieved demixing was worse. For each component we defined a demixing index (see Materials and methods) that is equal to 1 if the component is perfectly demixed. For all datasets, these indices were significantly higher with our current dPCA2015 method than with dPCA2011. Moreover, dPCA2011 failed to find some weak components at all. For comparison, see Figure 14 in the Materials and methods.
Another method, called 'targeted dimensionality reduction' (TDR) has recently been suggested for neural data analysis and is similar in spirit to dPCA in that it looks for demixing linear projections (Mante et al., 2013). As mentioned above, the original application of this method yields only one component per task parameter and ignores the conditionindependent components. While TDR can be extended in various ways to yield more components, no principled way of doing it has been suggested so far. Comparison of dPCA with TDR on our datasets shows that dPCA demixes the taskparameter dependencies better than TDR (see Figure 14 in the Materials and methods).
For an indepth discussion of the relationship between dPCA and LDA/MANOVA, we refer the reader to the Methods. Briefly, LDA is a oneway technique, meaning that only one parameter (class id) is associated with each data point. Therefore, LDA cannot directly be applied to the demixing problem. While LDA could be generalized to deal with several parameters in a systematic way, such a generalization has not been used for dimensionality reduction of neural data and does not have an established name in the statistical literature (we call it factorial LDA). We believe that for the purposes of dimensionality reduction, dPCA is a superior approach since it combines a reasonably high class separation with low reconstruction error, whereas LDA only optimizes class separation without taking the (potential) reconstruction error into account (see Figure 2). MANOVA, on the other hand, is a statistical test closely related to LDA that deals with multiple parameters. However, it deals with isolating the contribution of each parameter from residual noise rather than from the other parameters, and is therefore not suited for demixing.
Limitations and future work
While we believe that dPCA is an easytouse method of visualizing complex data sets with multiple task parameters, several limitations should be kept in mind. First, dPCA as presented here works only with discrete parameters, and all possible parameter combinations must be present in the data. This limitation is the downside of the large flexibility of the method: apart from the demixing constraint, we do not impose any other constraints on the latent variables and their estimation remains essentially nonparametric. In order to be able to treat continuous parameters or missing data (missing parameter combinations), we would need to further constrain the estimation of these latent variables, using e.g. a parametric model. One simple possibility is to directly use a parametric model for the activity of the single neurons, such as the linear model used in Mante et al. (2013), in order to fill in any missing data points, and then run dPCA subsequently.
Second, the number of neurons needs to be sufficiently high in order to obtain reliable estimates of the demixed components. In our datasets, we found that at least $\sim $100 neurons were needed to achieve satisfactory demixing. The number is likely to be higher if more than three task parameters are to be demixed, as the number of interaction terms grows exponentially with the number of parameters. This tradeoff between model complexity and demixing feasibility should be kept in mind when deciding how many parameters to put into the dPCA procedure. In cases when there are many task parameters of interest, dPCA is likely to be less useful than the more standard parametric singleunit approaches (such as linear regression). As a trivial example, imagine that only $N=1$ neuron has been recorded; it might have strong and significant tuning to various parameters of interest, but there is no way to demix (or decode) these parameters from the recorded 'population.'
Third, even with a large number of neurons, a dataset may be nondemixable, in which case dPCA would fail. For instance, if the highvariance directions of the stimulus and the decision parts of the neural activities fully overlap, then there is no linear decoder that can demix the two parameters.
Finally, dPCA components corresponding to the same parameter (e.g. successive stimulus components) are here chosen to be orthogonal, similarly to PCA. This can make successive components difficult to interpret (e.g. the second and the third stimulus components in Figure 3). To make them more interpretable, the orthogonality constraint could be replaced with some other constraints, such as e.g. requiring each component to have activity 'localized' in time. This problem may be addressed in future work.
Materials and methods
We will first explain the dPCA algorithm in the most wellbehaved case of simultaneously recorded and fully balanced data. A dataset with categorical predictors is called balanced when there is the same number of data points for each combination of predictors; in our case this means that there is the same number of trials for each combination of task parameters. This scenario is unlikely in most practical applications where the experimenter often does not have full control over some of the task parameters (such as e.g. animal’s decisions). Our suggestion for unbalanced datasets is to use what amounts to a 'rebalancing' procedure as explained below. Finally, we will deal with the case of sequentially recorded neurons (all datasets analyzed in this manuscript fall into this category).
Mathematical notation
In each of the datasets analyzed in this manuscript, trials can be labeled with two parameters: 'stimulus' and 'decision'. Note that a 'reward' label is not needed, because its value can be deduced from the other two due to the deterministic reward protocols in all tasks. In this situation, for each stimulus $s$ (out of $S$) and decision $d$ (out of $Q$), we have a collection of $K$ trials with $N$ neurons recorded in each trial. For each trial $k$ (out of $K$) and neuron $n$ (out of $N$) we have a recorded spike train. We denote the filtered (or binned) spike train by $x(t)$, and assume that it is sampled at $T$ time points $t$. To explicitly denote all task parameters, we will write either $x(t,s,d,k)$ or ${x}_{tsdk}$ for the filtered spike train of one neuron and ${\mathbf{\mathbf{x}}}_{tsdk}$ for the vector of filtered spike trains of all $N$ neurons. The latter notation is more compact and also highlights the tensorial character of the data.
These data can be thought of as $KSQ$ timedependent neural trajectories ($K$ trials for each of the $SQ$ conditions) in the $N$dimensional space ${\mathbb{R}}^{N}$ (Figure 2b). The number of distinct data points in this $N$dimensional space is $KSQT$. We collect the full data with all single trials in a matrix $\mathbf{\mathbf{X}}$ of size $N\times KSQT$, i.e. $N$ rows and $KSQT$ columns. Averaging all $K$ trials for each neuron, stimulus, and decision, yields mean firing rates (PSTHs) that can be collected in a smaller matrix $\stackrel{~}{\mathbf{\mathbf{X}}}$ of size $N\times SQT$.
Marginalization procedure
Consider one single neuron first. We can decompose its filtered spike trains, ${x}_{tsdk}$, into a set of averages (which we call marginalizations) over various combinations of parameters. We will denote the average over a set of parameters $\{a,b,\mathrm{\dots}\}$ by angular brackets ${\u27e8\cdot \u27e9}_{ab\mathrm{\dots}}$. Let us define the following marginalized averages:
Here $\overline{x}$ is simply the overall mean firing rate of our neuron, ${\overline{x}}_{t}$ is the average timevarying firing rate once the overall mean has been subtracted, etc. The righthand side shows the same averaging procedure in the more explicit form using ANOVAstyle notation, in which averages of $x$ over everything apart from the explicitly mentioned parameters, e.g., the stimulus $s$, are denoted by terms of the form ${\overline{x}}_{\cdot s\cdot \cdot}$. One can directly see that the original neural activities are given by the sum of all marginalizations:
This decomposition is identical to the one used in factorial ANOVA (Rutherford, 2001; Christensen, 2011) where task parameters are called factors. The ANOVA literature uses a slightly different notation with task parameters ($t,\phantom{\rule{thinmathspace}{0ex}}s,\phantom{\rule{thinmathspace}{0ex}}d,\phantom{\rule{thinmathspace}{0ex}}k$) replaced by indices ($i,\phantom{\rule{thinmathspace}{0ex}}j,\phantom{\rule{thinmathspace}{0ex}}k,\phantom{\rule{thinmathspace}{0ex}}l$) and with Greek letters designating individual terms:
We will use our notation, though, to keep the connection with the task parameters more explicit.
For the purposes of demixing neural signals in the context of our datasets, we combine some of these terms together. Indeed, demixing a timeindependent pure stimulus term ${\overline{x}}_{s}$ from a stimulustime interaction term ${\overline{x}}_{ts}$ makes little sense because we expect all neural components to change with time. Hence, we group the terms as follows (without changing the notation):
Here the first term on the righthand side is the mean firing rate, the last term is the trialtotrial noise, and we call the other terms conditionindependent term, stimulus term, decision term, and stimulusdecision interaction term. This decomposition is illustrated in Figure 8 for several exemplary neurons (we only show the decomposition of the PSTH part, leaving out the noise term).
We apply this marginalization procedure to every neuron, splitting the whole data matrix $\mathbf{\mathbf{X}}$ into parts. Assuming from now on that the data matrix is centered (i.e. $\overline{x}=0$ for all neurons), we can write the decomposition in the matrix form
Here $t$, $ts$, $td$, and $tsd$ are labels and not indices, and all terms are understood to be matrices of the same $N\times KSQT$ size, so e.g. ${\mathbf{\mathbf{X}}}_{t}$ is not an $N\times T$ sized matrix, but the full size $N\times KSQT$ matrix with $N\times T$ unique values replicated $KSQ$ times. Crucially, the marginalization procedure ensures that all terms are uncorrelated and that the $N\times N$ covariance matrix $\mathbf{\mathbf{C}}={\mathbf{\mathbf{X}\mathbf{X}}}^{\top}/(KSQT)$ is linearly decomposed into the sum of covariance matrices from each marginalization (see Appendix A for the proof):
Here all covariance matrices are defined with the same denominator, i.e. ${\mathbf{\mathbf{C}}}_{\varphi}={\mathbf{\mathbf{X}}}_{\varphi}{\mathbf{\mathbf{X}}}_{\varphi}^{\top}/(KSQT)$.
Core dPCA: loss function and algorithm
Given a decomposition $\mathbf{\mathbf{X}}={\sum}_{\varphi}{\mathbf{\mathbf{X}}}_{\varphi}+{\mathbf{\mathbf{X}}}_{\mathrm{noise}}$, the loss function of dPCA is given by
with
where each ${\mathbf{\mathbf{F}}}_{\varphi}$ is an encoder matrix with ${q}_{\varphi}$ columns and each ${\mathbf{\mathbf{D}}}_{\varphi}$ is a decoder matrix with ${q}_{\varphi}$ rows. Here and below, matrix norm signifies Frobenius norm, i.e. ${\parallel \mathbf{\mathbf{X}}\parallel}^{2}={\sum}_{i}{\sum}_{j}{X}_{ij}^{2}$. In the remaining discussion, it will often be sufficient to focus on the individual loss functions ${L}_{\varphi}$, in which case we will drop the indices $\varphi $ on the decoder and encoder matrices for notational convenience, and simply write $\mathbf{\mathbf{F}}$ and $\mathbf{\mathbf{D}}$.
Without any additional constraints, the decoder and encoder are only defined up to their product $\mathbf{\mathbf{F}\mathbf{D}}$ of rank $q$. To make the decomposition unique, we will assume that $\mathbf{\mathbf{F}}$ has orthonormal columns and that components are ordered such that their variance (row variance of $\mathbf{\mathbf{D}\mathbf{X}}$) is decreasing. The reason for this choice will become clear below.
This loss function penalizes the difference between the marginalized data ${\mathbf{\mathbf{X}}}_{\varphi}$ and the reconstructed full data $\mathbf{\mathbf{X}}$, i.e., the full data projected with the decoders $\mathbf{\mathbf{D}}$ onto a lowdimensional latent space and then reconstructed with the encoders $\mathbf{\mathbf{F}}$ (see Video 2). The loss function thereby favours variance in marginalization $\varphi $ and punishes variance coming from all other marginalizations and from trialtotrial noise. Given that the marginalized averages are uncorrelated with each other, we can make this observation clear by writing,
Here the first term corresponds to the nonexplained variance in marginalization $\varphi $ and the second term corresponds to the variance coming from all other marginalizations and from trialtotrial noise. The dPCA objective is to minimize both.
We note that the loss function ${L}_{\varphi}$ is of the general form ${\parallel {\mathbf{\mathbf{X}}}_{\varphi}\mathbf{\mathbf{A}\mathbf{X}}\parallel}^{2}$, with $\mathbf{\mathbf{A}}=\mathbf{\mathbf{F}\mathbf{D}}$. For an arbitrary $N\times N$ matrix $\mathbf{\mathbf{A}}$, minimization of the loss function amounts to a classical regression problem with the wellknown ordinary least squares (OLS) solution, ${\mathbf{\mathbf{A}}}_{\mathrm{OLS}}={\mathbf{\mathbf{X}}}_{\varphi}\mathbf{\mathbf{X}}(\mathbf{\mathbf{X}\mathbf{X}})_{\top}_{1}^{\top}$. In our case, $\mathbf{\mathbf{A}}=\mathbf{\mathbf{F}\mathbf{D}}$ is an $N\times N$ matrix of rank $q$, which we will make explicit by writing ${\mathbf{\mathbf{A}}}_{q}$. The dPCA loss function therefore amounts to a linear regression problem with an additional rank constraint on the matrix of regression coefficients. This problem is known as reducedrank regression (RRR) (Izenman, 1975; Reinsel and Velu, 1998; Izenman, 2008) and can be solved via the singular value decomposition.
To see this, we write ${\mathbf{X}}_{\varphi}{\mathbf{A}}_{q}\mathbf{X}=({\mathbf{X}}_{\varphi}{\mathbf{A}}_{\mathrm{O}\mathrm{L}\mathrm{S}}\mathbf{X})+({\mathbf{A}}_{\mathrm{O}\mathrm{L}\mathrm{S}}\mathbf{X}{\mathbf{A}}_{q}\mathbf{X})$. The first term, ${\mathbf{\mathbf{X}}}_{\varphi}{\mathbf{\mathbf{A}}}_{\mathrm{OLS}}\mathbf{\mathbf{X}}$, consists of the regression residuals that cannot be accounted for by any linear transformation of $\mathbf{\mathbf{X}}$. It is straightforward to verify that these regression residuals, ${\mathbf{\mathbf{X}}}_{\varphi}{\mathbf{\mathbf{A}}}_{\mathrm{OLS}}\mathbf{\mathbf{X}}$, are orthogonal to $\mathbf{\mathbf{X}}$ (Hastie et al., 2009, Section 3.2) and hence also orthogonal to $({\mathbf{\mathbf{A}}}_{\mathrm{OLS}}{\mathbf{\mathbf{A}}}_{q})\mathbf{\mathbf{X}}$. This orthogonality allows us to split the loss function into two terms,
where the first term captures the (unavoidable) error of the least squares fit while the second term describes the additional loss suffered through the rank constraint. Since the first term does not depend on ${\mathbf{\mathbf{A}}}_{q}$, the problem reduces to minimizing the second term.
To minimize the second term, we note that the best rank$q$ approximation to ${\mathbf{\mathbf{A}}}_{\mathrm{OLS}}\mathbf{\mathbf{X}}$ is given by its first $q$ principal components (EckartYoungMirsky theorem). Accordingly, if we write ${\mathbf{\mathbf{U}}}_{q}$ for the matrix of the $q$ leading principal directions (left singular vectors) ${\mathbf{\mathbf{u}}}_{i}$ of ${\mathbf{\mathbf{A}}}_{\mathrm{OLS}}\mathbf{\mathbf{X}}$, then the best approximation is given by ${\mathbf{\mathbf{U}}}_{q}{\mathbf{\mathbf{U}}}_{q}^{\top}{\mathbf{\mathbf{A}}}_{\mathrm{OLS}}\mathbf{\mathbf{X}}$ and hence ${\mathbf{\mathbf{A}}}_{q}={\mathbf{\mathbf{U}}}_{q}{\mathbf{\mathbf{U}}}_{q}^{\top}{\mathbf{\mathbf{A}}}_{\mathrm{OLS}}$.
To summarize, the reducedrank regression problem posed above can be solved in a threestep procedure:
Compute the OLS solution ${\mathbf{\mathbf{A}}}_{\mathrm{OLS}}={\mathbf{\mathbf{X}}}_{\varphi}{\mathbf{\mathbf{X}}}^{\top}{({\mathbf{\mathbf{X}\mathbf{X}}}^{\top})}^{1}$.
Perform PCA of ${\mathbf{\mathbf{A}}}_{\mathrm{OLS}}\mathbf{\mathbf{X}}$ and take the $q$ leading principal components to obtain the best lowrank approximation: ${\mathbf{\mathbf{A}}}_{q}={\mathbf{\mathbf{U}}}_{q}{\mathbf{\mathbf{U}}}_{q}^{\top}{\mathbf{\mathbf{A}}}_{\mathrm{OLS}}$ where ${\mathbf{\mathbf{U}}}_{q}$ is the $N\times q$ matrix of the $q$ leading principal directions (left singular vectors) of ${\mathbf{\mathbf{A}}}_{\mathrm{OLS}}\mathbf{\mathbf{X}}$.
Factorize the matrix ${\mathbf{\mathbf{A}}}_{q}$ into decoder and encoder matrices, ${\mathbf{\mathbf{A}}}_{q}=\mathbf{\mathbf{F}\mathbf{D}}$, by choosing $\mathbf{\mathbf{F}}={\mathbf{\mathbf{U}}}_{q}$ and $\mathbf{\mathbf{D}}={\mathbf{\mathbf{U}}}_{q}^{\top}{\mathbf{\mathbf{A}}}_{\mathrm{OLS}}$.
Conveniently, the extracted decoder/encoder pairs do not depend on how many pairs are extracted: the $i$th pair is given by $\mathbf{\mathbf{f}}={\mathbf{\mathbf{u}}}_{i}$ and $\mathbf{\mathbf{d}}={\mathbf{\mathbf{u}}}_{i}^{\top}{\mathbf{\mathbf{A}}}_{\mathrm{OLS}}$, independent of $q$. Indeed, this feature motivated the above choice that $\mathbf{\mathbf{F}}$ should have orthonormal columns.
Regularization
A standard way to avoid overfitting in regression problems is to add a quadratic penalty to the cost function, which is often called ridge regression (RR). This approach can be used in reducedrank regression as well. Specifically, we can add a ridge penalty term to the loss function ${L}_{\varphi}$:
The RR solution modifies the OLS solution from above to
In turn, the reducedrank solution can be obtained as described above: $\mathbf{\mathbf{F}}={\mathbf{\mathbf{U}}}_{q}$ and $\mathbf{\mathbf{D}}={\mathbf{\mathbf{U}}}_{q}^{\top}{\mathbf{\mathbf{A}}}_{\mathrm{RR}}$ where ${\mathbf{\mathbf{U}}}_{q}$ are the first $q$ principal directions of ${\mathbf{\mathbf{A}}}_{\mathrm{RR}}\mathbf{\mathbf{X}}$.
We found it convenient to define $\mu ={(\lambda \parallel \mathbf{\mathbf{X}}\parallel )}^{2}$, since this makes the values of $\lambda $ comparable across datasets. As explained below, we used crossvalidation to select the optimal value of $\lambda $ in each dataset.
Unbalanced data
The data and variance decomposition carried out by the marginalization procedure can break down when the dataset is unbalanced, i.e., when the number of data points (trials) differs between conditions. We illustrate this problem with a twodimensional toy example in Figure 9. We assume two task parameters (factors), each of which can take only two possible values. The overall mean as well as the interaction term are taken to be zero, so that $\mathbf{x}}_{ijk}={\mathbf{a}}_{i}+{\mathbf{b}}_{j}+{\mathbf{e}}_{ijk$. Since the number of trials, $K={K}_{ij}$, depends on the condition, the trial index runs through the values $k=1\dots {K}_{ij}$. As shown in Figure 9a, all three terms on the righthand side exhibit zero correlation between ${x}_{1}$ and ${x}_{2}$. A balanced dataset with the same number of data points in each of the four possible conditions (Figure 9b) also has zero correlation. However, an unbalanced dataset, as shown in Figure 9c, exhibits strong positive correlation ($\rho =0.8$). Accordingly, the covariance matrix of the full data can no longer be split into marginalized covariances. To avoid this and other related problems, we can perform a 'rebalancing' procedure by reformulating dPCA in terms of PSTHs and noise covariance.
In the balanced case, the dPCA loss function ${L}_{\varphi}$ can be rewritten as the sum of two terms with one term depending on the PSTHs and another term depending on the trialtotrial variations,
where we used the fact that ${\mathbf{\mathbf{X}}}_{\varphi}$ and $\mathbf{\mathbf{X}}{\mathbf{\mathbf{X}}}_{\mathrm{noise}}$ are orthogonal to ${\mathbf{\mathbf{X}}}_{\mathrm{noise}}$ (see Appendix A). We now define ${\mathbf{\mathbf{X}}}_{\mathrm{PSTH}}=\mathbf{\mathbf{X}}{\mathbf{\mathbf{X}}}_{\mathrm{noise}}$ which is simply a matrix of the same size as $\mathbf{\mathbf{X}}$ with the activity of each trial replaced by the corresponding PSTH. In addition, we observe that the squared norm of any centered data matrix $\mathbf{\mathbf{Y}}$ with $n$ data points can be written in terms of its covariance matrix ${\mathbf{\mathbf{C}}}_{Y}={\mathbf{\mathbf{Y}\mathbf{Y}}}^{\top}/n$, namely $\parallel \mathbf{Y}{\parallel}^{2}=\mathrm{t}\mathrm{r}[\mathbf{Y}{\mathbf{Y}}^{\mathrm{\top}}]=n\phantom{\rule{thinmathspace}{0ex}}\mathrm{t}\mathrm{r}[{\mathbf{C}}_{Y}]=n\phantom{\rule{thinmathspace}{0ex}}\mathrm{t}\mathrm{r}[{\mathbf{C}}_{Y}^{1/2}{\mathbf{C}}_{Y}^{1/2}]=n\parallel {\mathbf{C}}_{Y}^{1/2}{\parallel}^{2}$, and so
The first term consists of $K$ replicated copies: ${\mathbf{\mathbf{X}}}_{\mathrm{PSTH}}$ contains $K$ replicated copies of $\stackrel{~}{\mathbf{\mathbf{X}}}$ (which we defined above as the matrix of PSTHs) and ${\mathbf{\mathbf{X}}}_{\varphi}$ contains $K$ replicated copies of ${\stackrel{~}{\mathbf{\mathbf{X}}}}_{\varphi}$ (which we take to be a marginalization of $\stackrel{~}{\mathbf{\mathbf{X}}}$, with $\stackrel{~}{\mathbf{\mathbf{X}}}={\sum}_{\varphi}{\stackrel{~}{\mathbf{\mathbf{X}}}}_{\varphi}$). We can eliminate the replications and drop the factor $K$ to obtain
In the unbalanced case, we can directly use this last formulation where all occurrences of $\mathbf{\mathbf{X}}$ have been replaced by $\stackrel{~}{\mathbf{\mathbf{X}}}$. This is especially useful for neural data, where some combinations of task parameters may occur more often than others. The 'rebalanced' dPCA loss function treats all parameter combinations as equally important, independent of their occurrence frequency. It stands to reason to 'rebalance' the noise covariance matrix as well by defining it as follows:
where ${\mathbf{\mathbf{C}}}_{\mathrm{noise}}(s,d,t)$ is the covariance matrix for the $(s,d,t)$ parameter combination. This formulation, again, treats noise covariance matrices from different parameter combinations as equally important, independent of how many data points there are for each parameter combination.
Putting everything together and including the regularization term as well, we arrive at the following form of the dPCA loss function:
This loss function can be minimized as described in the previous section. Specifically, the full rank solution with $\mathbf{\mathbf{A}}=\mathbf{\mathbf{F}\mathbf{D}}$ becomes
The reducedrank solution can then be obtained by setting $\mathbf{\mathbf{F}}={\mathbf{\mathbf{U}}}_{q}$ and $\mathbf{\mathbf{D}}={\mathbf{\mathbf{U}}}_{q}^{\top}\mathbf{\mathbf{A}}$, where ${\mathbf{\mathbf{U}}}_{q}$ are the first $q$ principal directions of ${\mathbf{\mathbf{A}}}_{\mathrm{RR}}\stackrel{~}{\mathbf{\mathbf{X}}}$.
Missing data
Even when using the rebalanced formulation of the loss function, we still need data from all possible parameter combinations. In neurophysiological experiments, however, one may run into situations where not all combinations of stimuli could be presented to an animal before it decided to abort the task, or where an animal never carried out a particular decision, etc. This problem is particularly severe if individual task parameters can take many values. What should one do in these cases? The key problem here is that dPCA as formulated above makes no assumptions about how the firing rates of individual neurons depend on the task parameters. (Nor is there an explicit assumption about how the demixed components depend on the task parameters.) If some task conditions have not been recorded, then the only way out is to add more assumptions, or, more formally, to replace the nonparametric estimates of individual neural firing rates (or demixed components) by parametric estimates. We could for instance fit a simple linear model to the firing rate of each neuron at each time step (Mante et al., 2013; Brody et al., 2003),
and then use this model to 'fill in' the missing data. More sophisticated ways of dealing with missing data could be envisaged as well and may provide a venue for future research.
Sequentially recorded data
For sequentially recorded datasets, the matrix $\mathbf{\mathbf{X}}$ cannot be meaningfully constructed. However, we can still work with the PSTH matrix $\stackrel{~}{\mathbf{\mathbf{X}}}$ that can be decomposed into marginalizations: $\stackrel{~}{\mathbf{X}}={\sum}_{\varphi}{\stackrel{~}{\mathbf{X}}}_{\varphi}$. Consequently, we can use the same formulation of the loss function as in the simultaneously recorded unbalanced case (see above). The only difference is that the noise covariance matrix is not available (noise correlations cannot be estimated when neurons are recorded in different sessions). In this manuscript we took as ${\mathbf{\mathbf{C}}}_{\mathrm{noise}}$ the diagonal matrix with individual noise variances of each neuron on the diagonal. We used the rebalanced version ${\stackrel{~}{\mathbf{\mathbf{C}}}}_{\mathrm{noise}}$ (average noise covariance matrix across all conditions), but found that the difference between rebalanced and nonrebalanced noise covariance matrices was always minor and did not noticeably influence the dPCA solutions.
Variance calculations
As all datasets analyzed in this manuscript were sequentially recorded, we always reported fractions of the PSTH variance (as opposed to the total PSTH+noise variance) explained by our components, i.e. fractions of variance explained in $\stackrel{~}{\mathbf{\mathbf{X}}}$. We defined the fraction of explained variance in a standard way:
This formula can be used to compute the fraction of variance explained by each dPCA component (by plugging in its encoder $\mathbf{\mathbf{f}}$ and decoder $\mathbf{\mathbf{d}}$); these are the numbers reported on Figures 3–6b,d and used to order the components. The same formula can be used to compute the cumulative fraction of variance explained by the first $q$ components (by stacking their encoders and decoders as columns and rows of $\mathbf{\mathbf{F}}$ and $\mathbf{\mathbf{D}}$ respectively); these are the numbers reported on Figures 3–6c. Note that the cumulative explained variance is close to the sum of individually explained variances but not exactly equal to it since the dPCA components are not completely uncorrelated. The same formula holds for standard PCA using $\mathbf{\mathbf{F}}={\mathbf{\mathbf{D}}}^{\top}={\mathbf{\mathbf{U}}}_{\mathrm{pca}}$, i.e., the matrix of stacked together principal directions (Figures 3–6c).
Using the decomposition $\stackrel{~}{\mathbf{X}}={\sum}_{\varphi}{\stackrel{~}{\mathbf{X}}}_{\varphi}$, we can split the fraction of explained variance into additive contributions from different marginalizations:
We used this decomposition to produce the bar plots in Figures 3–6d, showing how the explained variance of each single dPCA component is split between marginalizations.
Following the approach of Machens et al. (2010), we note that our PSTH estimates $\stackrel{~}{\mathbf{\mathbf{X}}}$ must differ from the 'true' underlying PSTHs due to the finite amount of recorded trials. Hence, some fraction of the total variance of $\stackrel{~}{\mathbf{\mathbf{X}}}$ is coming from this residual noise. We can estimate this fraction as follows. Our estimate of the noise variance of the $n$th neuron is given by ${\stackrel{~}{C}}_{nn}$, the $n$th diagonal element of ${\stackrel{~}{\mathbf{\mathbf{C}}}}_{\mathrm{noise}}$. There are on average ${\stackrel{~}{K}}_{n}=\frac{1}{SQ}\sum {K}_{nsd}$ trials being averaged to compute the PSTHs for this neuron. So a reasonable estimate of the residual noise variance of the $n$th neuron is ${\stackrel{~}{C}}_{nn}/{\stackrel{~}{K}}_{n}$. Accordingly, we define the total residual noise sum of squares as
In turn, the fraction of total signal variance is computed as $1\mathrm{\Theta}/{\parallel \stackrel{~}{\mathbf{\mathbf{X}}}\parallel}^{2}$ which is the dashed line shown in Figures 3–6c. Note that each component likewise has contributions from both signal and noise variance, and hence the fraction of total signal variance does not constitute an upper bound on the number of components.
The residual noise variance is not split uniformly across marginalizations: the fraction falling into marginalization $\varphi $ is proportional to the respective number of degrees of freedom, ${K}_{\varphi}$. This can be explicitly computed; for a centered dataset with $S$ stimuli, $Q$ decisions, and $T$ time points the total number of degrees of freedom (per neuron) is $SQT1$ and is split into $T1$ for time, $STT$ for stimulus, $QTT$ for decision, and $SQTSTQT+T$ for the stimulusdecision interaction (compare with the formulas in the Marginalization Procedure section). Accordingly, we computed the residual noise sum of squares falling into marginalization $\varphi $ as
The pie charts in Figures 3–6d show the amount of variance in each marginalization, with estimated contributions of the residual noise variance subtracted: $\left({\parallel {\stackrel{~}{\mathbf{\mathbf{X}}}}_{\varphi}\parallel}^{2}{\mathrm{\Theta}}_{\varphi}\right)/\left({\parallel \stackrel{~}{\mathbf{\mathbf{X}}}\parallel}^{2}\mathrm{\Theta}\right)$. To display the percentage values on the pie charts, percentages were rounded using the 'largest remainder method', so that the sum of the rounded values remained 100%.
Demixing indices
We defined the demixing index of each component as $\mathrm{m}\mathrm{a}\mathrm{x}}_{\varphi}\{\parallel \mathbf{d}{\stackrel{\mathbf{~}}{\mathbf{X}}}_{\varphi}{\parallel}^{2}\}/\parallel \mathbf{d}\stackrel{\mathbf{~}}{\mathbf{X}}{\parallel}^{2$. This index can range from 1/4 to 1 (since there are four marginalizations) and the closer it is to 1, the better demixed the component is. As an example, for the somatosensory working memory dataset, the average demixing index over the first 15 PCA components is 0.76$\pm $0.16 (mean$\pm $SD), and over the first 15 dPCA components is 0.98$\pm $0.02, which means that dPCA achieves much better demixing ($p=0.0002$, MannWhitneyWilcoxon ranksum test). For the first 15 components of dPCA2011 (Brendel et al., 2011) it was 0.95$\pm $0.03, significantly less than for the current dPCA ($p=0.0008$). This difference may seem small, but is clearly visible in the projections by the naked eye. For comparison, the average demixing index of individual neurons in this dataset is 0.55$\pm $0.18. In other datasets these numbers are similar, and the same differences were significant in all cases.
Angles between dPCs
In Figures 3–6e, stars mark the pairs of components whose encoding axes ${\mathbf{\mathbf{f}}}_{1}$ and ${\mathbf{\mathbf{f}}}_{2}$ are significantly and robustly nonorthogonal. These were identified as follows: In Euclidean space of $N$ dimensions, two random unit vectors (from a uniform distribution on the unit sphere) have dot product (cosine of the angle between them) distributed with mean zero and standard deviation ${N}^{1/2}$. For large $N$ the distribution is approximately Gaussian. To avoid the problems inherent to multiple comparisons, we chose a conservative significance level of $p<0.001$, which means that two axes are significantly nonorthogonal if ${\mathbf{f}}_{1}\cdot {\mathbf{f}}_{2}\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}3.3/{N}^{1/2}$.
Coordinates of ${\mathbf{\mathbf{f}}}_{1}$ quantify how much this component contributes to the activity of each neuron. Hence, if cells exhibiting one component also tend to exhibit another, the dot product between the axes ${\mathbf{f}}_{1}\cdot {\mathbf{f}}_{2}\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}0$ is positive (note that ${\mathbf{\mathbf{f}}}_{1}\cdot {\mathbf{\mathbf{f}}}_{2}$ is approximately equal to the correlation between the coordinates of ${\mathbf{\mathbf{f}}}_{1}$ and ${\mathbf{\mathbf{f}}}_{2}$). Sometimes, however, the dot product has large absolute value only due to several outlying cells. To ease interpretation, we marked with stars only those pairs of axes for which the Kendall (robust) correlation was significant at $p<0.001$ level (in addition to the above criterion on ${\mathbf{\mathbf{f}}}_{1}\cdot {\mathbf{\mathbf{f}}}_{2}$).
Experimental data
Brief descriptions of experimental paradigms are provided in the Results section and readers are referred to the original publications for all further details. Here we describe the selection of animals, sessions, and trials for the present manuscript. In all experiments neural recordings were obtained in multiple sessions, so most of the neurons were not recorded simultaneously. All four datasets used in this manuscript have been made available at http://crcns.org (Romo et al., 2016; Constantinidis et al., 2016; Feierstein et al., 2016; Uchida et al., 2016).
Somatosensory working memory task in monkeys (Romo et al., 1999; Brody et al., 2003). We used data from two monkeys (code names RR14 and RR15) that were trained with the same set of vibratory frequencies, and we selected only the sessions where all six frequencies {10, 14, 18, 26, 30, 34} Hz were used for the first stimulation (other sessions were excluded). Monkeys made few mistakes (overall error rate was 6%), and here we analyzed only correct trials. Monkey RR15 had an additional 3 s delay after the end of the second stimulation before it was cued to provide the response. Using the data from monkey RR13 (that experienced a different frequency set) led to very similar dPCA components (data not shown).
Visuospatial working memory task in monkeys (Qi et al., 2011; Meyer et al., 2011; Qi et al., 2012). We used the data from two monkeys (code names AD and EL) that were trained with the same spatial task. Monkeys made few mistakes (overall error rate was 8%), and here we analysed only correct trials. The first visual stimulus was presented at 9 possible spatial locations arranged in a 3$\times $3 grid (Figure 4a); here we excluded all the trials where the first stimulus was presented in the centre position.
Olfactory discrimination task in rats (Feierstein et al., 2006). We used the data from all five rats (code names N1, P9, P5, T5, and W1). Some rats were trained with two distinct odors, some with four, some with six, and one rat experienced mixtures of two fixed odors in varying proportions. In all cases each odor was uniquely associated with one of the two available water ports (left/right). Following the original analysis (Feierstein et al., 2006), we grouped all odors associated with the left/right reward together as a 'left/right odor'. For most rats, caproic acid and 1hexanol (shown in Figures 5–6a) were used as the left/right odor. We excluded from the analysis all trials that were aborted by rats before reward delivery (or before waiting 0.2 s at the reward port for the error trials).
Olfactory categorization task in rats (Kepecs et al., 2008). We used the data from all three rats (code names N1, N48, and N49). Note that recordings from one of the rats (N1) were included in both this and previous datasets; when we excluded it from either of the datasets, the results stayed qualitatively the same (data not shown). We excluded from the analysis all trials that were aborted by rats before reward delivery (or before waiting 0.3 s at the reward port for the error trials).
Selection of neurons
For our analysis, we only selected neurons which had been recorded in each possible condition (combination of parameters), which avoids the missing data problems explained above. Additionally, we required that in each condition there were at least ${K}_{\mathrm{m}\mathrm{i}\mathrm{n}}\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}1$ trials, to reduce the standard error of the mean when averaging over trials, and also for crossvalidation purposes. The cutoff was set to ${K}_{\mathrm{min}}=5$ for both working memory datasets, and to ${K}_{\mathrm{min}}=2$ for both olfactory datasets (due to less neurons available).
We have further excluded very few neurons with mean firing rates over 50 Hz, as such neurons can bias the variancebased analysis. Firing rates above 50 Hz were atypical in all datasets (number of excluded neurons for each dataset: 5 / 2 / 1 / 0). This exclusion had a minor positive effect on the components. We did not apply any variancestabilizing transformations, but if the squareroot transformation was applied, the results stayed qualitatively the same (data not shown).
No other preselection of neurons was used. This procedure left 832 neurons (230 / 602 for individual animals, order as above) in the somatosensory working memory dataset, 956 neurons (182 / 774) in the visuospatial working memory dataset, 437 neurons in the olfactory discrimination dataset (166 / 30 / 9 / 106 / 126), and 214 neurons in the olfactory categorization dataset (67 / 38 / 109).
Preprocessing of the neural data
The spike trains were filtered with a Gaussian kernel ($\sigma =50$ ms) and sampled at 100 Hz to produce singletrial instantaneous firing rates.
In the visuospatial working memory dataset we identified the preferred location of each neuron as the location that evoked maximum mean firing rate in the 500 ms time period while the first stimulus was shown. The neural tuning was shown before to have a symmetric bell shape (Qi et al., 2011; Meyer et al., 2011), with each neuron having its own preferred location. We then resorted the trials (separately for each neuron) such that only five distinct stimuli were left: preferred location, 45°, 90°, 135°, and 180° away from the preferred location.
In both olfactory datasets trials were selfpaced. Accordingly, trials last different amounts of time, and firing rates cannot simply be averaged over trials. We used the following time warping (restretching) procedure to equalize the length of all trials and to align several events of interest (Figure 10) separately in each dataset. We defined five alignment events: odor poke in, odor poke out, water poke in, reward delivery, and water poke out. First, we aligned all trials on odor poke in (${T}_{1}=0$) and computed median times of the four other events ${T}_{i},\phantom{\rule{thinmathspace}{0ex}}i=2\dots 5$ (for the time of reward delivery, we took the median over all correct trials). Second, we set $\mathrm{\Delta}T$ to be the minimal waiting time between water port entry and reward delivery across the whole experiment ($\mathrm{\Delta}T=0.2$ s for the olfactory discrimination task and $\mathrm{\Delta}T=0.3$ s for the olfactory categorization task). Finally, for each trial with instantaneous firing rate $x(t)$ we set ${t}_{i},i=1\mathrm{\dots}5$, to be the times of alignment events on this particular trial (for error trials we took ${t}_{4}={t}_{3}+\mathrm{\Delta}T$), and stretched $x(t)$ along the time axis in a piecewiselinear manner to align each ${t}_{i}$ with the corresponding ${T}_{i}$.
We made sure that time warping did not introduce any artifacts by considering an alternative procedure, where short ($\pm $450 ms) time intervals around each ${t}_{i}$ were cut out of each trial and concatenated together; this procedure is similar to the pooling of neural data performed in the original studies (Feierstein et al., 2006; Kepecs et al., 2008). The dPCA analysis revealed qualitatively similar components (data not shown).
Crossvalidation to select regularization parameter
As noted above, we renormalized the regularization parameter $\mu ={(\lambda \parallel \mathbf{\mathbf{X}}\parallel )}^{2}$, and then used crossvalidation to find the optimal value of $\lambda $ for each dataset. To separate the data into training and testing sets, we held out one random trial for each neuron in each condition as a set of $SQ$ test 'pseudotrials' ${\mathbf{\mathbf{X}}}_{\mathrm{test}}$ (as the neurons were not recorded simultaneously, we do not have recordings of all $N$ neurons in any actual trial). Remaining trials were averaged to form a training set of PSTHs ${\stackrel{~}{\mathbf{\mathbf{X}}}}_{\mathrm{train}}$ and an estimate of the noise covariance matrix ${\stackrel{~}{\mathbf{\mathbf{C}}}}_{\text{train}}$. Note that ${\mathbf{\mathbf{X}}}_{\mathrm{test}}$ and ${\stackrel{~}{\mathbf{\mathbf{X}}}}_{\mathrm{train}}$ have the same dimensions.
We then performed dPCA on ${\stackrel{~}{\mathbf{\mathbf{X}}}}_{\mathrm{train}}$ for various values of $\lambda $ between ${10}^{7}$ and ${10}^{3}$ (on a logarithmic grid). For each $\lambda $, we selected ten components in each marginalization (i.e. 40 components in total) to obtain ${\mathbf{\mathbf{F}}}_{\varphi}(\lambda )$ and ${\mathbf{\mathbf{D}}}_{\varphi}(\lambda )$, and computed the normalized reconstruction error ${L}_{\mathrm{CV}}(\lambda )$ on the test set (see below). We repeated this procedure ten times for different traintest splittings and averaged the resulting functions ${L}_{\mathrm{CV}}(\lambda )$. In all cases the average function ${\overline{L}}_{\mathrm{CV}}(\lambda )$ had a clear minimum (Figure 11) that we selected as the optimal $\lambda $. The values of $\lambda $ selected for each dataset were $2.6\cdot {10}^{6}$ / $5.8\cdot {10}^{6}$ / $5.8\cdot {10}^{6}$ / $5.8\cdot {10}^{6}$. We also performed the same procedure in each marginalization separately, but in all datasets the optimal values of $\lambda $ were similar across marginalizations (Figure 11). We therefore chose to use the same value of $\lambda $ for all marginalizations.
Interestingly, for all our datasets ${\mathrm{min}}_{\lambda}\{{\overline{L}}_{\mathrm{CV}}(\lambda )\}$ was only slightly smaller than ${\overline{L}}_{\mathrm{CV}}(0)$, so the regularization term had almost no influence. Presumably, this result stems from our diagonal (and thus nonsingular) noise covariance matrices, and therefore does not necessarily hold for simultaneously recorded data.
To compute ${L}_{\mathrm{CV}}(\lambda )$, we used ${\mathbf{\mathbf{X}}}_{\mathrm{test}}$ to predict ${\stackrel{~}{\mathbf{\mathbf{X}}}}_{\mathrm{train}}$:
This is the residual trainingset variance not explained by the test data. Note that it would not make sense to exchange ${\mathbf{\mathbf{X}}}_{\mathrm{test}}$ and ${\stackrel{~}{\mathbf{\mathbf{X}}}}_{\mathrm{train}}$ in this formula: the decoder and encoder are fitted to the training data, and should only be applied to the test data for the purposes of crossvalidation. An alternative approach, in which we predicted the test data rather than the training data, yielded similar results (data not shown).
Crossvalidation to measure classification accuracy
We used decoding axis $\mathbf{\mathbf{d}}$ of each dPC in stimulus, decision, and interaction marginalizations as a linear classifier to decode stimulus, decision, or condition respectively. Black lines on Figures 3–6b show time periods of significant classification. A more detailed description follows below.
We used 100 iterations of stratified Monte Carlo leavegroupout crossvalidation, where on each iteration we held out one trial for each neuron in each condition as a set of $SQ$ test 'pseudotrials' ${\mathbf{\mathbf{X}}}_{\mathrm{test}}$ and averaged over remaining trials to form a training set ${\stackrel{~}{\mathbf{\mathbf{X}}}}_{\mathrm{train}}$ (see above). After running dPCA on ${\stackrel{~}{\mathbf{\mathbf{X}}}}_{\mathrm{train}}$, we used decoding axes of the first three stimulus/decision/interaction dPCs as a linear classifier to decode stimulus/decision/condition respectively. Consider e.g. the first stimulus dPC: first, for each stimulus, we computed the mean value of this dPC separately for every timepoint. Then we projected each test trial on the corresponding decoding axis and classified it at each timepoint according to the closest class mean. The proportion of test trials (out of $SQ$) classified correctly resulted in a timedependent classification accuracy, which we averaged over 100 crossvalidation iterations. Note that this is a stratified procedure: even though in reality some conditions have many fewer trials than others, here we classify exactly the same number of 'pseudotrials' per condition. At the same time, as the coordinates of individual data points in each pseudotrial are pooled from different sessions, the influence of noise correlations on the classification accuracies is neglected, similar to Meyers et al. (2012).
We then used 100 shuffles to compute the distribution of classification accuracies expected by chance. On each iteration and for each neuron, we shuffled all available trials between conditions, respecting the number of trials per condition (i.e. all $\sum}_{sd}{K}_{nsd$ trials were shuffled and then randomly assigned to the conditions such that all values ${K}_{nsd}$ stayed the same). Then exactly the same classification procedure as above (with 100 crossvalidation iterations) was applied to the shuffled dataset to find mean classification accuracy for the first stimulus, decision, and interaction components. All 100 shuffling iterations resulted in a set of 100 timedependent accuracies expected by chance.
The time periods when actual classification accuracy exceeded all 100 shuffled decoding accuracies in at least ten consecutive time bins are marked by black lines on Figures 3–6. Components without any periods of significant classification are not shown. See Figure 12 for classification accuracies in each dataset. The Monte Carlo computations took $\sim $8 hr for each of the larger datasets on a 6 core 3.2 Ghz Intel i73930K processor.
Implementation of classical approaches (Figure 1)
The twoway ANOVA shown in Figure 1c–e was performed as follows. The two factors were stimulus (with six levels) and decision (with two levels), the interaction term was included, and a separate ANOVA was run for the firing rate of each neuron at each time point. Significance level was set at $\alpha =0.05$. Effect size was defined as partial omega squared with a sign given by the sign of the correlation coefficient between firing rate and the corresponding parameter. It can take values between $1$ and 1, with 0 meaning no effect. For oneway ANOVA with a single twolevel factor (which is a ttest), it would reduce to the signed ${R}^{2}$ between firing rate and factor level.
For Figure 1f–g, we ran linear regressions for the firing rate of each neuron at $t=0.25$ s and $t=3.75$ s, taking F1 stimulus value in Hz as one predictor, decision as another one, and including an interaction effect. Predictors were standardized (so regression coefficients in Figure 1g are standardized coefficients). The components shown in Figure 1f were constructed following the 'targeted dimensionality reduction' method presented in Mante et al. (2013) (see below for more details).
To compute the proportion of explained PSTH variance, we arranged the two components (obtained by either method) into a matrix $\mathbf{\mathbf{Z}}$ of $2\times SQT$ size. Both the PSTH data matrix $\stackrel{~}{\mathbf{\mathbf{X}}}$ and $\mathbf{\mathbf{Z}}$ were centered by subtracting row means. Linear regression was used to find reconstruction weights $\mathbf{\mathbf{B}}=\stackrel{~}{\mathbf{\mathbf{X}}}{\mathbf{\mathbf{Z}}}^{\top}{({\mathbf{\mathbf{Z}\mathbf{Z}}}^{\top})}^{1}$ minimizing reconstruction error ${\parallel \stackrel{~}{\mathbf{\mathbf{X}}}\mathbf{\mathbf{B}\mathbf{Z}}\parallel}^{2}$. Then the proportion of explained variance was computed as ${R}^{2}=1{\parallel \stackrel{~}{\mathbf{\mathbf{X}}}\mathbf{\mathbf{B}\mathbf{Z}}\parallel}^{2}/{\parallel \stackrel{~}{\mathbf{\mathbf{X}}}\parallel}^{2}$.
The PCA on Figure 1i–k was done on the centered PSTH data matrix $\stackrel{~}{\mathbf{\mathbf{X}}}$. Let its singular value decomposition be $\stackrel{~}{\mathbf{\mathbf{X}}}={\mathbf{\mathbf{U}\mathbf{S}\mathbf{V}}}^{\top}$. Then each subplot on Figure 1j is a histogram of elements of one column of $\mathbf{\mathbf{U}}$ and each subplot on Figure 1j is one column of $\mathbf{\mathbf{V}}$.
Comparison of dPCA to PCA in each marginalization
To understand the differences of dPCA with respect to other (demixing) methods, we will make several explicit comparisons. The first method we will consider is performing a series of standard PCAs in each marginalization separately. This procedure can be understood in two ways: after performing PCA on ${\mathbf{\mathbf{X}}}_{\varphi}$ and obtaining the matrix ${\mathbf{\mathbf{U}}}_{\varphi}$ for the $k$ leading principal directions, we can use this matrix to project either the marginalized data or the full data.
In the first case we obtain the principal components of the corresponding marginalization, ${\mathbf{\mathbf{U}}}_{\varphi}{\mathbf{\mathbf{X}}}_{\varphi}$. However, while these components provide a particular decomposition or visualization of the data, they do not constitute readouts of the neural activity, since they are based on projecting the marginalized data. One particular advantage of the dPCA formulation is that it operates on the raw data, so that the decoders (and encoders) can actually be used on single trials. In turn, the visualization of the data found through dPCA also provides insights into the utility of the respective population code for the brain.
In the second case we obtain ${\mathbf{\mathbf{U}}}_{\varphi}\mathbf{\mathbf{X}}$ components from the full data, so that these components could be obtained by projecting singletrial activities. However, now there is no guarantee that these components will be demixed. For a simple counterexample, consider Figure 13: the stimulus marginalization ${\mathbf{\mathbf{X}}}_{s}$ consists of three points (one for each stimulus) located roughly on a horizontal axis, and so the first principal axis of ${\mathbf{\mathbf{X}}}_{s}$ is roughly horizontal. It is easy to see that the projection of the full data onto this axis will be not only stimulus, but also timedependent.
Nonetheless, we can obtain a reasonable approximation to the dPCA solution using PCA in each marginalization. Namely, ${\mathbf{\mathbf{U}}}_{\varphi}$ can be taken to constitute the encoders ${\mathbf{\mathbf{F}}}_{\varphi}$. In turn, the decoders ${\mathbf{\mathbf{D}}}_{\varphi}$ are obtained by a pseudoinverse $\mathbf{\mathbf{D}}={\mathbf{\mathbf{U}}}^{+}$, where $\mathbf{\mathbf{U}}$ is a matrix with $4k$ columns obtained by joining together all ${\mathbf{\mathbf{U}}}_{\varphi}$. We found that this procedure provides a close approximation of the actual decoder and encoder matrices, provided one chooses a reasonable value of $k$: choosing $k$ too small results in poor demixing, and choosing $k$ too large results in overfitting. In our datasets, $k=10$ provides a good tradeoff.
This approximate solution highlights the conditions under which dPCA will work well, i.e., result in welldemixed components that capture most of the variance of the data: the main principal axes of different marginalizations ${\mathbf{\mathbf{X}}}_{\varphi}$ need to be noncollinear. In other words, principal subspaces of different marginalizations should not overlap.
Comparison of dPCA with targeted dimensionality reduction
Next, we compare dPCA with 'targeted dimensionality reduction' (TDR), the method proposed by Mante et al. (2013). Briefly, the algorithm underlying TDR works as follows:
Perform PCA of the trialaverage neural data $\stackrel{~}{\mathbf{\mathbf{X}}}$ and define a 'denoising' matrix $\mathbf{\mathbf{K}}$ as a linear projector on the space spanned by the leading principal axes. Here we used 20 principal axes: $\mathbf{\mathbf{K}}={\mathbf{\mathbf{U}}}_{20}{\mathbf{\mathbf{U}}}_{20}^{\top}$.
For each neuron $i$, regress its firing rate at each time point $t$ on stimulus, decision, and interaction between them:
${x}_{i}(t)={\beta}_{i}^{o}(t)+{\beta}_{i}^{s}(t)s+{\beta}_{i}^{d}(t)d+{\beta}_{i}^{sd}(t)sd+\u03f5,$where $d$ is any suitable parametrization of decision, e.g. $d=\pm 1$, and $s$ is stimulus value (e.g. actual stimulation frequency in the somatosensory working memory task in monkeys).
Take $N$ values of ${\beta}_{i}^{s}(t)$ as defining an $N$dimensional vector ${\mathit{\bm{\beta}}}^{s}(t)$, and analogously define ${\mathit{\bm{\beta}}}^{d}(t)$, ${\mathit{\bm{\beta}}}^{sd}(t)$, and ${\mathit{\bm{\beta}}}^{o}(t)$. Mante et al. (2013) did not use the conditionindependent term ${\mathit{\bm{\beta}}}^{o}(t)$ in their original study, but we treat it here on equal footing.
Project each ${\mathit{\bm{\beta}}}^{\varphi}(t)$ onto the space spanned by the leading PCA axes: $\mathbf{\mathbf{K}}{\mathit{\bm{\beta}}}^{\varphi}(t).$
For each of the three parameters select one vector ${\mathit{\bm{\beta}}}_{*}^{\varphi}=\mathbf{\mathbf{K}}{\mathit{\bm{\beta}}}^{\varphi}({t}_{*}^{\varphi})$, where ${t}_{*}^{\varphi}$ is the time point at which the norm of $\mathbf{\mathbf{K}}{\mathit{\bm{\beta}}}^{\varphi}(t)$ is maximal.
Finally, stack the obtained vectors to form a matrix $\mathbf{\mathbf{B}}=[{\mathit{\bm{\beta}}}_{*}^{s}{\mathit{\bm{\beta}}}_{*}^{d}{\mathit{\bm{\beta}}}_{*}^{sd}{\mathit{\bm{\beta}}}_{*}^{o}]$ and perform QR decomposition $\mathbf{\mathbf{B}}=\mathbf{\mathbf{Q}\mathbf{R}}$ with $\mathbf{\mathbf{Q}}$ orthonormal and $\mathbf{\mathbf{R}}$ upper triangular to find a set of three orthogonal demixing axes (as columns of $\mathbf{\mathbf{Q}}$). If the individual vectors in $\mathbf{\mathbf{B}}$ are far from orthogonal, then the resulting axes will strongly depend on the order of stacking them into the matrix $\mathbf{\mathbf{B}}$. We found that we obtain best results if we select this order ad hoc for each dataset; the orders we used were stimulus $\to $ decision $\to $ interaction for the somatosensory working memory task, stimulus $\to $ interaction $\to $ decision for the visuospatial working memory task, and interaction $\to $ decision $\to $ stimulus for both olfactory tasks. The vector ${\mathit{\bm{\beta}}}^{o}(t)$ was always used last.
We applied TDR to all our datasets and observed that dPCA consistently outperforms it in terms of capturing variance and demixing task parameters. First, unlike dPCA, TDR yields only one component per task parameter. Second, even this component tends to retain more mixed selectivity than the corresponding dPCA component. Some representative components are shown in Figure 14.
Comparison of dPCA with LDA
Linear Discriminant Analysis (LDA) is usually understood as a oneway technique: there is only one parameter (class id) associated with each data point, whereas in this manuscript we dealt with three parameters simultaneously. Therefore, LDA in its standard form cannot directly be applied to the demixing problem. We can, however, use the same data and covariance decomposition
that dPCA is using and construct a separate LDA for each marginalization $\varphi $. To the best of our knowledge, this framework does not have an established name, so we call it factorial LDA.
Let us first consider the case of finding demixed components for marginalization ${\mathbf{\mathbf{X}}}_{\varphi}$. We will denote the remaining part of the data matrix as ${\mathbf{\mathbf{X}}}_{\varphi}=\mathbf{\mathbf{X}}{\mathbf{\mathbf{X}}}_{\varphi}$ and the remaining part of the covariance matrix as ${\mathbf{\mathbf{C}}}_{\varphi}=\mathbf{\mathbf{C}}{\mathbf{\mathbf{C}}}_{\varphi}$. In turn, the goal of LDA will be to find linear projections that have high variance in ${\mathbf{\mathbf{C}}}_{\varphi}$ and low variance in ${\mathbf{\mathbf{C}}}_{\varphi}$. In LDA, these matrices are usually called betweenclass and withinclass covariance matrices (Hastie et al., 2009). The standard treatment of LDA is to maximize the multivariate signaltonoise ratio
where $\mathbf{\mathbf{D}}$ is the matrix with discriminant axes in rows. The wellknown solution is that ${\mathbf{\mathbf{D}}}_{\mathrm{LDA}}$ is given by the leading eigenvectors of ${\mathbf{\mathbf{C}}}_{\varphi}^{1}{\mathbf{\mathbf{C}}}_{\varphi}$ (stacked together as rows), or, equivalently, as eigenvectors of ${\mathbf{\mathbf{C}}}^{1}{\mathbf{\mathbf{C}}}_{\varphi}$.
More useful for our purposes is the reformulation of LDA as a reducedrank regression problem (Izenman, 2008; De la Torre, 2012). When classes are balanced, it can be formulated as
where ${\mathbf{\mathbf{G}}}_{\varphi}$ is a class indicator matrix. This matrix has as many rows as there are possible values of parameter $\varphi $ and specifies which data point is labeled with which parameter value: ${G}_{ij}=1$ if the $j$th data point belongs to class $i$ (has $i$th value of the parameter $\varphi $) and ${G}_{ij}=0$ otherwise. In the toy example shown in Figure 2, there are three classes with five points each, and so ${\mathbf{\mathbf{G}}}_{\varphi}$ will be a $3\times 15$ matrix of zeros and ones. In this reformulation of LDA, the main interest is in the decoder matrix $\mathbf{\mathbf{D}}$, whereas the encoder matrix $\mathbf{\mathbf{F}}$, which serves to map the lowdimensional representation onto the class indicator matrix, plays only an auxiliary role.
In contrast, the dPCA loss function is
where ${\mathbf{\mathbf{X}}}_{\varphi}$ is the matrix of the same size as ${\mathbf{\mathbf{G}}}_{\varphi}$ with $j$th column being the class centroid of the class to which the $j$th point belongs. This comparison highlights the difference between the two methods: LDA looks for decoders that allow to reconstruct class identity (as encoded by ${\mathbf{\mathbf{G}}}_{\varphi}$) whereas dPCA looks for decoders that allow to reconstruct class means (as encoded by ${\mathbf{\mathbf{X}}}_{\varphi}$). Figure 2b,f,h provides a toy example of a situation when these two goals yield very different solutions: the LDA projection separates the three classes better than the dPCA projection, but the dPCA projection preserves the information about the distance between classes.
Using the explicit solution for reducedrank regression, one can show that ${L}_{\mathrm{LDA}}$ does indeed have eigenvectors of ${\mathbf{\mathbf{C}}}^{1}{\mathbf{\mathbf{C}}}_{\varphi}$ as a solution ${\mathbf{\mathbf{D}}}_{\mathrm{LDA}}$ for decoder (see Section 8.5.3 in Izenman, 2008). Following the similar logic for ${L}_{\mathrm{dPCA}}$, one can derive the corresponding expression for the dPCA decoder: ${\mathbf{\mathbf{D}}}_{\mathrm{dPCA}}$ is given by the eigenvectors of ${\mathbf{\mathbf{C}}}^{1}{\mathbf{\mathbf{C}}}_{\varphi}^{2}$ (personal communication with Maneesh Sahani).
A statistical test known as MANOVA can be seen as another possible factorial generalization of LDA. Given the same data and covariance decomposition, MANOVA tests if the effect of $\varphi $ is statistically significant by analyzing eigenvalues of ${\mathbf{\mathbf{C}}}_{\mathrm{noise}}^{1}{\mathbf{\mathbf{C}}}_{\varphi}$. The eigenvectors of this matrix can in principle serve as decoders, but these projections are optimized to separate the contribution of $\varphi $ from noise, not from the contributions of noise and other parameters. Hence, MANOVA is not the appropriate method for demixing purposes.
While the toy example of Figure 2 illustrates that dPCA and LDA will in principle have very different solutions, we note that in all datasets considered here factorial LDA and dPCA yielded very similar components. This may reflect several pecularities of the data: for instance, the population activity for different values of the same parameter was spaced rather evenly, and all decisions were binary. Nevertheless, we emphasize that dPCA is better suited for (demixed) dimensionality reduction due to its focus on reconstructing the original data, as explained and discussed in the Results (Figure 2).
Comparison of dPCA with previous versions
Demixed PCA as presented here is conceptually based on our previous work. Machens et al. (2010) suggested a demixing method called difference of covariances (DOC) that can only handle two parameters, e.g. stimulus $s$ and time $t$. Given PSTHs $\mathbf{\mathbf{x}}(s,t)$, DOC first constructs stimulusdependent and timedependent marginalizations $\overline{\mathbf{\mathbf{x}}}(s)={\u27e8\mathbf{\mathbf{x}}(s,t)\u27e9}_{t}$ and $\overline{\mathbf{\mathbf{x}}}(t)={\u27e8\mathbf{\mathbf{x}}(s,t)\u27e9}_{s}$, and then computes the difference between the stimulus and time covariance matrices ${\mathbf{\mathbf{C}}}_{s}=\u27e8\overline{\mathbf{\mathbf{x}}}(s)\overline{\mathbf{\mathbf{x}}}{(s)}^{\top}\u27e9$ and ${\mathbf{\mathbf{C}}}_{t}=\u27e8\overline{\mathbf{\mathbf{x}}}(t)\overline{\mathbf{\mathbf{x}}}{(t)}^{\top}\u27e9$,
Eigenvectors of S with maximum (positive) eigenvalues correspond to directions with maximum stimulus variance and minimum decision variance. Vice versa, eigenvectors with minimum (negative) eigenvalues correspond to directions with maximum decision variance and minimum stimulus variance. In the toy example presented in Figure 2 DOC finds the axis that is very close to the first PCA axis of class centroids (which is also very close to the dPCA encoder axis shown on the figure), providing worse demixing than both LDA and dPCA.
A possible extension of DOC to more than two parameters is described in Machens (2010). Here the PSTHs are assumed to depend on $M$ parameters, and the method constructs $M$ marginalizations by averaging over all parameters except one. The respective covariance matrices ${\mathbf{\mathbf{C}}}_{\varphi}$ are then formed as above. The extension of DOC seeks to find the matrix of orthogonal directions $\mathbf{\mathbf{U}}$ such that
is maximized subject to ${\mathbf{\mathbf{U}}}^{\top}\mathbf{\mathbf{U}}=\mathbf{\mathbf{I}}$ where $\mathbf{\mathbf{U}}=[{\mathbf{\mathbf{U}}}_{1}\mathrm{\dots}{\mathbf{\mathbf{U}}}_{M}]$. For $M=2$ this can be shown to be equivalent to the original DOC. Note that Machens (2010) did not address the interaction terms.
The connection between the current dPCA and the DOC approach can be made more explicit if we consider the full covariance decomposition $\mathbf{C}={\sum}_{\varphi}{\mathbf{C}}_{\varphi}$ and introduce into the dPCA loss function an additional constraint that both encoder and decoder should be given by the same matrix with orthonormal columns: ${\mathbf{\mathbf{F}}}_{\varphi}={\mathbf{\mathbf{D}}}_{\varphi}^{\top}={\mathbf{\mathbf{U}}}_{\varphi}$. Then
where the first equality follows from properties of the decomposition, the second equality from the properties of the orthonormal matrices ${\mathbf{\mathbf{U}}}_{\varphi}$, and the third equality uses the definition of the covariance. This derivation shows that the difference of covariances ${\mathbf{\mathbf{C}}}_{\varphi}{\mathbf{\mathbf{C}}}_{\varphi}$ emerges from the dPCA loss function if the decoder and encoder are given by the same set of orthogonal axes. However, such axes ${\mathbf{\mathbf{U}}}_{\varphi}$ from different marginalizations $\varphi $ will in general not be orthogonal to each other, whereas both DOC and its generalization insisted on orthogonal axes.
Both the original DOC and its extension ignored interaction terms. Brendel et al. (2011) introduced interaction terms and the full covariance splitting $\mathbf{C}={\sum}_{\varphi}{\mathbf{C}}_{\varphi}$ as described in this manuscript, and developed a probabilistic dPCA model based on probabilistic PCA (PPCA); to remove ambiguity we call this method dPCA2011. Similar to PPCA, dPCA2011 assumes that the data are described by a linear model with Gaussian residuals, i. e.
but the prior over the components z is chosen such that the components are sparsely distributed over marginalizations. In other words, the prior is chosen such that those components are favored that have variance in only one marginalization. Under the constraint that decoding directions W are orthogonal, the model can be fit using the expectationmaximization algorithm. However, the probabilistic formulation of Brendel et al. (2011) still suffers from the orthogonality constraint. As explained in the Discussion, the orthogonality constraint is too rigid and can prevent successful demixing if parameter subspaces are sufficiently nonorthogonal. Indeed, we applied dPCA2011 to all our datasets and observed that dPCA2015 showed better demixing (Figure 14). Moreover, dPCA2011 failed to find any decision components in the visuospatial working memory task.
In addition, the formulation of dPCA in this manuscript is radically simplified compared to Brendel et al. (2011), features an analytic solution and is easier to compare with other linear dimensionality reduction techniques.
Appendix A. Mathematical properties of the marginalization procedure
Above we presented marginalization procedure for three parameters. In order to generalize it for an arbitrary number of parameters, we introduce a more general notation. We denote as $\mathrm{\Psi}$ the set of parameters (in the previous section $\mathrm{\Psi}=\{t,s,d\}$; note that the trial index is not included into $\mathrm{\Psi}$) and write ${\overline{x}}_{\psi}$ to denote a decomposition term that depends on a subset of parameters $\psi \subseteq \mathrm{\Psi}$. In particular, $\overline{x}}_{\mathrm{\varnothing}}=\overline{x$. In full analogy to the 3parameter case, each term can be iteratively computed via
where ${\u27e8\cdot \u27e9}_{\mathrm{\Psi}\setminus \psi}$ denotes averaging over all parameters that are not elements of $\psi $ and averaging over the trial index. This equation can be rewritten in a noniterative way by expanding the sum; this yields the expression with alternating signs that is similar to our ANOVAstyle equations above:
One can verify that this formula correctly describes the 3parameter case presented above; the general case can be proven by induction. The noise term is defined via
This decomposition has several useful properties. First, the average of any marginalization ${\overline{x}}_{\psi}$ over any parameter $\gamma \in \psi $ is zero. This can be seen from the equation $(\star )$ because after averaging over $\gamma $ all terms will split into pairs with opposite signs (indeed, for each $\tau \ni \gamma $ there is another ${\tau}^{\prime}=\tau \setminus \gamma $). Second, all marginalizations are pairwise uncorrelated, i.e. their covariance is zero: ${\u27e8{\overline{x}}_{\psi}{\overline{x}}_{\chi}\u27e9}_{\mathrm{\Psi}}=0$. This can be seen from equation $(\star )$ because ${\overline{x}}_{\psi}$ and ${\overline{x}}_{\chi}$ both consist of an even number of terms with alternating signs, so their product will also consist of an even number of terms with alternating signs, and after averaging over $\mathrm{\Psi}$ all terms will become equal to $\pm {\overline{x}}^{2}$ and cancel each other. Third, from the definition of the noise term it follows that any marginalization ${\overline{x}}_{\psi}$ is uncorrelated with the noise term: ${\u27e8{\overline{x}}_{\psi}{\overline{x}}_{\mathrm{noise}}\u27e9}_{\mathrm{\Psi}}=0$.
The fact that all marginalizations and the noise are pairwise uncorrelated allows to segregate the variance of $x$ (here we assume that $x$ is centered, i.e. $\overline{x}=0$):
Turning now to the multivariate case, if we replace $x$ with $\mathbf{x}\in {\mathbb{R}}^{N}$, everything remains true but variances should be replaced by covariance matrices:
Note that in ANOVA literature one usually talks about decomposing sums of squares $\sum {x}^{2}$ and in MANOVA literature about decomposing scatter matrices $\sum {\mathbf{\mathbf{x}\mathbf{x}}}^{\top}$, because (co)variances of different terms are computed from these sums using different denominators (depending on the corresponding number of degrees of freedom) and do not add up. We do not make this distinction and prefer to talk about decomposing the (co)variance, i.e. all (co)variances here are defined with the same denominator equal to the total number of sample points.
Appendix B. Fourierlike components from temporal variations
Consider the decision components in the somatosensory working memory task, Figure 3. Here the second and the third components are closely resembling the first and second temporal derivatives of the leading decision component. To illustrate why these components are likely to be artifacts of the underlying sampling process, consider a highly simplified example in which a population of $N$ neurons is encoding a onedimensional bellshaped signal $z(t)$ in the population vector $\mathbf{\mathbf{a}}$, i.e the population response is given by $\mathbf{\mathbf{y}}(t)=\mathbf{\mathbf{a}}z(t)$. In this case, the population response lies in the onedimensional subspace spanned by $\mathbf{\mathbf{a}}$ and the covariance matrix has rank one:
Now consider the case in which the neurons are not recorded simultaneously but are pooled from different sessions. In behavioural experiments it is unavoidable that the onset of (selftimed) neural responses will vary by tenths or hundreds of milliseconds. Hence, the individual response ${y}_{i}(t)$ of neuron $i$ will experience a small timeshift ${\tau}_{i}$ so that ${y}_{i}(t)={a}_{i}z(t+{\tau}_{i})$, see Figure 15 for an example with Gaussian tuning curves. If ${\tau}_{i}$ is small we can do a Taylor expansion around $t$,
where we neglect higherorder corrections for simplicity, but the extension is straightforward. Let $\mathit{\bm{\tau}}$ be the vector of timeshifts of all neurons and let $\mathbf{\mathbf{b}}=\mathbf{\mathbf{a}}\circ \mathit{\bm{\tau}}$ be the elementwise vector product of $\mathbf{\mathbf{a}}$ and $\mathit{\bm{\tau}}$, i.e. ${[\mathbf{\mathbf{a}}\circ \mathit{\bm{\tau}}]}_{i}={a}_{i}{\tau}_{i}$. Then the population response can be written as
Hence, the covariance matrix becomes approximately
where we assumed for simplicity that $\mathbf{\mathbf{a}}\u27c2\mathbf{\mathbf{b}}$. In other words, timeshifts between observations will result in additional PCA components that roughly resemble the temporal derivatives of the source component.
Data and code
The dPCA code is available at http://github.com/machenslab/dPCA for Matlab and Python. All four datasets used in this manuscript have been made available at http://crcns.org (Romo et al., 2016; Constantinidis et al., 2016; Feierstein et al., 2016; Uchida et al., 2016). Our preprocessing and the main analysis scripts (Matlab) are available at http://github.com/machenslab/elife2016dpca.
References

1
Neuronal population coding of parametric working memoryThe Journal of Neuroscience 30:9424–9430.https://doi.org/10.1523/JNEUROSCI.187510.2010

2
Demixed principal component analysisAdvances in Neural Information Processing Systems 24. pp. 2654–2662.
 3
 4

5
Spectral learning of linear dynamics from generalisedlinear observations with application to neural population dataAdvances in Neural Information Processing Systems 25. pp. 1682–1690.

6
Springer Science & Business MediaPlane answers to complex questions: the theory of linear models, Springer Science & Business Media, 10.1111/j.17515823.2011.00159_25.x.
 7
 8

9
Dimensionality reduction for largescale neural recordingsNature Neuroscience, 17, 10.1038/nn.3776.

10
A leastsquares framework for component analysisPattern Analysis and Machine Intelligence, IEEE Transactions 34:1041–1055.https://doi.org/10.1109/tpami.2011.184

11
The effects of neural gain on attention and learningNature Neuroscience 16:1146–1153.https://doi.org/10.1038/nn.3428
 12
 13

14
Ongoing behavior predicts perceptual report of interval durationFrontiers in Neurorobotics, 8, 10.3389/fnbot.2014.00010.
 15
 16
 17

18
Reducedrank regression for the multivariate linear modelJournal of Multivariate Analysis 5:248–264.https://doi.org/10.1016/0047259X(75)900421
 19

20
Heterogenous population coding of a shortterm memory and decision taskThe Journal of Neuroscience 30:916–929.https://doi.org/10.1523/JNEUROSCI.206209.2010
 21

22
Demixing population activity in higher cortical areasFrontiers in Computational Neuroscience, 4, 10.3389/fncom.2010.00126.
 23

24
Functional, but not anatomical, separation of "what" and "when" in prefrontal cortexThe Journal of Neuroscience 30:350–360.https://doi.org/10.1523/JNEUROSCI.327609.2010
 25

26
Incorporation of new information into prefrontal cortical activity after learning working memory tasksProceedings of the National Academy of Sciences of the United States of America 109:4651–4656.https://doi.org/10.1073/pnas.1201022109

27
Stimulus selectivity in dorsal and ventral prefrontal cortex after training in working memory tasksThe Journal of Neuroscience 31:6266–6276.https://doi.org/10.1523/JNEUROSCI.679810.2011

28
Quantifying the signals contained in heterogeneous neural responses and determining their relationships with task performanceJournal of Neurophysiology 112:1584–1598.https://doi.org/10.1152/jn.00260.2014

29
Encoding and decoding in parietal cortex during sensorimotor decisionmakingNature Neuroscience 17:1395–1403.https://doi.org/10.1038/nn.3800

30
Robust learning of lowdimensional dynamics from large neural ensembles2391–2399, Advances in Neural Information Processing Systems 26.

31
Changes in prefrontal neuronal activity after learning to perform a spatial working memory taskCerebral Cortex, 21, 10.1093/cercor/bhr058.

32
Neural correlates of a decision variable before learning to perform a match/nonmatch taskThe Journal of Neuroscience 32:6161–6169.https://doi.org/10.1523/JNEUROSCI.636511.2012

33
A categoryfree neural population supports evolving demands during decisionmakingNature Neuroscience 17:1784–1792.https://doi.org/10.1038/nn.3865
 34

35
Variability in neural activity and behaviorCurrent Opinion in Neurobiology 25:211–220.https://doi.org/10.1016/j.conb.2014.02.013
 36
 37
 38
 39
 40

41
Flutter discrimination: Neural codes, perception, memory and decision makingNature Reviews. Neuroscience 4:203–218.https://doi.org/10.1038/nrn1058
 42

43
Extraction of the average and differential dynamical response in stimuluslocked experimental dataJournal of Neuroscience Methods 141:223–229.https://doi.org/10.1016/j.jneumeth.2004.06.012
 44

45
Speed and accuracy of olfactory discrimination in the ratNature Neuroscience 6:1224–1229.https://doi.org/10.1038/nn1142
 46

47
Populationwide distributions of neural activity during perceptual decisionmakingProgress in Neurobiology 103:156–193.https://doi.org/10.1016/j.pneurobio.2012.09.004

48
Gaussianprocess factor analysis for lowdimensional singletrial analysis of neural population activityJournal of Neurophysiology 102:614–635.https://doi.org/10.1152/jn.90941.2008
Decision letter

Mark CW van RossumReviewing Editor; University of Edinburgh, United Kingdom
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your work entitled "Demixed principal component analysis of neural population data" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Eve Marder as the Senior Editor.
The reviewers have discussed the reviews with one another and the Reviewing editor has drafted this decision to help you prepare a revised submission.
Essential revisions:
As you will see from the individual reports below, there were rather divergent views on your manuscript. While we are not asking for additional analysis, we are asking for editorial revisions to make the method and its novelty more apparent.
We are asking for a major rewrite that 1) defines the method better 2) clarifies the difference and similarities with earlier dPCA approaches, 3) the interpretation of the results. Being a methods paper we feel that the presentation is particularly important. Also publication/public presentation of the code is a necessary requirement for eLife publication (this can be any of a number of public and maintained sites).
We are treating this as a Tools and Resources paper, as it is really a data analysis Methods paper, and should have been submitted under the TR heading.
Reviewer #1:
This paper is about an algorithm to analyze neural activity in response to experimental manipulations of various factors such as time, stimulus property, and behavioral response. There is a new technique proposed and discussed in the context of previous methods. The results of the different methods do not look all that different.
The main issue I have with the paper is that the method is described 3 times in 3 separate sections (Results, Methods, supplement), with increasing level of detail, but seemingly inconsistently in concept and notation. It is therefore very hard to make sense of it all.
At the end of the day it seems that the novel idea advocated here is rather simple: find a set of projections of the raw data which explain most of the variance in the mean response (mean over all trials and factors while holding one factor fixed). This in turn seems to be simply solved, for each factor separately, by an existing algorithm known as "reduced rank regression".
Complicating the description, the authors seem to have at least 3 algorithms which they call dPCA. First, one they proposed in 2011 under that exact title. Then, a second, apparently the new one, which is based on a cost function that was not entirely clear in the first reading in the Results section, but became clear only after reading the supplement. There is a third "approximate" version suggested, which concatenates the eigenvectors of the covariance for each mean response. (Is that what is called Naive demixing (Figure 12)?) They seem to argue at one point that this approximate method gives similar results to the superior new technique.
The methods seem to have all in common in their goal of explaining the covariance of the mean responses (mean across the factors, and at times, mean across trials) with a few dimensions. There is much talk about an additive model similar to what is used in conventional MANOVA, where the total data is explained as a sum of means across various combinations of factors. Though I don't see that this is necessary to motivate the dimensionality reduction proposed here, it does lead to comparison with yet one more technique. With the current presentation, I find it hard to keep it all straight.
In Figure 12 the various methods are then finally compared to each other, but as stated below, the performance metric presented there left me confused, just when I thought I knew what is going on. Importantly, the traces presented there don't seem all that different to each other. So then, what is the main new contribution of this paper?
Reviewer #2:
This manuscript gives a discursive presentation of how a new version of remixed principal component analysis (dPCA2015) may be used to analyze multivariate neural data. The presentation is, I believe, complete enough that a user could reconstruct, from zero, the analysis method and the particulars of data preprocessing, etc. It is wellwritten and logical. The method itself is a nice compromise between a principal component approach to data analysis and a MANOVAlike approach. In particular, components are easily visualizable and the manuscript figures do a nice job of showing, at a glance, the results from fairly sophisticated experiments.
Other than a number of small textual changes that I would suggest, and one reference, I think this manuscript is in publishable form.
Reviewer #3:
This paper describes a statistical method, demixedPCA, for finding lowdimensional "interpretable" structure in trial based neural recordings. In many cases, neurons exhibit "mixed selectivity", meaning that they exhibit tuning to multiple experimental variables (e.g., the visual stimulus and the upcoming decision). dPCA differs from PCA in that it tries to find a lowdimensional projection that separates ('demixes') the tuning to different kinds of variables. The paper applies dPCA to four different datasets from monkey PFC and rat OFC, and shows that it recovers structure consistent with previously reported findings using these datasets.
The paper is interesting and this technique is likely to be of considerable interest and importance to the field, particularly given recent interest in mixed selectivities in different brain areas. However, I have some concerns about novelty and about the intended contribution of this paper. A similar demixing approach has been described previously in Machens 2010 and Brendel 2011 (which was a NIPS paper, so perhaps shouldn't be counted against the novelty of this one since it's only a conference proceedings paper). But it would be helpful to describe a little bit more clearly the intended contribution. Is this just the journallength writeup of the NIPS paper? How does the method described differ from that in the 2010 Frontiers paper?
It would also be nice to spell out a little bit more clearly what scientific insights the method is likely to provide (or indeed, provides for the datasets being analyzed here). It seemed that for each dataset, the paper mostly says that dPCA confirmed the findings in the original paper. But if that's the case, why do we need dPCA? What can we learn by analyzing data with this method that we didn't already have access to using the analysis methods from those papers?
I have two other highlevel comments:
1) I think the authors don't do enough to describe how they decided which components to include when setting up the model. Presumably this is a choice made before running the method, i.e., how to map the elements of the experiment onto discrete conditions. How did the authors solve this problem, and how should potential users go about solving it when applying the method to new datasets. For example, why don't we have "reward" components for any of the datasets considered here? How did you decide which interaction terms to include in the first place? In the olfactory data, did you include different components for each mixture or just one for each mixing component? What are the keys to setting up the analysis? Are there any exploratory data analysis techniques that the authors used to make these decisions?
2) For a methods paper like this, providing code should be a mandatory requirement for publication. Please say something about where to find an implementation, so that would be users can try it out.
[Editors' note: further revisions were requested prior to acceptance, as described below.]
Thank you for resubmitting your work entitled "Demixed principal component analysis of neural population data" for further consideration at eLife. Your revised article has been favorably evaluated by Eve Marder (Senior editor), a Reviewing editor, and three reviewers. The manuscript has been greatly improved but there are some smaller remaining issues outlined below.
Reviewer #1 (General assessment and major comments (Required)):
This manuscript is like night and day compared to the previous version. I barely recognize the material any more. It is so much clearer that it makes me much more confident that this is all sound and well. I can hardly believe that these were the same authors.
Reviewer #1 (Minor Comments):
A few places where I got stuck were for example:
Equation 4 in subsection “Core dPCA: loss function and algorithm”, why can one separate the quadratic term like this? Also not clear why stefs 13 minimize this cost function, but I trust that all this is in the cited paper, though I could only retrieve the Izenman (1975) paper, which looked like yet more work to answer my questions.
Similarly, second equation in subsection “Unbalanced data”, why can one separate the square into two terms? and why can one replace X_{noise} by C^{1/2}_{noise}? and why is X_{PSTH}=X tilde?
In general I have to say that, while the issue of balancing the data is intuitively clear, the corresponding math seems cumbersome, but again, this may be my lack of time.
Reviewer #2 (General assessment and major comments (Required)):
The revised manuscript is acceptable for publication.
Reviewer #3 (General assessment and major comments (Required)):
I thank the author for the detailed reply to the original review comments, and the revised manuscript is substantially improved. I have a few lingering technical comments and questions, but overall I think the paper is now suitable for publication.
Reviewer #3 (Minor Comments):
Figure 1K: not entirely clear what principal components these are, i.e., why are there multiple traces per component? I guess this is because the authors have taken the principal components of the rows of the X matrix (i.e., each of which contains the PSTH of a single neuron for all components). This is a slightly odd choice: I would have thought one would take PCA of the PSTHs, which is closer in spirit to the dPCA and other methods discussed here, so that each component is just a lengthT vector (i.e., where T is the number of time bins in the PSTH). I guess this is ok as is, but you should add something to the text or the caption to clarify what these are (e.g., each principal component is a PSTH across all conditions). This seems a little bit confusing, however, because the dPCA components you'll show later aren't the same size. (Or am I misinterpreting the figure, and what you're showing is instead the projection from all conditions on each PC?)
Indeed, the two components from Figure 1e explain only 23% of the total variance of the population firing rates and the two components from Figure 1h explain only 22% (see Methods). Consequently, a naive observer would not be able to infer from the components what the original neural activities looked like.
This seems a little bit uncharitable to the Mante et al. paper. After all, the regression model used by those authors did indeed have an "untuned component" for each neuron (referred to as Β0), they simply didn't do anything with these components when it came time to construct a lowd projection that captured information about the stimulus and decision variables. But one could certainly have performed a dimensionality reduction of the Β_0's if one were interested in capturing information about time. So, in my view, while this passage is technically correct, I would encourage the authors to rephrase it to be slightly less disparaging to Mante et al. They simply weren't interested in the coding of time, so the fact that they don't capture those components is a choice about what variables to include more than a fundamental limitation of the method.
"where averaging takes place over all irrelevant parameters." A bit unclear – could use an extra sentence unpacking what this means.
"The overall variance explained by the dPCA components (Figure 3c, red line) is very close to the overall variance explained by the PCA components (black line).": Why does PCA only get 80% of the variance? Is this because you've determined that the last 20% belongs to the noise component?
Figure 3 caption: "Thick black lines show time intervals during which the respective task parameters can be reliably extracted from singletrial activity". Does this mean a pseudotrial with 832 neurons? Start of this section mentions the # of neurons but I think it's important to say it here in the Figure caption what this means. (Presumably, if you have enough neurons recorded then singletrial decoding becomes perfect?)
https://doi.org/10.7554/eLife.10989.021Author response
Essential revisions:
As you will see from the individual reports below, there were rather divergent views on your manuscript. While we are not asking for additional analysis, we are asking for editorial revisions to make the method and its novelty more apparent.
We thank all the reviewers and the editors for their careful reading of our manuscript. The comments and suggestions were very helpful and have guided our efforts to clarify the manuscript and to improve the presentation. We provide a detailed rebuttal below.
Large portions of the text have been rewritten and rearranged, for that reason we found it difficult to provide a version of the manuscript with all changes tracked. The parts that are completely rewritten are (i) the presentation of dPCA method in the Results, (ii) the Methods. Please see the new manuscript file.
We are asking for a major rewrite that 1) defines the method better 2) clarifies the difference and similarities with earlier dPCA approaches, 3) the interpretation of the results. Being a methods paper we feel that the presentation is particularly important.
We rearranged, clarified and streamlined the presentation throughout the paper. To address point (1), we expanded and completely rewrote the presentation of the dPCA method in the Results section and also changed the accompanying figure. Furthermore, we merged the Supplementary Information with the Methods section, which now contains all the mathematical details. To address point (2), we now address these differences explicitly in both Introduction and Discussion. More importantly, we included a new section in the Methods that provides a detailed comparison between our older approaches and the current paper. To address point (3), we have clarified the description and layout of the figures taking into account all of the reviewers' comments. See below for more specific replies.
Also publication/public presentation of the code is a necessary requirement for eLife publication (this can be any of a number of public and maintained sites).
We fully support this policy. Indeed, we already published our code on github prior to the original submission. This was stated at the end of our Introduction.
“The dPCA code is available at http://github.com/wielandbrendel/dPCA for Matlab and Python.”
and we now reiterate this information in the Methods section.
We are treating this as a Tools and Resources paper, as it is really a data analysis Methods paper, and should have been submitted under the TR heading.
We were not aware that eLife has a Tools and Resources section. We agree that this is where our manuscript belongs the best.
Reviewer #1:
This is a partial review only. I did not manage to understand all the pieces of this manuscript, and feel that without a few answers from the Authors, and a more streamlined presentation, I will struggle to provide a complete review. So I will list what I did and did not understand so far.
We thank the Reviewer for a thorough reading of our paper and for a lot of detailed comments. Based on the issues raised by the Reviewer, we have made a number of major changes in how we present the method and structure the text, see below.
This paper is about an algorithm to analyze neural activity in response to experimental manipulations of various factors such as time, stimulus property, and behavioral response. There is a new technique proposed and discussed in the context of previous methods. The results of the different methods do not look all that different.
We believe that this “lack of difference" is a misunderstanding that was caused by a lack of clarity from our side. Indeed, dPCA should mainly be compared with the methods presented in our Figure 1, i.e. with the methods that are actually being used in neuroscience to analyze and visualize neurophysiological recordings. The difference between dPCA and all these methods is obvious.
In contrast, the “previous methods" that the Reviewer mentions here probably refers to our original Figure 12, which compared our own preliminary approach dPCA2011 (Brendel et al. 2011) to a variety of ad hoc demixing methods (such as MANOVAbased demixing", “naive demixing", etc.). However, none of these methods actually exist in the literature; rather, they were being presented here as alternative ways of doing something similar to dPCA with different, if sometimes subtle, tradeoffs.
Our preliminary approach, dPCA2011, was published as a conference proceedings paper, and is now substantially reformulated and improved (we extended our description in the Discussion of what are the differences and included a new section in the Methods); we do not claim that the difference between dPCA2011 and dPCA2015 is enormous, but we do argue that the improvement is quite noticeable and that the method, in its updated form and with an application to a variety of neural datasets, deserves a detailed journallength treatment. Apart from changes in the method, we now provide an indepth analysis of several experimental datasets, make a comparison between them, and clarify the relationship between dPCA and related methods such as LDA or the method of Mante et al. 2013. (Please see our largely updated presentation of LDA/dPCA in the Results and also the updated Discussion and Methods.)
All other approaches are ad hoc demixing methods; none of them has ever been used or even suggested for the analysis of neural data. We now see that what used to be our Figure 12 only caused confusion, so we removed all mentions of MANOVAbased demixing",”naive demixing", etc. as these methods do not exist in the literature. See below.
The main issue I have with the paper is that the method is described 3 times in 3 separate sections (Results, Methods, supplement), with increasing level of detail, but seemingly inconsistently in concept and notation. It is therefore very hard to make sense of it all.
The reviewer has a point. We undertook the following measures in order to address this issue:
1. We have merged the Supplementary Information with the Materials & Methods section.
2. We have largely expanded the presentation of dPCA in the Results, providing more geometrical intuitions, so that the main concepts are clearer.
3. We have reformatted our whole manuscript in LATEX which allowed us to use a consistent notation throughout the text.
The changes are too numerous to copy here entirely.
At the end of the day it seems that the novel idea advocated here is rather simple: find a set of projections of the raw data which explain most of the variance in the mean response (mean over all trials and factors while holding one factor fixed). This in turn seems to be simply solved, for each factor separately, by an existing algorithm known as "reduced rank regression".
That is correct. In fact, we are quite happy that the method has such a simple formulation and solution; we consider it a strength and not a weakness. Note that most of the related linear dimensionality reduction methods such as PCA, LDA, CCA, etc. can be formulated and solved just as easily.
Complicating the description, the authors seem to have at least 3 algorithms which they call dPCA. First, one they proposed in 2011 under that exact title. Then, a second, apparently the new one, which is based on a cost function that was not entirely clear in the first reading in the Results section, but became clear only after reading the supplement. There is a third "approximate" version suggested, which concatenates the eigenvectors of the covariance for each mean response. (Is that what is called Naive demixing (Figure 12)?) They seem to argue at one point that this approximate method gives similar results to the superior new technique.
It is true that the method published in 2011 in the NIPS conference proceedings has the same name; we decided that in the long run it will be less confusing if we keep using the name “dPCA" instead of inventing some other name. The core idea of the dPCA framework is to find linear projections of the full data that capture most of the variance and are demixed; dPCA2011 and dPCA2015 are two algorithmic approaches to solve this same task. DPCA2015 supersedes the approach of 2011.
Note that this is not an uncommon situation; e.g. independent component analysis (ICA) is a name that encompasses a variety of related methods, and the same is true for sparse PCA or factor analysis (FA). In both cases there are various suggested implementations, but all of them are united by their goal.
We did not intend to advocate what we called “naive demixing" as a “third version". Rather, we intended to briefly describe this procedure in order to provide additional intuitions about dPCA and to explain a common misperception that we have encountered when talking to experimentalists. Since this is a minor technical point, we have downgraded the discussion of this “version" by eliminating it from the comparison figure, and by discussing it only briefly in a separate section in the methods (see section “Comparison of dPCA to PCA in each marginalization").
The methods seem to have all in common in their goal of explaining the covariance of the mean responses (mean across the factors, and at times, mean across trials) with a few dimensions. There is much talk about an additive model similar to what is used in conventional MANOVA, where the total data is explained as a sum of means across various combinations of factors. Though I don't see that this is necessary to motivate the dimensionality reduction proposed here, it does lead to comparison with yet one more technique. With the current presentation, I find it hard to keep it all straight.
We agree with the reviewer that the multitude of comparisons with other “methods", most of which do not even exist, was, with hindsight, more confusing than helpful. Essentially, all these comparisons were driven by conversations with e.g. someone claiming that what we are doing has been done by and is not different from e.g. MANOVA, so we felt we need to explain the difference. We have downgraded most of these discussions to just highlight the technical differences without giving each “ad hoc" method the prominence it had in the original submission. To straighten the reference to MANOVA, we have rewritten both the Discussion and Methods section about LDA/MANOVA.
In Figure 12 the various methods are then finally compared to each other, but as stated below, the performance metric presented there left me confused, just when I thought I knew what is going on. Importantly, the traces presented there don't seem all that different to each other. So then, what is the main new contribution of this paper?
We see the referee's point and we agree that Figure 12 was quite unfortunate. First, we reemphasize that two of the “methods" presented in Figure 12 were just ad hoc methods invented by us to contrast against dPCA (e.g. naive demixing or MANOVAbased demixing). Since this is indeed confusing, we eliminated these methods from the plot, leaving the only two published methods, dPCA 2011 and targeted dimensionality reduction by Mante et al., 2013 (Figure 14 in the revised version). Second, we notice that there are differences between the remaining methods, even though they may appear subtle. dPCA2011 fails to find one component; the method of Mante et al. does not produce any conditionindependent components. The differences are not enormous, but noticeable and consistent.
It's especially strong where the encoding subspaces are not orthogonal. Third, we note that applying these methods to data is also not necessarily the best way of showing how they differ – too much depends on the specifics of the data set. To address these problems, we completely rewrote our explanation of dPCA and added additional geometric intutions that show how dPCA differs from PCA and LDA. We hope that these changes clarify the contributions of dPCA2015: a method that is far simpler and more direct than any of the existing (or ad hoc) methods, and that achieves overall better demixing, although, as the reviewer noticed, the differences on real data are not necessarily huge.
Concerning the performance metric, we compare all methods using linear projections of the full data and the variance is the variance of this projection.
Reviewer #2:
I am not an experimentalist, so this review will have nothing to say about the details of the experimental design, etc. I was not able to view the video(s) with my QuickTime viewer, so I also have nothing to say about the video(s).
This manuscript gives a discursive presentation of how a new version of remixed principal component analysis (dPCA2015) may be used to analyze multivariate neural data. The presentation is, I believe, complete enough that a user could reconstruct, from zero, the analysis method and the particulars of data preprocessing, etc. It is wellwritten and logical. The method itself is a nice compromise between a principal component approach to data analysis and a MANOVAlike approach. In particular, components are easily visualizable and the manuscript figures do a nice job of showing, at a glance, the results from fairly sophisticated experiments.
Other than a number of small textual changes that I would suggest, and one reference, I think this manuscript is in publishable form.
We thank the reviewer for these comments.
Reviewer #3:
This paper describes a statistical method, demixedPCA, for finding lowdimensional "interpretable" structure in trial based neural recordings. In many cases, neurons exhibit "mixed selectivity", meaning that they exhibit tuning to multiple experimental variables (e.g., the visual stimulus and the upcoming decision). dPCA differs from PCA in that it tries to find a lowdimensional projection that separates ('demixes') the tuning to different kinds of variables. The paper applies dPCA to four different datasets from monkey PFC and rat OFC, and shows that it recovers structure consistent with previously reported findings using these datasets.
The paper is interesting and this technique is likely to be of considerable interest and importance to the field, particularly given recent interest in mixed selectivities in different brain areas.
We thank the Reviewer for these comments.
However, I have some concerns about novelty and about the intended contribution of this paper. A similar demixing approach has been described previously in Machens 2010 and Brendel 2011 (which was a NIPS paper, so perhaps shouldn't be counted against the novelty of this one since it's only a conference proceedings paper). But it would be helpful to describe a little bit more clearly what the intended contribution. Is this just the journallength writeup of the NIPS paper? How does the method described differ from that in the 2010 Frontiers paper?
We understand that this history of publications makes it look as if we are trying to publish the same method over and over again. That is not the case. Rather, we see the current paper as the culmination of our efforts in developing demixing dimensionality reduction. It is similar in spirit to the previous papers, but technically very different. Briefly, here is our reasoning: Machens et al. 2010 (JNeurosci) introduced a “difference of covariances" method that can work with only two parameters, and has therefore rather limited applicability. Machens 2010 (Frontiers) papers offered one way to generalize that method to more parameters, but only in some specific cases; and it did not properly capture interaction terms. Indeed, this latter problem made its application to real data rather unattractive (unpublished observations). Finally, the Brendel et al. 2011 NIPS paper suggested a fully general way of splitting the data matrix and the covariance matrix into additive parts and then offered a probabilistic model to find the demixing projections. The latter paper also introduced the name “demixed principal component analysis".
The NIPS paper was a purely technical paper, and, back then, for us, the end of method development. At this point, we knew that dPCA worked very nicely on the Romo data, which is only briefly discussed within the NIPS paper. However, for a journallength writeup, we intended to go beyond the Romo data to demonstrate the applicability of dPCA2011 on a wider range of data sets. Alas, when we applied dPCA2011 to other datasets, we saw that the demixing was not as good as in the Romo data. Further investigation of this problem showed that the orthogonality constraint of dPCA2011, i.e., our emphasis on finding demixed components that are orthogonal to each other, posed a problem in some of the cases. Hence, we needed a more flexible approach, which caused us to revisit the method and design dPCA2015.
This history is probably not very interesting for the general reader. However, we agree with the reviewer that it is important to explain to readers how the methods differ. We have now included a more explicit paragraph into the Introduction, extended a corresponding subsection in the Discussion, and wrote a new section in the Methods that explicitly contrasts the older methods with dPCA2015.
It would also be nice to spell out a little bit more clearly what scientific insights the method is likely to provide (or indeed, provides for the datasets being analyzed here). It seemed that for each dataset, the paper mostly says that dPCA confirmed the findings in the original paper. But if that's the case, why do we need dPCA? What can we learn by analyzing data with this method that we didn't already have access to using the analysis methods from those papers?
The key problem that someone who has recorded 100s^{1}000s of neurons in a behavioral task faces is how to make sense of the incredible diversity of responses. The classical solutions are summarized in our Figure 1, and they essentially focus on extracting features of the data. Through clever combination of these classical techniques, researchers have always been able to extract important, or even key features in their data. The reviewer is correct in noting that dPCA is not guaranteed to discover anything beyond that, and, in the data sets we analyzed, we mostly confirmed the findings of the original studies (see below for some differences). So why dPCA?
There are several answers to this question. First, the classical techniques are actually quite laborious, and they require a lot of intuitions and work on the side of the researchers. Which part of the data to focus on first? Which timeperiods or which parameters? dPCA is supposed to radically simplify this process by allowing researchers to boil down complex data sets and represent them in a few components, or, in essence, print out pretty much everything that's happening on a single page. First and foremost, dPCA is therefore a quick visualization method for the data, something that can serve as an entry point to analyzing certain aspects of the data in more depth and detail.
Second, dPCA is designed to not `lose' any important part of the data, where importance is measured in terms of neural variability, i.e., changes in _ring rates. Our emphasis on (approximately) lossless dimensionality reduction or compression highlights several features of the data that have mostly been disregarded in the past (although they may be somewhat familiar at least to the people who have access to the raw data): these include the large percentage of neural variance falling into the condition independent components, the lack of separate cell classes, the precise dynamics of the embedding of the task parameters in the neural population response etc, see the section “Universal features of the PFC population activity". Of course, the question then is, why should we care about these features? That brings us to the next point:
Third, dPCA is designed to facilitate the comparison of population activity across data sets, which is one of the main emphasis of this paper. Since there is no unique way of applying the classical methods (Figure 1; statistical significance, regression, etc.) to neural population data, different studies usually rely on different combination of methods. In turn, it becomes essentially hard to impossible for a reader to appreciate the similarities and/or differences of monkey PFC data in different conditions, especially if the studies were carried out in different labs. dPCA provides a simple, standardized way of visualizing the data, which in turn does allow one to compare neural activities across data sets. In turn, if certain dominating aspects reoccur, such as the conditionindependent components, or the dynamics of the representation of information, then maybe these aspects deserve special attention, because they tell us something about what these areas are doing.
I have two other highlevel comments:
1) I think the authors don't do enough to describe how they decided which components to include when setting up the model. Presumably this is a choice made before running the method, i.e., how to map the elements of the experiment onto discrete conditions. How did the authors solve this problem, and how should potential users go about solving it when applying the method to new datasets. For example, why don't we have "reward" components for any of the datasets considered here? How did you decide which interaction terms to include in the first place? In the olfactory data, did you include different components for each mixture or just one for each mixing component? What are the keys to setting up the analysis? Are there any exploratory data analysis techniques that the authors used to make these decisions?
This is a very important issue, especially with respect to the goals of dPCA as explained above, and we updated both our Discussion and the methods (sections on balanced and unbalanced data and missing data) to cover it in more depth. To answer the questions above:
Why are there no “reward" components? In the monkey datasets we analyzed only correct trials because we lacked data on error trials since the monkeys often did not make enough mistakes. Accordingly, reward category is not very meaningful. In the olfactory datasets, we analyzed both correct and incorrect trials. Here, however, the presence/absence of reward is fully determined by stimulus and decision, so the “reward" label is superfluous given stimulus and decision labels. We note, however, that the activity related to reward shows up in the “interaction" terms, precisely because reward is based on the interaction of stimulus and decision.
How did you decide which interaction terms to include? Since we had stimulus and decision labels in all datasets, there really is only one interaction term. This interaction term was always included. (Time plays a special role in neurophysiological data and is treated differently as a parameter, as explained in the dPCA section in the Results and Methods.)
In the olfactory data, did you include different components for each mixture or just one for each mixing component? In the Kepecs et al. olfactory dataset, different odour mixtures were simply treated as different stimuli.
What are the keys to setting up the analysis? Are there any exploratory data analysis techniques that the authors used to make these decisions? The question of what labels to include is indeed very meaningful. We can provide some guidelines, but not a precise stepbystep recipe, as this depends too much on the scientific question addressed by an experiment. For instance, one could e.g. have some extra label available (imagine that for each stimulus and decision there were several possible values of reaction time, or movement type, etc.). Should they be included in the dPCA analysis as another parameter? That depends on the question, and there are some tradeoffs involved. As the number of task parameters included increases, so does the number of parameters (decoders and encoders) in dPCA, especially since there will be more and more interaction terms. Hence, the statistical power decreases and so does the usefulness of dPCA. Here is our updated Discussion:
“Second, the number of neurons needs to be sufficiently high in order to obtain reliable estimates of the demixed components. In our datasets, we found that at least _100 neurons were needed to achieve satisfactory demixing. The number is likely to be higher if more than three task parameters are to be demixed, as the number of interaction terms grows exponentially with the number of parameters. This tradeoff between model complexity and demixing feasibility should be kept in mind when deciding how many parameters to put into the dPCA procedure. In cases when there are many task parameters of interest, dPCA is likely to be less useful than the more standard parametric singleunit approaches (such as linear regression). As a trivial example, imagine that only N = 1 neuron has been recorded; it might have strong and significant tuning to various parameters of interest, but there is no way to demix (or decode) these parameters from the recorded population."
2) For a methods paper like this, providing code should be a mandatory requirement for publication. Please say something about where to find an implementation, so that would be users can try it out.
We published our code on github prior to the original submission, which is stated in the end of our Introduction. We now included a reminder of this in the Methods.
“The dPCA code is available at http://github.com/wielandbrendel/dPCA for Matlab and Python.”
[Editors' note: further revisions were requested prior to acceptance, as described below.]
The manuscript has been greatly improved but there are some smaller remaining issues outlined below.
Reviewer #1 (General assessment and major comments (Required)):
This manuscript is like night and day compared to the previous version. I barely recognize the material any more. It is so much clearer that it makes me much more confident that this is all sound and well. I can hardly believe that these were the same authors.
We are very happy to hear that and thank the reviewer for the valuable comments.
Reviewer #1 (Minor Comments):
A few places where I got stuck were for example:
Equation 4 in subsection “Core dPCA: loss function and algorithm”, why can one separate the quadratic term like this? Also not clear why stefs 13 minimize this cost function, but I trust that all this is in the cited paper, though I could only retrieve the Izenman 1975 paper, which looked like yet more work to answer my questions.
We updated this section to make it as selfcontained as possible and to avoid sending readers to the original papers from the 1970s. Reducedrank regression is quite obscure in our field and we have therefore tried to clarify all the steps as much as possible. Please let us know if something remains unclear here.
The updated paragraphs now read:
“We note that the loss function L_{Ø} is of the general form X_{Ø} – AX^{2}, with A = FD. For an arbitrary N x N matrix A, minimization of the loss function amounts to a classical regression problem with the wellknown ordinary least squares (OLS) solution, A_{OLS} =X_{Ø}X^{T} (XX^{T)1}. In our case, A = FD is an N x N matrix of rank q, which we will make explicit by writing A_{q}. The dPCA loss function therefore amounts to a linear regression problem with an additional rank constraint on the matrix of regression coefficients. This problem is known as reducedrank regression (RRR) (Izenman, 1975; Reinsel and Velu, 1998; Izenman, 2008) and can be solved via the singular value decomposition.
To see this, we write X_{Ø} –A_{q}X = (X_{Ø} –A_{OLS}X) + (A_{OLS}X – A_{q}X). The first term, X_{Ø} –A_{OLS}X, consists of the regression residuals that cannot be accounted for by any linear transformation of X. It is straightforward to verify that these regression residuals, X_{Ø} –A_{OLS}X, are orthogonal to X (Hastie et al., 2009, Section 3.2) and hence also orthogonal to (A_{OLS} – A_{q})X. This orthogonality allows us to split the loss function into two terms,
 X_{Ø} –A_{q}X ^{2} =  X_{Ø} –A_{OLS}X^{2}+ A_{OLS}X – A_{q}X^{2}
where the first term captures the (unavoidable) error of the least squares fit while the second term describes the additional loss suffered through the rank constraint. Since the first term does not depend on Aq, the problem reduces to minimizing the second term.
To minimize the second term, we note that the best rankq approximation to A_{OLS}X
is given by its first q principal components (EckartYoungMirsky theorem). Accordingly, if we write U_{q} for the matrix of the q leading principal directions (left singular vectors) u_{i} of A_{OLS}X, then the best approximation is given by U_{q}U^{T}_{q}A_{OLS}X and hence A_{q} =U_{q}U^{T}_{q}A_{OLS}.
To summarize, the reducedrank regression problem posed above can be solved in a threestep procedure: […]”
Similarly, second equation in subsection “Unbalanced data, why can one separate the square into two terms? and why can one replace X_{noise} by C^{1/2}_{noise}? and why is X_{PSTH}=X?
We have expanded the explanation of this bit as follows:
“In the balanced case, the dPCA loss function L_{Ø} can be rewritten as the sum of two terms with one term depending on the PSTHs and another term depending on the trialtotrial variations,
L_{Ø} = X_{Ø} – FDX^{2} = X_{Ø} – FD (X – X_{noise})^{2} + FDX_{noise}^{2};
where we used the fact that X_{Ø} and X – X_{noise} are orthogonal to X_{noise} (see Appendix A). We now define X_{PSTH} = X – X_{noise} which is simply a matrix of the same size as X with the activity of each trial replaced by the corresponding PSTH. In addition, we observe that the squared norm of any centered data matrix Y with n data points can be written in terms of its covariance matrix C_{Y} = YY^{T}/n, namely Y^{2} = tr[YY^{T}] = n tr[C_{Y}] = n tr[C^{1/2} _{Y}C^{1/2} _{Y}] = n C^{1/2} _{Y} ^{2}, and so
L_{Ø} = X_{Ø} – FDX_{PSTH}^{2} + K SQT FCD^{½}_{noise}^{2}
The first term consists of K replicated copies: X_{PSTH} contains K replicated copies of X (which we defined above as the matrix of PSTHs) and X_{Ø} contains K replicated copies of X_{Ø} (which we take to be a marginalization of X, with X = Ʃ_{Ø} X_{Ø}). We can eliminate the replications and drop the factor K to obtain
L_{Ø} = X_{Ø} – FDX ^{2} + SQT FDC^{1}^{/2}_{noise}^{2}”
In general I have to say that, while the issue of balancing the data is intuitively clear, the corresponding math seems cumbersome, but again, this may be my lack of time.
The balancing math can indeed appear somewhat confusing, however we should point out that this whole part is not essential for the understanding of our method. We included the full treatment of the balancing issue because it can be important in practical applications when the conditions are very unbalanced. Otherwise, we can expect the direct formulation of dPCA without any rebalancing modifications with X _{Ø} –FDX^{2} loss function to work fine, and the generalization to the loss function with PSTH matrices X _{Ø} – FDX^{2} for the sequentially recorded datasets is intuitive.
Reviewer #2 (General assessment and major comments (Required)): The revised manuscript is acceptable for publication. Reviewer #2 (Minor Comments): No author was designated with author association superscript 6 – Harvard University.
Author designation with superscript 2 is not provided.
 There's probably an error in the numbering of author associations.
We fixed the affiliation list.
Reviewer #3 (General assessment and major comments (Required)):
I thank the author for the detailed reply to the original review comments, and the revised manuscript is substantially improved. I have a few lingering technical comments and questions, but overall I think the paper is now suitable for publication.
Reviewer #3 (Minor Comments):
Figure 1K: not entirely clear what principal components these are, i.e., why are there multiple traces per component? I guess this is because the authors have taken the principal components of the rows of the X matrix (i.e., each of which contains the PSTH of a single neuron for all components). This is a slightly odd choice: I would have thought one would take PCA of the PSTHs, which is closer in spirit to the dPCA and other methods discussed here, so that each component is just a lengthT vector (i.e., where T is the number of time bins in the PSTH). I guess this is ok as is, but you should add something to the text or the caption to clarify what these are (e.g., each principal component is a PSTH across all conditions). This seems a little bit confusing, however, because the dPCA components you'll show later aren't the same size. (Or am I misinterpreting the figure, and what you're showing is instead the projection from all conditions on each PC?)
There is some confusion here. As it might partially be due to the terminology (“principal axes" vs “principal components"), we should clarify it here. The PSTH matrix X has N rows (N is the number of neurons) and SQT columns (number of data points in the PSTHs in all conditions, S is the number of stimuli, D the number of decisions, T the number of timepoints). We consider this as SQT points in the Ndimensional space and perform PCA on these data. Covariance matrix is N x N, each eigenvector (a direction in the Ndimensional space) defines a principal axis, and the projection onto this axis we call “principal component" and it has SQT points. These projections are shown on Figure 1K. We clarified this in the legend and also in the Methods section “Implementation of classical approaches".
(Perhaps by “PCA of the PSTHs" the reviewer means reshaping these data as N SQ different PSTHs of length T? Then each PC would indeed be of length T; but we want to work in the Ndimensional (and not NSQdimensional) space of neurons because we want PCA/dPCA projections to be interpreted as linear readouts from the neural population.)
We furthermore emphasize that in the dPCA treatment we always work in the same Ndimensional space. In all subsequent figures (e.g. Figure 3), we always display all conditions in each panel, so we emphasize that the size of the PCA/dPCA components does not change throughout the manuscript. Indeed, although the conditionindependent component may sometimes appear to consist of only one line, for instance, we are plotting one line for each condition, and they are just on top of each other (as they should be when the components are properly demixed).
Indeed, the two components from Figure 1e explain only 23% of the total variance of the population firing rates and the two components from Figure 1h explain only 22% (see Methods). Consequently, a naive observer would not be able to infer from the components what the original neural activities looked like. This seems a little bit uncharitable to the Mante et al. paper. After all, the regression model used by those authors did indeed have an "untuned component" for each neuron (referred to as Β0), they simply didn't do anything with these components when it came time to construct a lowd projection that captured information about the stimulus and decision variables. But one could certainly have performed a dimensionality reduction of the Β_0's if one were interested in capturing information about time. So, in my view, while this passage is technically correct, I would encourage the authors to rephrase it to be slightly less disparaging to Mante et al.
They simply weren't interested in the coding of time, so the fact that they don't capture those components is a choice about what variables to include more than a fundamental limitation of the method.
This is a fair point. Strictly speaking, Mante et al. define their targeted dimensionality reduction (TDR) procedure (in the supplementary materials to their paper) by using only linear terms in the regression, and they disregard the β0term. However, the reviewer is correct that one could treat _0 in the same way as other regression terms, and we thank the reviewer for pointing this out. More generally, as with any supervised or regressionbased method, there are many possible extensions of TDR one could envision (e.g. including quadratic terms into the regression, using several timepoints instead of choosing only one timepoint, etc.). However, our point here was to contrast the tradeo_ between supervised and unsupervised methods, and so we chose to present TDR exactly as it was originally introduced.
To address the reviewer's comments, we have now updated our “Comparison of dPCA with targeted dimensionality reduction" section in the Methods to be clear about the possibility of using β0, and updated our Figure 14 to show these conditionindependent components produced by TDR. We do not think the introductory Results section is the right place to elaborate on possible extensions of TDR, but we have slightly reworded the section to be more careful about our comparison of the supervised and unsupervised methods, and we now write:
“While such supervised approaches can be extended in various ways to produce more components and capture more variance, a more direct way to avoid this loss of information is to resort to unsupervised methods such as principal component analysis (PCA).”
To emphasize the importance of having more than one component per task parameter, we have now prepared another Video and included one additional paragraph in the Results section describing our first dataset:
“The first stimulus component (#5) looks similar to the stimulus components that we obtained with standard regressionbased methods (Figure 1e, h) but now we have further components as well. Together they show how stimulus representation evolves in time. In particular, plotting the first two stimulus components against each other (see Video 1) illustrates how stimulus representation rotates in the neural space during the delay period so that the encoding subspaces at F1 and F2 periods are not the same (but far from orthogonal either).”
"where averaging takes place over all irrelevant parameters." A bit unclear – could use an extra sentence unpacking what this means.
We slightly changed the formulation of this bit; what is meant here is illustrated in the next sentence using the toy example:
“First, we require that the compression and decompression steps reconstruct not the neural activity directly, but the neural activity averaged over trials and over some of the task parameters. In the toy example, the reconstruction target is the matrix of stimulus averages, Xs, which has the same size as X, but in which every data point is replaced by the average neural activity for the corresponding stimulus, as shown in Figure 2h.”
This averaging is also explained more formally below and is illustrated on Figure 2g.
"The overall variance explained by the dPCA components (Figure 3c, red line) is very close to the overall variance explained by the PCA components (black line)." Why does PCA only get 80% of the variance? Is this because you've determined that the last 20% belongs to the noise component?
The location of the legend was suboptimal on this figure panel and made it look as if “PCA" label refers to the horizontal dashed line at 80%. Instead, “PCA" label refers to the black line of cumulative explained variance that can be hardly seen because of the red dPCA line. So the correct interpretation is that PCA explained variance grows with the number of components and reaches 80% at 15 components. With more components it would of course exceed 80%. At the same time, the dashed horizontal line shows our estimate of the amount of “signal variance", i.e. we estimate that 20% of the variance is due to the noise in our PSTH estimates. This computation is explained in the Methods, but is not essential for anything else.
We have changed the legend location to remove the ambiguity.
Figure 3 caption: "Thick black lines show time intervals during which the respective task parameters can be reliably extracted from singletrial activity". Does this mean a pseudotrial with 832 neurons? Start of this section mentions the # of neurons but I think it's important to say it here in the Figure caption what this means. (Presumably, if you have enough neurons recorded then singletrial decoding becomes perfect?)
Yes, the decoding was done using pseudotrials with 832 neurons. The exact number does not seem to be important in the _gure caption: we simply used all available neurons. But the reviewer is right in that it is important to point out that the decoding was done using pseudotrials (and hence ignores noise correlations). We clarified the figure caption as follows:
“Thick black lines show time intervals during which the respective task parameters can be reliably extracted from singletrial activity (using pseudotrials with all recorded neurons)”
https://doi.org/10.7554/eLife.10989.022Article and author information
Author details
Funding
Fundação Bial (Fellowship, 389/14)
 Dmitry Kobak
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This work was supported by the Bial Foundation, grant #389/14. We thank Matthew Kauffman, Valerio Mante, and Emmanuel Procyk for helpful input along the way, and Maneesh Sahani for a valuable discussion on the connection between dPCA and LDA. We furthermore thank Nuno Calaim for help with creating animations.
Reviewing Editor
 Mark CW van Rossum, Reviewing Editor, University of Edinburgh, United Kingdom
Publication history
 Received: August 19, 2015
 Accepted: April 7, 2016
 Accepted Manuscript published: April 12, 2016 (version 1)
 Version of Record published: May 31, 2016 (version 2)
Copyright
© 2016, Kobak et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 9,985
 Page views

 2,792
 Downloads

 34
 Citations
Article citation count generated by polling the highest count across the following sources: Scopus, Crossref, PubMed Central.