# Demixed principal component analysis of neural population data

# Demixed principal component analysis of neural population data

- Dmitry Kobak
- Wieland Brendel
- Christos Constantinidis
- Claudia E Feierstein
- Adam Kepecs
- Zachary F Mainen
- Xue-Lian Qi
- Ranulfo Romo
- Naoshige Uchida
- Christian K Machens

Champalimaud Centre for the Unknown, Portugal; École Normale Supérieure, France; Centre for Integrative Neuroscience, University of Tübingen, Germany; Wake Forest University School of Medicine, United States; Cold Spring Harbor Laboratory, United States; Universidad Nacional Autónoma de México, Mexico; El Colegio Nacional, Mexico; Harvard University, United States

Published April 12, 2016

Cite as eLife 2016;5:e10989

Download figureOpen in new tabFigure 1. Existing approaches to population analysis, illustrated with recordings from monkey PFC during a somatosensory working memory task (Romo et al., 1999).(**a**) Cartoon of the paradigm, adapted from Romo and Salinas (2003). Legend shows 12 experimental conditions. (**b**) Average per-condition firing rates (PSTHs) for four exemplary neurons out of $N=832$. Colors refer to stimulus frequencies F1 and line styles (dashed/solid) refer to decision, see legend in (**a**). (**c**) Fraction of cells, significantly ($p<0.05$, two-way ANOVA) tuned to stimulus and decision at each time point. (**d**) Left: Distribution of stimulus tuning effect sizes across neural population at F1 period (black arrow in **c**). Significantly tuned neurons are shown in dark gray. Right: Same for decision at F2 period (gray arrow in **c**). (**e**) The average of zero-centered PSTHs over all significantly tuned neurons (for neurons with negative effect size, the sign of PSTHs was flipped). Arrows mark time-points that were used to select the significant cells. (**f**) Fraction of cells, significantly ($p<0.05$, linear regression) tuned to stimulus and decision at each time point. (**g**) Distribution of regression coefficients of neural firing rates to stimulus (during F1 period) and decision (during F2 period). (**h**) Stimulus and decision components produced by the method of Mante et al. (2013). Briefly, neural PSTHs are weighted by the regression coefficients. (**i**) Fraction of variance captured by the first 20 principal components. (**j**) Distributions of weights used to produce the first six principal components (weights are elements of the eigenvectors of the $N\times N$ covariance matrix). (**k**) First six principal components (projections of the full data onto the eigenvector directions).

**DOI:** http://dx.doi.org/10.7554/eLife.10989.003

Download figureOpen in new tabFigure 2. Linear dimensionality reduction.(**a**) Linear discriminant analysis maps the firing rates of individual neurons onto a latent component that allows us to decode a task parameter of interest. Shades of grey inside each neuron show the proportion of variance due to the various task parameters (e.g. stimulus, decision, and time), illustrating mixed selectivity. In contrast, the LDA component is maximally demixed. (**b**) At any moment in time, the population firing rate of $N$ neurons is represented by a point in the $N$-dimensional space; here $N=2$. Each trial is represented by a trajectory in this space. Colors indicate different stimuli and dot sizes represent time. The LDA component for stimulus is given by the projection onto the LDA axis (black line); projections of all points are shown along this line. All three stimuli are clearly separated, but their geometrical relation to each other is lost. (**c**) Principal component analysis linearly maps the firing rates into a few principal components such that a second linear transformation can reconstruct the original firing rates. (**d**) The same set of points as in (**b**) is projected onto the first PCA axis. However, the stimuli are no longer separated. Rather, the points along the PCA axis have complex dependencies on stimulus and time (mixed selectivity). The PCA axis minimizes the distances between the original points and their projections. (**e**) Demixed principal component analysis also compresses and decompresses the firing rates through two linear transformations. However, here the transformations are found by both minimizing the reconstruction error and enforcing a demixing constraint on the latent variables. (**f**) The same set of points as in (**b**) projected onto the first dPCA decoder axis. The three stimuli are clearly separated (as in LDA), but some information about the relative distances between classes is preserved as well (as in PCA). (**g**) The same data as in (**b**) linearly decomposed into the time effect, the stimulus effect, and the noise. (**h**) The dPCA projection from (**f**) has to be mapped onto a different axis, given by the dPCA encoder, in order to reconstruct the stimulus class means (large colored circles). The decoder and encoder axes together minimize the reconstruction error between the original data and the stimulus class means.

**DOI:** http://dx.doi.org/10.7554/eLife.10989.004

Download figureOpen in new tabFigure 3. Demixed PCA applied to recordings from monkey PFC during a somatosensory working memory task (Romo et al., 1999).(**a**) Cartoon of the paradigm, adapted from Romo and Salinas (2003). (**b**) Demixed principal components. Top row: first three condition-independent components; second row: first three stimulus components; third row: first three decision components; last row: first stimulus/decision interaction component. In each subplot, the full data are projected onto the respective dPCA decoder axis, so that there are 12 lines corresponding to 12 conditions (see legend). Thick black lines show time intervals during which the respective task parameters can be reliably extracted from single-trial activity (using pseudotrials with all recorded neurons), see Materials and methods. Note that the vertical scale differs across rows. Ordinal number of each component is shown in a circle; explained variances are shown as percentages. (**c**) Cumulative variance explained by PCA (black) and dPCA (red). Demixed PCA explains almost the same amount of variance as standard PCA. Dashed line shows an estimate of the fraction of 'signal variance' in the data, the remaining variance is due to noise in the PSTH estimates (see Materials and methods). (**d**) Variance of the individual demixed principal components. Each bar shows the proportion of total variance, and is composed out of four stacked bars of different color: gray for condition-independent variance, blue for stimulus variance, red for decision variance, and purple for variance due to stimulus-decision interactions. Each bar appears to be single-colored, which signifies nearly perfect demixing. Pie chart shows how the total signal variance is split among parameters. (**e**) Upper-right triangle shows dot products between all pairs of the first 15 demixed principal axes. Stars mark the pairs that are significantly and robustly non-orthogonal (see Materials and methods). Bottom-left triangle shows correlations between all pairs of the first 15 demixed principal components. Most of the correlations are close to zero.

**DOI:** http://dx.doi.org/10.7554/eLife.10989.005

Download figureOpen in new tabFigure 4. Demixed PCA applied to recordings from monkey PFC during a visuospatial working memory task (Qi et al., 2011).Same format as Figure 3. (**a**) Cartoon of the paradigm, adapted from Romo and Salinas (2003). (**b**) Demixed principal components. In each subplot there are ten lines corresponding to ten conditions (see legend). Color corresponds to the position of the last shown stimulus (first stimulus for $t\phantom{\rule{thinmathspace}{0ex}}<\phantom{\rule{thinmathspace}{0ex}}2$ s, second stimulus for $t\phantom{\rule{thinmathspace}{0ex}}>\phantom{\rule{thinmathspace}{0ex}}2$ s). In non-match conditions (dashed lines) the colour changes at $t=2$ s. Solid lines correspond to match conditions and do not change colors. (**c**) Cumulative variance explained by PCA and dPCA components. Dashed line marks fraction of signal variance. (**d**) Explained variance of the individual demixed principal components. Pie chart shows how the total signal variance is split between parameters. (**e**) Upper-right triangle shows dot products between all pairs of the first 15 demixed principal axes, bottom-left triangle shows correlations between all pairs of the first 15 demixed principal components.

**DOI:** http://dx.doi.org/10.7554/eLife.10989.007

Download figureOpen in new tabFigure 5. Demixed PCA applied to recordings from rat OFC during an olfactory discrimination task (Feierstein et al., 2006).Same format as Figure 3. (**a**) Cartoon of the paradigm, adapted from Wang et al. (2013). (**b**) Each subplot shows one demixed principal component. In each subplot there are four lines corresponding to four conditions (see legend). Two out of these four conditions were rewarded and are shown by thick lines. (**c**) Cumulative variance explained by PCA and dPCA components. (**d**) Explained variance of the individual demixed principal components. Pie chart shows how the total signal variance is split between parameters. (**e**) Upper-right triangle shows dot products between all pairs of the first 15 demixed principal axes, bottom-left triangle shows correlations between all pairs of the first 15 demixed principal components.

**DOI:** http://dx.doi.org/10.7554/eLife.10989.008

Download figureOpen in new tabFigure 6. Demixed PCA applied to recordings from rat OFC during an olfactory categorization task (Kepecs et al., 2008).Same format as Figure 3 (**a**) Cartoon of the paradigm, adapted from Wang et al. (2013). (**b**) Each subplot shows one demixed principal component. In each subplot there are ten lines corresponding to ten conditions (see legend). Six out of these ten conditions were rewarded and are shown with thick lines; note that the pure left (red) and the pure right (blue) odors did not have error trials. Inset shows mean rate of the second interaction component during the anticipation period. (**c**) Cumulative variance explained by PCA and dPCA components. (**d**) Explained variance of the individual demixed principal components. Pie chart shows how the total signal variance is split between parameters. (**e**) Upper-right triangle shows dot products between all pairs of the first 15 demixed principal axes, bottom-left triangle shows correlations between all pairs of the first 15 demixed principal components.

**DOI:** http://dx.doi.org/10.7554/eLife.10989.009

Download figureOpen in new tabFigure 7. Encoder weights for the leading dPCA components across the neural population.(**a**) Distributions of encoder weights for the 15 leading dPCA components across the neural population, in each of the four datasets. Each subplot shows 15 probability density curves, one curve per component (bin width 0.005). The distribution corresponding to the first component is highlighted in red. (**b**) Clustering of neurons by density peaks (Rodriguez and Laio, 2014). For each dataset we took the first 15 dPCA components, and then ran the clustering algorithm in the 15-dimensional space of encoding weights. The clustering algorithm works in two steps: first, it computes a local density for each point (i.e., for each neuron), using a Gaussian kernel with ${\sigma}^{2}=0.01$. Second, for each point it finds the minimal distance to a point with higher local density (if there is no such point, then the distance to the furthest point is taken). Each subplot shows local density on the horizontal axis plotted against distance to the next point with higher density on the vertical axis; each dot corresponds to one of the $N$ neurons. Cluster centres are characterized by high local density and large distance to the point of even higher density; they should appear as outliers in the upper-right corner of the plot (see Rodriguez and Laio, 2014, for details). In each case, there is only one such outlier (bigger dot), indicating a single cluster.

**DOI:** http://dx.doi.org/10.7554/eLife.10989.010

Download figureOpen in new tabFigure 9. Balanced and unbalanced data.(**a**) In this toy example there are two task parameters (factors), with two possible values each. Parameter A (left) is represented by the size of the dot, parameter B (middle) is represented by the color of the dot, noise is Gaussian with zero mean and zero correlation (right). Interaction term is equal to zero. (**b**) Balanced case with $N=10$ data points in each of the four parameter combinations. Overall correlation is zero. (**c**) Unbalanced case with $N=10$ for two parameter combinations and $N=100$ for the other two. Overall correlation is 0.8.

**DOI:** http://dx.doi.org/10.7554/eLife.10989.014

Download figureOpen in new tabFigure 10. Re-stretching (time warping) procedure.We defined several alignment events (such as odour poke in, odour poke out, etc.) and for each trial found the times ${t}_{i}$ of these events. After aligning all trials on ${t}_{1}=0$ (left) we computed median times ${T}_{i}$ for all other events. Then for each trial we re-stretched the firing rate on each interval $[{t}_{i},{t}_{i+1}]$ to align it with $[{T}_{i},{T}_{i+1}]$ (right). After such re-stretching, all events are aligned and the trials corresponding to one condition can be averaged.

**DOI:** http://dx.doi.org/10.7554/eLife.10989.015

Download figureOpen in new tabFigure 11. Cross-validation errors depending on the regularization parameter $\lambda $.Each subplot corresponds to one dataset and shows mean (solid lines) and min/max (boundaries of shaded regions) of the relative cross-validation errors for ten repetitions. Different colors refer to different marginalizations (see legend), the minima are marked by dots. Black color shows all marginalizations together, i.e. ${L}_{\mathrm{CV}}(\lambda )$.

**DOI:** http://dx.doi.org/10.7554/eLife.10989.016

Download figureOpen in new tabFigure 12. Cross-validated time-dependent classification accuracies of linear classifiers (black lines) given by the first three stimulus/decision/interaction dPCs (columns) in each dataset (rows).Shaded gray regions show distribution of classification accuracies expected by chance as estimated by 100 iterations of shuffling procedure.

**DOI:** http://dx.doi.org/10.7554/eLife.10989.017

Download figureOpen in new tabFigure 13. Toy example illustrating the pseudo-inverse intuition.(**a**) Firing rate trajectories of two neurons for three different stimuli. (**b**) Same data with dPCA decoding and encoding axes. The encoding axes are approximately equivalent to the axes of the principal components in this case.

**DOI:** http://dx.doi.org/10.7554/eLife.10989.018

Download figureOpen in new tabFigure 14. Some demixed components as given by three different demixing methods (rows) in various datasets and marginalizations (columns).Empty subplots mean that the corresponding method did not find any components. All projections were $z$-scored to make them of the same scale. Barplots on the right show fractions of variance in each marginalization for each component (stimulus in blue, decision in red, interaction in purple, condition-independent in gray): ${\parallel \mathbf{\mathbf{d}}{\stackrel{~}{\mathbf{\mathbf{X}}}}_{\varphi}\parallel}^{2}/{\parallel \mathbf{\mathbf{d}}\stackrel{~}{\mathbf{\mathbf{X}}}\parallel}^{2}$. Barplots consisting of a single colour correspond to perfect demixing.

**DOI:** http://dx.doi.org/10.7554/eLife.10989.019

Download figureOpen in new tabFigure 15. Fourier-like artifacts in PCA.(Left) In this toy example, single neuron responses are generated from the same underlying Gaussian but are randomly shifted in time. (Right) First three PCA components of the population data. While the leading component resembles the true signal, higher order components look like higher Fourier harmonics. They are artifacts of the jitter in time.

**DOI:** http://dx.doi.org/10.7554/eLife.10989.020

Table 1. Demixed PCA in comparison with existing methods. **Columns:** 'Signif.' refers to the method of counting significantly tuned cells, as shown in Figure 1c–e. TDR refers to the 'targeted dimensionality reduction' of Mante et al. (2013) shown in Figure 1f–h. LDA stands for linear discriminant analysis, but this column applies to any classification method (e.g. support vector machine, ordinal logistic regression, etc.). All classification methods can be used to summarize population tuning via a time-dependent classification accuracy (e.g. Meyers et al., 2012). PCA stands for principal component analysis, as shown in Figure 1i–k. FA stands for factor analysis, GPFA for Gaussian process factor analysis (Yu et al., 2009), LDS for hidden linear dynamical system (Buesing et al., 2012a; 2012b), jPCA is the method introduced in Churchland et al. (2012) . Some of the existing methods can be extended to become more general, but here we refer to how these methods are actually used in the original research. **Rows:** The first two rows are the two defining goals of dPCA. Following rows highlight notable features of other methods.

**DOI:** http://dx.doi.org/10.7554/eLife.10989.011

| Signif. | TDR | LDA | PCA | FA | GPFA | jPCA | LDS | **dPCA** |
---|

Takes task parameters into account & provides summary statistics of population tuning | ✓ | ✓ | ✓ | | | | | | ✓ |

Allows to reconstruct neural firing (captures variance) | | | | ✓ | ✓ | ✓ | | ✓ | ✓ |

Based on dynamical model | | | | | | | ✓ | ✓ | |

Based on probabilistic model | | | | | ✓ | ✓ | | ✓ | |

Takes spike trains as input | | | | | | ✓ | | ✓ | |