1. Neuroscience
Download icon

Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience

  1. Emily L Mackevicius
  2. Andrew H Bahle
  3. Alex H Williams
  4. Shijie Gu
  5. Natalia I Denisenko
  6. Mark S Goldman  Is a corresponding author
  7. Michale S Fee  Is a corresponding author
  1. Massachusetts Institute of Technology, United States
  2. Stanford University, United States
  3. ShanghaiTech University, China
  4. University of California, Davis, United States
Tools and Resources
  • Cited 1
  • Views 2,875
  • Annotations
Cite this article as: eLife 2019;8:e38471 doi: 10.7554/eLife.38471

Abstract

Identifying low-dimensional features that describe large-scale neural recordings is a major challenge in neuroscience. Repeated temporal patterns (sequences) are thought to be a salient feature of neural dynamics, but are not succinctly captured by traditional dimensionality reduction techniques. Here, we describe a software toolbox—called seqNMF—with new methods for extracting informative, non-redundant, sequences from high-dimensional neural data, testing the significance of these extracted patterns, and assessing the prevalence of sequential structure in data. We test these methods on simulated data under multiple noise conditions, and on several real neural and behavioral data sets. In hippocampal data, seqNMF identifies neural sequences that match those calculated manually by reference to behavioral events. In songbird data, seqNMF discovers neural sequences in untutored birds that lack stereotyped songs. Thus, by identifying temporal structure directly from neural data, seqNMF enables dissection of complex neural circuits without relying on temporal references from stimuli or behavioral outputs.

https://doi.org/10.7554/eLife.38471.001

Introduction

The ability to detect and analyze temporal sequences embedded in a complex sensory stream is an essential cognitive function, and as such is a necessary capability of neuronal circuits in the brain (Clegg et al., 1998; Janata and Grafton, 2003; Bapi et al., 2005; Hawkins and Ahmad, 2016), as well as artificial intelligence systems (Cui et al., 2016; Sutskever et al., 2014). The detection and characterization of temporal structure in signals is also useful for the analysis of many forms of physical and biological data. In neuroscience, recent advances in technology for electrophysiological and optical measurements of neural activity have enabled the simultaneous recording of hundreds or thousands of neurons (Chen et al., 2013; Kim et al., 2016; Scholvin et al., 2016; Jun et al., 2017), in which neuronal dynamics are often structured in sparse sequences (Hahnloser et al., 2002; Harvey et al., 2012; MacDonald et al., 2011; Okubo et al., 2015; Fujisawa et al., 2008; Pastalkova et al., 2008). Such sequences can be identified by averaging across multiple trials, but only in cases where an animal receives a temporally precise sensory stimulus, or executes a sufficiently stereotyped behavioral task.

Neural sequences have been hypothesized to play crucial roles over a much broader range of natural settings, including during learning, sleep, or diseased states (Mackevicius and Fee, 2018). In these applications, it may not be possible to use external timing references, either because behaviors are not stereotyped or are entirely absent. Thus, sequences must be extracted directly from the neuronal data using unsupervised learning methods. Commonly used methods of this type, such as principal component analysis (PCA) or clustering methods, do not efficiently extract sequences, because they typically only model synchronous patterns of activity, rather than extended spatio-temporal motifs of firing.

Existing approaches that search for repeating neural patterns require computationally intensive or statistically challenging analyses (Brody, 1999; Mokeichev et al., 2007; Quaglio et al., 2018; Brunton et al., 2016). While progress has been made in analyzing non-synchronous sequential patterns using statistical models that capture cross-correlations between pairs of neurons (Russo and Durstewitz, 2017; Gerstein et al., 2012; Schrader et al., 2008; Torre et al., 2016; Grossberger et al., 2018; van der Meij and Voytek, 2018), such methods may not have statistical power to scale to patterns that include many (more than a few dozen) neurons, may require long periods (≥105 timebins) of stationary data, and may have challenges in dealing with (non-sequential) background activity. For a review highlighting features and limitations of these methods see (Quaglio et al., 2018).

Here, we explore a complementary approach, which uses matrix factorization to reconstruct neural dynamics using a small set of exemplar sequences. In particular, we build on convolutional non-negative matrix factorization (convNMF) (Smaragdis, 2004; Smaragdis, 2007) (Figure 1B), which has been previously applied to identify recurring motifs in audio signals such as speech (O’Grady and Pearlmutter, 2006; Smaragdis, 2007; Vaz et al., 2016), as well as neural signals (Peter et al., 2017). ConvNMF identifies exemplar patterns (factors) in conjunction with the times and amplitudes of pattern occurrences. This strategy eliminates the need to average activity aligned to any external behavioral references.

Figure 1 with 1 supplement see all
Convolutional NMF factorization.

(A) NMF (non-negative matrix factorization) approximates a data matrix describing the activity of N neurons at T timepoints as a sum of K rank-one matrices. Each matrix is generated as the outer product of two nonnegative vectors: 𝐰k of length N, which stores a neural ensemble, and 𝐡k of length T, which holds the times at which the neural ensemble is active, and the relative amplitudes of this activity. (B) Convolutional NMF also approximates an N×T data matrix as a sum of K matrices. Each matrix is generated as the convolution of two components: a non-negative matrix 𝐰k of dimension N×L that stores a sequential pattern of the N neurons at L lags, and a vector of temporal loadings, 𝐡k, which holds the times at which each factor pattern is active in the data, and the relative amplitudes of this activity. (C) Three types of inefficiencies present in unregularized convNMF: Type 1, in which two factors are used to reconstruct the same instance of a sequence; Type 2, in which two factors reconstruct a sequence in a piece-wise manner; and Type 3, in which two factors are used to reconstruct different instances of the same sequence. For each case, the factors (𝐖 and 𝐇) are shown, as well as the reconstruction (𝐗~=𝐖𝐇=𝐰1𝐡1+𝐰2𝐡2+).

https://doi.org/10.7554/eLife.38471.002

While convNMF may produce excellent reconstructions of the data, it does not automatically produce the minimal number of factors required. Indeed, if the number of factors in the convNMF model is greater than the true number of sequences, the algorithm returns overly complex and redundant factorizations. Moreover, in these cases, the sequences extracted by convNMF will often be inconsistent across optimization runs from different initial conditions, complicating scientific interpretations of the results (Peter et al., 2017; Wu et al., 2016).

To address these concerns, we developed a toolbox of methods, called seqNMF, which includes two different strategies to resolve the problem of redundant factorizations described above. In addition, the toolbox includes methods for promoting potentially desirable features such as orthogonality or sparsity of the spatial and temporal structure of extracted factors, and methods for analyzing the statistical significance and prevalence of the identified sequential structure. To assess these tools, we characterize their performance on synthetic data under a variety of noise conditions and also show that they are able to find sequences in neural data collected from two different animal species using different behavioral protocols and recording technologies. Applied to extracellular recordings from rat hippocampus, seqNMF identifies neural sequences that were previously found by trial-averaging. Applied to functional calcium imaging data recorded in vocal/motor cortex of untutored songbirds, seqNMF robustly identifies neural sequences active in a biologically atypical and overlapping fashion. This finding highlights the utility of our approach to extract sequences without reference to external landmarks; untutored bird songs are so variable that aligning neural activity to song syllables would be difficult and highly subjective.

Results

Matrix factorization framework for unsupervised discovery of features in neural data

Matrix factorization underlies many well-known unsupervised learning algorithms, including PCA (Pearson, 1901), non-negative matrix factorization (NMF) (Lee and Seung, 1999), dictionary learning, and k-means clustering (see Udell et al., 2016 for a review). We start with a data matrix, 𝐗, containing the activity of N neurons at T timepoints. If the neurons exhibit a single repeated pattern of synchronous activity, the entire data matrix can be reconstructed using a column vector 𝐰 representing the neural pattern, and a row vector 𝐡 representing the times and amplitudes at which that pattern occurs (temporal loadings). In this case, the data matrix 𝐗 is mathematically reconstructed as the outer product of 𝐰 and 𝐡. If multiple component patterns are present in the data, then each pattern can be reconstructed by a separate outer product, where the reconstructions are summed to approximate the entire data matrix (Figure 1A) as follows:

(1) 𝐗nt𝐗~nt=k=1K𝐖nk𝐇kt=(𝐖𝐇)nt

where 𝐗nt is the (nt)th element of matrix 𝐗, that is, the activity of neuron n at time t. Here, in order to store K different patterns, 𝐖 is a N×K matrix containing the K exemplar patterns, and 𝐇 is a K×T matrix containing the K timecourses:

(2) W=[|||w1w2wK|||] H=[h1h2hK]

Given a data matrix with unknown patterns, the goal of matrix factorization is to discover a small set of patterns, 𝐖, and a corresponding set of temporal loading vectors, 𝐇, that approximate the data. In the case that the number of patterns, K, is sufficiently small (less than N and T), this corresponds to a dimensionality reduction, whereby the data is expressed in more compact form. PCA additionally requires that the columns of 𝐖 and the rows of 𝐇 are orthogonal. NMF instead requires that the elements of 𝐖 and 𝐇 are nonnegative. The discovery of unknown factors is often accomplished by minimizing the following cost function, which measures the element-by-element sum of all squared errors between a reconstruction 𝐗~=𝐖𝐇 and the original data matrix 𝐗 using the Frobenius norm, 𝐌F=ij𝐌ij2:

(3) (W,H)=arg minW,HX~XF2

(Note that other loss functions may be substituted if desired, for example to better reflect the noise statistics; see (Udell et al., 2016) for a review). The factors 𝐖* and 𝐇* that minimize this cost function produce an optimal reconstruction 𝐗~*=𝐖*𝐇*. Iterative optimization methods such as gradient descent can be used to search for global minima of the cost function; however, it is often possible for these methods to get caught in local minima. Thus, as described below, it is important to run multiple rounds of optimization to assess the stability/consistency of each model.

While this general strategy works well for extracting synchronous activity, it is unsuitable for discovering temporally extended patterns—first, because each element in a sequence must be represented by a different factor, and second, because NMF assumes that the columns of the data matrix are independent ‘samples’ of the data, so permutations in time have no effect on the factorization of a given data. It is therefore necessary to adopt a different strategy for temporally extended features.

Convolutional matrix factorization

Convolutional nonnegative matrix factorization (convNMF) (Smaragdis, 2004; Smaragdis, 2007) extends NMF to provide a framework for extracting temporal patterns, including sequences, from data. While in classical NMF each factor 𝐖 is represented by a single vector (Figure 1A), the factors 𝐖 in convNMF represent patterns of neural activity over a brief period of time. Each pattern is stored as an N×L matrix, 𝐰𝐤, where each column (indexed by =1 to L) indicates the activity of neurons at different timelags within the pattern (Figure 1B). The times at which this pattern/sequence occurs are encoded in the row vector 𝐡𝟏, as for NMF. The reconstruction is produced by convolving the N×L pattern with the time series 𝐡𝟏 (Figure 1B).

If the data contains multiple patterns, each pattern is captured by a different N×L matrix and a different associated time series vector 𝐡. A collection of K different patterns can be compiled together into an N×K×L array (also known as a tensor), 𝐖 and a corresponding K×T time series matrix 𝐇. Analogous to NMF, convNMF generates a reconstruction of the data as a sum of K convolutions between each neural activity pattern (𝐖), and its corresponding temporal loadings (𝐇):

(4) XntX~nt=k=1K=0L1WnkHk(t)(WH)nt

The tensor/matrix convolution operator (notation summary, Table 1) reduces to matrix multiplication in the L=1 case, which is equivalent to standard NMF. The quality of this reconstruction can be measured using the same cost function shown in Equation 3, and 𝐖 and 𝐇 may be found iteratively using similar multiplicative gradient descent updates to standard NMF (Lee and Seung, 1999; Smaragdis, 2004; Smaragdis, 2007).

Table 1
Notation for convolutional matrix factorization
https://doi.org/10.7554/eLife.38471.004
Shift operator

The operator (H) shifts a matrix 𝐇 in the direction by timebins:

(H)t=H(tl) and likewise (H)t=H(t+)

 where indicates all elements along the respective matrix dimension.

The shift operator inserts zeros when (t)<0 or (t+)>T

Tensor convolution operator

Convolutive matrix factorization reconstructs a data matrix 𝐗

 using a N×K×L tensor 𝐖 and a K×T matrix 𝐇:

𝐗~=𝐖𝐇=𝐖𝐇

Note that each neuron n is reconstructed as the sum of k convolutions:

𝐗~nt=k𝐖nk𝐇k(t-)(𝐖𝐇)nt

Transpose tensor convolution operator

The following quantity is useful in several contexts:

𝐖𝐗=(𝐖)𝐗

Note that each element (𝐖𝐗)kt=l(𝐖k)𝐗(t+)=ln𝐖nk𝐗n(t+) measures

 the overlap (correlation) of factor k with the data at time t

convNMF reconstruction

𝐗𝐗~=k𝐖k𝐇k=𝐖𝐇

Note that NMF is a special case of convNMF, where L=1

L1 entrywise norm excluding diagonal elements

For any K×K matrix 𝐂, 𝐂1,ijkjk𝐂jk
Special matrices

𝟏 is a K×K matrix of ones

𝐈 is the K×K identity matrix

𝐒 is a T×T smoothing matrix: 𝐒ij=1 when |ij|<L and otherwise 𝐒ij=0

While convNMF can perform extremely well at reconstructing sequential structure, it can be challenging to use when the number of sequences in the data is not known (Peter et al., 2017). In this case, a reasonable strategy would be to choose K at least as large as the number of sequences that one might expect in the data. However, if K is greater than the actual number of sequences, convNMF often identifies more significant factors than are minimally required. This is because each sequence in the data may be approximated equally well by a single sequential pattern or by a linear combination of multiple partial patterns. A related problem is that running convNMF from different random initial conditions produces inconsistent results, finding different combinations of partial patterns on each run (Peter et al., 2017). These inconsistency errors fall into three main categories (Figure 1C):

  • Type 1: Two or more factors are used to reconstruct the same instances of a sequence.

  • Type 2: Two or more factors are used to reconstruct temporally different parts of the same sequence, for instance the first half and the second half.

  • Type 3: Duplicate factors are used to reconstruct different instances of the same sequence.

Together, these inconsistency errors manifest as strong correlations between different redundant factors, as seen in the similarity of their temporal loadings (𝐇) and/or their exemplar activity patterns (𝐖).

We next describe two strategies for overcoming the redundancy errors described above. Both strategies build on previous work that reduces correlations between factors in NMF. The first strategy is based on regularization, a common technique in optimization that allows the incorporation of constraints or additional information with the goal of improving generalization performance or simplifying solutions to resolve degeneracies (Hastie et al., 2009). A second strategy directly estimates the number of underlying sequences by minimizing a measure of correlations between factors (stability NMF; Wu et al., 2016).

Optimization penalties to reduce redundant factors

To reduce the occurrence of redundant factors (and inconsistent factorizations) in convNMF, we sought a principled way of penalizing the correlations between factors by introducing a penalty term, , into the convNMF cost function:

(5) (W,H)=arg minW,H(X~XF2+)

Regularization has previously been used in NMF to address the problem of duplicated factors, which, similar to Type 1 errors above, present as correlations between the 𝐇’s (Choi, 2008; Chen and Cichocki, 2004). Such correlations are measured by computing the correlation matrix 𝐇𝐇, which contains the correlations between the temporal loadings of every pair of factors. The regularization may be implemented using the penalty term =λ𝐇𝐇1,ij, where the seminorm 1,ij sums the absolute value of every matrix entry except those along the diagonal (notation summary, Table 1) so that correlations between different factors are penalized, while the correlation of each factor with itself is not. Thus, during the minimization process, similar factors compete, and a larger amplitude factor drives down the temporal loading of a correlated smaller factor. The parameter λ controls the magnitude of the penalty term .

In convNMF, a penalty term based on 𝐇𝐇 yields an effective method to prevent errors of Type 1, because it penalizes the associated zero lag correlations. However, it does not prevent errors of the other types, which exhibit different types of correlations. For example, Type 2 errors result in correlated temporal loadings that have a small temporal offset and thus are not detected by 𝐇𝐇. One simple way to address this problem is to smooth the 𝐇’s in the penalty term with a square window of length 2L-1 using the smoothing matrix 𝐒 (𝐒ij=1 when |i-j|<L and otherwise 𝐒ij=0). The resulting penalty, =λ𝐇𝐒𝐇, allows factors with small temporal offsets to compete, effectively preventing errors of Types 1 and 2.

This penalty does not prevent errors of Type 3, in which redundant factors with highly similar patterns in 𝐖 are used to explain different instances of the same sequence. Such factors have temporal loadings that are segregated in time, and thus have low correlations, to which the cost term 𝐇𝐒𝐇 is insensitive. One way to resolve errors of Type 3 might be to include an additional cost term that penalizes the similarity of the factor patterns in 𝐖. This has the disadvantage of requiring an extra parameter, namely the λ associated with this cost.

Instead we chose an alternative approach to resolve errors of Type 3 that simultaneously detects correlations in 𝐖 and 𝐇 using a single cross-orthogonality cost term. We note that, for Type 3 errors, redundant 𝐖 patterns have a high degree of overlap with the data at the same times, even though their temporal loadings are segregated at different times. To introduce competition between these factors, we first compute, for each pattern in 𝐖, its overlap with the data at time t. This quantity is captured in symbolic form by 𝐖𝐗 (see Table 1). We then compute the pairwise correlation between the temporal loading of each factor and the overlap of every other factor with the data. This cross-orthogonality penalty term, which we refer to as 'x-ortho’, sums up these correlations across all pairs of factors, implemented as follows:

(6) =λ(𝐖𝐗)𝐒𝐇1,ij

When incorporated into the update rules, this causes any factor that has a high overlap with the data to suppress the temporal loadings (𝐇) of any other factors that have high overlap with the data at that time (Further analysis, Appendix 2). Thus, factors compete to explain each feature of the data, favoring solutions that use a minimal set of factors to give a good reconstruction. The resulting global cost function is:

(7) (W,H)=arg minW,H(X~XF2+λ(WX)SH1,ij)

The update rules for 𝐖 and 𝐇 are based on the derivatives of this global cost function, leading to a simple modification of the standard multiplicative update rules used for NMF and convNMF (Lee and Seung, 1999; Smaragdis, 2004; Smaragdis, 2007) (Table 2). Note that the addition of this cross-orthogonality term does not formally constitute regularization, because it also includes a contribution from the data matrix 𝐗, rather than just the model variables 𝐖 and 𝐇. However, at least for the case that the data is well reconstructed by the sum of all factors, the x-ortho penalty can be shown to be approximated by a formal regularization (Appendix 2). This formal regularization contains both a term corresponding to a weighted smoothed orthogonality penalty on 𝐖 and a term corresponding to a weighted smoothed) orthogonality penalty on 𝐇, consistent with the observation that the x-ortho penalty simultaneously punishes factor correlations in 𝐖 and 𝐇.

Table 2
Regularized NMF and convNMF: cost functions and algorithms
https://doi.org/10.7554/eLife.38471.005
NMF
=12||𝐗~-𝐗||22+
𝐗~=𝐖𝐇
WW×XHX~H+W
HH×WXWX~+H
convNMF
=12||𝐗~-𝐗||22+
𝐗~=𝐖𝐇
WW×X HX~ H+W
HH×WXWX~+H
L1 regularization for 𝐇 ( L1 for 𝐖 is analogous)
=λ||𝐇||1𝐖=0
𝐇=λ𝟏
Orthogonality cost for 𝐇
=λ2||𝐇𝐇||1,ij𝐖=0
𝐇=λ(𝟏-𝐈)𝐇
Smoothed orthogonality cost for 𝐇 (favors ‘events-based’)
=λ2||𝐇𝐒𝐇||1,ij𝐖=0
𝐇=λ(𝟏-𝐈)𝐇𝐒
Smoothed orthogonality cost for 𝐖 (favors ‘parts-based’)
=λ2||𝐖flat𝐖flat||1,ij
 where
(𝐖flat)nk=𝐖nk
𝐖=λ𝐖flat(𝟏-𝐈)
𝐇=0
Smoothed cross-factor orthogonality (x-ortho penalty)
=λ||(𝐖𝐗)𝐒𝐇||1,ij𝐖=λ𝐗𝐒𝐇(𝟏-𝐈)
𝐇=λ(𝟏-𝐈)𝐖𝐗𝐒

There is an interesting relation between our method for penalizing correlations and other methods for constraining optimization, namely sparsity. Because of the non-negativity constraint imposed in NMF, correlations can also be reduced by increasing the sparsity of the representation. Previous efforts have been made to minimize redundant factors using sparsity constraints; however, this approach may require penalties on both 𝐖 and 𝐇, necessitating the selection of two hyper-parameters (λw and λh) (Peter et al., 2017). Since the use of multiple penalty terms increases the complexity of model fitting and selection of parameters, one goal of our work was to design a simple, single penalty function that could regularize both 𝐖 and 𝐇 simultaneously. The x-ortho penalty described above serves this purpose (Equation 6). As we will describe below, the application of sparsity penalties can be very useful for shaping the factors produced by convNMF, and our code includes options for applying sparsity penalties on both 𝐖 and 𝐇.

Extracting ground-truth sequences with the x-ortho penalty when the number of sequences is not known

We next examined the effect of the x-ortho penalty on factorizations of sequences in simulated data, with a focus on convergence, consistency of factorizations, the ability of the algorithm to discover the correct number of sequences in the data, and robustness to noise (Figure 2A). We first assessed the model’s ability to extract three ground-truth sequences lasting 30 timesteps and containing 10 neurons in the absence of noise (Figure 2A). The resulting data matrix had a total duration of 15,000 timesteps and contained on average 60±6 instances of each sequence. Neural activation events were represented with an exponential kernel to simulate calcium imaging data. The algorithm was run with the x-ortho penalty for 1000 iterations andit reliably converged to a root-mean-squared-error (RMSE) close to zero (Figure 2B). RMSE reached a level within 10% of the asymptotic value in approximately 100 iterations.

Figure 2 with 2 supplements see all
Effect of the x-ortho penalty on the factorization of sequences.

(A) A simulated dataset with three sequences. Also shown is a factorization with x-ortho penalty (K=20, L=50, λ=0.003). Each significant factor is shown in a different color. At left are the exemplar patterns (𝐖) and on top are the timecourses (𝐇). (B) Reconstruction error as a function of iteration number. Factorizations were run on a simulated dataset with three sequences and 15,000 timebins ( 60 instances of each sequence). Twenty independent runs are shown. Here, the algorithm converges to within 10% of the asymptotic error value within 100 iterations. (C) The x-ortho penalty produces more consistent factorizations than unregularized convNMF across 400 independent fits (K=20, L=50, λ=0.003). (D) The number of statistically significant factors (Figure 2—figure supplement 1) vs. the number of ground-truth sequences for factorizations with and without the x-ortho penalty. Shown for each condition is a vertical histogram representing the number of significant factors over 20 runs (K=20, L=50, λ=0.003). (E) Factorization with x-ortho penalty of two simulated neural sequences with shared neurons that participate at the same latency. (F) Same as E but for two simulated neural sequences with shared neurons that participate at different latencies.

https://doi.org/10.7554/eLife.38471.006

While similar RMSE values were achieved using convNMF with and without the x-ortho penalty; the addition of this penalty allowed three ground-truth sequences to be robustly extracted into three separate factors (𝐰1, 𝐰2, and 𝐰3 in Figure 2A) so long as K was chosen to be larger than the true number of sequences. In contrast, convNMF with no penalty converged to inconsistent factorizations from different random initializations when K was chosen to be too large, due to the ambiguities described in Figure 1. We quantified the consistency of each model (see Materials and methods), and found that factorizations using the x-ortho penalty demonstrated near perfect consistency across different optimization runs (Figure 2C).

We next evaluated the performance of convNMF with and without the x-ortho penalty on datasets with a larger number of sequences. In particular, we set out to observe the effect of the x-ortho penalty on the number of statistically significant factors extracted. Statistical significance was determined based on the overlap of each extracted factor with held out data (see Materials and methods and code package). With the penalty term, the number of significant sequences closely matched the number of ground-truth sequences. Without the penalty, all 20 extracted sequences were significant by our test (Figure 2D).

We next considered how the x-ortho penalty performs on sequences with more complex structure than the sparse uniform sequences of activity ediscussed above. We further examined the case in which a population of neurons is active in multiple different sequences. Such neurons that are shared across different sequences have been observed in several neuronal datasets (Okubo et al., 2015; Pastalkova et al., 2008; Harvey et al., 2012). For one test, we constructed two sequences in which shared neurons were active at a common pattern of latencies in both sequences; in another test, shared neurons were active in a different pattern of latencies in each sequence. In both tests, factorizations using the x-ortho penalty achieved near-perfect reconstruction error, and consistency was similar to the case with no shared neurons (Figure 2E,F). We also examined other types of complex structure and have found that the x-ortho penalty performs well in data with large gaps between activity or with large overlaps of activity between neurons in the sequence. This approach also worked well in cases in which the duration of the activity or the interval between the activity of neurons varied across the sequence (Figure 3—figure supplement 3).

Figure 3 with 3 supplements see all
Testing factorization performance on sequences contaminated with noise.

Performance of the x-ortho penalty was tested under four different noise conditions: (A) probabilistic participation, (B) additive noise, (C) temporal jitter, and (D) sequence warping. For each noise type, we show: (top) examples of synthetic data at three different noise levels; (middle) similarity of extracted factors to ground-truth patterns across a range of noise levels (20 fits for each level); and (bottom) examples of extracted factors 𝐖’s for one of the ground-truth patterns. Examples are shown at the same three noise levels illustrated in the top row. In these examples, the algorithm was run with K=20, L=50 and λ = 2λ0 (via the procedure described in Figure 4). For C, jitter displacements were draw from a discrete guassian distribution with the standard deviation in timesteps shown above For D, timewarp conditions 1–10 indicate: 0, 66, 133, 200, 266, 333, 400, 466, 533 and 600 max % stretching respectively. For results at different values of λ, see Figure 3—figure supplement 1.

https://doi.org/10.7554/eLife.38471.009
Figure 4 with 5 supplements see all
Procedure for choosing λ for a new dataset based on finding a balance between reconstruction cost and x-ortho cost.

(A) Simulated data containing three sequences in the presence of participation noise (50% participation probability). This noise condition is used for the tests in (B–F). (B) Normalized reconstruction cost (||𝐗~-𝐗||F2) and cross-orthogonality cost (||(𝐖𝐗)𝐒𝐇||1,ij) as a function of λ for 20 fits of these data. The cross-over point λ0 is marked with a black circle. Note that in this plot the reconstruction cost and cross-orthogonality cost are normalized to vary between 0 and 1. (C) The number of significant factors obtained as a function of λ; 20 fits, mean plotted in orange. Red arrow at left indicates the correct number of sequences (three). (D) Fraction of fits returning the correct number of significant factors as a function of λ. (E) Similarity of extracted factors to ground-truth sequences as a function of λ. (F) Composite performance, as the product of the curves in (D) and (E) (smoothed using a three sample boxcar, plotted in orange with a circle marking the peak). Shaded region indicates the range of λ that works well (± half height of composite performance). (G–L) same as (A–F) but for simulated data containing three noiseless sequences. (M) Summary plot showing the range of values of λ (vertical bars), relative to the cross-over point λ0, that work well for each noise condition (± half height points of composite performance). Circles indicate the value of λ at the peak of the smoothed composite performance. For each noise type, results for all noise levels from Figure 3 are shown (increasing color saturation at high noise levels; Green, participation: 90, 80, 70, 60, 50, 40, 30, and 20%; Orange, additive noise 0.5, 1, 2, 2.5, 3, 3.5, and 4%; Purple, jitter: SD of the distribution of random jitter: 5, 10, 15, 20, 25, 30, 35, 40, and 45 timesteps; Grey, timewarp: 66, 133, 200, 266, 333, 400, 466, 533, 600, and 666 max % stretching. Asterisk (*) indicates the noise type and level used in panels (A–F). Gray band indicates a range between 2λ0 and 5λ0, a range that tended to perform well across the different noise conditions. In real data, it may be useful to explore a wider range of λ.

https://doi.org/10.7554/eLife.38471.013

Robustness to noisy data

The cross-orthogonality penalty performed well in the presence of types of noise commonly found in neural data. In particular, we considered: participation noise, in which individual neurons participate probabilistically in instances of a sequence; additive noise, in which neuronal events occur randomly outside of normal sequence patterns; temporal jitter, in which the timing of individual neurons is shifted relative to their typical time in a sequence; and finally, temporal warping, in which each instance of the sequence occurs at a different randomly selected speed. To test the robustness of the algorithm with the x-ortho penalty to each of these noise conditions, we factorized data containing three neural sequences at a variety of noise levels (Figure 3, top row). The value of λ was chosen using methods described in the next section. Factorizations with the x-ortho penalty proved relatively robust to all four noise types, with a high probability of returning the correct numbers of significant factors (Figure 4—figure supplement 5). Furthermore, under low-noise conditions, the algorithm produced factors that were highly similar to ground-truth, and this similarity declined gracefully at higher noise levels (Figure 3). Visualization of the extracted factors revealed a good qualitative match to ground-truth sequences even in the presence of high noise except for the case of temporal jitter (Figure 3). We also found that the x-ortho penalty allows reliable extraction of sequences in which the duration of each neuron’s activity exhibits substantial random variation across different renditions of the sequence, and in which the temporal jitter of neural activity exhibits systematic variation at different points in the sequences (Figure 3—figure supplement 3).

Finally, we wondered how our approach with the x-ortho penalty performs on datasets with only a small number of instances of each sequence. We generated data containing different numbers of repetitions ranging from 1 to 20, of each underlying ground-truth sequence. For intermediate levels of additive noise, we found that three repetitions of each sequence were sufficient to correctly extract factors with similarity scores close to those obtained with much larger numbers of repetitions (Figure 3—figure supplement 2).

Methods for choosing an appropriate value of λ

The x-ortho penalty performs best when the strength of the regularization term (determined by the hyperparameter λ) is chosen appropriately. For λ too small, the behavior of the algorithm approaches that of convNMF, producing a large number of redundant factors with high x-ortho cost. For λ too large, all but one of the factors are suppressed to zero amplitude, resulting in a factorization with near-zero x-ortho cost, but with large reconstruction error if multiple sequences are present in the data. Between these extremes, there exists a region in which increasing λ produces a rapidly increasing reconstruction error and a rapidly decreasing x-ortho cost. Thus, there is a single point, which we term λ0, at which changes in reconstruction cost and changes in x-ortho cost are balanced (Figure 4A). We hypothesized that the optimal choice of λ (i.e. the one producing the correct number of ground-truth factors) would lie near this point.

To test this intuition, we examined the performance of the x-ortho penalty as a function ofλ in noisy synthetic data consisting of three non-overlapping sequences (Figure 4A). Our analysis revealed that, overall, values of λ between 2λ0 and 5λ0 performed well for these data across all noise types and levels (Figure 4B,C). In general, near-optimal performance was observed over an order of magnitude range of λ (Figure 1). However, there were systematic variations depending on noise type: for additive noise, performance was better when λ was closer to λ0, while with other noise types, performance was better at somewhat higher values of λs (10λ0).

Similar ranges of λ appeared to work for datasets with different numbers of ground-truth sequences—for the datasets used in Figure 2D, a range of λ between 0.001 and 0.01 returned the correct number of sequences at least 90% of the time for datasets containing between 1 and 10 sequences (Figure 2—figure supplement 2). Furthermore, this method for choosing λ also worked on datasets containing sequences with shared neurons (Figure 4—figure supplement 2).

The value of λ may also be determined by cross-validation (see Materials and methods). Indeed, the λ chosen with the heuristic described above coincided with a minimum or distinctive feature in the cross-validated test error for all the cases we examined (Figure 4—figure supplement 3). The seqNMF code package accompanying this paper provides functions to determine λ both by cross-validation or in reference to λ0.

Sparsity constraints to reduce redundant factors

One of the advantages of the x-ortho penalty is that it includes only a single term to penalize correlations between different factors, and thus requires only a single hyperparameter λ. This contrasts with the approach of incorporating a sparsity constraint on 𝐖 and 𝐇 of the form λw𝐖1+λh𝐇1 (Peter et al., 2017). We have found that the performance of the sparsity approach depends on the correct choice of both hyperparameters λw and λh (Figure 4—figure supplement 4). Given the optimal choice of these parameters, the L1 sparsity constraint yields an overall performance approximately as good as the x-ortho penalty (Figure 4—figure supplement 4). However, there are some consistent differences in the performance of the sparsity and x-ortho approaches depending on noise type; an analysis at moderately high noise levels reveals that the x-ortho penalty performs slightly better with warping and participation noise, while the L1 sparsity penalty performs slightly better on data with jitter and additive noise (Figure 4—figure supplement 5). However, given the added complexity of choosing two hyperparameters for L1 sparsity, we prefer the x-ortho approach.

Direct selection of K to reduce redundant factors

An alternative strategy to minimizing redundant factorizations is to estimate the number of underlying sequences and to select the appropriate value of K. An approach for choosing the number of factors in regular NMF is to run the algorithm many times with different initial conditions, at different values of K, and choose the case with the most consistent and uncorrelated factors. This strategy is called stability NMF (Wu et al., 2016) and is similar to other stability-based metrics that have been used in clustering models (von Luxburg, 2010). The stability NMF score, diss, is measured between two factorizations, 𝐅1={𝐖1,𝐇1} and 𝐅2={𝐖2,𝐇2}, run from different initial conditions:

diss(F1,F2)=12K(2Kj=1Kmax1kKCjkk=1Kmax1jKCjk)

where 𝐂 is the cross-correlation matrix between the columns of the matrix 𝐖1 and the the columns of the matrix 𝐖2. Note that diss is low when there is a one-to-one mapping between factors in 𝐅1 and 𝐅2, which tends to occur at the correct K in NMF (Wu et al., 2016; Ubaru et al., 2017). NMF is run many times and the diss metric is calculated for all unique pairs. The best value of K is chosen as that which yields the lowest average diss metric.

To use this approach for convNMF, we needed to slightly modify the stability NMF diss metric. Unlike in NMF, convNMF factors have a temporal degeneracy; that is, one can shift the elements of 𝐡k by one time step while shifting the elements of 𝐰k by one step in the opposite direction with little change to the model reconstruction. Thus, rather than computing correlations from the factor patterns 𝐖 or loadings 𝐇, we computed the diss metric using correlations between factor reconstructions (𝐗~k=𝐰𝐤𝐡𝐤).

𝐂ij=Tr[𝐗~iT𝐗~j]𝐗~iF𝐗~jF

where Tr[] denotes the trace operator, Tr[𝐌]=i𝐌ii. That is, 𝐂ij measures the correlation between the reconstruction of factor i in 𝐅1 and the reconstruction of factor j in 𝐅2. Here, as for stability NMF, the approach is to run convNMF many times with different numbers of factors (K) and choose the K which minimizes the diss metric.

We evaluated the robustness of this approach in synthetic data with the four noise conditions examined earlier. Synthetic data were constructed with three ground-truth sequences and 20 convNMF factorizations were carried out for each K ranging from 1 to 10. For each K the average diss metric was computed over all 20 factorizations. In many cases, the average diss metric exhibited a minimum at the ground-truth K (Figure 5—figure supplement 1). As shown below, this method also appears to be useful for identifying the number of sequences in real neural data.

Not only does the diss metric identify factorizations that are highly similar to the ground truth and have the correct number of underlying factors, it also yields factorizations that minimize reconstruction error in held out data (Figure 5, Figure 5—figure supplement 2), as shown using the same cross-validation procedure described above (Figure 5—figure supplement 2). For simulated datasets with participation noise, additive noise, and temporal jitter, there is a clear minimum in the test error at the K given by diss metric. In other cases, there is a distinguishing feature such as a kink or a plateau in the test error at this K (Figure 5—figure supplement 2).

Figure 5 with 2 supplements see all
Direct selection of K using the diss metric, a measure of the dissimilarity between different factorizations.

Panels show the distribution of diss as a function of K for several different noise conditions. Lower values of diss indicate greater consistency or stability of the factorizations, an indication of low factor redundancy. (A) probabilistic participation (60%), (B) additive noise (2.5% bins), (C) timing jitter (SD = 20 bins), and (D) sequence warping (max warping = 266%). For each noise type, we show: (top) examples of synthetic data; (bottom) the diss metric for 20 fits of convNMF for K from 1 to 10; the black line shows the median of the diss metric and the dotted red line shows the true number of factors.

https://doi.org/10.7554/eLife.38471.019

Strategies for dealing with ambiguous sequence structure

Some sequences can be interpreted in multiple ways, and these interpretations will correspond to different factorizations. A common example arises when neurons are shared between different sequences, as is shown in Figure 6A and B. In this case, there are two ensembles of neurons (1 and 2), that participate in two different types of events. In one event type, ensemble one is active alone, while in the other event type, ensemble one is coactive with ensemble 2. There are two different reasonable factorizations of these data. In one factorization, the two different ensembles are separated into two different factors, while in the other factorization the two different event types are separated into two different factors. We refer to these as ’parts-based’ and ’events-based’ respectively. Note that these different factorizations may correspond to different intuitions about underlying mechanisms. ‘Parts-based’ factorizations will be particularly useful for clustering neurons into ensembles, and ‘events-based’ factorizations will be particularly useful for correlating neural events with behavior.

Figure 6 with 1 supplement see all
Using penalties to bias toward events-based and parts-based factorizations.

Datasets that have neurons shared between multiple sequences can be factorized in different ways, emphasizing discrete temporal events (events-based) or component neuronal ensembles (parts-based), by using orthogonality penalties on 𝐇 or 𝐖 to penalize factor correlations (see Table 2). (Left) A dataset with two different ensembles of neurons that participate in two different types of events, with (A) events-based factorization obtained using an orthogonality penalty on 𝐇 and (B) parts-based factorizations obtained using an orthogonality penalty on 𝐖. (Right) A dataset with six different ensembles of neurons that participate in three different types of events, with (C) events-based and (D) parts-based factorizations obtained as in (A) and (B).

https://doi.org/10.7554/eLife.38471.022

Here, we show that the addition of penalties on either 𝐖 or 𝐇 correlations can be used to shape the factorizations of convNMF, with or without the x-ortho penalty, to produce ‘parts-based’ or ‘events-based’ factorization. Without this additional control, factorizations may be either ‘parts-based’, or ‘events-based’ depending on initial conditions and the structure of shared neurons activities. This approach works because, in ‘events-based’ factorization, the 𝐇’s are orthogonal (uncorrelated) while the 𝐖’s have high overlap; conversely, in the ‘parts-based’ factorization, the 𝐖’s are orthogonal while the 𝐇’s are strongly correlated. Note that these correlations in 𝐖 or 𝐇 are unavoidable in the presence of shared neurons and such correlations do not indicate a redundant factorization. Update rules to implement penalties on correlations in 𝐖 or 𝐇 are provided in Table 2 with derivations in Appendix 1. Figure 9—figure supplement 2 shows examples of using these penalties on the songbird dataset described in Figure 9.

L1 regularization is a widely used strategy for achieving sparse model parameters (Zhang et al., 2016), and has been incorporated into convNMF in the past (O’Grady and Pearlmutter, 2006; Ramanarayanan et al., 2013). In some of our datasets, we found it useful to include L1 regularization for sparsity. The multiplicative update rules in the presence of L1 regularization are included in Table 2, and as part of our code package. Sparsity on the matrices 𝐖 and 𝐇 may be particularly useful in cases when sequences are repeated rhythmically (Figure 6—figure supplement 1A). For example, the addition of a sparsity regularizer on the 𝐖 update will bias the 𝐖 exemplars to include only a single repetition of the repeated sequence, while the addition of a sparsity regularizer on 𝐇 will bias the 𝐖 exemplars to include multiple repetitions of the repeated sequence. Like the ambiguities described above, these are both valid interpretations of the data, but each may be more useful in different contexts.

Quantifying the prevalence of sequential structure in a dataset

While sequences may be found in a variety of neural datasets, their importance and prevalence is still a matter of debate and investigation. To address this, we developed a metric to assess how much of the explanatory power of a seqNMF factorization was due to synchronous vs. asynchronous neural firing events. Since convNMF can fit both synchronous and sequential events in a dataset, reconstruction error is not, by itself, diagnostic of the ‘sequenciness’ of neural activity. Our approach is guided by the observation that in a data matrix with only synchronous temporal structure (i.e. patterns of rank 1), the columns can be permuted without sacrificing convNMF reconstruction error. In contrast, permuting the columns eliminates the ability of convNMF to model data that contains sparse temporal sequences (i.e. high rank patterns) but no synchronous structure. We thus compute a ‘sequenciness’ metric, ranging from 0 to 1, that compares the performance of convNMF on column-shuffled versus non-shuffled data matrices (see Materials and methods), and quantify the performance of this metric in simulated datasets containing synchronous and sequential events with varying prevalence (Figure 7C). We found that this metric varies approximately linearly with the degree to which sequences are present in a dataset. Below, we apply this method to real experimental data and obtain high ‘sequenciness’ scores, suggesting that convolutional matrix factorization is a well-suited tool for summarizing neural dynamics in these datasets.

Using seqNMF to assess the prevalence of sequences in noisy data.

(A) Example simulated datasets. Each dataset contains 10 neurons, with varying amounts of additive noise, and varying proportions of synchronous events versus asynchronous sequences. For the purposes of this figure, 'sequence' refers to a sequential pattern with no synchrony between different neurons in the pattern. The duration of each dataset used below is 3000 times, and here 300 timebins are shown. (B) Median percent power explained by convNMF (L = 12; K = 2; λ=0) for each type of dataset (100 examples of each dataset type). Different colors indicate the three different levels of additive noise shown in A. Solid lines and filled circles indicate results on unshuffled datasets. Note that performance is flat for each noise level, regardless of the probability of sequences vs synchronous events. Dotted lines and open circles indicate results on column-shuffled datasets. When no sequences are present, convNMF performs the same on column-shuffled data. However, when sequences are present, convNMF performs worse on column-shuffled data. (C) For datasets with patterns ranging from exclusively synchronous events to exclusively asynchronous sequences, convNMF was used to generate a ‘Sequenciness’ score. Colors correspond to different noise levels shown in A. Asterisks denote cases where the power explained exceeds the Bonferroni-corrected significance threshold generated from column-shuffled datasets. Open circles denote cases that do not achieve significance. Note that this significance test is fairly sensitive, detecting even relatively low presence of sequences, and that the ‘Sequenciness’ score distinguishes between cases where more or less of the dataset consists of sequences.

https://doi.org/10.7554/eLife.38471.024

Application of seqNMF to hippocampal sequences

To test the ability of seqNMF to discover patterns in electrophysiological data, we analyzed multielectrode recordings from rat hippocampus (https://crcns.org/data-sets/hc), which were previously shown to contain sequential patterns of neural firing (Pastalkova et al., 2015). Specifically, rats were trained to alternate between left and right turns in a T-maze to earn a water reward. Between alternations, the rats ran on a running wheel during an imposed delay period lasting either 10 or 20 seconds. By averaging spiking activity during the delay period, the authors reported long temporal sequences of neural activity spanning the delay. In some rats, the same sequence occurred on left and right trials, while in other rats, different sequences were active in the delay period during each trial types.

Without reference to the behavioral landmarks, seqNMF was able to extract sequences in both datasets. In Rat 1, seqNMF extracted a single factor, corresponding to a sequence active throughout the running wheel delay period and immediately after, when the rat ran up the stem of the maze (Figure 8A); for 10 fits of K ranging from 1 to 10, the average diss metric reached a minimum at 1 and with λ=2λ0, most runs using the x-ortho penalty extracted a single significant factor (Figure 8C–E). Factorizations of thes data with one factor captured 40.8% of the power in the dataset on average, and had a ‘sequenciness’ score of 0.49. Some runs using the x-ortho penalty extracted two factors (Figure 8E), splitting the delay period sequence and the maze stem sequence; this is a reasonable interpretation of the data, and likely results from variability in the relative timing of running wheel and maze stem traversal. At somewhat lower values of λ, factorizations more often split these sequences into two factors. At even lower values of λ, factorizations had even more significant factors. Such higher granularity factorizations may correspond to real variants of the sequences, as they generalize to held-out data or may reflect time warping in the data (Figure 5—figure supplement 2J). However, a single sequence may be a better description of the data because the diss metric displayed a clear minimum at K=1 (Figure 8C). In Rat 2, seqNMF typically identified three factors (Figure 8B). The first two correspond to distinct sequences active for the duration of the delay period on alternating left and right trials. A third sequence was active immediately following each of the alternating sequences, corresponding to the time at which the animal exits the wheel and runs up the stem of the maze. For 10 fits of K ranging from 1 to 10, the average diss metric reached a minimum at three and with λ=1.5λ0, most runs with the x-ortho penalty extracted between 2 and 4 factors (Figure 8F–H). Factorizations of these data with three factors captured 52.6% of the power in the dataset on average, and had a pattern ‘sequenciness’ score of 0.85. Taken together, these results suggest that seqNMF can detect multiple neural sequences without the use of behavioral landmarks.

Application of seqNMF to extract hippocampal sequences from two rats.

(A) Firing rates of 110 neurons recorded in the hippocampus of Rat 1 during an alternating left-right task with a delay period (Pastalkova et al., 2015). The single significant extracted x-ortho penalized factor. Both an x-ortho penalized reconstruction of each factor (left) and raw data (right) are shown. Neurons are sorted according to the latency of their peak activation within the factor. The red line shows the onset and offset of the forced delay periods, during which the animal ran on a treadmill. (B) Firing rates of 43 hippocampal neurons recorded in Rat 2 during the same task (Mizuseki et al., 2013). Neurons are sorted according to the latency of their peak activation within each of the three significant extracted sequences. The first two factors correspond to left and right trials, and the third corresponds to running along the stem of the maze. (C) The diss metric as a function of K for Rat 1. Black line represents the median of the black points. Notice the minimum at K = 1. (D) (Left) Reconstruction (red) and correlation (blue) costs as a function of λ for Rat 1. Arrow indicates λ=8×105, used for the x-ortho penalized factorization shown in (A). (E) Histogram of the number of significant factors across 30 runs of x-ortho penalized convNMF. (D) The diss metric as a function of K for Rat 2. Notice the minimum at K = 3. (G–H) Same as in (D–E) but for Rat 2. Arrow indicates λ=8×105, used for the factorization shown in (B).

https://doi.org/10.7554/eLife.38471.025

Application of seqNMF to abnormal sequence development in avian motor cortex

We applied seqNMF methods to analyze functional calcium imaging data recorded in the songbird premotor cortical nucleus HVC during singing. Normal adult birds sing a highly stereotyped song, making it possible to detect sequences by averaging neural activity aligned to the song. Using this approach, it has been shown that HVC neurons generate precisely timed sequences that tile each song syllable (Hahnloser et al., 2002; Picardo et al., 2016; Lynch et al., 2016). Songbirds learn their song by imitation and must hear a tutor to develop normal adult vocalizations. Birds isolated from a tutor sing highly variable and abnormal songs as adults (Fehér et al., 2009). Such ‘isolate’ birds provide an opportunity to study how the absence of normal auditory experience leads to pathological vocal/motor development. However, the high variability of pathological ‘isolate’ song makes it difficult to identify neural sequences using the standard approach of aligning neural activity to vocal output.

Using seqNMF, we were able to identify repeating neural sequences in isolate songbirds (Figure 9A). At the chosen λ (Figure 9B), x-ortho penalized factorizations typically extracted three significant sequences (Figure 9C). Similarly, the diss measure has a local minimum at K=3 (Figure 9—figure supplement 1B). The three-sequence factorization explained 41% of the total power in the dataset, with a sequenciness score of 0.7 andhe extracted sequences included sequences deployed during syllables of abnormally long and variable durations (Figure 9D–F, Figure 9—figure supplement 1A).

Figure 9 with 2 supplements see all
SeqNMF applied to calcium imaging data from a singing isolate bird reveals abnormal sequence deployment.

(A) Functional calcium signals recorded from 75 neurons, unsorted, in a singing isolate bird. (B) Reconstruction and cross-orthogonality cost as a function of λ. The arrow at λ=0.005 indicates the value selected for the rest of the analysis. (C) Number of significant factors for 100 runs with the x-ortho penalty with K=10, λ=0.005. Arrow indicates three is the most common number of significant factors. (D) X-ortho factor exemplars (𝐖’s). Neurons are grouped according to the factor in which they have peak activation, and within each group neurons are sorted by the latency of their peak activation within the factor. (E) The same data shown in (A), after sorting neurons by their latency within each factor as in (D). A spectrogram of the bird’s song is shown at top, with a purple ‘*’ denoting syllable variants correlated with 𝐰2. (F) Same as (E), but showing reconstructed data rather than calcium signals. Shown at top are the temporal loadings (𝐇) of each factor.

https://doi.org/10.7554/eLife.38471.026

In addition, the extracted sequences exhibit properties not observed in normal adult birds. We see an example of two distinct sequences that sometimes, but not always, co-occur (Figure 9). We observe that a shorter sequence (green) occurs alone on some syllable renditions while a second, longer sequence (purple) occurs simultaneously on other syllable renditions. We found that biasing x-ortho penalized convNMF towards ’parts-based’ or ’events-based’ factorizations gives a useful tool to visualize this feature of the data (Figure 9—figure supplement 2). This probabilistic overlap of different sequences is highly atypical in normal adult birds (Hahnloser et al., 2002; Long et al., 2010; Picardo et al., 2016; Lynch et al., 2016) and is associated with abnormal variations in syllable structure—in this case resulting in a longer variant of the syllable when both sequences co-occur. This acoustic variation is a characteristic pathology of isolate song (Fehér et al., 2009).

Thus, even though we observe HVC generating sequences in the absence of a tutor, it appears that these sequences are deployed in a highly abnormal fashion.

Application of seqNMF to a behavioral dataset: song spectrograms

Although we have focused on the application of seqNMF to neural activity data, these methods naturally extend to other types of high-dimensional datasets, including behavioral data with applications to neuroscience. The neural mechanisms underlying song production and learning in songbirds is an area of active research. However, the identification and labeling of song syllables in acoustic recordings is challenging, particularly in young birds in which song syllables are highly variable. Because automatic segmentation and clustering often fail, song syllables are still routinely labelled by hand (Okubo et al., 2015). We tested whether seqNMF, applied to a spectrographic representation of zebra finch vocalizations, is able to extract meaningful features in behavioral data. Using the x-ortho penalty, factorizations correctly identified repeated acoustic patterns in juvenile songs, placing each distinct syllable type into a different factor (Figure 10). The resulting classifications agree with previously published hand-labeled syllable types (Okubo et al., 2015). A similar approach could be applied to other behavioral data, for example movement data or human speech, and could facilitate the study of neural mechanisms underlying even earlier and more variable stages of learning. Indeed, convNMF was originally developed for application to spectrograms (Smaragdis, 2004); notably it has been suggested that auditory cortex may use similar computations to represent and parse natural statistics (Młynarski and McDermott, 2018).

SeqNMF applied to song spectrograms.

(A) Spectrogram of juvenile song, with hand-labeled syllable types (Okubo et al., 2015). (B) Reconstruction cost and x-ortho cost for these data as a function of λ. Arrow denotes λ=0.0003, which was used to run convNMF with the x-ortho penalty (C) 𝐖’s for this song, fit with K=8, L=200ms, λ=0.0003. Note that there are three non-empty factors, corresponding to the three hand-labeled syllables a, b, and c. (D) X-ortho penalized 𝐇’s (for the three non-empty factors) and reconstruction of the song shown in (A) using these factors.

https://doi.org/10.7554/eLife.38471.029

Discussion

As neuroscientists strive to record larger datasets, there is a need for rigorous tools to reveal underlying structure in high-dimensional data (Gao and Ganguli, 2015; Sejnowski et al., 2014; Churchland and Abbott, 2016; Bzdok and Yeo, 2017). In particular, sequential structure is increasingly regarded as a fundamental property of neuronal circuits (Hahnloser et al., 2002; Harvey et al., 2012; Okubo et al., 2015; Pastalkova et al., 2008), but standardized statistical approaches for extracting such structure have not been widely adopted or agreed upon. Extracting sequences is particularly challenging when animal behaviors are variable (e.g. during learning) or absent entirely (e.g. during sleep).

Here, we explored a simple matrix factorization-based approach to identify neural sequences without reference to animal behavior. The convNMF model elegantly captures sequential structure in an unsupervised manner (Smaragdis, 2004; Smaragdis, 2007; Peter et al., 2017). However, in datasets where the number of sequences is not known, convNMF may return inefficient and inconsistent factorizations. To address these challenges, we introduced a new regularization term to penalize correlated factorizations, and developed a new dissimilarity measure to assess model stability. Both proposed methods can be used to infer the number of sequences in neural data and are highly robust to noise. For example, even when (synthetic) neurons participate probabilistically in sequences at a rate of 50%, the model typically identifies factors with greater than 80% similarity to the ground truth (Figure 3A). Additionally, these methods perform well even with very limited amounts of data: for example successfully extracting sequences that only appear a handful of times in a noisy data stream (Figure 3—figure supplement 2).

The x-ortho penalty developed in this paper may represent a useful improvement over traditional orthogonality regularizations or suggest how traditional regularization penalties may be usefully modified. First, it simultaneously provides a penalty on correlations in both 𝐖 and 𝐇, thus simplifying analyses by having only one penalty term. Second, although the x-ortho penalty does not formally constitute regularization due to its inclusion of the data 𝐗, we have described how the penalty can be approximated by a data-free regularization with potentially useful properties (Appendix 2). Specifically, the data-free regularization contains terms corresponding to weighted orthogonality in (smoothed) 𝐇 and 𝐖, where the weights focus the orthogonality penalty preferentially on those factors that contribute the most power to the reconstruction. This concept of using power-weighted regularization penalties may be applicable more generally to matrix factorization techniques.

As in many data analysis scenarios, a variety of statistical approaches may be brought to bear on finding sequences in neural data. A classic method is to construct cross-correlogram plots, showing spike time correlations between pairs of neurons at various time lags. However, other forms of spike rate covariation, such as trial-to-trial gain modulation, can produce spurious peaks in this measure (Brody, 1999); recent work has developed statistical corrections for these effects (Russo and Durstewitz, 2017). After significant pairwise correlations are identified, one can heuristically piece together pairs of neurons with significant interactions into a sequence. This bottom-up approach may be better than seqNMF at detecting sequences involving small numbers of neurons, since seqNMF specifically targets sequences that explain large amounts of variance in the data. On the other hand, bottom-up approaches to sequence extraction may fail to identify long sequences with high participation noise or jitter in each neuron (Quaglio et al., 2018). One can think of seqNMF as a complementary top-down approach, which performs very well in the high-noise regime since it learns a template sequence at the level of the full population that is robust to noise at the level of individual units.

Statistical models with a dynamical component, such as Hidden Markov Models (HMMs) (Maboudi et al., 2018), linear dynamical systems (Kao et al., 2015), and models with switching dynamics (Linderman et al., 2017), can also capture sequential firing patterns. These methods will typically require many hidden states or latent dimensions to capture sequences, similar to PCA and NMF which require many components to recover sequences. Nevertheless, visualizing the transition matrix of an HMM can provide insight into the order in which hidden states of the model are visited, mapping onto different sequences that manifest in population activity (Maboudi et al., 2018). One advantage of this approach is that it can model sequences that occasionally end prematurely, while convNMF will always reconstruct the full sequence. On the other hand, this pattern completion property makes convNMF robust to participation noise and jitter. In contrast, a standard HMM must pass through each hidden state to model a sequence, and therefore may have trouble if many of these states are skipped. Thus, we expect HMMs and related models to exhibit complementary strengths and weaknesses when compared to convNMF.

Another strength of convNMF is its ability to accommodate sequences with shared neurons, as has been observed during song learning (Okubo et al., 2015). Sequences with shared neurons can be interpreted either in terms of ‘parts-based’ or ‘events-based’ factorizations (Figure 9—figure supplement 2). This capacity for a combinatorial description of overlapping sequences distinguishes convNMF from many other methods, which assume that neural patterns/sequences do not co-occur in time. For example, a vanilla HMM can only model each time step with a single hidden state and thus cannot express parts-based representations of neural sequences. Likewise, simple clustering models would assign each time interval to a single cluster label. Adding hierarchical and factorial structure to these models could allow them to test for overlapping neural sequences (see e.g. Ghahramani and Jordan, 1997); however, we believe seqNMF provides a simpler and more direct framework to explore this possibility.

Finally, as demonstrated by our development of new regularization terms and stability measures, convolutional matrix factorization is a flexible and extensible framework for sequence extraction. For example, one can tune the overall sparsity in the model by introducing additional L1 regularization terms. The loss function may also be modified, for example substituting in KL divergence or more general β-divergence (Villasana et al., 2018). Both L1 regularization and β-divergence losses are included in the seqNMF code package so that the model can be tuned to the particular needs of future analyses. Future development could incorporate outlier detection into the objective function (Netrapalli et al., 2014), or online optimization methods for large datasets (Wang et al., 2013). Other extensions to NMF, for example, Union of Intersections NMF Cluster (Ubaru et al., 2017), have yielded increased robustness and consistency of NMF factorizations, and could potentially also be modified for application to convNMF. Thus, adding convolutional structure to factorization-based models of neural data represents a rich opportunity for statistical neuroscience.

Despite limiting ourselves to a relatively simple model for the purposes of this paper, we extracted biological insights that would have been difficult to otherwise achieve. For example, we identified neural sequences in isolated songbirds without aligning to song syllables, enabling new research into songbird learning on two fronts. First, since isolated and juvenile birds sing highly variable songs that are not easily segmented into stereotyped syllables, it is difficult and highly subjective to identify sequences by aligning to human-labeled syllables. SeqNMF enables the discovery and future characterization of neural sequences in these cases. Second, while behaviorally aligned sequences exist in tutored birds, it is that possible many neural sequences—for example, in different brain areas or stages of development—are not closely locked to song syllables. Thus, even in cases where stereotyped song syllables exist, behavioral alignment may overlook relevant sequences and structure in the data. These lessons apply broadly to many neural systems, and demonstrate the importance of general-purpose methods that extract sequences without reference to behavior. Our results show that convolutional matrix factorization models are an attractive framework to meet this need.

Materials and methods

Key resources table
Reagent type
(species) or resource
DesignationSource or
reference
IdentifiersAdditional
information
Software,
algorithm
seqNMFthis paperhttps://github.com/FeeLab/seqNMFstart with demo.m
Software,
algorithm
convNMFSmaragdis, 2004;
Smaragdis, 2007
https://github.com/colinvaz/nmf-toolbox
Software,
algorithm
sparse convNMFO’Grady and Pearlmutter, 2006;
Ramanarayanan et al., 2013
https://github.com/colinvaz/nmf-toolbox
Software,
algorithm
NMF orthogonality penaltiesChoi, 2008;
Chen and Cichocki, 2004
Software,
algorithm
other NMF extensionsCichocki et al., 2009
Software,
algorithm
NMFLee and Seung, 1999
Software,
algorithm
CNMF_E (cell extraction)Zhou et al., 2018https://github.com/zhoupc/CNMF_E
Software,
algorithm
MATLABMathWorkswww.mathworks.com,
RRID:SCR_001622
Strain, strain
background
(adeno-associated virus)
AAV9.CAG.GCaMP6f.
WPRE.SV40
Chen et al., 2013Addgene viral prep # 100836-AAV9,
http://n2t.net/addgene:100836,
RRID:Addgene_100836
Commercial
assay or kit
Miniature
microscope
Inscopixhttps://www.inscopix.com/nvista

Contact for resource sharing

Request a detailed protocol

Further requests should be directed to Michale Fee (fee@mit.edu).

Software and data availability

Request a detailed protocol

The seqNMF MATLAB code is publicly available as a github repository, which also includes our songbird data (Figure 9) for demonstration (Mackevicius et al., 2018; copy archived at https://github.com/elifesciences-publications/seqNMF).

The repository includes the seqNMF function, as well as helper functions for selecting λ, testing the significance of factors, plotting, and other functions. It also includes a demo script with an example of how to select λ for a new dataset, test for significance of factors, plot the seqNMF factorization, switch between parts-based and events-based factorizations, and calculate cross-validated performance on a masked test set.

Generating simulated data

Request a detailed protocol

We simulated neural sequences containing between 1 and 10 distinct neural sequences in the presence of various noise conditions. Each neural sequence was made up of 10 consecutively active neurons, each separated by three timebins. The binary activity matrix was convolved with an exponential kernel (τ=10 timebins) to resemble neural calcium imaging activity.

SeqNMF algorithm details

The x-ortho penalized convNMF algorithm is a direct extension of the multiplicative update convNMF algorithm (Smaragdis, 2004), and draws on previous work regularizing NMF to encourage factor orthogonality (Chen and Cichocki, 2004).

The uniqueness and consistency of traditional NMF has been better studied than convNMF. In special cases, NMF has a unique solution comprised of sparse, ‘parts-based’ features that can be consistently identified by known algorithms (Donoho and Stodden, 2004; Arora et al., 2011). However, this ideal scenario does not hold in many practical settings. In these cases, NMF is sensitive to initialization, resulting in potentially inconsistent features. This problem can be addressed by introducing additional constraints or regularization terms that encourage the model to extract particular, e.g. sparse or approximately orthogonal features (Huang et al., 2014; Kim and Park, 2008). Both theoretical work and empirical observations suggest that these modifications result in more consistently identified features (Theis et al., 2005; Kim and Park, 2008).

For x-ortho penalized seqNMF, we added to the convNMF cost function a term that promotes competition between overlapping factors, resulting in the following cost function:

(8) (W~,H~)=arg minW,H(||X~X||F2+λ||(WX)SH||1,ij)

We derived the following multiplicative update rules for 𝐖 and 𝐇 (Appendix 1):

(9) WW×X(H)X~(H)+λXSH(1I)
(10) HH×WXWX~+λ(1I)(WXS)

where the division and × are element-wise. The operator () shifts a matrix in the direction by timebins, that is a delay by timebins, and () shifts a matrix in the direction by timebins (notation summary, Table 1). Note that multiplication with the K×K matrix (𝟏-𝐈) effectively implements factor competition because it places in the kth row a sum across all other factors. These update rules are derived in Appendix 1 by taking the derivative of the cost function in Equation 8 and choosing an appropriate learning rate for each element.

In addition to the multiplicative updates outlined in Table 2, we also renormalize so rows of 𝐇 have unit norm; shift factors to be centered in time such that the center of mass of each 𝐖 pattern occurs in the middle; and in the final iteration run one additional step of unregularized convNMF to prioritize the cost of reconstruction error over the regularization (Algorithm 1). This final step is done to correct a minor suppression in the amplitude of some peaks in 𝐇 that may occur within 2L timebins of neighboring sequences.

Testing the significance of each factor on held-out data

Request a detailed protocol

In order to test whether a factor is significantly present in held-out data, we measured the distribution across timebins of the overlaps of the factor with the held-out data, and compared the skewness of this distribution to the null case (Figure 1). Overlap with the data is measured as 𝐖𝐗, a quantity which will be high at timepoints when the sequence occurs, producing a distribution of 𝐖𝐗 with high skew. In contrast, a distribution of overlaps exhibiting low skew indicates a sequence is not present in the data, since there are few timepoints of particularly high overlap. We estimated what skew levels would appear by chance by constructing null factors where temporal relationships between neurons have been eliminated. To create such null factors, we start from the real factors then circularly shift the timecourse of each neuron by a random amount between 0 and L. We measure the skew of the overlap distributions for each null factor, and ask whether the skew we measured for the real factor is significant at p-value α, that is, if it exceeds the Bonferroni corrected ((1-αK)×100)th percentile of the null skews (see Figure 2—figure supplement 1).

Algorithm 1: SeqNMF (x-ortho algorithm)
    Input: Data matrix 𝐗, number of factors K, factor duration L, regularization
                  strength λ
    Output: Factor exemplars 𝐖, factor timecourses 𝐇
1 Initialize 𝐖 and 𝐇 randomly
2 Iter = 1
3 While (Iter < maxIter) and (Δ cost > tolerance) do
4       Update 𝐇 using multiplicative update from Table 2
5       Shift 𝐖 and 𝐇 to center 𝐖’s in time
6       Renormalize 𝐖 and 𝐇 so rows of 𝐇 have unit norm
7       Update 𝐖 using multiplicative update from Table 2
8       Iter = Iter + 1
9 Do one final unregularized convNMF update of 𝐖 and 𝐇
10 return

Note that if λ is set too small, seqNMF will produce multiple redundant factors to explain one sequence in the data. In this case, each redundant candidate sequence will pass the significance test outlined here. We will address below a procedure for choosing λ and methods for determining the number of sequences.

Calculating the percent power explained by a factorization

Request a detailed protocol

In assessing the relevance of sequences in a dataset, it can be useful to calculate what percentage of the total power in the dataset is explained by the factorization (𝐗~). The total power in the data is 𝐗2 (abbreviating n=1Nt=1Txnt2 to X2). The power unexplained by the factorization is (XX~)2. Thus, the percent of the total power explained by the factorization is:

(11) 𝐗2-(𝐗-𝐗~)2𝐗2=2𝐗𝐗~-𝐗~2𝐗2

‘Sequenciness’ score

Request a detailed protocol

The ‘sequenciness’ score was developed to distinguish between datasets with exclusively synchronous patterns, and datasets with temporally extended sequential patterns. This score relies on the observation that synchronous patterns are not disrupted by shuffling the columns of the data matrix. The ‘sequenciness’ score is calculated by first computing the difference between the power explained by seqNMF in the actual and column-shuffled data. This quantity is then divided by the power explained in the actual data minus the power explained in data where each neuron is time-shuffled by a different random permutation.

Choosing appropriate parameters for a new dataset

Request a detailed protocol

The choice of appropriate parameters (λ, K and L) will depend on the data type (sequence length, number, and density; amount of noise; etc.).

In practice, we found that results were relatively robust to the choice of parameters. When K or L is set larger than necessary, seqNMF tends to simply leave the unnecessary factors or times empty. For choosing λ, the goal is to find the ‘sweet spot’ (Figure 4) to explain as much data as possible while still producing sensible factorizations, that is, minimally correlated factors, with low values of ||(𝐖𝐗)𝐒𝐇||1,ij. Our software package includes demo code for determining the best parameters for a new type of data, using the following strategy:

  1. Start with K slightly larger than the number of sequences anticipated in the data

  2. Start with L slightly longer than the maximum expected factor length

  3. Run seqNMF for a range of λ’s, and for each λ measure the reconstruction error (||𝐗-𝐖𝐇||F2) and the correlation cost term (||(𝐖𝐗)𝐒𝐇||1,ij)

  4. Choose a λ slightly above the crossover point λ0

  5. Decrease K if desired, as otherwise some factors will be consistently empty

  6. Decrease L if desired, as otherwise some times will consistently be empty

In some applications, achieving the desired accuracy may depend on choosing a λ that allows some inconsistency. It is possible to deal with this remaining inconsistency by comparing factors produced by different random initializations, and only considering factors that arise from several different initializations, a strategy that has been previously applied to standard convNMF on neural data (Peter et al., 2017).

During validation of our procedure for choosing λ, we compared factorizations to ground truth sequences as shown in Figure 4. To find the optimal value of λ, we used the product of two curves. The first curve was obtained by calculating the fraction of fits in which the true number of sequences was recovered as a function of λ. The second curve was obtained by calculating similarity to ground truth as a function of λ (see Materials and methods section ‘Measuring performance on noisy fits by comparing seqNMF sequence to ground-truth sequences’). The product of these two curves was smoothed using a three-sample boxcar sliding window, and the width was found as the values of λ on either side of the peak value that correspond most closely to the half-maximum points of the curve.

Preprocessing

Request a detailed protocol

While seqNMF is generally quite robust to noisy data, and different types of sequential patterns, proper preprocessing of the data can be important to obtaining reasonable factorizations on real neural data. A key principle is that, in minimizing the reconstruction error, seqNMF is most strongly influenced by parts of the data that exhibit high variance. This can be problematic if the regions of interest in the data have relatively low amplitude. For example, high firing rate neurons may be prioritized over those with lower firing rate. As an alternative to subtracting the mean firing rate of each neuron, which would introduce negative values, neurons could be normalized divisively or by subtracting off a NMF reconstruction fit using a method that forces a non-negative residual (Kim and Smaragdis, 2014). Additionally, variations in behavioral state may lead to seqNMF factorizations that prioritize regions of the data with high variance and neglect other regions. It may be possible to mitigate these effects by normalizing data, or by restricting analysis to particular subsets of the data, either by time or by neuron.

Measuring performance on noisy data by comparing seqNMF sequences to ground-truth sequences

Request a detailed protocol

We wanted to measure the ability of seqNMF to recover ground-truth sequences even when the sequences are obstructed by noise. Our noisy data consisted of three ground-truth sequences, obstructed by a variety of noise types. For each ground-truth sequence, we found its best match among the seqNMF factors. This was performed in a greedy manner. Specifically, we first computed a reconstruction for one of the ground-truth factors. We then measured the correlation between this reconstruction and reconstructions generated from each of the extracted factors, and chose the best match (highest correlation). Next, we matched a second ground-truth sequence with its best match (highest correlation between reconstructions) among the remaining seqNMF factors, and finally we found the best match for the third ground-truth sequence. The mean of these three correlations was used as a measure of similarity between the seqNMF factorization and the ground-truth (noiseless) sequences.

Testing generalization of factorization to randomly held-out (masked) data entries

Request a detailed protocol

The data matrix 𝐗 was divided into training data and test data by randomly selecting 5 or 10% of matrix entries to hold out. Specifically, the objective function (Equation 5, in the Results section) was modified to:

(12)arg minW,H||M×(WHX)||F2+

where × indicates elementwise multiplication (Hadamard product) and 𝐌 is a binary matrix with 5 or 10% of the entries randomly selected to be zero (held-out test set) and the remaining 95 or 90% set to one (training set). To search for a solution, we reformulate this optimization problem as:

(13) arg minW,H,Z||WHZ||F2+subjecttoM×Z=M×X

where we have introduced a new optimization variable 𝐙, which can be thought of as a surrogate dataset that is equal to the ground truth data only on the training set. The goal is now to minimize the difference between the model estimate, 𝐗~=𝐖𝐇, and the surrogate, 𝐙, while constraining 𝐙 to equal 𝐗 at unmasked elements (where mij=1) and allowing 𝐙 to be freely chosen at masked elements (where mij=0). Clearly, at masked elements, the best choice is to make 𝐙 equal to the current model estimate 𝐗~ as this minimizes the cost function without violating the constraint. This leads to the following update rules which are applied cyclically to update 𝐙, 𝐖, and 𝐇.

(14) Znt{XntifMnt=1(WH)ntifMnt=0
(15) WW×Z(H)X~(H)+λZSH(1I)
(16) HH×WZWX~+λ(1I)(WZS)

The measure used for testing generalization performance was root mean squared error (RMSE). For the testing phase, RMSE was computed from the difference between 𝐗~ and the data matrix 𝐗 only for held-out entries.

Hippocampus data

Request a detailed protocol

The hippocampal data was collected in the Buzsaki lab (Pastalkova et al., 2015; Mizuseki et al., 2013), and is publicly available on the Collaborative Research in Computational Neuroscience (CRCNS) Data sharing website. The dataset we refer to as ‘Rat 1’ is in the hc-5 dataset, and the dataset we refer to as ‘Rat 2’ is in the hc-3 dataset. Before running seqNMF, we processed the data by convolving the raw spike trains with a gaussian kernel of standard deviation 100 ms.

Animal care and use

Request a detailed protocol

We used male zebra finches (Taeniopygia guttata) from the MIT zebra finch breeding facility (Cambridge, MA). Animal care and experiments were carried out in accordance with NIH guidelines, and reviewed and approved by the Massachusetts Institute of Technology Committee on Animal Care (protocol 0715-071-18).

In order to prevent exposure to a tutor song, birds were foster-raised by female birds, which do not sing, starting on or before post-hatch day 15. For experiments, birds were housed singly in custom-made sound isolation chambers.

Data acquisition and preprocessing

Request a detailed protocol

The calcium indicator GCaMP6f was expressed in HVC by intracranial injection of the viral vector AAV9.CAG.GCaMP6f.WPRE.SV40 (Chen et al., 2013) into HVC. In the same surgery, a cranial window was made using a GRIN (gradient index) lens (1 mm diamenter, 4 mm length, Inscopix). After at least one week, in order to allow for sufficient viral expression, recordings were made using the Inscopix nVista miniature fluorescent microscope.

Neuronal activity traces were extracted from raw fluorescence movies using the CNMF_E algorithm, a constrained non-negative matrix factorization algorithm specialized for microendoscope data by including a local background model to remove activity from out-of-focus cells (Zhou et al., 2018).

We performed several preprocessing steps before applying seqNMF to functional calcium traces extracted by CNMF_E. First, we estimated burst times from the raw traces by deconvolving the traces using an AR-2 process. The deconvolution parameters (time constants and noise floor) were estimated for each neuron using the CNMF_E code package (Zhou et al., 2018). Some neurons exhibited larger peaks than others, likely due to different expression levels of the calcium indicator. Since seqNMF would prioritize the neurons with the most power, we renormalized by dividing the signal from each neuron by the sum of the maximum value of that row and the 95th percentile of the signal across all neurons. In this way, neurons with larger peaks were given some priority, but not much more than that of neurons with weaker signals.

Appendix 1

Deriving multiplicative update rules

Standard gradient descent methods for minimizing a cost function must be adapted when solutions are constrained to be non-negative, since gradient descent steps may result in negative values. Lee and Seung invented an elegant and widely-used algorithm for non-negative gradient descent that avoids negative values by performing multiplicative updates (Lee and Seung, 2001; Lee and Seung, 1999). They derived these multiplicative updates by choosing an adaptive learning rate that makes additive terms cancel from standard gradient descent on the cost function. We will reproduce their derivation here, and detail how to extend it to the convolutional case (Smaragdis, 2004) and apply several forms of regularization (O’Grady and Pearlmutter, 2006; Ramanarayanan et al., 2013; Chen and Cichocki, 2004). See Table 2 for a compilation of cost functions, derivatives and multiplicative updates for NMF and convNMF under several different regularization conditions.

Standard NMF

NMF performs the factorization 𝐗𝐗~=𝐖𝐇. NMF factorizations seek to solve the following problem:

(17) (𝐖~,𝐇~)=argmin𝐖,𝐇(𝐖,𝐇)
(18) (W,H) = 12||X~X||F2
(19) W~,H~0

This problem is convex in 𝐖 and 𝐇 separately, not together, so a local minimum is found by alternating 𝐖 and 𝐇 updates. Note that:

(20) 𝐖(𝐖,𝐇)=𝐗~𝐇-𝐗𝐇
(21) 𝐇(𝐖,𝐇)=𝐖𝐗~-𝐖𝐗

Thus, gradient descent steps for 𝐖 and 𝐇 are:

(22) WWηW(X~HXH)
(23) HHηH(WX~WX)

To arrive at multiplicative updates, Lee and Seung (Lee and Seung, 2001) set:

(24) η𝐖=𝐖𝐖𝐇𝐇
(25) η𝐇=𝐇𝐖𝐖𝐇

Thus, the gradient descent updates become multiplicative:

(26) WW×XHWHH=W×XHX~H
(27) HH×WXWWH=H×WXWX~

where the division and × are element-wise.

Standard convNMF

Convolutional NMF factorizes data 𝐗𝐗~=𝐖𝐇=𝐖𝐇. ConvNMF factorizations seek to solve the following problem:

(28) (𝐖~,𝐇~)=argmin𝐖,𝐇(𝐖,𝐇)
(29) (𝐖,𝐇)=12||𝐗~-𝐗||F2
(30) 𝐖~,𝐇~0

The derivation above for standard NMF can be applied for each , yielding the following update rules for convNMF (Smaragdis, 2004):

(31) WW×XHX~H
(32) HH×WXWX~ = H×WXWX~

Where the operator shifts a matrix in the direction by timebins, that is a delay by timebins, and shifts a matrix in the direction by timebins (Table 1). Note that NMF is a special case of convNMF where L=1.

Incorporating regularization terms

Suppose we want to regularize by adding a new term, to the cost function:

(33) (𝐖~,𝐇~)=argmin𝐖,𝐇(𝐖,𝐇)
(34) (𝐖,𝐇)=12||𝐗~-𝐗||F2+
(35) 𝐖~,𝐇~0

Using a similar trick to Lee and Seung, we choose a η𝐖,η𝐇 to arrive at a simple multiplicative update. Below is the standard NMF case, which generalizes trivially to the convNMF case.

Note that:

(36) 𝐖=𝐗~𝐇-𝐗𝐇+𝐖
(37) 𝐇=𝐖𝐗~-𝐖𝐗+𝐇

We set:

(38) η𝐖=𝐖𝐗~𝐇+𝐖
(39) η𝐇=𝐇𝐖𝐗~+𝐇

Thus, the gradient descent updates become multiplicative:

(40) WWηWW=W×XHX~H+W
(41) HHηHH=H×WXWX~+H

where the division and × are element-wise.

The above formulation enables flexible incorporation of different types of regularization or penalty terms into the multiplicative NMF update algorithm. This framework also extends naturally to the convolutional case. See Table 2 for examples of several regularization terms, including L1 sparsity (O’Grady and Pearlmutter, 2006; Ramanarayanan et al., 2013) and spatial decorrelation (Chen and Cichocki, 2004), as well as the terms we introduce here to combat the types of inefficiencies and cross correlations we identified in convolutional NMF, namely, smoothed orthogonality for 𝐇 and 𝐖, and the x-ortho penalty term. For the x-ortho penalty term, λ||(𝐖𝐗)𝐒𝐇||1,ij, the multiplicative update rules are:

(42) WW×X(H)X~(H)+λXSH(1I)
(43) HH×WXWX~+λ(1I)(WXS)

where the division and × are element-wise. Note that multiplication with the K×K matrix (𝟏-𝐈) effectively implements factor competition because it places in the kth row a sum across all other factors.

https://doi.org/10.7554/eLife.38471.032

Appendix 2

Relation of the x-ortho penalty to traditional regularizations

As noted in the main text, the x-ortho penalty term is not formally a regularization because it includes the data 𝐗. In this Appendix, we show how this penalty can be approximated by a data-free regularization. The resulting regularization contains three terms corresponding to a weighted orthogonality penalty on pairs of 𝐇 factors, a weighted orthogonality penalty on pairs of 𝐖 factors, and a term that penalizes interactions among triplets of factors. We analyze each term in both the time domain (Equation 50) and in the frequency domain (Equations 50 and 69).

Time domain analysis

We consider the cross-orthogonality penalty term:

(44) =(𝐖𝐗)𝐒𝐇1,ij

and define, 𝐑=(𝐖𝐗)𝐒𝐇, which is a K×K matrix. Each element 𝐑ij is a positive number describing the overlap or correlation between factor i and factor j in the model. Each element of 𝐑 can be written explicitly as:

(45) 𝐑ij=tn𝐖ni𝐗n(t+)τ𝐒tτ𝐇jτ

Where the index variables t and τ range from 1 to T, n ranges from 1 to N, and ranges from 1 to L.

Our goal here is to find a close approximation to this penalty term that does not contain the data 𝐗. This can readily be done if 𝐗 is well-approximated by the convNMF decomposition:

(46) 𝐗nt(𝐖𝐇)nt=k𝐖nk𝐇k(t-)

Substituting this expression into Equation 45 and defining the smoothed matrix τ𝐒tτ𝐇jτ as 𝐇jtsmooth gives:

(47) 𝐑ijtnk𝐖ni𝐖nk𝐇kt𝐇jtsmooth

Making the substitution u=- gives:

(48) 𝐑ijtnu=-(L-1)L-1k𝐖ni(+u)𝐖nk𝐇k(t+u)𝐇jtsmooth

where in the above expression we have taken u=- to extend over the full range from -(L-1) to (L-1) under the implicit assumption that 𝐖 and 𝐇 are zero padded such that values of 𝐖 for lag indices outside the range 0 to L-1 and values of 𝐇 for time indices outside the range 1 to T are taken to be zero.

Relabeling as and gathering terms together yields

(49) 𝐑ijku=-(L-1)L-1(n𝐖ni(+u)𝐖nk)(t𝐇k(t+u)𝐇jtsmooth)

We note that the above expression contains terms that resemble penalties on orthogonality between two 𝐖 factors (first parenthetical) or two 𝐇 factors (one of which is smoothed, second parenthetical), but in this case allowing for different time lags u between the factors. To understand this formula better, we decompose the above sum over k into three contributions corresponding to k=i, k=j, and ki,j

(50) Riju=(L1)L1(nWni(+u)Wni)(tHi(t+u)Hjtsmooth)+u=(L1)L1(nWni(+u)Wnj)(tHj(t+u)Hjtsmooth)+ki,ju=(L1)L1(nWni(+u)Wnk)(tHk(t+u)Hjtsmooth)

The first term above contains, for u=0, a simple extension of the (i,j)th element of the 𝐇 orthogonality condition 𝐇𝐒𝐇. The extension is that the orthogonality is weighted by the power, that is the sum of squared elements, in the ith factor of 𝐖 (the apparent lack of symmetry in weighing by the ith rather than the jth factor can be removed by simultaneously considering the term Rji, as shown in the Fourier representation of the following section). This weighting has the benefit of applying the penalty on 𝐇 orthogonality most strongly to those factors whose corresponding 𝐖 components contain the most power. For u0, this orthogonality condition is extended to allow for overlap of time-shifted 𝐇’s, with weighting at each time shift by the autocorrelation of the corresponding 𝐖 factor. Qualitatively, this enforces that (even in the absence of the smoothing matrix 𝐒), 𝐇’s that are offset by less than the width of the autocorrelation of the corresponding 𝐖’s will have overlapping convolutions with these 𝐖’s due to the temporal smoothing associated with the convolution operation. We note that, for sparse sequences as in the examples of Figure 1, there is no time-lagged component to the autocorrelation, so this term corresponds simply to a smoothed 𝐇 orthogonality regularization, weighted by the strength of the corresponding 𝐖 factors.

The second term above represents a complementary orthogonality condition on the 𝐖 components, in which orthogonality in the ith and jth𝐖 factors are weighted by the (smoothed) autocorrelation of the 𝐇 factors. For the case in which the 𝐇 factors have no time-lagged autocorrelations, this corresponds to a simple weighting of 𝐖 orthogonality by the strength of the corresponding 𝐇 factors.

Finally, we consider the remaining terms of the cost function, for which ki,j. We note that these terms are only relevant when the factorization contains at least three factors, and thus their role cannot be visualized from the simple Type 1 to Type 3 examples of Figure 1. These terms have the form:

(51) ki,ju=-(L-1)L-1(n𝐖ni(+u)𝐖nk)(t𝐇k(t+u)𝐇jtsmooth)

To understand how this term contributes, we consider each of the expressions in parentheses. The first expression corresponds, as described above, to the time-lagged cross correlation of the ith and kthW components. Likewise, the second expression corresponds to the time-lagged correlation of the (smoothed) jth and kthH components. Thus, this term of 𝐑ij contributes whenever there is a factor (𝐖k,𝐇k) that overlaps, at the same time lags, with the ith factor’s 𝐖 component and the jth factor’s 𝐇 component. Thus, this term penalizes cases where, rather than (or in addition to) two factors i and j directly overlapping one another, they have a common factor k with which they overlap.

An example of the contribution of a triplet penalty term, as well as of the paired terms of Equation 50, is shown in Figure 1 of this Appendix. By inspection, there is a penalty 𝐑23 due to the overlapping values of the pair (𝐡𝟐,𝐡𝟑). Likewise, there is a penalty 𝐑13 due to the overlapping values of the pair (𝐰𝟏,𝐰𝟑). The triplet penalty term contributes to 𝐑12 and derives from the fact that 𝐰𝟏 overlaps with 𝐰𝟑 at the same time (and with the same, zero time lag) as 𝐡𝟐 overlaps with 𝐡𝟑.

In summary, the above analysis shows that for good reconstructions of the data where 𝐗𝐖𝐇, the x-ortho penalty can be well-approximated by the sum of three contributions. The first corresponds to a penalty on time-lagged (smoothed) 𝐇 orthogonality weighted at each time lag by the autocorrelation of the corresponding 𝐖 factors. The second similarly corresponds to a penalty on time-lagged 𝐖 orthogonality weighted at each time lag by the (smoothed) autocorrelation of the corresponding 𝐇 factors. For simple cases of sparse sequences, these contributions reduce to orthogonality in 𝐇 or 𝐖 weighted by the power in the corresponding 𝐖 or 𝐇, respectively, thus focusing the penalties most heavily on those factors which contribute most heavily to the data reconstruction. The third, triplet contribution corresponds to the case in which a factor in 𝐖 and a different factor in 𝐇 both overlap (at the same time lag) with a third common factor, and may occur even when the factors 𝐖 and 𝐇 themselves are orthogonal. Further work is needed to determine whether this third contribution is critical to the x-ortho penalty or is simply a by-product of the x-ortho penalty procedure’s direct use of the data 𝐗.

Appendix 2—figure 1
Example of redundancy with three factors.

In addition to the direct overlap of 𝐡2 and 𝐡3, and of 𝐰1 and 𝐰3, there is a ‘triplet’ penalty 𝐑12 on factors 1 and 2 that occurs because each has an overlap (in either 𝐖 or 𝐇) with the 3rd factor (𝐰3,𝐡3). This occurs even though neither 𝐰1 and 𝐰2, nor 𝐡1 and 𝐡2, are themselves overlapping.

https://doi.org/10.7554/eLife.38471.034

Frequency domain analysis

Additional insight may be obtained by analyzing these three components of 𝐑 in the Fourier domain. Before doing so, we below derive the Fourier domain representation of 𝐑, and provide insights suggested by this perspective.

Fourier representation of the x-ortho penalty

As in the time domain analysis, we start with defining:

(52) 𝐑ij=tn𝐖ni𝐗n(t+)τ𝐒tτ𝐇jτ

Unpacking the notation above, we note that:

(53) 𝐑ij=[n=1NCorr(𝐖i(n),𝐗(n))]Conv(𝐇j,𝐬)

where 𝐖i(n) is the nth row of 𝐖i, 𝐗(n) is the nth row of 𝐗, 𝐇j is 𝐇j, 𝐬 is a smoothing vector corresponding to the entries of each row of the smoothing matrix 𝐒, and “” is a dot product. For ease of mathematical presentation, in the following, we work with continuous time rather than the discretely sampled data and extend the 𝐖 factors, 𝐇 factors, and data matrix 𝐗 through zero-padding on both ends so that:

(54) Corr(𝐖i(n),𝐗(n))(t)=-𝐖i(n)𝐗+t(n)d

and

(55) Conv(𝐇j,𝐬)=-𝐇jτ𝐬t-τdτ

Recall that the Fourier transform is defined as:

(56) f^(ω)(f(t))-f(t)e-iωtdt

with inverse Fourier transform:

(57) f(t)=-1(f^(ω))12π-f^(ω)e+iωtdω

Now recall some basic features of Fourier transforms of correlation and convolution integrals:

(58) (Conv(f(t),g(t)))=f^(ω)g^(ω)
(59) (Corr(f(t),g(t)))=f^(-ω)g^(ω)
(60) f(t)g(t)=f(t)g(t)dt=Corrt=0(f,g)=1[f^(ω)g^(ω)dω]t=0f(t)g(t)=12πf^(ω)g^(ω)dω

This final identity, known as Parseval’s theorem, says that the inner product (dot product) between two functions evaluated in the time and frequency domain are equivalent up to a proportionality constant of 1/(2π). With the above identities, we can calculate our quantity of interest:

(61) 𝐑ij=[n=1NCorr(𝐖i(n),𝐗(n))]Conv(𝐇j,𝐬)

First, define:

(62) A(t)=[n=1NCorr(𝐖i(n),𝐗(n))]
(63) B(t)=Conv(𝐇j,𝐬)

From Equation 60 (Parseval’s theorem):

(64) 𝐑ij=12π-A^(-ω)B^(ω)dω

Then, from Equations 58 and 59, we have:

(65) 𝐑ij=12πn=1N-dω𝐖^i(n)(ω)𝐗^(n)(-ω)𝐇^j(ω)𝐬^(ω)

The above formula shows that:

  1. Viewed in the frequency domain, the x-ortho penalty reduces to a (sum over neurons and frequencies of a) simple product of Fourier transforms of the four matrices involved in the penalty.

  2. The smoothing can equally well be applied to 𝐇 or 𝐖 or 𝐗. (For 𝐗, note that for symmetric smoothing function 𝐬(t)=𝐬(-t), we also have 𝐬^(ω)=𝐬^(-ω).)

  3. One can view this operation as either of the below:

    1. First correlate 𝐖 and 𝐗 by summing correlations of each row, and then calculate the overlap with the smoothed 𝐇, as described in the main text:𝐑ij=12πn=1N-dω[𝐖^i(n)(ω)𝐗^(n)(-ω)][𝐇^j(ω)𝐬^(ω)]

    2. Correlate 𝐇 with each row of 𝐗 and then calculate the overlap of this correlation with the corresponding smoothed row of 𝐖. Then sum over all rows:𝐑ij=12πn=1N-dω[𝐇^j(ω)𝐗^(n)(-ω)][𝐖^i(n)(ω)𝐬^(ω)]

Fourier representation of the traditional regularization approximation of the x-ortho penalty

We now proceed to show how the x-ortho penalty can be approximated by a traditional (data-free) regularization, expressing the results in the frequency domain. As in the time domain analysis, we consider the approximation in which the data 𝐗 are nearly perfectly reconstructed by the convNMF decomposition (𝐗𝐖𝐇).

Noting that this decomposition is a sum over factors of row-by-row ordinary convolutions, we can write the Fourier analog for each row of 𝐗 as:

(66) 𝐗^(n)(ω)k𝐖^k(n)(ω)𝐇^k(ω)

Thus, substituting 𝐗 with the reconstruction, 𝐖𝐇 in Equation 65, we have:

(67) 𝐑ij12πk-dω[n=1N𝐖^i(n)(ω)𝐖^k(n)(-ω)][𝐇^k(-ω)𝐇^j(ω)𝐬^(ω)]

As in the time domain analysis, we separate the sum over k into three cases: k=i, k=j, and ki,j. Recall that for real numbers, f^(-ω)=f^*(ω), and f^(ω)f^*(ω)=|f^(ω)|2. Thus, separating the sum over k into the three cases, we have:

(68) Rij=12πdω[n=1N|W^i(n)(ω)|2][H^i(ω)H^j(ω)s^(ω)]+12π|H^j(ω)|2[n=1NW^i(n)(ω)W^j(n)(ω)s^(ω)]+Y

where Y represents the remaining terms for which ki,j.

We can obtain a more symmetric form of this equation by summing the contributions of factors i and j, 𝐑ij+𝐑ji. For symmetric smoothing functions 𝐬(t)=𝐬(-t), for which 𝐬^(ω)=𝐬^(-ω), we obtain:

(69) Rij+Rji=12πdω[n=1N|W^i(n)(ω)|2+|W^j(n)(ω)|2][H^i(ω)H^j(ω)s^(ω)]+12π[|H^i(ω)|2+|H^j(ω)|2][n=1NW^i(n)(ω)W^j(n)(ω)s^(ω)]+Y

As in the time domain analysis, the first two terms above have a simple interpretation in comparison to traditional orthogonality regularizations: The first term resembles a traditional regularization of orthogonality in (smoothed) 𝐇, but now weighted frequency-by-frequency by the summed power at that frequency in the corresponding 𝐖 factors. For sparse (delta-function-like) sequences, the power in 𝐖 at each frequency is a constant and can be taken outside the integral. In this case, the regularization corresponds precisely to orthogonality in (smoothed) H, weighted by the summed power in the corresponding 𝐖’s. Likewise, the second term above corresponds to a traditional regularization of orthogonality in (smoothed) 𝐖, weighted by the summed power at each component frequency in the corresponding 𝐇 factors.

Altogether, we see that these terms represent a Fourier-power weighted extension of (smoothed) traditional orthogonality regularizations in 𝐖 and 𝐇. This weighting may be beneficial relative to traditional orthogonality penalties, since it makes the regularization focus most heavily on the factors and frequencies that contribute most to the data reconstruction.

Finally, we consider the remaining terms in the cost function, for which ki,j. As noted previously, these terms are only relevant when the factorization contains at least three terms, so cannot be seen in the simple Type 1, 2 and 3 cases illustrated in Figure 1. These terms have the form:

(70) 12πki,j-dω[n=1N𝐖^i(n)(ω)𝐖^k(n)(-ω)][𝐇^k(-ω)𝐇^j(ω)𝐬^(ω)]

To understand how this term contributes, we consider each of the expressions in parentheses. The first expression contains each frequency component of the correlation of the ith and kth factors’ 𝐖’s. The second expression likewise contains each frequency component of the correlation of the jth and kth factors’ 𝐇’s. Thus, analogous to the time domain analysis, this term of 𝐑ij contributes whenever there is a factor (𝐖k,𝐇k) that overlaps at any frequency with the ith factor’s 𝐖 component and the jth factor’s 𝐇 component. In this manner, this three-factor interaction term effectively enforces competition between factors i and j even if they are not correlated themselves, as demonstrated in Figure 1 of this Appendix.

https://doi.org/10.7554/eLife.38471.033

References

  1. 1
  2. 2
    Investigation of sequence processing: a cognitive and computational neuroscience perspective
    1. RS Bapi
    2. VSC Pammi
    3. KP Miyapuram
    4. Ahmed
    (2005)
    Current Science 89:1690–1698.
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
    Nonnegative matrix factorization with temporal smoothness and/or spatial decorrelation constraints
    1. Z Chen
    2. A Cichocki
    (2004)
    Signal Processing, 11.
  9. 9
    Algorithms for orthogonal nonnegative matrix factorization
    1. S Choi
    (2008)
    IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). pp. 1828–1832.
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
    When does non-negative matrix factorization give a correct decomposition into parts?
    1. D Donoho
    2. V Stodden
    (2004)
    In: S Thrun, L. K Saul, B Schölkopf, editors. Advances in Neural Information Processing Systems, 16. MIT Press. pp. 1141–1148.
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
    Sparse nonnegative matrix factorization for clustering
    1. J Kim
    2. H Park
    (2008)
    Georgia Institute of Technology. Technical Report GT-CSE.
  31. 31
  32. 32
  33. 33
    Algorithms for non-negative matrix factorization
    1. DD Lee
    2. HS Seung
    (2001)
    In: T. K Leen, T. G Dietterich, V Tresp, editors. Advances in Neural Information Processing Systems, 13. MIT Press. pp. 556–562.
  34. 34
    Bayesian learning and inference in recurrent switching linear dynamical systems
    1. S Linderman
    2. M Johnson
    3. A Miller
    4. R Adams
    5. D Blei
    6. L Paninski
    (2017)
    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics.
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
    Non-convex robust PCA
    1. P Netrapalli
    2. UN Niranjan
    3. S Sanghavi
    4. A Anandkumar
    5. P Jain
    (2014)
    In: Z Ghahramani, M Welling, C Cortes, ND Lawrence, KQ Weinberger, editors. Advances in Neural Information Processing Systems, 27. Curran Associates, Inc. pp. 1107–1115.
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
    On lines and planes of closest fit to systems of points in space
    1. K Pearson
    (1901)
    The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2:559–572.
    https://doi.org/10.1080/14786440109462720
  50. 50
    Sparse convolutional coding for neuronal assembly detection
    1. S Peter
    2. E Kirschbaum
    3. M Both
    4. L Campbell
    5. B Harvey
    6. C Heins
    7. D Durstewitz
    (2017)
    In: I Guyon, U. V Luxburg, S Bengio, H Wallach, R Fergus, S Vishwanathan, R Garnett, editors. Advances in Neural Information Processing Systems, 30. Curran Associates, Inc. pp. 3675–3685.
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
    Sequence to sequence learning with neural networks
    1. I Sutskever
    2. O Vinyals
    3. QV Le
    (2014)
    In: Z Ghahramani, M Welling, C Cortes, KQ Weinberger, ND Lawrence, editors. Advances in Neural Information Processing Systems, 27. MIT Press. pp. 3104–3112.
  61. 61
    First results on uniqueness of sparse non-negative matrix factorization
    1. FJ Theis
    2. K Stadlthanner
    3. T Tanaka
    (2005)
    13th European Signal Processing Conference.
  62. 62
  63. 63
    UoI-NMF cluster: a robust nonnegative matrix factorization algorithm for improved parts-based decomposition and reconstruction of noisy data
    1. S Ubaru
    2. K Wu
    3. KE Bouchard
    (2017)
    2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA).
  64. 64
    Generalized low rank models
    1. M Udell
    2. C Horn
    3. R Zadeh
    4. S Boyd
    (2016)
    Foundations and Trends in Machine Learning 9:1–118.
    https://doi.org/10.1561/2200000055
  65. 65
  66. 66
    Convex hull convolutive non-negative matrix factorization for uncovering temporal patterns in multivariate time-series data
    1. C Vaz
    2. A Toutios
    3. S Narayanan
    (2016)
    Interspeech 2016.
  67. 67
    Multiplicative updates for convolutional NMF under β-Divergence
    1. TPJ Villasana
    2. S Gorlow
    3. AT Hariraman
    (2018)
    arVix.
  68. 68
    Clustering stability: an overview
    1. U von Luxburg
    (2010)
    Foundations and Trends in Machine Learning 2:235–274.
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73

Decision letter

  1. Laura Colgin
    Reviewing Editor; The University of Texas at Austin, United States
  2. Timothy E Behrens
    Senior Editor; University of Oxford, United Kingdom

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for sending your article entitled "Unsupervised discovery of temporal sequences in high-dimensional datasets, with applications to neuroscience" for peer review at eLife. Your article is being evaluated by Timothy Behrens as the Senior Editor, a Reviewing Editor, and three reviewers.

While the reviewers were generally enthusiastic about the work, major concerns were brought up that raise the question of whether this work is appropriate for the general readership of eLife. All reviewers agreed that these major concerns require essential revisions and thus would like to see an action plan that addresses these concerns before they issue a formal decision. In particular, the reviewers were not convinced that this method constitutes a major advance over current state-of-the-art methods in the field. Also, the selection of hyperparameters was not convincingly justified. The reviews are included below in their entirety.

Reviewer #1:

This paper introduces a matrix factorization-based method, seqNMF, for extracting temporal sequences from high-dimensional data. The authors convincingly demonstrate that seqNMF performs well in artificial and real datasets. I believe seqNMF will be a useful tool for a wide range of applications. After its appearance at COSYNE, the tool has already created considerable excitement in the computational and systems neuroscience community.

I found the paper to be very well written and the results clearly presented. However, some points need improvement:

1) While Type 1, 2, and 3 errors are clearly explained, I thought better naming could improve the readability of the paper a lot.

2) Building into equation 6, the authors discuss two alternative regularizers, and claim that they would fail at preventing Type 1, 2 and 3 errors simultaneously. While the arguments are clear, an experimental demonstration would be more convincing.

3) In subsection “SeqNMF: A constrained convolutional non-negative matrix factorization”, authors call ||M||1,i≠j a norm on M, which technically isn't correct (Under this definition, a diagonal matrix would have a zero norm).

4) I have trouble with understanding the first method of choosing λ. Is this method only applicable to data for which ground-truth is known? Or are the authors suggesting to choose λ between 2λ0 and 5λ0 for any dataset? Please clarify what the suggested method is.

Reviewer #2:

This paper describes a targeted dimensionality reduction approached, called seqNMF, to identify sequences in neural data. The approach is an improvement on previously published methods. The method will be of interest to its target audience. Also, the method is well described in straightforward terms, and the authors do a great job of describing how to use the method. They provide good examples, both from simulated data and multiple types of real data. They also give a good understanding of how the method can be tuned, such as using the λ value, to explore sequences with different levels of granularity or to test different ideas about the data. The paper is well written and is presented in a fair manner. I am therefore supportive of this paper and suggest publication.

Reviewer #3:

Overview: This manuscript aims to address the problem of automated discovery of temporal sequences from neural data. The authors report on a modification (seqNMF) of the convolutional non-negative matrix factorization (convNMF) algorithm targeted to address purported issues in the base method. The fundamental issue that seqNMF attempts to address is the redundancy/replication of learned bases by the vanilla convNMF algorithm. The authors use extensive numerical studies to characterize the performance of their algorithm. The target biological datasets are neural activity (spike trains and calcium signals) from hippocampus and nucleus HVC of the zebra finch. This paper is generally well written, though it could be streamlined substantially to accelerate the reader through the manuscript. However, for reasons elaborated below, it is unclear how much of an advance the reported method is over the current state of the art in the field. Most importantly, the authors seem unaware of recent advances in the statistical-machine learning literature that aim at addressing the issue of non-redundant bases learning by NMF algorithms. Furthermore, the heuristic used for selection of the penalty hyperparameter is not adequately justified. As such, the impact/novelty of the proposed algorithm has not been evaluated or demonstrated: quantitative comparisons to the state-of-the-art is required for evaluation of proposed new analytic methods. While these issues could potentially be addressed, in my opinion, this work does not present a clear major improvement over existing methods and thus belongs in a much more specialized venue.

Major Critiques:

1) Insufficient comparison to state-of-the-art methods. The central issue that the seqNMF algorithm attempts to solve is "To reduce the occurrence of redundant factors in convNMF […]". The authors identify three types of bases learning consistency errors. To address these errors, the authors augment an existing penalty on the reconstruction cost function to encourage uncorrelated bases and loadings. This is a reasonable strategy but is by no means the only approach. Indeed, as mentioned by the authors:

-Direct selection of the rank K vs. hyperparameter strength (λ)

-Use of a sparsity penalty (e.g., L1-regularization)

While these approaches are mentioned, and sometimes examined, the benefit of the proposed method over these approaches has not been demonstrated in either synthetic or neurobiological data. For example, if one were to attempt to optimize the number of bases (K) and the regularization strength (λ) in a sparse convNMF, how do the performance measures on the synthetic data sets with various noise levels stack up? Note that the results of Figure 2C,D in this regard are not at all convincing, as the rank K of the factorization of convNMF is not optimized: indeed, I find this figure to be misleading, as it is not an 'apples-to-apples' comparison (also, I believe there is a mislabeling of convNMF vs. seqNMF in this figure).

More seriously, the authors seem unaware of recent advances in the statistical-machine learning community which directly address the issue of learning non-redundant bases. For example, the 'stability NMF' algorithm (Wu et al., 2016) was recently successfully applied to extract biologically meaningful patterns from diverse data sets. More recently, the UoI-NMFcluster algorithm (Ubaru et al., 2018) has been shown to extract the precise generative bases, and assign sparse reconstruction weights to those bases, from noisy synthetic and neurobiological data. Both of these methods aim to select bases that are minimally correlated with each other. While these methods have not been applied to sequences per se, there is no conceptual reason why they could not be. For example, if one allows for a sliding cross-correlation, it is likely that both stability NMF and UoI-NMF could deal with temporal jitter. Can these methods be modified to provide even greater robustness for learning non-redundant bases for sequences?

In summary, the authors have not demonstrated the broad and robust improvement of the proposed algorithm relative to other methods in the field.

2) Method for selecting regularization parameters. In the seqNMF algorithm, the balance between reconstruction and un-correlated bases is modulated by the λ hyperparamter. The authors provide various heuristics for choosing λ to find the correct bases. The main heuristic used for the main figure use a heuristic of 2-5x of the λ value that gives the cross-over in normalized reconstruction vs. correlation costs (λ0). The authors justify this as giving good results in the synthetic data sets and having a reasonable overlap with cross-validated reconstruction error. However, for many of the synthetic and real data sets, lambda0 also lies at the most sensitive part of the range, and small modulations lead to large changes in the trade-off of reconstruction vs. correlation. That is, the results are likely to be very sensitive to this selection. Furthermore, the selected values often result in very poor data reconstruction (e.g., Figure 6C,D and Figure 7B,F). As quantified, this is a big problem for 'interpretability', as it is very hard to interpret the results of a method that does not have good data prediction quality. In science, ground-truth is not known, and a major metric we have to quantitatively evaluate methods is reconstruction error/parsimony. If the authors could show that, something like cross-validated Bayesian Information Criterion, was optimized for the selected values of λ, that would be much more convincing.

In summary, the heuristic method used to select the regularization strength, which is a major component of the reported work, is not sufficiently justified from a statistical-machine learning perspective.

https://doi.org/10.7554/eLife.38471.040

Author response

[Editors' note: the authors’ plan for revisions was approved and the authors made a formal revised submission.]

While the reviewers were generally enthusiastic about the work, major concerns were brought up that raise the question of whether this work is appropriate for the general readership of eLife. All reviewers agreed that these major concerns require essential revisions and thus would like to see an action plan that addresses these concerns before they issue a formal decision. In particular, the reviewers were not convinced that this method constitutes a major advance over current state-of-the-art methods in the field. Also, the selection of hyperparameters was not convincingly justified. The reviews are included below in their entirety.

We thank the reviewers for their thorough and insightful reading of our manuscript. We believe their criticisms will lead to a substantial improvement in our work. Here we address these criticisms and outline improvements to the paper.

We would like to emphasize that the aim of this paper is to provide the neuroscience community with a broadly useful and largely automated method for identifying temporal patterns (which we refer to as “sequences”) in neural data. Such methods do not currently exist.

Thus, we think that, for the neuroscience community, seqNMF is a major advance and fills an important gap. Sequential firing patterns have appeared in a large range of important studies in different brain regions and different behaviors (Hahnloser, 2002; Harvey, 2012; Okubo, 2015; Fujisawa, 2008; Pastalkova, 2008; MacDonald, 2011). These studies are all from top neuroscience labs with strong quantitative expertise, and yet in each of these studies, the sequences were extracted manually by alignment of neural activity with external behavioral or task events. This suggests a profound lack of existing methods in the field for automatic extraction of sequential structure in neuronal data.

Towards our goal of making our method practical and easy-to-adopt, we have intentionally worked to build upon core existing approaches from the machine learning community. We have attempted to write the paper in an accessible and pedagogical format so that the broader neuroscience community can easily apply the methods described, using the easy-to-use code we have posted publicly.

Our goal has therefore been to bring together a variety of statistical approaches: Convolutional matrix factorization, regularization strategies to encourage orthogonal factors, cross-validation, and now (thanks to reviewer 3) stability-based measures for determining model hyperparameters. Many of these individual concepts can be found by digging through different pieces of previous literature; however, repurposing and refining models for practical applications has a strong history in machine learning and computational neuroscience and can constitute a significant advance for the field. To make this clearer in our current version of the manuscript we have used seqNMF only to refer to a toolbox containing all of our contributions. The penalty term previously referred to as seqNMF is not referred to as the cross-orthogonality (x-ortho) penalty term. Thus, even if our paper did not represent a technical advance in statistical methodology, our careful analysis of the performance of regularized convNMF on different types of synthetic and real neural (and behavioral) data represents a significant advance of interest to the broad readership of eLife.

That said, we still take very seriously the concerns of Reviewer 3 about the novelty of our method. We thank the reviewer for pointing out the richness of recent advances in the area of preventing bases learning consistency errors, some of which we were unaware of. Thanks to the suggestion of the reviewer we have now incorporated some of these recent advances in our method.

In short, we would prefer to think of our contribution not as a detailed benchmarking of different methods of regularizing convolutional NMF, but rather as a toolkit for neuroscientists (who may not be statistics experts) to automatically and reliably extract sequences from neural data, to test their statistical significance, and to flexibly fine-tune the structure of the factorization. Finally, we have endeavored to present these ideas within a pedagogical framework so that readers can readily understand, and potentially build on the method.

Reviewer #1:

This paper introduces a matrix factorization-based method, seqNMF, for extracting temporal sequences from high-dimensional data. The authors convincingly demonstrate that seqNMF performs well in artificial and real datasets. I believe seqNMF will be a useful tool for a wide range of applications. After its appearance at COSYNE, the tool has already created considerable excitement in the computational and systems neuroscience community.

I found the paper to be very well written and the results clearly presented. However, some points need improvement:

1) While Type 1, 2, and 3 errors are clearly explained, I thought better naming could improve the readability of the paper a lot.

We attempted to change the error types to: duplication, fragmentation and duplication with independence but felt that this made the paper more difficult to read. Perhaps there are other better names which would improve readability however we have chosen to keep Type 1, 2 and 3 for the time being.

2) Building into equation 6, the authors discuss two alternative regularizers, and claim that they would fail at preventing Type 1, 2 and 3 errors simultaneously. While the arguments are clear, an experimental demonstration would be more convincing.

We have added a supplementary figure (Figure 1—figure supplement 1) illustrating each type of error on reconstructions of synthetic data. We have also quantified the rates of occurrence of each error type for the different regularizers described in the paper. This analysis illustrates the pattern described in the text.

In addition, we have added an appendix analyzing how the penalty used in seqNMF relates to traditional regularizers.

3) In subsection “SeqNMF: A constrained convolutional non-negative matrix factorization”, authors call ||M||1,i≠j a norm on M, which technically isn't correct (Under this definition, a diagonal matrix would have a zero norm).

We changed the text to refer to this as a seminorm.

4) I have trouble with understanding the first method of choosing λ. Is this method only applicable to data for which ground-truth is known? Or are the authors suggesting to choose λ between 2 λ0 and 5λ0 for any dataset? Please clarify what the suggested method is.

The reviewer’s interpretation of our method is correct; we are suggesting that the choice of λ between 2λ0 and 5λ0 is a good starting point for any dataset. This range is based on the performance of the algorithm on synthetic datasets for which the ground truth is known.

Further clarification: The procedure for choosing λ is motivated by the common-sense observation that λ provides a way to weight the priority given to reconstruction error vs redundancy (correlation cost). In the datasets we tested, values between 2λ0 and 5λ0 tended to yield reasonable results, reducing redundancy without sacrificing too much reconstruction error. In certain applications, the desired tradeoff may be different (e.g., when the main priority is reconstruction error, and some redundancy is tolerable), in which case the intuition behind our procedure may be useful for selecting λ (e.g., choose lower λ). We have made this intuition clearer in the text, so that future users of our algorithm will have a better idea of the tradeoffs being made when choosing different values of λ. That being said, based on our synthetic data and examining the results with different values for the real data, we suggest starting with a λ between 2 and 5 for any data set. In the cases we have considered where the ground-truth is known, we have shown that this strategy produces results that agree with the ground-truth (Figure 4). Furthermore, these results were relatively insensitive to the precise choice of λ. That is, ground-truth factors could be recovered across a wide range of λ (Figures 4 and Figure 2—figure supplement 2; also compare Figure 3 with Figure 3—figure supplement 1).

Reviewer #3:

Overview: This manuscript aims to address the problem of automated discovery of temporal sequences from neural data. The authors report on a modification (seqNMF) of the convolutional non-negative matrix factorization (convNMF) algorithm targeted to address purported issues in the base method. The fundamental issue that seqNMF attempts to address is the redundancy/replication of learned bases by the vanilla convNMF algorithm. The authors use extensive numerical studies to characterize the performance of their algorithm. The target biological datasets are neural activity (spike trains and calcium signals) from hippocampus and nucleus HVC of the zebra finch. This paper is generally well written, though it could be streamlined substantially to accelerate the reader through the manuscript. However, for reasons elaborated below, it is unclear how much of an advance the reported method is over the current state of the art in the field. Most importantly, the authors seem unaware of recent advances in the statistical-machine learning literature that aim at addressing the issue of non-redundant bases learning by NMF algorithms. Furthermore, the heuristic used for selection of the penalty hyperparameter is not adequately justified. As such, the impact/novelty of the proposed algorithm has not been evaluated or demonstrated: quantitative comparisons to the state-of-the-art is required for evaluation of proposed new analytic methods. While these issues could potentially be addressed, in my opinion, this work does not present a clear major improvement over existing methods and thus belongs in a much more specialized venue.

We thank the reviewer for a thorough reading and thoughtful response to our manuscript.

The central aim of our paper is to present a simple and widely applicable toolset for neuroscientists to detect sequences in their data. Reducing the occurrence of redundant factors in convNMF was an important part of this aim, but not an endpoint in itself. Thus, we looked to the statistical-machine learning literature for inspiration, not for competition. (Indeed, we thank the reviewer for pointing out advances in this field that have turned out to be very effective.) We purposely targeted eLife, rather than a computer science or machine learning venue, because the appropriate audience for this work is the neuroscience community, which has a long-standing interest in neural sequences from both experimental and theoretical perspectives (e.g. Hahnloser, 2002; Harvey, 2012; Okubo, 2015; Fujisawa, 2008; Pastalkova, 2008; MacDonald, 2011; Brody, 1999; Mokeichev, 2007; Schrader, 2008; Gerstein, 2012; Torre, 2016; Russo, 2017’ Grossberger, 2018; Quaglio, 2018).

Thus, our contribution is less about a particular advance in constrained optimization, but rather about adapting and applying a collection of statistical tools to the problem of discovering interpretable sequential structure in neuroscience data. These include: the use of convolutional NMF on a variety of neural and behavioral datasets, constraints to reduce redundant factorizations, methods to select hyperparameters, estimation of statistical significance, quantification of robustness to different common noise types, and regularization to shape the structure of factorizations (i.e. parts vs events).

Despite the interest in neural sequences, factorization-based approaches have been underutilized in the field. This is likely because the appropriate set of statistical tools had not yet been assembled, adapted and applied to neuroscience data in a convincing manner. We believe that our work represents such an advance.

Major Critiques:

1) Insufficient comparison to state-of-the-art methods. The central issue that the seqNMF algorithm attempts to solve is "To reduce the occurrence of redundant factors in convNMF […]". The authors identify three types of bases learning consistency errors. To address these errors, the authors augment an existing penalty on the reconstruction cost function to encourage uncorrelated bases and loadings. This is a reasonable strategy but is by no means the only approach. Indeed, as mentioned by the authors:

-Direct selection of the rank K vs. hyperparameter strength (λ)

-Use of a sparsity penalty (e.g., L1-regularization)

While these approaches are mentioned, and sometimes examined, the benefit of the proposed method over these approaches has not been demonstrated in either synthetic or neurobiological data. For example, if one were to attempt to optimize the number of bases (K) and the regularization strength (λ) in a sparse convNMF, how do the performance measures on the synthetic data sets with various noise levels stack up?

We agree with the reviewer that it would be useful to compare our approach to the others mentioned (i.e. direct selection of rank K, or the use of a sparsity penalty) and we have added additional results exploring these approaches.

Direct selection of K

We had previously shown that direct selection of K can sometimes be carried out using cross-validated error performance on held-out data. However, following the reviewer’s suggestion of extending stability NMF (and the ​diss​ metric, see below) to the convolutional case has yielded significantly better performance for directly estimating the rank K. We have therefore added a new main figure (Figure 5) and a new supplementary figure (Figure 5—figure supplement 1) to the paper illustrating the stability NMF approach and have modified the text to present this as an additional, complementary method for minimizing bases consistency errors.

L1-sparsity

One of the advantages of the cross orthogonality penalty is that it consists of only a single term to penalize correlations between different factors, and thus requires the determination of only a single hyperparameter (λ). This contrasts with a different potential approach to minimize redundant factorizations, namely incorporating a sparsity constraint on W and H of the form λw||W||1+ λh||H||1. We first wanted to understand the dependence of the sparsity approach on the hyperparameters (λh and λw). Performance was quantified by measuring both the similarity to ground-truth and the probability of obtaining the correct number of significant factors using synthetic data sets with intermediate levels of each of the four noise types. We found that the similarity to ground-truth and probability of obtaining the correct number of significant factors varies in a complex manner as a function of the two λ parameters (​Figure 4—figure supplement 4).

We next wondered how the ‘optimal’ performance of the sparsity approach (at the best choice of λh and λw) compares with the ‘optimal’ performance of the x-ortho approach (at the best choice of the cross orthogonality λ). [See below for more information on how this was done.] Having chosen the best λ for L1-sparsity and cross orthogonality for each noise condition, we ran the factorization for each method and quantified the distribution of performance (again using similarity and the number of significant factors). We found that L1 sparsity constraint yielded an overall performance quite similar to the cross orthogonality regularizer (Figure 4—figure supplement 5). Interestingly though, there are some consistent differences in the performance of these two constraints depending on the noise type: the x-ortho approach performs slightly better with warping and participation noise, while L1 sparsity performs slightly better with jitter and additive noise (Figure 4—figure supplement 5).

Given these findings, we feel that the x-ortho penalty represents a practical advance over the sparsity approach. The two methods perform similarly once the optimal values of the hyperparameters is determined, but in cases where the ground truth is not known, it will be simpler to find the optimal value of a single parameter (by using the seqNMF method for selecting λ), than two parameters for sparsity. Of course, it may be also possible to develop a method to find suitable values of λh and λw for the sparsity approach, but given the strong performance of the x-ortho penalty term, there is little motivation to develop such a method.

More detailed methods for L1 sparsity comparison

Optimal values of λh and λw for the sparsity method were selected by computing the composite performance score at each value of λh and λw for each noise type (see arrow in Figure 4—figure supplement 4A). The composite performance is the same as that used in Figure 4F, namely the element-wise product of the similarity to ground-truth and the fraction of fits with the correct number of factors. The peak value of this composite performance score yielded a pair of lambdas at which the sparsity penalties gave the best performance on this dataset. For the comparison, the same procedure was used to select the optimal λ for the x-ortho penalty in each noise condition. The performance of sparsity and x-ortho were each evaluated at their optimal lambdas by fitting the data 100 times from random initial conditions.

Note that the results of Figure 2C,D in this regard are not at all convincing, as the rank K of the factorization of convNMF is not optimized: indeed, I find this figure to be misleading, as it is not an 'apples-to-apples' comparison (also, I believe there is a mislabeling of convNMF vs. seqNMF in this figure).

With regard to Figure 2, we apologize that our intention was not clear. Our intention was simply to illustrate the effect of the regularizer, rather than to formally compare to convNMF as a competing method. This is now clarified in the text.

More seriously, the authors seem unaware of recent advances in the statistical-machine learning community which directly address the issue of learning non-redundant bases. For example, the 'stability NMF' algorithm (Wu et al., 2016) was recently successfully applied to extract biologically meaningful patterns from diverse data sets. More recently, the UoI-NMFcluster algorithm (Ubaru et al., 2018) has been shown to extract the precise generative bases, and assign sparse reconstruction weights to those bases, from noisy synthetic and neurobiological data. Both of these methods aim to select bases that are minimally correlated with each other. While these methods have not been applied to sequences per se, there is no conceptual reason why they could not be. For example, if one allows for a sliding cross-correlation, it is likely that both stability NMF and UoI-NMF could deal with temporal jitter. Can these methods be modified to provide even greater robustness for learning non-redundant bases for sequences?

In summary, the authors have not demonstrated the broad and robust improvement of the proposed algorithm relative to other methods in the field.

We thank the reviewer for these suggestions for alternative possible methods to constrain convNMF factorizations, and to reduce bases learning consistency errors. Indeed, we were not aware of these recent exciting advances in the field and have explored the application of both stability NMF and UoI to convNMF.

Stability NMF

We have successfully adapted stability NMF (Wu et al., 2016) using the dissimilarity (diss) metric to account for temporally-shifted correlations. As the reviewer pointed out, different runs of convNMF may produce factors which are shifted in time (in both H and W) and thus have artificially low correlations, confounding the stability measure described in Wu et al. In our initial exploration of this metric, we observed that smoothing the factors worked sometimes but found that a much more robust way of overcoming this misalignment problem is to apply the stability analysis to the reconstructions of each individual factor (that is, wk⊛ hk). Because this is a more reliable method of optimizing K for convNMF than the cross-validation approach we developed previously, we have added a new main figure (Figure 5) and a new supplementary figure (Figure 5—figure supplement 1) and have incorporated this into our Results section “Strategy for choosing K rather than choosing λ?”

Union of Intersections

We next wondered if the approach of UoI-NMF could be adapted to improve the performance of seqNMF. To apply this method to the convolutional case we first had to make several adjustments to the UoI algorithm. First, we simplified things by skipping the first step of estimating the appropriate rank using stability NMF and simply chose K to be the ground-truth rank of our synthetic data. Because convolutional NMF relies on structure across observations in time it is not possible to bootstrap our data in the conventional way by taking random samples of observations. Instead we took the approach that we used for cross-validation in Figure 5—figure supplement 2; we randomly held out individual bins from the dataset during fitting and defined the remaining data as our bootstrapped sample. Finally, rather than solving a non-negative least squares problem, we took advantage of the multiplicative update rules to constrain our factorizations. After taking the intersections of our factors from many different bootstrapped fits we defined an intersection mask where an element was defined as 1 if it was a member of the intersection and zero otherwise. We then used this mask as an initialization of our factorization, and the multiplicative nature of the updates ensures that elements which are initialized as 0 remain fixed at zero throughout the optimization thus enforcing the intersection.

Before applying our implementation of UoI-NMF to convNMF, we first tested that our changes to the UoI algorithm did not negatively impact its ability to robustly identify bases in an NMF model (Author response image 1). We generated a ground-truth dataset using synchronous patterns and compared the ability of UoI-NMF and NMF to extract the ground-truth bases. As expected, our adapted implementation of UoI-NMF extracted factors that were much more similar to the ground-truth set than NMF for a range of noise conditions (Author response image 1).

Author response image 1
Validation of our adaptations of UoI-NMF.

Similarity to ground-truth as a function of additive noise level for UoI-NMF factorizations and NMF factorizations using the algorithmic changes detailed above.

Note that UoI-NMF extracts bases which are closer to the ground-truth patterns than regular NMF.

https://doi.org/10.7554/eLife.38471.036

After ensuring that our adaptations of the UoI-NMF algorithm still yielded good results in the NMF case, we tested it on convNMF, but found that it gave mixed results. In the best cases, UoI-convNMF returned factors which were a 5-10% better match to the ground-truth factors than regular convNMF, but in some conditions, UoI-convNMF returned factors which were a 5-10% worse match, and our attempts to further adapt the algorithm or change parameters sometimes yielded even worse results. We believe that this is primarily due to the challenge that convNMF factors on individual runs can be shifted in time relative to one another, as mentioned above. This could, in principle, cause problems during the intersection step of the algorithm, because misaligned bases will tend not to overlap. In summary, the UoI approach is a promising way to address bases correlations in convNMF, but we believe that the additional extensions necessary for good performance in the convNMF model are beyond the scope of our paper. In our Discussion section, we have added a description of the UoI-NMF method, highlighting it as a promising possible extension for discovering robust and sparse sequential bases.

2) Method for selecting regularization parameters. In the seqNMF algorithm, the balance between reconstruction and un-correlated bases is modulated by the λ hyperparamter. The authors provide various heuristics for choosing λ to find the correct bases. The main heuristic used for the main figure use a heuristic of 2-5x of the λ value that gives the cross-over in normalized reconstruction vs. correlation costs (λ0). The authors justify this as giving good results in the synthetic data sets and having a reasonable overlap with cross-validated reconstruction error. However, for many of the synthetic and real data sets, lambda0 also lies at the most sensitive part of the range, and small modulations lead to large changes in the trade-off of reconstruction vs. correlation. That is, the results are likely to be very sensitive to this selection. Furthermore, the selected values often result in very poor data reconstruction (e.g., Figure 6C,D and Figure 7B,F). As quantified, this is a big problem for 'interpretability', as it is very hard to interpret the results of a method that does not have good data prediction quality. In science, ground-truth is not known, and a major metric we have to quantitatively evaluate methods is reconstruction error/parsimony. If the authors could show that, something like cross-validated Bayesian Information Criterion, was optimized for the selected values of λ, that would be much more convincing.

In summary, the heuristic method used to select the regularization strength, which is a major component of the reported work, is not sufficiently justified from a statistical-machine learning perspective.

We agree that our method for choosing λ is heuristic and is not well motivated formally from a machine learning perspective. However, we have also confirmed that our heuristic method agrees with the results of a cross-validation procedure, as presented in Figure 5—figure supplement 2. The reason we kept the heuristic in the main text is because it is more robust in low-noise cases (where overfitting is minimal) and is more interpretable for some noise types such as warping. In our action plan we stated that we would explore the use of cross-validated reconstruction error in the selection of λ, but we were unable to find any methods that performed better than our heuristic. Furthermore, our analysis of synthetic datasets reveals that the performance on a given dataset is not very sensitive to the choice of λ in the range we have described. In fact, in most cases, we recover the correct number of factors over a range of λ spanning nearly an order of magnitude. We have quantified this in a new figure (Figure 4—figure supplement 1).

The reviewer suggests that the optimal value of λ often results in very poor data reconstruction. This interpretation of Figure 6C,D and 7B,F, is incorrect. The plots of reconstruction that the reviewer refers to are normalized to go between 0 and 1, and the overall reconstruction error cannot be interpreted from these plots. At low λ, reconstruction error reaches a minimum which corresponds to the unregularized convNMF reconstruction error. In contrast, at high λ, reconstruction error saturates at a value that is determined by the amount of variance which can be explained by a single factor (often quite a lot). This is because the x-ortho penalty does not affect K=1 factorizations. For the case of single sequential pattern embedded in high noise, the low λ region reflects massive over-fitting of noise, whereas the saturated region at high lambdas will represent ideal performance (even though it is the “worst” reconstruction error of any range of λ values). For the case of multiple sequences embedded in noise, the leftmost region still represents massive over-fitting while the right-most region represents the error incurred by incorrectly using a 1 factor model. Thus, by definition the best parameter value must lie on the steep region of the reconstruction error as factorizations move from an essentially un-penalized 20 factor factorization to an un-penalized 1 factor factorization. In other words, reconstruction error is large because the data are noisy, and seqNMF is fitting only the repeatable sequences, and not the noise. Furthermore, values of λ that produce better reconstruction error do so by overfitting the data, that is, producing factors with lower similarity to ground-truth (Figure 4E and 4K).

https://doi.org/10.7554/eLife.38471.041

Article and author information

Author details

  1. Emily L Mackevicius

    McGovern Institute for Brain Research, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing
    Contributed equally with
    Andrew H Bahle
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6593-4398
  2. Andrew H Bahle

    McGovern Institute for Brain Research, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing
    Contributed equally with
    Emily L Mackevicius
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0567-7195
  3. Alex H Williams

    Neurosciences Program, Stanford University, Stanford, United States
    Contribution
    Conceptualization, Formal analysis, Validation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5853-103X
  4. Shijie Gu

    1. McGovern Institute for Brain Research, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, United States
    2. School of Life Sciences and Technology, ShanghaiTech University, Shanghai, China
    Contribution
    Data curation
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6257-5756
  5. Natalia I Denisenko

    McGovern Institute for Brain Research, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, United States
    Contribution
    Investigation
    Competing interests
    No competing interests declared
  6. Mark S Goldman

    1. Center for Neuroscience, Department of Neurobiology, Physiology and Behavior, University of California, Davis, Davis, United States
    2. Department of Ophthamology and Vision Science, University of California, Davis, Davis, United States
    Contribution
    Conceptualization, Formal analysis, Supervision, Funding acquisition, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    msgoldman@ucdavis.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8257-2314
  7. Michale S Fee

    McGovern Institute for Brain Research, Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, United States
    Contribution
    Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Validation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    fee@mit.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7539-1745

Funding

Simons Foundation (Simons Collaboration for the Global Brain)

  • Mark S Goldman
  • Michale S Fee

National Institute on Deafness and Other Communication Disorders (R01 DC009183)

  • Michale S Fee

G Harold and Leila Y. Mathers Foundation

  • Michale S Fee

U.S. Department of Defense (NDSEG Fellowship program)

  • Emily L Mackevicius

Department of Energy, Labor and Economic Growth (Computational Science Graduate Fellowship (CSGF))

  • Alex H Williams

NIH Office of the Director (Training Grant 5T32EB019940-03)

  • Andrew H Bahle

National Institute of Neurological Disorders and Stroke (U19 NS104648)

  • Mark S Goldman

National Institute of Mental Health (R25 MH062204)

  • Mark S Goldman
  • Michale S Fee

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was supported by a grant from the Simons Collaboration for the Global Brain, the National Institutes of Health (NIH) [grant number R01 DC009183] and the G Harold and Leila Y Mathers Charitable Foundation. ELM received support through the NDSEG Fellowship program. AHB received support through NIH training grant 5T32EB019940-03. MSG received support from the NIH [grant number U19NS104648]. AHW received support from the U.S. Department of Energy Computational Science Graduate Fellowship (CSGF) program. Thanks to Pengcheng Zhou for advice on his CNMF_E calcium data cell extraction algorithm. Thanks to Wiktor Młynarski for helpful convNMF discussions. Thanks to Michael Stetner, Galen Lynch, Nhat Le, Dezhe Jin, Edward Nieh, Adam Charles, Jane Van Velden and Yiheng Wang for comments on the manuscript and on our code package. Thanks to our reviewers for wonderful suggestions, including the use of diss to select K, and using seqNMF to measure ‘sequenciness’. Special thanks to the 2017 Methods in Computational Neuroscience course [supported by NIH grant R25 MH062204 and Simons Foundation] at the Woods Hole Marine Biology Lab, where this collaboration was started.

Ethics

Animal experimentation: Animal care and experiments were carried out in accordance with NIH guidelines, and reviewed and approved by the Massachusetts Institute of Technology Committee on Animal Care (protocol 0715-071-18).

Senior Editor

  1. Timothy E Behrens, University of Oxford, United Kingdom

Reviewing Editor

  1. Laura Colgin, The University of Texas at Austin, United States

Publication history

  1. Received: May 19, 2018
  2. Accepted: January 4, 2019
  3. Version of Record published: February 5, 2019 (version 1)
  4. Version of Record updated: February 7, 2019 (version 2)
  5. Version of Record updated: August 6, 2019 (version 3)

Copyright

© 2019, Mackevicius et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,875
    Page views
  • 445
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Neuroscience
    Ethan A Heming et al.
    Research Article
    1. Neuroscience
    Caroline Lei Wee et al.
    Research Article