A spike sorting toolbox for up to thousands of electrodes validated with ground truth recordings in vitro and in vivo
 Cited 4
 Views 2,462
 Annotations
Abstract
In recent years, multielectrode arrays and large silicon probes have been developed to record simultaneously between hundreds and thousands of electrodes packed with a high density. However, they require novel methods to extract the spiking activity of large ensembles of neurons. Here, we developed a new toolbox to sort spikes from these largescale extracellular data. To validate our method, we performed simultaneous extracellular and loose patch recordings in rodents to obtain ‘ground truth’ data, where the solution to this sorting problem is known for one cell. The performance of our algorithm was always close to the best expected performance, over a broad range of signaltonoise ratios, in vitro and in vivo. The algorithm is entirely parallelized and has been successfully tested on recordings with up to 4225 electrodes. Our toolbox thus offers a generic solution to sort accurately spikes for up to thousands of electrodes.
https://doi.org/10.7554/eLife.34518.001Introduction
As local circuits represent information using large populations of neurons throughout the brain (Buzsáki, 2010), technologies to record hundreds or thousands of them are therefore essential. One of the most powerful and widespread techniques for neuronal population recording is extracellular electrophysiology. Recently, newly developed microelectrode arrays (MEA) have allowed recording of local voltage from hundreds to thousands of extracellular sites separated only by tens of microns (Berdondini et al., 2005; Fiscella et al., 2012; Lambacher et al., 2004), giving indirect access to large neural ensembles with a high spatial resolution. Thanks to this resolution, the spikes from a single neuron will be detected on several electrodes and produce extracellular waveforms with a characteristic spatiotemporal profile across the recording sites. However, this high resolution comes at a cost: each electrode receives the activity from many neurons. To access the spiking activity of individual neurons, one needs to separate the waveform produced by each neuron and identify when it appears in the recording. This process, called spike sorting, has received a lot of attention for recordings with a small number of electrodes (typically, a few tens of electrodes). However, for largescale and dense recordings, it is still unclear how to extract the spike contributions from extracellular recordings. In particular, for thousands of electrodes, this problem is still largely unresolved.
Classical spike sorting algorithms cannot process this new type of data for several reasons. First, many algorithms do not take into account that the spikes of a single neuron will evoke a voltage deflection on many electrodes. Second, most algorithms do not scale up to hundreds or thousands of electrodes in vitro and in vivo, because their computation time would increase exponentially with the number of electrodes (Rossant et al., 2016). A few algorithms have been designed to process largescale recordings (Marre et al., 2012; Pillow et al., 2013; Pachitariu et al., 2016; Leibig et al., 2016; Hilgen et al., 2017; Chung et al., 2017; Jun et al., 2017), but they have not been tested on real ‘ground truth’ data.
In ground truth data, one neuron is cherry picked from among all the neurons recorded using an extracellular array using another technique, and simultaneousy recorded. Unfortunately, such data are rare. Dual loose patch and extracellular recordings have been performed for culture of neurons or in cortical slices (Anastassiou et al., 2015; Franke et al., 2015). However, in this condition, only one or two neurons emit spikes, and this simplifies drastically the spike sorting problem. Ground truth data recorded in vivo are scarce (Henze et al., 2000; Neto et al., 2016) and in many cases the patchrecorded neuron is relatively far from the extracellular electrodes. As a result, most algorithms have been tested in simulated cases where an artificial template is added at random times to an actual recording. However, it is not clear if this simulated data reproduce the conditions of actual recordings. In particular, waveforms triggered by a given neuron can vary in amplitude and shape, and most simulations do not reproduce this feature of biological data. Also, spike trains of different cells are usually correlated, and these correlations can make extracellular spikes that do overlap.
Here, we present a novel toolbox for spike sorting in vitro and in vivo, validated on ground truth recordings. Our sorting algorithm is based on a combination of densitybased clustering and template matching. To validate our method, we performed experiments where a largescale extracellular recording was performed while one of the neurons was recorded with a patch electrode. We showed that the performance of our algorithm was always close to an optimal classifier, both in vitro and in vivo. We demonstrate that our sorting algorithm could process recordings from up to thousands of electrodes with similar accuracy. To handle data from thousands of electrodes, we developed a tool automating the step that is usually left to manual curation.
Our method is a fast and accurate solution for spike sorting for up to thousands of electrodes, and we provide it as a freely available software that can be run on multiple platforms and several computers in parallel. Our ground truth data are also publicly available and will be a useful resource to benchmark future improvements in spike sorting methods.
Results
Spike sorting algorithm
We developed an algorithm (called SpyKING CIRCUS) with two main steps: a clustering followed by a template matching step (see Materials and methods for details). First, spikes are detected as threshold crossings (Figure 1A) and the algorithm isolated the extracellular waveforms for a number of randomly chosen spike times. In the following text, we will refer to the extracellular waveforms associated with a given spike time as snippets.
We divided the snippets into groups, depending on their physical positions: for every electrode we grouped together all the spikes having their maximum peak on this electrode. Thanks to this division, the ensemble of spikes was divided into as many groups as there were electrodes. The group associated with electrode $k$ contains all the snippets with a maximum peak on electrode $k$. It was possible that, even among the spikes peaking on the same electrode, there could be several neurons. We thus performed a clustering separately on each group, in order to separate the different neurons present in a single group.
For each group, the snippets were first masked: we assumed that a single cell can only influence the electrodes in its vicinity, and only kept the signal on electrodes close enough to the peak (Figure 1B, see Materials and methods). Due to to this reduction, the memory needed for each clustering did no scale with the total number of electrodes. The snippets were then projected into a lower dimensional feature space using Principal Component Analysis (PCA) (usually five dimensions, see Materials and methods), as is classically done in many spike sorting algorithms (Rossant et al., 2016; Einevoll et al., 2012). Note that the simple division in groups before clustering allowed us to parallelize the clustering step, making it scalable for even thousands of electrodes. ierreEven if a spike is detected on several electrodes, it is only assigned to the electrode where the voltage peak is the largest: if a spike has its largest peak on electrode 1, but is also detected on electrode 2, it will only be assigned to electrode1 (see Figure 1—figure supplement 1).
For each group, we performed a densitybased clustering inspired by (Rodriguez and Laio, 2014) (see Materials and methods). The idea of this algorithm is that the centroid of a cluster in the feature space should have many neighbours, that is a high density of points in their neighborhood. The centroid should also be the point where this density is a local maximum: there should not be a point nearby with a higher density. To formalize this intuition, for each point we measured the average distance of the 100 closest points $rho$ (intuitively, this is inversely proportional to the local density of points), and the distance $\delta $ to the closest point of higher density (i.e. with a lower $\rho $). Centroids should have a low $\rho $ and a high $\delta $. We hypothesized that there was a maximum of ten clusters in each group (i.e. at most ten different cells peaking on the same electrode) and took the ten points with the largest $\delta /\rho$ ratio as the centroids. Each remaining point was then assigned iteratively to the nearest point with highest density, until they were all assigned to the centroids (see Materials and methods for details  note that all the numbers mentioned here are parameters that are tunable in our toolbox).
Thanks to this method we could find many clusters, corresponding to putative neurons. In many spikesorting methods, it is assumed that finding clusters is enough to solve the spikesorting problem (Chung et al., 2017). However, this neglects the specific issue of overlapping spikes (see Figure 1C). When two nearby cells spike synchronously, the extracellular waveforms evoked by each cell will superimpose (Figure 1C, see also [Pillow et al., 2013]). This superimposition of two signals coming from two different cells will distort the feature estimation. As a result, these spikes will appear as points very far away from the cluster associated to each cell. An example of this phenomena is illustrated in Figure 1C. Blue and red points correspond to the spikes associated to two different cells. In green, we show the spikes of one cell when the waveform of another one was added at different delays. For short delays, the presence of this additional waveform strongly distort the feature estimation. As a result, the corresponding point is far from the initial cluster, and will be missed by the clustering. To overcome this issue, we performed a template matching as the next step of our algorithm.
For this, we first extracted a template from each cluster. This template is a simplified description of the cluster and is composed of two waveforms. The first one is the average extracellular waveform evoked by the putative cell (Figure 1C, left and red waveforms). The second is the direction of largest variance that is orthogonal to this average waveform (see Materials and methods). We assumed that each waveform triggered by this cell is a linear combination of these two components. Thanks to these two components, the waveform of the spikes attributed to one cell could vary both in amplitude and in shape.
At the end of this step, we should have extracted an ensemble of templates (i.e. pairs of waveforms) that correspond to putative cells. Note that we only used the clusters to extract the templates. Our algorithm is thus tolerant to some errors in the clustering. For example, it can tolerate errors in the delineation of the cluster border. The clustering task here is therefore less demanding than in classical spike sorting algorithms where finding the correct cluster borders is essential to minimize the final error rate. By focusing on only getting the cluster centroids, we should thus have made the clustering task easier, but all the the spikes corresponding to one neuron have yet to be found. We therefore used a template matching algorithm to find all the instances where each cell has emitted a spike.
In this step, we assumed that the templates of different cells spiking together sum linearly and used a greedy iterative approach inspired by the projection pursuit algorithm to match the templates to the raw data (Figure 1D, see Materials and methods). Within a piece of raw data, we looked for the template whose first component had the highest similarity to the raw signal (here similarity is defined as the scalar product between the first component of the template and the raw data) and matched its amplitude to the signal. If this amplitude falls between predetermined thresholds (see Materials and methods), we matched and subtracted the two components to the raw signal. These predetermined thresholds reflect the prior that the amplitude of the first component should be around 1, which corresponds to the average waveform evoked by the cell. We then reiterated this matching process until no more spike could be matched (Figure 1D,E) (see Materials and methods). We found many examples where allowing amplitude variation wasa desirable feature (see Figure 1F).
After this template matching step, the algorithm outputs putative cells, described by the templates, and associated spike trains, that is spike times where the template was matched to the data.
Performance on ground truth data
To test our algorithm, we performed dual recordings (Figure 2A,B) using both a multielectrode array to record many cells (see schematic on Figure 2A), and simultaneous loose patch to record the spikes of one of the cell (Figure 2B). For this cell we know what should be the output of the spike sorting. In vitro, we recorded 18 neurons from 14 retinas with a 252 electrode MEA where the spacing between electrodes was 30 $\mu \text{m}$ (see Materials and methods, (Spampinato et al., 2018)). We also generated datasets where we removed the signals of some electrodes, such that the density of the remaining electrodes was either 42 or 60 $\mu \text{m}$ (by removing half or 3 quarters of the electrodes, respectively).
We then ran the spike sorting algorithm only on the extracellular data, and estimated the error rate (as the mean of false positives and false negatives, see Materials and methods) for the cell recorded in loose patch, where we know where the spikes occurred. The performance of the algorithm is limited not only by imperfections of the algorithm, but also by several factors related to the ground truth data themselves. In particular, some of the cells recorded with loose patch can evoke only a small spike on the extracellular electrode, for example if they are far from the nearest electrode or if their soma is small (Buzsáki, 2004). If a spike of the patchrecorded cell triggers a large voltage deflection, this cell should be easy to detect. However, if the triggered voltage deflection is barely detectable, even the best sorting algorithm will not perform well. Looking at Figure 2C, for in vitro data (see Materials and methods), we found that there was a correlation between the error rate of our algorithm and the size of the extracellular waveform evoked by the spikes of the patchrecorded cell: the higher the waveform, the lower the error rate.
We then asked if our algorithm is close to the ‘best’ possible performance, that is the only errors are due to intrinsic limitations in the ground truth data (e.g. small waveform size).
There is no method to exactly estimate this best possible performance. However, a proxy can be found by training a nonlinear classifier on the ground truth data (Harris et al., 2000; Rossant et al., 2016). We trained a nonlinear classifier on the extracellular waveforms triggered by the spikes of the recorded cell, similar to (Harris et al., 2000; Rossant et al., 2016) (referred to as the Best Ellipsoidal Error Rate (BEER), see Materials and methods). This classifier ‘knows’ where the true spikes are and simply quantifies how well they can be separated from the other spikes based on the extracellular recording. Note that, strictly speaking, this BEER estimate is not a lower bound of the error rate. It assumes that spikes can be all found inside a region of the feature space delineated by ellipsoidal boundaries. As we have explained above, spikes that overlap with spikes from another cell will probably be missed and this ellipsoidal assumption is also likely to be wrong in case of bursting neurons or electrodetissue drifts. However, we used the BEER estimate because it has been used in several papers describing spike sorting methods (Harris et al., 2000; Rossant et al., 2016) and has been established as a commonly accepted benchmark. In addition, because we used rather stationary recordings (few minutes long, see Materials and methods), we did not see strong electrodetissue drifts.
We estimated the error made by the classifier and found that the performance of our algorithm almost always was in the same order of magnitude as the performance of this classifier, (Figure 2D, left; $r=0.89$, $p<{10}^{5}$) over a broad range of spike sizes. For 37 neurons with large waveform sizes (above) the average error of the classifier is 2.7% and the one for our algorithm is 4.8% (see Figure 2E). For 22 neurons with lower spike size (below), the average error of the classifier is 11.1% and the one for our algorithm is 15.2%. This suggests that our algorithm reached an almost optimal performance on this in vitro data.
We also used similar ground truth datasets recorded in vivo in rat cortex using dense silicon probes with either 32 or 128 recording sites (Neto et al., 2016). With the same approach as for in vitro data, we also found that our algorithm achieved near optimal performance (Figure 2F, right; $r=0.92$, $p<{10}^{5}$), although there were only two recordings where the spike size of the patchrecorded neuron was large enough to be sorted with a good accuracy. For only two available neurons with low optimal error rate, the average error of the classifier is 13.9% and the one for our algorithm is 14.8%. For 24 neurons with lower spike size, the average error of the classifier is 64.0% and the one for our algorithm is 67.8%. Together, these results show that our algorithm can reach a satisfying performance (i.e. comparing to the classifier error) over a broad range of spike sizes, for both in vivo and in vitro recordings.
We also compared our performance to the Kilosort algorithm (Pachitariu et al., 2016) and found similar performances (4.4% on average over all nondecimated neurons for SpyKING CIRCUS against 4.2% for Kilosort). Because Kilosort used a GPU, it could be run faster than our algorithm on a single machine: on a 1 hr recording with 252 electrodes, Kilosort can achieve a four times speedup on a standard desktop machine (see Materials and methods). But without using a GPU, Kilosort was only marginally faster (1.5 speedup), and slower if we started using several cores of the machine. However, is it worth noticing that the speedup of Kilosort comes at the cost of an increased usage of memory. Kilosort used 32 GB of RAM for a maximal number of 500 neurons, while our algorithm had a much lower memory footprint, because of design choices. We have therefore found a trade off where execution speed is slightly slower, but much less memory is used. Thanks to this, we could run our algorithm to process recordings with thousands of electrodes, while Kilosort does not scale up to this number. In the next section, we demonstrate that our algorithm still works accurately at that scale.
Scaling up to thousands of electrodes
A crucial condition to process recordings performed with thousands of electrodes is that every step of the algorithm should be run in parallel over different CPUs. The clustering step of our algorithm was run in parallel on different subsets of snippets as explained above. The template matching step could be run independently on different blocks of data, such that the computing time only scaled linearly with the data length. Each step of the spike sorting algorithm was therefore parallelized. The runtime of the full algorithm decreased proportionally with the numbers of CPU cores available (Figure 3A, grey area indicates where the software is ‘realtime’ or faster). As a result, the sorting algorithm could process 1 hr of data recorded with 252 electrodes in 1 hr with 9 CPU cores (spread over three computers) (Figure 3A,B). It also scaled up linearly with the number of electrodes (Figure 3B), and with the number of templates (Figure 3C). It was therefore possible to run it on long recordings with more than 4000 electrodes, and the runtime could be be divided by simply having more CPUs available.
To test the accuracy of our algorithm on 4225 electrodes, we generated hybrid ground truth datasets where artificial spikes were added to real recordings performed on mouse retina in vitro (see Materials and methods). We ran the spikesorting algorithm on different datasets, picked some templates and used them to create new artificial templates that we added at random places to the real recordings (see Materials and methods). This process, as shown in Figure 3D allowed us to generate ‘hybrid’ datasets were we know the activity of a number of artificially injected neurons. We then ran our sorting algorithm on these datasets and measured if the algorithm was able to find at which times the artificial spikes were added. We counted a falsenegative error when an artificial spike was missed and a falsepositive error when the algorithm detected a spike when there was not any (see Materials and methods). Summing these two types of errors, the total error rate remained below 5% for all the spikes whose size was significantly above spike detection threshold (normalized amplitude corresponds to the spike size divided by the spike threshold). Error rates were similar for recordings with 4255 electrodes in vitro, 128 electrodes in vivo or 252 electrodes in vitro. Performance did not depend on the firing rate of the injected templates (Figure 3E) and only weakly on the normalized amplitude of the templates (Figure 3F), as long as it was above the spike threshold. The accuracy of the algorithm is therefore invariant to the size of the recordings.
A crucial issue when recording thousands of densely packed electrodes is that more and more spikes overlap with each other. If an algorithm misses overlapping spikes, then the estimation of the amplitude of correlations between cells will be biased. To test if our method was able to solve the problem of overlapping spikes and thus estimate correlations properly, we generated hybrid datasets where we injected templates with a controlled amount of overlapping spikes (see Materials and methods). We then ran the sorting algorithm and compared the injected spike trains and the spike trains recovered by SpyKING CIRCUS. We then compared the correlation between both pairs. If some overlapping spikes were missed by the algorithm, the correlation between the sorted spike trains should be lower than the correlation between the injected spike trains. We found that our method was always able to estimate the pairwise correlation between the spike trains with no underestimation (Figure 3G). Overlapping spikes were therefore correctly detected by our algorithm. The ability of our template matching algorithm to resolve overlapping spikes thus allowed an unbiased estimation of correlations between spike trains, even for thousands of electrodes.
These different tests, described above, show that SpyKING CIRCUS reached a similar performance for 4225 electrodes than for hundreds electrodes, where our ground truth recordings showed that performance was near optimal. Our algorithm is therefore also able to sort accurately recordings from thousands of electrodes.
Automated merging
As in most spikesorting algorithms, our algorithm may split one cell into several units. After running the entire algorithm, it is therefore necessary to merge together the units corresponding to the same cell. However, for hundreds or thousands of electrodes, going through all the pairs of units and merging them by hand would take a substantial amount of time. To overcome this problem, we designed a tool to merge automatically many units at once so that the time spent on this task does not scale with the number of electrodes and this allows us to automate this final step.
Units that likely belong to the same cell (and thus should be merged) have templates that look alike and in addition, the combined crosscorrelogram between the two cell’s spike trains shows a clear dip near 0 ms, indicating that the merged spike trains do not show any refractory period violation (Figure 4A, blue example). In order to automate this merging process, we formalized this intuition by estimating for each pair of units two factors that reflect if they correspond to the same cell or not. For each pair of templates, we estimated first the similarity between templates and second the dip in the center of the crosscorrelogram. This dip is measured as the difference between the geometrical mean of the firing rate $\varphi $ (i.e. the baseline of the crosscorrelogram) and the value of the crosscorrelogram at delay 0 ms, $\u27e8CC\u27e9$ (see Materials and methods and right insets in Figure 4A).
We plotted for each pair with high similarity the dip estimation against the geometrical mean of their firing rates. If there is a strong dip in the crosscorrelogram (quantified by $\varphi \u27e8CC\u27e9$), the dip quantification and the geometrical mean, $\varphi $, should be almost equal, and the corresponding pair should thus be close to the diagonal in the plot.
In one example, where we artificially split synthetic spike trains (Figure 4A; see Materials and methods), we could clearly isolate a cluster of pairs lying near this diagonal, corresponding to the pairs that needed to be merged (Figure 4A right panels). We have designed a GUI such the user can automatically select this cluster and merge all the pairs at once. Thanks to this, with a single manipulation by the user, all the pairs are merged.
We then tested this method on our in vitro ground truth data. In these recordings, the cell recorded with loose patch might be split by the algorithm between different spike trains. We can determine the units that correspond to the patchrecorded cell. For one particular neuron taken from our database, we can visualize all the units that need to be merged together (blue points in Figure 4B), and that should not be merged with units corresponding to other cells (green pairs in Figure 4B). For all the other cells, we do not have access to a ground truth, and thus cannot decide if the pairs should be merged or not (orange pairs in Figure 4B).
To automate the merging process entirely, we defined two simple rules to merge two units: first, their similarity should be above a threshold (similarity threshold, 0.8 in our experiments). Second, the dip estimation for this pair should be close to the geometrical mean of firing rates, that is their difference should be below a threshold (dip threshold). In practice, the corresponding point in Figure 4B should be above a line parallel to the diagonal. We used these rules to perform a fully automatic merging of all units. We then estimated the error rate for the ground truth cell, in the same way as the previous section. We also estimated the lowest error rate possible error rate by merging the units using the ground truth data for guidance (Best Merges, see aterials and m). We found that the error rate obtained with our automated method was close to this best error rate (Figure 4C). We have therefore automated the process of merging spike trains while keeping a low error rate. The performance did not vary much with the values of the two parameters (similarity threshold and dip threshold), and we used the same parameters for all the different datasets. This shows that the sorting can be fully automated while limiting the error rate to a small value. We thus have a solution to fully automate the sorting, including all the decisions that need to be taken during the manual curation step. However, because we used crosscorrelograms in order to help automate the merging process, it is worth noticing that one can no longer use crosscorrelograms as a validation metric.
Discussion
We have shown that our method, based on densitybased clustering and template matching, allows sorting spikes from largescale extracellular recordings both in vitro and in vivo. We tested the performance of our algorithm on ‘ground truth' datasets, where one neuron is recorded both with extracellular recordings and with a patch electrode. We showed that our performance was close to an optimal nonlinear classifier, trained using the true spike trains. Our algorithm has also been tested on purely synthetic datasets (Hagen et al., 2015) and similar results were obtained (data not shown). Note that tests were performed by different groups on our algorithm and show its high performance on various datasets (see http://spikesortingtest.com/ and http://phy.cortexlab.net/data/sortingComparison/). Our algorithm is entirely parallelized and could therefore handle long datasets recorded with thousands of electrodes. Our code has already been used by other groups (Denman et al., 2017; Mena et al., 2017; Chung et al., 2017; Wilson et al., 2017) and is available as a complete multiplatform, open source software for download (http://spykingcircus.rtfd.org) with a complete documentation. Note that all the parameters mentioned in the description of the algorithm can be modified easily to work with different kinds of data. We have made all the ground truth data available for download (see Source Code section in Materials and methods), so that improvements in our algorithm as well as alternative solutions could be benchmarked easily in the future.
Classical approaches to the spike sorting problem involve extracting some features from each detected spike (Hubel, 1957; Meister et al., 1994; Lewicki, 1994; Einevoll et al., 2012; Quiroga et al., 2004; Hill et al., 2011; Pouzat et al., 2002; Litke et al., 2004; Chung et al., 2017) and clustering the spikes in the feature space. In this approach, the spike sorting problem is reduced to a clustering problem and this introduces several major problems. First, to assign the spikes to the correct cell, the different cellsmust be separated in the feature space. Finding the exact borders of each cell in the feature space is a hard task that cannot be easily automated (but see [Chung et al., 2017]). Second, running a clustering algorithm on data with thousands of electrodes is very challenging. Finally, overlapping spikes will appear as strong deviations in the feature space and will therefore be missed in this approach. These three issues preclude the use of this approach for largescale recordings with dense arrays of electrodes. In comparison, here we have parallelized the clustering step efficiently, using a template matching approach, so that we only needed to infer the centroid of each cluster and not their precise borders. The template matching approach also allowed us to deconvolve overlapping spikes in a fast, efficient and automated manner. Some template matching approaches have been previously tested, mostly on in vitro data (Marre et al., 2012; Pillow et al., 2013; Franke et al., 2015b) but were not validated on ground truth datasets like the ones we acquired here. Also, they only had one waveform for each template, which did not allow any variation in the shape of the spike, while we have designed our template matching method to take into account not only variation in the amplitude of the spike waveform, but also in shape. Finally, several solutions did not scale up to thousands of electrodes. All GPUbased algorithms (Pachitariu et al., 2016; Lee et al., 2017; Jun et al., 2017) only scale for a few hundreds channels, and face severe memory issues for larger probes.
Finally, a common issue when sorting spikes from hundreds or thousands of electrodes is the time spent on manual curation of the data. Here, we have designed a tool to automate this step by merging units corresponding to the same cell all at once, based onthe crosscorrelogram between cells and the similarity between their templates. Having an objective criterion for merging spike trains not only speeds up the manual curation time, it also makes the results less sensitive to human errors and variability in decisions. In some cases, it might be necessary to take into account additional variables that are specific to the experiment, but even then our tool will still significantly reduce the time spent on manual curation.
Our method is entirely parallel and can therefore be run in ‘real time’ (i.e. 1 hr of recording processed in 1 hr) with enough computer power. This paves the way towards online spike sorting for largescale recordings. Several applications, likebrain machine interfaces, or closedloop experiments (Franke et al., 2012; Hamilton et al., 2015; Benda et al., 2007), will require an accurate online spike sorting. This will require adapting our method to process data ‘on the fly’, processing new data blocks when they become available and adapting the shape of the templates over time.
Materials and methods
Experimental recordings
In vitro recordings with 252 or 4225 electrodes
Retinal tissue was obtained from adult (8 weeks old) male LongEvans rat (Rattus norvegicus) or mouse (mus musculus, 4–9 weeks old) and continuously perfused with Ames Solution (SigmaAldrich) and maintained at 32 C. Ganglion cell spikes were recorded extracellularly from a multielectrode array with 252 electrodes spaced 30 or 60 $\mu \text{m}$ apart (MultiChannel Systems) or with 4225 electrodes arranged in a 2D grid and spacedby 16 $\mu \text{m}$ (Wilson et al., 2017),4] at a sampling rate of 20 kHz. Experiments were performed in accordance with institutional animal care standards.
For the ground truth recordings, electrophysiological recordings were obtained from ex vivo isolated retinae of rd1 mice (4/5 weeks old). The retinal tissue was placed in AMES medium (SigmaAldrich, St Louis, MO; A1420) bubbled with 95% ${\text{O}}_{2}$ and 5% ${{\displaystyle \text{CO}}}_{2}$ at room temperature, on a MEA (10 $\mu \text{m}$ electrodes spaced by 30 $\mu \text{m}$; Multichannel Systems, Reutlingen, Germany) with ganglion cells layer facing the electrodes. Borosilicate glass electrodes (BF10050, Sutter instruments) were filled with AMES with a final impedance of 6–9 M$\Omega $. Cells were imaged with a customized inverted DIC microscope (Olympus BX 71) mounted with a high sensitivity CCD Camera (Hamamatsu ORCA −03G) and recorded with an Axon Multiclamp 700B patch clamp amplifier set in current zero mode. We used rd1 mice because going through the photoreceptor layer with the patch pipette was easier than for a wildtype mouse.
For the data shown in Figures 1 and 3, we used a recording of 130 min. For the data shown in Figure 2A, 16 neurones were recorded over 14 intact retinas. Recording durations all lasted 5 min. The thresholds for the detection of juxtacellular spikes were manually adjusted for all the recordings (Spampinato et al., 2018)
In vivo recordings with 128 electrodes
We use the freely available datasets provided by (Neto et al., 2016). Those are 32 or 128 dense silicon probes recordings (20 $\mu \text{m}$ spacing) at 30 kHz performed in rat visual cortex, combined with juxtacellular recordings. The dataset gave us a total of 13 neurons for Figure 2.C with recordings between 4 and 10 min each. Similarly to the in vitro case, the detection thresholds for the juxtacellular spikes were manually adjusted based on the data provided by Neto et al. (2016) and on spiketriggered waveforms. For the validation with ‘hybrid’ dataset, shown in Figure 3, we used the longest dataset recorded with 128 electrodes.
Details of the algorithm
In the following, we consider that we have ${N}_{\text{elec}}$ electrodes, acquired at a sampling rate ${f}_{\text{rate}}$. Every electrode $k$ is located at a physical position ${p}_{k}=({x}_{k},{y}_{k})$ in a 2D space (extension to 3D probes would be straightforward). The aims of our algorithm is to decompose the signal as a linear sum of spatiotemporal kernels or ‘templates’ (see equation 1).
where $s\left(t\right)$ is the signal recorded over ${N}_{\text{elec}}$ electrodes and over multiple time points. ${w}_{j}(t{t}_{i})$ and ${v}_{j}(t{t}_{i})$ are the two components of the template associated to each cell. They represent the waveform triggered on the electrodes by cell $j$. Times $\left\{{t}_{i}\right\}$ are the putative spike times over all the electrodes. ${a}_{ij}$ and ${b}_{ij}$ are the amplitude factors for spike time ${t}_{i}$ for the template $j$, and $e\left(t\right)$ is the background noise. reNote that at a given spike time ${t}_{i}$, it is likely that only a couple of cells fire a spike. Only these cells will therefore have ${a}_{ij}$ and ${b}_{ij}$ different from zero. For all the other ones, these coefficients are zero simply because the cell does not fire at this time.
The algorithm can be divided into two main steps, described below. After a preprocessing stage, we first run a clustering algorithm to extract a dictionary of ‘templates’ from the recording. Second, we use these templates to decompose the signal with a templatematching algorithm. We assume that a spike will only influence the extracellular signal over a time window of size ${N}_{t}$ (typically 2 ms for in vivo and 5 ms for in vitro data) and only electrodes whose distance to the soma is below ${r}_{\text{max}}$ (typically for in vivo and for in vitro data). For every electrode $k$ centered on ${p}_{k}$, we define ${G}^{k}$ as the ensemble of nearby electrodes $l$ such that ${p}_{k}{p}_{l}{}_{2}\le {r}_{\text{max}}$. The key parameters of the algorithmare summarized in Table 1.
Preprocessing
Filtering
In a preprocessing stage, all the signals were individually highpass filtered with a Butterworth filter of order three and a cutoff frequency of to remove any lowfrequency components of the signals.We then subtracted, for every electrode $k$, the median such that.
Spike detection
Once signals have been filtered, we computed a spike threshold ${\theta}_{k}$ for every electrode $k$: ${\theta}_{k}=\lambda \text{MAD}\left({s}_{k}\right(t\left)\right)$, where MAD is the Median Absolute Deviation, and $\lambda $ is a free parameter. For all the datasets shown in this paper, we set $\lambda =6$. We detected the putative spike times ${t}_{i}$ as times where there was at least one electrode $k$ where ${s}_{k}\left({t}_{i}\right)$ was below the threshold ${\theta}_{k}$ and a local minimum of the voltage $vect{s}_{k}\left(t\right)$.
Whitening
To remove spurious spatial correlations between nearby recordings electrodes, we performed a spatial whitening on the data. To do so, we searched for a maximum of 20 s of recordings where there were no spikes (i.e no threshold crossings). We then computed the Covariance Matrix of the noise ${C}_{\text{spatial}}$ and estimated its eigenvalues $\left\{{d}_{m}\right\}$ and associated eigenvector matrix $V$. From the diagonal matrix $D=\text{diag}\left({\scriptscriptstyle \frac{1}{\sqrt{d+\u03f5}}}\right)$, where $\u03f5={10}^{18}$ is a regularization factor to ensure stability, we computed the whitening matrix $F=VD{V}^{T}$. In the following, each time blocks of data are loaded, they are multiplied by $F$. After whitening,we recomputed the spike detection threshold ${\theta}_{k}$ of each electrode $k$ in the whitened space.
Basis estimation (PCA)
Our first goal was to reduce the dimensionality of the temporal waveforms. We collected up to ${N}_{\text{p}}$ spikes on each electrode. We thus obtained a maximum of ${N}_{\text{p}}\times {N}_{\text{elec}}$ spikes and took the waveform only on the peaking electrode for each of them. This is a collection of a large number of temporal waveforms and we then aimed at finding the best basis to project them. In order to compensate for sampling rate artifacts, we first upsampled all the collected singleelectrode waveforms by bicubic spline interpolation to five times the sampling rate ${f}_{\text{rate}}$, aligned on their local minima, and then resampled at ${f}_{\text{rate}}$. We then performed a Principal Component Analysis (PCA) on these centered and aligned waveforms and kept only the first ${N}_{\text{PCA}}$ principal components. In all the calculations we used default values of ${N}_{\text{p}}=10000$ and ${N}_{\text{PCA}}=5$. These principal components were used during the clustering step.
Clustering
The goal of the clustering step is to construct a dictionary of templates. As opposed to former clustering approaches of spike sorting (Quiroga et al., 2004; Harris et al., 2000; Kadir et al., 2014), because this clustering step is followed by a template matching, we do not need to perform the clustering on all the spikes.
Masking
We first randomly collected a subset of many spikes ${t}_{i}$ to perform the clustering. To minimize redundancy between collected spikes, we prevented the algorithm to have two spikes peaking on the same electrode separated by less than ${N}_{t}/2$.
Preclustering of the spikes
Trying to cluster all the spikes from all the electrodes at once is very challenging, because they are numerous and live in a high dimensional space. We used a divide and conquer approach to parallelize this problem (Marre et al., 2012; Swindale and Spacek, 2014). Each time a spike was detected at time ${t}_{i}$, we searched for the electrode $k$ where the voltage $s\left({t}_{i}\right)$ has the lowest value, that is such that. For every electrode $k$ we collected a maximum of ${N}_{\text{spikes}}$ spikes (set to 10,000 by default) peaking on this electrode. Each of these spikes is represented by a spatiotemporal waveform of size ${N}_{t}\times {N}_{\text{neigh}}^{k}$, where ${N}_{\text{neigh}}^{k}$ is the number of electrodes in the vicinity of electrode $k$, that is the number of elements in ${G}^{k}$. Note that, in the following we did not assume that spikes are only detected on a single electrode. We used the information available on all the neighboring electrodes.
We projected each temporal waveform on the PCA basis, estimated earlier, to reduce the dimensionality to ${N}_{\text{PCA}}\times {N}_{\text{neigh}}^{k}$. During this projection, the same upsampling technique described in the Preprocessing was used. Each spike was then represented in a space with ${N}_{\text{PCA}}\times {N}_{\text{neigh}}^{i}$ dimensions. To reduce dimensionality even further before the clustering stage, for every electrode $k$ we performed a PCA on the collected spikes and kept only the first ${N}_{{\text{PCA}}_{2}}$ principal components (in all the paper, ${N}_{{\text{PCA}}_{2}}=5$). Therefore, we performed a clustering in parallel for every electrode on at max ${N}_{\text{spikes}}$ described in a space of ${N}_{{\text{PCA}}_{2}}$dimension.
Clustering by search of local density peaks
To perform the clustering, we used a modified version of the algorithm published in (Rodriguez and Laio, 2014). We note the spikes $\{1,\mathrm{..},l\}$ associated with electrode $k$ (and projected on the second PCA basis) as vectors ${x}_{\{1,\mathrm{..},l\}}^{k}$ in a ${N}_{{\text{PCA}}_{2}}$ dimensional space. For each of these vectors, we estimated ${\rho}_{l}^{k}$ as the mean distance to the $S$ nearest neighbors of ${x}_{l}^{k}$. Note that $1/{\rho}_{l}^{k}$ can be considered as a proxy for the density of points. $S$ is chosen such that $S=\u03f5{N}_{\text{spikes}}$, with $\u03f5=0.01$. In our settings, since ${N}_{\text{spikes}}=10000$ then $S=100$. This density measure turned out to be more robust than the one given in the original paper and rather insensitive to changes in $\u03f5$. To avoid a potentially inaccurate estimation of the ${\rho}_{l}^{k}$ values, we collected iteratively additional spikes to refine this estimate. Keeping in memory the spikes ${x}_{l}^{k}$, we searched again in the data ${N}_{\text{spikes}}^{k}$ different spikes andused them only to refine the estimation of ${\rho}_{l}^{k}$ of our selected points ${x}_{l}^{k}$. This refinement gave more robust results for the clustering and we performed 3 rounds of this new search. Then, for every point ${x}_{l}^{k}$, we computed ${\delta}_{l}^{k}$ as the minimal distance to any other point ${x}_{m,m\ne l}^{k}$ such that ${\rho}_{m}^{k}\le {\rho}_{l}^{k}$. This corresponds to the distance to the nearest point with a higher density. The intuition of the algorithm is that the centroids should be points with a high density (i.e. low $\rho $) and far apart from each others (high $\delta $).
Centroids and cluster definition
To define the centroids we ranked the points as a function of the ratios $\delta /\rho $ and we detected the best ${N}_{\text{clusters}}^{\text{max}}$ points as putative centroids. By default ${N}_{a}^{\text{max}}thrmclusters=10$. Intuitively, this parameter corresponds to the maximal number of cells that will peak on any given electrode. It can be seen as an upper bound of the ratio between the number of cells and the number of electrodes. ructure recorded, the density of cells and the density of the electrodes, this number can be varied. Clusters were formed by assigning each point to one of the selected centroids following an iterative rule, going from the points of lowest $\rho $ to the points of highest $\rho $: each point was assigned to the same cluster as the closest point with a lower $\rho $ (Rodriguez and Laio, 2014). Thanks to this ordering, we started by assigning the points of highest density to the nearest centroid, and then assigned the next points to the nearest point with higher density, which has been already assigned to a cluster. We created ${N}_{\text{clusters}}^{\text{max}}$ clusters. Once this is done, we iteratively merged pairs of clusters that were too similar to each others.
Merging similar clusters
We computed a normalized distance $\zeta $ between each pair of clusters. The center ${\alpha}_{m}$ of each cluster was defined as the median of all the points composing this cluster. For each pair of cluster $(m,n)$, we then projected all the points foreach of them onto the axis joining the two centroids ${\alpha}_{m}{\alpha}_{n}$. We defined the dispersions around the centroids ${\alpha}_{m}$ as ${\gamma}_{m}=\text{MAD}({x}_{m}\cdot ({\alpha}_{m}{\alpha}_{n}\left)\right)$, where $\cdot $ is the scalar product of two vectors. The normalized distance is:
We then iteratively merged all clusters $(m,n)$ such that $\zeta (m,n)\le {\sigma}_{\text{similar}}$. At the end of the clustering, every cluster with less than $\eta {N}_{\text{spikes}}^{i}$ was discarded. In all the manuscript we used ${\sigma}_{\text{similar}}=3$, ${N}_{\text{clusters}}^{\text{max}}=10$, and $\eta =0.005$. We chose ${N}_{\text{clusters}}^{\text{max}}=10$ because we never see more than 10 neurons peaking on the same electrode, and this approximately corresponds to the ratio between density of observable cells and density of electrodes.
Template estimation
At the end of the clustering phase, pooling the clusters obtained from every electrode, we obtained a list of clusters. Each cluster $m$ is defined by a list of spike times ${t}_{\{1,\mathrm{..},l\}}^{m}$. During this phase we limited the number of spike times per template to a maximal value of 500 to avoid memory saturation, because we had to keep in memory these 500 snippets.
We computed the first component from the raw data as the pointwise median of all the waveforms belonging to the cluster: ${\mathbf{w}}_{m}\left(t\right)={\text{med}}_{l}\mathbf{s}\left({t}_{l}^{m}+t\right)$. Note that ${w}_{m}$ is only different from zero on the electrodes close to its peak (see Figure 1C). This information is used internally by the algorithm to save templates as sparse structures. We set to 0 all the electrodes $k$ where $\Vert {w}_{m}^{k}\left(t\right)\Vert <{\theta}_{k}$, where ${\theta}_{k}$ is the detection threshold on electrode $k$. This allowed us to remove electrodes without discriminant information and to increase the sparsity of the templates.
We then computed the projection of all snippets in the space orthogonal to the first component: $\mathrm{\forall}l,{\mathbf{q}}_{l}=\mathbf{s}\left({t}_{l}^{m}\right){\beta}_{l}{\mathbf{w}}_{m}$, with ${\beta}_{l}={\scriptscriptstyle \frac{s\left({t}_{l}^{m}\right).{w}_{m}}{\Vert {w}_{m}\Vert}}$. The $q$ are the projections of the waveforms in a space orthogonal to ${w}_{m}$. We estimated the second component of the template ${v}_{m}\left(t\right)$ as the direction of largest variance in this orthogonal space (i.e. the first principal component of ${q}_{l}$).
From the first components ${w}_{m}$, we also computed its minimal and maximal amplitudes ${a}_{m}^{\text{min}/\text{max}}$. If ${{\displaystyle \widehat{w}}}_{m}$ is the normalized template, such that ${{\displaystyle \widehat{w}}}_{m}={w}_{m}/\Vert {w}_{m}\Vert $, we computed
Those boundaries are used during the template matching step (see below). The factor five allows most of the points to have their amplitude between the two limits.
Removing redundant templates
To remove redundant templates that may be present in the dictionary because of the divide and conquer approach (for example a neuron between two electrodes would give rise to two very similar templates on two electrodes), we computed for all pairs of templates in the dictionary $C{C}_{max}\left(m,n\right)=\underset{t}{max}CC\left({\mathbf{w}}_{m},{\mathbf{w}}_{n}\right)$, where $CC$ stands for normalized crosscorrelation. If $C{C}_{\text{max}}(m,n)\ge c{c}_{\text{similar}}$, we considered these templates to be equivalent and they were merged. In all the following, we used $c{c}_{\text{similar}}=0.975$. Note that we computed the crosscorrelations between normalized templates such that two templates that have the same shape but different amplitudes are merged. Similarly, we searched if any template ${w}_{p}$ could be explained as a linear combination of two templates in the dictionary. If we could find ${w}_{m}$ and ${w}_{n}$ such that $CC({w}_{p},{w}_{m}+{w}_{n})\ge c{c}_{\text{similar}}$, ${w}_{p}$ was considered to be a mixture of two cells and was removed from the dictionary.
Template matching
At the end of this ‘templatefinding’ phase we have found a dictionary of templates ($w$, $v$). We now need to reconstruct the signal $s$ by finding the amplitudes coefficients ${a}_{ij}$ and ${b}_{ij}$ described in Equation 1. Because at a given spike time ${t}_{i}$ it is likely that only a couple of cells will fire a spike, most ${a}_{ij}$ and ${b}_{ij}$ in this equation are equal to 0. For the other ones most ${a}_{ij}$ values are around one because a spike usually appears on electrodes with an amplitude close to the average first component $w$. In this template matching step, all the other parameters have been determined by template extraction and spike detection, so the purpose is only to find the values of these amplitudes. Note that the spike times ${t}_{i}$ were detected using the method described above and include all the threshold crossing voltages that are local minima. Each true spike can be detected over several electrodes at slightly different times such that there are many more ${t}_{i}$ than actual spikes. With this approach, we found that there was no need to shift templates before matching them to the raw data.
To match the templates to the data we used an iterative greedy approach to estimate the ${a}_{ij}$ for each ${t}_{i}$, which bears some similarity to the matching pursuit algorithm (Mallat and Zhifeng Zhang, 1993). The fitting was performed in blocks of putative spike times,{${t}_{i}$}, that were successively loaded in memory. The size of one block was typically one second, which includes a lot of spike times, and is much larger than a single snippet. The snippets were thus not fitted independently from each other. The successive blocks were always overlapping by two times the size of a snippet and we discarded the results obtained on the borders to avoid any error of the template matching that would be due to a spike split between two blocks. Such an approach allowed us to easily split the workload linearly among several processors.
Each block of raw data $s$ was loaded and templatematching was performed according to the following steps:
Estimate the normalized scalar products $s\left(t\right)\cdot {w}_{j}(t{t}_{i})$ for each template $j$ and putative spike time ${t}_{i}$, for all the $i$ and $j$ in the block of raw data.
Choose the $(i,j)$ pair with the highest scalar product, excluding the pairs $(i,j)$ which have already been tried and the ${t}_{i}$’s already explored (see below).
Set ${a}_{ij}$ equal to the amplitude value that best fits the raw data: ${a}_{ij}={\scriptscriptstyle \frac{s\left(t\right).{w}_{j}(t{t}_{i})}{\Vert {w}_{j}(t{t}_{i})\Vert}}$.
Check if the ${a}_{ij}$ amplitude value is between ${a}_{j}^{\text{min}}$ and ${a}_{j}^{\text{max}}$.
If yes, accept this value, subtract the scaled template from the raw data: $s\left(t\right)\leftarrow s\left(t\right){a}_{ij}{w}_{j}(t{t}_{i})$. Then set ${b}_{ij}$ equal to the amplitude value that best fits the raw data with ${v}_{j}$, and subtractit too. Then return to step one to reestimate the scalar products on the residual.
Otherwise increase by one ${n}_{i}$, which counts the number of times any template has been rejected for spike time ${t}_{i}$.
If ${n}_{i}$ reaches ${n}_{\text{failures}}=3$, label this ${t}_{i}$ as ‘explored’. If all ${t}_{i}$ have been explored, quit the loop.
Otherwise return to step one and iterate.
The parameters of the algorithm were the amplitude thresholds ${a}_{j}^{\text{min}}$ and ${a}_{j}^{\text{max}}$, computed as described in the section Template Estimation.
Automated merging
For the template similarity, we computed, for every pair of templates $m$ and $n$, $C{C}_{max}\left(m,n\right)=\underset{t}{max}CC\left({\mathbf{w}}_{m},{\mathbf{w}}_{n}\right)$ (where $CC$ is the normalized crosscorrelation between the two templates  see above forthe definition). To quantify the dip in the crosscorrelogram, we binned the spike trains obtained for templates $m$ and $n$ with 2 ms bin size, and estimated the cross correlogram ${r}_{m,n}\left(\tau \right)$ between unit $m$ and unit $n$, defined as $\u27e8{\sigma}_{m}\left(t\right){\sigma}_{n}\left(t+\tau \right)\u27e9}_{t$. ${\sigma}_{m}\left(t\right)$ is the number of spikes of unit $m$ in time bin $t$. We then estimated the dip as the difference between the value of the crosscorrelogram at time 0 ms and the geometrical mean of the firing rates, that is $\varphi (m,n)={\u27e8{\sigma}_{m}\left(t\right)\u27e9}_{t}{\u27e8{\sigma}_{n}\left(t\right)\u27e9}_{t}$. This geometrical mean would be the value of the crosscorrelogram if the two spike trains were independent. The dip is therefore estimated as
We plotted the values of the estimated dip, the template similarity and the geometrical mean of the firing rates for each pair in a Graphical User Interface (GUI). The user can quickly define at once a whole set of pairs that need to be merged. After merging a subset of the pairs, quantities $C{C}_{\text{max}}$ and $\varphi $ are recomputed, until the user decides to stop merging (see Figure 4).
If the two spike trains from templates $m$ and $n$ correspond to the same cell, there should be no refractory spike trains. The crosscorrelogram value should be close to 0 and the dip estimation should therefore be close to the geometrical mean of the firing rates. To formalize this intuition and fully automate the merging, we decided to merge all the pairs $(m,n)$ such that:
with $c{c}_{\text{merge}}=0.8$ and ${\varphi}_{\text{merge}}=0.1$. This corresponds to merging all the highly similar pairs above a line parallel to the diagonal (see Figure 4A,B, gray area). With these two parameters we could automate the merging process.
Simulated ground truth tests
In order to assess the performance of the algorithm, we injected new templates in real datasets (see Figure 3D). To do so, we ran the algorithm on a given dataset and obtain a list of putative templates ${w}_{j\in \{1,\dots N\}}$. Then, we randomly selected some of those templates ${w}_{j}$ and shuffled the list of their electrodes before injecting them elsewhere in the datasets at controlled firing rates (Harris et al., 2000; Rossant et al., 2016; Kadir et al., 2014; Segev et al., 2004; Marre et al., 2012; Chung et al., 2017). This enabled us to properly quantify the performance of the algorithm. In order not to bias the clustering, when a template ${w}_{j}$ was selected and shuffled as a new template ${{\displaystyle \overline{w}}}_{k}$ centered on a newelectrode $k$, we ensured that the injected template was not too similar to one that would already be in the data: $\mathrm{\forall}h\in \left\{1,\dots N\right\},\underset{t}{max}CC\left({\mathbf{w}}_{h},{\overline{\mathbf{w}}}_{k}\right)\le 0.8$. Before being injected, ${{\displaystyle \overline{w}}}_{k}$ was normalized such that $\underset{t}{min}{\overline{\mathbf{w}}}_{k}={\alpha}_{k}{\theta}_{k}$. ${\alpha}_{k}$ is the relative amplitude, expressed as function of ${\theta}_{k}$, the detection threshold on the electrode where the template is peaking. If ${\alpha}_{k}\le 1$ the template is smaller than spike threshold, and its spikes should not be detected; if ${\alpha}_{k}\ge 1$ the spikes should be detected. In Figure 3G, we injected the artificial templates into the data such that they were all firing at 10 Hz, but with a controlled correlation coefficient $c$ that could be varied (using a Multiple Interaction Process [Kuhn et al., 2003]). This parameter $c$ allowed us to quantify the percentage of pairwise correlations recovered by the algorithm for overlapping spatiotemporal templates.
Performance estimation
Estimation of false positives and false negatives
To quantify the performance of the algorithm we matched the spikes recovered by the algorithm to the real groundtruth spikes (either synthetic or obtained with juxtacellular recordings). A spike was considered to be a match if it had a corresponding spike in the ground truth at less than 2 ms. Spikes in the groundtruth datasets that had no matches in the spike sorting results in a 2 ms window were labeled as ‘false negatives’, while those that are not present while the algorithm detected a spike were ‘false positives’. The falsenegative rate was defined as the number of false negatives divided by the number of spikes in the ground truth recording. The falsepositive rate was defined as the number of false positives divided by the number of spikes in the spike train extracted by the algorithm. In the paper, the error is defined as mean of the false negative and the false positive rates (see Figures 2 and 3). Note that to take into account the fact that a groundtruth neuron could be split into several templates at the end of the algorithm, we always compared the groundtruth cells with the combination of templates that minimized the error.
Theoretical estimate
To quantify the performance of the software with real groundtruth recordings (see Figure 2) we computed the Best Ellipsoidal Error Rate (BEER), as described in (Harris et al., 2000). This BEER estimate gave an upper bound on the performance of any clusteringbased spike sorting method using elliptical cluster boundaries. After thresholding and feature extraction, snippets were labeled according to whether or not they contained a true spike. Half of this labeled data set was then used to train a perceptron whose decision rule is a linear combination of all pairwise products of the features of each snippet. If ${x}_{i}$ is the $i$th snippet, projected in the feature space, then the optimized function $f\left(x\right)$ is:
We trained this function $f$ by varying $A$, $b$ and $c$ with the objective that $f\left(x\right)$ should be +1 for the ground truth spikes, and −1 otherwise. These parameters were optimized by a stochastic gradient descent with a regularization constraint. The resulting classifier was then used to predict the occurrence of spikes in the snippets in the remaining half of the labeled data. Only the snippets where $f\left(x\right)>0$ were predicted as true spikes. This prediction provided an estimate of the falsenegative and falsepositive rates for the BEER estimate. The mean between the two was considered to be the BEER error rate, or ‘Optimal Classifier Error’.
Decimation of the electrodes
In order to increase the number of data points for the comparison between our sorting algorithm and the nonlinear classifiers defined by the BEER metric (see Figure 2), we ran the analysis several times on the same neurons, but removing some electrodes, to create recordings at a lower electrode density. We divided by a factor 2 or 4 the number of electrodes in the 252 in vitro Multielectrode Array or the 128 in vivo silicon probe.
Hardware specifications
The comparison between Kilosort (Pachitariu et al., 2016) and SpyKING CIRCUS was performed on a desktop machine with 32 Gb RAM and eight cores (proc Intel Xeon(R) CPU E51630 v3 @ 3.70 GHz). The GPU used was a NVIDIA Quadro K4200 with 4 Gb of dedicated memory.
Implementation and source code
SpyKING CIRCUS is a pure Python package, based on the python wrapper for the Message Passing Interface (MPI) library (Dalcin et al., 2011) to allow parallelization over distributed computers, and is available with its full documentation at http://spykingcircus.rtfd.org. Results can easily be exported to the kwik or phy format (Rossant and Harris, 2013). All the datasets used in this manuscript are available online, for testing and comparison with other algorithms (Spampinato et al., 2018 ).
References

1
Cell type and activitydependent extracellular correlates of intracellular spikingJournal of Neurophysiology 114:608–623.https://doi.org/10.1152/jn.00628.2014

2
From response to stimulus: adaptive sampling in sensory physiologyCurrent Opinion in Neurobiology 17:430–436.https://doi.org/10.1016/j.conb.2007.07.009

3
Highdensity electrode array for imaging in vitro electrophysiological activityBiosensors and Bioelectronics 21:167–174.https://doi.org/10.1016/j.bios.2004.08.011

4
A cmosbased sensor array for invitro neural tissue interfacing with 4225 recording sites and 1024 stimulation sitesBiomedical Circuits and Systems Conference (BioCAS), 2014 IEEE. pp. 304–307.https://doi.org/10.1109/BioCAS.2014.6981723

5
Largescale recording of neuronal ensemblesNature Neuroscience 7:446–451.https://doi.org/10.1038/nn1233
 6
 7

8
Parallel distributed computing using PythonAdvances in Water Resources 34:1124–1139.https://doi.org/10.1016/j.advwatres.2011.04.013

9
Spatial organization of chromatic pathways in the mouse dorsal lateral geniculate nucleusThe Journal of Neuroscience 37:1102–1116.https://doi.org/10.1523/JNEUROSCI.174216.2016

10
Towards reliable spiketrain recordings from thousands of neurons with multielectrodesCurrent Opinion in Neurobiology 22:11–17.https://doi.org/10.1016/j.conb.2011.10.001
 11
 12

13
Spike sorting of synchronous spikes from local neuron ensemblesJournal of Neurophysiology 114:2535–2549.https://doi.org/10.1152/jn.00993.2014

14
Bayes optimal template matching for spike sorting  combining fisher discriminant analysis with optimal filteringJournal of Computational Neuroscience 38:439–459.https://doi.org/10.1007/s1082701505477

15
ViSAPy: a Python tool for biophysicsbased generation of virtual spiking activity for evaluation of spikesorting algorithmsJournal of Neuroscience Methods 245:182–204.https://doi.org/10.1016/j.jneumeth.2015.01.029

16
Neural Signal Processing and ClosedLoop Control Algorithm Design for an Implanted Neural Recording and Stimulation SystemEngineering in Medicine and Biology Society (EMBC), 2015 37th Annual International Conference of the IEEE. pp. 7831–7836.https://doi.org/10.1109/EMBC.2015.7320207
 17

18
Intracellular features predicted by extracellular recordings in the hippocampus in vivoJournal of Neurophysiology 84:390–400.https://doi.org/10.1152/jn.2000.84.1.390
 19

20
Quality metrics to accompany spike sorting of extracellular signalsJournal of Neuroscience 31:8699–8705.https://doi.org/10.1523/JNEUROSCI.097111.2011
 21
 22

23
Highdimensional cluster analysis with the masked EM algorithmNeural Computation 26:2379–2394.https://doi.org/10.1162/NECO_a_00661
 24
 25
 26
 27

28
Bayesian modeling and classification of neural signalsNeural Computation 6:1005–1030.https://doi.org/10.1162/neco.1994.6.5.1005

29
What does the eye tell the brain?: Development of a system for the largescale recording of retinal output activityIEEE Transactions on Nuclear Science 51:1434–1440.https://doi.org/10.1109/TNS.2004.832706

30
Matching pursuits with timefrequency dictionariesIEEE Transactions on Signal Processing 41:3397–3415.https://doi.org/10.1109/78.258082

31
Mapping a complete neural population in the retinaJournal of Neuroscience 32:14859–14873.https://doi.org/10.1523/JNEUROSCI.072312.2012

32
Multineuronal signals from the retina: acquisition and analysisJournal of Neuroscience Methods 51:95–106.https://doi.org/10.1016/01650270(94)900302

33
Electrical stimulus artifact cancellation and neural spike detection on large multielectrode arraysPLoS Computational Biology 13:e1005842.https://doi.org/10.1371/journal.pcbi.1005842

34
Validating silicon polytrodes with paired juxtacellular recordings: method and datasetJournal of Neurophysiology 116:892–903.https://doi.org/10.1152/jn.00103.2016

35
Fast and accurate spike sorting of highchannel count probes with kilosortAdvances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems. pp. 4448–4456.
 36

37
Using noise signature to optimize spikesorting and to assess neuronal classification qualityJournal of Neuroscience Methods 122:43–57.https://doi.org/10.1016/S01650270(02)002765
 38
 39

40
Hardwareaccelerated interactive data visualization for neuroscience in PythonFrontiers in Neuroinformatics, 7, 10.3389/fninf.2013.00036, 24391582.

41
Spike sorting for large, dense electrode arraysNature Neuroscience 19:634–641.https://doi.org/10.1038/nn.4268

42
Recording spikes from a large fraction of the ganglion cells in a retinal patchNature Neuroscience 7:1155–1162.https://doi.org/10.1038/nn1323

43
Ground truth recordings for validation of spike sorting algorithmsGround truth recordings for validation of spike sorting algorithms, https://doi.org/10.5281/zenodo.1205233.

44
Spike sorting for polytrodes: a divide and conquer approachFrontiers in Systems Neuroscience, 8, 10.3389/fnsys.2014.00006, 24574979.

45
A primacy code for odor identityNature Communications 8:1477.https://doi.org/10.1038/s41467017014324
 46
Decision letter

David KleinfeldReviewing Editor; University of California, San Diego, United States
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
[Editors’ note: a previous version of this study was rejected after peer review, but the authors submitted for reconsideration. The first decision letter after peer review is shown below.]
Thank you for submitting your work entitled "A spike sorting toolbox for up to thousands of electrodes validated with ground truth recordings in vitro and in vivo" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and a Senior Editor.
Your work is of particular interest based on the rarity of ground truth data for spike sorting. Unfortunately, the reviewers had mixed feeling about the depth and clarity of the data and voice some significant concerns that speak against publication. The chief concern is that the ground truth data for the MEA is limited to the 7 cells for sorting errors less than 20% (Figure 2C; see comment by reviewer 1). In fact, the example (presumably one of the seven points) in Figure 2B suggests that a given cell goes not contribute a waveform to many electrodes (as implied in Figure 1; see additional comment by reviewer 1). Thus the necessity as well as utility of the algorithm seems unmotivated as it stands. Another weakness, discussed by reviewers 1 and 2, is that the merging of clusters is "by hand", although there is precedent in the literature (Fee et al. 1997) for automatic merging based on avoiding events at equal time in the autocorrelation of the merged clusters. Finally, all reviewers and the Reviewing Editor find the presentation confusing or lacking. Material is not presented in a logical order, details of the algorithm as glossed over, etc.
All reviewers felt that the topic of the manuscript is timely. We understand that you may wish to take this manuscript elsewhere at this time. On the other hand, if you agree with the reviewers, we would be willing to consider a revised manuscript with a larger useful data set and with a means for automation of the merging of clusters, without assurance that the revision would be successful.
Reviewer #1:
These authors report their method to do classic spike sorting, creation of single neuron spike time records, from multisite recording. They report results on both multielectrode arrays, used to record in vitro from retinal tissue, and Si probes, used to record in vivo, most frequently from rodents. Their method has been previewed on bioRxiv for some time so there is some public information comparing performance of this method to existing and other recently developed packages. While the technical treatment is thorough, I find the paper confusing, in large part because of the bifurcation of the Materials and methods and the remainder of the paper. To begin, the authors say "those snippets are detected, they are projected in a feature space of lower dimension". What feature space. Ten pages later I find out PCA. How many dimensions, either 5 or 6 depending on the parts of the algorithm. Without comment, the templates are 57 sites. This seems exceptionally large, perhaps usual for MEA recording. The choice of size deserves a clear comment.
I commend the authors on their contribution of "ground truth" MEA data. The field is seriously deficient in reference data and this deficiency impairs progress in the computational problem at hand. Figure 2A and B are commendably clear. I find 2C to be puzzling. I would assume only the inset of retina in vitro is of any interest to the practitioners of neural recording. Error rates above 20% are of no practical utility but occupy 95% of the plot. Alternatively only 1 data point in the in vivo plat is useful. This result is not surprising given the traces in 2B. Peak to peak noise is ~1012 uV. The usual peak threshold is 3.55 x RMS or 1215 uV. As such, one would not expect to sort signal below that amplitude. With the authors template matching routine, it can be argued that smaller signals can be found but they are buried in noise, again as Figure 2C seems to illustrate. I am left wondering what 2C teaches, except, that sorting has failed in most of the conditions plotted. Does this not call into question the success of the strategy of finding template pools at high amplitude then matching those templates to the raw traces? If that is not the lesson, then the authors need to clarify the value of plotting so many failed sorting exercises. I fail to understand the value of a 50% classification error that matches the optimal classification rate. I would conclude the experiment has failed to produce adequate usable data. Further, the authors choose to compare their sorting accuracy to a previously defined metric, Best Ellipsoidal Error Rate. Harris, 2000, the original source defines this measure as "This measure was designed to estimate the optimal clustering performance for a given set of feature vectors by searching over all possible ellipsoidal cluster boundaries." Is not the principal weakness of the methods of Harris reported there and later an assumption that a cluster is well described by ellipsoidal boundaries? The authors should justify this assumption in the reference data sets. I see no a priori reason that their data should be so constrained, although the "ground spikes" may be.
In the Discussion, especially Figure 3, trends are difficult to understand. For 3A, the plot seems to argue a superlinear dependence on processor count, 6 processors taking 1/30th the time of 2 processors. This seems implausible. Would this plot be more informative on a log time scale without a trendline. The current plot implies that 10 processors would digest the data in negative time! 3B is much more informative, showing 400 channels would take ~20 hours of computation for 1 hour of recording. While an improvement over many current methods, is not the method of Pachatariu et al., reference in this manuscript as a bioRxiv but now also published in Advances In Neural Information Processing Systems 44484456 (2016). The GPU based system reported their quotes "First, we timed the algorithm on several large scale datasets. The average run times for 32, 128 and 384 channel recordings were 10, 29 and 140 minutes respectively, on a single GPUequipped workstation". While they did not quote the duration of the recording or the total number of neuronal event (a better metric of the computation burden) it seems far faster than the ~30 hours in the present manuscript. Both of the packages are available in the public domain so a direct comparison is a reasonable request. In Figure 3E, the authors plot data ranging from 0 to 15% (including error bars) to 50% full scale. Why not use the full plot height for 15%. Same comment for 3F.
Finally, the authors freely admit their strategy over segments but offer a combination of an automated merging tool as well as a "dedicated GUI" for manual merging. This reviewer did not implement that interface so I cannot comment directly on performance or ease of use. In the Materials and methods section, they describe in detail the method for correlation, but they do no show the error rate as a function of the parameters used for that merging.
In summary, I am undecided. The authors use an admirable amount to "ground truth" data, both their own and that available publicly. Unfortunately, all of these data are inadequate to the problem since the known cells are so few and in the experimental data of such low amplitude yield a residual question of performance remains. This manuscript reports a novel combination of strategies but does not appear to exceed in performance or speed other recently published packages. More than one option is probably justified based on different needs of data sets, but that optimum path is difficult to discern from this manuscript of other published data. Reporting the method is a service to the community and justified in journal space, especially for an online journal such as eLife.
Reviewer #2 (General assessment and major comments (Required)):
In this report, the authors present a new approach to spike sorting that they claim is scalable to thousands of electrodes. This a very timely report as there is renewed interest in spike sorting thanks to the new electrode technologies. However, I find this report to be mixed in terms of strength of the proposed solution and clarity with which the underlying claims are made.
The strength of the proposed solution is significantly undercut by the need for manual cluster merging. This is a really major limitation. Clustering of thousands of electrodes has to be automatic if it is to be genuinely scalable. I can imagine it would take a long time for someone to check thousands of recordings and merge, especially if they are not an expert.
The authors cite several algorithms that are designed to scale and says they are untested. This is a fair point, but why don't they test them with their ground truth data and compare the results? This would seem to be an important issue for the reader to assess the value of the proposed work in the context of the cited literature.
The Introduction implies that the spike sorting problem has been addressed for small numbers of contacts, but does not include any citations to support the point. What forms of the classical approach that the authors are referring to? I would claim that despite a great deal of attention it is not the case that lowdensity sorting has been solved. The high density electrode case that the authors focus on here presents new challenges but the authors need to acknowledge the known limitations of lowdensity approaches.
A novelty of the proposed approach is that it claims to address the superposition problem but how well this works is not demonstrated or discussed. I think the work would be strengthened if this aspect received more attention.
Reviewer #3:
This manuscript presents a novel resource for automated spike sorting on data from largescale multielectrode recordings. The authors present new strategies for tackling large recordings with hundreds to thousands of electrodes while being able to integrate information from multiple electrodes and to disentangle overlapping spikes from multiple neurons. Thus, the resource addresses fundamental issues of spike sorting. Given the current development of recording systems with hundreds to thousands of electrodes and the paucity of methods targeted to such data sets, the presented method should therefore make a timely and very strong contribution to the field of multielectrode recordings.
The approach appears to be well thoughtthrough and apparently guided by much practical experience with spikesorting issues. The manuscript is accompanied by freely available software that can be downloaded from the authors' website, and the software appears to be well documented and versatile in that it can be run on different operating systems, using different hardware configurations, with straightforward parallelization. Moreover, the authors test their spikesorting approach on groundtruth data by combining their extracellular multielectrode recordings with the intracellular recording of individual neurons, a laudable effort. They generally find that their automated spike sorting performs nearly as well as an "optimal classifier", obtained by supervised learning of an elliptical decision boundary in the space of raw electrode signals. The authors also make this groundtruth data available, which will be a great asset available to future developments of spike sorting approaches.
My only concerns therefore relate to the presentation of the material. Given that this is a presentation of a spike sorting resource, it seems mandatory that the computational approach and the technical aspects of the algorithm be explained as thoroughly and clearly as possible so that readers and users can fully understand the approach and adjust it to their own needs by tuning the appropriate parameters. In the Results part, the explanation of the algorithm is rather brief, and it would here be useful to provide a bit more of an intuitive description of the approach (see detailed points below). The Materials and methods section, on the other hand, appears to contain some inaccuracies and small omissions of detail that make it hard to follow some of the steps.
1) The Results part is a bit sparse on explaining the approach, although this would be a nice opportunity to lay out concepts without the need for technical details. For example, it might be useful to provide some brief explanations of why limiting the clustering to the search for centroids is less demanding and what the approach in densitybased clustering is. Furthermore, some explanation of what is meant by "clustering each group in parallel" would be helpful, including the strategy for finding the right number of clusters and merging templates and units when necessary.
2) One aspect that doesn't become clear in the main part is to what extent merging of units generally needs to be performed and how it was done for the results of Figures 2 and 3. The Materials and methods section states that several templates were used for comparison with groundtruth data. It would be useful to obtain some more information about this. Was this merging done with the GUI described in the Discussion? Was the merging done "blindly", that is, before comparison with the groundtruth data, or retrospectively? Was merging the norm or the exception?
3) A potential issue with automated spike sorting is that it might be difficult to judge the sorting quality of the obtained units, that is, how reliable the units are and to what degree one may expect contamination from other cells or missed spikes. Does the resource provide for any quality control? Useful simple statistics might be the normalized spike amplitude, the distance from other templates, the quality of the fits in the template matching phase, and the variability in the fitting parameters, but maybe the authors could also point out other ways of assessing sorting quality from their own experience of using their software.
[Editors’ note: what now follows is the decision letter after the authors submitted for further consideration.]
Thank you for resubmitting your article "A spike sorting toolbox for up to thousands of electrodes validated with ground truth recordings in vitro and in vivo" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and Eve Marder as the Senior Editor. The reviewers have opted to remain anonymous.
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
The manuscript has been improved but there are some remaining issues that must be addressed before acceptance. This is an important paper that will gain from clarity in the exposition and the correction of numerous typographical errors. Of note:
1) A table of symbols and their definitions will be of great help for the reader.
2) Confusion remains in the implementation of the original clustering process as detailed by the first reviewer. Simply, a clear delineation of the process to create a template and use the template should be completely defined pictorially as well as mathematically.
3) The reviewers identify other confusing issues, missing references, and the need to avoid "double counting" by using correlation of the same data set for both analysis and verification.
Please address each of the reviewer criticisms in a written reply that addresses each reviewer comment and notes all changes to the text. The revised manuscript will be assessed by the Reviewing editor.
Reviewer #1:
The authors have improved the clarity of the manuscript substantially. My prior concerns have been adequately addressed. In addition to the specific comments below I would ask for additions/clarifications on three subjects.
I) First, on the path of the computation: Figure 1 is clearly the axis for most readers to understand the spike sorting process. Having carefully read the revised text as well as the original, I am still not sure how the original clustering is done. For the step of template collection, the text seems to imply that there are N clustering steps for the N electrodes. If true, one expects that each neuron would be detected on 38 electrodes and thus there are 38 times more clusters than neurons at least. The content of 1 C implies that these very many clusters are reduced to a template with 57 site "snipets" that make up the template. However, the example of 1C is centered on a maximum amplitude transient. How those maximum amplitude "centers" for templates are found is not clear to me, nor is how the expected variation around that center accommodated. Perhaps it does not need to be, since the template field is large so even is the maximum amplitude site for a given spike is off center, there is still ample information to cluster. A clear delineation of this template creation and use process would be useful, especially if it can be done pictorially as well as mathematically. How are all these redundant units merged?
II) While reading the Materials and methods section especially, but also reading the manuscript, I was constantly looking back through the text for the definition of variables and symbols. Would it be possible for the authors to make a variable definition table so that a reader can have access in one place to be reminded of variable definition for all variables in the paper?
III) The use of BEER is much better explained, but it would be useful to discuss when this assumption of elliptical boundaries is unlikely true, such as bursting cells or electrodetissue drift.
Reviewer #3:
Let me begin by stating that it is very worthwhile to have a publication that describes SpyKING CIRCUS, since it is one of a handful of tools aimed at the automated analysis of multielectrode data.
Some comments follow:
1) This paper needs significant proofreading/editing to make it more readable.
As an example, the second sentence in the second section (Results) is:
“First, spikes are being detected as threshold crossings (Figure 1A) and isolated the extracellular waveforms for a number of randomly chosen spike times.”
I know what the authors intend, but it is garbled.
2) The Results section contains a description of the algorithm. Is that the format eLife requires? It would be clearer to have sections on the Algorithm, experimental results, validation experiments, etc.
3) A distinguishing feature of the method is the ability to resolve overlapping spikes in large systems in a reasonably automatic fashion and makes this an interesting contribution.
4) Other groups have used artificial template/spike insertion as a way to validate the algorithm and some reference could be given (e.g. Rossant 2016, Chung 2017).
5) The authors use crosscorrelograms in order to help automate the merging process. It should be noted that, as a side effect, one can no longer use crosscorrelograms as a validation metric.
6) Equation 1 is very confusing. The authors should explain the double sum notation (why index over events and units)? I'm sure they have something in mind but I don't know what it is.
https://doi.org/10.7554/eLife.34518.014Author response
[Editors’ note: the author responses to the first round of peer review follow.]
Reviewer #1:
These authors report their method to do classic spike sorting, creation of single neuron spike time records, from multisite recording. They report results on both multielectrode arrays, used to record in vitro from retinal tissue, and Si probes, used to record in vivo, most frequently from rodents. Their method has been previewed on bioRxiv for some time so there is some public information comparing performance of this method to existing and other recently developed packages. While the technical treatment is thorough, I find the paper confusing, in large part because of the bifurcation of the Materials and methods and the remainder of the paper. To begin, the authors say "those snippets are detected, they are projected in a feature space of lower dimension". What feature space. Ten pages later I find out PCA. How many dimensions, either 5 or 6 depending on the parts of the algorithm. Without comment, the templates are 57 sites. This seems exceptionally large, perhaps usual for MEA recording. The choice of size deserves a clear comment.
We would like to thank the reviewer for the comment. The manuscript has been intensively rewritten, and we hope that the Materials and methods and the paper are easier to understand for the reader.
I commend the authors on their contribution of "ground truth" MEA data. The field is seriously deficient in reference data and this deficiency impairs progress in the computational problem at hand. Figure 2A and B are commendably clear. I find 2C to be puzzling. I would assume only the inset of retina in vitro is of any interest to the practitioners of neural recording. Error rates above 20% are of no practical utility but occupy 95% of the plot. Alternatively only 1 data point in the in vivo plat is useful. This result is not surprising given the traces in 2B. Peak to peak noise is ~1012 uV. The usual peak threshold is 3.55 x RMS or 1215 uV. As such, one would not expect to sort signal below that amplitude. With the authors template matching routine, it can be argued that smaller signals can be found but they are buried in noise, again as Figure 2C seems to illustrate. I am left wondering what 2C teaches, except, that sorting has failed in most of the conditions plotted. Does this not call into question the success of the strategy of finding template pools at high amplitude then matching those templates to the raw traces? If that is not the lesson, then the authors need to clarify the value of plotting so many failed sorting exercises. I fail to understand the value of a 50% classification error that matches the optimal classification rate. I would conclude the experiment has failed to produce adequate usable data.
We thank the reviewer for appreciating the value of ground truth MEA data. We agree that, in the major part of our previous recordings, the spike recorded with the loose patch electrode was too low on the MEA to be properly isolated. We have therefore collected more data where the spike height measured on the multielectrode array is large enough so that the error rate is very low (18 cells below 10% ). We found that our spike sorting algorithm was successful for all these cells. Our algorithm is therefore able to sort accurately cells where a very low error rate as expected.
Further, the authors choose to compare their sorting accuracy to a previously defined metric, Best Ellipsoidal Error Rate. Harris, 2000, the original source defines this measure as "This measure was designed to estimate the optimal clustering performance for a given set of feature vectors by searching over all possible ellipsoidal cluster boundaries." Is not the principal weakness of the methods of Harris reported there and later an assumption that a cluster is well described by ellipsoidal boundaries? The authors should justify this assumption in the reference data sets. I see no a priori reason that their data should be so constrained, although the "ground spikes" may be.
We agree that the comparison with the BEER estimate was not properly justified. Our motivation is that all cells cannot be sorted with equal performance: if a cell evoked small spikes on the array, or if the extracellular waveforms are very similar to the ones triggered by another neighbour cell, then it will be difficult to sort this cell, and even the best algorithm will produce a large error rate. We thus need to compare the performance of our algorithm with an estimate of how well the “best” algorithm should perform. Unfortunately, there is no known method to provide a perfect estimate of this best performance. One simple option is to plot the error of our algorithm against the height of the spike. We did this and show that the error rate decreases when the spike height increases, as expected (see Figure 2C). However, this does not tell us if this is the best performance reachable. The BEER estimate is just one attempt at getting close to this best performance. We agree with the reviewer about the limitation of the BEER estimate: it assumes that spikes can be found in a cluster with ellipsoidal boundaries, which is probably wrong. In particular, spikes in overlap with spikes from another cell will probably be missed. The BEER estimate cannot be considered as estimating the optimal error rate. We used it because it has been used in several papers describing spike sorting methods (Harris et al., 2000, Rossant et al., 2016, Kadir et al., 2014) and has been established as a commonly accepted benchmark. To our knowledge, there is no alternative method that circumvents the issues of the BEER. We have mentioned the limitations of the BEER in the text. When compared to the error rate of our algorithm, we found similar results to the BEER. While this cannot be taken as a proof that our algorithm is optimally sorting the spikes, the fact that the error rate is close to a supervised classifier learned on the ground truth data suggests that it does very well.
In the Discussion, especially Figure 3, trends are difficult to understand. For 3A, the plot seems to argue a superlinear dependence on processor count, 6 processors taking 1/30th the time of 2 processors. This seems implausible. Would this plot be more informative on a log time scale without a trendline. The current plot implies that 10 processors would digest the data in negative time! 3B is much more informative, showing 400 channels would take ~20 hours of computation for 1 hour of recording. While an improvement over many current methods, is not the method of Pachatariu et al., reference in this manuscript as a bioRxiv but now also published in Advances In Neural Information Processing Systems 44484456 (2016). The GPU based system reported their quotes "First, we timed the algorithm on several large scale datasets. The average run times for 32, 128 and 384 channel recordings were 10, 29 and 140 minutes respectively, on a single GPUequipped workstation". While they did not quote the duration of the recording or the total number of neuronal event (a better metric of the computation burden) it seems far faster than the ~30 hours in the present manuscript. Both of the packages are available in the public domain so a direct comparison is a reasonable request. In Figure 3E, the authors plot data ranging from 0 to 15% (including error bars) to 50% full scale. Why not use the full plot height for 15%. Same comment for 3F.
We apologize for the confusion. The time to run the algorithm is inversely proportional to the number of CPUs. We have modified the plot to make this clearer. We have also performed a direct comparison with KiloSort (Pachitariu et al., 2016). For a small enough number of electrodes, Kilosort is faster than our algorithm. However, this comes at the cost of a much larger memory usage. As a consequence, KiloSort cannot handle large recordings with more than 500 channels, while our algorithm could handle up to 4225 channels.
Finally, the authors freely admit their strategy over segments but offer a combination of an automated merging tool as well as a "dedicated GUI" for manual merging. This reviewer did not implement that interface so I cannot comment directly on performance or ease of use. In the Materials and methods section, they describe in detail the method for correlation, but they do no show the error rate as a function of the parameters used for that merging.
We have now extensively described this tool for automated merging, and shown a solution where merging can be automated with only two parameters. We have tested this fully automated strategy using the ground truth data and found that it allows sorting neurons with a good accuracy.
In summary, I am undecided. The authors use an admirable amount to "ground truth" data, both their own and that available publicly. Unfortunately, all of these data are inadequate to the problem since the known cells are so few and in the experimental data of such low amplitude yield a residual question of performance remains. This manuscript reports a novel combination of strategies but does not appear to exceed in performance or speed other recently published packages. More than one option is probably justified based on different needs of data sets, but that optimum path is difficult to discern from this manuscript of other published data. Reporting the method is a service to the community and justified in journal space, especially for an online journal such as eLife.
We hope that the novel data and improvements in the algorithm will now convince the reviewer. The new datasets have spikes with high amplitude. Our algorithm for spike sorting also allows scaling up thousands of electrodes, which is not the case of other published software.
Reviewer #2 (General assessment and major comments (Required)):
In this report, the authors present a new approach to spike sorting that they claim is scalable to thousands of electrodes. This a very timely report as there is renewed interest in spike sorting thanks to the new electrode technologies. However, I find this report to be mixed in terms of strength of the proposed solution and clarity with which the underlying claims are made.
The strength of the proposed solution is significantly undercut by the need for manual cluster merging. This is a really major limitation. Clustering of thousands of electrodes has to be automatic if it is to be genuinely scalable. I can imagine it would take a long time for someone to check thousands of recordings and merge, especially if they are not an expert.
The authors cite several algorithms that are designed to scale and says they are untested. This is a fair point, but why don't they test them with their ground truth data and compare the results? This would seem to be an important issue for the reader to assess the value of the proposed work in the context of the cited literature.
We thank the reviewer for this suggestion and have now designed our own strategy to automate fully the spike sorting, with a final step of automated merging. We have tested this automated merging step with our ground truth data and found that we could reach very good performance.
The Introduction implies that the spike sorting problem has been addressed for small numbers of contacts, but does not include any citations to support the point. What forms of the classical approach that the authors are referring to? I would claim that despite a great deal of attention it is not the case that lowdensity sorting has been solved. The high density electrode case that the authors focus on here presents new challenges but the authors need to acknowledge the known limitations of lowdensity approaches.
We apologize for the confusion. We did not imply that spike sorting was solved for small numbers of contacts. However, it has received a lot of attention and several solutions have been proposed, although we agree it is not clear if these solutions are optimal or not. The common point of most of these solutions is to use clustering and we have now emphasized the limitations of a “pure” clustering approach in the Introduction and Results.
A novelty of the proposed approach is that it claims to address the superposition problem but how well this works is not demonstrated or discussed. I think the work would be strengthened if this aspect received more attention.
The superposition problem is a salient issue when trying to estimate pairwise correlations between cells. If two nearby cells spike synchronously, their spikes will temporally and spatially overlap. A clusteringbased approach will miss these synchronous spikes and, as a result, the correlation between the two cells will be underestimated (see Pillow et al., 2013). To demonstrate that our algorithm could solve this problem we simulated two correlated spike trains and added these artificial cells to real extracellular recordings. We therefore generated ground truth data where we know the true correlation value for these two points. We then showed that our algorithm was able to recover the correct value for correlation. Our algorithm is therefore able to solve the superposition problem. We have added these explanations in the text.
Reviewer #3:
[…] My only concerns therefore relate to the presentation of the material. Given that this is a presentation of a spike sorting resource, it seems mandatory that the computational approach and the technical aspects of the algorithm be explained as thoroughly and clearly as possible so that readers and users can fully understand the approach and adjust it to their own needs by tuning the appropriate parameters. In the Results part, the explanation of the algorithm is rather brief, and it would here be useful to provide a bit more of an intuitive description of the approach (see detailed points below). The Materials and methods section, on the other hand, appears to contain some inaccuracies and small omissions of detail that make it hard to follow some of the steps.
We have completely rewritten the Results and Materials and methods section. The results are now more complete, and the methods more accurate.
1) The Results part is a bit sparse on explaining the approach, although this would be a nice opportunity to lay out concepts without the need for technical details. For example, it might be useful to provide some brief explanations of why limiting the clustering to the search for centroids is less demanding and what the approach in densitybased clustering is. Furthermore, some explanation of what is meant by "clustering each group in parallel" would be helpful, including the strategy for finding the right number of clusters and merging templates and units when necessary.
We have done our best to explain our clustering strategy here. When we search for cluster centroids, we can afford errors in the cluster borders. This is why this task can be considered as less demanding than a full sorting by clustering, where borders have to be correct. We have also detailed more what we meant by clustering in parallel: before clustering, we group spikes according the electrode where they peak. This creates as many groups as electrodes and we cluster each group independently.
2) One aspect that doesn't become clear in the main part is to what extent merging of units generally needs to be performed and how it was done for the results of Figures 2 and 3. The Materials and methods section states that several templates were used for comparison with groundtruth data. It would be useful to obtain some more information about this. Was this merging done with the GUI described in the Discussion? Was the merging done "blindly", that is, before comparison with the groundtruth data, or retrospectively? Was merging the norm or the exception?
This part has been extensively rewritten and modified following the comments of the other reviewers. We have now designed a fully automated method for the merging step, and tested it on the ground truth data.
3) A potential issue with automated spike sorting is that it might be difficult to judge the sorting quality of the obtained units, that is, how reliable the units are and to what degree one may expect contamination from other cells or missed spikes. Does the resource provide for any quality control? Useful simple statistics might be the normalized spike amplitude, the distance from other templates, the quality of the fits in the template matching phase, and the variability in the fitting parameters, but maybe the authors could also point out other ways of assessing sorting quality from their own experience of using their software.
From our ground truth data one can see that the quality of the sorted units is directly related to the size of the spike waveform. We have not found any other parameter that played a significant role in determining the quality of the sorted unit. We have added a panel showing the relation between sorting quality and spike size.
[Editors' note: the author responses to the rereview follow.]
The manuscript has been improved but there are some remaining issues that must be addressed before acceptance. This is an important paper that will gain from clarity in the exposition and the correction of numerous typographical errors. Of note:
1) A table of symbols and their definitions will be of great help for the reader.
2) Confusion remains in the implementation of the original clustering process as detailed by the first reviewer. Simply, a clear delineation of the process to create a template and use the template should be completely defined pictorially as well as mathematically.
3) The reviewers identify other confusing issues, missing references, and the need to avoid "double counting" by using correlation of the same data set for both analysis and verification.
Please address each of the reviewer criticisms in a written reply that addresses each reviewer comment and notes all changes to the text. The revised manuscript will be assessed by the Reviewing editor.
This revised version aims at addressing all of the reviewers’ criticisms. The reviewers’ concerns are addressed in detail below. We thank the reviewers for their helpful comments and criticisms, which have helped to clarify this manuscript. We have answered to all their points:
 The text has been proofread to improve clarity;
 Figure 4 has been redrawn with new colors, to improve visibility (see comment from reviewer 1);
 We added a table at the end of the main manuscript to explain all the notations used in the paper;
 We added a supplementary figure to better explain pictorially how the clustering is parallelized (see first comment of reviewer 1).
Reviewer #1:
The authors have improved the clarity of the manuscript substantially. My prior concerns have been adequately addressed. In addition to the specific comments below I would ask for additions/clarifications on three subjects.
I) First, on the path of the computation: Figure 1 is clearly the axis for most readers to understand the spike sorting process. Having carefully read the revised text as well as the original, I am still not sure how the original clustering is done. For the step of template collection, the text seems to imply that there are N clustering steps for the N electrodes. If true, one expects that each neuron would be detected on 38 electrodes and thus there are 38 times more clusters than neurons at least. The content of 1 C implies that these very many clusters are reduced to a template with 57 site "snipets" that make up the template. However, the example of 1C is centered on a maximum amplitude transient. How those maximum amplitude "centers" for templates are found is not clear to me, nor is how the expected variation around that center accommodated. Perhaps it does not need to be, since the template field is large so even is the maximum amplitude site for a given spike is off center, there is still ample information to cluster. A clear delineation of this template creation and use process would be useful, especially if it can be done pictorially as well as mathematically. How are all these redundant units merged?
We thank the reviewer for this comment. We would like to clarify how we do this clustering step. Even if a spike is detected on several electrodes, it will only be assigned to the electrode where the voltage peak is the largest. Thanks to this method, if a spike has its largest peak on electrode 1, but is also detected on electrode 2, it will only be assigned to electrode 1. This means that, in general, the spikes of one cell will be assigned to only one electrode and will therefore correspond to a single cluster, not to 38 ones as predicted by the reviewer. We have explained this better in the text, and a supplementary figure to explain this procedure has been added to the manuscript.
II) While reading the Materials and methods section especially, but also reading the manuscript, I was constantly looking back through the text for the definition of variables and symbols. Would it be possible for the authors to make a variable definition table so that a reader can have access in one place to be reminded of variable definition for all variables in the paper?
We would like to thank the reviewer for this suggestion. A table has been added at the end of the paper.
III) The use of BEER is much better explained, but it would be useful to discuss when this assumption of elliptical boundaries is unlikely true, such as bursting cells or electrodetissue drift.
The reviewer is right, and this is now better discussed in the manuscript. Indeed, the BEER is only an approximation of the expected best performance, as this nonlinear classifier may not cope perfectly with temporal overlaps, drifts, or bursting behavior. Nevertheless, in our case, since we used only few minutes long recording, the drifts appear to be negligible.
Reviewer #3:
Let me begin by stating that it is very worthwhile to have a publication that describes SpyKING CIRCUS, since it is one of a handful of tools aimed at the automated analysis of multielectrode data.
Some comments follow:
1) This paper needs significant proofreading/editing to make it more readable.
As an example, the second sentence in the second section (Results) is:
“First, spikes are being detected as threshold crossings (Figure 1A) and isolated the extracellular waveforms for a number of randomly chosen spike times.”
I know what the authors intend, but it is garbled.
We would like to thank the reviewer for this comment on SpyKING CIRCUS. We have proofread the manuscript, in order to make it easier to understand. We hope that this is now suitable for publication.
2) The Results section contains a description of the algorithm. Is that the format eLife requires? It would be clearer to have sections on the Algorithm, experimental results, validation experiments, etc.
Many articles in eLife in the Tools and Resources category include in the results a brief description of the methods, which is expanded in the Materials and methods section. For example, see
 https://elifesciences.org/articles/28728
 https://elifesciences.org/articles/32671
 https://elifesciences.org/articles/24656
We have followed this organisation. This allows to grasp the main steps of the algorithm quickly for the standard user of the algorithm, while giving all the necessary details in the Materials and methods section.
3) A distinguishing feature of the method is the ability to resolve overlapping spikes in large systems in a reasonably automatic fashion and makes this an interesting contribution.
We would like to thank the reviewer for the comment. We have emphasized more this point in the text.
4) Other groups have used artificial template/spike insertion as a way to validate the algorithm and some reference could be given (e.g. Rossant 2016, Chung 2017).
We thank the reviewer for the comment. In fact, in the original manuscript, the reference to (Rossant et al., 2016) was already there in the Materials and methods section (Simulated ground truth tests). Indeed, we are fully aware that many groups have used this strategy of designing hybrid" datasets as a way to validate spike sorting with artificial data. The reference to (Chung et al., 2017) has been added. We have also mentioned other papers where this was done previously (Segev et al., 2004, Marre et al., 2012)
5) The authors use crosscorrelograms in order to help automate the merging process. It should be noted that, as a side effect, one can no longer use crosscorrelograms as a validation metric.
The reviewer is right, and this is something that has now been explicitly stated in the manuscript.
6) Equation 1 is very confusing. The authors should explain the double sum notation (why index over events and units)? I'm sure they have something in mind but I don't know what it is.
We apologize for the lack of clarity. The manuscript has been modified in order to better explain this key equation. The core idea of this equation is that for every spike detected (indexed by i), we assume that several templates (indexed by j) could re and participate to this spiking event. This is why there is a double sum here. That said, since only a few cells will fire at the same spike time, most of the coefficients should be 0. This notation is nevertheless convenient to explain how spikes can be superimposed. We have clarified this in the text.
https://doi.org/10.7554/eLife.34518.015Article and author information
Author details
Funding
Agence Nationale de la Recherche (TRAJECTORY)
 Olivier Marre
European Commission (ERC StG 309776)
 Jens Duebel
National Institutes of Health (U01NS090501)
 Olivier Marre
Foundation Fighting Blindness
 Serge Picaud
Agence Nationale de la Recherche (ANR14CE130003)
 Pierre Yger
Agence Nationale de la Recherche (ANR10LABX65)
 Serge Picaud
European Commission (FP7720270)
 Olivier Marre
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We would like to thank Charlotte Deleuze for her help with the in vitro juxtacellular recordings, and Steve Baccus and Sami El Boustani for insightful discussions. We also would like to thank Kenneth Harris, Cyrille Rossant and Nick Steimetz for feedbacks and the help with the interface to the phy software. This work was supported by ANR14CE130003 to PY, ANR TRAJECTORY, ANR OPTIMA, the French State program Investissements d’Avenir managed by the Agence Nationale de la Recherche [LIFESENSES: ANR10LABX65], an EC grant from the Human Brain Project (FP7720270), and NIH grant U01NS090501 to OM, ERC Starting Grant (309776) to JD and Foundation Figthing Blindness to SP.
Ethics
Animal experimentation: Experiments were performed in accordance with institutional animal care standards, using protocol (#00847.02) of the Institut de la Vision (Agreement number A751202). The protocol was approved by the Charles Darwin ethic committee (CEEACD/N°5)
Reviewing Editor
 David Kleinfeld, University of California, San Diego, United States
Publication history
 Received: December 20, 2017
 Accepted: March 19, 2018
 Accepted Manuscript published: March 20, 2018 (version 1)
 Version of Record published: April 9, 2018 (version 2)
Copyright
© 2018, Yger et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 2,462
 Page views

 467
 Downloads

 4
 Citations
Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.