1. Neuroscience
Download icon

SpikeForest, reproducible web-facing ground-truth validation of automated neural spike sorters

  1. Jeremy Magland  Is a corresponding author
  2. James J Jun
  3. Elizabeth Lovero
  4. Alexander J Morley
  5. Cole Lincoln Hurwitz
  6. Alessio Paolo Buccino
  7. Samuel Garcia
  8. Alex H Barnett
  1. Center for Computational Mathematics, Flatiron Institute, United States
  2. Scientific Computing Core, Flatiron Institute, United States
  3. Medical Research Council Brain Network Dynamics Unit, University of Oxford, United Kingdom
  4. Institute for Adaptive and Neural Computation Informatics, University of Edinburgh, United Kingdom
  5. Centre for IntegrativeNeuroplasticity (CINPLA), University of Oslo, Norway
  6. Centre de Recherche en Neuroscience de Lyon, Université de Lyon, France
Tools and Resources
  • Cited 0
  • Views 531
  • Annotations
Cite this article as: eLife 2020;9:e55167 doi: 10.7554/eLife.55167

Abstract

Spike sorting is a crucial step in electrophysiological studies of neuronal activity. While many spike sorting packages are available, there is little consensus about which are most accurate under different experimental conditions. SpikeForest is an open-source and reproducible software suite that benchmarks the performance of automated spike sorting algorithms across an extensive, curated database of ground-truth electrophysiological recordings, displaying results interactively on a continuously-updating website. With contributions from eleven laboratories, our database currently comprises 650 recordings (1.3 TB total size) with around 35,000 ground-truth units. These data include paired intracellular/extracellular recordings and state-of-the-art simulated recordings. Ten of the most popular spike sorting codes are wrapped in a Python package and evaluated on a compute cluster using an automated pipeline. SpikeForest documents community progress in automated spike sorting, and guides neuroscientists to an optimal choice of sorter and parameters for a wide range of probes and brain regions.

Introduction

Background

Direct electrical recording of extracellular potentials (Buzsáki, 2004; Seymour et al., 2017) is one of the most popular modalities for studying neural activity since it is possible to determine, with sub-millisecond time resolution, individual firing events from hundreds (potentially thousands) of cells, and to track the activity of individual neurons over hours or days. Recordings are acquired either from within the living animal (in vivo) or from extracted tissue (ex vivo), at electrodes separated by typically 5–25 µm, with baseline noise on the order of 10 µV RMS and 10–30 kHz sampling rate. Probes for in vivo use—which are usually needle-like to minimize tissue damage during insertion—include microwire monotrodes (Hubel, 1957; Nicolelis et al., 1997), tetrodes (Gray et al., 1995; Harris et al., 2000; Dhawale et al., 2017), and multi-shank probes (with typically 1–4 columns of electrodes per shank) on silicon (Csicsvari et al., 2003; Buzsáki, 2004; Jun et al., 2017b) or polymer (Kuo et al., 2013; Chung et al., 2019) substrates. Multiple such probes are often combined into arrays to cover a larger volume in tandem. For ex vivo use (e.g., explanted retina), planar, two-dimensional multi-electrode arrays (MEAs) are common, allowing channel counts of up to tens of thousands (Eversmann et al., 2003; Litke et al., 2004; Berdondini et al., 2005; Yuan et al., 2016; Tsai et al., 2017).

Spike sorting is an essential computational step needed to isolate the activity of individual neurons, or units, within extracellular recordings which combine noisy signals from many neurons. Historically, this procedure has relied on manual steps (Hazan et al., 2006; Prentice et al., 2011; Rossant et al., 2016): putative waveforms crossing an amplitude threshold are visualized in a low-dimensional space (either using peak amplitudes or dimensionality reduction techniques), then clusters are separated by eye. While manual spike sorting is manageable with small numbers of recording channels, the rapid growth in channel counts and data volume in recent years as well as the requirement for reproducibility and objectivity demand automated approaches.

Most automated algorithms apply a sequence of steps that include filtering, detection, dimension reduction, and clustering, although these may be combined with (or replaced by) many other approaches such as template matching (Prentice et al., 2011; Pillow et al., 2013; Pachitariu et al., 2016), dictionary learning or basis pursuit (Carlson et al., 2014; Ekanadham et al., 2014), and independent component analysis (Takahashi et al., 2002; Buccino et al., 2018). The past 20 years have seen major efforts to improve these algorithms, with recent work focusing on the challenges arising from probe drift (changing spike waveform shapes), spatiotemporally overlapping spikes, and massive data volumes. We will not attempt a full review here, but instead refer the reader to, for example Fee et al., 1996; Lewicki, 1998; Quiroga, 2012; Einevoll et al., 2012; Rey et al., 2015; Lefebvre et al., 2016; Hennig et al., 2019; Carlson and Carin, 2019.

In the last few years, many automated spike sorters have been released and are in wide use. Yet, there is little consensus about which is the best choice for a given probe, brain region and experiment type. Often, decisions are based not on evidence of accuracy or performance but rather on the ease of installation or usage, or historical precedent. Thus, the goals of extracting the highest quality results from experiments and of improving reproducibility across laboratories (Denker et al., 2018; Harris et al., 2016) make objective comparison of the available automated spike sorters a pressing concern.

Prior work

One approach to assessing spike sorter accuracy is to devise intrinsic quality metrics that are applied to each sorted unit, quantifying, for instance, the feature-space isolation of a cluster of firing events (Pouzat et al., 2002; Schmitzer-Torbert et al., 2005; Hill et al., 2011; Neymotin et al., 2011; Barnett et al., 2016; Chung et al., 2017). Another approach is to use biophysical validation methods such as examining cross-correlograms or discovered place fields (Li et al., 2015; Chung et al., 2017). However, the gold standard, when possible, is to evaluate the sorter by comparing with ground-truth data, that is using recordings where the spike train for one or more units is known a priori. Laboratory acquisition of such recordings is difficult and time-consuming, demanding simultaneous paired extracellular and intra-/juxta-cellular probes (Harris et al., 2000; Franke et al., 2015; Neto et al., 2016; Yger et al., 2018; Allen et al., 2018; Marques-Smith et al., 2018a). Since the number of ground-truth units collected in this way is currently small (one per recording), hybrid recordings (where known synthetic firing events are added to experimental data) (Marre et al., 2012; Steinmetz, 2015; Rossant et al., 2016; Pachitariu et al., 2016; Wouters et al., 2019), and biophysically detailed simulated recordings (Camuñas-Mesa and Quiroga, 2013; Hagen et al., 2015; Gratiy et al., 2018; Buccino and Einevoll, 2019), which can contain 1–2 orders of magnitude more ground-truth units, have also been made available for the purpose of method validation.

Recently, such ground-truth data have been used to compare new spike sorting algorithms against preexisting ones (Einevoll et al., 2012; Pachitariu et al., 2016; Chung et al., 2017; Jun et al., 2017a; Lee et al., 2017; Yger et al., 2018). However, the choice of accuracy metrics, sorters, data sets, parameters, and code versions varies among studies, making few of the results reproducible, transparent, or comprehensive enough to be of long-term use for the community. To alleviate these issues, a small number of groups initiated web-facing projects to benchmark spike sorter accuracy, notably G-Node (Franke et al., 2012), a phy hybrid study (Steinmetz, 2015) and spikesortingtest (Mitelut, 2016). To our knowledge, these unmaintained projects are either small-scale snapshots or are only partially realized. Yet, in the related area of calcium imaging, leaderboard-style comparison efforts have been more useful for establishing community benchmarks (Freeman, 2015; Berens et al., 2018).

SpikeForest

We have addressed the above needs by creating and deploying the SpikeForest software suite. SpikeForest comprises a large database of electrophysiological recordings with ground truth (collected from the community), a parallel processing pipeline that benchmarks the performance of many automated spike sorters, and an interactive website that allows for in-depth exploration of the results. At present, the database includes hundreds of recordings, of the types specified above (paired and state-of-the-art biophysical simulation), contributed by eleven laboratories and containing more than 30,000 ground-truth units. Our pipeline runs the various sorters on the recordings, then finds, for each ground-truth unit, the sorted unit whose firing train is the best match, and finally computes metrics involving the numbers of correct, missing, and false positive spikes. A set of accuracy evaluation metrics are then derived per ground-truth unit for each sorter. By averaging results from many units of a similar recording type, we provide high-level accuracy summaries for each sorter in various experimental settings. In order to understand the failure modes of each sorter, SpikeForest further provides various interactive plots.

A central aim of this project is to maximize the transparency and reproducibility of the analyses. To this end, all data—the set of recordings, their ground-truth firings, and firing outputs from all sorters—are available for public download via our Python API. SpikeForest itself is open-source, as are the wrappers to all sorters, the Docker (Merkel, 2014) containers, and all of the parameter settings used in the current study results. In fact, code to rerun any sorting task may be requested via the web interface, and is auto-generated on the fly. Both the code and the formulae (for accuracy, SNR, and other metrics) are documented on the site, with links to the source code repositories.

Contribution

Our work has three main objectives. The primary goal is to aid neuroscientists in selecting the optimal spike sorting software (and algorithm parameters) for their particular probe, brain region, or application. A second goal is to spur improvements in current and future spike sorting software by providing standardized evaluation criteria. This has already begun to happen as developers of some spike sorting algorithms have already made improvements in direct response to this project. As a byproduct, and in collaboration with the SpikeInterface project (Buccino et al., 2019), we achieve a third objective of providing a software package which enables laboratories to run a suite of many popular, open-source, automatic spike sorters, on their own recordings via a unified Python interface.

Results

In conjunction with the SpikeInterface project (Buccino et al., 2019), the SpikeForest Python package provides standardized wrappers for the following popular spike sorters: HerdingSpikes2 (Hilgen et al., 2017), IronClust (Jun et al., in preparation), JRCLUST (Jun et al., 2017a), KiloSort (Pachitariu et al., 2016), KiloSort2 (Pachitariu et al., 2019), Klusta (Rossant et al., 2016), MountainSort4 (Chung et al., 2017), SpyKING CIRCUS (Yger et al., 2018), Tridesclous (Garcia and Pouzat, 2019), and WaveClus (Chaure et al., 2018; Quiroga et al., 2004). Details of each of these algorithms are provided in Table 1. Since each of these spike sorters operates within a unique computing environment, we utilize Docker (Merkel, 2014) and Singularity (Kurtzer et al., 2017) containers to rigorously encapsulate the versions and prerequisites for each algorithm, ensuring independent verifiability of results, and circumventing software library conflicts.

The electrophysiology recordings (together with ground-truth information) registered in SpikeForest are organized into studies, and studies are then grouped into study sets. Table 1 details all study sets presently in the system. Recordings within a study set share a common origin (e.g., laboratory) and type (e.g., paired), whereas recordings within the same study are associated with very similar simulation parameters or experimental conditions.

Table 1
Table of spike sorting algorithms currently included in the SpikeForest analysis.

Each algorithm is registered into the system via a Python wrapper. A Docker recipe defines the operating system and environment where the sorter is run. Algorithms with asterisks were updated and optimized using SpikeForest data. For the other algorithms, we used the default or recommended parameters.

Sorting algorithmLanguageNotes
HerdingSpikes2*PythonDesigned for large-scale, high-density multielectrode arrays. See Hilgen et al., 2017.
IronClust*MATLAB and CUDADerived from JRCLUST. See Jun et al., in preparation.
JRCLUSTMATLAB and CUDADesigned for high-density silicon probes. See Jun et al., 2017a.
KiloSortMATLAB and CUDATemplate matching. See Pachitariu et al., 2016.
KiloSort2MATLAB and CUDADerived from KiloSort. See Pachitariu et al., 2019.
KlustaPythonExpectation-Maximization masked clustering. See Rossant et al., 2016.
MountainSort4Python and C++Density-based clustering via ISO-SPLIT. See Chung et al., 2017.
SpyKING CIRCUS*Python and MPIDensity-based clustering and template matching. See Yger et al., 2018.
Tridesclous*Python and OpenCLSee Garcia and Pouzat, 2019.
WaveClusMATLABSuperparamagnetic clustering. See Chaure et al., 2018; Quiroga et al., 2004.
Table 2
Table of study sets currently included in the SpikeForest analysis.

Study sets fall into three categories: paired, synthetic, and curated. Each study set comprises one or more studies, which in turn comprise multiple recordings acquired or generated under the same conditions.

Study set# Rec. / # Elec. / Dur.Source lab.Description
Paired intra/extracellular
PAIRED_BOYDEN19 / 32ch / 6-10minE. BoydenSubselected from 64, 128, or 256-ch. probes, mouse cortex
PAIRED_CRCNS_HC193 / 4-6ch / 6-12minG. BuzsakiTetrodes or silicon probe (one shank) in rat hippocampus
PAIRED_ENGLISH29 / 4-32ch / 1-36minD. EnglishHybrid juxtacellular-Si probe, behaving mouse, various regions
PAIRED_KAMPFF15 / 32ch / 9-20minA. KampffSubselected from 374, 127, or 32-ch. probes, mouse cortex
PAIRED_MEA64C_YGER18 / 64ch / 5minO. MarreSubselected from 252-ch. MEA, mouse retina
PAIRED_MONOTRODE100 / 1ch / 5-20minBoyden, Kampff, Marre, BuzsakiSubselected from paired recordings from four labs
Simulation
SYNTH_BIONET36 / 60ch / 15minAIBSBioNet simulation containing no drift, monotonic drift, and random jumps; used by JRCLUST, IronClust
SYNTH_JANELIA60 / 4-64ch / 5-20minM. PachitariuDistributed with KiloSort2, with and without simulated drift
SYNTH_MAGLAND80 / 8ch / 10minFlatiron Inst.Synthetic waveforms, Gaussian noise, varying SNR, channel count and unit count
SYNTH_MEAREC_NEURONEX60 / 32ch / 10minA. BuccinoSimulated using MEAREC, varying SNR and unit count
SYNTH_MEAREC_TETRODE40 / 4ch / 10minA. BuccinoSimulated using MEAREC, varying SNR and unit count
SYNTH_MONOTRODE111 / 1ch / 10minQ. QuirogaSimulated by Quiroga lab by mixing averaged real spike waveforms
SYNTH_VISAPY6 / 30ch / 5minG. EinevollGenerated using VISAPy simulator
Human curated
MANUAL_FRANKLAB21 / 4ch / 10-40minL. FrankThree manual curations of the same recordings

Each time the collection of spike sorting algorithms and ground-truth datasets are updated, our pipeline, depicted in Figure 1, reruns the ten sorters on the recordings. It then finds, for each ground-truth unit, the sorted unit whose firing train is the best match, and finally computes metrics involving the numbers of correct, missing, and false positive spikes. A set of accuracy evaluation metrics are then derived per ground-truth unit for each sorter and displayed on the website.

Simplified flow diagram of the SpikeForest analysis pipeline.

Each in a collection of spike sorting codes (top) are run on each recording with ground truth (left side) to yield a large matrix of sorting results and accuracy metrics (right). See the section on comparison with ground truth for mathematical notations. Recordings are grouped into ‘studies’, and those into ‘study sets’; these share features such as probe type and laboratory of origin. The web interface summarizes the results table by grouping them into study sets (as in Figure 2), but also allows drilling down to the single study and recording level. Aspects such as extraction of mean waveforms, representative firing events, and computation of per-unit SNR are not shown, for simplicity.

Main results table from the SpikeForest website showing aggregated results for 10 algorithms applied to 13 registered study sets.

The left columns of the table show the average accuracy (see (5)) obtained from averaging over all ground-truth units with SNR above an adjustable threshold, here set to 8. The right columns show the number of ground-truth units with accuracy above an adjustable threshold, here set to 0.8. The first five study sets contain paired recordings with simultaneous extracellular and juxta- or intra-cellular ground truth acquisitions. The next six contain simulations from various software packages. The SYNTH_JANELIA, obtained from Pachitariu et al., 2019, is simulated noise with realistic spike waveforms superimposed at known times. The last study set is a collection of human-curated tetrode data. An asterisk indicates an incomplete (timed out) or failed sorting on a subset of results; in these cases, missing accuracies are imputed using linear regression as described in the Materials and methods. Empty cells correspond to excluded sorter/study set pairs. These results reflect the analysis run of March 23rd, 2020.

Web interface

The results of the latest SpikeForest analysis may be found at https://spikeforest.flatironinstitute.org and are updated on a regular basis as the ground-truth recordings, sorting algorithms, and sorting parameters are adjusted based on community input. The central element of this web page is the main results matrix (Figure 2) which summarizes results for each sorter listed in Table 1 (using formulae defined later by Equation 5). The average accuracies are mapped to a color scale (heat map), with darker blue indicating higher accuracy, using a nonlinear mapping designed to highlight differences at the upper end. For the average accuracy table on the left, only ground-truth units with SNR above a user-adjustable threshold are included in the average accuracy calculations; the user may then explore interactively the effect of unit amplitude on the sorting accuracies of all sorters. If a sorter either crashes or times out (>1 hr run time) on any recording in a study set, an asterisk is appended to that accuracy result, and the missing values are imputed using linear regression as described in the Materials and methods section (there is also an option to simply exclude the missing data from the calculation).

The right table of Figure 2 displays the number of ground truth units with accuracy above a user-adjustable threshold (0.8 by default), regardless of SNR. This latter table may be useful for determining which sorters should be used for applications that benefit from a high yield of accurately sorted units and where the acceptable accuracy threshold is known. The website also allows easy switching between three evaluation metrics (accuracy, precision, and recall) as described in the section on comparison with ground truth.

Clicking on any result expands the row into its breakdown across studies. Further breakdowns are possible by clicking on the study names to reveal individual recordings. Clicking on any result brings up a scatter plot of accuracy vs. SNR for each ground-truth unit for that study/sorter pair (e.g., Figure 3, left side). Additional information can then be obtained by clicking on the markers for individual units, revealing individual spike waveforms (e.g., Figure 3, right side).

Screenshots from the SpikeForest website.

(left) Scatter plot of accuracy vs. SNR for each ground-truth unit, for a particular sorter (KiloSort2) and study (a simulated drift dataset from the SYNTH_JANELIA study set). The SNR threshold for the main table calculation is shown as a dashed line, and the user-selected unit is highlighted. Marker area is proportional to the number of events, and the color indicates the particular recording within the study. (right) A subset of spike waveforms (overlaid) corresponding to the selected ground truth unit, in four categories: ground truth, sorted, false negative, and false positive.

Parallel operation and run times

Since neuroscientist users also need to compare the efficiencies (speeds) of algorithms, we measure total computation time for each algorithm on each study, and provide this as an option for display on the website via a heat map. Run times are measured using our cluster pipeline, which allocates a single core to each sorting job on shared-memory multi-core machines (with GPU resources as needed). Since many jobs thus share I/O and RAM bandwidth on a given node, these cannot be taken as accurate indicators of speeds in ideal, or even typical, laboratory settings. We emphasize that our pipeline has been optimized for generation and updating of the accuracy results, not for speed benchmarking. For these reasons, we will not present run time comparisons in this paper, referring the interested reader to the website. Here we only note that older sorters such as Klusta can be over 30 times slower than more recent GPU-enabled sorters such as KiloSort and IronClust.

At present, the total compute time for the 650 recordings and 10 sorters is 380 core hours, yet it takes only 3–4 hr (excluding failing jobs) to complete this analysis when run in parallel on our compute cluster with up to 100–200 jobs running simultaneously (typically 14 jobs per node). Since the system automatically detects which results require updating, the pipeline may be run on a daily basis utilizing minimal compute resources for the usual situation where few (if any) updates are needed.

Sorter accuracy comparison results

We now draw some initial conclusions about the relative performances of the spike sorters based on the threshold choices in Figure 2. No single spike sorter emerged as the top performer in all study sets, with IronClust, KiloSort2, MountainSort4, and SpyKING CIRCUS each appearing among the most accurate in at least six of the study sets.

The higher average accuracy of KiloSort2 over its predecessor KiloSort is evident, especially for paired recordings. However, in synthetic studies, particularly tetrodes, KiloSort finds more units above accuracy 0.8 than KiloSort2. Scatter plots (e.g., Figure 3, left side) show that KiloSort2 can retain high accuracy down to lower SNR than other sorters, but not for all such low-SNR units. While KiloSort2 was among the best performers for six of the study sets, KiloSort and KiloSort2 had higher numbers of crashes than any of the other sorters, including crashing on every one of the SYNTH_VISAPY recordings. It is likely that modifications to sorting parameters could reduce the number of crashes, but attempts so far, including contacting the author, have not yet fixed this problem. In the synthetic datasets, KiloSort2 had the largest number of false positive units (distinct from the false positive rate of a single unit), but this is not currently reported by SpikeForest (see Discussion).

IronClust appears among the top average accuracies for eight of the study sets, and is especially strong for the simulated and drifting recordings. For most study sets, IronClust has improved accuracy over its predecessor JRCLUST, and is also improved in terms of speed and reliability (no crashes observed). Although a substantial portion of the development of the IronClust software took place while it had access to the SpikeForest ground truth datasets, the same sorting parameters are used across all studies, limiting the potential for overfitting (see Discussion).

MountainSort4 is among the top performers for six of the study sets (based on the average accuracy table) and does particularly well for the low-channel-count datasets (monotrodes and tetrodes). It is not surprising that MountainSort4 is the top performer for MANUAL_FRANKLAB because that data source was used in development of the algorithm (Chung et al., 2017).

When considering the left table (average accuracy), SpyKING CIRCUS is among the best sorters for ten study sets. However, it ranks a lot lower in the unit count table on the right of Figure 2. This was an example of a sorter that improved over a period of months as a result of using SpikeForest for benchmarking.

HerdingSpikes2 was developed for high-density MEA probes and uses a 2D estimate of the spike location, hence was applied only for recordings with a sufficiently planar electrode array structure (this excluded tetrodes and linear probes). For PAIRED_MEA64C_YGER its performance was similar to other top sorters, but in the other study sets, it was somewhat less accurate. One advantage of HerdingSpikes2 not highlighted in the results table is that it is computationally efficient for large arrays, even without using a GPU.

Tridesclous is among the top performers for both MEAREC study sets and for PAIRED_MEA64C_YGER, but had a substantially lower accuracy for most of the other datasets. This algorithm appears to struggle with lower-SNR units.

Klusta is substantially less accurate than other sorters in most of the study sets, apart from MANUAL_FRANKLAB where, surprisingly, it found the most units above accuracy 0.8 of any sorter. It also has one of the highest crash/timeout rates.

The version of WaveClus used in SpikeForest is only suited for (and only run on) monotrodes; a new version of WaveClus now supports polytrodes, but we have not yet integrated it. We included both paired and synthetic monotrode study sets with studies taken from selected single electrodes of other recordings. Four sorters (HerdingSpikes2, JRCLUST, KiloSort, and KiloSort2) were unable to sort this type of data. Of those that could, MountainSort4 was the most accurate, with accuracies slightly higher than WaveClus.

An eleventh algorithm, Yet Another Spike Sorter (YASS) (Lee et al., 2017), was not included in the comparison because, even after considerable effort and reaching out to the authors, its performance was too poor, leading us to suspect an installation or configuration problem. We plan to include YASS in a future version of the analysis.

Precision and recall results

Depending on the scientific question being asked, researchers may want to place a greater importance on maximizing either precision or recall. Precision is the complement of the false positives rate which corresponds to spikes incorrectly labeled as coming from some true neuron. A low precision (high number of false positives) may result in illusory correlations between units and a potentially false conclusion that the neurons are interacting, or may result in a false correlation between a unit firing and some stimulus or task. A low recall, on the other hand, means a large fraction of the true firing events are missed, causing a general reduction in putative firing rates, and also possibly introducing false correlations.

Figure 4 shows aggregated precision and recall scores for the results in the main SpikeForest table, again using the SNR threshold of 8 (keep in mind that conclusions can depend strongly on this threshold). We will not attempt to summarize the entire set of results, only to make two observations. For the paired studies, the sorters that have the highest precisions are IronClust, KiloSort2, MountainSort4, and SpyKING CIRCUS. For the paired and manual studies, precisions are generally a lot lower than recalls, across most sorters. Interestingly, this is not generally true for the synthetic studies (where often the precision is higher than recall), indicating that, despite the sophistication of many of these simulations, they may not yet be duplicating the firing and noise statistics of real-world electrophysiology recordings.

Results table from the SpikeForest website, similar to the left side of Figure 2 except showing aggregated precision and recall scores rather than accuracy.

Precision measures how well the algorithm avoids false positives, whereas recall is the complement of the false negative rate. An asterisk indicates an incomplete (timed out) or failed sorting on a subset of results; in these cases, missing accuracies are imputed using linear regression as described in the Materials and methods. Empty cells correspond to excluded sorter/study set pairs. These results reflect the analysis run of March 23rd, 2020.

How well can quality metrics predict accuracy?

In addition to informing the selection of a spike sorter, our SpikeForest analysis provides an unprecedented opportunity to compare various quality metrics that can be used to accept or reject sorted units when ground truth is not available (i.e., in a laboratory setting). For each sorter, what is the quality metric (or combination thereof) most predictive of actual accuracy? Figure 5 is based on the SYNTH_JANELIA tetrode study and shows the relationships between ground-truth accuracy and three metrics of the sorted units: SNR, mean firing rate, and inter-spike interval violation rate (ISI-vr) (Hill et al., 2011). The latter is the ratio between the number of refractory period violations (2.5 ms threshold) and the expected number of violations under a Poisson spiking assumption. We observe that these relationships are highly dependent on the spike sorter. For IronClust, the SNR and log ISI-vr are predictive of accuracy, whereas firing rate is much less predictive. For KiloSort and SpyKING CIRCUS, firing rate and SNR are both predictive, but log ISI-vr does not appear to correlate. For KiloSort2 and MountainSort4, firing rate is the only predictive metric of the three. The final column in this plot shows that a linear combination of metrics is a better predictor than any metric alone. We note that this predictive ability will also depend on the recording type, and, in this case, fidelity of the simulation. For these reasons it is an important future task to extend this type of study across the entire database; see the discussion below.

Relationship between ground-truth accuracy and three quality metrics for all sorted units (with SNR ≥5), for the SYNTH_JANELIA tetrode study and five spike sorting algorithms.

Each marker represents a sorted unit. The x-axis of the plots in the final column is the predicted accuracy via linear regression using all three predictors (SNR, firing rate, and log ISI-vr).

Discussion

We have introduced a Python framework and public website for evaluating and comparing many popular spike sorting algorithms by running them on a large and diverse set of curated electrophysiology datasets with ground truth. We have described the principal features of the website. The Materials and methods section details how we prepared ground truth data of varying types (paired, synthetic, and manually-curated) and apply the algorithms in a uniform, transparent, and unbiased manner. We have summarized initial findings comparing sorter accuracies, and invite the reader to explore the continually updating results on the live site.

One conclusion is that (as of this time) there is no single sorter that is the most accurate across a diversity of probe types, experimental conditions, metric types and SNR cut-offs. Rather, several different sorters are optimal in different settings. For example, MountainSort4 performs especially well for low channel counts (tetrodes and monotrodes), IronClust excels in the simulated and drifting datasets, while KiloSort2 retains higher accuracy (for units that are found) at lower SNR than other sorters.

Our contribution also helps to overcome some practical issues with neurophysiology research in the laboratory. Traditionally, spike sorting codes bring special requirements for installation and custom input/output file formats, and also require expertise in sorting parameter selection and using visualization and curation utilities intimately tied to that code. This problem has been exacerbated by the use of GPU computing to improve performance in recently developed codes. It is therefore challenging, and unusual, for a laboratory to have more than one or two installed sorters for in-house evaluation. Part of this obstacle is overcome by the SpikeInterface project (Buccino et al., 2019), which SpikeForest utilizes, but another part is solved by the Docker and Singularity containers that we have developed to fully capture the operating system environments needed by each sorter.

In the past, comparisons between spike sorting algorithms have been biased or limited. Almost all have been presented in the context of a new or improved method, and so the choice of simulations used for evaluation can often be (unintentionally) biased toward showing that the new method is superior. Given only such reports, it is impractical for readers to verify that the various sorters were used properly and with sensible parameters. The openness and reproducibility of SpikeForest in part remedies this situation.

Validation approach and challenges

There are many ongoing challenges to the validation of spike sorters. Although human curation is still part of the spike sorting analysis pipeline in most labs, the increase in the potential yield from recently developed high-density recording devices will soon make this step infeasible. We therefore have adopted the philosophy that spike sorting algorithms should be evaluated in an automated reproducible fashion on recordings that we make publicly available, and that, when used for benchmarking purposes, the algorithms should be wrapped and run server-side without the possibility of human curation of their output. This follows evaluation efforts for automated clustering algorithms such as ClustEval (Wiwie et al., 2015). We constrast this to ‘competition’-style efforts (Franke et al., 2012; Freeman, 2015; Berens et al., 2018) which allow contributions of (potentially non-reproducible) sorting results, and which report accuracy on held-out data whose ground truth is necessarily private, and thus cannot be interrogated by the community.

Although some effort has been made to choose good parameters for each sorter, optimal choices are not guaranteed. We anticipate that a use case for SpikeForest will be finding the best combination of sorter and parameters for a particular recording type. However, automated parameter optimization (as in ClustEval by Wiwie et al., 2015) has not yet been implemented, since it would multiply the total CPU cost by a large factor. For now, we encourage the community to contact us with improved settings or algorithms. Indeed, we reached out to the developers of all ten tested sorters in mid-2019 with preliminary results, and several developers (including at least those of SpyKING CIRCUS, HerdingSpikes2, IronClust, and Tridesclous; see Table 1) have already used SpikeForest to improve their software. We emphasize that, for reproducibility, sorter versions and parameter choices used for each date-stamped analysis are available in the downloadable analysis archives linked from the website.

This raises the issue of potential overfitting. Because all SpikeForest data are public, the community may continue to optimize sorters using SpikeForest as a benchmark, so one might wonder if this will lead to SpikeForest metrics which overestimate true real-world performance. We believe that such bias will be small, and certainly smaller than the bias of studies conducted in order to propose a single sorting algorithm, for the following reasons. 1) Our database is large and diverse, comprising 13 different types of study sets. It is difficult to imagine overfitting to 650 different recordings by optimizing even dozens of parameters, and, while we allow the potential (as yet unused) for different parameter choices for different studies, multiple recordings would still have to be overfit by one parameter set. 2) All code and parameter settings are available for public inspection, making the above style of parameter and algorithm hacking obvious, were it to occur. We feel that the benefit to the community of improved, validated sorters far outweighs the risk of overfitting. Naturally, SpikeForest results are biased towards the types of neurons and conditions for which ground-truth data exist; by keeping results for different study sets separate we expose such biases as much as possible. In the long run, we appreciate that both held-out and open data benchmarks will play a useful role in comparison and standardization.

Another issue is the paucity of paired ground-truth units in the database, a consequence of the time-consuming nature of their collection. For instance the study set PAIRED_KAMPFF has 15 units, of which only 11 have sufficient SNR to be sortable by any of the sorters. Therefore a single false split or merge by a sorter can lead to variations in reported study-averaged accuracy as large as ±0.05, and dictate the most accurate sorter for that study set. In (small-scale) laboratory pipelines it is possible that such splits or merges would be detected by expert human curation; the point of SpikeForest is to measure the performance of purely automated algorithms. Such variance will be reduced as the size of the ground-truth studies increases.

Future work

The above issues help inform several specific future goals:

  1. We plan to report false positive units. Currently only the one sorted unit which best matches each ground-truth unit is assessed for accuracy. Thus, there is no penalty for a sorter which generates many spurious units that are not present in the data (this is necessarily true for KiloSort in most cases, since the number of returned units is set in advance by the user). Results from such a sorter would of course then be less credible, and, even if examined by an expert, much harder to curate. This failure mode is not revealed by paired recordings; however, in simulations (where every single firing is ground truth) a new metric capturing the fraction of such spurious units will be included (see Buccino et al., 2019).

  2. SpikeForest does not directly address the common laboratory task of deciding, given a new sorting of electrophysiology recordings, which neural units are to be trusted and which discarded. Assessing the credibility of putative neural units output by spike sorters remains a crucial question, and many quality metrics are used in practice, including SNR, firing rate, metrics of cluster isolation or separation from the ‘noise cluster’, stability, and cross-correlograms (Pouzat et al., 2002; Schmitzer-Torbert et al., 2005; Hill et al., 2011; Neymotin et al., 2011; Barnett et al., 2016; Chung et al., 2017). Yet, because of its scale, the SpikeForest database provides a unique opportunity to tabulate such quality metrics for multiple sorter outputs, then correlate each to the actual ground-truth accuracy, with the goal of assessing the predictive power of each metric, or combinations thereof. Figure 5 showed such a metric comparison for just one of the SpikeForest studies. We are now poised to scale this up to the entire database and a wider set of metrics.

  3. Most of the ground truth recordings in our database have a relatively short duration, on the order of 10 min. When waveform drift is not a factor, this duration usually yields enough events per unit to satisfy the requirements of clustering algorithms. However, when waveform drift is present, it is important to assess the performance of spike sorters for a greater range of recording durations. We plan to provide examples of longer duration datasets (on the order of 1–2 hr) in future iterations.

  4. It is currently not as easy as it could be to identify common failure modes such as false merges and false splits. We plan to display confusion matrices (see, e.g., Barnett et al., 2016) between ground-truth and a given sorting; this would also allow comparison between two different sorter outputs.

  5. We plan to add accuracy evaluations taylored to specific tasks, such as the ability to handle probe drift and/or long recording durations (currently only three synthetic studies explicitly test drift). The website could, for instance, visualize successful tracking as a function of time.

  6. The comparison of CPU time for each sorter is currently sub-optimal, because multiple sorters may be running on one node. We plan to include special timing comparisons on dedicated nodes in order to reflect actual laboratory use cases.

  7. The number of sorter job crashes and time-outs needs be reduced, which requires detailed diagnosis on many sorter-recording pairs. For instance, currently KiloSort2 has a larger proportion of crashes or time-outs than many other sorters, yet this is not fully understood and not quantified in accuracy summary results.

  8. Finally, we envision that the machinery of SpikeForest could be used for a web-based spike sorting platform, to which users would upload their data (which becomes public), run possibly more than one sorter, visualize, curate, and download the results. This could render all spike sorting and human curation affecting the downstream science analysis accessible and reproducible.

We anticipate that, as the use and scale of spike sorting as a tool continues to grow, SpikeForest will become a resource for comparison and validation of sorter codes and encourage more rigorous reproducibility in neuroscience. To this end, we seek contributions and input from the electrophysiology community, both to optimize parameter settings for existing algorithms, and to further expand the set of algorithms and ground-truth recordings included in the analysis.

Materials and methods

Ground-truth recordings

The thirteen study sets included in the SpikeForest analysis are detailed in Table 2 and are grouped according to type (paired, simulated, curated).

Paired recordings

Request a detailed protocol

We selected 145 paired recordings from raw extracellular recordings that were publicly released or otherwise provided to us by four laboratories (Henze et al., 2000; Harris et al., 2000; Henze et al., 2009; Neto et al., 2016; Allen et al., 2018; Marques-Smith et al., 2018a; Marques-Smith et al., 2018b; Yger et al., 2018; Spampinato et al., 2018). The intracellular spike times were taken from the author-reported values unless they were not provided (Henze et al., 2009). 93 of these we prepared from raw Buzsáki laboratory recordings (Henze et al., 2000; Harris et al., 2000; Henze et al., 2009) with ground truth based on the intracellular traces, after excluding time segments containing artifacts due to movement and current injection. The other 52 were obtained from tetrodes or conventional silicon probes in rat hippocampus (PAIRED_CRCNS_HC1) (Henze et al., 2000; Harris et al., 2000; Henze et al., 2009), high-density silicon probes in mice cortex (PAIRED_KAMPFF and PAIRED_BOYDEN) (Neto et al., 2016; Allen et al., 2018; Marques-Smith et al., 2018a), and high-density MEAs in mice retina (PAIRED_MEA64C_YGER) (Yger et al., 2018; Spampinato et al., 2018). We also include 29 new paired recordings in awake behaving mice (PAIRED_ENGLISH), from a hybrid probe comprising a juxtacellular electrode glued on top of a Neuronexus silicon probe (similar to English et al., 2017 but with only ∼20 µm separation). These come from the hippocampus, neocortex and thalamus. We generated a monotrode version (PAIRED_MONOTRODE) of various paired recordings by randomly sampling one channel from each recording session to uniformly span the SNR range (25 units per study).

Synthetic recordings

Request a detailed protocol

Table 1 provides details on the ten algorithms included in the SpikeForest analysis. To overcome the limited number of units offered by the paired ground truth, we added simulated ground-truth study sets that were independently generated by five laboratories (Camuñas-Mesa and Quiroga, 2013; Hagen et al., 2015; Chung et al., 2017; Gratiy et al., 2018; Buccino and Einevoll, 2019). The simulators vary in their biophysical details, computational speeds, and configurable parameters. Simulations based on phenomenological models tend to be fast and easily configurable (e.g., SYNTH_MAGLAND, identical to the simulations in Chung et al., 2017 except with iid Gaussian noise), while biophysical simulators such as SYNTH_VISAPY (Hagen et al., 2015) and SYNTH_BIONET (Gratiy et al., 2018) use synaptically connected, morphologically detailed neurons to achieve high fidelity at the expense of computational speed. SYNTH_MONOTRODE (Camuñas-Mesa and Quiroga, 2013) and SYNTH_MEAREC (Buccino and Einevoll, 2019) take an intermediate approach by generating spike waveform templates based on single-neuron simulations and randomly placing the spike waveforms conforming to pre-specified ISI distributions.

SYNTH_BIONET was generated from the Allen Institute’s BioNet simulator (Gratiy et al., 2018; Jun et al., 2017a) running on the computing resources provided by the Neuroscience Gateway (Sivagnanam et al., 2013). We simulated a column of synaptically connected neurons (n=708, 200×200×600μm3) based on the rat cortical NEURON models (Hines and Carnevale, 1997; Ascoli et al., 2007; Mitelut, 2017; Jun et al., 2017a) by capturing the spike waveforms at four vertical columns of densely-spaced electrodes (2 µm vertical, 16 µm horizontal pitch, 600 channels). Linear probe drift was simulated by subsampling the electrodes to match the Neuropixels site layout (20–25 µm pitch) and vertically shifting the electrode positions as a function of time. To achieve smooth linear motion, a 2D-interpolation based on a Gaussian kernel was applied (0.5 µm vertical spacing, 16 µm total displacement for 16 min). Based on the linear probe drift simulation, we generated a ’shuffled’ version to mimic the fast probe displacements during animal movement by subdividing the recordings into 32 time segments and randomly shuffling (30 s per segment).

The simulation entitled SYNTH_JANELIA was developed by J. Colonell (Pachitariu et al., 2019) and uses average unit waveforms from a densely spaced electrode array (5 µm pitch, 15 × 17 layout) collected by the Kampff, 2018 laboratory. The waveform templates were inserted at randomly selected channels and time points after being multiplied by an amplitude-scaling factor drawn from a Gamma distribution. The baseline noise was randomly generated to match the power spectrum observed from a Neuropixels recording, and a spatiotemporal smoothing was applied to induce correlation between nearby channels and time samples. The original simulation generated either static or sinusoidal drift (±10 µm displacement, 600 s period) with 1200s duration (30 KS/s) on the Neuropixels layout (64 channels, 25 µm pitch, staggered four columns). We trimmed the original simulations to generate shorter recording durations (600, 1200s) and channel counts (4, 16, 32, 64 channels) to study the effects of these parameters on sorting accuracy.

Manually sorted recordings

Request a detailed protocol

Based on the requests of the electrophysiology community, we included a single study set of manually curated tetrode sortings (MANUAL_FRANKLAB) (Chung et al., 2017); we emphasize that, for this study set only, there is no ground-truth data. Accuracies are reported relative to three independent manual sortings of the same tetrode recording session in a chronically implanted rat hippocampus. We subdivided the entire recording duration to generate three different durations (600, 1200, 2400 s) to study the effect of duration on sorting accuracy, but did not find a consistent relationship across all sorters.

Registration

Request a detailed protocol

A content-addressable storage database is used to store the file content of each recording, and all data is available for public download via our Python API. Since these files are indexed according to their SHA-1 hashes, their content is guaranteed not to have changed even when the mechanism for retrieving the data evolves over time, ensuring long-term repeatability of the analysis.

Sorting algorithms and wrappers

Request a detailed protocol

Table 2 provides details on the ten algorithms included in the SpikeForest analysis. For each spike sorter, SpikeForest contains a Python wrapper and a Docker (and Singularity) container defining the exact execution environment including all necessary dependencies. For the sorters that are implemented in MATLAB, the MATLAB Compiler tool was used to generate standalone binary files inserted into the containers so that a MATLAB license is not required to run spike sorting. The SpikeForest framework flexibly allows running of each sorter either within the native operating system or within the container. The former method is useful during development or in an environment where the spike sorting software is already installed. The latter is crucial for ensuring reproducibility and for avoiding conflicts between sorters due to incompatible dependencies. The Python wrappers make use of the spikesorters package of the SpikeInterface project (Buccino et al., 2019) for passing the parameters and executing the sorter. All sorters operate on the raw (unfiltered) recordings.

An eleventh algorithm, YASS, was incorporated into the Python package, but was not included in the comparison, as discussed in the Results section.

Comparison with ground truth

Request a detailed protocol

Depending on the experimental context, false negatives (missed events) and false positives may have different relative importance for the researcher. Thus the SpikeForest website allows switching between three evaluation metrics for the comparison with ground truth: precision (which penalizes only false positives), recall (which penalizes only false negatives), and an overall accuracy metric (which balances the two). For each sorter-recording pair these are computed by comparing the output of spike sorting (spike times and labels) with the ground-truth timings associated with the recording, as follows.

We first consider a sorted unit k and a ground-truth unit l, and describe how the events in the spike train for k are matched to ground-truth events of l. Let s1(k),,sMk(k) be the spike train (timestamps) associated with sorted unit k, and let t1(l),,tNl(l) be the spike train for ground-truth unit l (see Figure 1). Let Δ be an acceptable firing time error, which we assume is shorter than half the refractory period of any true neuron. We set Δ to one millisecond; the results are rather insensitive to its exact value. The number of matches of sorted unit k to the ground-truth unit l is

(1) nl,kmatch:=#{i:|ti(l)-sj(k)|Δ for some j}.

Note that even if more than one sorted event falls within ±Δ of a true event, at most one is counted in this matching. The reverse situation—more than one ground-truth event from the same neuron matching to a given sorted event—cannot happen by our assumption about the refractory period. The number of missed events and false positives are then, respectively,

(2) nl,kmiss:=Nl-nl,kmatch,nl,kfp:=Mk-nl,kmatch,

where Nl is the total number of firings of ground-truth unit l, and Mk is the total number of found events of sorted unit k. Following Jun et al., 2017b we define the accuracy for this pair as

(3) al,k:=nl,kmatchnl,kmatch+nl,kmiss+nl,kfp.

Note that this definition of accuracy is a balance between precision and recall, and is similar, but not identical, to the F1 metric (Zaki and Meira Jr, 2014, Eq. (17.1)) used to evaluate clustering methods.

Fixing the ground-truth unit l, we define its best matching sorted unitk^l as the sorted unit k with highest accuracy,

(4) k^l:=argmaxkal,k.

Now restricting to this best match, we define the accuracy for ground-truth unit l by

(5) al:=al,k^l,

and the corresponding precision and recall for this unit by

(6) pl:=nl,k^lmatchnl,k^lmatch+nl,k^lfp rl:=nl,k^lmatchnl,k^lmatch+nl,k^lmiss.

Averages of these metrics are then computed for all units l within each study, without weighting by their numbers of events (i.e., treating all units equally). Note that, in the case of a recording with more than one ground-truth unit, it is possible that more than one such unit could share a common best-matching sorted unit, but this could only happen if these ground-truth units had extremely correlated events or if the sorting was highly inaccurate.

The spike sorted units that are considered in the computation of these metrics are only those that are matched to ground-truth units. Therefore, the results shown in the main table do not account for false positive units, that is units found by the spike sorters that are not present in the recording.

Compensating for missing data

Request a detailed protocol

As described above, when a sorting run fails (either crashes or times out), an asterisk is appended to the corresponding table cell, and the average accuracy is calculated based on imputed data using linear regression. There is also an option to simply exclude the missing values, but the problem with this method (which we have encountered) is that sometimes an algorithm will happen to crash on recordings with relatively difficult units, resulting in artificially elevated scores. Imputing by zero has the problem of yielding deceptively low values. We have thus opted to use linear regression to fill in the missing data using values predicted based on the accuracies of other sorters. Specifically, for a given sorter with missing data and a given study set (or study, depending on the level of aggregation), a linear model is estimated for fitting the non-missing values based on the values of all sorters with no missing data. That model is then applied to estimate the missing data. To give some intuition, if a sorter typically has somewhat lower scores than the other sorters and crashes on one recording, then for the purpose of computing average accuracy, the accuracy for that recording will be filled in with a value that is also somewhat lower than the other sorters.

In constrast, when reporting total numbers of units found above an accuracy cut-off (e.g., Figure 2, right side), we do not impute, but simply sum the number of non-missing units.

Signal-to-noise ratio per unit

Request a detailed protocol

We define SNR as a property of a single neural unit, either a ground-truth unit or a unit as output by a sorter. It is reported on the website and is used as a cut-off for selection of a subset of ground-truth units for computing average accuracy. SNR is computed on the bandpass-filtered timeseries data.

We first describe our filter used to compute SNR (noting that this is distinct from various filters used internally by the spike sorters). This filter is a bandpass from fmin=300 to fmax=6000, in Hz units. It is applied to each channel by taking the FFT, multiplying by the real-valued frequency filter function

(7) A(f)=121+erf((f-fmin)/100)1-erf((f-fmax)/1000),

where erf is the error function, then taking the inverse FFT. Here, the parameters 100 and 1000 control the roll-off widths in Hz at the low and high ends respectively. From this filtered timeseries and the set of firing times of a unit, the average spike waveform is extracted. SNR is then defined as the ratio between the peak absolute amplitude of this average spike waveform and the estimated noise on the channel where this peak amplitude occurs. The noise is estimated as the median absolute deviation of the filtered timeseries data divided by 0.6745, which gives a robust estimate of the standard deviation of the noise (Quiroga et al., 2004, Sec. 3.1).

Analysis pipeline

Request a detailed protocol

The analysis pipeline of SpikeForest depicted in Figure 6 is built using Python utilities developed by the authors for creating shareable and reproducible scientific workflows. This system provides a formal method for creating well-defined Python procedures that operate on input parameters and files and produce output files. These are known as processors. SpikeForest defines processors for running the spike sorters, computing properties of ground-truth units (e.g., SNR), comparing the spike sorting outputs with ground truth, and computing summary data for the plots shown on the website. Once these processors are set up, the framework provides several advantages including: (a) automatic execution of processing inside Singularity or Docker containers; (b) automatic caching of processing results; and (c) job management and queueing mechanisms for running batches of processing on a compute cluster. This allows the analysis pipeline to be defined as a standard Python script, with a simple nested loop iterating through the sorters and ground-truth datasets (as in Figure 1). The script may then be configured to run in a variety of settings: on a standard workstation for development, testing, or reproducing of a subset of results; a shared computer with large memory and many cores; or on a compute cluster.

Interaction of software and hardware components of the SpikeForest system, showing the flow of data from the server-side analysis (left) to the user’s web browser (right).

The processing pipeline automatically detects which sorting jobs need to be updated and runs these in parallel as needed on a compute cluster. Processing results are uploaded to two databases, one for relatively small JSON files and the other for large binary content. A NodeJS application pulls data from these databases in order to show the most up-to-date results on the front-end website.

A crucial and novel feature of our framework is that all files (both input and output) are represented by SHA-1 URIs, for example of the form: sha1://88b62db6fc467b83ba0693453c59c5f538e20d5c/firings_true.mda.

The hexadecimal code embedded in the URI is the SHA-1 hash of the content of the file, and therefore this URI uniquely identifies the desired file (in this case the ground-truth firing data for one of the SpikeForest recordings) without specifying the actual location of the data. In contrast, explicit location references (e.g., a path on the local computer, an IP address, or a web URL) can be problematic because over time data archives may stop being maintained, may change locations, or files may be renamed or updated with new content. The SHA-1 URI system alleviates these difficulties by separating the mechanism for storing archives of files from the representation of these files, via universal hash strings within scripts.

All SpikeForest input recordings, ground-truth data, sorting outputs, and other processing data are stored in a public content-addressable storage (CAS) system called kachery. The API of a kachery database simply allows downloading of files referenced only by their SHA-1 hashes (or URIs). Therefore, as long as the environment is configured to point to a kachery database with the relevant files available, the SpikeForest pipeline may be executed (in whole or in part) on any computer connected to the internet.

The automatic caching capability we developed is also crucial for the SpikeForest system. If updates are made to either the database of ground-truth recordings or to the sorting algorithms and parameters, the system can automatically detect which processing needs to be rerun. In this way, the pipeline can simply be executed in full at regular intervals, and the website will continuously update with the latest changes. This facilitates a conceptually simple method for adding new datasets or sorting algorithms because changes are represented by modifications to pipeline configuration files.

A limitation of the reliance on caching is the implicit assumption that the sorting algorithms are deterministic, that is given the same inputs, parameters, and code, the outputs should be exactly the same. However, this assumption can be violated by unfixed random seeds or even differences between implementations of floating-point arithmetic on different hardware. Controlling these factors is currently out of scope for SpikeForest, but we aim to measure the degree of stability (Barnett et al., 2016) with respect to repeated runs of each sorting task in future versions.

To ensure that all results may be reproduced at a later date and/or independently verified by a third party, the Python wrapper for each spike sorter includes a reference to a Docker image containing the entire operating system and environment where the sorter is installed. Our system handles the automatic download of these images (if not already on the local computer) and the execution of the sorting inside Docker or Singularity containers. For development and testing purposes, it is also possible to configure the pipeline to run outside the container, provided that the sorter software is installed on the operating system.

The output of processing is a single JSON file (only around 15 MB at this time) containing SHA-1 URI references to all recording files, ground-truth data, containers, spike sorting parameters, sorting outputs, comparisons with ground truth, and other information used by the website. It also contains meta-information about the sorting runs such as the execution times and console outputs. This JSON file is itself uploaded to the kachery database and is represented by a SHA-1 URI accessible on the website. The archive section of the website contains references to these files for all past analyses, allowing tracking of sorter performance over time.

Finally, the data from this JSON output file is loaded into a MongoDB database for efficient access by the website’s front end.

Website front end

Request a detailed protocol

The primary user interface for the SpikeForest platform is an isomorphic JavaScript web application with overall structure as shown on the right side of Figure 6. Built from reusable React components, the front end utilizes the D3 library to render the interactive tables and plots. A Node.JS backend, organized using the Redux state container, queries a MongoDB database to retrieve JSON files for each datatype and generate the comparative visualizations. For more detailed plots, like the spike sprays, larger data objects are retrieved via the content-addressable storage database.

We optimized the interface for efficient hierarchical navigation through the results and rapid loading and interactive response (for instance, when a user adjusts the SNR or accuracy cut-off slider bar). When the user clicks on results at the individual recording or unit level, they are taken to a page (with an auto-generated, shareable URL) specific to the study in question. This page allows one-click comparison of the sorters on this study. Clicking on individual units in the scatter plot for a given sorter-study pair brings up spike waveforms for that unit (Figure 3) and a link to a sub-page with details specific to the particular sorter-recording run. This latter page includes sorter parameters used, the console output of the run and a link to an auto-generated Python script with human-readable documentation to reproduce that sorter run within SpikeForest. A permanent top-level menu bar allows access to all meta-data about sorters (as in Table 1), study sets (as in Table 2, also allowing hierarchical drilling down to the individual recording level), historical snapshots, metrics used, and an explanation of the project. All of the website data and results sub-pages are automatically generated from the SpikeForest databases.

Enabling users to run all spike sorters locally on their own data

Request a detailed protocol

The benefit of having wrappers and Docker images for these ten spike sorting algorithms extends beyond usage within the SpikeForest pipeline. Researchers may also utilize these wrappers and images to spike sort their own data without needing to install and configure the individual sorting packages. As mentioned in the previous subsection, the website provides, for each sorting result, a self-contained Python script for reproducing that result offline. This script may easily be modified to operate on datasets that are not registered within the SpikeForest database. Here is an excerpt of the auto-generated script for running SpyKING CIRCUS on a dataset:

# Run the spike sorting 
with hither.config(container=’docker://magland/sf-spykingcircus:0.9.2’, gpu=False): 
  sorting_result=sorter.run( 
       recording_path=recording_path, 
       sorting_out=hither.File(), 
       **params 
  ) 
assert sorting_result.success 
sorting_path=sorting_result.outputs.sorting_out

This is compatible with all three OS platforms: Linux, MacOS and Windows (although, currently, the last of these cannot use GPU-based sorters). Here, hither is a utility within SpikeForest; note that the Docker container and version (located on Docker Hub) is specified in the configuration line. Because we use SpikeInterface, it is easy to switch between a variety of different file formats for the input and output data. The SpikeForest website provides further instructions and examples.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
    Independent component analysis for FullyAutomated Multi-Electrode array spike sorting. in: 2018
    1. AP Buccino
    2. E Hagen
    3. GT Einevoll
    4. PD Häfliger
    5. G Cauwenbergh
    (2018)
    40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) IEEE. pp. 2018–2040.
    https://doi.org/10.1109/EMBC.2018.8512788
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
    Neurofinder
    1. J Freeman
    (2015)
    Codeneuro.
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
    In Vitro Neuronal Networks
    1. MH Hennig
    2. C Hurwitz
    3. M Sorbaro
    (2019)
    171–184, Scaling spike detection and sorting for next-generation electrophysiology, In Vitro Neuronal Networks, Springer, 10.1007/978-3-030-11135-9.
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
    YASS:yet another spike sorter
    1. JH Lee
    2. DE Carlson
    3. HS Razaghi
    4. W Yao
    5. GA Goetz
    6. E Hagen
    7. E Batty
    8. EJ Chichilnisky
    9. GT Einevoll
    10. L Paninski
    (2017)
    Adv. Neural. Inf. Process. Syst. (NeurIPS). pp. 4002–4012.
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
    Docker: lightweight linux containers for consistent development and deployment
    1. D Merkel
    (2014)
     Linux Journal 2:62479797.
  55. 55
    Spikesortingtest
    1. CC Mitelut
    (2016)
    Internet Archive.
  56. 56
    Characterizing single neuron activity patterns and dynamics using multi-scale spontaneous neuronal activity recordings of cat and mouse cortex
    1. CC Mitelut
    (2017)
    University of British Columbia.
  57. 57
  58. 58
  59. 59
  60. 60
    Fast and accurate spike sorting of high-channel count probes with kilosort
    1. M Pachitariu
    2. NA Steinmetz
    3. SN Kadir
    4. M Carandini
    5. KD Harris
    (2016)
    Advances in Neural Information Processing Systems 29 (NIPS 2016). pp. 4448–4456.
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
    Introducing the neuroscience gateway
    1. S Sivagnanam
    2. A Majumdar
    3. K Yoshimoto
    4. V Astakhov
    5. A Bandrowski
    6. ME Martone
    7. NT Carnevale
    (2013)
    IWSG, Volume 993 of CEUR Workshop Proceedings.
  72. 72
  73. 73
    Sorting Comparison
    1. N Steinmetz
    (2015)
    Cortexlab.
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80

Decision letter

  1. Markus Meister
    Reviewing Editor; California Institute of Technology, United States
  2. Ronald L Calabrese
    Senior Editor; Emory University, United States
  3. Markus Meister
    Reviewer; California Institute of Technology, United States

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

This paper introduces an online web site focused on evaluating different methods of spike sorting: the computational process that extracts the spiking of single neurons from a raw electrical recording of brain signals. The authors set themselves three goals: to aid scientists in selecting a spike sorter; to spur improvements to sorting algorithms; and to provide easy-to-use software packages. In an area that has been crying out for standardization this web site promises to be an invaluable resource.

Decision letter after peer review:

Thank you for submitting your article "SpikeForest: reproducible web-facing ground-truth validation of automated neural spike sorters" for consideration by eLife. Your article has been reviewed by three peer reviewers including Markus Meister as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by a Reviewing Editor and Ronald Calabrese as the Senior Editor.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

This paper introduces an online resource focused on evaluating different methods of spike sorting. Spike sorting is an indispensable step in analyzing extracellularly recorded neuronal data, and there have been large advances in the software that performs this task over the past few years. The result is a large number of algorithms and no general consensus as to which of these algorithms works best in each circumstance. Add to that the large number of different recording configurations and brain regions, and the result is substantial uncertainty about which sorter to use in any given situation. The SpikeForest website and comparison framework presented here represents a major step toward addressing that problem. The site maintains data from many extracellular recordings that include ground truth spike times of single units. And it implements about a dozen different spike sorting algorithms that can be applied to the same data sets. The results are analyzed by various quality metrics and compared across spike sorters and data sets in tabular form. The site also offers a dive into the results down to the waveforms of individual sorted units. It is live and extensible both with regards to new data sets and new sorters.

The authors set themselves three goals: to aid scientists in selecting a spike sorter; to spur improvements to sorting algorithms; and to provide easy-to-use software packages. Based on experience in our research groups and some polling of some colleagues, these goals have been achieved already. The paper is very well written and offers an indispensable complement to the website. Here is the consensus review:

Essential revisions:

The paper offers little guidance for how to interpret the output of each sorter. Each of these sorters will produce a number of clusters which do not correspond to single units. Some set of quality metrics must be applied when the user decides how to interpret the sorter's output, and the authors have the opportunity to provide some guidance on that critical topic. This is mentioned as the first future direction, and we would encourage the authors to make a first effort at this in the current manuscript.

The Introduction is quite long and reads more like a review article. Please distil this section down to the key facts that a reader needs to know and move other material to Results section or Discussion section. The first figure, for example, would be better placed at the beginning of the Results section.

All of the ground truth datasets are quite short (~10 min), but standard recordings from neuronal populations can span hours. It seems likely that the number of spikes from a given ground truth unit and the number of spikes from other units has an effect on sorting accuracy. Can this impact be quantified for each sorter? Please comment on whether the results will generalize to longer data sets.

Figure 2: Given that there is space for figures, it would be helpful to show not just the overall accuracy metric, but also the false positive and negatives for each sorter. Of course, this is available on the website, but readers may want this information right away. The caption should explain the meaning of empty cells in the table.

Please add some guidance for why one might choose one performance metric versus another. False positives, for example, can be particularly problematic when assessing correlations between units, while false negatives lead to underestimates of firing rates.

The process whereby a user might take advantage of the containerized sorters could be explained more clearly. This is not the main thrust of the manuscript, but creation of these standard software environments greatly furthers the goal of reproducibility.

Results section, Discussion section and elsewhere: Which of the spike sorters have been optimized on these specific test data? This is mentioned in context of IronClust and SpyKING CIRCUS. Maybe spell out in Table 2 for each sorter how the parameters were chosen and whether they were adapted to the test sets.

https://doi.org/10.7554/eLife.55167.sa1

Author response

Essential revisions:

The paper offers little guidance for how to interpret the output of each sorter. Each of these sorters will produce a number of clusters which do not correspond to single units. Some set of quality metrics must be applied when the user decides how to interpret the sorter's output, and the authors have the opportunity to provide some guidance on that critical topic. This is mentioned as the first future direction, and we would encourage the authors to make a first effort at this in the current manuscript.

We agree. This encouragement led to us performing a preliminary offline analysis that uses one of the SpikeForest study sets to predict accuracy on the basis of three quality metrics of sorted units: SNR, firing rate, and inter-spike-interval violation rate (ISI-vr). We chose these metrics as a starting point since they are commonly used, unambiguous, and straightforward to compute. There are other metrics used in the field that require, e.g. access to the clustering feature space, making their extraction purely from the sorted spike trains much more complicated.

A new subsection entitled "How well can quality metrics predict accuracy?" and a new figure (Figure 6) has been added to the Results section showing some of the results of this analysis. We decided to focus on the SYNTH_JANELIA study set because it contains many units with varying degrees of accuracies. As seen in the new Figure 6 in the revised manuscript, the relationship between accuracy and these three metrics is highly dependent on the spike sorting algorithm used (it also depends to some extent on the type of recording). For IronClust, the SNR and ISI-vr are predictive of accuracy, whereas firing rate is much less predictive. For KiloSort and SpykingCircus, firing rate and SNR are both predictive, but ISI-vr does not appear to correlate strongly. For KiloSort2 and MountainSort4, firing rate is the only predictive metric of the three.

There is a subtle but crucial issue that prevents the immediate expansion of such a correlation analysis to other recording types, as follows. The analysis can only be performed using simulated recordings because there all of the true units are known. For the real (paired) recordings the ground truth timing information is available for only one of the units. Say that we wish to show that a certain per-unit metric value is a good predictor of real-world inaccuracy. We can collect this metric for all sorted units, but we cannot prove that any sorted unit is inaccurate. For instance, while there may be no sorted units that match the ground-truth unit well, each may be an accurate sorting of some other legitimate unit. If the reviewers have any solution to this issue, we would appreciate any suggestions as we move to the next phase.

Here is another way to explain the issue: our current website displays accuracies for each ground truth unit by reporting the extent of matching with the most similar sorted unit. In contrast, the proposed metric exploration requires reporting the accuracy for each sorted unit relative to the best matching ground truth. Therefore, while each marker in the website scatterplot corresponds to a ground truth unit, each marker in the new plots (Figure 6) corresponds to a sorted unit.

Adding such metric comparisons to the website interface, addressing the above issue, and expanding the set of metrics, is a larger undertaking that we expect to start shortly, as we indicate in the future work section.

The Introduction is quite long and reads more like a review article. Please distil this section down to the key facts that a reader needs to know and move other material to Results section or Discussion section. The first figure, for example, would be better placed at the beginning of the Results section.

In response to this suggestion, we have improved the readability by breaking the introduction into subsections and shortening it by three paragraphs (one moved to Results section, one moved to Discussion section, and one removed entirely). The first subsection "Background" and comprises four paragraphs that the reader may skim over if they are already familiar with the field. The next subsection "Prior work" containing two paragraphs that we consider crucial for presenting this work within a larger context of validation efforts. The third subsection, "SpikeForest", gives a high-level overview of the approach presented in the paper. We moved two of the more technical paragraphs from this part of the writeup to Results section and Discussion section and removed one other paragraph entirely. Finally, we have subsection "Contribution" that has the paragraph describing the three main objectives of our work. All references to figures have been removed from the Introduction so that, as suggested by the reviewers, the first figure appears as part of the Results section.

All of the ground truth datasets are quite short (~10 min), but standard recordings from neuronal populations can span hours. It seems likely that the number of spikes from a given ground truth unit and the number of spikes from other units has an effect on sorting accuracy. Can this impact be quantified for each sorter? Please comment on whether the results will generalize to longer data sets.

A new 'future direction' item has been added to the Discussion section addressing our plans to add longer duration recordings with drifting spike waveforms.

We selected relatively short datasets partly because of compute time considerations. Multiple algorithms are run on every dataset and some of these sorters are very slow for longer recordings. We also believe that, when waveform drift is not a factor, around 10 minutes usually yields enough events per unit to satisfy typical clustering algorithms (even a 1 Hz average firing rate gives 600 events over 10 minutes).

To support this claim (for the reviewers but not in the paper), we simulated some additional recordings with varying durations (300, 600, 1200, 2400, and 4800 seconds). Technical details of how these were generated are provided below. Author response image 1 shows that for these simulations, accuracy remains steady with increasing duration across all of the sorters.

Author response image 1
Accuracy vs. duration for nine sorters applied to a study set of simulated recordings.

Another issue associated with longer recordings is waveform drift, which varies quite a bit depending on the probe type and experimental setup. To give an indication of how the different sorters perform in the presence of this type of drift over longer recordings, we regenerated the recordings used above, except adding simulated spike waveform drift. In this case (Author response image 2) we see that there is generally a falloff in accuracy with increasing duration, except for three sorters which appear to be more robust with respect to drift: IronClust, KiloSort, and KiloSort2.

Author response image 2
Same as Author response image 1, except that the simulations included synthetic drift.

Technical details on how these simulations were generated: We modified the Kilosort2 eMouse simulator developed by J. Colonell and M. Pachitariu. This simulator uses averaged unit waveforms from the recordings taken from the Kampff laboratory using a densely spaced electrode array (15 x 17 layout spanning 100 x 102 micrometers). In the drift simulation case, a linear probe motion was generated by uniformly translating a 64-channel probe (Neuropixels layout) by 20 micrometers over 80 min. To study the effect of time duration and channel count on the sorting accuracy, we extracted 16 channels from the original output (64 channels, 80 minutes) by taking contiguous neighboring channels at various time durations (5, 10, 20, 40, 80 min) starting at t=0. Ten recordings were sampled from each channel count and time duration by uniformly varying the channel offsets. The simulation inserted waveform templates at random channels and time points after multiplying them by a random scaling factor drawn from a Gamma distribution. The baseline noise was randomly generated to match the power spectrum observed from a Neuropixels recording, and a spatiotemporal smoothing was applied to induce a correlation between nearby channels and time samples.

Figure 2: Given that there is space for figures, it would be helpful to show not just the overall accuracy metric, but also the false positive and negatives for each sorter. Of course, this is available on the website, but readers may want this information right away. The caption should explain the meaning of empty cells in the table.

A new figure (Figure 5) has been added showing aggregated precision and recall for the main results. A new subsection summarizes and motivates this figure, at the end of the Results section. This led to the interesting observation that generally precision is lower than recall for paired recordings, but more often higher than recall for synthetic, showing that the simulations do not yet capture all of the statistics of real-world recordings.

We have also now added an explanation of empty cells to the captions of the two main result figures.

Please add some guidance for why one might choose one performance metric versus another. False positives, for example, can be particularly problematic when assessing correlations between units, while false negatives lead to underestimates of firing rates.

We have added a paragraph to the Results section, which explains when one might prefer to consider precision or recall over accuracy. The paragraph also refers to the new Figure 5 in the previous response.

The process whereby a user might take advantage of the containerized sorters could be explained more clearly. This is not the main thrust of the manuscript, but creation of these standard software environments greatly furthers the goal of reproducibility.

We have added a new subsection "Running spike sorting on a local machine" which explains this.

Results section, Discussion section and elsewhere: Which of the spike sorters have been optimized on these specific test data? This is mentioned in context of IronClust and SpyKING CIRCUS. Maybe spell out in Table 2 for each sorter how the parameters were chosen and whether they were adapted to the test sets.

Asterisks were added to four of the algorithms in the table, and the following was added to the caption:

"Algorithms with asterisks were updated and optimized using SpikeForest data. For the other algorithms, we used the default or recommended parameters."

https://doi.org/10.7554/eLife.55167.sa2

Article and author information

Author details

  1. Jeremy Magland

    Center for Computational Mathematics, Flatiron Institute, New York, United States
    Contribution
    Conceptualization, Data curation, Software, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    For correspondence
    jmagland@flatironinstitute.org
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5286-4375
  2. James J Jun

    Center for Computational Mathematics, Flatiron Institute, New York, United States
    Contribution
    Conceptualization, Data curation, Software, Investigation, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
  3. Elizabeth Lovero

    Scientific Computing Core, Flatiron Institute, New York, United States
    Contribution
    Software, Investigation, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2642-603X
  4. Alexander J Morley

    Medical Research Council Brain Network Dynamics Unit, University of Oxford, Oxford, United Kingdom
    Contribution
    Software, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-4997-4063
  5. Cole Lincoln Hurwitz

    Institute for Adaptive and Neural Computation Informatics, University of Edinburgh, Edinburgh, United Kingdom
    Contribution
    Software, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2023-1653
  6. Alessio Paolo Buccino

    Centre for IntegrativeNeuroplasticity (CINPLA), University of Oslo, Oslo, Norway
    Contribution
    Software, Writing - review and editing
    Competing interests
    No competing interests declared
  7. Samuel Garcia

    Centre de Recherche en Neuroscience de Lyon, Université de Lyon, Lyon, France
    Contribution
    Software, Writing - review and editing
    Competing interests
    No competing interests declared
  8. Alex H Barnett

    Center for Computational Mathematics, Flatiron Institute, New York, United States
    Contribution
    Conceptualization, Supervision, Investigation, Methodology, Writing - original draft, Project administration, Writing - review and editing
    Competing interests
    No competing interests declared

Funding

No external funding was received for this work.

Acknowledgements

We have benefited from discussions with Leslie Greengard, Eftychios Pnevmatikakis, Andrea Giovannucci, Karel Svoboda, Loren Frank, and the developers of several of the sorting codes, and from the comments of the anonymous reviewers. The BioNet simulations were made possible by Catalin Mitelut and the Neuroscience Gateway Portal. We thank Dan English and colleagues for making their hybrid paired recordings available. We are grateful to the organizers of the Janelia spike sorting workshop, held in March 2018, which helped to inspire this project. The Flatiron Institute is a division of the Simons Foundation.

Senior Editor

  1. Ronald L Calabrese, Emory University, United States

Reviewing Editor

  1. Markus Meister, California Institute of Technology, United States

Reviewer

  1. Markus Meister, California Institute of Technology, United States

Publication history

  1. Received: January 14, 2020
  2. Accepted: May 5, 2020
  3. Version of Record published: May 19, 2020 (version 1)

Copyright

© 2020, Magland et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 531
    Page views
  • 78
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)