1. Neuroscience
Download icon

SpikeInterface, a unified framework for spike sorting

  1. Alessio P Buccino  Is a corresponding author
  2. Cole L Hurwitz
  3. Samuel Garcia
  4. Jeremy Magland
  5. Joshua H Siegle
  6. Roger Hurwitz
  7. Matthias H Hennig
  1. Department of Biosystems Science and Engineering, ETH Zurich, Switzerland
  2. Centre for Integrative Neuroplasticity (CINPLA), University of Oslo, Norway
  3. School of Informatics, University of Edinburgh, United Kingdom
  4. Centre de Recherche en Neuroscience de Lyon, CNRS, France
  5. Flatiron Institute, United States
  6. Allen Institute for Brain Science, United States
  7. Independent Researcher, United States
Tools and Resources
  • Cited 9
  • Views 4,133
  • Annotations
Cite this article as: eLife 2020;9:e61834 doi: 10.7554/eLife.61834

Abstract

Much development has been directed toward improving the performance and automation of spike sorting. This continuous development, while essential, has contributed to an over-saturation of new, incompatible tools that hinders rigorous benchmarking and complicates reproducible analysis. To address these limitations, we developed SpikeInterface, a Python framework designed to unify preexisting spike sorting technologies into a single codebase and to facilitate straightforward comparison and adoption of different approaches. With a few lines of code, researchers can reproducibly run, compare, and benchmark most modern spike sorting algorithms; pre-process, post-process, and visualize extracellular datasets; validate, curate, and export sorting outputs; and more. In this paper, we provide an overview of SpikeInterface and, with applications to real and simulated datasets, demonstrate how it can be utilized to reduce the burden of manual curation and to more comprehensively benchmark automated spike sorters.

Introduction

Extracellular recording is an indispensable tool in neuroscience for probing how single neurons and populations of neurons encode and transmit information. When analyzing extracellular recordings, most researchers are interested in the spiking activity of individual neurons, which must be extracted from the raw voltage traces through a process called spike sorting. Many laboratories perform spike sorting using fully manual techniques (e.g. XClust [Mucha, 1995], SimpleClust [Voigts, 2012], Plexon Offline Sorter [Plexon, 2020]), but such approaches are nearly impossible to standardize due to inherent operator bias (Wood et al., 2004). To alleviate this issue, spike sorting has seen decades of algorithmic and software improvements to increase both the accuracy and automation of the process (Rey et al., 2015). This progress has accelerated in the past few years as high-density devices (Eversmann et al., 2003; Berdondini et al., 2005; Frey et al., 2010; Ballini et al., 2014; Müller et al., 2015; Yuan et al., 2016; Lopez et al., 2016; Jun et al., 2017a; Dimitriadis et al., 2018; Angotzi et al., 2019), capable of recording from hundreds to thousands of neurons simultaneously have made manual intervention impractical, increasing the demand for both accurate and scalable spike sorting algorithms (Rossant et al., 2016; Pachitariu et al., 2016; Lee et al., 2017; Chung et al., 2017; Yger et al., 2018; Hilgen et al., 2017; Jun et al., 2017b; Diggelmann et al., 2018).

Despite the development and widespread use of automatic spike sorters, there still exist no clear standards for how spike sorting should be performed or evaluated (Rey et al., 2015; Barnett et al., 2016; Carlson and Carin, 2019; Magland et al., 2020). Research labs that are beginning to experiment with high-density extracellular recordings have to choose from a multitude of spike sorters, data processing algorithms, file formats, and curation tools just to analyze their first recording. As trying out multiple spike sorting pipelines is time-consuming and technically challenging, many labs choose one and stick to it as their de facto solution (Magland et al., 2020). This has led to a fragmented software ecosystem which challenges reproducibility, benchmarking, and collaboration among different research labs.

Previous work to standardize the field has focused on developing open-source frameworks that make extracellular analysis and spike sorting more accessible (Egert et al., 2002; Bonomini et al., 2005; Hazan et al., 2006; Garcia and Fourcaud-Trocmé, 2009; Goldberg et al., 2009; Bokil et al., 2010; Xq et al., 2011; Bologna et al., 2010; Oostenveld et al., 2011; Kwon et al., 2012; Mahmud et al., 2012; Bongard et al., 2014; Regalia et al., 2016; Zhang et al., 2017; Nasiotis et al., 2019a). While useful tools in their own right, these frameworks only implement a limited suite of spike sorting technologies since their main focus is to provide entire extracellular analysis pipelines (spike trains, LFPs, EEG, and more). Moreover, these tools do little to improve the evaluation and comparison of spike sorting performance which is still a relatively unsolved problem in electrophysiology. An exception to this is SpikeForest (Magland et al., 2020), a recently developed open-source software suite that benchmarks 10 automated spike sorting algorithms against an extensive database of ground-truth recordings (SpikeForest makes use of SpikeInterface in many of its core capabilities [file IO, preprocessing, spike sorting]). Despite these developments, there exists a need for an up-to-date spike sorting framework that can standardize the usage and evaluation of modern algorithms.

In this paper, we introduce SpikeInterface, the first open-source, Python-based framework exclusively designed to encapsulate all steps in the spike sorting pipeline (we utilize Python as it is open-source, free, and increasingly popular in the neuroscience community; Muller et al., 2015; Gleeson et al., 2017). The goals of this software framework are five-fold.

  1. To increase the accessibility and standardization of modern spike sorting technologies by providing users with a simple application programming interface (API) and graphical user interface (GUI) that exist within a continuously integrated code-base.

  2. To make spike sorting pipelines fully reproducible by capturing the entire provenance of the data flow during run time.

  3. To make data access and analysis both memory and computation-efficient by utilizing memory-mapping, parallelization, and high-performance computing platforms.

  4. To encourage the sharing of datasets, results, and analysis pipelines by providing full compatibility with standardized file formats such as Neurodata Without Borders (NWB) (Teeters et al., 2015; Ruebel et al., 2019) and the Neuroscience Information Exchange (NIX) Format (NIX, 2015).

  5. To supply the most comprehensive suite of benchmarking capabilities available for spike sorting in order to guide future usage and development.

In the remainder of this article, we showcase the numerous capabilities of SpikeInterface by performing an in-depth meta-analysis of preexisting spike sorters. This analysis includes quantifying the agreement among six modern spike sorters for dense probe recordings, benchmarking each sorter on ground truth, and introducing a consensus-based technique to potentially improve performance and enable automated curation. Afterwards, we present an overview of the codebase and how its interconnected components can be utilized to build full spike sorting pipelines. Finally, we contrast SpikeInterface with preexisting analysis frameworks and outline future directions.

Results

In this section, we perform a meta-analysis of six modern spike sorters on real and simulated datasets. This meta-analysis includes quantifying agreement among the sorters, benchmarking each sorter on ground truth, and investigating whether it is possible to combine outputs from multiple spike sorters to improve overall performance and to reduce the burden of manual curation. All analyses are done with spikeinterface version 0.10.0 which is available on PyPI (https://pypi.org/project/spikeinterface/). The code to perform this analysis and produce all figures can be found at https://spikeinterface.github.io/ which also showcases other experiments performed using SpikeInterface. The datasets are publicly available in NWB format on the DANDI archive (https://gui.dandiarchive.org/#/dandiset/000034/draft).

Spike sorters show low agreement for the same high-density dataset

The dataset we use in this analysis is a Neuropixels recording from a head-fixed mouse acquired at the Allen Institute for Brain Science (Siegle et al., 2019a; Allen Institute for Brain Science, 2019 dataset ID: 766640955; probe ID: 77359232). The recording has 246 active recording channels (the remaining of the 384 Neuropixels channels were either not inserted in the brain tissue or had a firing rate below 0.1 Hz), and a sampling frequency of 30 kHz. The recording’s duration was trimmed to 15 min. The probe records from part of the cortex (V1), the hippocampus (CA1), the dentate gyrus, and the thalamus (LP). During the experiment, the mouse was presented with a variety of visual stimuli while freely running on a rotating disk (for more details see Siegle et al., 2019a). An activity map of the probe and a 1 s snippet of the traces on 10 channels are shown in Figure 1A. The notebook for reproducing the results for this section and the last section of the Results can be viewed at https://spikeinterface.github.io/blog/ensemble-sorting-of-a-neuropixels-recording.

Figure 1 with 4 supplements see all
Comparison of spike sorters on a real Neuropixels dataset.

(A) A visualization of the activity on the Neuropixels array (top, color indicates spike rate estimated on each channel evaluated with threshold detection) and of traces from the Neuropixels recording (below). (B) The number of detected units for each of the six spike sorters (HS = HerdingSpikes2, KS = Kilosort2, IC = IronClust, TDC = Tridesclous, SC = SpyKING Circus, HDS = HDSort). (C) The total number of units for which k sorters agree (unit agreement is defined as 50% spike match). (D) The number of units (per sorter) for which k sorters agree; most sorters find many units that other sorters do not.

For this analysis, we select six different spike sorters: HerdingSpikes2 (Hilgen et al., 2017), Kilosort2 (Pachitariu et al., 2018), IronClust (Jun et al., 2017b), SpyKING Circus (Yger et al., 2018), Tridesclous (Garcia and Pouzat, 2015), and HDSort (Diggelmann et al., 2018) (the versions for each spike sorter are as follows: SpyKING Circus==0.9.7, Tridesclous==1.6.0, HerdingSpikes2==0.3.7, IronClust==5.9.8, Kilosort2==GitHub commit 48bf2b81d8ad, HDSort==1.0.1). As most of these algorithms have been tuned rigorously on multiple ground-truth datasets (including the recent large-scale evaluation from Magland et al., 2020), we fix their parameters to default values to allow for straightforward comparison. We do not include Klusta (Rossant et al., 2016), WaveClus (Chaure et al., 2018), Kilosort (Pachitariu et al., 2016), or MountainSort4 (Chung et al., 2017) in this analysis as Klusta can only handle up to 64 channels, WaveClus is designed for low channel count probes, Kilosort is superseded by Kilosort2, and MountainSort4’s latest verion is currently not optimized for high channel counts, scaling quadratically with the number of channels.

In Figure 1B, we show the number of units that each of the six sorters output. Immediately, we observe large variability among the sorters, with Tridesclous (TDC) finding the least units (187) and SpyKING Circus (SC) finding the most units (628). HerdingSpikes2 finds 210 units; Kilosort2 finds 446 units; IronClust finds 233 units; and HDSort finds 317 units. From this result, we can see that there is no clear consensus among the sorters on the number of neurons in the recording (without performing extensive manual curation).

Next, we compare the unit spike trains found by each sorter to determine the level of agreement among the different algorithms (see the SpikeComparison Section of the Methods for how this is done). In Figure 1C, we visualize the total number of units for which k sorters agree (unit agreement is defined as a 50% spike train match; the time window to consider spikes as matching is 0.4 ms). Figure 1—figure supplement 1 shows spike trains and templates for two sample matched units (one with a higher - 0.97 - and one with a lower agreement - 0.69). Of the 2031 total detected units, all six sorters agree on just 33 of the units. This is surprisingly low given the relatively undemanding criteria of a 50% spike train match. We also find that two or more sorters agree on just 263 of the total units. To further break down the disagreement between spike sorters, Figure 1D shows the number of units per sorter for which k other sorters agree. For most sorters, over 50% of the units that they find do not match with any other sorter (with the exceptions of Ironclust and Tridesclous). For agreed-upon units, around 80% of the agreement scores are 0.8 or higher, indicating that matched units typically have high spike train agreement (Figure 1—figure supplement 2).

The analysis performed on this dataset suggests that agreement among spike sorters is startlingly low. To corroborate this finding, we repeat the same analysis using different datasets including a Neuropixels recordings from another lab and an in vitro retinal recording from a planar, high-density array. In both cases, we find similar disagreement among the sorters (Figure 1—figure supplements 3 and 4). The notebooks for these analyses can be viewed at https://spikeinterface.github.io/blog/ensemble-sorting-of-a-neuropixels-recording-2/ and https://spikeinterface.github.io/blog/ensemble-sorting-of-a-3brain-biocam-recording-from-a-retina/.

This low agreement raises the following question: how many of the total outputted units actually correspond to real neurons? To explore this question, we turn to simulation where the ground-truth spiking activity is known a priori.

Evaluating spike sorters on a simulated dataset

In this analysis, we simulate a 10 min Neuropixels recording using the MEArec Python package (Buccino and Einevoll, 2020). The recording contains the spiking activity of 250 biophysically detailed neurons (200 excitatory and 50 inhibitory cells from the Neocortical Micro Circuit Portal; Ramaswamy et al., 2015; Markram et al., 2015) that exhibit independent Poisson firing patterns. The recording also has an additive Gaussian noise with 10 μV standard deviation. A visualization of the simulated activity map and extracellular traces from the Neuropixels probe is shown in Figure 2A. A histogram of the signal-to-noise ratios (SNR) for the ground-truth units is shown in Figure 2B. The notebook for reproducing the results for this and the next section can be viewed at https://spikeinterface.github.io/blog/ground-truth-comparison-and-ensemble-sorting-of-a-synthetic-neuropixels-recording/.

Figure 2 with 1 supplement see all
Evaluation of spike sorters on a simulated Neuropixels dataset.

(A) A visualization of the activity on and traces from the simulated Neuropixels recording. (B) The signal-to-noise ratios (SNR) for the ground-truth units. (C) The number of detected units for each of the six spike sorters (HS = HerdingSpikes2, KS = Kilosort2, IC = IronClust, TDC = Tridesclous, SC = SpyKING Circus, HDS = HDSort). (D) The accuracy, precision, and recall of each sorter on the ground-truth units. (E) A breakdown of the detected units for each sorter (precise definitions of each unit type can be found in the SpikeComparison Section of the Methods). The horizontal dashed line indicates the number of ground-truth units (250).

We run the same six spike sorters on the simulated dataset, keeping the parameters the same as those used on the real Neuropixels dataset. We then utilize SpikeInterface to evaluate each spike sorter on the ground-truth dataset. Afterwards, we repeat the agreement analysis from the previous section to diagnose the low agreement among sorters.

The main result of the ground-truth evaluation is summarized in Figure 2. As can be seen in Figure 2C, the sorters, again, have a large discrepancy in the number of detected units. The number of detected units range from the 189 units found by Tridesclous to the 458 units found by HDSort. HerdingSpikes2 finds 233 units; Kilosort2 finds 415 units; IronClust finds 283 units; and SpyKING Circus finds 343 units. We again see that there is no clear consensus among the sorters on the number of neurons in the simulated recording.

In Figure 2D, the accuracy, precision, and recall of all the ground-truth units are plotted for each spike sorter. Some sorters tend to favor precision over recall while others do the opposite (Figure 2—figure supplement 1A). Moreover, the accuracy is modulated by the SNR of the ground-truth units for all spike sorters except Kilosort2 which achieves an almost perfect performance on the low-SNR units (Figure 2—figure supplement 1B). While most spike sorters have a wide range of scores for each metric, Kilosort2 attains significantly higher scores than the rest of the spike sorters for most ground-truth units.

Figure 2E shows the breakdown of detected units for each spike sorter. Each unit is classified as well-detected, false positive, redundant, and/or overmerged by SpikeInterface (the definitions of each unit type can be found in the SpikeComparison Section of the Materials and methods). This plot, interestingly, may shed some light on the remarkable accuracy of Kilosort2. While Kilosort2 has the most well-detected units (245), this comes at the cost of a high percentage of false positive (147) and redundant (21) units (The high-rate of false positive/redundant units persists, but is alleviated, even when using Kilosort2’s automated curation step which removes units that have >20% estimated contamination rate [computed from the refractory period violations]. In that case the number of well-detected units is 241, false positives are 93, and redundant units are 18. In both cases two overmerged units are found). Notably, Tridesclous detects very few false positive/redundant units while still finding many well-detected units. HDSort, on the flip side, finds many more false positive units than any other spike sorter. For a comprehensive comparison of spike sorter performance on both real and simulated datasets, we refer the reader to the related SpikeForest project (https://spikeforest.flatironinstitute.org/) (Magland et al., 2020).

Low-agreement units are mainly false positives

Similarly to the real Neuropixels dataset, we compare the agreement among the different spike sorters on the simulated dataset. Again, we observe a large disagreement among the spike sorting outputs with only 139 units of the 1921 total units (7.24%) being in agreement among all sorters (Figure 3A). We can break down the overall agreement by sorter (Figure 3B), highlighting that some sorters are more prone to finding low agreement units (HDSort, SpyKING Circus, Kilosort2) than other sorters (HerdingSpikes2, Ironclust, Tridesclous).

Figure 3 with 2 supplements see all
Comparison of spike sorters on a simulated Neuropixels dataset.

(A) The total number of units for which k sorters agree (unit agreement is defined as 50% spike match). (B) The number of units (per sorter) for which k sorters agree; Most sorters find many units that other sorters do not. (HS = HerdingSpikes2, KS = Kilosort2, IC = IronClust, TDC = Tridesclous, SC = SpyKING Circus, HDS = HDSort) (C) Number of matched ground-truth units (blue) and false positive units (red) found by each sorter on which k sorters agree upon. Most of the false positive units are only found by a single sorter. Number of false positive units found by k2 sorters: HS = 4, KS = 4, IC = 4, SC = 2, TDC = 1, HDS = 2. (D) Signal-to-noise ratio (SNR) of ground-truth unit with respect to the number of k sorters agreement. Results are split by sorter.

Given that we know the ground-truth spiking activity of the simulated recording, we can now investigate whether low-agreement units actually correspond to ground-truth units or if they are falsely detected (false positive) units. In Figure 3C, bar plots for each sorter show the number of matched ground-truth units (blue) and false positive units (red) in relation to the ensemble agreement (1 - no agreement, 6 - full agreement). The plots show that (almost) all false positive units are ones that are found by only a single sorter (not matched with any other sorters), while most real units are matched by more than one sorter. We also assessed how well false positive units can be identified using fewer sorters (Figure 3—figure supplement 1). This analysis showed that using a pair of sorters is sufficient to isolate almost all false positive units in each sorter, yet when fewer than four sorter outputs are compared, a significant fraction of true positive units found by only one sorter can be wrongly classified as false positives with this approach. For two sorters, the most reliable identification of true positives for this dataset was achieved by combining Kilosort2 and Ironclust (96% and 95% false positive and true positive detection rate, respectively). In Figure 3D, we display the signal-to-noise ratio (SNR) as a function of the ensemble agreement. This shows, as expected, that higher SNR units have higher agreement among sorters. In other words, units with a large amplitude (high SNR) are easier to detect and more consistently found by many sorters. Additionally, we tested if SNR can be used to distinguish between false and true positive units, as noise may be wrongly detected as events with low SNR. We found that for Kilosort2’s output, which is best matched with ground-truth spike trains, SNR is not a good predictor of false positives (Figure 3—figure supplement 2) - many false positives had a high estimated SNR. Taken together, these results suggest that the ensemble agreement among multiple sorters can be used to remove false positive units from each of the sorter outputs or to inform their subsequent manual curation.

Consensus units highly overlap with manually curated ones

We next investigate the ensemble agreement among the sorters on the real Neuropixels recording presented in Figure 1. As there is no ground-truth information in this setting to identify false positives, we turn to manually curated sorting outputs. Two experts (which we will refer to as C1 and C2) manually curate the spike sorting output of Kilosort2 using the Phy software. During this curation step, the two experts label the sorted units as false positives or real units by rejecting, splitting, merging, or accepting units according to spike features (Rossant and Harris, 2013).

Figure 4A shows the agreement between expert 1 (C1) and expert 2 (C2). While there are some discrepancies (as expected when manually curating spike sorting results; Wood et al., 2004), most of the curated units (226 out of 351–64.2%) are agreed upon by both experts. Notably, 174 units found by Kilosort2 are discarded by both experts, indicating a large number of false positive units.

Comparison between consensus and manually curated outputs.

(A) Venn diagram showing the agreement between Curator 1 and 2. 174 units are discarded by both curators from the Kilsort2 output. (B) Percent of matched units between the output of each sorter and C1 (red) and C2 (blue). Ironclust has the highest match with both curated datasets. (C) Similar to C, but using the consensus units (units agreed upon by at least two sorters - k2). The percent of matching with curated datasets is now above 70% for all sorters, with Kilosort2 having the highest match (KSc ∩C1 = 84.55%, KSc∩C2 = 89.55%), slightly higher than Ironclust (ICc ∩ C1 = 82.63%, ICc ∩ C2 = 83.83%). (D) Percent of non-consensus units (k=1) matched to curated datasets. The only significant overlap is between Curator one and Kilosort2, with a percent around 18% (KSnc ∩ C1 = 18.58%, KSnc ∩ C2 = 24.34%).

We then compare the output of each of the spike sorters to C1 and C2 and find that, in general, only a small percentage of units outputted by any single sorter is matched to the curated results (Figure 4). The highest percentage match is actually IronClust which is surprising given that the initial sorting output was curated from Kilosort2’s output (IC ∩ C1 = 59.83%, IC ∩ C2 = 61.1%, KS ∩ C1 = 50.67%, KS ∩ C2 = 56.25%).

Next, for each sorter, we take all the units that are matched by at least one other sorter (consensus units, k2) and all units that are found by only that sorter (non-consensus units, k=1). We refer to the consensus units of a sorter as Sorterc and the non-consensus units of a sorter as Sorternc. In Figure 4C, we show the match percentage between consensus units and curated units. The average match percentage is above 70% for all sorters showing that there is a large agreement between the manually curated outputs and the consensus-based output. Kilosort2 has the highest match (KSc ∩ C1 = 84.55%, KSc ∩ C2 = 89.55%), slightly higher than Ironclust (ICc ∩ C1 = 82.63%, ICc ∩ C2 = 83.83%). Conversely, the percentage of non-consensus units matched to curated units is very small (Figure 4D) for all sorters.

Overall, this analysis suggests that a consensus-based approach to curation could allow for identification of real neurons from spike sorted data. Despite differences among the sorters with respect to the number of detected neurons and the quality of their isolation (as demonstrated by the ground-truth analysis), the consensus-based approach has good agreement with hand-curated data and appears to be less variable as illustrated by the small but significant disagreement between the two curators.

Materials and methods

Overview of SpikeInterface

SpikeInterface consists of five main Python packages designed to handle different steps in the spike sorting pipeline: (i) spikeextractors, for extracellular recording, sorting output, and probe file I/O; (ii) spiketoolkit for low level processing such as pre-processing, post-processing, validation, curation; (iii) spiketoolkit for spike sorting algorithms and job launching functionality; (iv) spikecomparison for sorter comparison, ground-truth comparison, and ground-truth studies; and (v) spikewidgets, for data visualization.

These five packages can be installed and used through the spikeinterface metapackage, which contains stable versions of all five packages as internal modules (see Figure 5). With these five packages (or our meta-package), users can build, run, and evaluate full spike sorting pipelines in a reproducible and standardized way. In the following subsections, we present an overview of, and a code snippet for, each package.

Overview of SpikeInterface’s Python packages, their different functionalities, and how they can be accessed by our meta-package, spikeinterface.

SpikeExtractors

Request a detailed protocol

The spikeextractors package (https://github.com/SpikeInterface/spikeextractors; Buccino et al., 2020a) is designed to alleviate issues of any file format incompatibility within spike sorting without creating additional file formats. To this end, spikeextractors contains two core Python objects that can directly and uniformly access all spike sorting related files: the RecordingExtractor and the SortingExtractor.

The RecordingExtractor directly interfaces with an extracellular recording and can query it for four primary pieces of information: (i) the extracellular recorded traces; (ii) the sampling frequency; (iii) the number of samples, or frames, in the recording; and (iv) the channel indices of the recording electrodes. These data are shared across all extracellular recordings allowing for standardized retrieval functions. In addition, a RecordingExtractor may store extra information about the recording device as 'channel properties’ which are key–value pairs. This includes properties such as 'location’, 'group’, and 'gain’ which are either provided by certain extracellular file formats, loaded manually by the user, or loaded automatically with our built-in probe file (.prb or .csv) reader. Taken together, the RecordingExtractor is an object representation of an extracellular recording and the associated probe configuration.

The SortingExtractor directly interfaces with a sorting output and can query it for two primary pieces of information: (i) the unit indices and (ii) the spike train of each unit. Again, these data are shared across all sorting outputs. A SortingExtractor may also store extra information about the sorting output as either 'unit properties' or 'unit spike features', key–value pairs which store information about the individual units or the individual spikes of each unit, respectively. This extra information is either loaded from the sorting output, loaded manually by the user, or loaded automatically with built-in post-processing tools (discussed in the SpikeToolkit Section). Taken together, the SortingExtractor is an object representation of a sorting output along with any associated post-processing.

Critically, both Extractor types can lazily query the underlying datasets for information as it is required, reducing their memory footprint and allowing their use for long, large-scale recordings. While this is the default operation mode, Extractors can also cache parts of the dataset in temporary binary files to enable faster downstream computations at the cost of higher memory usage. All extracted data is converted into either native Python data structures or into numpy arrays for immediate use in Python. Additionally, each Extractor can be dumped to and loaded from a json file, a pickle file, or a dictionary, ensuring full provenance and allowing for parallel processing.

The following code snippet illustrates how Extractors can be used to retrieve raw traces from an extracellular recording and spike trains from a sorting output:

import spikeinterface.extractors as se
recording = se.MyFormatRecordingExtractor(file_path='myrecording')
sorting = se.MyFormatSortingExtractor(file_path=’mysorting’)
traces = recording.get_traces() # 2D numpy array (channels x time)
spike_train = sorting.get_unit_spike_train(unit_id=1) # 1D numpy array

Along with using Extractors for single files, it is possible to access data from multiple files or portions of files with the MultiExtractors and SubExtractors, respectively. Both have identical functionality to normal Extractors and can be used and treated in the same ways, simplifying, for instance, the combined analysis of a recording split into multiple files.

As of this moment, SpikeInterface supports 19 extracellular recording formats and 18 sorting output formats. The available file formats can be found in Table 1. Although this covers many popular formats in extracellular analysis (including Neurodata Without Borders, Teeters et al., 2015, and NIX, 2015), we expect the number of formats to grow with future versions as adding a new format is as simple as making a new Extractor subclass for it. We also have started to integrate NEO’s (Garcia et al., 2014) I/O system into spikeextractors which allow SpikeInterface to support many more open-source and proprietary file formats without changing any functionality. Already, two recording formats have been added through our NEO integration (Neuralynx, 2020 and Plexon, 2020).

Table 1
Currently available file formats in SpikeInterface and if they are writable.

*The Phy writing method is implemented in spiketoolkit as the export_to_phy function (all other writing methods are implemented in spikeextractors).

Raw formatsWritableReferenceSorted formatsWritableReference
KlustaYesRossant et al., 2016KlustaYesRossant et al., 2016
MountainsortYesJun et al., 2017aMountainsortYesJun et al., 2017a
Phy*YesRossant and Harris, 2013Phy*YesRossant and Harris, 2013
Kilosort/Kilosort2NoPachitariu et al., 2016; Rossant et al., 2014Kilosort/Kilosort2NoPachitariu et al., 2016; Rossant et al., 2014
SpyKING CircusNoYger et al., 2018SpyKING CircusYesYger et al., 2018
ExdirYesDragly et al., 2018ExdirYesDragly et al., 2018
MEArecYesBuccino and Einevoll, 2020MEArecYesBuccino and Einevoll, 2020
Open EphysNoSiegle et al., 2017Open EphysNoSiegle et al., 2017
Neurodata Without BordersYesTeeters et al., 2015Neurodata Without BordersYesTeeters et al., 2015
NIXYesNIX, 2015NIXYesNIX, 2015
PlexonNoPlexon, 2020PlexonNoPlexon, 2020
NeuralynxNoNeuralynx, 2020NeuralynxNoNeuralynx, 2020
SHYBRIDYesWouters et al., 2020SHYBRIDYesWouters et al., 2020
NeuroscopeYesHazan et al., 2006NeuroscopeYesHazan et al., 2006
SpikeGLXNoKarsh, 2016HerdingSpikes2YesHilgen et al., 2017
IntanNoIntan, 2010JRCLUSTNoJun et al., 2017b
MCS H5NoMCS, 2020Wave clusNoChaure et al., 2018
Biocam HDF5YesBiocam, 2018TridesclousNoGarcia and Pouzat, 2015
MEA1kYesMEA1k, 2020NPZ (numpy zip)YesN/A
MaxOneNoMaxWell, 2020
BinaryYesN/A

SpikeToolkit

Request a detailed protocol

The spiketoolkit package (https://github.com/SpikeInterface/spiketoolkit; Buccino et al., 2020b) is designed for efficient pre-processing, post-processing, validation, and curation of extracellular datasets and sorting outputs. It contains four modules that encapsulate each of these functionalities: preprocessing, postprocessing, validation, and curation.

Pre-processing
Request a detailed protocol

The preprocessing module provides functions to process raw extracellular recordings before spike sorting. To pre-process an extracellular recording, the user passes a RecordingExtractor to a pre-processing function which returns a new 'preprocessed' RecordingExtractor. This new RecordingExtractor, which can be used in exactly the same way as the original extractor, implements the preprocessing in a lazy fashion so that the actual computation is performed only when data is requested. As all pre-processing functions take in and return a RecordingExtractor, they can be naturally chained together to perform multiple pre-processing steps on the same recording.

Pre-processing functions range from commonly used operations, such as bandpass filtering, notch filtering, re-referencing signals, and removing channels, to more advanced procedures such as clipping traces depending on the amplitude, or removing artifacts arising, for example, from electrical stimulation. The following code snippet illustrates how to chain together a few common pre-processing functions to process a raw extracellular recording:

import spikeinterface.spiketoolkit as st
recording = st.preprocessing.bandpass_filter(recording, freq_min=300, freq_max=6000)
recording_1 = st.preprocessing.remove_bad_channels(recording, bad_channels=[5])
recording_2 = st.preprocessing.common_reference(recording_1, reference=’median’)
Post-processing
Request a detailed protocol

The postprocessing module provides functions to compute and store information about an extracellular recording given an associated sorting output. As such, post-processing functions are designed to take in both a RecordingExtractor and a SortingExtractor, using them in conjunction to compute the desired information. These functions include, but are not limited to: extracting unit waveforms and templates, computing principle component analysis projections, as well as calculating features from templates (e.g. peak to valley duration, full-width half maximum).

One essential feature of the postprocessing module is that it provides the functionality to export a RecordingExtractor/SortingExtractor pair into the Phy format for manual curation later. Phy (Rossant and Harris, 2013; Rossant et al., 2016) is a popular manual curation GUI that allows users to visualize a sorting output with several views and to curate the results by manually merging or splitting clusters. Phy is already supported by several spike sorters (including klusta, Kilosort, Kilosort2, and SpyKING Circus) so our exporter function extends Phy’s functionality to all SpikeInterface-supported spike sorters. After manual curation is performed in Phy, the curated data can be re-imported into SpikeInterface using the PhySortingExtractor for further analysis. The following code snippet illustrates how to retrieve waveforms for each sorted unit, compute principal component analysis (PCA) features for each spike, and export to Phy using SpikeInterface:

import spikeinterface.toolkit as st
waveforms = st.postprocessing.get_unit_waveforms(recording, sorting)
pca_scores = st.postprocessing.compute_unit_pca_scores(recording, sorting, n_comp=3)
st.postprocessing.export_to_phy(recording, sorting, output_folder=’phy_folder’)
Validation
Request a detailed protocol

The validation module allows users to automatically evaluate spike sorting results in the absence of ground truth with a variety of quality metrics. The quality metrics currently available are a compilation of historical and modern approaches that were re-implemented by researchers at Allen Institute for Brain Science (https://github.com/AllenInstitute/ecephys_spike_sorting; Siegle et al., 2019b) and by the SpikeInterface team (see Table 2).

Table 2
Currently available quality metrics in Spikeinterface.

Re-implemented by researchers at Allen Institute for Brain and by the SpikeInterface team.

MetricDescriptionReference
Signal-to-noise ratioThe signal-to-noise ratio computed on unit templates.N/A
Firing rateThe average firing rate over a time period.N/A
Presence ratioThe fraction of a time period in which spikes are present.N/A
Amplitude CutoffAn estimate of the miss rate based on an amplitude histogram.N/A
Maximum driftThe maximum change in spike position (computed as the center of mass of the energy of the first principal component score) throughout a recording.N/A
Cumulative driftThe cumulative change in spike position throughout a recording.N/A
ISI violationsThe rate of inter-spike-interval (ISI) refractory period violations.Hill et al., 2011
Isolation DistanceRadius of the smallest ellipsoid that contains all the spikes from a cluster and an equal number of spikes from other clusters (centered on the specified cluster).Harris et al., 2001
L-ratioAssuming that the distribution of spike distances from a cluster center is multivariate normal, L-ratio is the average value of the tail distribution for non-member spikes of that cluster.Schmitzer-Torbert and Redish, 2004
D-PrimeThe classification accuracy between two units based on linear discriminant analysis (LDA)Hill et al., 2011
Nearest-neighborsA non-parametric estimate of unit contamination using nearest-neighbor classification.Chung et al., 2017
Silhouette scoreThe ratio between cohesiveness of a cluster (distance between member spikes) and its separation from other clusters (distance to non-member spikes).Rousseeuw, 1987

Each of SpikeInterface’s quality metric functions internally utilize the postprocessing module to generate all data needed to compute the specified metric (amplitudes, principal components, etc.). The following code snippet demonstrates how to compute both a single quality metric (isolation distance) and also all the quality metrics with just two function calls:

import spikeinterface.toolkit as st
iso_metric = st.validation.compute_isolation_distances(sorting, recording)
all_metrics = st.validation.compute_quality_metrics(sorting, recording)
Curation
Request a detailed protocol

The curation module allows users to quickly remove units from a SortingExtractor based on computed quality metrics. To curate a sorted dataset, the user passes a SortingExtractor to a curation function which returns a new 'curated’ SortingExtractor (similar to how pre-processing works). This new SortingExtractor can be used in exactly the same way as the original extractor. As all curation functions take in and return a SortingExtractor, they can be naturally chained together to perform multiple curation steps on the same sorting output.

Currently, all implemented curation functions are based on excluding units with respect to a user-defined threshold on a specified quality metric. These curation functions will compute the associated quality metric and then threshold the dataset accordingly. The following code snippet demonstrates how to chain together two curation functions that are based on different quality metrics and apply a 'less’ threshold to the underlying units (exclude all units below the given threshold):

import spikeinterface.toolkit as st
sorting_1 = st.curation.threshold_firing_rates(sorting, threshold=2.3, threshold_sign=’less’)
sorting_2 = st.curation.threshold_snrs(sorting_1, recording, threshold=10, threshold_sign=’less’)

SpikeSorters

Request a detailed protocol

The spikesorters (https://github.com/SpikeInterface/spikesorters; Buccino et al., 2020c) package provides a straightforward interface for running spike sorting algorithms supported by SpikeInterface. Modern spike sorting algorithms are built and deployed in a variety of programming languages including C, C++, MATLAB, and Python. Along with variability in the underlying program languages, each sorting algorithm may depend on external technologies like CUDA or command line interfaces (CLIs), complicating standardization. To unify these disparate algorithms into a single codebase, spikesorters provides Python-wrappers for each supported spike sorting algorithm. These spike sorting wrappers use a standard API for running the corresponding algorithms, internally handling intrinsic complexities such as automatic code generation for MATLAB- and CLI-based algorithms. Each spike sorting wrapper is implemented as a subclass of a BaseSorter class that contains all shared code for running the spike sorters.

To run a specific spike sorting algorithm, users can pass a RecordingExtractor object to the associated function in spikesorters and overwrite any default parameters with new values (only essential parameters are exposed to the user for modification). Internally, each function initializes a spike sorting wrapper with the user-defined parameters. This wrapper then creates and modifies a new spike sorter configuration and runs the sorter on the dataset encapsulated by the RecordingExtractor. Once the spike sorting algorithm is finished, the sorting output is saved and a corresponding SortingExtractor is returned to the user. For each sorter, all available parameters and their descriptions can be retrieved using the get_default_params() and get_params_description() functions, respectively.

In the following code snippet, Mountainsort4 and Kilosort2 are used to sort an extracellular recording. Running each algorithm (and changing the default parameters) can be done as follows:

import spikeinterface.sorters as ss
sorting_MS4 = ss.run_mountainsort4(recording, adjacency_radius=50)
sorting_KS2 = ss.run_kilosort2(recording, detect_threshold=5)

Our spike sorting functions also allow for users to sort specific 'groups’ of channels in the recording separately (and in parallel, if specified). This can be very useful for multiple tetrode recordings where the data are all stored in one file, but the user wants to sort each tetrode separately. For large-scale analyses where the user wants to run many different spike sorters on many different datasets, spikesorters provides a launcher function which handles any internal complications associated with running multiple sorters and returns a nested dictionary of SortingExtractor objects corresponding to each sorting output. The launcher can be deployed on HPC platforms through the multiprocessing or dask engine (Dask, 2016). Finally, and importantly, when running a spike sorting job the recording information and all the spike sorting parameters are saved in a log file, including the console output of the spike sorting run (which can be used to inspect errors). This provenance mechanism ensures full reproducibility of the spike sorting pipeline.

Currently, SpikeInterface supports 10 semi-automated spike sorters which are listed in Table 3. We encourage developers to contribute to this expanding list in future versions and we provide comprehensive documentation on how to do so (https://spikeinterface.readthedocs.io/en/latest/contribute.html).

Table 3
Currently available spike sorters in Spikeinterface.

TM = Template Matching; SL = Spike Localization; DB = Density-based clustering.

NameMethodNotesReference
 KlustaDBPython-based, semi-automatic, designed for low channel count, dense probes.Rossant et al., 2016
Mountainsort4DBPython-based, fully automatic, unique clustering method (isosplit), designed for low channel count, dense probes and tetrodes.Chung et al., 2017
KilosortTMMATLAB-based, GPU support, semi-automated final curation.Pachitariu et al., 2016
Kilosort2TMMATLAB-based, GPU support, semi-automated final curation, designed to correct for drift.Pachitariu et al., 2018
SpyKING CircusTMPython-based, fast and scalable with CPUs, designed to correct for drift.Yger et al., 2018
HerdingSpikes2DB + SLPython-based, fast and scalable with CPUs, scales up to thousands of channels.Hilgen et al., 2017
TridesclousTMPython-based, graphical user interface, GPU support, multi-platformGarcia and Pouzat, 2015
IronClustDB + SLMATLAB-based, GPU support, designed to correct for drift.Jun et al., 2020
Wave clusTMMatlab-based, fully automatic, designed for single electrodes and tetrodes, multi-platform.Chaure et al., 2018
 HDsortTMMatlab-based, fast and scalable, designed for large-scale, dense arrays.Diggelmann et al., 2018

SpikeComparison

Request a detailed protocol

The spikecomparison package (https://github.com/SpikeInterface/spikecomparison; Buccino et al., 2020d) provides a variety of tools that allow users to compare and benchmark sorting outputs. Along with these comparison tools, spikecomparison also provides the functionality to run systematic performance comparisons of multiple spike sorters on multiple ground-truth recordings.

Within spikecomparison, there exist three core comparison functions:

  1. compare_two_sorters - Compares two spike sorting outputs.

  2. compare_multiple_sorters - Compares multiple spike sorting outputs.

  3. compare_sorter_with_ground_truth - Compares a spike sorting output to ground truth.

Each of these comparison functions takes in multiple SortingExtractor objects and uses them to compute agreement scores among the underlying spike trains. The agreement score between two spike trains is defined as:

(1) score=#nmatches#n1+#n2#nmatches

where #nmatches is the number of 'matched' spikes between the two spike trains and #n1 and #n2 are the number of spikes in the first and second spike train, respectively. Two spikes from two different spike trains are 'matched' when they occur within a certain time window of each other (this window length can be adjusted by the user and is 0.4 ms by default).

When comparing two sorting outputs (compare_two_sorters), a linear assignment based on the Hungarian method (Kuhn, 1955) is used. With this assignment method, each unit from the first sorting output can be matched to at most one other unit in the second sorting output. The final result of this comparison is then the list of matching units (given by the Hungarian method) and the agreement scores of the spike trains.

The multi-sorting comparison function (compare_multiple_sorters) can be used to compute the agreement among the units of many sorting outputs at once. Internally, pair-wise sorter comparisons are run for all of the sorting output pairs. A graph is then built with the sorted units as nodes and the agreement scores among the sorted units as edges. With this graph implementation, it is straightforward to query for units that are in agreement among multiple sorters. For example, if three sorting outputs are being compared, any units that are in agreement among all three sorters will be part of a subgraph with large weights.

For a ground-truth comparison (compare_sorter_with_ground_truth), either the Hungarian or the best-match method can be used. With the Hungarian method, each tested unit from the sorting output is matched to at most a single ground-truth unit. With the best-match method, a tested unit from the sorting output can be matched to multiple ground-truth units (above an adjustable agreement threshold) allowing for more in-depth characterizations of sorting failures. Note that in the SpikeForest benchmarking software suite (Magland et al., 2020), the best-match strategy is used.

Additionally, when comparing a sorting output to a ground-truth sorted result, each spike can be optionally labeled as:

  • True positive (tp): Found both in the ground-truth spike train and tested spike train.

  • False negative (fn): Found in the ground-truth spike train, but not in the tested spike train.

  • False positive (fp): Found in the tested spike train, but not in the ground-truth spike train.

Using these labels, the following performance measures can be computed:

  • Accuracy: #tp(#tp+#fn+#fp)

  • Recall: #tp(#tp+#fn)

  • Precision: #tp(#tp+#fp)

  • Miss rate: #fn(#tp+#fn)

  • False discovery rate: #fp(#tp+#fp)

While previous metrics give a measure of individual spike train quality, we also propose metrics at a unit population level. Based on the matching results and the scores, the units of the sorting output are classified as well-detected, false positive, redundant, and overmerged. Well-detected units are matched units with an agreement score above 0.8. False positive units are unmatched units or units which are matched with an agreement score below 0.2. Redundant units have agreement scores above 0.2 with only one ground-truth unit, but are not the best matched tested units (redundant units can either be oversplit or duplicate units). Overmerged units have an agreement score above 0.2 with two or more ground-truth units. All these agreement score thresholds are adjustable by the user. We highlight to the reader that the unit classification proposed here is currently only based on agreement score (i.e. accuracy). More sophisticated classification rules could involve a combination of accuracy, precision, and recall values, which can be easily computed for each unit with the spikecomparison module.

The following code snippet shows how to perform all three types of spike sorter comparisons:

import spikeinterface.comparison as sc
comp_type_1 = sc.compare_two_sorters(sorting1, sorting2)
comp_type_2 = sc.compare_multiple_sorters([sorting1, sorting2, sorting3])
comp_type_3 = sc.compare_sorter_with_ground_truth(gt_sorting, tested_sorting)

Along with the three comparison functions, spikecomparison also includes a GroundTruthStudy class that allows for the systematic comparison of multiple spike sorters on multiple ground-truth datasets. With this class, users can set up a study folder (in which the recordings to be tested are saved), run several spike sorters and store their results in a compact way, perform systematic ground-truth comparisons, and aggregate the results in pandas dataframes (McKinney, 2010).

SpikeWidgets

Request a detailed protocol

The spikewidgets package (https://github.com/SpikeInterface/spikewidgets; Buccino et al., 2020e) implements a variety of widgets that allow for efficient visualization of different elements in a spike sorting pipeline.

There exist four categories of widgets in spikewidgets. The first category utilizes a RecordingExtractor for its visualization. This category includes widgets for visualizing time series data, electrode geometries, signal spectra, and spectrograms. The second category utilizes a SortingExtractor for its visualization. These widgets include displays for raster plots, auto-correlograms, cross-correlograms, and inter-spike-interval distributions. The third category utilizes both a RecordingExtractor and a SortingExtractor for its visualization. These widgets include visualizations of unit waveforms, amplitude distributions for each unit, amplitudes of each unit over time, and PCA features. The fourth category utlizes comparison objects from the spikecomparison package for its visualization. These widgets allow the user to visualize confusion matrices, agreement scores, spike sorting performance metrics (e.g. accuracy, precision, recall) with respect to a unit property (e.g. SNR), and the agreement between multiple sorting algorithms on the same dataset.

The following code snippet demonstrates how SpikeInterface can be used to visualize ten seconds of both the extracellular traces and the corresponding raster plot:

import spikeinterface.widgets as sw
sw.plot_timeseries(recording, channel_ids=[0,1,2,3], trange=[0,10])
sw.plot_rasters(sorting, unit_ids=[0,1,3], trange=[0,10]).

Building a spike sorting pipeline

So far, we have given an overview of each of the main packages in isolation. In this section, we illustrate how these packages can be combined, using both the Python API and the Spikely GUI, to build a robust spike sorting pipeline. The spike sorting pipeline that we construct using SpikeInterface is depicted in Figure 6A and consists of the following analysis steps:

Sample spike sorting pipeline using SpikeInterface.

(A) A diagram of a sample spike sorting pipeline. Each processing step is colored to represent the SpikeInterface package in which it is implemented and the dashed, colored arrows demonstrate how the Extractors are used in each processing step. (B) How to use the Python API to build the pipeline shown in (A). (C) How to use the GUI to build the pipeline shown in (A).

  1. Loading an Open Ephys recording (Siegle et al., 2017).

  2. Loading a probe file.

  3. Applying a bandpass filter.

  4. Applying common median referencing to reduce the common mode noise.

  5. Spike sorting with Mountainsort4.

  6. Removing clusters with less than 100 events.

  7. Exporting the results to Phy for manual curation.

Traditionally, implementing this pipeline is challenging as the user has to load data from multiple file formats, interface with a probe file, memory-map all the processing functions, prepare the correct inputs for Mountainsort4, and understand how to export the results into Phy. Even if the user manages to implement all of the analysis steps on their own, it is difficult to verify their correctness or reuse them without proper unit testing and code reviewing.

Using the Python API

Request a detailed protocol

Using SpikeInterface’s Python API to build the pipeline shown in Figure 6A is straightforward. Each of the seven steps is implemented with a single line of code (as shown in Figure 6B). Additionally, data visualizations can be added for each step of the pipeline using the appropriate widgets (as described in the SpikeWidgets Section). Unlike handmade scripts, SpikeInterface has a wide range of unit tests, employs continuous integration, and has been carefully developed by a team of researchers. Users, therefore, can have increased confidence that the pipelines they create are correct and reusable. Additionally, SpikeInterface tracks the entire provenance of the performed analysis, allowing other users (or the same user) to reproduce the analysis at a later date.

Using the spikely GUI

Request a detailed protocol

Along with our Python API, we also developed spikely (https://github.com/SpikeInterface/spikely; Hurwitz et al., 2020), a PyQt-based GUI that allows for simple construction of complex spike sorting pipelines. With spikely, users can build workflows that include: (i) loading a recording and a probe file; (ii) performing pre-processing on the underlying recording with multiple processing steps; (iii) running any spike sorter supported by SpikeInterface on the processed recording; (iv) automatically curating the sorter’s output; and (v) exporting the final result to a variety of file formats, including Phy. At its core, spikely utilizes SpikeInterface’s Python API to run any constructed spike sorting workflow. This ensures that the functionality of spikely grows organically with that of SpikeInterface.

Figure 6C shows a screenshot from spikely where the pipeline in Figure 6A is constructed. Each stage of the pipeline is added using drop-down lists, and all the parameters (which were not left at their default values) are set in the right-hand panel. Once a pipeline is constructed in spikely, the user can save it using the built-in save functionality and then load it back into spikely at a later date. Since spikely is cross-platform and user-friendly, we believe it can be utilized to increase the accessibility and reproducibility of spike sorting.

Discussion

In this paper, we introduced SpikeInterface, a Python framework designed to enhance the accessibility, reliability, efficiency, and reproducibility of spike sorting. To illustrate the use-cases and advantages of SpikeInterface, we performed a detailed meta-analysis that included: quantifying the agreement among six modern sorters on a real dataset, benchmarking each sorter on a simulated ground-truth recording, and investigating the performance of a consensus-based spike sorting and how it compares with manually curated results. To highlight the modular design of SpikeInterface, we then provided descriptions and code samples for each of the five main packages and showed how they could be chained together to construct flexible spike sorting workflows.

Ensemble spike sorting

Our analysis demonstrated that spike sorters not only differ in unit isolation quality, but can also return a significant number of false positive units. To identify true neurons and remove poorly sorted and noisy units, we combined the output of several spike sorters and found that although agreement between sorters is generally poor, units that are found by more than one sorter are likely true positives. This strategy, which we term consensus-based or ensemble spike sorting (a terminology borrowed from machine learning; Dietterich, 2000) appears to be a viable alternative to manual curation which suffers from high-variability among different operators (Wood et al., 2004; Rossant et al., 2016). Alternatives to manual curation are especially enticing as the density and number of simultaneously recording channels continue to increase rapidly.

We propose that consensus-based spike sorting (or curation) can be utilized in a number of different ways. A first possibility is to choose a suitable spike sorter (for instance, based on the extensive ground-truth comparison performed by SpikeForest; Magland et al., 2020) and then to curate its output by retaining the units that are in agreement with other sorters. Alternatively, a more conservative approach is to simply record the agreement scores for all sorted units and then hand-curate only those units that have low agreement. A third method, already implemented in SpikeInterface, is to generate a consensus spike sorting by using, for each unit, the union of the two closest matching units from different sorters (matching spikes are only considered once). Although more work is needed to quantitatively assess the advantages and disadvantages of each approach, our analysis indicates that agreement among sorters can be a useful tool for curating sorting results.

Although ensemble spike sorting is an exciting new direction to explore, there are other methods for curation that must be considered. One popular curation method is to accept or reject sorted units based on a variety of quality metrics (this is supported by SpikeInterface). Another method that is gaining more popularity is to use the large amount of available curated datasets to train classifiers that can automatically flag a unit as ‘good’ or ‘noise’ depending on some features, such as waveform shape. Finally, while manual curation is subjective and time consuming, it is the only method that allows for merging and splitting of units and, through powerful software tools such as Phy (Rossant et al., 2014; Rossant et al., 2016), it allows for full control over the curation process. Future research into these different curation methods is required to determine which are appropriate for the new influx of high-density extracellular recording devices.

Comparison to other frameworks

As mentioned in the introduction, many software tools have attempted to improve the accessibility and reproducibility of spike sorting. Here, we review the four most recent tools that are in use (to our knowledge) and compare them to SpikeInterface.

Nev2lkit (Bongard et al., 2014) is a cross-platform, C++-based GUI designed for the analysis of recordings from multi-shank multi-electrode arrays (Utah arrays). In this GUI, the spike sorting step consists of PCA for dimensionality reduction and then klustakwik for automatic clustering (Rossant et al., 2016). As Nev2lkit targets low-density probes where each channel is spike sorted separately, it is not suitable for the analysis of high-density recordings. Also, since it implements only one spike sorter, users cannot utilize any consensus-based curation or exploration of the data. The software is available online (http://nev2lkit.sourceforge.net/), but it lacks version-control and automated testing with continuous integration platforms.

SigMate (Mahmud et al., 2012) is a MATLAB-based toolkit built for the analysis of electrophysiological data. SigMate has a large scope of usage including the analysis of electroencephalograpy (EEG) signals, local field potentials (LFP), and spike trains. Despite its broad scope, or because of it, the spike sorting step in SigMate is limited to Wave clus Chaure et al., 2018, which is mainly designed for spike sorting recordings from a few channels. This means that both major limitations of Nev2lkit (as discussed above) also apply to SigMate. The software is available online (https://sites.google.com/site/muftimahmud/codes), but again, it lacks version-control and automated testing with continuous integration platforms.

Regalia et al., 2016 developed a spike sorting framework with an intuitive MATLAB-based GUI. The spike sorting functionality implemented in this framework includes four feature extraction methods, three clustering methods, and one template matching classifier (O-Sort; Rutishauser et al., 2006). These 'building blocks' can be combined to construct new spike sorting pipelines. As this framework targets low-density probes where signals from separate electrodes are spike sorted separately, its usefulness for newly developed high-density recording technology is limited. Moreover, this framework only runs with a specific file format (MCD format from Multi Channel Systems; MCS, 2020). The software is distributed upon request.

Most recently, Nasiotis et al., 2019a implemented IN-Brainstorm, a MATLAB-based GUI designed for the analysis of invasive neurophysiology data. IN-Brainstorm allows users to run three spike sorting packages (Wave clus [Chaure et al., 2018], UltraMegaSort2000 [Hill et al., 2011], and Kilosort [Pachitariu et al., 2016]). Recordings can be loaded and analyzed from six different file formats: Blackrock, Ripple, Plexon, Intan, NWB, and Tucker Davis Technologies. IN-Brainstorm is available on GitHub (https://github.com/brainstorm-tools/brainstorm3; Nasiotis et al., 2019b) and its functionality is documented (https://neuroimage.usc.edu/brainstorm/e-phys/Introduction). IN-Brainstorm does not include the latest spike sorting software (Rossant et al., 2016; Yger et al., 2018; Chung et al., 2017; Jun et al., 2017b; Pachitariu et al., 2018; Hilgen et al., 2017) (IN-Brainstorm does include instructions on how to import data that has been spike sorted by a non-supported spike sorter), and it does not support any post-sorting analysis such as quality metric calculation, automated curation, or sorting output comparison.

Outlook

As it stands, spike sorting is still an open problem. No step in the spike sorting pipeline is completely solved and no spike sorter can be used for all applications. With SpikeInterface, researchers can quickly build, run, and evaluate many different spike sorting workflows on their specific datasets and applications, allowing them to determine which will work best for them. Once a researcher determines an ideal workflow for their specific problem, it is straightforward to share and re-use that workflow in other laboratories as the full provenance is automatically stored by SpikeInterface. We envision that many laboratories will use SpikeInterface to satisfy their spike sorting needs.

Along with its applications to extracellular analysis, SpikeInterface is also a powerful tool for developers looking to create new spike sorting algorithms and analysis tools. Developers can test their methods using our efficient and comprehensive comparison functions. Once satisfied with their performance, developers can integrate their work into SpikeInterface, allowing them access to a large-community of new users and providing them with automatic file I/O for many popular extracellular dataset formats. For developers who work on projects that utilize spike sorting, SpikeInterface is useful out-of-the-box, providing more reliability and functionality than lab-specific scripts. We envision that many developers will be excited to use and integrate with SpikeInterface.

Already, SpikeInterface is being used in a variety of applications. The file IO, preprocessing, and spike sorting capabilities of SpikeInterface are an integral part of SpikeForest (Magland et al., 2020), which is an interactive website for benchmarking and tracking the accuracy of publicly available spike sorting algorithms. At present, this project includes ten spike sorting algorithms and more than 300 extracellular recordings with ground-truth firing information. SpikeInterface’s ability to read and write to a multitude of extracellular file formats is also being utilized by Neurodata Without Borders (Teeters et al., 2015) in their nwb-conversion-tools package. We hope to continue integrating SpikeInterface into cutting-edge extracellular analysis frameworks.

Data availability

All data generated or analysed during this study are included in the manuscript and supporting files. The datasets are uploaded to the DANDI archive, dataset 000034 (https://gui.dandiarchive.org/#/dandiset/000034). The source code for generating all figures is also publicly available at: https://spikeinterface.github.io/.

The following data sets were generated
    1. Buccino AP
    2. Hurwitz CL
    3. Garcia S
    4. Magland J
    5. Siegle JH
    6. Hurwitz R
    7. Hennig MH
    (2020) DANDI
    ID 000034. SpikeInterface, a unified framework for spike sorting.

References

  1. Software
    1. Dask
    (2016) Dask: Library for Dynamic Task Scheduling
    Dask: Library for Dynamic Task Scheduling.
  2. Conference
    1. Dietterich TG
    (2000)
    Ensemble methods in machine learning
    International Workshop Multiple Classifier Systems. pp. 1–15.
  3. Software
    1. Intan
    (2010) Intan technologies
    Intan technologies.
  4. Software
    1. Jun JJ
    2. Magland JF
    3. Mitelut C
    4. Barnett AH
    (2020)
    IronClust: Scalable and Drift-Resistant Spike Sorting for Long-Duration, High-Channel Count Recordings
    IronClust: Scalable and Drift-Resistant Spike Sorting for Long-Duration, High-Channel Count Recordings.
  5. Book
    1. Lee JH
    2. Carlson DE
    3. Razaghi HS
    4. Yao W
    5. Goetz GA
    6. Hagen E
    7. Batty E
    8. Chichilnisky E
    9. Einevoll GT
    10. Paninski L
    (2017)
    YASS: yet another spike sorter
    In: Becker S, editors. Advances in Neural Information Processing Systems. MIT Press. pp. 4002–4012.
  6. Software
    1. Marques-Smith A
    2. Neto JP
    3. Lopes G
    4. Nogueira J
    5. Calcaterra L
    6. Frazão J
    7. Kim D
    8. Phillips MG
    9. Dimitriadis G
    10. Kampff A
    (2018b)
    Simultaneous Patch-Clamp and Dense Cmos Probe Extracellular Recordings From the Same Cortical Neuron in Anaesthetized Rats, CRCNS
    Simultaneous Patch-Clamp and Dense Cmos Probe Extracellular Recordings From the Same Cortical Neuron in Anaesthetized Rats, CRCNS.
  7. Software
    1. MaxWell
    (2020) MaxWell biosystems
    MaxWell biosystems.
  8. Conference
    1. McKinney W
    (2010)
    Data structures for statistical computing in Python
    Proceedings of the 9th Python in Science Conference. pp. 51–56.
  9. Software
    1. MCS
    (2020) Multi channel systems
    Multi channel systems.
  10. Book
    1. Mucha HJ
    (1995) XClust: clustering in an interactive way
    In: Berwin T, editors. XploRe: An Interactive Statistical Computing Environment. Springer. pp. 141–168.
    https://doi.org/10.1007/978-1-4612-4214-7
  11. Software
    1. Neuralynx
    (2020) Neuralynx
    Neuralynx.
  12. Software
    1. NIX
    (2015) Neuroscience Information Exchange Format - Nix
    Neuroscience Information Exchange Format - Nix.
  13. Book
    1. Pachitariu M
    2. Steinmetz NA
    3. Kadir SN
    (2016)
    Fast and accurate spike sorting of high-channel count probes with kilosort
    In: Dietterich T. G, editors. Advances in Neural Information Processing Systems. MIT press. pp. 4448–4456.
  14. Conference
    1. Xq L
    2. Wu X
    3. Liu C
    (2011)
    SPKtool: an open source toolbox for electrophysiological data processing
    In 2011 4th International Conference on Biomedical Engineering and Informatics. pp. 854–857.
  15. Conference
    1. Yuan X
    2. Kim S
    3. Juyon J
    4. D’Urbino M
    5. Bullmann T
    6. Chen Y
    7. Stettler A
    8. Hierlemann A
    9. Frey U
    (2016)
    A microelectrode array with 8,640 electrodes enabling simultaneous full-frame readout at 6.5 kfps and 112-channel switch-matrix readout at 20 ks/s
    VLSI Circuits (VLSI-Circuits), 2016 IEEE Symposium. pp. 1–2.

Decision letter

  1. Laura L Colgin
    Senior and Reviewing Editor; University of Texas at Austin, United States
  2. Sonja Grün
    Reviewer; Jülich Research Centre, Germany
  3. Fabian Kloosterman
    Reviewer; KU Leuven, Belgium

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Acceptance summary:

SpikeInterface is an integrated set of tools that makes it straightforward for researchers to set up a complete spike sorting workflow. SpikeInterface supports many common data formats and modern spike sorters and provides post-processing tools for characterization of the spike sorting results. This allows for validation and comparison of multiple spike sorting results. Results suggest that combining the results of multiple spike sorters could help to reduce the number of false positive units, which is an interesting future direction that will likely inspire further investigation. This tool is expected to be useful for researchers in many different areas who are studying neuronal responses.

Decision letter after peer review:

[Editors’ note: the authors submitted for reconsideration following the decision after peer review. What follows is the decision letter after the first round of review.]

Thank you for submitting your work entitled "SpikeInterface, a unified framework for spike sorting" for consideration by eLife. Your article has been reviewed by a Senior Editor, a Reviewing Editor, and three reviewers. The following individuals involved in review of your submission have agreed to reveal their identity: Fabian Kloosterman (Reviewer #2).

Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work will not be considered further for publication in eLife.

There was a great deal of discussion about this manuscript among the reviewers and the editor after the individual reviews were received. Ultimately, the consensus was that this work in its present form is too preliminary to be useful to, and to make a major impact on, a broad range of users. At eLife, the standard revision period is approximately two months, and therefore papers are largely assessed "as is" to allow authors to decide when to publish the work at the stage when they feel it is ready. In this case, though, reviewers agreed that the work needs a number of major revisions that constitute a substantial amount of work in order to make a major impact across a broad range of readers (e.g., reviewers were not confident that this tool is ready to be used by anyone who is recording with Neuropixels). If you agree with the reviewers that major changes to the tool are necessary to make a major impact on the field, then we would encourage you to submit a majorly revised manuscript to us in the future, citing this manuscript number and requesting the same editor. We would be willing to re-assess the manuscript at that time. Otherwise, you can just move on to a more specialized journal, keeping your tool in its present form and perhaps improving on it in future publications.

The three original reviews are included in their entirety below. However, due to the extensive and constructive discussion that occurred after reviewers read each other's reviews, we would like to emphasize a number of interrelated major points that were discussed in the consultation:

1) A concern was raised that SpikeInterface limits the flexibility of the spike sorters it contains and makes spike sorting more of a "black box". Given the lack of real ground truth available for results comparison, this was viewed as a major weakness. Reviewers felt that users still need to be able to look carefully at the units, understand the different algorithms, and properly set the parameters. Using the default parameters could lead to suboptimal results, and the authors did not attempt to adjust parameters. Reviewers felt that the SpikeInterface toolbox could also be used to compare results of the same spike sorter using different parameters and that this would be useful to find optimal parameters and would increase the potential impact of this tool.

2) The analyses and presentation of the comparison of spike sorters was viewed as weak. Reviewers agreed that careful manual curation is still the only right way to compare spike sorting results. Reviewers felt that it would be a mistake for readers who are new to the spike sorting process to look at the analyses shown in the paper as the right way to compare spike sorting results. Reviewers felt that a manual curation step is necessary to make the spike sorter evaluation process more useful.

3) It was suggested that the authors should attempt to perform some sort of smart cluster merging strategy that utilizes the output of different sorters.

4) Another suggestion for a potentially useful addition was that users would be able to swap in and out different algorithmic components in the spike sorting pipeline. Combined with manual curation and comparison to "ground truth" data sets, reviewers felt that this could help users to determine the best algorithmic components for particular types of recordings.

Reviewer #1:

In this paper, the authors introduce a Python package for easily running many different spike sorters and exporting to many different formats. The goal is to make it easier for electrophysiologists to run their data through the spike sorters and output these results to Phy and other GUIs for data visualizations. While I agree that spike sorting is a hard problem and users need to be helped as much as possible, I don't think this framework helps a lot and will ultimately not find much use. I think Phy (already published and widely used) does most of the work that the authors suggest SpikeInterface should do, and in fact the main use case for SpikeInterface seems to be as an exporter to Phy. At its core, the code provided here is a set of file converters and code wrappers that further obfuscate the black-boxes that many spike sorters are, and make it more difficult for users to know how to build a successful spike sorting pipeline for their own data.

Reviewer #2:

The work presented in this manuscript is of great interest to both spike sorting users and developers. The unified framework bridges the gap between the plethora of recording file formats and spike sorting packages, which is a major improvement in terms of spike sorting experience. The framework also features many interesting features related to spike sorting for processing recordings and sorting results. The manuscript is clearly written and introduces the functionality that is at hand in the framework in a concise way. Below is a list of major and minor comments that need to be addressed, however.

1) Spikeinterface is portrayed as a general spike sorting framework. Still, the spike sorting workflow supported by spikeinterface appears to be geared towards specific kind of data and sorters, i.e. those that work on high electrode count continuous datasets. The authors should make explicit the assumptions that are made in spikeinterface regarding the data that is accepted (e.g. datasets with only waveform snippets appear not to be supported) and the minimal requirement for spike sorters (e.g. do spike sorters need to include their own spike detection algorithm and spike feature extraction?).

2) The authors have chosen to run the spike sorters with their default parameters and without manual or automated refinement (i.e. noise cluster rejection, cluster merging/splitting). As many spike sorting algorithms explicitly depend on a manual cluster merging/splitting step after they have been applied to the data, it would be interesting to also provide an automated cluster merging (e.g., based on the ground truth as in Wouters, Kloosterman and Bertrand, 2019). This will improve the understanding of the true potential of a spike sorting algorithm, when comparing it to others in a ground-truth study. As a bare minimum, the authors should discuss the need of a post-sorting split/merge curation step and discuss the effect of leaving the step out on their results. Without such discussion, it would be premature to talk about a "consensus-based strategy" to select clusters (subsection “Application 1: Comparing Spike Sorters on Neuropixels Data”).

3) The authors define an agreement score to match clusters from different sorters and use the score to classify clusters (as compared to ground truth) as "well-detected", "false positive", "redundant" and "over-merged". However, a low agreement score could result from a high number of false positive detections or a high number of false negative detections (or both), and the interpretation would be different in these cases. In the extremes of no false positives or false negatives, an agreement score of 0.2 could either mean all spikes in a cluster represent 20% of the ground truth spikes (i.e. a clean partial cluster) or it could mean that all ground truth spikes represent 20% of the spikes in a cluster (i.e. a "dirty" over-merged cluster). Thus, the agreement score is not a good metric for the classification of the clusters. Instead, the authors should consider a classification based on different metric(s), e.g. both precision and recall.

4) We do not find the swarm plot in Figure 4 that compares the accuracy, precision and recall for multiple sorters very informative. First, the number of non-matched clusters is not obvious in this plot (we assume point with zero score are non-matched?). More importantly, there is often a trade-off between the number of false positive and false negatives, and each sorter may make a different trade-off, depending on the parameters. The swarm plot does not show the relation between precision and recall for each sorter, and a precision-recall scatter plot would be more informative.

Reviewer #3:

This submission describes a software toolbox aimed to facilitate the comparison of spike sorting algorithms. It is targeted for a broad user base, who may not have the time or technical ability to make such comparisons on their own. This tool addresses a need of the neuroscience community. Outlined below are number of suggested corrections.

Introduction: Not all the listed sorters are truly fully-manual, ie Mclust is semiautomatic.

Subsection “Overview of SpikeInterface”: Roman numerals swapped for spikecomparison and spikewidgets.

Subsection “SpikeExtractors”: It is unclear how recordingextractor, a visualization tool, provides functionality required to excess data to evaluate the spike sorting pipeling. This becomes more clear later, but could be made more clear sooner.

Subsection “SpikeExtractors” and subsection “SpikeToolkit”: The code snippets could be expanded to give more context and be more relevant.

Subsection “Curation”: Instead of holding of for the future, this functionality would be nice to implement here, if it is not an unreasonable amount of work.

Subsection “Using the Python API”: It could be said that spikeinterface is also handmade, maybe clarify the point.

Figure 3B is hard to read.

Figure 3D, what are the color code agreement levels exactly, this is unclear.

In Figure 4 it would be nice to see plotted SNR vs agreement score.

[Editors’ note: further revisions were suggested prior to acceptance, as described below.]

Thank you for submitting your article "SpikeInterface, a unified framework for spike sorting" for consideration by eLife. Your article has been reviewed by Laura Colgin as the Senior Editor and Reviewing Editor and three reviewers. The following individuals involved in review of your submission have agreed to reveal their identity: Sonja Grün (Reviewer #2); Fabian Kloosterman (Reviewer #3).

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

We would like to draw your attention to changes in our revision policy that we have made in response to COVID-19 (https://elifesciences.org/articles/57162). Specifically, we are asking editors to accept without delay manuscripts, like yours, that they judge can stand as eLife papers without additional data, even if they feel that they would make the manuscript stronger. Thus the revisions requested below only address clarity and presentation.

Summary:

The authors describe SpikeInterface, which is an integrated set of tools that makes it straightforward for researchers to set up a complete spike sorting workflow. SpikeInterface is modular (and extendible) and supports many common data formats and modern spike sorters. It provides postprocessing tools for characterization of the spike sorting results and for validation and comparison of multiple spike sorting results (e.g. against ground truth). SpikeInterface allows users to focus on the spike sorting results and curation, rather than having to glue together or (re)implement disparate tools themselves.

Compared to the previous version, the authors have now more clearly outlined the goals of SpikeInterface in the introduction. In addition to comparing multiple sorters, they have added new results that indicate that combining the results of multiple spike sorters ("ensemble spike sorting") could help to reduce the number of false positive units, which is an interesting future direction that needs further investigation.

Essential revisions:

1) Regarding the ensemble spike sorting approach:

– From the results shown in Figure 3C, it seems that one won't need to run all 6 sorters to eliminate the false positives. Could the authors quantify the relative benefit of combining 2, 3 or more sorters over the use of a single sorter?

– One could imagine that combining the results of spike sorters that use a different class of algorithm would provide more benefit than combining two sorters that use the same algorithm. Do the authors observe this?

– If reviewers understand correctly, many spike sorters will return a slightly different output when run a second time on the same data set. To what extent does running the same sorter twice give the same benefit (i.e. low agreement on false positive clusters) as running two different sorters?

– Would using the ensemble spike sorting approach give similar results as using a more stringent selection of units found by a single sorter (e.g. based on cluster quality metrics, SNR, spike amplitude, etc.)?

2) The authors used an arbitrary agreement score threshold of 0.5, which they acknowledge is a pragmatic but not necessarily the best choice to match units found by multiple sorters. Reviewers did not think it is necessary to change the way units are matched, but to provide more insight into the matching process it would be helpful to know what the distribution of unit agreement scores for matching pairs looks like.

3) When evaluating the spike sorting results on the simulated data set, the authors only mention matches and false positives. Reviewers did not see a mention of the number of false negatives (i.e. number of ground units that were missed by the sorters). Could the authors also indicate to what extent spikes of the missed clusters actually show up as part of the false positive units (e.g. because false positive units are actually overly split/merged true units)?

https://doi.org/10.7554/eLife.61834.sa1

Author response

[Editors’ note: the authors resubmitted a revised version of the paper for consideration. What follows is the authors’ response to the first round of review.]

There was a great deal of discussion about this manuscript among the reviewers and the editor after the individual reviews were received. Ultimately, the consensus was that this work in its present form is too preliminary to be useful to, and to make a major impact on, a broad range of users. At eLife, the standard revision period is approximately two months, and therefore papers are largely assessed "as is" to allow authors to decide when to publish the work at the stage when they feel it is ready. In this case, though, reviewers agreed that the work needs a number of major revisions that constitute a substantial amount of work in order to make a major impact across a broad range of readers (e.g., reviewers were not confident that this tool is ready to be used by anyone who is recording with Neuropixels). If you agree with the reviewers that major changes to the tool are necessary to make a major impact on the field, then we would encourage you to submit a majorly revised manuscript to us in the future, citing this manuscript number and requesting the same editor. We would be willing to re-assess the manuscript at that time. Otherwise, you can just move on to a more specialized journal, keeping your tool in its present form and perhaps improving on it in future publications.

The three original reviews are included in their entirety below. However, due to the extensive and constructive discussion that occurred after reviewers read each other's reviews, we would like to emphasize a number of interrelated major points that were discussed in the consultation:

1) A concern was raised that SpikeInterface limits the flexibility of the spike sorters it contains and makes spike sorting more of a "black box". Given the lack of real ground truth available for results comparison, this was viewed as a major weakness. Reviewers felt that users still need to be able to look carefully at the units, understand the different algorithms, and properly set the parameters. Using the default parameters could lead to suboptimal results, and the authors did not attempt to adjust parameters. Reviewers felt that the SpikeInterface toolbox could also be used to compare results of the same spike sorter using different parameters and that this would be useful to find optimal parameters and would increase the potential impact of this tool.

We disagree that SpikeInterface limits the flexibility of the underlying sorter in any meaningful way. A key feature of SpikeInterface is that the Python-based wrappers for each spike sorter expose the underlying parameters such that users can adjust them before running the sorter. In fact, to ease the parameter adjustment process for a new user, SpikeInterface contains two functions, get_default_params() and get_params_description(), which return the developer-suggested value and description of each parameter, respectively. We believe that this functionality provides quick and accessible guidance to the users to make spike sorting less “black box” if anything. We would also like to point the editor and reviewers to a series of open analyses that showcase how SpikeInterface can be used for parameter sweeping and evaluation of spike sorting results

(https://spikeinterface.github.io/blog/example-of-parameters-optimization/). While we acknowledge that these aspects of SpikeInterface are important to highlight, we believe that they would require an extensive study to properly investigate so we decided to leave them out of the current manuscript. Our main point here is that this type of optimisation is very easy to perform with SpikeInterface.

2) The analyses and presentation of the comparison of spike sorters was viewed as weak. Reviewers agreed that careful manual curation is still the only right way to compare spike sorting results. Reviewers felt that it would be a mistake for readers who are new to the spike sorting process to look at the analyses shown in the paper as the right way to compare spike sorting results. Reviewers felt that a manual curation step is necessary to make the spike sorter evaluation process more useful.

In our initial manuscript, the presented analyses were designed to showcase the different ways that SpikeInterface could be used to compare spike sorters. We agree with the reviewers, however, that the shallowness of these analyses was a major weak point that needed to be addressed. Therefore, in our updated manuscript, we completely re-wrote the Results section. In this new Results section, we explore how combining the output of multiple spike sorters can be used to improve overall spike sorting performance, a method we call ensemble spike sorting. To this end, we first demonstrate that there is surprisingly low agreement among six spike sorters designed for high-count probes on a Neuropixels recording (analysis of other recordings are illustrated in the supplement). Next, we repeat this analysis with a simulated Neuropixels dataset (with known ground truth), finding that the sorter agreement level is comparable to that of real data and that almost all units that are found by only one sorter are actually false positive units. Finally, we ask two experts to manually curate the output of the real Neuropixels dataset (using Phy) and compare the manually curated datasets to a consensus sorting result (units agreed upon by at least two out of the six sorters). We find excellent agreement between the “good” units found in the manually curated datasets and the consensus sorting results. We believe that this new analysis is much more detailed and comprehensive and provides evidence that using multiple spike sorters can potentially inform (or even replace) the time consuming and subjective manual curation step in extracellular analysis pipelines.

3) It was suggested that the authors should attempt to perform some sort of smart cluster merging strategy that utilizes the output of different sorters.

As outlined above, our analysis now explores an ensemble-based approach to spike sorting which combines the output of several spike sorters to improve overall spike sorting performance. In addition, SpikeInterface now includes functionality to automatically generate this consensus set by merging the best unit matches among multiple sorters. While we found this can yield good results for ground truth data, we believe further analysis is required to validate this method. We are confident, however, that the agreement scores among multiple spike sorters can, at the very least, inform subsequent manual curation steps.

4) Another suggestion for a potentially useful addition was that users would be able to swap in and out different algorithmic components in the spike sorting pipeline. Combined with manual curation and comparison to "ground truth" data sets, reviewers felt that this could help users to determine the best algorithmic components for particular types of recordings.

We would like to point out (as expressed also by reviewer 1) that different processing steps in a spike sorting algorithm are not decoupled from each other. We therefore believe that swapping components between different algorithms cannot be a viable solution for building better spike sorting pipelines. Despite our reservations about the mix-and-match approach to spike sorting, we have added a module to the spiketoolkit package called sortingcomponents which will contain different algorithmic components for spike sorting.

Currently, this package just contains a decoupled detection algorithm, but we plan to add more functionality in future updates.

Reviewer #1:

In this paper, the authors introduce a Python package for easily running many different spike sorters and exporting to many different formats. The goal is to make it easier for electrophysiologists to run their data through the spike sorters and output these results to Phy and other GUIs for data visualizations. While I agree that spike sorting is a hard problem and users need to be helped as much as possible, I don't think this framework helps a lot and will ultimately not find much use. I think Phy (already published and widely used) does most of the work that the authors suggest SpikeInterface should do, and in fact the main use case for SpikeInterface seems to be as an exporter to Phy. At its core, the code provided here is a set of file converters and code wrappers that further obfuscate the black-boxes that many spike sorters are, and make it more difficult for users to know how to build a successful spike sorting pipeline for their own data.

We thank the reviewer for the comments, however, we strongly disagree with this assessment. Phy is a tool for visualising and manually curating spike sorter outputs. In contrast, SpikeInterface aims to automate many aspects of a sorting pipeline, including file IO, pre and post-processing, running multiple spike sorting jobs, comparing spike sorting outputs, and computing quality metrics. As we show in our updated manuscript, the ability to efficiently run multiple spike sorters and to quickly compare their outputs offers new possibilities for automatic curation and unit annotation to inform manual curation. All analysis that can be performed with SpikeInterface is fully reproducible as well. We hope our revised paper better illustrates the many exciting use cases of SpikeInterface.

Reviewer #2:

The work presented in this manuscript is of great interest to both spike sorting users and developers. The unified framework bridges the gap between the plethora of recording file formats and spike sorting packages, which is a major improvement in terms of spike sorting experience. The framework also features many interesting features related to spike sorting for processing recordings and sorting results. The manuscript is clearly written and introduces the functionality that is at hand in the framework in a concise way. Below is a list of major and minor comments that need to be addressed, however.

1) Spikeinterface is portrayed as a general spike sorting framework. Still, the spike sorting workflow supported by spikeinterface appears to be geared towards specific kind of data and sorters, i.e. those that work on high electrode count continuous datasets. The authors should make explicit the assumptions that are made in spikeinterface regarding the data that is accepted (e.g. datasets with only waveform snippets appear not to be supported) and the minimal requirement for spike sorters (e.g. do spike sorters need to include their own spike detection algorithm and spike feature extraction?).

The reviewer correctly points out that datasets with only waveform snippets are not supported by SpikeInterface. Apart from this requirement, which is satisfied by most of modern acquisition systems, SpikeInterface does not make any other assumptions about the initial dataset. All supported spike sorters are end-to-end, i.e. they take a raw recording as input and output the sorted results. While many supported sorters are designed for high-count devices (e.g. IronClust, HerdingSpikes, Kilosort2), we also include sorters which are better suited for low-channel counts (Klusta, Wave-Clus, Mountainsort4). To help users better understand the different sorters supported by SpikeInterface, we added the get_sorte_description() function to the spikesorters package which provides a description of the algorithm, including its intended use.

2) The authors have chosen to run the spike sorters with their default parameters and without manual or automated refinement (i.e. noise cluster rejection, cluster merging/splitting). As many spike sorting algorithms explicitly depend on a manual cluster merging/splitting step after they have been applied to the data, it would be interesting to also provide an automated cluster merging (e.g., based on the ground truth as in Wouters, Kloosterman and Bertrand, 2019). This will improve the understanding of the true potential of a spike sorting algorithm, when comparing it to others in a ground-truth study. As a bare minimum, the authors should discuss the need of a post-sorting split/merge curation step and discuss the effect of leaving the step out on their results. Without such discussion, it would be premature to talk about a "consensus-based strategy" to select clusters (subsection “Application 1: Comparing Spike Sorters on Neuropixels Data”).

We thank the reviewer for this insightful comment which led us to completely re-write the Results section in our updated manuscript. In our new Results section, we provide evidence that a consensus-based spike sorting strategy is a viable alternative to using a single sorter. We demonstrate that units that are found by only one sorter mainly coincide with false positives (on simulated data). We also added a comparison between our consensus-based method and the manually curated outputs from two experts (on real data), showing that there is a large agreement between units labeled as “good” by the curators and the consensus sorting output. We believe that the new results in our updated manuscript provide initial evidence for the viability of a consensus-based strategy. In the Discussion section, we added a paragraph on Ensemble spike sorting to discuss the potential strengths and limitations of this strategy.

3) The authors define an agreement score to match clusters from different sorters and use the score to classify clusters (as compared to ground truth) as "well-detected", "false positive", "redundant" and "over-merged". However, a low agreement score could result from a high number of false positive detections or a high number of false negative detections (or both), and the interpretation would be different in these cases. In the extremes of no false positives or false negatives, an agreement score of 0.2 could either mean all spikes in a cluster represent 20% of the ground truth spikes (i.e. a clean partial cluster) or it could mean that all ground truth spikes represent 20% of the spikes in a cluster (i.e. a "dirty" over-merged cluster). Thus, the agreement score is not a good metric for the classification of the clusters. Instead, the authors should consider a classification based on different metric(s), e.g. both precision and recall.

We acknowledge that our proposed classification of sorted units is preliminary and cannot differentiate between some spike sorting failure modes. However, we still believe that it can provide insight into the strengths and weaknesses of different sorters despite not being a perfect solution. We agree that a more detailed analysis on multiple ground-truth recordings is needed to provide better classification rules. We address these concerns in the updated manuscript by adding the following sentence in the subsection “SpikeComparison”: "We would like to highlight to the reader that the unit classification proposed here is currently only based on agreement score (i.e. accuracy). More sophisticated classification rules could involve a combination of accuracy, precision, and recall values, which can be easily computed for each unit with the spikecomparison module."

4) We do not find the swarm plot in Figure 4 that compares the accuracy, precision and recall for multiple sorters very informative. First, the number of non-matched clusters is not obvious in this plot (we assume point with zero score are non-matched?). More importantly, there is often a trade-off between the number of false positive and false negatives, and each sorter may make a different trade-off, depending on the parameters. The swarm plot does not show the relation between precision and recall for each sorter, and a precision-recall scatter plot would be more informative.

To address this concern, we added a supplementary figure to the new manuscript that shows a scatter plot with precision vs recall for each sorter (Figure 2—figure supplement 1). As mentioned by reviewer 2, this figure helps reveal relevant differences among the chosen sorters and we thought it important to include in the new draft.

Reviewer #3:

This submission describes a software toolbox aimed to facilitate the comparison of spike sorting algorithms. It is targeted for a broad user base, who may not have the time or technical ability to make such comparisons on their own. This tool addresses a need of the neuroscience community. Outlined below are number of suggested corrections.

Introduction: Not all the listed sorters are truly fully-manual, ie Mclust is semiautomatic.

Thanks for catching this; We removed MClust from the list of fully manual spike sorters.

Subsection “Overview of SpikeInterface”: Roman numerals swapped for spikecomparison and spikewidgets.

We fixed the roman numerals accordingly.

Subsection “SpikeExtractors”: It is unclear how recordingextractor, a visualization tool, provides functionality required to excess data to evaluate the spike sorting pipeling. This becomes more clear later, but could be made more clear sooner.

A RecordingExtractor is not a visualization tool, but an file IO class that interfaces with the data. We are not sure what reviewer 3 means with this comment.

Subsection “SpikeExtractors” and subsection “SpikeToolkit”: The code snippets could be expanded to give more context and be more relevant.

In our updated manuscript, we improved the description before each code snippet to provide more context. We still chose to keep the code snippets minimal just to show the basic aspects of the API. The reader/user can find more detailed and comprehensive examples in the online documentation.

Subsection “Curation”: Instead of holding of for the future, this functionality would be nice to implement here, if it is not an unreasonable amount of work.

The curation module has now been extended to support curation based on all supported quality metrics.

Subsection “Using the Python API”: It could be said that spikeinterface is also handmade, maybe clarify the point.

The main point of that sentence is that, usually, custom scripts are not fully tested on continuous integration platforms, while SpikeInterface is. We believe that the paragraph conveys this message: "Unlike handmade scripts, SpikeInterface has a wide range of unit tests, employs continuous integration, and has been carefully developed by a team of researchers. Users, therefore, can have increased confidence that the pipelines they create are correct and reusable. Additionally, SpikeInterface tracks the entire provenance of the performed analysis, allowing other users (or the same user) to reproduce the analysis at a later date.".

Figure 3B is hard to read.

We agree with the reviewer and we removed the panel from the Figure.

Figure 3D, what are the color code agreement levels exactly, this is unclear.

The legend is shown on the top right and it indicates the number of k sorters that agree on a unit.

In Figure 4 it would be nice to see plotted SNR vs agreement score.

We added the suggested Figure as Figure 2—figure supplement 1B.

[Editors’ note: what follows is the authors’ response to the second round of review.]

Summary:

The authors describe SpikeInterface, which is an integrated set of tools that makes it straightforward for researchers to set up a complete spike sorting workflow. SpikeInterface is modular (and extendible) and supports many common data formats and modern spike sorters. It provides postprocessing tools for characterization of the spike sorting results and for validation and comparison of multiple spike sorting results (e.g. against ground truth). SpikeInterface allows users to focus on the spike sorting results and curation, rather than having to glue together or (re)implement disparate tools themselves.

Compared to the previous version, the authors have now more clearly outlined the goals of SpikeInterface in the introduction. In addition to comparing multiple sorters, they have added new results that indicate that combining the results of multiple spike sorters ("ensemble spike sorting") could help to reduce the number of false positive units, which is an interesting future direction that needs further investigation.

Essential revisions:

1) Regarding the ensemble spike sorting approach:

– From the results shown in Figure 3C, it seems that one won't need to run all 6 sorters to eliminate the false positives. Could the authors quantify the relative benefit of combining 2, 3 or more sorters over the use of a single sorter?

We have added Figure 3—figure supplement 1, and a brief passage of text to show how the detection of false/true positives depends on the number of sorters used, using the simulated ground-truth dataset. To this end, we tested all possible combinations of two to five sorters and quantified the fraction of identified false positives. The results show that already two sorters are sufficient to remove false positives, with little change when more sorters are added. However, the fraction of true positives in the ensemble sorting can increase substantially when more sorters are used. Therefore, the main benefit of combining multiple sorters is a more reliable identification of true positive units.

– One could imagine that combining the results of spike sorters that use a different class of algorithm would provide more benefit than combining two sorters that use the same algorithm. Do the authors observe this?

The sorters are highly variable in terms of performance which may occlude systematic differences between the main algorithms (template-based sorting and density-based clustering). We did observe a slightly better identification of false/true positives when different approaches (e.g. Kilosort2 and Ironclust) are combined, compared to the use of one approach alone (e.g. Kilosort2 and SpykingCircus). However, we found it difficult to judge if these biases were due to the central algorithm or due to other components of the sorting process. Therefore, we feel we could not give a well-qualified answer to this question as this would require a more systematic comparison of algorithms with equal pre/postprocessing and more ground-truth datasets with different generative assumptions.

– If reviewers understand correctly, many spike sorters will return a slightly different output when run a second time on the same data set. To what extent does running the same sorter twice give the same benefit (i.e. low agreement on false positive clusters) as running two different sorters?

We tested this phenomena for Kilosort2 and the simulated ground-truth recording from our manuscript. Each run returned slightly different results with different numbers of units each time. Using the same ensemble method, we found that all unmatched units were indeed false positives, but this removed only a subset of false positives (a certain fraction of false positives was always consistently found). A summary of the effect is shown in the following figure:

Author response image 1
Comparison of five individual runs of Kilosort2 on the simulated Neuropixels recording.

The top shows the proportions of units from each sorting found in k other sortings. Below these units are split according to false and true positive units after comparison to the ground-truth data. While a sizable fraction of false positive units are unique to each run of the sorter, many are identical in all sortings, indicating that variability in multiple sorter outputs cannot be used to reliably separate false and true positive units.

While potentially worth further investigation, we feel it is too preliminary to show these results as we do not understand where these differences originate and how they can be controlled. Therefore, we decided not to mention this result in the manuscript.

– Would using the ensemble spike sorting approach give similar results as using a more stringent selection of units found by a single sorter (e.g. based on cluster quality metrics, SNR, spike amplitude, etc.)?

We tested this idea using the most obvious quality measure, SNR, on a Kilosort2 sorting of the ground-truth data. We chose Kilosort2 since its performance on detecting true positive units is best among all the sorters. Here we found, surprisingly, that false positive units can have a large SNR so there is no obvious way to separate these. This is illustrated in Figure 3—figure supplement 2, and a short passage was added to this effect.

2) The authors used an arbitrary agreement score threshold of 0.5, which they acknowledge is a pragmatic but not necessarily the best choice to match units found by multiple sorters. Reviewers did not think it is necessary to change the way units are matched, but to provide more insight into the matching process it would be helpful to know what the distribution of unit agreement scores for matching pairs looks like.

We added a histogram of agreement scores for the ground-truth data (Figure 1—figure supplement 2), which shows that the majority of matches have scores >0.8.

3) When evaluating the spike sorting results on the simulated data set, the authors only mention matches and false positives. Reviewers did not see a mention of the number of false negatives (i.e. number of ground units that were missed by the sorters). Could the authors also indicate to what extent spikes of the missed clusters actually show up as part of the false positive units (e.g. because false positive units are actually overly split/merged true units)?

There are 250 ground-truth units, which we now indicate as an horizontal dashed line in Figure 2E.

https://doi.org/10.7554/eLife.61834.sa2

Article and author information

Author details

  1. Alessio P Buccino

    1. Department of Biosystems Science and Engineering, ETH Zurich, Zürich, Switzerland
    2. Centre for Integrative Neuroplasticity (CINPLA), University of Oslo, Oslo, Norway
    Contribution
    Conceptualization, Resources, Data curation, Software, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Contributed equally with
    Cole L Hurwitz
    For correspondence
    alessio.buccino@bsse.ethz.ch
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3661-527X
  2. Cole L Hurwitz

    School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
    Contribution
    Conceptualization, Resources, Software, Visualization, Methodology, Writing - original draft, Writing - review and editing
    Contributed equally with
    Alessio P Buccino
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-2023-1653
  3. Samuel Garcia

    Centre de Recherche en Neuroscience de Lyon, CNRS, Lyon, France
    Contribution
    Software, Visualization, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6389-9779
  4. Jeremy Magland

    Flatiron Institute, New York, United States
    Contribution
    Conceptualization, Software, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5286-4375
  5. Joshua H Siegle

    Allen Institute for Brain Science, Seattle, United States
    Contribution
    Data curation, Software, Methodology, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7736-4844
  6. Roger Hurwitz

    Independent Researcher, Portland, United States
    Contribution
    Software
    Competing interests
    No competing interests declared
  7. Matthias H Hennig

    School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
    Contribution
    Conceptualization, Resources, Software, Supervision, Visualization, Writing - original draft, Writing - review and editing
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-7270-5817

Funding

Wellcome Trust (214431/Z/18/Z)

  • Matthias H Hennig

ETH Zürich (19–2 FEL-17)

  • Alessio P Buccino

University of Oslo (PhD training (SUURPh) program)

  • Alessio P Buccino

Norwegian Ministry of Education, Research and Church Affairs

  • Alessio P Buccino

University of Edinburgh (Thouron Award)

  • Cole L Hurwitz

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was supported by the Wellcome Trust grant 214431/Z/18/Z (MHH). APB is supported by an ETH Zurich Postdoctoral Fellowship 19–2 FEL-17, and by the Simula-UCSD-University of Oslo Research and PhD training (SUURPh) program, funded by the Norwegian Ministry of Education and Research. CLH is supported by the Thouron Award and by the Institute for Adaptive and Neural Computation, University of Edinburgh. JHS wishes to thank the Allen Institute founder, Paul G Allen, for his vision, encouragement and support. We thank Shangmin Guo for his recent contributions to debugging and improving the codebase.

Senior and Reviewing Editor

  1. Laura L Colgin, University of Texas at Austin, United States

Reviewers

  1. Sonja Grün, Jülich Research Centre, Germany
  2. Fabian Kloosterman, KU Leuven, Belgium

Publication history

  1. Received: August 6, 2020
  2. Accepted: November 9, 2020
  3. Accepted Manuscript published: November 10, 2020 (version 1)
  4. Version of Record published: November 30, 2020 (version 2)

Copyright

© 2020, Buccino et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 4,133
    Page views
  • 437
    Downloads
  • 9
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Neuroscience
    Aurelio Cortese et al.
    Research Article Updated

    The human brain excels at constructing and using abstractions, such as rules, or concepts. Here, in two fMRI experiments, we demonstrate a mechanism of abstraction built upon the valuation of sensory features. Human volunteers learned novel association rules based on simple visual features. Reinforcement-learning algorithms revealed that, with learning, high-value abstract representations increasingly guided participant behaviour, resulting in better choices and higher subjective confidence. We also found that the brain area computing value signals – the ventromedial prefrontal cortex – prioritised and selected latent task elements during abstraction, both locally and through its connection to the visual cortex. Such a coding scheme predicts a causal role for valuation. Hence, in a second experiment, we used multivoxel neural reinforcement to test for the causality of feature valuation in the sensory cortex, as a mechanism of abstraction. Tagging the neural representation of a task feature with rewards evoked abstraction-based decisions. Together, these findings provide a novel interpretation of value as a goal-dependent, key factor in forging abstract representations.

    1. Neuroscience
    Noah C Benson et al.
    Research Article

    Human vision has striking radial asymmetries, with performance on many tasks varying sharply with stimulus polar angle. Performance is generally better on the horizontal than vertical meridian, and on the lower than upper vertical meridian, and these asymmetries decrease gradually with deviation from the vertical meridian. Here we report cortical magnification at a fine angular resolution around the visual field. This precision enables comparisons between cortical magnification and behavior, between cortical magnification and retinal cell densities, and between cortical magnification in twin pairs. We show that cortical magnification in human primary visual cortex, measured in 163 subjects, varies substantially around the visual field, with a pattern similar to behavior. These radial asymmetries in cortex are larger than those found in the retina, and they are correlated between monozygotic twin pairs. These findings indicate a tight link between cortical topography and behavior, and suggest that visual field asymmetries are partly heritable.