Opening the black box: a modular approach to spike sorting

  1. Centre de Recherche en Neuroscience de Lyon, CNRS, Lyon, France
  2. University of Edinburgh, Edinburgh, United Kingdom
  3. Columbia University, New York, United States
  4. Harvard Medical School, Boston, United States
  5. Massachusetts General Hospital, Boston, United States
  6. CatalystNeuro, Casper, United States
  7. Allen Institute for Neural Dynamics, Seattle, United States
  8. Lille Neurosciences & Cognition (LilNCog) – U1172 (INSERM, Lille), Univ Lille, CHU Lille, Lille, France

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Adrien Peyrache
    McGill University, Montreal, Canada
  • Senior Editor
    Panayiota Poirazi
    FORTH Institute of Molecular Biology and Biotechnology, Heraklion, Greece

Reviewer #1 (Public review):

Summary:

This work presents a flexible spike-sorting framework that allows users to run, swap, and benchmark individual modules commonly used in spike sorting. The paper argues and demonstrates that "opening the black box" is essential for understanding which components drive performance differences and for making progress toward more accurate and transparent spike sorting.
Using this modular benchmarking pipeline, the work identifies electrode drift as a primary bottleneck for accurate sorting and introduces an end-to-end sorter ("Lupin") that combines the best-performing modules and is reported to outperform existing spike-sorting packages on their benchmark.

Overall, this is a strong tool/resource contribution with clear potential to accelerate spike-sorting development and enable more rigorous comparisons. However, several claims, particularly around Lupin's or individual modules' superiority, are not yet supported robustly enough for the strength of the conclusions stated.

Strengths:

This work has high community value and practical utility. The effort to make benchmarking and spike sorting modules accessible and standardized is substantial and likely to be broadly useful.
Treating spike sorting as a set of interchangeable modules is a useful approach to some extent, and it enables targeted improvements rather than 'new sorters' popping up, which are difficult to fully understand.

Implementing this resource within SpikeInterface, an already widely used tool, will facilitate uptake and community contributions.

Overall, I am positive about this manuscript as a resource paper. The core framework is compelling and timely.

Weaknesses:

(1) The main concern is the limited support for the claim that 'Lupin' and individual modules' outperform existing spike sorters.

(2) Evidence is primarily from a single benchmark based on an intentionally simplified simulation. While the authors discuss the trade-offs between simulated and real data, the current evaluation does not provide enough diversity to justify claims of superiority.

(3) While improving individual modules that run in a serial fashion could aid overall spike sorting performance, acknowledging that some end-to-end sorters work in an iterative fashion across multiple of these modules would be fair. Perhaps the optimal spike sorter is not a serial set of modules.

(4) There is also a risk of benchmark overfitting. A modular approach makes it easy to select components that excel on specific benchmarks (or a specific project's data characteristics) without generalizing.

Concrete ways to strengthen this work:

(1) Evaluate on multiple simulation regimes, consider adding at least one biophysically detailed simulation, benchmark on multiple probe-geometries with neurons also clustered in different depth profiles (as this will affect drift solutions), and provide real-data validation. Even without full ground truth, real-data can be evaluated with expert curation, functional validation (e.g., refractory violations, quality metrics, unit waveform consistency), agreement across sorters, and consistency across time.

(2) Related to real-data applicability, it is also important to acknowledge that modulatory approaches can enable overfitting to the needs of individual projects. Without real-data benchmarking (or benchmark diversity), it is unclear how the framework will guide users towards generalizable 'best practices' rather than optimized configurations that work for their specific conditions.

Reviewer #2 (Public review):

Summary:

Spike sorting, that is, assigning events detected in extracellular electrophysiology data to the firing of individual neurons, is an inherently difficult computational problem involving multiple steps. The difficulty arises from low signal-to-noise, instability in signal due to the relative motion of the tissue and recording sites, and large volumes of data. Experimental ground truth data - where the correct assignment of spikes is known - is not available in large enough quantities to test algorithms. This paper describes a tool for creating fully synthetic ground truth data and benchmarking the individual steps of spike sorting to dissect the impact of signal-to-noise, firing rate, and motion correction on each step. This information is used to construct an optimized algorithm for sorting the ground truth data. One result of particular interest is the dominant role of motion correction in degrading accuracy. Another important technical result is that motion correction via interpolation of the voltage traces yields similar accuracy to interpolation of the spike templates.

Strengths:

The paper clearly shows the benefits of analyzing the complex process of spike sorting step by step. While this analysis has also been done in papers presenting spike sorters (for example, reference [32]), the tools presented here allow users and developers to do similar studies for their own work. This toolset will be very useful to many labs, especially those working in less studied brain areas or model systems, cases where the tuning of standard spike sorting tools is not a good match to the data.

Weaknesses:

The model ground truth data used in the paper does not need to be a perfect match to experimental data to provide useful benchmarking. However, as with all measurements of spike sorting accuracy, extrapolation to experimental data can be complicated. Users of these tools will need to assess how well the simulated data matches their recordings.

Reviewer #3 (Public review):

Overview:

In this manuscript, the authors describe two additions to an existing toolbox (SpikeInterface, Buccino et al., 2020, eLife). The first addition is an empirical simulator for extracellular recordings, in which spikes from predefined templates are added up with Gaussian noise. The second addition involves granting user-level access to intermediate processing steps along spike sorting algorithms. The authors demonstrate the toolbox by evaluating functions (e.g., event detection) or sets of functions (e.g., feature extraction + clustering) on their simulated data, and suggest that a specific combination of function implementations provides performance improvement relative to kilosort4 (Pachitariu et al., 2024, Nature Methods).

If the authors are interested in making this manuscript a suitable scientific contribution, the entire work has to be revised extensively. In particular, the simulator has to be extended and improved; the implementation of existing spike sorters has to be improved; the feedforward architecture of the modules has to be extended; the reporting of results has to follow standard reporting standards; new algorithms have to be explained in sufficient detail; and the manuscript has to undergo extensive proofreading.

Notably, even assuming perfect implementation and descriptions, it is unclear to me whether the scope of the present work warrants a publication in a scientific journal, or is more suitable for an internal technical report or an e.g., a GitHub version release. To go beyond a scientifically-sound technical report, the authors may choose to demonstrate the utility of their new proposed sorter ("Lupin") and compare it to existing tools on multiple datasets.

General comments:

(1) The simulator itself has to be improved and extended. Right now, it simply generates, for every unit, a mother waveform from a sum of exponentials, scales that over channels, and then adds up multiple instantiations of every unit on every channel, along with noise. This is not a biophysical simulator: it is an ad hoc procedure, and the sentence "we firmly believe that.." (lines 482-483) does not make the procedure convincing. To make the simulator credible, the authors should: (1) use a set of biophysical equations, with multi-compartmental modeling of currents and return currents; (2) use noised data from extracellular recordings; or (3) some combination thereof.

(2) The simulated dataset has to be extended in time. Maybe I missed something, but 500 units over 10 minutes, with some units having firing rates as low as 0.1 spikes/s, corresponds to some of the units firing an expected 60 spikes. This is clearly too short, and does not replicate the standard situation in extracellular experiments.

(3) The simulated dataset has to be extended in space. The choice of using NeuroPixels 1.0 geometry is a poor one. Many labs use other monolithic electrode arrays (MEAs, silicon probes, other rigid arrays); tetrodes remain a major tool, and flexible probes (polyimide, mesh) are evolving. Assessing algorithms over a single spatial architecture is likely to lead to local maxima in performance and potentially erroneous conclusions.

(4) The existing spike sorters evaluated are not completely described. Some sorters (e.g., SpyKING Circus and KS4) were described in previous publications, but it is unclear whether the implementation that was used for the present tests is exactly the same as those previously published. More importantly, some of the sorters evaluated (e.g., TDC, TDC2, SpyKING Circus 2) were never described in a peer-reviewed paper. This does not mean that they cannot be evaluated - but if they are, they must be described in full. Relying on the fact that the code is open source cannot replace a complete and accurate scientific description.

(5) Related to the above, all relevant code should be made available online in permanent repositories, not only in author-controlled ones.

(6) It is unclear why SpyKING Circus 2 and TDC2 are evaluated - these could potentially be described as straw men. I recommend reorganizing the manuscript so that after every module is evaluated separately based on a limited ground truth dataset, a single "best" sorter would be constructed, and then tested extensively (and compared to the de facto state of the art). Such reorganization would both demonstrate the utility of a modular approach and clarify the general usefulness of the outcome.

(7) The new algorithms developed, for example, clustering and template matching, have to be described in more detail, and demonstrated graphically on simple datasets. This can be done in supplementary material if the authors prefer not to extend the manuscript too much.

(8) This reviewer finds the description and interpretation of the results to be inadequate. As an example, focusing on Figure 5: The results in Figure 5A have to be supplemented and summarized as a scalar point estimate (e.g., median accuracy), an estimate of dispersion (e.g., using MAD, IQR, or SD), evaluated over multiple runs, and compared using statistical tests between tools and conditions (e.g., using a multi-dimensional analysis of variance, a mixed effect model, etc.). The results in Figure 5D must have an indication of dispersion. Any conclusions based on the numerical experiments must be based on these metrics and statistical evaluations.

(9) The entire MS would benefit from expert proofreading; there are many language errors, mostly in indefinite articles and grammatical numbers.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation