A Most of modern spike sorters tend to have the same sequence of algorithmic steps, with particular details and/or implementations choices. Particular examples are shown for Kilosort2 and 4 [32, 35], alongside SpyKING-CIRCUS [56]. B For each of these key algorithmic steps, we factorized and re-implemented various algorithms to properly benchmark each of them on a per step basis, bypassing the need to compute performance on the whole sequence. C Relying on the modular architecture, BenchMarkStudy objects on a per-step basis can be created, to test/optimize individual components.

Ground truth recording generation.

A Given a probe layout, templates, activity patterns and inhomogeneous motion vectors, we can generate, on-demand and lazily, extracellular traces to extensively benchmark online spike sorting. B When neurons are drifting, templates will be dynamically generated as functions of updated positions of the sources. C The positions of the cells are drawn randomly from a trimodal Gaussian along the depth of a Neuropixels-like probe (see Methods) D Top left: the rigid motion vector impacting all the cells during the course of the recording. Bottom Left: the distribution of the signal-to-noise ratio for all the templates generated given the cells positions. Top Right: the distribution of firing rates for all the cells in the recording. Bottom Right: the distribution of Euclidean distances between the templates, for all pairs of cells in the recording, on a log scale.

Peak detection benchmark.

A Given the true distribution of the peak amplitudes in the ground truth recordings (grey area), we compute the recovered distribution for the peaks detected either via the “locally exclusive” method (blue), or the “matched filtering” (green). B The recall for all the neurons in the ground truth recording as function of their signal-to-noise ratios for the two aforementioned methods. C Same as in B, but for Precision. D The run times (in seconds) for the two methods.

Feature extraction and clustering benchmark.

A Accuracy recall for all the clustering methods when applied on true peak times from a static recording. Left: as function of the signal-to-noise ratios of the neurons. Right: as function of the firing-rate of the neurons B Same as in A, but applied to motion-corrected recordings, with the exact same activity/firing as the static one (see Methods) C Sorted accuracy levels for all clustering methods and recording types, as function of the neurons present in the artificial recordings D The run times for all the methods and datasets E For all clustering methods and recording types, the number of Well Detected, False Positive, Redundant and Overmerged units (see Methods)

Template matching benchmark.

A Accuracy recall for all the matching methods when launched with a perfect catalog of templates, and as a function of the signal-to-noise ratios of the neurons B Same as in A, but when applied to a motion-corrected recording, with the exact same activity/firing as the static one (see Methods) C Sorted accuracy levels for all matching methods and recording types, as a function of the neurons present in the artificial recordings D The run times for all the methods and datasets E For all matching methods, collisions levels [12] computed on the static recording, as a function of the temporal lag between all pairs of spikes, and the cosine similarity between pairs of templates.

Motion correction strategies benchmark.

A Example of template interpolation for a particular neuron drifting along the depth of the electrode. On the Left, the real templates are generated from the generative mathematical model, taking into account the real position of the neuron. In the middle, templates are interpolated via cubic spline interpolation for a particular displacement. On the right, the residuals (i.e. the differences) between estimated and real templates, as functions of the positions B Accuracy for several matching methods when run on a static recording, when Templates or Traces are interpolated given the estimated motion, or when we use the True static or drifting templates (perfect dictionaries of templates) and as function of the signal-to-noise ratios of the neurons C Sorted accuracy levels for all matching methods and types of dictionary, as function of the neurons present in the artificial recordings D The run times for all the methods and datasets

End-to-end spike sorter benchmark.

A Averaged accuracy for several spike sorters when applied to several static recordings (with various seeds), as function of the signal-to-noise ratios of the neurons B Same as in A, but when applied to motion-corrected recordings, with the exact same activity/firing as the static ones (see Methods) C Sorted accuracy levels for all spike sorters and recording types, as a function of the total number of neurons present in all artificial recordings D The run times for all the methods and datasets E For all spike sorting methods and recording types, the number of Well Detected, False Positive, Redundant and Overmerged units (see Methods)