Figures and data

Challenges of scaling electrophysiological recordings.
a, Multi-shank Neuropixels probe overlaid on a mouse brain, with zoomed-in region showing the spatial footprint of a typical spike waveform (from Steinmetz & Ye 2022). Waveforms from individual electrodes and spatiotemporal footprint (sampled by a hypothetical Neuropixels 2.0 probe) are shown on the right. The approximate scale of the spike (50 µm x 2 ms) necessitates dense sampling in both space and time. Scaling up the number of recorded neurons requires increasing the number of electrodes that are in close physical proximity to neurons. b, Times required to run preprocessing, spike sorting, and automated curation on two-hour recordings with different probe configurations. Assuming no parallelization across machines, a recording with six Neuropixels 1.0 probes (384 channels each) would take more than two days to process. A recording with six Neuropixels 2.0 Quad Base probes (1,536 channels each), which recently became commercially available, would take over one week. Parallelization is essential to complete processing in under 24 hours after data collection.

Spike sorting pipeline overview.
Raw electrophysiology data from multiple probes (top) is ingested by the Job Dispatch step, which coordinates parallelization of downstream processing. All steps are run in parallel until Result Collection. Pipeline outputs (including Figurl interactive visualizations, metrics stored in JSON format, PNG-formatted images, and a Neurodata Without Borders file) are shown at the bottom. Each step includes an estimate of run time (per hour of recording) and required computing resources (CPU, GPU, and RAM). The pipeline is encapsulated in a Nextflow workflow (green background), and individual steps are implemented in SpikeInterface (action potential logo). Interlocking brick icons indicate processing algorithms that can be easily substituted. For the spike sorting step, run time is calculated for Kilosort4. See Table 1 for detailed run times and cost estimates.

Resources, run time, and cost for all steps in the spike sorting pipeline.
Values are based on a one-hour recording with six Neuropixels probes. Values for non-parallel steps (Job Dispatch, NWB Packaging, Result Collection) are normalized by the number of probes. For steps that run in parallel, run times and estimated cost are reported as the mean ± standard deviation for probes in the same session. *Run Time per hour of recording with one 384-channel probe. **Cost per hour of recording with one 384-channel probe. Costs are estimated using the prices of the reported instance from https://aws.amazon.com/ec2/pricing/on-demand/ as of April 30, 2025.

Outputs of the visualization and quality control step.
a, Screenshot of interactive visualization of raw and preprocessed data (view on web). b, Screenshot of SortingView GUI, used for inspecting and curating spike sorting outputs (view on web). c-d-e, Examples of static quality control plots generated by the pipeline: power spectral density (c), drift (d) and unit yield (e).

Spike sorting pipeline usage at the Allen Institute for Neural Dynamics.
a, Sessions processed by the pipeline each week between February 2024 and August 2025. Color indicates the spike sorter used for each pipeline run. b, Cumulative sessions processed by the pipeline, split by spike sorter. c, Cumulative probes processed by the pipeline (many sessions include data from multiple probes). d, Cumulative units detected by the pipeline, with the subset of neural units indicated in yellow and units passing default quality metric thresholds in green. Units are considered neural if the UnitRefine classifier does not label them as “noise”.

Benchmarking pipeline run times.
The number of jobs and effective run time in hours for the two benchmarking applications (Kilosort2.5 vs Kilosort4, Lossless vs Lossy) on data from two types of Neuropixels probes (NP1, NP2). The “distributed” execution refers to our Nextflow implementation, while the “serial” execution has all jobs running sequentially on an individual workstation. The run time for serial execution is extrapolated from the distributed run times, and was not actually measured.

Benchmarking pipeline overview.
The Job Dispatch step accepts N sessions as input, each of which may contain data from multiple probes. Next, the Hybrid Generation step injects T ground truth spike templates for each session and probe (default = 10), with IT randomized iterations (default = 5). Hybrid data are then processed in parallel through M Spike Sorting Cases, each of which sorts data using a different set of algorithms or parameters. The sorting results are then collected and compared by the Hybrid Evaluation step, which outputs figures and metrics for analyzing the performance of each case.

Example hybrid evaluation.
a, 100 ms snippet of traces, showing ground-truth hybrid spikes and respective waveforms in blue if found by both sorters, red if found by Kilosort2.5 only, and purple if found by Kilosort4 only. b, Spike rasters of all hybrid spikes in the recording with same color scheme as panel a. Gray spikes are those from the original recording. c, Fraction of false positive spikes found by either sorter (a.u. = arbitrary units). d, Fraction of false negative spikes. False positive and false negative spikes are calculated using 10 s time bins.

Benchmarking spike sorting algorithms.
a, Accuracy, precision, and recall for all ground truth units for Kilosort2.5 (red) and Kilosort4 (purple) for Neuropixels 1.0 (top panels) and 2.0 (bottom panels) probes. b, Scatter plots for accuracy, precision, and recall for ground-truth units matched by both Kilosort2.5 (x axis) and Kilosort4 (y axis) for Neuropixels 1.0 (top panels) and 2.0 (bottom panels) probes. c, Distributions of refractory period contamination for all units found by both sorters. d, Histograms of presence ratio values for all units found by both sorters. In panels c and d, data from both probes are combined and include all units that were matched to a ground-truth hybrid unit (accuracy≥0.2 - Kilosort2.5: 3,046 matches; Kilosort4: 3,652 matches).

Lossy compression benchmarks
a, Accuracy, precision, and recall for all ground truth units for lossless, and lossy (dark green) WavPack compression (lighter greens as BPS decreases, i.e., compression ratio increases) for Neuropixels 1.0 (top panels) and 2.0 (bottom panels) probes. The dashed red line shows the curves for Kilosort2.5 for reference. b, Median accuracy, precision, and recall as a function of compression ratio, for both probe types (top row: Neuropixels 1.0; bottom row: Neuropixels 2.0). c, Scatter plots for accuracy, precision, and recall for ground-truth units matched by both the lossless (x axis) and BPS=3 (y axis), for both probe types. d, Distributions of refractory period contamination for all units found at each compression ratio. e, Histograms of presence ratio values for all units found at each compression ratio. In panels d and e, data from both probes are combined and include all units that were matched with accuracy ≥0.2 to a ground-truth hybrid unit (Lossless: 3,612 matches; BPS=3: 3,513 matches; BPS=2.5: 3,475 matches; BPS=2.25: 3,405 matches).





Spike sorter performance as a function of unit signal-to-noise ratio.
a, Accuracy, precision, and recall for every hybrid units added to the Neuropixels 1.0. Lines indicate the average value at each signal-to-noise level. b, Same as a, but for Neuropixels 2.0 datasets.

Kilosort run times.
a, Run times of Kilosort2.5 (red) and Kilosort4 (purple) on Neuropixels 1.0 datasets. Run times are relative to real-time (xRT). b, Same as a, but for Neuropixels 2.0 datasets. Note that for Neuropixels 1.0, spike sorting was run on the full 384-channel recording, while it was split by shank (96 channels each) for Neuropixels 2.0, which is why the latter probe has faster execution times.