Pynapple: a toolbox for data analysis in neuroscience

  1. Montreal Neurological Institute and Hospital, McGill University, Montreal, QC, Canada
  2. MILA – Quebec IA Institute
  3. Departments of Psychiatry and Neuroscience, Albert Einstein College of Medicine, Bronx, NY
  4. Donders Institute for Brain, Cognition and Behaviour, Radboud University, 6525AJ Nijmegen, The Netherlands

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Caleb Kemere
    Rice University, Houston, United States of America
  • Senior Editor
    Timothy Behrens
    University of Oxford, Oxford, United Kingdom

Reviewer #1 (Public Review):

A typical path from preprocessed data to findings in systems neuroscience often includes a set of analyses that often share common components. For example, an investigator might want to generate plots that relate one time series (e.g., a set of spike times) to another (measurements of a behavioral parameter such as pupil diameter or running speed). In most cases, each individual scientist writes their own code to carry out these analyses, and thus the same basic analysis is coded repeatedly. This is problematic for several reasons, including the waste of time, the potential for errors, and the greater difficulty inherent in sharing highly customized code.

This paper presents Pynapple, a python package that aims to address those problems.

Strengths:

The authors have identified a key need in the community - well-written analysis routines that carry out a core set of functions and can import data from multiple formats. In addition, they recognized that there are some common elements of many analyses, particularly those involving timeseries, and their object-oriented architecture takes advantage of those commonalities to simplify the overall analysis process.

The package is separated into a core set of applications and another with more advanced applications, with the goal of both providing a streamlined base for analyses and allowing for implementations/inclusion of more experimental approaches.

Weaknesses:

There are two main weaknesses of the paper in its present form.

First, the claims relating to the value of the library in everyday use are not demonstrated clearly. There are no comparisons of, for example, the number of lines of code required to carry out a specific analysis with and without Pynapple or Pynacollada. Similarly, the paper does not give the reader a good sense of how analyses are carried out and how the object-oriented architecture provides a simplified user interaction experience. This contrasts with their GitHub page and associated notebooks which do a better job of showing the package in action.

Second, the paper makes several claims about the values of object-oriented programming and the overall design strategy that are not entirely accurate. For example, object-oriented programming does not inherently reduce coding errors, although it can be part of good software engineering. Similarly, there is a claim that the design strategy "ensures stability" when it would be much more accurate to say that these strategies make it easier to maintain the stability of the code. And the authors state that the package has no dependencies, which is not true in the codebase. These and other claims are made without a clear definition of the properties that good scientific analysis software should have (e.g., stability, extensibility, testing infrastructure, etc.).

There is also a minor issue - these packages address an important need for high-level analysis tools but do not provide associated tools for preprocessing (e.g., spike sorting) or for creating reproducible pipelines for these analyses. This is entirely reasonable, in that no one package can be expected to do everything, but a bit deeper account of the process that takes raw data and produces scientific results would be helpful. In addition, some discussion of how this package could be combined with other tools (e.g., DataJoint, Code Ocean) would help provide context for where Pynapple and Pynacollada could fit into a robust and reliable data analysis ecosystem.

Reviewer #2 (Public Review):

Pynapple and Pynacollada have the potential to become very valuable and foundational tools for the analysis of neurophysiological data. NWB still has a steep learning curve and Pynapple offers a user-friendly toolset that can also serve as a wrapper for NWB.

The scope of the manuscript is not clear to me, and the authors could help clarify if Pynacollada and other toolsets in the making become a future aspect of this paper (and Pynapple), or are the authors planning on building these as separate publications.

The author writes that Pynapple can be used without the I/O layer, but the author should clarify how or if Pynapple may work outside NWB.

This brings us to an important fundamental question. What are the advantages of the current approach, where data is imported into the Ts objects, compared to doing the data import into NWB files directly, and then making Pynapple secondary objects loaded from the NWB file? Does NWB natively have the ability to store the 5 object types or are they initialized on every load call?

Many of these functions and objects have a long history in MATLAB - which documents their usefulness, and I believe it would be fitting to put further stress on this aspect - what aspects already existed in MATLAB and what is completely novel. A widely used MATLAB toolset, the FMA toolbox (the Freely moving animal toolbox) has not been cited, which I believe is a mistake.

A limitation in using NWB files is its standardization with limited built-in options for derived data and additional metadata. How are derived data stored in the NWB files?

How is Pynapple handling an existing NWB dataset, where spikes, behavioral traces, and other data types have already been imported?

Author Response:

We would like to thank the reviewers and editor for their insightful comments and suggestions. We will update the manuscript accordingly. We are particularly glad to read that our software package constitutes a set of “well-written analysis routines” which have “the potential to become very valuable and foundational tools for the analysis of neurophysiological data”. Both reviewers have identified a number of weaknesses in the manuscript, and we would like to take this opportunity to provide a response to some of the remarks and clarify the objectives of our work. We would like to stress that this kind of toolkit is in continual development, and the manuscript offered a snapshot of the package at one point during this process. Since the initial submission several months ago, several improvements have been implemented and further improvements are in development by our group and a growing community of contributors. The manuscript will be updated to reflect these more recent changes, some which will directly address the reviewers’ remarks.

It was first suggested that the manuscript should better showcase the value of the analysis pipeline. As noted by the first reviewer, the online repository (i.e. GitHub page) conveys a better sense of how the toolbox can be used than the present manuscript. Our original intention was to illustrate some examples of data analysis in Figure 4 by adding the corresponding Pynapple command above each processing step. Each step takes a single line of code, meaning that, for example, one only needs to write three lines of code to decode a feature from population activity using a Bayesian decoder (Fig. 4a), or to compute a cross-correlograms of two neurons during specific stimulus presentation (Fig. 4b), or to compute the average firing rate of two neurons around a specific time of the experimental task (Fig. 4c). In our revision, we will include code snippets which will clearly show the required steps for each of these analyses. In addition, we will more clearly point the reader to the online tools (e.g. Jupyter notebooks), which offer an easier and clearer way to demonstrate the use of the toolbox.

Another remark concerns our claim that the package does not have dependencies. We agree that this claim was not well-worded. Our intention was to say that the package exclude dependencies such as scikit-learn, tensorflow or pytorch, which are often used in signal processing and which can be tedious to install. Pynapple still depends on a few packages including the most common ones: Numpy, Scipy, and Pandas. We will rephrase this statement in the manuscript and emphasize the importance of minimal dependencies for long-term backwards-compatibility in scientific computing.

We will complete the bibliography to make sure we properly reference all the packages designed for similar purpose. To note, some are not citable per se (i.e. no associated paper) but will be discussed.

It was suggested that the manuscript should better describe the integration of Pynapple into a full experimental data pipeline. This is an interesting point, which was briefly mentioned in the third paragraph of the discussion. Pynapple was not originally designed to pre-process data. However, it can load any type of data stream after the necessary pre-processing steps. Overall, this modularity is a key aspect of the Pynapple framework, and this is also the case for the integration with data pre-processing pipelines, for example spike sorting in electrophysiology and detection of region of interest in calcium imaging. We do not think there should be an integrated solution to the problem but, instead, to make it possible that any piece of code can be used for data irrespective of how the dataset was acquired. This is why we focused on making data loading straightforward and easy to adapt to any situation. This feature enables any user with any data modality and any long-established (often in-house) pre-processing scripts/software to utilize Pynapple in the analysis phase of their pipeline. Overall, not imposing a certain format compatibility from data acquisition phase is a strength for any analysis package.

Finally, the reviews raised the issue of data and intermediate result storage. We agree that this is a critical issue. In the long term, we do not believe that the current implementation of NWB is the right answer for data involved in active analysis, as it is not possible to overwrite a NWB file. This would require the creation of a new NWB file each time an intermediate result is saved, which will be computationally intensive and time consuming, further increasing the odds of writing error. Theoretically, users who need to store intermediate results in a flexible way could use any methods they prefer, writing their own data files and wrappers to reload these data into Pynapple object. However, it is desirable for the Pynapple ecosystem to have a standardized format for storing data. We are currently improving this feature by developing save and loads methods for each Pynapple core object. We aim to provide an output format that is very simple to read in future Pynapple releases. This feature will be available in the coming weeks and will be described in the revised manuscript.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation