The OpenNeuro resource for sharing of neuroscience data

  1. Christopher J Markiewicz
  2. Krzysztof J Gorgolewski
  3. Franklin Feingold
  4. Ross Blair
  5. Yaroslav O Halchenko
  6. Eric Miller
  7. Nell Hardcastle
  8. Joe Wexler
  9. Oscar Esteban
  10. Mathias Goncavles
  11. Anita Jwa
  12. Russell Poldrack  Is a corresponding author
  1. Department of Psychology, Stanford University, United States
  2. Department of Psychological & Brain Sciences, Dartmouth College, United States
  3. Squishymedia, United States
  4. Lausanne University Hospital and University of Lausanne, Switzerland

Abstract

The sharing of research data is essential to ensure reproducibility and maximize the impact of public investments in scientific research. Here, we describe OpenNeuro, a BRAIN Initiative data archive that provides the ability to openly share data from a broad range of brain imaging data types following the FAIR principles for data sharing. We highlight the importance of the Brain Imaging Data Structure standard for enabling effective curation, sharing, and reuse of data. The archive presently shares more than 600 datasets including data from more than 20,000 participants, comprising multiple species and measurement modalities and a broad range of phenotypes. The impact of the shared data is evident in a growing number of published reuses, currently totalling more than 150 publications. We conclude by describing plans for future development and integration with other ongoing open science efforts.

Introduction

There is growing recognition of the importance of data sharing for scientific progress (National Academies of Sciences, Engineering, and Medicine, Policy and Global Affairs, Board on Research Data and Information, Committee on Toward an Open Science Enterprise, 2018). However, not all shared data are equally useful. The FAIR principles (Wilkinson et al., 2016) have formalized the notion that in order for shared data to be maximally useful, they need to be findable, accessible, interoperable, and reusable. An essential necessity for achieving these goals is that the data and associated metadata follow a common standard for organization, so that data users can easily understand and reuse the shared data. Here, we describe the OpenNeuro data archive (RRID:SCR_005031), accessible at https://openneuro.org, which enables FAIR-compliant data sharing for a growing range of neuroscience data types (currently including magnetic resonance imaging [MRI], electroencephalography [EEG], magnetoencephalography [MEG], and positron emission tomography [PET]) through the use of a common community standard, the Brain Imaging Data Structure (BIDS) (RRID:SCR_016124; Gorgolewski et al., 2016).

Starting with early pioneering efforts by Gazzaniga and Van Horn to establish an fMRI Data Center in 1999 (Van Horn and Gazzaniga, 2013), data sharing has become well established in the domain of neuroimaging (Milham et al., 2018; Poldrack and Gorgolewski, 2014; Poline et al., 2012). A major impetus for the growth of data sharing was the International Neuroimaging Data Sharing Initiative (Mennes et al., 2013), which published a landmark paper in 2010 (Biswal et al., 2010) demonstrating the scientific utility of a large shared resting fMRI dataset. The most prominent recent examples have been large-scale prospective data sharing projects, including the Human Connectome Project (HCP) (Van Essen et al., 2013), the NKI-Rockland sample (Nooner et al., 2012), Adolescent Brain Cognitive Development study (Casey et al., 2018), and the UK Biobank (Littlejohns et al., 2020). These datasets have provided immense value to the field and have strongly demonstrated the utility of shared data. However, their scientific scope is necessarily limited, given that each dataset includes only a limited number of imaging tasks and measurement types. Beyond these large focused data sharing projects, there is a ‘long tail’ of smaller neuroimaging datasets that have been collected in service of specific research questions. Making these available is essential to ensure reproducibility as well as to allow aggregation across many different types of measurements in service of novel scientific questions. The OpenNeuro archive addresses this challenge by providing researchers with the ability to easily share a broad range of neuroimaging data types in a way that adheres to the FAIR principles.

Goals and principles

The OpenNeuro archive evolved from the OpenfMRI archive (Poldrack et al., 2013), which was focused solely on the sharing of task-based human fMRI data. Some of the principles behind OpenNeuro were inherited from OpenfMRI, whereas others grew out of our experiences in that project as well as from new developments in the domain of open science.

Minimal restrictions on sharing

There is a range of restrictiveness across data archives with regard to their data use agreements (Jwa and Poldrack, 2021). At one end of the spectrum are highly restricted databases such as the Alzheimer’s Disease Neuroimaging Initiative, which requires researchers to submit their scientific question for review and requires the consortium to be included as a corporate author on any publications. OpenNeuro represents the other pole of restrictiveness, by releasing data (by default) under a Creative Commons Zero (CC0) Public Domain Dedication which places no restrictions on who can use the data or what can be done with them. While not legally required, researchers using the data are expected to abide by community norms and cite the data following the guidelines included within each dataset. The primary motivation for this policy is that it makes the data maximally accessible to the largest possible number of researchers and citizen-scientists.

Standards-focused data sharing

To ensure the utility of shared data for the purposes of efficient discovery, reuse, and reproducibility, standards are required for data and metadata organization. These standards make the structure of the data clear to users and thus reduce the need for support by data owners and curation by repository owners, as well as enabling automated QA, preprocessing, and analytics. Unfortunately, most prior data sharing projects have relied upon custom organizational schemes, which can lead to misunderstanding and can also require substantial reorganization to adapt to common analysis workflows. The need for a clearly defined standard for neuroimaging data emerged from our experiences in the OpenfMRI project; while the repository had developed a custom scheme for data organization and file naming, this scheme was ad hoc and limited in its coverage, and datasets often required substantial manual curation (involving laborious interaction with data owners). In addition, there was no way to directly validate whether a particular dataset met the standard.

For these reasons, we focused at the outset of the OpenNeuro project on developing a more robust data organization standard that could be implemented in an automated validator. We engaged a large group of researchers from the neuroimaging community to establish a standard that ultimately became the BIDS (Gorgolewski et al., 2016), which is now a highly successful community standard for a broad and growing range of neuroimaging data types. BIDS defines a set of schemas for file and folder organization and naming, along with a schema for metadata organization. The framework was inspired by the existing data organization frameworks used in many research laboratories, so that transitioning to the standard is relatively easy for most researchers. One of the important features of BIDS is its extensibility; using a scheme inspired by open-source software projects, community members can propose extensions to BIDS that encompass new data types. To date, modality extensions include MEG (Niso et al., 2018), scalp EEG (Pernet et al., 2019), intracranial EEG (Holdgraf et al., 2019), PET (Norgaard et al., 2021), and arterial spin labeling MRI. In addition to standards for raw data, the BIDS community has also developed a standard for the organization of the outputs of processing operations (known as ‘BIDS Derivatives’), providing a framework for sharing processed as well as raw data.

While BIDS and OpenNeuro are now independent projects, there is a strongly synergistic relationship. All data uploaded to OpenNeuro must first pass a BIDS validation step, such that all data in OpenNeuro are compliant with the BIDS specifications at upload time. Conversely, the OpenNeuro team has made substantial contributions to the BIDS standard and validator. The BIDS standard has been remarkably successful, with tens of thousands of datasets now available in the format, including but not limited to those contained in the OpenNeuro database. As a consequence, this model maximizes compatibility with processing and analysis tools (Gorgolewski et al., 2017), but more importantly, it effectively minimizes the potential for data misinterpretation (e.g., when owner and reuser have slightly different definitions of a critical acquisition parameter). Through the adoption of BIDS, OpenNeuro has moved away from project- or database-specific data structures designed by the owner or the distributor (as used in earlier projects such as OpenfMRI and HCP) and toward a uniform and unambiguous representation model agreed upon by the research community prior to sharing and reuse.

FAIR sharing

The FAIR principles (Wilkinson et al., 2016) have provided an important framework to guide the development and assessment of open data resources. OpenNeuro implements each of these principles.

Findable: Each dataset within OpenNeuro is associated with metadata, both directly from the BIDS dataset along with additional dataset-level metadata provided by the submitter at time of submission. Both data and metadata are assigned a persistent unique identifier (Digital Object Identifier [DOI]). Within the repository, a machine-readable summary of BIDS metadata is collected by the BIDS validator and indexed with an ElasticSearch mapping. In addition, dataset-level metadata are exposed according to the schema.org standard, which allows indexing by external resources such as Google Dataset Search.

Accessible: Data and metadata can be retrieved using a number of access methods (directly from Amazon S3, using the OpenNeuro command line tool, or using DataLad) via standard protocols (http/https). Metadata are also accessible programmatically via a web API. Metadata remain available even in the case that data must be removed (e.g., in cases of human subjects concerns). No authentication is necessary to access the data.

Interoperable: The data and metadata use the BIDS standard to ensure accessible representation and interoperation with analysis workflows, such as BIDS Apps (Gorgolewski et al., 2017). Ongoing work is extending the metadata representation to use richer formats and to link to relevant FAIR ontologies or vocabularies.

Reusable: The data are released with a clear data use agreement (currently defaulting to a CC0 public domain dedication). Through use of the BIDS standard, the data and metadata are consistent with community standards in the field.

Data versioning and preservation

OpenNeuro keeps track of all changes in stored datasets and allows researchers to unambiguously report the exact version of the data used for any analysis. OpenNeuro preserves all versions of the data through the creation of ‘snapshots’ that unequivocally point to one specific point in the lifetime of a dataset. Data management and snapshots are supported by DataLad (RRID:SCR_003931; Halchenko et al., 2021), a free and open-source distributed data management system (Hanke et al., 2021).

Protecting privacy and confidentiality of data

There is a direct relationship in data sharing between the openness of the data and their reuse potential; all else being equal, data that are more easily or openly available will be more easily and readily reused. However, all else is not equal, as openness raises concern regarding risks to subject privacy and confidentiality of data in human subjects research. Researchers are ethically bound to both minimize the risks to their research participants (including risks to confidentiality) and to maximize the benefits of their participation (United States. National Commission for the States, 1978). Because sharing of data will necessarily increase the potential utility of the data, researchers are ethically bound to share human subject data unless the benefits of sharing are outweighed by risks to the participant (Brakewood and Poldrack, 2013).

In general, risks to data privacy and confidentiality are addressed through deidentification of the data to be shared. For example, under the Health Insurance Portability and Accountability Act of 1996 (HIPAA) in the United States, deidentification can be achieved through the removal of any of 18 personal identifiers, unless the researcher has knowledge that the remaining data could be re-identified (known as the ‘safe harbor’ method). With regard to neuroimaging data, a particularly challenging feature is the facial structure that is present in some forms of imaging data, such as structural MRI images. It is often possible to reconstruct facial structures from these images, and there are proofs of concept that such data could be used to re-identify individuals from photographic databases (Schwarz et al., 2019). It is thus essential to remove any image features that could be used to reconstruct facial structure (Bischoff-Grethe et al., 2007). For this reason, all MRI data shared through OpenNeuro must have facial features removed prior to upload, in addition to the 18 personal identifiers outlined by HIPAA. An exception is provided in cases where an investigator has explicit permission to openly share the data without defacing, usually when the data are collected by the investigator themself. At present, data are examined by a human curator to ensure that this requirement has been met. In the future, we plan to deploy an automated face detection tool (Bansal et al., 2020) to detect any uploads that inadvertently contain facial features.

Truly informed consent requires that subjects be made aware that their data may be shared publicly, and that confidentiality cannot be absolutely guaranteed in the future. For this reason, we recommend that researchers planning to share their data via OpenNeuro use a consent form based on the Open Brain Consent (Bannier et al., 2021), which includes language that ensures subject awareness of the intent to share and its potential impact on the risk of participating. Of note, the Open Brain Consent has recently been adapted to include a data usage agreement that accommodates the European Union’s General Data Protection Regulation (GDPR 2016/679); however, data collected in countries covered by GDPR cannot be shared through OpenNeuro at present due to the requirement for restrictive data use agreements that are not currently supported by OpenNeuro.

Open source

The entirety of the code for OpenNeuro is available under a permissive open-source software license (MIT License) at https://github.com/OpenNeuroOrg/openneuro. This enables any researcher who wishes to reuse part or all of the code or to run their own instance of the platform.

Data submission and access

Figure 1 outlines the steps required for sharing a dataset using OpenNeuro. Once shared, data can be accessed by several available mechanisms.

Figure 1 with 2 supplements see all
A schematic overview of the data upload process.

Web download: Each snapshot is associated with a link that provides immediate downloading of the dataset.

DataLad. DataLad (Halchenko et al., 2016) is a decentralized data management system built on top of git and git-annex. Through DataLad, researchers may install a complete copy of a dataset, while deferring the retrieval of file contents until needed, permitting lightweight views of large datasets. OpenNeuro’s versioned snapshots are implemented as git tags, which allows specific versions to be easily retrieved or compared. The decentralized protocol also allows mirrors of the datasets to be hosted on GitHub and https://datasets.datalad.org, ensuring access during service interruptions of the OpenNeuro website.

OpenNeuro command line tool: The OpenNeuro command line tool provides access to the latest snapshot of all datasets, and is generally more stable than browser downloads for large datasets.

Amazon S3: The latest snapshot as well as all previous versions of a dataset may be fetched using the Amazon Web Services (AWS) clients or directly via https.

User support

Support for individual datasets: Data users sometimes have questions regarding particular datasets. In order to facilitate discussion of these issues and to make those discussions available to the entire community, a discussion forum is provided on each dataset page. The dataset owner is automatically notified by email of any questions that are posted. In addition, users can ‘follow’ a dataset of interest and receive notifications of any comments posted to the dataset.

Site support: Two mechanisms are provided for users of the OpenNeuro site to obtain help with site issues. First, a helpdesk is available directly from the site, through which users can submit specific help questions. Second, users are recommended to post general questions to the Neurostars.org question and answer forum, so that the answers will be available to the entire community.

Data processing

Data processing was initially envisioned as an incentive for researchers to share their data, and the OpenNeuro site was launched in 2017 with the ability to perform cloud-based data processing using a limited set of analysis workflows. This feature was disabled in 2018, after an overhaul of the site’s initial storage infrastructure. At that time, we determined that it would be preferable to collaborate with an existing platform dedicated to cloud processing rather than rebuilding our own execution platform. At present, OpenNeuro has partnered with the Brainlife.io platform (RRID:SCR_020940), which provides a large set of cloud-based neuroimaging workflows for data analysis and visualization. Data hosted on OpenNeuro can be easily imported into Brainlife for analysis, and more than 400 OpenNeuro datasets are cached for quick access; in the first 6 months of 2021, more than 700 analyses were performed on these datasets. In the future we plan to partner with additional platforms, including the NEMAR platform for EEG/MEG analysis; the availability of the data via DataLad and Amazon S3 also enables any platform to make the data available to their users without requiring any agreement or effort from OpenNeuro.

Results

Usage and impact

The OpenNeuro site was launched in June 2017, and was originally seeded with all of the datasets previously shared through OpenfMRI, after converting them to the BIDS standard. All data presented below are current as of October 9, 2021. The database contains 604 datasets comprising data from 20,989 individual participants. Figure 2 shows cumulative figures for numbers of datasets and subjects since 2018, demonstrating sustained and continual growth in the archive since its inception.

The volume of data available on OpenNeuro has shown a steady growth since its opening started operations in 2017.

Shown are figures from July 2018, when all data were migrated to a new DataLad storage backend, through the present date. The green line illustrates the cumulative growth in total number of datasets, and the red line shows the aggregate of subjects (in thousands).

The overwhelming majority of datasets are from humans (574 datasets, 95%), with a small but growing number of nonhuman species including mouse (17 datasets), rat (6 datasets), nonhuman primates (2 datasets), dogs (1 dataset), and juvenile pigs (1 dataset). Table 1 presents data for the prevalence of different modalities; while the majority of datasets include some form of MRI data, other supported modalities are present including electrophysiological measures and PET.

Table 1
Number of datasets by imaging modality; additional modalities present in fewer than three datasets are not included here.
ModalityNumber of datasets
Anatomical MRI501
Functional MRI445
Electroencephalography81
Diffusion-weighted MRI53
Magnetoencephalography23
Positron emission tomography10
Intracranial EEG8
Arterial spin labeling MRI3

OpenNeuro is a recommended data repository for a number of publishers and journals, including Nature Scientific Data, PLOS, eLife, F1000 Research, Gigascience, BioMed Central, American Heart Association, and Wellcome Open Research. The database contains 407 DOIs for publications associated with datasets (including both primary scientific publications and data descriptors).

Multiple dimensions of ‘big data’

Discussions of ‘big data’ in neuroimaging (Poldrack and Gorgolewski, 2014; Smith and Nichols, 2018) have largely focused on datasets including large numbers of individuals. While these analyses are essential for robust population inference, it is also important to recognize that large numbers of subjects are only one dimension over which a neuroimaging dataset can be ‘big’. Here, we will define the number of subjects as the ‘width’ of the dataset, the number of different phenotypes measured for each individual as the ‘breadth’ of the dataset, and the number of measurements per individual as the ‘depth’ of the dataset.

The OpenNeuro database is distinguished by sharing datasets that are extensive along each of these dimensions (see Figure 3). With regard to width, the median dataset size is 23 subjects, with 31 studies having sample sizes larger than 100, and a maximum sample size of 928. With regard to breadth, notable datasets include the BOLD5000 dataset (Chang et al., 2019), which includes data from subjects viewing a total of 5000 natural images; the Individual Brain Charting dataset (Pinho et al., 2020; Pinho et al., 2018), which includes data from individuals each completing 24 different tasks, and the Multidomain Task Battery dataset (King et al., 2019), which includes data from individuals each completing 26 tasks. With regard to depth, the database currently includes the MyConnectome dataset (Poldrack et al., 2015), which includes extensive task, resting, and diffusion MRI data from more than 100 sessions for a single individual; the Midnight Scan Club dataset (Gordon et al., 2017), which includes extensive task and resting fMRI data from 10 individuals; and a number of other dense scanning datasets (Gonzalez-Castillo et al., 2015; Newbold et al., 2020; Salehi et al., 2020).

OpenNeuro datasets vary substantially in number of participants (X axis), number of sessions per participant (Y axis), and number of tasks per participant (size/color of datapoints); axes are log-scaled for easier visualization.

Results are based on metadata derived directly from the 502 OpenNeuro datasets available via DataLad as of 10/9/2021.

Another unique feature of OpenNeuro is the breadth of phenotypes across datasets. To further characterize this, we searched the text associated with OpenNeuro datasets to identify terms related to psychological concepts and tasks as defined in the Cognitive Atlas ontology (Poldrack et al., 2011). Word clouds showing the top terms identified in this analysis are shown in Figure 1—figure supplement 2. This analysis shows a broad range of tasks and concepts associated with these datasets, highlighting the substantial conceptual and methodological breadth of the archive.

Data reuse

OpenNeuro has distributed a substantial amount of data; from May 2020 through April 2021, a total of 406 terabytes of data were distributed. Because data reuse is not directly measurable, we utilize published reuse of the shared data as a proxy. To identify published reuses of OpenNeuro data, we used Google Scholar and CrossRef to identify potential reuses, and then manually examined them to confirm that they were a legitimate reuse (as opposed to a primary publication of the data or data descriptor); note that this is an underestimate since many papers during this period reported analyses of data downloaded from OpenfMRI, which would not have been identified in our searches. We identified 165 publications that reused OpenNeuro datasets; this showed a sharp increase over time (see Figure 4). Of these publications, 112 were journal or conference papers, 42 were preprints, and 11 were other types of publications (such as theses or project reports). A total of 111 OpenNeuro datasets were reused at least once, with the most popular dataset (Poldrack et al., 2016) appearing in 28 published reuses. A significant number of publications reused multiple datasets; 31 of the 165 papers reused at least two datasets, with a maximum of 40 datasets reused (Esteban et al., 2019). Collecting these data from scratch would have required more than 21,000 individual subject visits; at an estimated scanning cost of $1000/session (based on the conservative cost estimate from Milham et al., 2018), this represents a total data reuse value of nearly 21 million US dollars. These reuses have a total of 1329 citations (according to Google Scholar as of June 15, 2021); the most highly cited reuse (Esteban et al., 2019) has more than 500 citations.

Published reuses of OpenNeuro datasets, split by the type of reuse.

Note that the final bar includes only reuses identified through June 2021.

The published reuses of OpenNeuro data span from basic neuroscience to methodological studies and software development. In particular, several studies demonstrate how OpenNeuro data have enabled new insights into brain function. For example, Martins et al., 2021, used structural MRI data from several OpenNeuro datasets along with other shared data to examine different patient groups suffering from physical pain or depression. Their analyses demonstrated a specific pattern of anatomical change common to patients with pain syndromes but distinct from depression. This kind of analysis highlights the way in which OpenNeuro enables researchers to combine smaller datasets in order to test hypotheses using convergent data, which can help overcome the confounds and biases present in any particular study as well as increasing statistical power. Other basic neuroscience studies have used OpenNeuro data to model the role of temporal context in forgetting (Chien and Honey, 2020), characterize the role of edge communities in brain networks (Faskowitz et al., 2020), understand the relationship between functional connectivity and sustained attention (Rosenberg et al., 2020), and to demonstrate that functional parcellation changes as a function of task (Salehi et al., 2020).

Data from OpenNeuro have been particularly useful for the development of new software tools. Esteban et al., 2019, used the breadth and variety of datasets in the archive to assess the robustness of the fMRIPrep preprocessing workflow to many different fMRI datasets, incorporating a total of 40 datasets from OpenNeuro. Importantly, these datasets were used in an iterative manner to improve the robustness of the tool; thus, the breadth of the data were essential both for assessment and for improvement of the tool. Without OpenNeuro (and BIDS), amassing such a large and diverse group of datasets would have required immense efforts to reach out to many different research groups, request their data, and then format the data for common usage, whereas with OpenNeuro the entirety of these datasets can be downloaded automatically within a number of hours, immediately ready for analysis. Other software development projects have taken advantage of some of the particular unique datasets in OpenNeuro; for example, Takeda et al., 2019, took advantage of a unique dataset that combines EEG, MEG, and MRI data on the same individuals (Wakeman and Henson, 2015) to demonstrate the broad range of functions of their VBMEG toolbox. Other software publications using OpenNeuro data include FastSurfer (Henschel et al., 2020) for structural MRI analysis, and Brainstorm (Tadel et al., 2019) for MEG/EEG analysis.

The data in OpenNeuro have been particularly useful for methodological researchers. One prominent example was published by Bowring et al., 2019, who examined how the use of different analysis software impacted statistical results from fMRI activation analyses. Their study included an in-depth analysis of the publications associated with each of 55 datasets, in order to identify studies with analysis pipelines and activation results that could be easily compared with their multi-platform results. Based on this process, they selected three datasets and processed each using several different analysis pipelines; their results highlighted substantial similarity in unthresholded maps but substantial discordance in thresholded maps, highlighting the need for better understanding of the impact of software packages on statistical results. Another example that would have been challenging to perform without OpenNeuro was published by Dadi et al., 2020, who developed a set of functional atlases using 27 datasets. This breadth allowed them to ensure that the specific features of the atlas were not driven by any particular dataset or task. Other examples include studies that used OpenNeuro data to assess the impact of confound regression on fMRI signals and develop new methods for confound modeling (Aquino et al., 2020), and to develop and benchmark new methods for multiple comparison correction (Spisák et al., 2019).

Discussion

The OpenNeuro data archive plays an important role in advancing neuroscience research and ensuring its reproducibility by enabling the sharing of a broad range of neuroscience data types according to the FAIR principles. Its tight integration with the community-driven BIDS standard enhances the ease of sharing, the reusability of the shared data, and the extensibility of the archive in the future. The shared data have enabled a growing number of publications that provide novel neuroscientific insights, as well as supporting novel methodological advances and software development.

Lessons learned

The experiences of our group in developing the OpenNeuro project have provided a number of lessons that may be useful more generally for researchers interested in establishing a culture of data sharing within their scientific subdomain.

Foremost, we have found that the use of a common community-driven format for data organization is essential to effective sharing. In our case, the BIDS standard has enabled data owners to easily share a growing range of data types (through the use of client-side validator), and has enabled researchers to easily reuse the data. Because any dataset that passes the validator can be shared, the community’s efforts on extending the standard (which are implemented in the validator) has provided a steady stream of additions to the types of data that OpenNeuro can share. Another important point is that data sharing does not only include sharing with other researchers, but also with one’s own research group in the future; thus, the use of a well-structured data standard can help researchers ensure that data collected by current lab members can be effectively utilized by other lab members in future, as well as making it easy to share the data beyond one’s own lab. On the flipside, we continue to see that conversion of data into the BIDS standard remains a stumbling block for many researchers; the continued development of conversion tools is necessary to support these researchers.

Second, we have found that ‘it takes an ecosystem’ to make data sharing successful. OpenNeuro is only one of the data sharing projects within the field of neuroimaging, and each of the projects has its own particular features and advantages, but together these projects have increasingly led the field to view data sharing as a net positive for our field. In addition, the availability of these data resources has allowed others to build projects that support new mechanisms for data representation and distribution (such as the DataLad project) and new platforms for analysis (such as Brainlife.io). Together, these tools have provided researchers with additional incentives to share their data via OpenNeuro through its deep integration with those projects. While we believe that sharing is most effective when it is most open, we also realize that some researchers will be unable to share their data on OpenNeuro for ethical or regulatory reasons; for this reason, we believe that a variety of data sharing resources that vary in their sharing policies (Jwa and Poldrack, 2021) will remain essential to support the broadest possible degree of data sharing.

Finally, we would highlight the importance of domain-specific data repositories that support a particular research community. All of the sharing activities accomplished using OpenNeuro could in principle have been accomplished using more general data sharing repositories (such as Figshare or Dryad). A unique benefit of OpenNeuro has been in making a large number of datasets easily findable by researchers, rather than requiring a trawl through a much larger body of datasets to find ones that are relevant. By developing upload and download systems that are tailored for imaging data, OpenNeuro has also greatly lowered the barrier to sharing and reusing data. These benefits argue for the continued need for domain-specific data sharing projects designed in close consultation with researchers in the area. Domain specificity has also allowed OpenNeuro to nurture a community around the resource. Through our social media presence, we have engaged the community with regular blog posts that highlight the most open and sharing labs over the previous 6 months to promote more social incentives to sharing.

Long-term sustainability

A continual challenge for any investigator-initiated data repository is the long-term sustainability of the archive, in order to ensure researchers’ trust in the platform (Lin et al., 2020). The ongoing costs of running a repository are substantial, primarily due to the continuing cost of technological upkeep of a web platform with regard to security and stability, as well as the ongoing costs of storage and bandwidth on cloud platforms or hardware maintenance when using on-premise computing systems. Performant web applications require the use of cutting-edge software tools, which can often become deprecated or unstable over time, leading to substantial technical debt that must be continually addressed to maintain stable and secure operation.

One major challenge for repositories that are reliant upon federal grants is the usual 3-year funding period, in addition to the preference of standard grant mechanisms for funding novel projects rather than ongoing maintenance and operations. One welcome development has been the instigation of longer-term funding for data archives through the US BRAIN Initiative (Koroshetz et al., 2018), which has explicitly dedicated funding to the development and long-term sustainability of data archives for neuroscience data. These renewable 5-year grants (of which OpenNeuro is one of the recipients) provide a much-needed longer-term funding source for data repositories.

Another resource for longer-term sustainability is institutional data repositories, which are increasingly available at many universities. OpenNeuro is working with the Stanford Digital Repository to develop a plan to deposit all raw datasets within the university’s archive, which would provide a digital backstop to the archive’s cloud storage.

OpenNeuro has also been fortunate to be part of the Amazon Public Datasets project (https://registry.opendata.aws/openneuro/), which has provided free data storage and bandwidth for the openly available datasets in the OpenNeuro archive.

Current limitations and future directions

There are a number of additional features planned for future development. These include:

Enhanced metadata: At present, a limited amount of dataset-level metadata is collected beyond that present within the BIDS metadata. Working with the CEDAR Metadata Center (Musen et al., 2015), we plan to add the ability for researchers to enter additional metadata that is linked to standard ontologies, including those being developed for BIDS data in the context of the Neuroimaging Data Model (Maumet et al., 2016). These annotations will provide the basis for more powerful queries of the archive.

Sharing of derivatives: At present, OpenNeuro only shares raw data. However, the availability of a BIDS standard for the outputs of data processing (i.e. ‘derivative’ data) now provides the ability to include derivative data within a BIDS dataset. We plan to enable researchers to share derivatives, for example, allowing the sharing of preprocessed MRI data in addition to raw data. This will greatly enhance the reuse of data by researchers who do not have the resources or expertise to preprocess these complex datasets as well as provide a standard baseline for downstream analyses, reducing the potential effects of analytic flexibility (Botvinik-Nezer et al., 2020; Bowring et al., 2019).

Bringing computing to data: The availability of the OpenNeuro data on the AWS allows researchers direct access to computing on the data, but doing so requires a substantial degree of cloud computing expertise. To ease the application of computing to the data, we plan to adapt the DANDI Hub infrastructure developed by the Distributed Archives for Neurophysiology Data Integration (DANDI: https://www.dandiarchive.org/), which will allow direct access to the data via a Jupyter notebook.

Beyond MRI data: Driven by the initial seeding of data from OpenfMRI, and reflecting the fact that BIDS was originally MRI-centric, the data currently available from OpenNeuro are heavily skewed toward MRI, and fMRI in particular (Table 1). However, BIDS is quickly expanding to other modalities that can readily be uploaded to OpenNeuro, and there has been a rapid increase in sharing of other modalities; for example, more than 60 EEG datasets have been deposited since the publication of the BIDS-EEG standard in 2019 (Pernet et al., 2019). This organic expansion beyond MRI will be supported with the necessary adaptations (e.g., online visualization of new modalities) of OpenNeuro’s user interface.

Conclusion

Data sharing ensures the transparency and reproducibility of scientific research, and allows aggregation across datasets that improves statistical power and enables new research questions. The OpenNeuro repository plays a central role in the data sharing ecosystem by promoting maximally open sharing of data, and by enhancing open availability of data from a wide range of datasets spanning. The growth and impact of the repository demonstrate the viability of minimally restrictive sharing, and the importance of common standards such as BIDS for the effective sharing and reuse of data.

Materials and methods

OpenNeuro infrastructure

Request a detailed protocol

Code for the OpenNeuro platform is available at (https://github.com/OpenNeuroOrg/openneuro). The application utilizes a cloud-based containerized architecture and is built in JavaScript and Python with a MongoDB database for application data storage. OpenNeuro is hosted on AWS using the Kubernetes container orchestration platform. Services are deployed as containers and integrated via a JavaScript GraphQL API gateway and the AWS Application Load Balancer. Several clients access this API, the React website, OpenNeuro command line interface, and an ElasticSearch indexer. Datasets are stored as DataLad repositories and managed by a Python backend service container. Each DataLad repository is assigned to a ZFS pool backed by AWS Elastic Block Store. This allows DataLad versioning and filesystem level access to datasets with existing processing and validation tools. Persistent metadata such as user accounts and permissions are maintained in a MongoDB database. Ephemeral caching is provided by Redis. Search indexes, performance monitoring, and logging are implemented with ElasticSearch. CloudFront is used as a global cache and network to provide global presence.

Content analysis

Request a detailed protocol

Data regarding OpenNeuro contents and usage were current as of October 9, 2021. Code and data needed to execute all analyses and generate all figures are available from https://doi.org/105281/zenodo5559041.

Reuse analyses: Potential reuses were identified by first searching Google Scholar for the term “OpenNeuro”; note that this will exclude any paper that mention ‘OpenfMRI’ instead of OpenNeuro, thus the reported results are underestimates of the true impact of the data, given that many of the datasets in OpenNeuro came from OpenfMRI. Papers matching this search were examined manually to confirm that they had reused data; data descriptor papers were excluded from further analysis. Citation counts were obtained from Google Scholar using the Python package ‘scholarly’.

Dataset size analyses: Dataset size analyses were performed using DataLad to obtain the full BIDS metadata for the 502 datasets available as of October 9, 2021, and then using PyBIDS (Yarkoni et al., 2019) to load the metadata for each dataset.

Data availability

The OpenNeuro data repository is accessible at http://openneuro.org. The derived data used to generate the analyses and figures reported here are available at https://doi.org/10.5281/zenodo.5559041.

The following data sets were generated
    1. Poldrack R
    2. Esteban O
    (2021) Zenodo
    poldrack/OpenNeuro_analyses: Revision and updates.
    https://doi.org/10.5281/zenodo.5559041

References

    1. Botvinik-Nezer R
    2. Holzmeister F
    3. Camerer CF
    4. Dreber A
    5. Huber J
    6. Johannesson M
    7. Kirchler M
    8. Iwanir R
    9. Mumford JA
    10. Adcock RA
    11. Avesani P
    12. Baczkowski BM
    13. Bajracharya A
    14. Bakst L
    15. Ball S
    16. Barilari M
    17. Bault N
    18. Beaton D
    19. Beitner J
    20. Benoit RG
    21. Berkers RMWJ
    22. Bhanji JP
    23. Biswal BB
    24. Bobadilla-Suarez S
    25. Bortolini T
    26. Bottenhorn KL
    27. Bowring A
    28. Braem S
    29. Brooks HR
    30. Brudner EG
    31. Calderon CB
    32. Camilleri JA
    33. Castrellon JJ
    34. Cecchetti L
    35. Cieslik EC
    36. Cole ZJ
    37. Collignon O
    38. Cox RW
    39. Cunningham WA
    40. Czoschke S
    41. Dadi K
    42. Davis CP
    43. Luca AD
    44. Delgado MR
    45. Demetriou L
    46. Dennison JB
    47. Di X
    48. Dickie EW
    49. Dobryakova E
    50. Donnat CL
    51. Dukart J
    52. Duncan NW
    53. Durnez J
    54. Eed A
    55. Eickhoff SB
    56. Erhart A
    57. Fontanesi L
    58. Fricke GM
    59. Fu S
    60. Galván A
    61. Gau R
    62. Genon S
    63. Glatard T
    64. Glerean E
    65. Goeman JJ
    66. Golowin SAE
    67. González-García C
    68. Gorgolewski KJ
    69. Grady CL
    70. Green MA
    71. Guassi Moreira JF
    72. Guest O
    73. Hakimi S
    74. Hamilton JP
    75. Hancock R
    76. Handjaras G
    77. Harry BB
    78. Hawco C
    79. Herholz P
    80. Herman G
    81. Heunis S
    82. Hoffstaedter F
    83. Hogeveen J
    84. Holmes S
    85. Hu C-P
    86. Huettel SA
    87. Hughes ME
    88. Iacovella V
    89. Iordan AD
    90. Isager PM
    91. Isik AI
    92. Jahn A
    93. Johnson MR
    94. Johnstone T
    95. Joseph MJE
    96. Juliano AC
    97. Kable JW
    98. Kassinopoulos M
    99. Koba C
    100. Kong X-Z
    101. Koscik TR
    102. Kucukboyaci NE
    103. Kuhl BA
    104. Kupek S
    105. Laird AR
    106. Lamm C
    107. Langner R
    108. Lauharatanahirun N
    109. Lee H
    110. Lee S
    111. Leemans A
    112. Leo A
    113. Lesage E
    114. Li F
    115. Li MYC
    116. Lim PC
    117. Lintz EN
    118. Liphardt SW
    119. Losecaat Vermeer AB
    120. Love BC
    121. Mack ML
    122. Malpica N
    123. Marins T
    124. Maumet C
    125. McDonald K
    126. McGuire JT
    127. Melero H
    128. Méndez Leal AS
    129. Meyer B
    130. Meyer KN
    131. Mihai G
    132. Mitsis GD
    133. Moll J
    134. Nielson DM
    135. Nilsonne G
    136. Notter MP
    137. Olivetti E
    138. Onicas AI
    139. Papale P
    140. Patil KR
    141. Peelle JE
    142. Pérez A
    143. Pischedda D
    144. Poline J-B
    145. Prystauka Y
    146. Ray S
    147. Reuter-Lorenz PA
    148. Reynolds RC
    149. Ricciardi E
    150. Rieck JR
    151. Rodriguez-Thompson AM
    152. Romyn A
    153. Salo T
    154. Samanez-Larkin GR
    155. Sanz-Morales E
    156. Schlichting ML
    157. Schultz DH
    158. Shen Q
    159. Sheridan MA
    160. Silvers JA
    161. Skagerlund K
    162. Smith A
    163. Smith DV
    164. Sokol-Hessner P
    165. Steinkamp SR
    166. Tashjian SM
    167. Thirion B
    168. Thorp JN
    169. Tinghög G
    170. Tisdall L
    171. Tompson SH
    172. Toro-Serey C
    173. Torre Tresols JJ
    174. Tozzi L
    175. Truong V
    176. Turella L
    177. van ’t Veer AE
    178. Verguts T
    179. Vettel JM
    180. Vijayarajah S
    181. Vo K
    182. Wall MB
    183. Weeda WD
    184. Weis S
    185. White DJ
    186. Wisniewski D
    187. Xifra-Porxas A
    188. Yearling EA
    189. Yoon S
    190. Yuan R
    191. Yuen KSL
    192. Zhang L
    193. Zhang X
    194. Zosky JE
    195. Nichols TE
    196. Poldrack RA
    197. Schonberg T
    (2020) Variability in the analysis of a single neuroimaging dataset by many teams
    Nature 582:84–88.
    https://doi.org/10.1038/s41586-020-2314-9
  1. Book
    1. Halchenko YO
    2. Poldrack B
    3. Hanke M
    (2016)
    DataLad--Decentralized Data Distribution for Consumption and Sharing of Scientific DatasetsOrganization of Human Brain Mapping Poster
    Geneva, Switzerland: Organization of Human Brain Mapping Annual Meeting.
    1. States U
    (1978)
    The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects of Research
    National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research, The Belmont Report: Ethical Principles and Guidelines for the Protection of Human Subjects of Research, The Commission.

Article and author information

Author details

  1. Christopher J Markiewicz

    Department of Psychology, Stanford University, Stanford, United States
    Contribution
    Conceptualization, Project administration, Writing - original draft
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6533-164X
  2. Krzysztof J Gorgolewski

    Department of Psychology, Stanford University, Stanford, United States
    Contribution
    Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Software
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-3321-7583
  3. Franklin Feingold

    Department of Psychology, Stanford University, Stanford, United States
    Contribution
    Project administration, Supervision, Software
    Competing interests
    No competing interests declared
  4. Ross Blair

    Department of Psychology, Stanford University, Stanford, United States
    Contribution
    Data curation, Software
    Competing interests
    No competing interests declared
  5. Yaroslav O Halchenko

    Department of Psychological & Brain Sciences, Dartmouth College, Hanover, United States
    Contribution
    Conceptualization, Data curation, Software
    Competing interests
    No competing interests declared
  6. Eric Miller

    Squishymedia, Portland, United States
    Contribution
    Conceptualization, Project administration, Data curation, Software
    Competing interests
    is owner of Squishymedia which is funded to perform software development work on OpenNeuro.
  7. Nell Hardcastle

    Squishymedia, Portland, United States
    Contribution
    Data curation, Software
    Competing interests
    is an employee of Squishymedia which is funded to perform software development work on OpenNeuro.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-3837-0707
  8. Joe Wexler

    Department of Psychology, Stanford University, Stanford, United States
    Contribution
    Investigation, Software
    Competing interests
    No competing interests declared
  9. Oscar Esteban

    1. Department of Psychology, Stanford University, Stanford, United States
    2. Lausanne University Hospital and University of Lausanne, Lausanne, Switzerland
    Contribution
    Data curation, Software
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8435-6191
  10. Mathias Goncavles

    Department of Psychology, Stanford University, Stanford, United States
    Contribution
    Data curation, Software
    Competing interests
    No competing interests declared
  11. Anita Jwa

    Department of Psychology, Stanford University, Stanford, United States
    Contribution
    Formal analysis, Software
    Competing interests
    No competing interests declared
  12. Russell Poldrack

    Department of Psychology, Stanford University, Stanford, United States
    Contribution
    Conceptualization, Writing – review and editing, Funding acquisition, Methodology, Project administration, Data curation, Supervision, Writing - original draft, Software
    For correspondence
    russpold@stanford.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6755-0259

Funding

National Institute of Mental Health (R24MH117179)

  • Russell Poldrack

National Institute of Mental Health (R24MH114705)

  • Russell Poldrack

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

The work described here has been supported by the National Institute of Mental Health of the National Institutes of Health under Award Numbers R24MH117179 and R24MH114705. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. Development of OpenNeuro and OpenfMRI was also supported by a grant from the Laura and John Arnold Foundation, and the National Science Foundation (OAC-1131441). Sharing of OpenNeuro datasets has been enabled by support from AWS. We would like to thank all of the users who have uploaded data to OpenNeuro. Thanks to Franco Pestili for providing usage data on Brainlife.io, and Nico Dosenbach, Michael Hawrylycz, Karel Svoboda, and Armin Thomas for helpful comments on an earlier draft.

Version history

  1. Preprint posted: June 29, 2021 (view preprint)
  2. Received: June 29, 2021
  3. Accepted: October 15, 2021
  4. Accepted Manuscript published: October 18, 2021 (version 1)
  5. Version of Record published: October 27, 2021 (version 2)

Copyright

© 2021, Markiewicz et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 4,773
    views
  • 472
    downloads
  • 138
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Christopher J Markiewicz
  2. Krzysztof J Gorgolewski
  3. Franklin Feingold
  4. Ross Blair
  5. Yaroslav O Halchenko
  6. Eric Miller
  7. Nell Hardcastle
  8. Joe Wexler
  9. Oscar Esteban
  10. Mathias Goncavles
  11. Anita Jwa
  12. Russell Poldrack
(2021)
The OpenNeuro resource for sharing of neuroscience data
eLife 10:e71774.
https://doi.org/10.7554/eLife.71774

Share this article

https://doi.org/10.7554/eLife.71774

Further reading

    1. Medicine
    2. Neuroscience
    Yunlu Xue, Yimin Zhou, Constance L Cepko
    Research Advance

    Retinitis pigmentosa (RP) is an inherited retinal disease in which there is a loss of cone-mediated daylight vision. As there are >100 disease genes, our goal is to preserve cone vision in a disease gene-agnostic manner. Previously we showed that overexpressing TXNIP, an α-arrestin protein, prolonged cone vision in RP mouse models, using an AAV to express it only in cones. Here, we expressed different alleles of Txnip in the retinal pigmented epithelium (RPE), a support layer for cones. Our goal was to learn more of TXNIP’s structure-function relationships for cone survival, as well as determine the optimal cell type expression pattern for cone survival. The C-terminal half of TXNIP was found to be sufficient to remove GLUT1 from the cell surface, and improved RP cone survival, when expressed in the RPE, but not in cones. Knock-down of HSP90AB1, a TXNIP-interactor which regulates metabolism, improved the survival of cones alone and was additive for cone survival when combined with TXNIP. From these and other results, it is likely that TXNIP interacts with several proteins in the RPE to indirectly support cone survival, with some of these interactions different from those that lead to cone survival when expressed only in cones.

    1. Neuroscience
    Juan Jose Rodriguez Gotor, Kashif Mahfooz ... John F Wesseling
    Research Article

    Vesicles within presynaptic terminals are thought to be segregated into a variety of readily releasable and reserve pools. The nature of the pools and trafficking between them is not well understood, but pools that are slow to mobilize when synapses are active are often assumed to feed pools that are mobilized more quickly, in a series. However, electrophysiological studies of synaptic transmission have suggested instead a parallel organization where vesicles within slowly and quickly mobilized reserve pools would separately feed independent reluctant- and fast-releasing subdivisions of the readily releasable pool. Here, we use FM-dyes to confirm the existence of multiple reserve pools at hippocampal synapses and a parallel organization that prevents intermixing between the pools, even when stimulation is intense enough to drive exocytosis at the maximum rate. The experiments additionally demonstrate extensive heterogeneity among synapses in the relative sizes of the slowly and quickly mobilized reserve pools, which suggests equivalent heterogeneity in the numbers of reluctant and fast-releasing readily releasable vesicles that may be relevant for understanding information processing and storage.