A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis-generation

  1. Daniel S Quintana  Is a corresponding author
  1. University of Oslo, Norway

Abstract

Open research data provides considerable scientific, societal, and economic benefits. However, disclosure risks can sometimes limit the sharing of open data, especially in datasets that include sensitive details or information from individuals with rare disorders. This article introduces the concept of synthetic datasets, which is an emerging method originally developed to permit the sharing of confidential census data. Synthetic datasets mimic real datasets by preserving their statistical properties and the relationships between variables. Importantly, this method also reduces disclosure risk to essentially nil as no record in the synthetic dataset represents a real individual. This practical guide with accompanying R script enables biobehavioural researchers to create synthetic datasets and assess their utility via the synthpop R package. By sharing synthetic datasets that mimic original datasets that could not otherwise be made open, researchers can ensure the reproducibility of their results and facilitate data exploration while maintaining participant privacy.

Data availability

Data and analysis scripts are available at the article's Open Science Framework webpage https://osf.io/z524n/

The following previously published data sets were used
    1. Jones BC
    2. DeBruine L
    (2019) Sociosexuality and self-rated attractiveness
    Open Science Framework, DOI: 10.17605/OSF.IO/6BK3W.

Article and author information

Author details

  1. Daniel S Quintana

    Institute of Clinical Medicine, University of Oslo, Oslo, Norway
    For correspondence
    daniel.quintana@medisin.uio.no
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-2876-0004

Funding

Novo Nordisk Foundation (Excellence grant NNF16OC0019856)

  • Daniel S Quintana

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Reviewing Editor

  1. Mone Zaidi, Icahn School of Medicine at Mount Sinai, United States

Publication history

  1. Received: November 1, 2019
  2. Accepted: March 11, 2020
  3. Accepted Manuscript published: March 11, 2020 (version 1)
  4. Version of Record published: April 1, 2020 (version 2)

Copyright

© 2020, Quintana

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 3,428
    Page views
  • 279
    Downloads
  • 24
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Daniel S Quintana
(2020)
A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis-generation
eLife 9:e53275.
https://doi.org/10.7554/eLife.53275

Further reading

    1. Genetics and Genomics
    2. Medicine
    Xiaojing Chu et al.
    Research Article Updated

    Background:

    The large inter-individual variability in immune-cell composition and function determines immune responses in general and susceptibility o immune-mediated diseases in particular. While much has been learned about the genetic variants relevant for type 1 diabetes (T1D), the pathophysiological mechanisms through which these variations exert their effects remain unknown.

    Methods:

    Blood samples were collected from 243 patients with T1D of Dutch descent. We applied genetic association analysis on >200 immune-cell traits and >100 cytokine production profiles in response to stimuli measured to identify genetic determinants of immune function, and compared the results obtained in T1D to healthy controls.

    Results:

    Genetic variants that determine susceptibility to T1D significantly affect T cell composition. Specifically, the CCR5+ regulatory T cells associate with T1D through the CCR region, suggesting a shared genetic regulation. Genome-wide quantitative trait loci (QTLs) mapping analysis of immune traits revealed 15 genetic loci that influence immune responses in T1D, including 12 that have never been reported in healthy population studies, implying a disease-specific genetic regulation.

    Conclusions:

    This study provides new insights into the genetic factors that affect immunological responses in T1D.

    Funding:

    This work was supported by an ERC starting grant (no. 948207) and a Radboud University Medical Centre Hypatia grant (2018) to YL and an ERC advanced grant (no. 833247) and a Spinoza grant of the Netherlands Association for Scientific Research to MGN CT received funding from the Perspectief Biomarker Development Center Research Programme, which is (partly) financed by the Netherlands Organisation for Scientific Research (NWO). AJ was funded by a grant from the European Foundation for the Study of Diabetes (EFSD/AZ Macrovascular Programme 2015). XC was supported by the China Scholarship Council (201706040081).

    1. Medicine
    2. Neuroscience
    Simón Oxenford et al.
    Tools and Resources Updated

    Background:

    Deep brain stimulation (DBS) electrode implant trajectories are stereotactically defined using preoperative neuroimaging. To validate the correct trajectory, microelectrode recordings (MERs) or local field potential recordings can be used to extend neuroanatomical information (defined by MRI) with neurophysiological activity patterns recorded from micro- and macroelectrodes probing the surgical target site. Currently, these two sources of information (imaging vs. electrophysiology) are analyzed separately, while means to fuse both data streams have not been introduced.

    Methods:

    Here, we present a tool that integrates resources from stereotactic planning, neuroimaging, MER, and high-resolution atlas data to create a real-time visualization of the implant trajectory. We validate the tool based on a retrospective cohort of DBS patients (N = 52) offline and present single-use cases of the real-time platform.

    Results:

    We establish an open-source software tool for multimodal data visualization and analysis during DBS surgery. We show a general correspondence between features derived from neuroimaging and electrophysiological recordings and present examples that demonstrate the functionality of the tool.

    Conclusions:

    This novel software platform for multimodal data visualization and analysis bears translational potential to improve accuracy of DBS surgery. The toolbox is made openly available and is extendable to integrate with additional software packages.

    Funding:

    Deutsche Forschungsgesellschaft (410169619, 424778381), Deutsches Zentrum für Luft- und Raumfahrt (DynaSti), National Institutes of Health (2R01 MH113929), and Foundation for OCD Research (FFOR).