How synthetic data can protect participant privacy

Synthetic datasets could make it easier for scientists to share data without jeopardizing the privacy of participants.

63 views

Apr 1, 2020

Open access
Copyright information

Schematic of data servers for genomic data. Image credit: Ernesto Del Aguila III, NHGRI (CC BY-NC 2.0)

It is becoming increasingly common for scientists to share their data with other researchers. This makes it possible to independently verify reported results, which increases trust in research. Sometimes it is not possible to share certain datasets because they include sensitive information about individuals. In psychology and medicine, scientists have tried to remove identifying information from datasets before sharing them by, for example, adding minor artificial errors. But, even when researchers take these steps, it may still be possible to identify individuals, and the introduction of artificial errors can make it harder to verify the original results.

One potential alternative to sharing sensitive data is to create ‘synthetic datasets’. Synthetic datasets mimic original datasets by maintaining the statistical properties of the data but without matching the original recorded values. Synthetic datasets are already being used, for example, to share confidential census data. However, this approach is rarely used in other areas of research. Now, Daniel S. Quintana demonstrates how synthetic datasets can be used in psychology and medicine.

Three different datasets were studied to ensure that synthetic datasets performed well regardless of the type or size of the data. Quintana evaluated freely available software that could generate synthetic versions of these different datasets, which essentially removed any identifying information. The results obtained by analysing the synthetic datasets closely mimicked the original results.

These tools could allow researchers to verify each other’s results more easily without jeopardizing the privacy of participants. This could encourage more collaboration, stimulate ideas for future research, and increase data sharing between research groups.

1. Related to
A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation

Daniel S Quintana

Tools and Resources Updated Apr 1, 2020

Privacy notice