As we get our sleeves rolled up and start putting the journal together, we’re formulating our thoughts and policies on a host of issues at eLife. In this post I’d like to share where our thoughts are on the issue of supplementary data and data in general. I’d love to hear your feedback (Tweet me at @ianmulvany, comment on this post, or email firstname.lastname@example.org).
It’s timely that we think about this, as demonstrated by the recent report by the Royal Society.
The report highlights the promise of digital communication
We are now on the brink of an achievable aim: for all science literature to be online, for all of the data to be online and for the two to be interoperable.
However, the report also highlights that there is plenty of scope for improvement
A great deal of data has become detached from the published conclusions that depend upon it, such that the two vital complementary components of the scientific endeavour - the idea and the evidence - are too frequently separated.
eLife most definitely wants to be part of the solution to this problem. Journals such as Gigascience are pointing in the right direction.
To get started I’d like to outline two of the broad principles that are leading our thinking in the creation of the journal.
- There’s a difference between a representation of a data set and the underlying data. A good example of a representation of a data set is a summary table or a figure. Good examples of the underlying data are a CSV file, a full sequence, or a multi-gigabyte image.
- The Web can enable researchers to interact directly with the underlying data. Although we’re far from being there now, in the long run the static PDF as a primary format for scholarly communication will be superseded by formats that can do better justice to the needs of the researcher – which we feel to be tousedata collected, methods invented, or analyses performed to justify assertions about the world.
With these two broad approaches in mind, eLife is hoping to move towards a situation in which underlying or primary data – of any size and any composition – will be linked appropriately to an article narrative, will be easily searchable, discoverable and citable, and will be made available in the most useful formats for reuse. Such goals are very different from the more typical current situation whereby valuable data, experiments and interpretation are buried in supplementary files - often as a PDF - that are difficult to find and use. As we have no page limits, we don’t need to force researchers to hide critical parts of the story in a supplementary file - authors will have the space to present their work in full. We will also encourage authors to provide important underlying data, but such supplementary information will be presented as individual files in useful formats, with titles, legends and unique identifiers (DOIs).
I’m personally very excited about the prospect of being able to accept for submission source code, and being able to give that source code a stable home under appropriate version control; for example in a location like github or as an accompanying metapaper in a venue such as the Journal of Open Research Software.
For representations of the data, such as figures, some videos, and some interactive molecular structures, we aim to move towards a situation where these should all be submitted as part of the main body of the article. However, we also recognise that some figures are more central to the narrative of the paper, and will therefore be supporting child or secondary figures.
On the Web, it will be straightforward to create a reading experience that allows for those figure supplements to be invoked by the reader at an appropriate moment. For the PDF artefact, we’re still in discussion around the structure and design; it may be possible for us to format those figures to appear in some kind of appendix.
In terms of the underlying data, we will require all authors to make the key data freely available, and we recognize that there are several ways to achieve this goal. We will be inviting authors to submit the data that underlay certain figures or tables, along with appropriate metadata. We also hope to increase the research community’s awareness of tools that can be used to deposit their data, and can provide a persistent identifier associated with that data. We’ll be encouraging authors to deposit their data in repositories such as dryad, and we hope to be able to work well with services such as figshare, sybase, labarchives and others.
I’ll be coming back to look at various tools for managing data in a later post. The consistent themes are that any data should be searchable, discoverable, usable and citable. In line with recommendation 6 from the Royal Society report, we will collect critical information about any large datasets used (including licensing information).
As a condition of publication, scientific journals should enforce a requirement that the data on which the argument of the article depends should be accessible, assessable, usable and traceable through information in the article. This should be in line with the practical limits for that field of research. The article should indicate when and under what conditions the data will be available for others to access.
We want to be open about our design plans at eLife. We’re still in early days with this and I would be very happy to have some feedback. Again, tweet me at @ianmulvany, comment here, or email email@example.com.
Conflicts of interest
I am on the editorial board of the Journal of Open Research Software