- Views 576
Some researchers already compose manuscripts and document experiments in formats that closely tie the analysis and visualisation code with the outputs, including Jupyter notebooks and R Markdown documents. The benefits this brings to the researcher includes better documentation and reproducibility of their analyses, as well as faster updating of results and visualisations as new data is added. For the reader, including this extra level of research documentation enables better comprehension and reuse of computational methodologies. So, how might publishers better support the sharing of research in these enriched forms? Do readers want to be able to interact with, reproduce or modify computational analyses? In which research areas are these technologies most relevant?
In order to better understand the demand for new features to support executable research articles, we conducted a survey including questions about how familiar the community are with current reproducible research tools and which features supporting these tools they would be most likely to use if they were to be provided. The survey questions are available here.
We surveyed eLife authors, reviewers and readers over the course of one week in June, with the survey advertised by email, in our weekly digest of new articles (eTOC) and on social media. We received 152 complete responses: the key results are shared in this blogpost.
The results of this survey affirm that there are researchers who already document their workflows and results in a way that could allow readers to understand and build on this work more readily. As may be expected, these tools are currently most widely used in more computationally-intensive research areas - Computational and Systems Biology, Genomics and Evolutionary Biology, and Neuroscience - but researchers in general are interested in learning and adopting more computational practises.
Being able to view and download the data and code associated with a research article directly from the online publication was identified as a key feature of interest highlighting that raising the visibility of these assets is an important feature. Respondents were also likely to submit their manuscript in a file format that preserves the data and code alongside the text, particularly if they were already users of such formats (R Markdown, Jupyter notebooks), and to interact online with these assets in a way that allows them to gain additional insight into the research publication they are reading.
Several issues and limitations with implementing more computationally reproducible articles were highlighted. For example, it may not be possible to share research involving large or complex datasets and computational scripts in a format other than the traditional manuscript. Any infrastructure would need to manage dependencies, accommodate versioning, and deliver computational outputs with sufficient speed to be more useful than computing locally. Finally, the imbalance between author burden - including the time taken to produce well-documented assets and the effort to learn these new methods - and third party benefit is a concern, and one which may prevent some researchers from engaging with these technologies without appropriate incentivisation.
The results are described in greater detail below.
We thank all who responded to our survey. The results have identified problems areas to focus on and highlighted potential features to prioritise as we explore how to support researchers to share and consume more computationally reproducible research.
We are grateful to the many researchers who added messages of support and welcomed future interactions regarding our investigations. We will be following up in due course. If you would like to share your experiences regarding the use of reproducible research documents, or to be involved in future discussions, please email Naomi at firstname.lastname@example.org.
Who took the survey
The top three research categories that the respondents worked within were Neuroscience (34%), Biochemistry (28%), and Computational and Systems Biology (28%). Other notable research areas were Cell Biology, Genomics and Evolutionary Biology, and Biophysics and Structural Biology (Figure 1).
Of the 152 respondents, over half (84) were group leaders and 38 were post-docs (Figure 2).
Just under half the respondents had previously used R Markdown and/or Jupyter notebooks in their research (Figure 3); both are popular formats for documenting code with the computational outputs. Of note, 42% of respondents had not used R for their research; they were not asked about R Markdown. Interestingly, the respondents who had previously used R Markdown and/or Jupyter notebooks worked mainly within the computationally-intensive research areas of Computational and Systems Biology, Genomics and Evolutionary Biology, and Neuroscience (Figure 4).
We asked the respondents who had not used these formats, or were not sure if they had, what reason(s) prevented them from using this in their research (Table 1). The major reason why respondents did not use R was not knowing how, whilst a lack of awareness of the tool was the main blocker for using R Markdown and Jupyter notebooks.
|R||R Markdown||Jupyter notebook|
|I don't know what _ is||13 (20%)||18 (38%)||62 (60%)|
|I don’t know how to use __||28 (43%)||14 (30%)||25 (24%)|
|I don’t use __||24 (37%)||14 (30%)||24 (23%)|
|My colleagues do not use __||8 (12%)||10 (21%)||18 (17%)|
|It’s not appropriate for my research||6 (9%)||4 (9%)||9 (9%)|
|I cannot share my research through a journal in this format||Not asked||1 (2%)||3 (3%)|
|There’s no incentive to use __||12 (18%)||11 (23%)||12 (12%)|
|Other||20 respondents left comments. Some suggested they would like to learn R, or would learn it if they felt able to or were not already comfortable with an alternative.||14 respondents left comments, noting that they use or prefer alternatives, whilst others noted an interest in using it having now that they have been introduced to it.||17 respondents left comments, noting that they used R-based and other tools instead, or that they would use Jupyter if they had Python code.|
Table 1. The respondents who had not or were not sure if they had used R, R Markdown and Jupyter notebooks were asked ‘What prevents you from using __ for your research?’ for each tool. The number of respondents for each question was 65, 47 and 102, respectively.
Considering how to improve the sharing of data and code and to bring these assets closer to the research narrative, the activities of greatest interest to the respondents in this survey were, in order of greatest likelihood:
- Viewing and downloading the code and data underlying the research directly from the online article
- Interacting online with the data, code and/or plots in the article to gain additional insight
- Submitting the manuscript in a file format that preserves code and data alongside the text
There was more mixed interest in being able to modify the code and run this new version online and being able to run the original code online (Figure 5).
The respondents noted that submitting their research in a more reproducible format would need to be as practical and achievable for authors as possible: “It would help if all publishers would agree to one common format for code and data, so researchers could all work in that format from the start.” Adding extra work was a concern, as was the data and/or code being too complex to share in this way. Researchers who already document their research in more reproducible formats were in favour of the publisher supporting this: “For each manuscript I have a directory with all data, code, manuscript files etc. I think the future is for journals to host this. The current situation (at most journals) that a compressed PDF appears online with most data hidden is outdated.”
Being able to easily locate, access and download the data and code underlying a research article was seen as a useful feature, pointing to the inadequacies of the current system: “I have gotten so frustrated trying to get data and from articles (so I can analyze them myself) and not being able to find the data or code. Numerous times authors have neglected to respond to me or been impossible to contact. In other cases the documentation for the code has been so inadequate that it's difficult to impossible to use.” The practicality of managing dependencies and versions was flagged, whilst some respondents felt that existing data and code sharing mechanisms need not be duplicated by the publisher.
Those in support of being able to interact with the data, code and/or plots online showed interest in using R Shiny and Plot.ly-style graphs, and pointed out benefits including helping less computationally literature users to interact, and that this “could eliminate the need for countless supplementary figures in many cases (e.g. exploring the effect of parameters on results)”. Those against the idea again referred to the lack of time for authors and the impracticality of producing these assets.
Reproducing a computational result was not seen to be particularly useful beyond the first few reproductions, and only for results that were particularly significant or relevant, or “in case of serious doubts about the analysis of the data.” Concerns were raised over the practicality for analyses with high computation requirements and regarding dependency management. It was felt that researchers should be trusted to do the analysis correctly, whilst anyone wishing to explore the data and code would do so on their own machines and have the prerequisite knowledge to do so.
The respondents had mixed opinions over the utility of providing a feature that would enable the reader to modify and re-run computations underlying the research article. One saw this as “a killer feature especially if results can be returned in near real-time” whilst another added caveats: “The level of annotation and documentation required for this would be huge but certainly beneficial to the field. A code badly written by a biologist should not prevent publication of the result as long as it does what it claims it does.” Practicalities such as dependency management, delivering the necessary compute power with enough speed, and variable code quality were flagged as concerns.
We also asked the respondents who had used R Markdown and/or Jupyter notebooks before whether they would be likely to submit a research manuscript in one of these specific formats, or to present their research in these formats as a complement to the research article. The majority of respondents indicated they were very likely to do so (Figure 6).
Some respondents detailed concerns about these formats, including the potential infeasibility due to the complexity and size of datasets, as well as “the potential for this to add significantly to the work required to submit and to review papers”. Respondents noted the need to manage dependencies across computational environments, that these notebook formats are not suitable for protected data (such as patient data), and that “different parts of the analysis protocol may require different languages” - this flexibility would need to be accommodated.
However, the benefits of sharing research in these formats were also highlighted: “Having the code available for analysis workflows would really speed up research. For instance, one could copy and improve a workflow for another data set”, and “it could allow an entirely new kind of interaction between authors and readers”. Furthermore, several respondents revealed that they were already composing research manuscripts using these tools, and highlighting that there could be better support of this workflow.
Supporting greater transparency of the original workings (data and code) underlying research publications could have implications for peer review. The respondents who had conducted peer review before (129; 84%) said they would be likely to make use of a file format that links the text, code and data during peer review, if this were available (Figure 7). The majority of reviewers would download this file to review the reproducibility of a result (34.7%) whilst 36.2% may do depending on the situation: having the time to do so would be a factor, and this would only be useful for particularly important or suspicious results. The remaining respondents would not (9.5%) or were not sure (19.7%).
If we were to make it possible to explore the code and data online during peer review, the majority of reviewers felt they would be likely to use this (Figure 7), noting that this would be favourable if it saved the reviewer’s time and if it meant reducing the learning curve for computational reproducibility. However, some felt it would be better to download and run locally in order to be thorough and to trust the output, whilst others noted the extra burden on reviewers.
Sharing research in a computationally reproducible format would require authors to be willing to share the underlying data and code. Therefore, we asked the survey respondents about their current data and code sharing practises.
We first asked what types of data were produced in their research and coded the free-text responses. Most respondents produced simple datasets, such as excel files and time series data (74%), and a substantial number of respondents listed microscopy images (45%), sequencing data (36%), wet lab images (35%), and more complex data types, such as crystal structure data (27%). A small minority of respondents produced personal or clinical data (4%), or other types of data, including mathematical models or computational scripts (7%).
The majority of respondents had shared data (68%) and code (72% of the 115 respondents who used or produced code during their research; Table 2). Data was most frequently shared as supplementary files with a manuscript, whilst code was most often shared via an online open-access repository, such as Github.
|Do you share...||...data?||...code?|
|Yes, of which:||103 (68%)||78 (72%)|
|- Via an online open-access repository||66||55|
|- As supplementary files with the research manuscript||77||44|
|- Via a website (lab or personal)||41||33|
|No||39 (26%)||29 (27%)|
|I'm not sure||9 (6%)||2 (2%)|
Table 2. Respondents were asked ‘Have you ever shared the data underlying your research as an open-access resource?’ (151 responses), and those who use or produce code were asked ‘Have you ever shared the code underlying your research data analysis or visualisation as an open-access resource?’ (115 responses).
We asked all respondents what prevented them or would prevent them from sharing their research data and code (128 free-text comments for data, 81 for code). The most common reason preventing the sharing of both data and code was the time and effort it takes to make these shareable, including annotation and formatting, with little incentive or perceived benefit. The data file being too large or complex to easily upload or share was frequently identified as a blocker, as was a lack of suitable repositories for certain data. For code, other preventative reasons were the burden of supporting secondary users, fearing that the code is not good enough quality, not having any code to share yet, and the specific lack of a citation practice for shared code. There were also concerns that some research code is too specialist or environment-dependent to be worth sharing. Other blockers identified were related to licensing, intellectual property, or other legal issues, fear of “scooping”, not knowing how, and not being supported by peers to do so. Despite the identified blockers, a substantial minority (22 respondents; 27%) stated they had no reason not to share code.