This week, researchers from the Wikimedia Foundation announced a major update to the open dataset of citations in Wikipedia: the data now includes the article topics and open-access (OA) status of cited works, enabling analysts to understand the openness of research shared via Wikipedia. Researchers at the Wikimedia Foundation found that 40% of the scholarly publications cited in Wikipedia have a freely available version on the web. For Wikipedia articles in biology and medicine specifically, 44% and 40% of scholarly articles cited have a freely available version on the web, respectively.
To help interested readers explore this data, a team of developers from Wikimedia and eLife have released WikiCiteVis, a web app for querying the dataset, first designed at the eLife Innovation Sprint in May 2018. Check how a scholarly article or book is cited in Wikipedia using WikiCiteVis and read on to find out more about the development of this tool, including how to contribute.
The number of citations included in the dataset is large, with the current release containing nearly 16 million rows. It is well-structured, which helps with reusability; however, understanding trends in this data requires data science and programming skills. Given the value and public interest in the dataset, Sam Walton from the Wikimedia team proposed the development of a web-based query and visualisation tool to enable people without these computational skills to explore the dataset. With that in mind, Sam brought the open dataset of citations with identifiers in Wikipedia to the eLife Innovation Sprint 2018, where he teamed up with David Moulton, Chris Wilkinson and Sean Wiseman, eLife, and Ian Mulvany, SAGE Publishing, to design and prototype such a tool: WikiCiteVis.
Since May, the eLife development team and Sam have continued their work on the prototype tool, extending it to include the OA status and article topic as included in the updated dataset, and connecting the front end and back end elements to deliver a working web app in time to accompany Wikimedia’s announcement.
The result is a browser-based tool through which users can search for how a published work has been cited on Wikipedia. The search can be conducted for one of five common publication identifiers (DOI, ISBN, arXiv ID, PMID and PMCID) and with either a complete (‘10.7554/eLife.37001’) or partial (‘10.7554’) query. Additional filtering options for the search include the language of the article on Wikipedia and the date range during which the citation was added. The tool returns a table of results, including columns for all data from the original dataset, hyperlinking to the source for each entry and the Wikipedia edit which added the citation for easy inspection. It also provides information on how many results were returned in total.
At the eLife Innovation Sprint, Sam worked with Ian to consider the core audiences who might be interested in the dataset, such as Wikimedians, publishers and researchers, the value that such a query tool would bring them, and their requirements. David worked on the front end user interface and some initial data visualisations, while Sean and Chris developed the back end that would return subsets of the data according to the user’s search query.
Sam says: “I have some Python skills, and ran into a number of difficulties in searching the dataset due to its size. It occurred to me that anyone interested in this data who didn’t have those data science skills would be almost completely unable to learn anything from it themselves. The eLife Sprint seemed like a great opportunity to work with developers who understood the value of open and accessible data to work on a solution. Without this, finding the right group of people with the time and interest to work on the tool would have been considerably harder.”
David says: “At a time when we need access to reliable facts more than ever, both Wikipedia and anything that makes its data easier to use in all its forms are strong forces for good. It was great working on the front end of WikiCiteVis during the eLife Innovation Sprint, especially with a team of like-minded people who are passionate about developing projects with inherent value.”
In order to deliver an alpha working version of the tool, the team chose not to include data visualisations with the search results as originally designed. They therefore welcome contributions to deliver that feature in the working tool, as well as to improve web accessibility, suggest additional features and make other improvements. Inspect the code and contribute via GitHub.
With thanks to Miriam Redi, Dario Taraborelli and Sam Walton at the Wikimedia Foundation, and in acknowledgement of David Moulton, Chris Wilkinson and Sean Wiseman, eLife, and Ian Mulvany, SAGE Publishing, for this collaborative effort. During initial development and hosting, WikiCiteVis was powered by Amazon Web Services (AWS) thanks to the AWS Cloud Credits for Research programme.
We welcome comments, questions and feedback. Please annotate publicly on the article or contact us at innovation [at] elifesciences [dot] org.
You can also email us to share your idea for an open source project to improve the way research is shared and evaluated. We offer project funding, mentorship and exposure through the eLife Innovation Initiative.