Meta-Research: Tracking the popularity and outcomes of all bioRxiv preprints

University of Minnesota, United States

Apr 24, 2019

Open access
Copyright information

Download
Cite
CommentOpen annotations (there are currently 0 annotations on this page).
Share

Article
Figures and data
Abstract
Introduction
Results
Discussion
Methods
Data availability
References
Decision letter
Author response
Article and author information
Metrics

Abstract

The growth of preprints in the life sciences has been reported widely and is driving policy changes for journals and funders, but little quantitative information has been published about preprint usage. Here, we report how we collected and analyzed data on all 37,648 preprints uploaded to bioRxiv.org, the largest biology-focused preprint server, in its first five years. The rate of preprint uploads to bioRxiv continues to grow (exceeding 2,100 in October 2018), as does the number of downloads (1.1 million in October 2018). We also find that two-thirds of preprints posted before 2017 were later published in peer-reviewed journals, and find a relationship between the number of downloads a preprint has received and the impact factor of the journal in which it is published. We also describe Rxivist.org, a web application that provides multiple ways to interact with preprint metadata.

https://doi.org/10.7554/eLife.45133.001

Introduction

In the 30 days of September 2018, four leading biology journals – The Journal of Biochemistry, PLOS Biology, Genetics and Cell – published 85 full-length research articles. The preprint server bioRxiv (pronounced ‘Bio Archive’) had posted this number of preprints by the end of September 3 (Figure 1—source data 4). Preprints allow researchers to make their results available as quickly and widely as possible, short-circuiting the delays and requests for extra experiments often associated with peer review (Berg et al., 2016; Powell, 2016; Raff et al., 2008; Snyder, 2013; Hartgerink, 2015; Vale, 2015; Royle, 2014).

Physicists have been sharing preprints using the service now called arXiv.org since 1991 (Verma, 2017), but early efforts to facilitate preprints in the life sciences failed to gain traction (Cobb, 2017; Desjardins-Proulx et al., 2013). An early proposal to host preprints on PubMed Central (Varmus, 1999; Smaglik, 1999) was scuttled by the National Academy of Sciences, which successfully negotiated to exclude work that had not been peer-reviewed (Marshall, 1999; Kling et al., 2003). Further attempts to circulate biology preprints, such as NetPrints (Delamothe et al., 1999), Nature Precedings (Kaiser, 2017), and The Lancet Electronic Research Archive (McConnell and Horton, 1999), popped up (and then folded) over time (The Lancet Electronic Research Archive, 2005). The preprint server that would catch on, bioRxiv, was not founded until 2013 (Callaway, 2013). Now, biology publishers are actively trawling preprint servers for submissions (Barsh et al., 2016; Vence, 2017), and more than 100 journals accept submissions directly from the bioRxiv website (BioRxiv, 2018). The National Institutes of Health now allows researchers to cite preprints in grant proposals (National Institutes of Health, 2017), and grants from the Chan Zuckerberg Initiative require researchers to post their manuscripts to preprint servers (Chan Zuckerberg Initiative, 2019; Champieux, 2018).

Preprints are influencing publishing conventions in the life sciences, but many details about the preprint ecosystem remain unclear. We know bioRxiv is the largest of the biology preprint servers (Anaya, 2018), and sporadic updates from bioRxiv leaders show steadily increasing submission numbers (Sever, 2018). Analyses have examined metrics such as total downloads (Serghiou and Ioannidis, 2018) and publication rate (Schloss, 2017), but long-term questions remain open. Which fields have posted the most preprints, and which collections are growing most quickly? How many times have preprints been downloaded, and which categories are most popular with readers? How many preprints are eventually published elsewhere, and in what journals? Is there a relationship between a preprint’s popularity and the journal in which it later appears? Do these conclusions change over time?

Here, we aim to answer these questions by collecting metadata about all 37,648 preprints posted to bioRxiv from its launch through November 2018. As part of this effort we have developed Rxivist (pronounced ‘Archivist’): a website, API and database (available at https://rxivist.org and gopher://origin.rxivist.org) that provide a fully featured system for interacting programmatically with the periodically indexed metadata of all preprints posted to bioRxiv.

Results

We developed a Python-based web crawler to visit every content page on the bioRxiv website and download basic data about each preprint across the site’s 27 subject-specific categories: title, authors, download statistics, submission date, category, DOI, and abstract. The bioRxiv website also provides the email address and institutional affiliation of each author, plus, if the preprint has been published, its new DOI and the journal in which it appeared. For those preprints, we also used information from Crossref to determine the date of publication. We have stored these data in a PostgreSQL database; snapshots of the database are available for download, and users can access data for individual preprints and authors on the Rxivist website and API. Additionally, a repository is available online at https://doi.org/10.5281/zenodo.2465689 that includes the database snapshot used for this manuscript, plus the data files used to create all figures. Code to regenerate all the figures in this paper is included there and on GitHub (https://github.com/blekhmanlab/rxivist/blob/master/paper/figures.md). See Methods and Supplementary Information for a complete description.

Preprint submissions

The most apparent trend that can be pulled from the bioRxiv data is that the website is becoming an increasingly popular venue for authors to share their work, at a rate that increases almost monthly. There were 37,648 preprints available on bioRxiv at the end of November 2018, and more preprints were posted in the first 11 months of 2018 (18,825) than in all four previous years combined (Figure 1a). The number of bioRxiv preprints doubled in less than a year, and new submissions have been trending upward for five years (Figure 1b). The largest driver of site-wide growth has been the neuroscience collection, which has had more submissions than any bioRxiv category in every month since September 2016 (Figure 1b). In October 2018, it became the first of bioRxiv’s collections to contain 6,000 preprints (Figure 1a). The second-largest category is bioinformatics (4,249 preprints), followed by evolutionary biology (2,934). October 2018 was also the first month in which bioRxiv posted more than 2,000 preprints, increasing its total preprint count by 6.3% (2,119) in 31 days.

Figure 1

Download asset Open asset

Total preprints posted to bioRxiv over a 61 month period from November 2013 through November 2018.

(a) The number of preprints (y-axis) at each month (x-axis), with each category depicted as a line in a different color. Inset: The overall number of preprints on bioRxiv in each month. (b) The number of preprints posted (y-axis) in each month (x-axis) by category. The category color key is provided below the figure.

https://doi.org/10.7554/eLife.45133.002

Figure 1—source data 1 The number of submissions per month to each bioRxiv category, plus running totals.: https://doi.org/10.7554/eLife.45133.003
Download elife-45133-fig1-data1-v2.csv
Figure 1—source data 2 An Excel workbook demonstrating the formulas used to calculate the running totals in Figure 1—source data 1.: https://doi.org/10.7554/eLife.45133.004
Download elife-45133-fig1-data2-v2.xlsx
Figure 1—source data 3 The number of submissions per month overall, plus running totals.: https://doi.org/10.7554/eLife.45133.005
Download elife-45133-fig1-data3-v2.csv
Figure 1—source data 4 The number of full-length articles published by an arbitrary selection of well-known journals in September 2018.: https://doi.org/10.7554/eLife.45133.006
Download elife-45133-fig1-data4-v2.docx
Figure 1—source data 5 A table of the top 15 authors with the most preprints on bioRxiv.: https://doi.org/10.7554/eLife.45133.007
Download elife-45133-fig1-data5-v2.docx
Figure 1—source data 6 A list of every author, the number of preprints for which they are listed as an author, and the number of email addresses they are associated with.: https://doi.org/10.7554/eLife.45133.008
Download elife-45133-fig1-data6-v2.csv
Figure 1—source data 7 A table of the top 25 institutions with the most authors listing them as their affiliation, and how many papers have been published by those authors.: https://doi.org/10.7554/eLife.45133.009
Download elife-45133-fig1-data7-v2.docx
Figure 1—source data 8 A list of every indexed institution, the number of authors associated with that institution, and the number of papers authored by those researchers.: https://doi.org/10.7554/eLife.45133.010
Download elife-45133-fig1-data8-v2.csv

Preprint downloads

Using preprint downloads as a metric for readership, we find that bioRxiv’s usage among readers is also increasing rapidly (Figure 2). The total download count in October 2018 (1,140,296) was an 82% increase over October 2017, which itself was a 115% increase over October 2016 (Figure 2a). BioRxiv preprints were downloaded almost 9.3 million times in the first 11 months of 2018, and in October and November 2018, bioRxiv recorded more downloads (2,248,652) than in the website’s first two and a half years (Figure 2b). The overall median downloads per paper is 279 (Figure 2b, inset), and the genomics category has the highest median downloads per paper, with 496 (Figure 2c). The neuroscience category has the most downloads overall – it overtook bioinformatics in that metric in October 2018, after bioinformatics spent nearly four and a half years as the most downloaded category (Figure 2d). In total, bioRxiv preprints were downloaded 19,699,115 times from November 2013 through November 2018, and the neuroscience category’s 3,184,456 total downloads accounts for 16.2% of these (Figure 2d). However, this is driven mostly by that category’s high volume of preprints: the median downloads per paper in the neuroscience category is 269.5, while the median of preprints in all other categories is 281 (Figure 2c; Mann–Whitney U test p=0.0003).

Figure 2 with 4 supplements see all

Download asset Open asset

The distribution of all recorded downloads of bioRxiv preprints.

(a) The downloads recorded in each month, with each line representing a different year. The lines reflect the same totals as the height of the bars in Figure 2b. (b) A stacked bar plot of the downloads in each month. The height of each bar indicates the total downloads in that month. Each stacked bar shows the number of downloads in that month attributable to each category; the colors of the bars are described in the legend in Figure 1. Inset: A histogram showing the site-wide distribution of downloads per preprint, as of the end of November 2018. The median download count for a single preprint is 279, marked by the yellow dashed line. (c) The distribution of downloads per preprint, broken down by category. Each box illustrates that category’s first quartile, median, and third quartile (similar to a boxplot, but whiskers are omitted due to a long right tail in the distribution). The vertical dashed yellow line indicates the overall median downloads for all preprints. (d) Cumulative downloads over time of all preprints in each category. The top seven categories at the end of the plot (November 2018) are labeled using the same category color-coding as above.

https://doi.org/10.7554/eLife.45133.011

Figure 2—source data 1 A list of every preprint, its bioRxiv category, and its total downloads.: https://doi.org/10.7554/eLife.45133.021
Download elife-45133-fig2-data1-v2.csv
Figure 2—source data 2 The number of downloads per month in each bioRxiv category, plus running totals.: https://doi.org/10.7554/eLife.45133.022
Download elife-45133-fig2-data2-v2.csv
Figure 2—source data 3 An Excel workbook demonstrating the formulas used to calculate the running totals in Figure 2—source data 2.: https://doi.org/10.7554/eLife.45133.023
Download elife-45133-fig2-data3-v2.xlsx
Figure 2—source data 4 The number of downloads per month overall, plus running totals.: https://doi.org/10.7554/eLife.45133.024
Download elife-45133-fig2-data4-v2.csv

We also examined traffic numbers for individual preprints relative to the date that they were posted to bioRxiv, which helped create a picture of the change in a preprint’s downloads by month (Figure 2—figure supplement 1). We can see that preprints typically have the most downloads in their first month, and the download count per month decays most quickly over a preprint’s first year on the site. The most downloads recorded in a preprint’s first month is 96,047, but the median number of downloads a preprint receives in its debut month on bioRxiv is 73. The median downloads in a preprint’s second month falls to 46, and the third month median falls again, to 27. Even so, the average preprint at the end of its first year online is still being downloaded about 12 times per month, and some papers don’t have a ‘big’ month until relatively late, receiving the majority of their downloads in their sixth month or later (Figure 2—figure supplement 2).

Preprint authors

While data about the authors of individual preprints is easy to organize, associating authors between preprints is difficult due to a lack of consistent unique identifiers (see Methods). We chose to define an author as a unique name in the author list, including middle initials but disregarding letter case and punctuation. Keeping this in mind, we find that there are 170,287 individual authors with content on bioRxiv. Of these, 106,231 (62.4%) posted a preprint in 2018, including 84,339 who posted a preprint for the first time (Table 1), indicating that total authors increased by more than 98% in 2018.

Table 1

Unique authors posting preprints in each year.

‘New authors’ counts authors posting preprints in that year that had never posted before; ‘Total authors’ includes researchers who may have already been counted in a previous year, but are also listed as an author on a preprint posted in that year. Data for table pulled directly from database. An SQL query to generate these numbers is provided in the Methods section.

https://doi.org/10.7554/eLife.45133.025

Year	New authors	Total authors
2013	608	608
2014	3,873	4,012
2015	7,584	8,411
2016	21,832	24,699
2017	52,051	61,239
2018	84,339	106,231

Even though 129,419 authors (76.0%) are associated with only a single preprint, the mean preprints per author is 1.52 because of a skewed rate of contributions also found in conventional publishing (Rørstad and Aksnes, 2015): 10% of authors account for 72.8% of all preprints, and the most prolific researcher on bioRxiv, George Davey Smith, is listed on 97 preprints across seven categories (Figure 1—source data 5). 1,473 authors list their most recent affiliation as Stanford University, the most represented institution on bioRxiv (Figure 1—source data 7). Though the majority of the top 100 universities (by author count) are based in the United States, five of the top 11 are from Great Britain. These results rely on data provided by authors, however, and is confounded by varying levels of specificity: while 530 authors report their affiliation as ‘Harvard University,’ for example, there are 528 different institutions that include the phrase ‘Harvard,’ and the four preprints from the ‘Wyss Institute for Biologically Inspired Engineering at Harvard University’ don't count toward the ‘Harvard University’ total.

Publication outcomes

In addition to monthly download statistics, bioRxiv also records whether a preprint has been published elsewhere, and in what journal. In total, 15,797 bioRxiv preprints have been published, or 42.0% of all preprints on the site (Figure 3a), according to bioRxiv’s records linking preprints to their external publications. Proportionally, evolutionary biology preprints have the highest publication rate of the bioRxiv categories: 51.5% of all bioRxiv evolutionary biology preprints have been published in a journal (Figure 3b). Examining the raw number of publications per category, neuroscience again comes out on top, with 2,608 preprints in that category published elsewhere (Figure 3c). When comparing the publication rates of preprints posted in each month we see that more recent preprints are published at a rate close to zero, followed by an increase in the rate of publication every month for about 12–18 months (Figure 3a). A similar dynamic was observed in a study of preprints posted to arXiv; after recording lower rates in the most recent time periods, Larivière et al. found publication rates of arXiv preprints leveled out at about 73% (Larivière et al., 2014). Of bioRxiv preprints posted between 2013 and the end of 2016, 67.0% have been published; if 2017 papers are included, that number falls to 64.0%. Of preprints posted in 2018, only 20.0% have been printed elsewhere (Figure 3a).

Figure 3 with 1 supplement see all

Download asset Open asset

Characteristics of the bioRxiv preprints published in journals, across the 27 subject collections.

(a) The proportion of preprints that have been published (y-axis), broken down by the month in which the preprint was first posted (x-axis). (b) The proportion of preprints in each category that have been published elsewhere. The dashed line marks the overall proportion of bioRxiv preprints that have been published and is at the same position as the dashed line in panel 3a. (c) The number of preprints in each category that have been published in a journal.

https://doi.org/10.7554/eLife.45133.026

Figure 3—source data 1 The number of preprints posted in each month, plus the count and proportion of those later published.: https://doi.org/10.7554/eLife.45133.029
Download elife-45133-fig3-data1-v2.csv
Figure 3—source data 2 The number of preprints posted in each category, plus the count and proportion of those published.: https://doi.org/10.7554/eLife.45133.030
Download elife-45133-fig3-data2-v2.csv

These publication statistics are based on data produced by bioRxiv’s internal system that links publications to their preprint versions, a difficult challenge that appears to rely heavily on title-based matching. To better understand the reliability of the linking between preprints and their published versions, we selected a sample of 120 preprints that were not indicated as being published, and manually validated their publication status using Google and Google Scholar (see Methods). Overall, 37.5% of these ‘unpublished’ preprints had actually appeared in a journal. We found earlier years to have a much higher false-negative rate: 53% of the evaluated ‘unpublished’ preprints from 2015 had actually been published, though that number dropped to less than 17% in 2017 (Figure 3—figure supplement 1). While a more robust study would be required to draw more detailed conclusions about the ‘true’ publication rate, this preliminary examination suggests the data from bioRxiv may be an underestimation of the number of preprints that have actually been published.

Overall, 15,797 bioRxiv preprints have appeared in 1,531 different journals (Figure 4). Scientific Reports has published the most, with 828 papers, followed by eLife and PLOS ONE with 750 and 741 papers, respectively. However, considering the proportion of preprints of the total papers published in each journal can lead to a different interpretation. For example, Scientific Reports published 398 bioRxiv preprints in 2018, but this represents 2.36% of the 16,899 articles it published in that year, as indexed by Web of Science (Figure 4—source data 2). In contrast, eLife published almost as many bioRxiv preprints (394), which means more than a third of their 1,172 articles from 2018 first appeared on bioRxiv. GigaScience had the highest proportion of articles from preprints in 2018 (49.4% of 89 articles), followed by Genome Biology (39.9% of 183 articles) and Genome Research (36.7% of 169 articles). Incorporating all years in which bioRxiv preprints have been published (2014–2018), these are also the three top journals.

Figure 4

Download asset Open asset

A stacked bar graph showing the 30 journals that have published the most bioRxiv preprints.

The bars indicate the number of preprints published by each journal, broken down by the bioRxiv categories to which the preprints were originally posted.

https://doi.org/10.7554/eLife.45133.031

Figure 4—source data 1 The number of preprints published in each category by the 30 most prolific publishers of preprints.: https://doi.org/10.7554/eLife.45133.033
Download elife-45133-fig4-data1-v2.csv
Figure 4—source data 2 A table showing the proportion of published papers that were previously bioRxiv preprints, for the 30 journals that published the most bioRxiv preprints.: https://doi.org/10.7554/eLife.45133.034
Download elife-45133-fig4-data2-v2.docx
Figure 4—source data 3 Year-level data of the proportion of published papers that were previously bioRxiv preprints, for the 30 journals that published the most bioRxiv preprints.: https://doi.org/10.7554/eLife.45133.035
Download elife-45133-fig4-data3-v2.xlsx

Some journals have accepted a broad range of preprints, though none have hit all 27 of bioRxiv’s categories – PLOS ONE has published the most diverse category list, with 26. (It has yet to publish a preprint from the clinical trials collection, bioRxiv's second-smallest.) Other journals are much more specialized, though in expected ways. Of the 172 bioRxiv preprints published by The Journal of Neuroscience, 169 were in neuroscience, and three were from animal behavior and cognition. Similarly, NeuroImage has published 211 neuroscience papers, two in bioinformatics, and one in bioengineering. It should be noted that these counts are based on the publications detected by bioRxiv and linked to their preprint, so some journals – for example, those that more frequently rewrite the titles of articles – may be underrepresented here.

When evaluating the downloads of preprints published in individual journals (Figure 5), there is a significant positive correlation between the median downloads per paper and journal impact factor (JIF): in general, journals with higher impact factors (Clarivate Analytics, 2018) publish preprints that have more downloads. For example, Nature Methods (2017 JIF 26.919) has published 119 bioRxiv preprints; the median download count of these preprints is 2,266. By comparison, PLOS ONE (2017 JIF 2.766) has published 719 preprints with a median download count of 279 (Figure 5). In this analysis, each data point in the regression represented a journal, indicating its JIF and the median downloads per paper for the preprints it had published. We found a significant positive correlation between these two measurements (Kendall’s τ_b=0.5862, p=1.364e-06). We also found a similar, albeit weaker, correlation when we performed another analysis in which each data point represented a single preprint (n=7,445; Kendall’s τ_b=0.2053, p=9.311e-152; see Methods).

Figure 5

Download asset Open asset

A modified box plot (without whiskers) illustrating the median downloads of all bioRxiv preprints published in a journal.

Each box illustrates the journal’s first quartile, median, and third quartile, as in Figure 2c. Colors correspond to journal access policy as described in the legend. Inset: A scatterplot in which each point represents an academic journal, showing the relationship between median downloads of the bioRxiv preprints published in the journal (x-axis) against its 2017 journal impact factor (y-axis). The size of each point is scaled to reflect the total number of bioRxiv preprints published by that journal. The regression line in this plot was calculated using the ‘lm’ function in the R ‘stats’ package, but all reported statistics use the Kendall rank correlation coefficient, which does not make as many assumptions about normality or homoscedasticity.

https://doi.org/10.7554/eLife.45133.036

Figure 5—source data 1 A list of every preprint with its total download count and the journal in which it was published, if any.: https://doi.org/10.7554/eLife.45133.037
Download elife-45133-fig5-data1-v2.csv
Figure 5—source data 2 Journal impact factor and access status of the 30 journals that have published the most preprints.: https://doi.org/10.7554/eLife.45133.038
Download elife-45133-fig5-data2-v2.csv

It is important to note that we did not evaluate when these downloads occurred, relative to a preprint's publication. While it looks like accruing more downloads makes it more likely that a preprint will appear in a higher impact journal, it is also possible that appearance in particular journals drives bioRxiv downloads after publication. The Rxivist dataset has already been used to begin evaluating questions like this (Kramer, 2019), and further study may be able to unravel the links, if any, between downloads and journals.

If journals are driving post-publication downloads on bioRxiv, however, their efforts are curiously consistent: preprints that have been published elsewhere have almost twice as many downloads as preprints that have not (Table 2; Mann–Whitney U test, p<2.2e-16). Among papers that have not been published, the median number of downloads per preprint is 208. For preprints that have been published, the median download count is 394 (Mann–Whitney U test, p<2.2e-16). When preprints published in 2018 are excluded from this calculation, the difference between published and unpublished preprints shrinks, but is still significant (Table 2; Mann–Whitney U test, p<2.2e-16). Though preprints posted in 2018 received more downloads in 2018 than preprints posted in previous years did (Figure 2—figure supplement 3), it appears they have not yet had time to accumulate as many downloads as papers from previous years (Figure 2—figure supplement 4).

Table 2

A comparison of the median downloads per preprint for bioRxiv preprints that have been published elsewhere to those that have not.

See Methods section for description of tests used.

https://doi.org/10.7554/eLife.45133.039

Posted	Published	Unpublished
2017 and earlier	465	414
Through 2018	394	208

Table 2—source data 1 A list of every preprint with its total download count, the year in which it was first posted, and whether it has been published.: https://doi.org/10.7554/eLife.45133.040
Download elife-45133-table2-data1-v2.csv

We also retrieved the publication date for all published preprints using the Crossref ‘Metadata Delivery’ API (Crossref, 2018). This, combined with the bioRxiv data, gives us a comprehensive picture of the interval between the date a preprint is first posted to bioRxiv and the date it is published by a journal. These data show the median interval is 166 days, or about 5.5 months. 75% of preprints are published within 247 days of appearing on bioRxiv, and 90% are published within 346 days (Figure 6a). The median interval we found at the end of November 2018 (166 days) is a 23.9% increase over the 134 day median interval reported by bioRxiv in mid-2016 (Inglis and Sever, 2016).

Figure 6

Download asset Open asset

The interval between the date a preprint is posted to bioRxiv and the date it is first published elsewhere.

(a) A histogram showing the distribution of publication intervals. The x-axis indicates the time between preprint posting and journal publication; the y-axis indicates how many preprints fall within the limits of each bin. The yellow line indicates the median; the same data is also visualized using a boxplot above the histogram. (b) The publication intervals of preprints, broken down by the journal in which each appeared. The journals in this list are the 30 journals that have published the most total bioRxiv preprints; the plot for each journal indicates the density distribution of the preprints published by that journal, excluding any papers that were posted to bioRxiv after publication. Portions of the distributions beyond 1,000 days are not displayed.

https://doi.org/10.7554/eLife.45133.041

Figure 6—source data 1 A list of every published preprint, the year it was first posted, the date it was published, and the interval between posting and publication, in days.: https://doi.org/10.7554/eLife.45133.042
Download elife-45133-fig6-data1-v2.csv
Figure 6—source data 2 A list of every preprint published in the 30 journals displayed in the figure, the journal in which it was published, and the interval between posting and publication, in days.: https://doi.org/10.7554/eLife.45133.043
Download elife-45133-fig6-data2-v2.csv
Figure 6—source data 3 The results of Dunn’s test, a pairwise comparison of the median publication interval of each journal in the figure.: https://doi.org/10.7554/eLife.45133.044
Download elife-45133-fig6-data3-v2.txt

We also used these data to further examine patterns in the properties of the preprints that appear in individual journals. The journal that publishes preprints with the highest median age is Nature Genetics, whose median interval between bioRxiv posting and publication is 272 days (Figure 6b), a significant difference from every journal except Genome Research (Kruskal–Wallis rank sum test, p<2.2e-16; Dunn’s test q<0.05 comparing Nature Genetics to all other journals except Genome Research, after Benjamini–Hochberg correction). Among the 30 journals publishing the most bioRxiv preprints, the journal with the most rapid transition from bioRxiv to publication is G3, whose median, 119 days, is significantly different from all journals except Genetics, mBio, and The Biophysical Journal (Figure 5).

It is important to note that this metric does not directly evaluate the production processes at individual journals. Authors submit preprints to bioRxiv at different points in the publication process and may work with multiple journals before publication, so individual data points capture a variety of experiences. For example, 122 preprints were published within a week of being posted to bioRxiv, and the longest period between preprint and publication is 3 years, 7 months and 2 days, for a preprint that was posted in March 2015 and not published until October 2018 (Figure 6a).

Discussion

Biology preprints have a growing presence in scientific communication, and we now have ongoing, detailed data to quantify this process. The ability to better characterize the preprint ecosystem can inform decision-making at multiple levels. For authors, particularly those looking for feedback from the community, our results show bioRxiv preprints are being downloaded more than one million times per month, and that an average paper can receive hundreds of downloads in its first few months online (Figure 2—figure supplement 1). Serghiou and Ioannidis (2018) evaluated download metrics for bioRxiv preprints through 2016 and found an almost identical median for downloads in a preprint’s first month; we have expanded this to include more detailed longitudinal traffic metrics for the entire bioRxiv collection (Figure 2b).

For readers, we show that thousands of new preprints are being posted every month. This tracks closely with a widely referenced summary of submissions to preprint servers (PrePubMed, 2018) generated monthly by PrePubMed (http://www.prepubmed.org) and expands on submission data collected by researchers using custom web scrapers of their own (Stuart, 2016; Stuart, 2017; Holdgraf, 2016). There is also enough data to provide some evidence against the perception that research in preprint is less rigorous than papers appearing in journals (Nature Biotechnology, 2017; Vale, 2015). In short, the majority of bioRxiv preprints do appear in journals eventually, and potentially with very few differences: an analysis of published preprints that had first been posted to arXiv.org found that ‘the vast majority of final published papers are largely indistinguishable from their pre-print versions’ (Klein et al., 2016). A 2016 project measured which journals had published the most bioRxiv preprints (Schmid, 2016); despite a six-fold increase in the number of published preprints since then, 23 of the top 30 journals found in their results are also in the top 30 journals we found (Figure 5).

For authors, we also have a clearer picture of the fate of preprints after they are shared online. Among preprints that are eventually published, we found that 75% have appeared in a journal by the time they had spent 247 days (about eight months) on bioRxiv. This interval is similar to results from Larivière et al. showing preprints on arXiv were most frequently published within a year of being posted there (Larivière et al., 2014), and to a later study examining bioRxiv preprints that found ‘the probability of publication in the peer-reviewed literature was 48% within 12 months’ (Serghiou and Ioannidis, 2018). Another study published in spring 2017 found that 33.6% of preprints from 2015 and earlier had been published (Schloss, 2017); our data through November 2018 show that 68.2% of preprints from 2015 and earlier have been published. Multiple studies have examined the interval between submission and publication at individual journals (e.g. Himmelstein, 2016a; Royle, 2015; Powell, 2016), but the incorporation of information about preprints is not as common.

We also found a positive correlation between the impact factor of journals and the number of downloads received by the preprints they have published. This finding in particular should be interpreted with caution. Journal impact factor is broadly intended to be a measurement of how citable a journal’s ‘average’ paper is (Garfield, 2006), though it morphed long ago into an unfounded proxy for scientific quality in individual papers (The PLoS Medicine Editors, 2006). It is referenced here only as an observation about a journal-level metric correlated with preprint downloads; there is no indication that either factor is influencing the other, nor that download numbers play a direct role in publication decisions.

More broadly, our granular data provide a new level of detail for researchers looking to evaluate many remaining questions. What factors may impact the interval between when a preprint is posted to bioRxiv and when it is published elsewhere? Does a paper’s presence on bioRxiv have any relationship to its eventual citation count once it is published in a journal, as has been found with arXiv (e.g. Feldman et al., 2018; Wang et al., 2018; Schwarz and Kennicutt, 2004)? What can we learn from ‘altmetrics’ as they relate to preprints, and is there value in measuring a preprint’s impact using methods rooted in online interactions rather than citation count (Haustein, 2018)? One study, published before bioRxiv launched, found a significant association between Twitter mentions of published papers and their citation counts (Thelwall et al., 2013) – have preprints changed this dynamic?

Researchers have used existing resources and custom scripts to answer questions like these. Himmelstein found that only 17.8% of bioRxiv papers had an ‘open license’ (Himmelstein, 2016b), for example, and another study examined the relationship between Facebook ‘likes’ of preprints and ‘traditional impact indicators’ such as citation count, but found no correlation for papers on bioRxiv (Ringelhan et al., 2015). Since most bioRxiv data is not programmatically accessible, many of these studies had to begin by scraping data from the bioRxiv website itself. The Rxivist API allows users to request the details of any preprint or author on bioRxiv, and the database snapshots enable bulk querying of preprints using SQL, C and several other languages (PostgreSQL Global Development Group, 2018) at a level of complexity currently unavailable using the standard bioRxiv web interface. Using these resources, researchers can now perform detailed and robust bibliometric analysis of the website with the largest collection of preprints in biology, the one that, beginning in September 2018, held more biology preprints than all other major preprint servers combined (Anaya, 2018).

In addition to our analysis here that focuses on big-picture trends related to bioRxiv, the Rxivist website provides many additional features that may interest preprint readers and authors. Its primary feature is sorting and filtering preprints based by download count or mentions on Twitter, to help users find preprints in particular categories that are being discussed either in the short term (Twitter) or over the span of months (downloads). Tracking these metrics could also help authors gauge public reaction to their work. While bioRxiv has compensated for a low rate of comments posted on the site itself (Inglis and Sever, 2016) by highlighting external sources such as tweets and blogs, Rxivist provides additional context for how a preprint compares to others on similar topics. Several other sites have attempted to use social interaction data to ‘rank’ preprints, though none incorporate bioRxiv download metrics. The ‘Assert’ web application (https://assert.pub) ranks preprints from multiple repositories based on data from Twitter and GitHub. The ‘PromisingPreprints’ Twitter bot (https://twitter.com/PromPreprint) accomplishes a similar goal, posting links to bioRxiv preprints that receive an exceptionally high social media attention score (Altmetric Support, 2018) from Altmetric (https://www.altmetric.com) in their first week on bioRxiv (De Coster, 2017). Arxiv Sanity Preserver (http://www.arxiv-sanity.com) provides rankings of arXiv.org preprints based on Twitter activity, though its implementation of this scoring (Karpathy, 2018) is more opinionated than that of Rxivist. Other websites perform similar curation, but based on user interactions within the sites themselves: SciRate (https://scirate.com), Paperkast (https://paperkast.com) and upvote.pub allow users to vote on articles that should receive more attention (van der Silk et al., 2018; Özturan, 2018), though upvote.pub is no longer online (upvote.pub, 2018). By comparison, Rxivist doesn't rely on user interaction – by pulling ‘popularity’ metrics from Twitter and bioRxiv, we aim to decouple the quality of our data from the popularity of the website itself.

In summary, our approach provides multiple perspectives on trends in biology preprints: (1) the Rxivist.org website, where readers can prioritize preprints and generate reading lists tailored to specific topics; (2) a dataset that can provide a foundation for developers and bibliometric researchers to build new tools, websites and studies that can further improve the ways we interact with preprints and (3) an analysis that brings together a comprehensive summary of trends in bioRxiv preprints and an examination of the crossover points between preprints and conventional publishing.

Share this article

Cite this article

Total preprints posted to bioRxiv over a 61 month period from November 2013 through November 2018.

Figure 1—source data 1

Figure 1—source data 2

Figure 1—source data 3

Figure 1—source data 4

Figure 1—source data 5

Figure 1—source data 6

Figure 1—source data 7

Figure 1—source data 8

The distribution of all recorded downloads of bioRxiv preprints.

Figure 2—source data 1

Figure 2—source data 2

Figure 2—source data 3

Figure 2—source data 4

Unique authors posting preprints in each year.

Characteristics of the bioRxiv preprints published in journals, across the 27 subject collections.

Figure 3—source data 1

Figure 3—source data 2

A stacked bar graph showing the 30 journals that have published the most bioRxiv preprints.

Figure 4—source data 1

Figure 4—source data 2

Figure 4—source data 3

A modified box plot (without whiskers) illustrating the median downloads of all bioRxiv preprints published in a journal.

Figure 5—source data 1

Figure 5—source data 2

A comparison of the median downloads per preprint for bioRxiv preprints that have been published elsewhere to those that have not.

Table 2—source data 1

The interval between the date a preprint is posted to bioRxiv and the date it is first published elsewhere.

Figure 6—source data 1

Figure 6—source data 2

Figure 6—source data 3

Author details

Richard J Abdill

Contribution

Competing interests

Ran Blekhman

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Further reading