Longitudinal trends for the eLife articles.

For the training set, we summarized our findings per article. This graph shows article counts for the “Definitely okay” and “Definitely problematic” categories for each year evaluated.

Trends by biology subdiscipline for the eLife articles.

For the training set, this graph shows the percentage of articles categorized as “Definitely problematic” for a given subdiscipline, as indicated in the article metadata. In many cases, a single article was associated with multiple subdisciplines; these articles are shown as “Multidisciplinary.” We used a χ² goodness-of-fit test to calculate the p-value, with the overall proportion of each discipline as the expected probability.

Receiver operating characteristic curve for the Convolutional Neural Network predictions for the images in the eLife hold-out test set.

This curve illustrates tradeoffs between sensitivity and specificity. The area under the curve is 0.89.

Precision-recall curve for the Convolutional Neural Network predictions for the images in the eLife hold-out test set.

This curve illustrates tradeoffs between precision and recall. The area under the curve is 0.75.

Receiver operating characteristic curve for the Convolutional Neural Network predictions for the images in the PubMed Central hold-out test set.

This curve illustrates tradeoffs between sensitivity and specificity. The area under the curve is 0.78.

Precision-recall curve for the Convolutional Neural Network predictions for the images in the PubMed Central hold-out test set.

This curve illustrates tradeoffs between precision and recall. The area under the curve is 0.39.