Longitudinal trends.

For the 4,964 images in the training set, we identified images that were “Definitely okay” or “Definitely problematic”. This graph shows the percentage of these images that were “Definitely problematic” in a given year, over time. We used linear regression to calculate the P-value.

Trends by biology subdiscipline.

For the 4,964 images in the training set, we identified images that were “Definitely okay” or “Definitely problematic”. This graph shows the percentage of these images that were “Definitely problematic” for a given subdiscipline, as indicated in the article metadata. In many cases, a single image was associated with multiple subdisciplines; these images are counted separately for each subdiscipline. We used a chi-squared goodness-of-fit test to calculate the P-value.

Rank-based metric score for images categorized as “Definitely okay” or “Definitely problematic”.

Convolutional Neural Network predictions for the hold-out test set.

Each point represents the prediction for an image from the hold-out test set. Relatively high confidence scores indicate that the model had more confidence that a given image was “Definitely problematic” for a person with deuteranopia.

Receiver operating characteristic curve for Convolutional Neural Network predictions on the hold-out test set.

This curve illustrates tradeoffs between sensitivity and specificity for the Convolutional Neural Network on the hold-out test set. The area under the curve is 0.89.