Breakdown of quotes at major processing steps

Breakdown of citations at major processing steps

Breakdown of all Springer Nature papers at major processing steps

Breakdown of all Nature papers at major processing steps

Data and Processing Pipeline Overview

Panel A, left, depicts an example news article and the type of data extracted from the text. Green and blue highlighted text depicts all quotes, and associated speakers identified by the coreNLP pipeline. A custom script described in section Methods identifies all citations. Panel A, right, charts the analyses done on the extracted names and locations from news articles and papers published by Nature. Panel B shows the types and amounts of articles that we have used for analyses.

Predicted male speakers are overrepresented in quotes, but this depends on the article type.

Panel A, left, depicts an example of the names extracted from quoted speakers in news articles and authors in papers. Panel A, right, highlighted the data types and processes used to analyze the predicted gender of extracted names. Panel B shows an overview of the number of quotes extracted for each article type. Panel C depicts three trend lines: Purple: Proportion of quotes for an estimated male speaker; Light Blue: Proportion of first author papers from an estimated male author; Dark Blue: Proportion of predicted male last authors. We observe that the proportion of estimated male quotes is steadily decreasing, most notably from 2017 onward. This decreasing trend is not due to a change in quotes from the first or last authors, as observed in Panel D. Panel D shows a consistent but slight shift towards quoting the last author of a cited article than the first author. Instead, the observed downward trend of male quotes coincides with additional article types introduced in 2017. Panel E depicts the frequency of quote by article type highlighting an increase in quotes from “Career Feature” articles. Panel E depicts that the quotes obtained in this article type have reached parity. The colored bands represent a 5th and 95th bootstrap quantiles in all plots, and the point is the mean calculated from 1,000 bootstrap samples.

Analysis of Quotes and Citations found Over-representation of Celtic/English and underrepresentation of East Asian predicted name origins.

Panel A, left, depicts an example of the names extracted from quoted speakers and citations found within news articles and authors in papers. Panel A, right, highlights the data types and processes used to analyze the predicted origin of extracted names. Panels B and C depict a comparison between the predicted name origins of last authors in Nature and cited papers in the news. Panel B and C differ in the news article types. Panel B calculates the predicted name origin proportion using only journalist-written articles, whereas Panel C only uses scientist-written articles. The distinction between scientist- and journalist-written articles are defined by the article appearing in either the “Career Column” or “News and Views” sections, or another section, respectively. Similarly, Panels D and E depict two possible trend lines, comparing predicted name origins of either quoted or mentioned people against name origins of last authors of Nature research papers. For more precise numerical comparisons, the mean yearly fold-change for each comparison is provided in Table 5.

Mean fold change comparison with Nature from bootstrap samples with 95% CI

Mean fold change comparison with Springer Nature from bootstrap samples with 95% CI

Quoted speaker name origin, by journalist name origin

Quoted + cited speaker name origin, by journalist name origin

Quoted speakers (with US affiliated citation) name origin, by journalist name origin

Benchmark Data

Panel A, depicts the performance of gender prediction for pipeline-identified quoted speakers. Panel B is a histogram of the number of articles that were falsely identified to mention a country by our processing pipeline. Panels C shows the estimated versus true frequency of country mentions within our benchmark dataset. The red line denotes the x = y line.

Predicted male speakers are overrepresented in news quotes regardless of predicted journalist gender

Panel A depicts two trend lines: Yellow: Proportion of Nature news articles written by a predicted female journalist; Blue: Proportion of Nature news articles written by a predicted male journalist. We observe a moderate gender difference in the number of articles written by male and female journalists. Panel B depicts two trend lines: Yellow: Proportion of predicted male quotes in an article written by a predicted female journalist; Blue: Proportion of predicted male quotesin an article written by a predicted male journalist. In all plots, the colored bands represent the 5th and 95th bootstrap quantiles and the point is the mean calculated from 1,000 bootstrap samples.

Predicted male speakers are overrepresented in news quotes when compared against Springer Nature authorship

Panel A depicts three trend lines: Purple: Proportion of Nature quotes for an estimated male speaker; Light Grey: Proportion of The Guardian quotes for an estimated male speaker; Yellow: Proportion of first author articles from an estimated male author in Springer Nature; Dark Mustard: Proportion of last author articles from an estimated male author in Springer Nature. We observe a larger gender difference between first and last authors in Springer Nature articles, however the proportion of predicted male speakers is less than observed in Nature research articles. Panel B depicts the proportion of male quotes broken down by article type. In all plots the colored bands represent the 5th and 95th bootstrap quantiles and the point is the mean calculated from 1,000 bootstrap samples.

Predicted Celtic/English, and European name origins are the highest cited, quoted, and mentioned

Panel A, depicts the number of quotes, mentions, citations, or research articles considered in the name origin analysis. Panels B-G depicts the proportion of a name origin in a given dataset, citations in articles written by journalists or writers, quoted speakers or mentions. In all plots the colored bands represent the 5th and 95th bootstrap quantiles and the point is the mean calculated from 1,000 bootstrap samples.

Distribution of name origins Nature and Springer Nature articles

Panels A-D depicts the predicted name origins of first and last authors in our background sets. Panel A and B show the predicted name origins of Nature first and last authors, respectively. Panel C and D show the predicted name origins of Springer Nature first and last authors, respectively.

Over-representation of predicted Celtic/English and under-representation of East Asian name origins is also found in comparison to Nature and Springer Nature articles

Panels A-F depicts ten plots, each for a possible name origin comparison against a background set. Panel A, C, and E compare the citation (a), quote (c), or mention (e) rate against Nature first and last author name origins. Panel B, D, and F compare the citation (a), quote (c), or mention (e) rate against Springer Nature first and last author name origins. Panels A and B additionally partition the citation rates by journalist-written articles and scientist-written articles, each further divided into first or last author position. For C-F, only journalist written articles are considered.

Over-representation of predicted Celtic/English and under-representation of East Asian quotes and mentions are reduced when additionally considering citation

Panels A-D depicts twelve plots, each for a possible name origin comparison against a background set. Panels A and B compare name origin proportions of quotes from people that were also cited in the same article. Panels C and D compare name origin proportions from mentions of people that were also cited in the same article. In all plots the colored bands represent the 5th and 95th bootstrap quantiles and the point is the mean calculated from 1,000 bootstrap samples.