Data and Processing Pipeline Overview Panel

a, left, depicts an example news article and the type of data extracted from the text. Green and blue highlighted text depicts all quotes, and associated speakers identified by the coreNLP pipeline. A custom script described in section Methods identifies all citations. Panel a, right, charts the analyses done on the extracted names and locations from news articles and papers published by Nature. Panel b shows the types and amounts of articles that we have used for analyses.

Breakdown of quotes at major processing steps

Breakdown of citations at major processing steps

Breakdown of all Springer Nature papers at major processing steps

Breakdown of all Nature papers at major processing steps

Speakers predicted to be men are sometimes overrepresented in quotes, but this depends on the year and article type.

Panel a, left, depicts an example of the names extracted from quoted speakers in news articles and authors in papers. Panel a, right, highlighted the data types and processes used to analyze the predicted gender of extracted names. Panel b shows an overview of the number of quotes extracted for each article type. Panel c depicts three trend lines: Purple: Proportion of quotes for a speaker estimated to be a man; Light Blue: Proportion of first author papers estimated to be a man; Dark Blue: Proportion of last authors predicted to be a man. We observe that the proportion of quotes estimated to come from a man is steadily decreasing, most notably from 2017 onward. This decreasing trend is not due to a change in quotes from the first or last authors, as observed in Panel d. Panel d shows a consistent but slight bias towards quoting the last author of a cited article than the first author over time. Panel e depicts the frequency of quote by article type highlighting an increase in quotes from “Career Feature” articles. Panel e depicts that the quotes obtained in this article type have reached parity. The colored bands represent a 5th and 95th bootstrap quantiles in all plots, and the point is the mean calculated from 1,000 bootstrap samples.

Quoted speaker gender by name origin

Analysis of Quotes and Citations found Over-representation of Celtic/English and under-representation of East Asian predicted name origins.

Panel a, left, depicts an example of the names extracted from quoted speakers and citations found within news articles and authors in papers. Panel a, right, highlights the data types and processes used to analyze the predicted origin of extracted names. Panels b and c depict a comparison between the predicted name origins of last authors in Nature and cited papers in the news. Panel b and c differ in the news article types. Panel b calculates the predicted name origin proportion using only journalist-written articles, whereas Panel c only uses scientist-written articles. The distinction between scientist- and journalist-written articles are defined by the article appearing in either the “Career Column” or “News and Views” sections, or another section, respectively. Similarly, Panels d and e depict two possible trend lines, comparing predicted name origins of either quoted or mentioned people against name origins of last authors of Nature research papers. For more precise numerical comparisons, the mean yearly fold-change for each comparison is provided in Table 6.

Mean fold change comparison with Nature from bootstrap samples with 95% CI

Mean fold change comparison with Springer Nature from bootstrap samples with 95% CI

Quoted speaker name origin, by journalist name origin

Quoted + cited speaker name origin, by journalist name origin

Quoted speakers (with US affliated citation) name origin, by journalist name origin

Benchmark Data

The performance of gender prediction for pipeline-identified quoted speakers.

Speakers predicted to be men are overrepresented in news quotes regardless of predicted journalist gender

Panel a depicts two trend lines: Yellow: Proportion of Nature news articles written by a predicted women journalist; Blue: Proportion of Nature news articles written by a predicted men journalist. We observe a moderate gender difference in the number of articles written by men and women journalists. Panel b depicts two trend lines: Yellow: Proportion of quotes predicted to be from men in an article written by a journalist predicted to be a woman; Blue: Proportion of quotes predicted to be from men in an article written by a journalist predicted to be a man. In all plots, the colored bands represent the 5th and 95th bootstrap quantiles and the point is the mean calculated from 1,000 bootstrap samples.

Speakers predicted to be men are overrepresented in news quotes when compared against Springer Nature authorship

Panel a depicts three trend lines: Purple: Proportion of Nature quotes for a speaker estimated to be a man; Light Grey: Proportion of The Guardian quotes for a speaker estimated to be a man; Yellow: Proportion of first author articles from an author estimated to be a man in Springer Nature; Dark Mustard: Proportion of last author articles from an author estimated to be a man in Springer Nature. We observe a larger gender difference between first and last authors in Springer Nature articles, however the proportion of speakers estimated to be men is less than observed in Nature research articles. Panel b depicts the proportion of quotes from predicted men broken down by article type. In all plots the colored bands represent the 5th and 95th bootstrap quantiles and the point is the mean calculated from 1,000 bootstrap samples.

Predicted Celtic/English, and European name origins are the highest cited, quoted, and mentioned

Panel a, depicts the number of quotes, mentions, citations, or research articles considered in the name origin analysis. Panels b-g depicts the proportion of a name origin in a given dataset, citations in articles written by journalists or writers, quoted speakers or mentions. In all plots the colored bands represent the 5th and 95th bootstrap quantiles and the point is the mean calculated from 1,000 bootstrap samples.

Distribution of name origins Nature and Springer Nature articles

Panels a-d depicts the predicted name origins of first and last authors in our background sets. Panel a and b show the predicted name origins of Nature first and last authors, respectively. Panel c and d show the predicted name origins of Springer Nature first and last authors, respectively.

Over-representation of predicted Celtic/English and under-representation of East Asian name origins is also found in comparison to Nature and Springer Nature articles

Panels a-f depicts ten plots, each for a possible name origin comparison against a background set. Panel a, c, and e compare the citation (a), quote (c), or mention (e) rate against Nature first and last author name origins. Panel b, d, and f compare the citation (a), quote (c), or mention (e) rate against Springer Nature first and last author name origins. Panels a and b additionally partition the citation rates by journalist-written articles and scientist-written articles, each further divided into first or last author position. For c-f, only journalist written articles are considered.

Over-representation of predicted Celtic/English and under-representation of East Asian quotes and mentions are reduced when additionally considering citation

Panels a-d depicts twelve plots, each for a possible name origin comparison against a background set. Panels a and b compare name origin proportions of quotes from people that were also cited in the same article. Panels c and d compare name origin proportions from mentions of people that were also cited in the same article. In all plots the colored bands represent the 5th and 95th bootstrap quantiles and the point is the mean calculated from 1,000 bootstrap samples.