Abstract
The peer review process is a critical step in ensuring the quality of scientific research. However, its subjectivity has raised concerns. To investigate this issue, I examined over 500 publicly available peer review reports from 200 published neuroscience papers in 2022-2023. OpenAI’s generative artificial intelligence ChatGPT was used to analyze language use in these reports. This analysis found high levels of variability in how each reviewer scored the same paper, indicating the presence of subjectivity in the peer review process. The results also revealed that female first authors received less polite reviews than their male peers, indicating a gender bias in reviewing. Furthermore, published papers with a female senior author received more favorable reviews than papers with a male senior author, suggesting a gender disparity in academic publishing. This study highlights the potential of generative artificial intelligence in identifying areas of concern in scientific peer review and underscores the need to enhance transparency and objectivity in the scientific publishing process.
Introduction
The peer review process is a crucial step in the publication of scientific research, where manuscripts are evaluated by independent experts in the field before being accepted for publication. This process helps ensure the quality and validity of scientific research and is a cornerstone of scientific integrity. Despite its importance, concerns have been raised regarding subjectivity in this process that may affect the fairness and accuracy of evaluations1-5. Indeed, most journals engage in single-blind peer review, in which the reviewers have information about the authors of the paper, but not vice versa. While some studies have found evidence of disparities in peer review as a result of gender bias, the scope and methodology of these studies are often limited6-8. For example, in one case study, a biology journal observed an increase in papers from female first authors after introduction of double-blind peer review6. Additionally, other factors, such as the seniority and institutional affiliation of authors, may influence the evaluation process and lead to biased assessments of research quality7. As such, papers from more prestigious research institutions may receive better peer reviews9. It is crucial to identify potential sources of disparity in the reviewing process to maintain scientific integrity and find areas for improvement within the scientific pipeline.
Natural language processing tools have shown promise in analyzing large amounts of textual data and extracting meaningful insights from evaluations10-12. However, applying these tools to scientific peer review has been challenging due to the specialized construction and language use in such reports. Past attempts to use natural language processing to analyze peer review reports have often been limited by small dataset sizes or the complexity of the text13-15. Recent advances in generative artificial intelligence, such as OpenAI’s ChatGPT, offer new possibilities for studying scientific peer review. These models can process vast amounts of text and provide accurate sentiment scores and language use metrics for individual sentences and documents. As such, using generative artificial intelligence to study scientific peer review may ultimately help improve the overall quality and fairness of scientific publications and identify areas of concern in the way towards equitable academic research.
This study had three main objectives. The first was to test whether the latest advances in generative artificial intelligence, such as OpenAI’s ChatGPT, can be used to analyze language use in scientific peer reviews. The second aim was to explore subjectivity in peer review by looking at consistency in favorability across reviews for the same paper. The last aim was to test whether the identity of the authors, such as institutional affiliation and gender, affect the favorability and language use of the reviews they receive.
Results
An analysis of scientific peer review
Nature Communications has engaged in transparent peer review since January 2016, giving authors the option to publish the peer review history of their paper16. To explore language use in these reports, I downloaded the primary (i.e., first-round) reviews from the last 200 papers in the neuroscience field published in this journal. This yielded a total of 572 reviews from 200 papers, with publications dates ranging from August 2022 to February 2023. Additional metrics of these papers were manually collected (Fig. 1a and 1b), including the total review time of the paper, the subfield of neuroscience, the geographical location and QS World Ranking score of the senior author’s institutional affiliation, the gender of the senior author, and whether the first author had a male or female name (see Methods for more information on classifications and a rationale for the chosen metrics). These metrics were collected to test whether they influenced the favorability and language use of the reviews that a paper received.
Sentiment analysis
To assess the sentiment and language use of each of the peer review reports, I asked OpenAI’s generative artificial intelligence ChatGPT to extract two scores from each of the reviews (Fig. 2a). The first score was the sentiment score, and measures how favorable the review is. This metric ranges from -100 (negative) to 0 (neutral) to +100 (positive). Sentiment reflects the reviewer’s opinion about the paper and is what presumably drives the decision for a paper to be accepted or rejected. The second score was the politeness score, which evaluates how polite a review’s language is, measured on a scale from -100 (rude) to 0 (neutral) to +100 (polite). ChatGPT was able to extract sentiment and politeness scores for all of the 572 reviews, and usually included a reasoning of how it established the score (Supplementary Fig. 1).
The accuracy and consistency of the generated scores was validated in four different ways. First, for a representative sample of the reviews, I read both the review and ChatGPT’s reasoning of how it came to the scores (for examples see Supplementary Fig. 1). I established that the algorithm was able to extract the most important sentences from each of the reviews and to provide a plausible score. Second, since generative artificial intelligence can provide different answers every time it is prompted, the algorithm was asked to provide scores for each review twice. This yielded a significant correlation between the first and second iteration of scoring (P < 0.0001 for both sentiment and politeness scores; Supplementary Fig. 2); the average of the two scores was used for all subsequent analyses in this paper. Third, manipulated reviews (in which I manually re-wrote a ‘neutral’ review in a more rude, polite, negative or positive manner) were input into ChatGPT, which confirmed that this changed the review’s politeness and sentiment scores, respectively (Supplementary Fig. 3). Finally, for a subset of reviews, ChatGPT’s scores were compared to that of seven human scorers that were blinded to the algorithm’s scores (Supplementary Fig. 4). Interestingly, there was high variability across human scorers, but their average score had a high correlation to that of ChatGPT (linear regression for sentiment score: R2 = 0.91, P = 0.0010; for politeness score: R2 = 0.70, P = 0.018). Together, these validations indicate that ChatGPT can accurately score the sentiment and politeness of scientific peer reviews.
The majority of the 558 peer reviews (90.1%) were of positive sentiment; 7.9% were negative; 2% were neutral (i.e., a sentiment score of 0) (Fig. 2b). 99.8% of reviews were deemed polite by the algorithm (i.e., a positive politeness score), only 1 review was scored as rude (i.e., a negative politeness score; Fig. 2c, bottom left inset). A regression analysis indicated a strong relation between the reviews’ sentiment and politeness scores (60% of variance explained in a third-degree polynomial regression) (Fig. 2c). Thus, the more positive a review, the more polite the reviewer’s language generally is. It is important to note here that the papers included in this analysis were ultimately accepted for publication in Nature Communications, which has a low acceptance rate of 7.7%. As a result of this selection, there will be an over-representation of positive scores in this analysis (Fig. 2c, bottom right inset).
Consistency across reviewers
If a research paper meets certain objective standards of quality, one can reasonably expect that reviewers evaluating that paper would share a common view on its overall sentiment. To investigate if this is the case, I analyzed the consistency across review scores for the same paper (Fig. 3). As expected, the overall distribution of sentiment and politeness scores did not differ between the first three reviewers (Fig. 3a). Interestingly, a cross-correlational analysis of sentiment scores across reviewers indicated very low, if any, correlation between the sentiment scores of reviews for the same paper (Fig. 3b). The maximum variance explained in sentiment scores between reviewers was 5.5% (between reviewer 1 and 3; the only comparison that reached statistical significance). This surprising result indicates high levels of disagreement between the reviewers’ favorability of a paper, suggesting that the peer review process is subjective.
I then looked at the relation between a paper’s review scores and its review time (i.e., the time from paper submission to acceptance). For this analysis, review scores were first classified as the lowest, median (only for papers with an odd number of reviewers), or highest for a paper (Fig. 3c). A linear regression analysis indicated that the median sentiment score was the best predictor of a paper’s review time (R2 = 0.0670, P = 0.0002), followed by the lowest sentiment score (R2 = 0.1404, P < 0.0001) (Fig. 3c, bottom left panels). Interestingly, a paper’s highest sentiment score did not significantly predict review time (R2 = 0.0088, P = 0.1874).
Exploring disparities in peer review
To explore potential sources of disparities in scientific publishing, I correlated the review scores, pooled across all papers, with the different paper and author metrics that were collected earlier (Fig. 1b). No significant effects were observed between sentiment and politeness scores across the different subfields of neuroscience (Fig. 4a). With respect to the institutional affiliation of the senior author, no effects were observed between the scores and the continent in which the senior author was based (Fig. 4b). Additionally, no correlation was observed between the institute’s score on the QS World ranking — an imperfect metric of the institute’s perceived prestige — and the paper’s sentiment and politeness scores (Fig. 4c).
Finally, I looked at how the gender of the first and senior authors may affect a paper’s review scores. First authors with a female name received significantly more impolite reviews, but no effect was observed on sentiment (Fig. 4d). When analyzing these same reviews split by low/median/high politeness score, I observed that the lower politeness scores for first authors with a female name was driven by significantly lower low and median scores (Fig. 4d, bottom panel). Conversely, female senior authors received significantly higher sentiment scores, indicating more favorable reviews, but these reviews did not differ in terms of politeness (Fig. 4e). An analysis of reviews split by low/median/high sentiment score indicated that the reviewer that gave the most favorable review to female senior authors did so with a significantly higher score (Fig. 4e, bottom panel). No interactions on scores were observed between the genders of the first and senior authors (Supplementary Fig. 5).
Discussion
Peer review is a crucial component of scientific publishing. It helps ensure that research papers are of high quality and have been scrutinized by experts in the field. However, the potential for subjectivity in the peer review process has been an ongoing concern. For example, implicit or explicit bias of reviewers may lead to disparities in peer review scores on the basis of gender or institutional affiliation. In this study, I used natural language processing tools embedded in OpenAI’s ChatGPT to analyze over 500 peer review reports from 200 papers that were accepted for publication in Nature Communications within the past year. I found that this approach was able to provide consistent and accurate scores, indicating that generative artificial intelligence can be an easy and useful tool in studying scientific peer review. The findings reveal several key insights into the peer review process and highlight potential areas of concern within academic publishing.
First, this study found that evaluations of the same manuscript varied considerably among different reviewers. This finding suggests that the peer review process may be subjective, with different reviewers having different opinions on the quality and validity of the research. This subjectivity may be due to the different backgrounds, experiences, and biases of the reviewers. The inconsistency in the evaluations emphasizes the need for greater standardization in the peer review process, with clear guidelines and protocols that can minimize such discrepancies17.
I also investigated disparities in peer review based on the institutional affiliation of the senior author of a paper. Specifically, I looked at the geographic location (continent) as well as the score of the institute in the 2023 QS World University Rankings. This analysis revealed no relation of these two metrics with the sentiment and politeness of the reviews, suggesting that evaluations were not influenced by the geographical location and perceived prestige of the senior author’s research institution. This finding is encouraging and suggests that peer review may be based on the quality and merit of the research rather than the authors’ research institute. That said, the identity of the peer reviewers is not known, so it cannot be tested whether reviewers have a bias with respect to authors from a more closely related country, culture or institution (i.e., in-group favoritism).
This study further found that first authors with a female name received less polite reviews than first authors with a male name, although this did not affect the favorability of their reviews. Regardless, this disparity is worrisome as it may indicate an unconscious gender bias in review writing that may ultimately impact the confidence and motivation of (especially early-stage) female researchers. One may argue that the effect size of gender on politeness scores is small, but given the selection bias in this dataset (Fig 2c, bottom right inset), this effect may be larger in the entire pool of reviewed manuscripts (i.e., rejected + accepted). To address this issue, double-blind peer review, where the authors’ names are anonymized, could be implemented. Double-blind peer review has been debated before, but has come under scrutiny for various reasons8,18,19. Another — arguably more simple — solution could be for reviewers to be more mindful of their language use. Indeed, even negative reviews can be written in a polite manner (Fig. 2c), and reviewers may want to use ChatGPT to extract a politeness score for their review before submitting.
Additionally, female senior authors received more favorable reviews than male senior authors in this pool of accepted papers. This disparity in sentiment score in favor of women may be surprising given the wealth of data showing unconscious bias against women, including in scientific research20,21. It is therefore likely that the observed effect is due to selection bias elsewhere in the publishing process. There may be two potential sources of this bias. The first one is that female senior authors may submit better papers to this journal than their male peers, such that the observed gender effect on sentiment is representative for the entire pool of submitted manuscript (i.e., rejected + accepted). This could be the result of institutional barriers that lead to a small, but highly talented pool of female principal investigators22 that submits better papers than their male peers23. Alternatively, women may have a higher level of self-imposed quality control24, such that men submit more variable quality papers to high-impact journals like Nature Communications. In the imperfect process that is editorial decision making, this may lead to the publication of certain lower-quality papers from male senior authors. The second explanation may be related to an (unconscious) selection bias in the editorial process25, requiring female senior authors to have better papers before being sent out for peer review, or better scores before being invited for a revise-and-resubmit. As such, paper acceptance may serve as a collider variable26,27, inducing a spurious association between gender of the senior author and sentiment score. Further research is required to investigate the reasons behind this effect and to identify in what level of the academic system these differences emerge.
Notably, there are several limitations to this study. The peer review reports I analyzed are all ultimately accepted for publication in Nature Communications, meaning that there is a selection bias in the reviews that were included. As such, papers that have received less favorable reviews, or papers that have not been sent out for peer review at all, were not included in this analysis. It is unclear what the gender and institutional affiliation distribution is for the papers that were ultimately unpublished. Additionally, this study only focused on the neuroscience field, and the findings may not generalize to other fields. Similarly, it is not clear if the results from this study apply to journals beyond Nature Communications.
Together, this study serves as a proof of concept for the use of natural language processing to analyze scientific peer review. As such, areas of concern were discovered within the academic publishing system that require immediate attention. One such area is the inconsistency between the reviews of the same paper, highlighting the need for greater standardization in the peer review process. Additionally, I uncovered possible gender disparity in academic publishing and reviewing. This research underscores the potential of generative artificial intelligence to evaluate and enhance scientific peer review, which may ultimately lead to a more equitable and just academic system.
Methods
Downloading reviews
Reviewer reports were downloaded from the website of Nature Communications in February 2023. Only papers that were categorized under Biological sciences > Neuroscience were included in this analysis. Not all papers had their primary reviewer reports published; to reach the total of 200 papers with primary review reports, the most recently published 283 papers were considered (published between August 16, 2022 and Feb 17, 2023).
Additional paper metrics were subsequently collected. Paper submission and acceptance date were downloaded from the ‘About this article’ section on the paper website. Review time was calculated by counting the number of days between these two dates. Research field was manually categorized on the basis of title and abstract of the paper into five different subfields. The affiliation of the senior author was downloaded from the paper website and manually categorized based on continent; if the senior author had affiliations across multiple continents, it was categorized as ‘multiple’ and not used for further analyses (this was the case for 5 papers). The affiliated institutions’ score in the 2023 QS World Ranking was downloaded from the QS World Ranking website (TopUniversities.com) in March 2023; the maximum score an institution could receive was 100 (in 2023 this was Massachusetts Institute of Technology). Not all institutions were listed in the QS World Ranking, usually because they were not considered an organization of higher education. If a senior author had multiple affiliations, then the affiliation with the highest score was used. The gender categorization of the first author was based on name only, to reflect that reviewers generally do not know the first author of a paper they review personally (and thus themselves may infer their gender on the name only). This name-based categorization was performed using ChatGPT (query: “Of the following list of international full (first+last) names, can you guess, based on name only, if these people are male, female, or unknown (i.e., name is not gender specific)?”). As a confirmation, all names that were assigned a gender by ChatGPT were verified using the Genderize database (http://genderize.io; probability > 0.5). The gender of the senior author was categorized in a similar manner, except that the categorization for gender-unspecific names was manually completed, usually by looking up the senior author on the research institution’s website or the author’s Google Scholar or Twitter profile. In this manual look up, I tried to find the senior author’s preferred pronouns. If not available, I inferred the senior author’s gender on the basis of a photograph. I did not find evidence that any of the senior authors included in this analysis identified as non-binary; for 4 senior authors I was not able to find or infer their gender. Note that this gender look-up was performed for the senior author, but not for the first author, because reviewers are generally familiar with the senior author of papers they review and thus are likely aware of their gender identity.
Sentiment analysis
Scores of sentiment and politeness of language use of each peer review report was performed using OpenAI’s ChatGPT (Version Feb 13, 2023). The prompt consisted of the following question (see Fig. 3a):
Below you will find a scientific peer review. Such reviews generally contain the reviewer’s sentiment in the first paragraph(s) of the review, followed by a list of specific recommendations to the authors. Can you score this peer review on [1] the sentiment, on a scale from -100 (negative) to 0 (neutral) to 100 (positive), and [2] politeness of language use, on a scale of -100 (rude) to 0 (neutral) to 100 (polite)? followed by the full text of the peer review. This question was entered into ChatGPT twice, and the average of these scores was used for further analyses; for a correlation between the two iterations see Supplementary Fig. 2.
Statistics
Linear regression and cross-correlational analyses in Fig. 3b and 3c was performed using JASP 0.16 (University of Amsterdam). For the polynomial linear regression in Fig. 3c, data were centered by z-scoring the individual sentiment and politeness scores. Statistical tests in Fig. 3a and 4 were performed in Prism 9 (GraphPad Inc.). For Fig. 3a, a mixed effects model was used to compute statistical significance between the reviewers, because repeated measures data was not always available (i.e., not all papers received a third review). To compute statistical significance in Fig. 4a and 4b, a Kruskal-Wallis ANOVA was used. For Fig. 4c, significance was calculated using linear regression. For Fig. 4d and 4e, Mann-Whitney tests were used to compute significance between male and female authors. Significant effects were further studied by splitting the reviews per score (i.e., splitting in lowest, median and highest scores per paper). To calculate statistical significance between male and female authors for lowest/median/highest score in Fig. 4d and 4e, Mann-Whitney tests were used. Statistical tests were always two-tailed. Note that review scores were not always normally distributed, so non-parametric tests were mostly used. Significance was defined as P < 0.05 and denoted with asterisks; * P < 0.05, ** P < 0.01, *** P < 0.001.
Data availability
All data are available as a Supplementary file to this paper.
Acknowledgements
J.P.H.V. was supported by a Rubicon postdoctoral fellowship from the Netherlands Organization of Scientific Research. I thank Amanda Tose, Han de Jong and Stephan Lammel for helpful comments on this manuscript.
Conflict of interest statement
The author declares no competing interests.
Supplementary Figures
References
- 1.Modelling the effects of subjective and objective decision making in scientfic peer reviewNature 506:93–6
- 2.Journal peer review in context: A qualitative study of the social and subjective dimensions of manuscript review in biomedical publishingSocial Science & Medicine 72:1056–1063
- 3.Systematic Subjectivity: How Subtle Biases Infect the Scholarship Review ProcessJournal of Management 44:843–853
- 4.Bias in peer reviewJournal of the American Society for Information Science and Technology 64:2–17
- 5.Publish or Politic: Referee Bias in Manuscript ReviewJournal of Applied Social Psychology 5:187–200
- 6.Double-blind review favours increased representation of female authorsTrends in Ecology and Evolution 23:4–6
- 7.The effects of double-blind versus single-blind reviewing — experimental evidence from the American-Economic reviewThe American economic review 81:1041–1067
- 8.“I don’t see gender”: Conceptualizing a gendered system of academic publishingSocial Science and Medicine 235
- 9.Reviewer bias in single-versus double-blind peer reviewPNAS 114:12708–12713
- 10.Natural Language ProcessingIn: Fundamentals of Artficial Intelligence New Delhi: Springer
- 11.Advances in natural language processingSCience 349:261–266
- 12.Sentiment analysis using deep learning architectures: a reviewArtificial Intelligence Review 53:4335–4385
- 13.Sentiment Analysis of Peer Review Texts for Scholarly Papers. SIGIR ‘18: The 41st International ACM SIGIR Conference on Research &Development in Information Retrieval :175–184
- 14.Aspect-based Sentiment Analysis of Scientific Reviews, JCDL ‘20: Proceedings of the ACM/ IEEE Joint Conference on Digital Libraries in 2020:207–216
- 15.Acceptance Decision Prediction in Peer-Review Through Sentiment AnalysisProgress in Artificial Intelligence
- 16.Transparent peer reviewat Nature Communications. Nature Communications 6, 10277 (2015).Nature Communications 6
- 17.The limitations to our understanding of peer reviewResearch Integrity and Peer Review 5
- 18.unblinded peer review of manuscripts submitted to a dermatology journal: a randomized multi-rater studyClinical and Laboratory Investigations 165:563–567
- 19.Single-versus double-blind reviewing: an analysis of literatureACM SIGMOD Record 35:8–21
- 20.Women and science careers: leaky pipeline or gender filter?Gender and Education 17:369–386
- 21.Fixing the leaky pipeline: women scientists in academiaJournal of Animal Science 74:2843–2848
- 22.Elite male faculty in the life sciences employ fewer womenPNAS 111:10107–10112
- 23.Publishing while female: Are women held to higher standards?Evidence from peer review. The Economic Journal 132:2951–2991
- 24.Women and leadership in higher education in AustraliaTertiary Education and Management 9:45–60
- 25.Editorial bias in scientific publicationsNeurología 26:1–5
- 26.Collider BiasJAMA 327:1282–1283
- 27.Collider bias undermines our understanding of COVID-19 disease risk and severityNature Communications 11
Article and author information
Author information
Version history
- Sent for peer review:
- Preprint posted:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
- Version of Record published:
Copyright
© 2023, Jeroen P. H. Verharen
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 4,053
- downloads
- 172
- citations
- 8
Views, downloads and citations are aggregated across all versions of this paper published by eLife.