An experimental test of the effects of redacting grant applicant identifiers on peer review outcomes

  1. Richard K Nakamura
  2. Lee S Mann
  3. Mark D Lindner
  4. Jeremy Braithwaite
  5. Mei-Ching Chen
  6. Adrian Vancea
  7. Noni Byrnes
  8. Valerie Durrant
  9. Bruce Reed  Is a corresponding author
  1. Retired, formerly Center for Scientific Review, National Institutes of Health, United States
  2. Center for Scientific Review, National Institutes of Health, United States
  3. Social Solutions International, United States

Abstract

Background:

Blinding reviewers to applicant identity has been proposed to reduce bias in peer review.

Methods:

This experimental test used 1200 NIH grant applications, 400 from Black investigators, 400 matched applications from White investigators, and 400 randomly selected applications from White investigators. Applications were reviewed by mail in standard and redacted formats.

Results:

Redaction reduced, but did not eliminate, reviewers’ ability to correctly guess features of identity. The primary, preregistered analysis hypothesized a differential effect of redaction according to investigator race in the matched applications. A set of secondary analyses (not preregistered) used the randomly selected applications from White scientists and tested the same interaction. Both analyses revealed similar effects: Standard format applications from White investigators scored better than those from Black investigators. Redaction cut the size of the difference by about half (e.g. from a Cohen’s d of 0.20–0.10 in matched applications); redaction caused applications from White scientists to score worse but had no effect on scores for Black applications.

Conclusions:

Grant-writing considerations and halo effects are discussed as competing explanations for this pattern. The findings support further evaluation of peer review models that diminish the influence of applicant identity.

Funding:

Funding was provided by the NIH.

Introduction

National Institutes of Health (NIH) distributes over $34 billion per year in research grants to support biomedical research at research institutions and small businesses across the United States. NIH funding is important, not only to the future of scientific discovery, but to the careers of individual scientists. Grant funding enables them to pursue their scientific studies and success in obtaining NIH funding often factors into tenure and promotion decisions. In 2011, the National Academies of Science (NAS) issued a report arguing that increased diversity in the scientific workforce is critical to ensure that the United States maintains its global leadership and competitive edge in science and technology (National Academy of Sciences, N.A.o.E, 2011). The same year, Ginther et al. reported that the likelihood of Black PIs being awarded NIH research funding between 2000 and 2006 was 55% of that of White investigators (Ginther et al., 2011). This funding gap persists (Ginther et al., 2011; Ginther et al., 2018; Hoppe et al., 2019; Erosheva et al., 2020) and the proportion of NIH research grant applicants who are Black has increased only slightly in the ensuing years, from 1.4% in Ginther’s data to 2.3% in 2020.

The largest single factor determining the probability of funding in the highly competitive NIH system is the outcome of peer review. Peer review panels meet to discuss and score applications according to scientific merit; the possibility that peer review is biased is of great concern to applicants, funding agencies, and the American public (Fox and Paine, 2019; Gropp et al., 2017; Hengel, 2017; Lerback and Hanson, 2017; Wennerås and Wold, 1997; Taffe and Gilpin, 2021). Disparities in success rates of scientific publishing (Bendels et al., 2018; Hopkins et al., 2012; Ouyang et al., 2018) and in obtaining grant awards (Ginther et al., 2011; Pohlhaus et al., 2011; Ginther et al., 2016) raise questions of whether reviewer bias on applicant demographics (race, gender, age, career stage, etc.) or institutional reputation unfairly influence the review outcomes (Wahls, 2019; Witteman et al., 2019; Murray et al., 2016). Concerns are particularly salient for NIH because the criteria for evaluating the scientific merit of research projects include ‘investigators’ and ‘environment’, thus explicitly directing reviewers to take these factors into account.

It should be understood that NIH funding is not determined by peer review alone, but rather is additionally determined by scientific priorities and budgets at the funding institutes. Funding rates for major research grants vary approximately threefold, from about (10% to nearly 30%) across the institutes meaning that applications in some areas of science enjoy greater success than others. A recent paper focused attention on the finding that funding success varied substantially depending on scientific topic, and that the topics most often studied by Black investigators tend to have low funding rates (Hoppe et al., 2019). An important follow-up paper showed that this association was primarily attributable to the disparate funding rates across the 24 NIH institutes, rather than topical bias in peer review (Lauer et al., 2021). Nonetheless, peer review outcomes are a fundamental determinant of success across the NIH.

Various approaches to reducing demographic bias in review have been proposed. Blinding reviewers (in ‘double blind’ or ‘dual anonymous’ reviews) to applicant identity and institutional affiliations is one such approach (Cox and Montgomerie, 2019; Fisher et al., 1994; Haffar et al., 2019; Okike et al., 2016; Snodgrass, 2006). The literature examining the impact of blinding on review outcomes is mixed. With respect to gender, for example, blinding has been reported to reduce disparities (Budden et al., 2008; Terrell et al., 2017; Aloisi and Reid, 2019) but has also been ineffectual (Primack, 2009; Whittaker, 2008; Blank, 1991; Ledin et al., 2007). To our knowledge, there are no studies evaluating real review of scientific grants blinded with respect to race. Even so, the importance of diversifying science, the strong correlation between review and funding outcomes, and the perceived tractability of review make review interventions attractive. Strong concerns about the potential of demographic bias, and especially anti-Black racial bias, in peer review remain (Stevens et al., 2021).

The present study was part of the NIH response to Ginther’s 2011 report of the Black-White funding gap and the NAS report (National Academy of Sciences, N.A.o.E, 2011) on the lack of diversity in the U.S. scientific workforce. An NIH Advisory Committee to the Director (ACD) Working Group on Diversity in the Biomedical Research Workforce (WGDBRW) recommended that ‘NIH should design an experiment to determine the effects of anonymizing applications…’ in peer review (Working Group on Diversity in the Biomedical Research Workforce (WGDBRW) and T.A.C.t.t.D. (ACD), 2012). An NIH Subcommittee on Peer Review (National Institutes of Health, 2013) and an internal NIH group led by then-CSR Director (RN) designed it as a large-sample experimental investigation of the potential effects of racial bias on review outcomes, specifically bias against Black or African American investigators (see Figure 1).

Study background and timeline.

The decision to restrict the focus to Black scientists stemmed from three considerations. First, although funding gaps have been reported for other disadvantaged groups, none has been persistently as large as that experienced by Blacks. For example, in 2013 NIH award rates for major research projects for Hispanic scientists were 81% that of White investigators, and award rates for Asians 83% that of Whites. Both rates improved to 88% that of Whites in 2020. Past differences in funding rates for male versus female applicants for major NIH research awards have disappeared (https://report.nih.gov/nihdatabook/report/131). Second, the funding disparity for Blacks occurred in the context of the U.S. history of centuries long, pernicious and pleiotropic anti-Black racism. Lastly, the scale of a properly powered experiment to investigate multiple forms of demographic bias was prohibitive from a practical perspective.

The experimental intervention was to anonymize applications by post-submission redaction of applicant identity and institutional affiliation. Using real, previously reviewed NIH R01 applications, the experiment compared scores for applications from Black vs. White applicants as re-reviewed for this study in their standard (original) vs. redacted formats. The primary research question was ‘Does concealing the race and identity of the applicant affect reviewers’ overall impact scores of applications from Black and White applicants differently?’.

Materials and methods

Design

Request a detailed protocol

The study was conducted by a contract organization, Social Solutions International (SSI). The study design and analytic plan were preregistered at the Center for Open Science in October 2017 (https://osf.io/3vmfz). The experiment obtained reviews using either the standard (original) application format or the redacted format for applications from Black PIs vs. White PIs. Applications were real NIH R01s (NIH’s major research project awards) that had been previously reviewed in 2014–2015 by CSR. There were three sets of applications; 400 R01 applications submitted by Black PIs, and two comparator sets from White PIs, one selected to match the Black PIs’ applications on review-relevant features, the other selected randomly. All applications were redacted to obscure the PI’s identity, race, and institutional affiliation. The original and redacted versions of each application were re-reviewed independently for this study by new reviewers. Each reviewer provided for each application (1) a preliminary overall impact score (which is the primary outcome measure), (2) a written critique, (3) guesses of the race, ethnicity, gender, institutional affiliation, career stage, and name of the investigator, along with confidence ratings regarding those guesses, and (4) ratings of grantsmanship. Grantsmanship was measured with two items intended to measure aspects of the application evaluation not captured by the overall impact score or five individual criterion scores: (1) Is the application organized, well-written, and easy to follow? (‘Grant 1’), and (2) Did the application enable the review to generate informed conclusions about the proposed project? (‘Grant 2’). The major hypothesis was that redaction would differentially affect the scores given to Black and White application sets; that is, either Blacks would score better, Whites worse, or both when applications were redacted.

Sample

Request a detailed protocol

The preregistered plan called for a sample size of 400 per group based on power calculations for a two-sample t-test with alpha = 0.05, an effect size of 0.25, and power of 94%. As documented in the Transparent Change Summary (available on request), linear mixed models were used instead of the originally registered t-tests and the central hypothesis was tested with an interaction term. A revised power analysis focused on detecting interactions in a mixed-effects linear regression showed that with an N of 400 per cell, the study had 70% power to detect an effect size (d) of 0.2, 80% to detect an effect size of 0.25, and greater than 90% power to detect an effect size of 0.3 (Leon and Heo, 2009; Supplementary file 1A).

The 400 R01 applications from Black contact PIs comprised nearly 80% of such applications that were reviewed in 2014–2015. The plan specified that a sample of 400 applications from White PIs matched to the Black PI applications on preliminary overall impact score and on review-relevant characteristics would be the comparison group for the primary test of the hypothesis. A secondary comparison set of 400 applications from White PIs, randomly selected from the approximately 26,000 reviewed in 2014–2015 was also drawn (‘random White sample’). The random White sample was to provide a ‘real world’ comparator, and an alternative test of the hypothesis (Rubin, 2006; Campbell and Stanley, 1963).

NIH applications may have more than one principal investigator, for example, may be a ‘multiple PI’ (MPI) application. MPI applications were assigned to groups based on the demographics of only the contact PI. Overall, 21% of applications were MPI. See Table 1 for sample characteristics.

Table 1
PI demographics and application characteristics by sample.
Match criteriaBlack (n = 400)White matched (n = 400)White random (n = 400)
Gender
 Male232233276
 Female166167120
 Unknown24
Institution NIH mean (SD) awarded dollars in $millions182.88 (172.02)171.12 (159.85)176.92 (157.13)
Type of application
 Type 1 (New)370369334
 Type 2 (Renewal)303166
Revision or resubmission
 A0 (original submission)290290263
 A1 (resubmission)110110137
Early stage investigator
 Yes10210247
 No298298353
Investigator age mean (SD)48.66 (9.31)50.27 (10.20)51.96 (9.96)
Behavioral/social science IRG
 Yes17417375
 No226227325
Degree held
 MD807254
 PhD237267289
 MD/PhD373340
 Others24168
 Unknown22129
Original preliminary overall impact scores: mean (SD)4.35 (1.46)4.34 (1.36)3.94 (1.26)
% with multiple PIs241821

Matching and redaction

Request a detailed protocol

The intent of using matched sets of applications from Black and White PIs was to isolate the effect of PI race and redaction on review outcomes. Applications were matched on actual preliminary overall impact scores and seven additional variables: (1) area of science (behavioral/social science vs. other), (2) application type (new/renewal), (3) application resubmission or not, (4) gender, (5) early-stage investigator (ESI) or not, (6) degree of PI (PhD, MD, etc.), and (7) institutional research funding (quintiles of NIH funding) (Supplementary file 1B).

Redaction was performed by a team of 25 SSI research staff and checked by a quality assurance team member. Redaction took between 2 and 8 hr per application to accomplish, and quality assurance took 2–4 hr more (redacted fields listed in Supplementary file 1C).

Review procedures

Request a detailed protocol

Reviews were overseen by nine PhD-level scientists contracted by SSI, who functioned as scientific review officers (SROs). Three had retired from CSR, and one had previous contract SRO experience at CSR. The other five had no prior SRO experience with NIH. All SROs were provided with 6 hr of group training along with individual coaching from NIH-experienced SROs. Reviewers were recruited by the SROs from more than 19,000 scientists who had served on the study sections where the 1200 applications were originally reviewed. Reviewers were recruited using a standardized email invitation that stated that this was a study being conducted ‘to examine the impact of anonymization on NIH peer review outcomes in support of CSR’s mission to ensure that grant review processes are conducive to funding the most promising research’. Reviewers were told nothing about the racial composition of the application sample.

Reviewers were assigned applications based on expertise: SROs reviewed the application project narrative, abstract, specific aims, and research strategy and tagged each application with key words to indicate the topic and methods. SROs then matched applications to potential reviewers’ Research, Condition, and Disease Categorization (RCDC) terms and scores. RCDC terms are system-generated tags applied by NIH to all incoming applications, designed to characterize their scientific content. Weighted combinations of scores can be used to characterize the content of an application or to characterize a scientist’s expertise.

Six reviewers were recruited for each application; three were randomly assigned to review the standard application, three to review the redacted version. Most reviewers reviewed some standard format and some redacted applications. The goal was for each reviewer to get ~6 applications to review but problems with reviewer recruitment and attrition resulted in 3.4 applications per reviewer on average (median = 3, interquartile range = 1–5, maximum 29).

In standard NIH peer review, each reviewer scores the application on overall level of scientific merit before reading other reviewers’ critiques (‘preliminary impact score’—the outcome for this study), then may read other’s critiques and adjust that score, then presents that preliminary score to the panel, explains the basis for it, the panel discusses the application, reviewers revise their scores, and each panelist votes a final score. This procedure was considered not feasible for this study. Instead, review was done entirely through non-interactive written reviews. Reviewers were given a chart of the NIH scoring system (1 = best, 9 = worst) and standard R01 critique templates. In addition to providing an overall impact score, reviewers rated applications on grantsmanship and on whether redacted applications provided enough information to enable a fair review. Reviewers reviewed each application as a package, beginning with the writing of the critique and scoring of the application, ending with the questions on grantsmanship and guesses about applicant/institutional identity. The review template and additional rating items are in Supplementary file 2. Nearly all applications received the desired six critiques (7155 of 7200).

Statistical analysis

Request a detailed protocol

The preregistered protocol defined three primary questions of interest: (1) Effectiveness of redaction in achieving anonymization: Are reviewers less accurate in their assessment of the applicants’ actual race in the anonymized version of the applications? (2) Effectiveness of the matching procedure: Did the matching produce equivalent preliminary overall impact scores in the current study on the standard application format? (3) Primary test of the study hypothesis: Does concealing the race and identity of the applicant affect reviewers’ preliminary overall impact scores of applications from Black and White applicants differently?

Question 1 was evaluated using chi-square analyses comparing rates of correct identification of Black and White PIs using standard format and redacted applications. Questions 2 and 3 were examined using linear mixed models (multi-level models) to account for the intra-class correlation of impact scores within individual applications. The average of the three reviewers’ preliminary overall impact scores for each application is the dependent variable. The model has two binary main effects, PI race and application format. Thus, the primary test of the study hypothesis is tested with the race × application format interaction term. A significant interaction would indicate that the effect of redaction on scores was different for Black and White applications.

The preregistered plan specified that the hypothesis be tested using the matched samples of applications from Black and White PIs (in order to maximize statistical power), and that the randomly selected set of applications from White investigators would be used for secondary analyses. For clarity of presentation, methods for the secondary models are described with the results of those models.

Results

Preregistered question 1

Question 1 concerns the effectiveness of redaction in achieving anonymization. Table 2 shows redaction reduced the rate at which reviewers could guess PI race for Black PI’s by over half, from 58% to 28%. The effect on the rate of correctly guessing the race of White PIs was much smaller (93%–87%). Reviewers mistakenly guessed that Black PIs were White 36% of the time with standard format applications, 61% of the time with redacted applications. (Data for the two White samples were combined for simplicity, because their distributions were highly similar.) Reviewer confidence in their guesses of race using redacted applications was just over 2 on scale from 1 (‘not at all confident’) to 5 (‘fully confident’) for all PI samples. Using standard format applications, confidence ratings for guesses of race were about one point better; ratings did not vary appreciably by applicant race (see Table 3). Guesses of PI race based on redacted applications were significantly less likely to be correct than were guesses based on standard applications; χ2(1) = 160.2, p < 0.001.

Table 2
Reviewer’s guesses of applicant race in relation to actual race by application format.
Reviewer guess of PI raceStandard format applicationsRedacted format applications
Black PIsWhite PIsBlack PIsWhite PIs
Black683 (58%)49 (2%)336 (28%)48 (2%)
White432 (36%)2234 (93%)723 (61%)2081 (87%)
Other45 (4%)66 (3%)78 (7%)172 (7%)
No guess25 (2%)41 (2%)52 (4%)90 (4%)
Table 3
Reviewer confidence regarding their guesses of investigator demographics.
ApplicantcharacteristicBlack investigatorsWhite matched investigatorsWhite random investigators
Standard reviewsAnonymized reviewsStandard reviewsAnonymized reviewsStandard reviewsAnonymized reviews
Race3.22.13.22.23.42.2
Gender4.32.34.42.34.52.3
Institution4.23.24.33.34.43.3
Career stage4.23.14.23.24.43.2
  1. Note: 5-point scale, 1 = low confidence, 5 = high confidence.

Reviewers of redacted applications were asked to guess the PI/research group. Most of the time they did not venture guess, but 21% of the time, a reviewer was able to make an exact identification. Table 4 details these data according to application set. Guesses for MPI applications were counted as correct if the reviewer named any one of the PIs.

Table 4
Rates of reviewer identification of name/research group in redacted applications.
PI raceCorrectIncorrectNo guess
Overall(3580)21.6%(775)6.1%(217)72.3%(2588)
Black(1189)18.9%(225)5.6%(67)75.4%(897)
White (matched sample)(1194)19.4%(232)7.0%(84)73.5%(878)
White (random sample)(1197)26.6%(318)5.5%(66)67.9%(813)

Thus, in answer to question 1, redaction diminished but did not eliminate reviewer knowledge of applicant race and identity. Reviewers were about half as likely to identify applicants as Black when viewing redacted applications compared to standard applications.

Preregistered question 2

Question 2 asks did the matching produce equivalent preliminary overall impact scores for standard application format applications? Although the applications sets were matched on the preliminary overall impact scores received in NIH review, simple contrasts show that when reviewed for this study, applications from White PIs scored better (M = 3.9 White, 4.1 Black). The effect size was small, d = 0.20. Figure 2 shows the distributions of average preliminary overall impact scores for Black, White matched, and White random PI applications in standard and redacted formats.

Distributions of preliminary overall impact scores according to race of PI and format in which the applications were reviewed.

Boxes delineate the central 50% of scores those falling between the 25th and 75th percentiles (Interquartile Range, IQR). Whiskers extend 1.5X the IQR. Dots mark outliers. Horizontal lines within boxes indicate the median, and “x” marks the mean value. Lower scores are better.

Preregistered question 3

Question 3 tests the study hypothesis: Does concealing the race and identity of the applicant affect reviewers’ preliminary overall impact scores of applications from Black and White applicants differently? Table 5 summarizes the analysis, which found a significant main effect for both PI race and application format. On average, applications from White PIs received better scores than those from Black PIs. Redacted format applications scored worse than standard format applications. Both effect sizes were small. The prespecified statistical test of the study hypothesis is the race × application format interaction and was not statistically significant (p = 0.17). Removing from the analyses scores for those cases in which the review correctly identified the PI did not appreciably change the parameter estimates or significance levels.

Table 5
Primary analysis.

Effects of race and application format on overall impact scores in matched White and Black application sets.

Estimatep-Value95% Confidence interval (CI)
Fixed effects
Race–0.170.01(−0.31,–0.04)
Application format–0.100.02(−0.19,–0.02)
Race × application format–0.120.17(–0.29, 0.05)
Intercept4.06< 0.001(3.99, 4.13)
Random effects
Application intercept0.61(0.51, 0.72)
  1. Note: The reference category for race is the Black group. The reference category for application format is the redacted format.

Table 6 shows the observed data and simple contrasts. Redaction had a significant effect on White PI’s applications (scores became worse). Redaction had no effect on scores for Black PI’s applications. Distributions of change scores (three-reviewer average score, redacted format minus standard format score) for the two samples were similar: for Black and matched White samples respectively the means were 0.04, 0.16; medians 0, 0; 1st quartile –0.67, –0.67; 3rd quartile 1, 1.

Table 6
Simple contrasts of average preliminary impact scores for redacted vs. standard format applications by PI race.

Matched White application set.

RaceAnonymization condition
StandardAnonymizedSimple contrast (SE)Effect size
Black4.134.170.04 (0.06)0.04
White matched3.894.050.16* (0.06)0.14
Simple contrast (SE)–0.23* (0.08)–0.12 (0.08)
Effect size for race0.200.10
  1. *

    p <.05 (Bonferroni-adjusted).

Secondary analyses

Using the White random application set as a comparator provided a secondary test of the study hypothesis in applications representative of those received at NIH (from Black or White PIs).

It also allowed exploratory analyses of additional factors that may influence review outcomes. The dependent variable was the preliminary overall impact score entered by each reviewer. Cases with missing data were deleted. Covariates of interest were categorized as follows: investigator demographics, application characteristics, reviewer perceptions, and grantsmanship indicators. Effects of covariates on final overall impact scores were tested in a set of linear mixed models. The base model included race of the PI as the only predictor. Models 2–4 add blocks of covariates. For each model, appropriate random effects were specified. To determine which random effects were appropriate, we began by including random slopes for all predictors in the model, then used backward list deletion to determine which random effects significantly contributed to the given model. Table 7 displays the fixed and random effects for the nested models.

Table 7
Parameter estimates and standard errors from nested models predicting overall impact scores in the Black and random White application sets.
Model 1(n = 4764800 applications)Model 2(n = 4728794 applications)Model 3(n = 4728794 applications)Model 4(n = 4315794 applications)
Fixed effectsCoef.SECoef.SECoef.SECoef.SE
DemographicsRace (White = 1)0.266a0.0690.132c0.0650.132c0.065–0.1240.068
Type 2 application0.492a0.1010.491a0.1010.484a0.104
A1 application0.420a0.0690.420a0.0690.415a0.072
Gender–0.0050.067–0.0050.0670.0130.069
Early-stage investigator0.178c0.0840.178c0.0840.186c0.087
Low NIH institutional funding0.618a0.0940.618a0.0940.612a0.097
Experimental covariates
Format (standard = 1)0.144a0.042–0.0220.041
Format × race0.186b0.0830.237b0.080
Perceptions
PI race guess Black0.155b0.069
PI gender guess female–0.0690.061
PI career stage guessEarly-stage investigator0.0910.063
Institutional funding guess ‘low’0.447a0.134
Grantsmanship indicators
Grant 10.519a0.027
Grant 20.204a0.029
Random effects
Grant 1 slope0.052
Institution slope0.4890.4890.477
Application intercept0.6140.4000.4020.511
Residual2.0442.0412.0321.561
  1. Note: Statistically significant parameter estimates are bolded; ap ≤ 0.001, bp ≤ 0.025, cp < 0.05.

Model 1 tested the unadjusted effect of race of the PI on overall impact scores across both application formats. Applications from White PIs scored better. The effect was small, explaining less than 2% of variance in overall impact scores.

Model 2 added application characteristics and additional characteristics of the PI. All covariates except PI gender had significant effects; resubmissions and competing renewals scored better while applications from ESIs and institutions in the lowest quintile of NIH grant funding scored worse. Including these effects reduced the effect of PI race by half, but PI race remained a significant predictor.

Model 3 provides a secondary test of the study hypothesis by adding terms for application format and the PI race by format interaction. Application format was significant, with redacted applications scoring worse, and the application format × race interaction was significant. Redaction did not significantly change scores for Black PIs but significantly worsened scores for White PIs. Table 8 shows the effects of PI race and application format in the raw (unadjusted) data.

Table 8
Simple contrasts of average preliminary impact scores for redacted vs. standard format applications by PI race.

Randomly selected White application set.

RaceAnonymization condition
StandardAnonymizedDifference (SE)Effect size
Black4.134.170.04 (0.06)0.04
White random3.764.010.25* (0.06)0.21
Difference (SE)–0.37* (0.08)–0.16 (0.08)
Effect size for race0.310.15
  1. *

    p < .05 (Bonferroni-adjusted).

Model 4 added reviewer guesses of applicant race, gender, ESI status, institutional funding, and ratings of grantsmanship. Reviewer guesses that the PI was an ESI or was from an institution with low NIH funding were both associated with worse scores, institutional status having the larger effect. Controlling for all other variables in the model, including actual PI race, reviewer’s guess that the PI was Black was associated with slightly better scores. Better ratings on grantsmanship indicators were associated with better overall impact scores. In the final model the following indicators were associated with better scores: competitive renewal, resubmission, reviewer ratings of better grantsmanship, and reviewer guess that the PI was Black. The following indicators were associated with worse scores: ESI, low funded institution, and reviewer guess that the institution was in the low funded group. With this set of covariates, neither PI race nor application format was significant, but the interaction of format by PI race interaction was significant. PI gender was not a significant predictor of scores in any model.

Discussion

Designed as a test of whether blinded review reduces Black-White disparities in peer review (Ginther et al., 2011; Hoppe et al., 2019; Erosheva et al., 2020), the data are also pertinent to understanding the basis of the advantage that applications from White PIs enjoy in review. The experimental intervention, post-submission administrative redaction of identifying information, reduced but did not entirely eliminate reviewer knowledge of applicant identity. Applications submitted by Black PIs were compared to two samples of applications from White PIs, one matched on review-relevant variables, the other selected randomly. The preregistered analysis defined the primary question to be whether redaction differentially affected scores according to race and specified that it be tested using the matched White set of applications. That interaction term was statistically nonsignificant. A secondary test, using the randomly selected set of White applications and a different modeling approach, was statistically significant. We suggest that it is more useful to focus on the concordant patterns of observed data and overall similarity of the modeled results in the two analyses, rather than on small differences in significance levels.

The following effects were consistent across both samples and both modeling approaches: (1) applications from White PIs scored better than those from Black PIs; (2) standard format applications scored better than redacted; (3) redaction produced worse scores for applications from White PIs but did not change scores for applications from Black PIs. In both the primary and secondary comparisons, redaction reduced the difference in mean scores of Black and White application sets by about half (Table 6, Table 8). Thus, the data suggest that redaction, on average, does not improve review outcomes for Black PIs but does reduce the advantage that applications from White PIs have in review of standard format applications. Why?

Applications from White PIs tended to score better than applications from Black PIs. This was unexpected when comparing the matched application sets because the samples were closely matched on the scores they had received in actual NIH review. However, it was not surprising that applications from Black PIs scored worse, on average, than randomly selected applications from White PIs. A persistent gap on the order of 50% between award rates for NIH R01 grants to Black versus White PIs has been previously reported (e.g., Ginther et al., 2011; Hoppe et al., 2019; Erosheva et al., 2020), and the correlation between overall impact scores and probability of funding, across NIH, is the same for Black and White investigators (Hoppe et al., 2019). The secondary models identified several factors that partially account for the racial difference in scores. Competitive renewals of previously funded projects and resubmissions of previously reviewed applications scored better. These effects are well known and tend to favor applications from White PIs (because the representation of White PIs among established investigators is higher). Conversely, ESI status and being associated with an institution that is at the lowest end of the NIH funding distribution were both associated with worse review outcomes. Together these factors tend to disadvantage Black PIs but do not entirely account for the gap in scores.

Other studies have identified additional contributors to differential racial outcomes, including cumulative disadvantage (Ginther et al., 2011), and differences in publication citation metrics (Ginther et al., 2018). Citation metrics are associated with many factors (Tahamtan et al., 2016), some of which are not linked to the quality of the paper, factors such as differences in publication practices between areas of research (Piro et al., 2013), scientific networks (Li et al., 2019), coauthors’ reputations (Petersen et al., 2014), the Matthew effect (Wang, 2014), and race of the authors (Ginther et al., 2018).

The data reveal little evidence of systematic bias based on knowledge of, or impressions of PI race per se. NIH applications do not have fields for PI race (or gender). In this study reviewers were asked to guess PI race, after reading and scoring the application, not before, and reviewers were generally not very confident of their guesses (see Table 3). Thus, how much ‘knowledge’ reviewers had about PI race, and at what point in the process they formed their impression of race is unclear and likely varies across applications and reviewers. With standard applications, reviewers were more likely to guess that Black PIs were Black than White (58% vs. 36%), but with redacted application more likely to guess they were White than Black (61% vs. 28%); even so, redaction did not change scores for Black PIs. Conversely, redaction did not change the frequency of reviewer guesses that White PIs were White (93% standard, 87% redacted), but redaction did change scores for applications from White PIs. A reviewer’s guess that the PI was Black had a very small effect on scores (improving them), controlling for multiple other factors including actual PI race. Interpretation of this effect is statistically and substantively complex. It does not necessarily represent positive racial bias toward Black applicants. Reviewers had reason to presume that they were participating in a study examining the effects of PI race on review outcomes; in this context some might have tried to avoid appearing prejudiced and thus scored PIs they believed to be Black more favorably. It could also be that reviewers judged the science to be better because the PI was perceived to be Black in certain scientific contexts, for example, for an implementation study that hinged on engaging minority communities.

Redacted applications scored worse on average. This is not surprising given that redaction was done administratively, post-submission. The application the reviewers read was not the one written and information lost in redaction may have been important. Retrospectively redacting applications does not simply remove information but also changes the context of the information that remains. Applicants wrote their applications believing that reviewers would be given their names and institutional affiliations and that the other information in the application would be judged in that context. They also, presumably, took into account the fact that NIH grant review criteria include ‘investigators’ and ‘environment’, and that these criteria are supposed to be factored into reviewers’ final scores. Approximately 28% of reviewers of anonymized applications disagreed with the statement ‘I believe that reviewers can provide a fair, thorough, and competent review if applications are anonymized’. Alternatively, or in addition, redacted applications might have done worse because they reduced halo effects (Kaatz et al., 2014; Crane, 1967) that had inflated the scores of standard applications. Halo effects refer to the tendency to rate something better based on a general good impression of the applicant as opposed to specific relevant information; for example, scoring on the basis of positive reputation rather than the application itself. Absent halo effects, the applications may have scored worse when judged on their merits.

Thus, we believe there are two plausible explanations for why applications from White PIs did worse when redacted. One is that redaction reduced positive halo effects of PI and institution. The interconnected factors of PI reputation, scientific networks, scientific pedigree, and institutional prestige certainly can influence review. They are deeply intertwined with race and tend to favor White PIs over others. If redaction reduced halo effects, it would suggest that blinded review models might improve fairness (Ross et al., 2006; Nielsen et al., 2021). On the other hand, it may have been that when PI identity was deleted, the scientific narrative lost context and was consequently degraded. This would presumably predominantly affect more senior, established PIs, who are disproportionately White. If the effect of redaction represents a mismatch between writing and review conditions, blinding would not likely have a lasting effect because scientists will adjust their grant writing to conform to new review conditions.

There are practical problems with administrative redaction as an anonymization strategy. Each application took 2–8 hr to redact and quality assurance an additional 2–4 hr. Despite careful removal of names and other information, redaction was only partially successful in blinding reviewers to PI characteristics. Reviewers of redacted applications correctly identified PI race 70% of the time overall and were able to name the PI or research group in 22% of cases. This result, which is consistent with prior attempts at redaction, suggests that it is not possible to administratively obscure identity in all cases.

What accounts for the unexpected finding that the ‘matched’ samples of applications from Black and White PIs (selected to match on overall impact scores from their actual NIH reviews) scored differently when those same applications were reviewed as part of this study? We think it likely that the change in White matched scores represents regression to the mean, a problem that is more likely when, as is true here, the groups differ on multiple factors that differ in the same direction (Campbell and Stanley, 1963). The set of applications from Black PIs represented almost the entire population of such applications. The matched applications from White PIs were drawn from a population of 26,000 applications which, on average, scored better than the Black applications. Each observed score has a true score component and an error component which is presumed to be random. Selecting applications to match a score worse than the population mean risks selecting applications where the error terms are not random but rather are skewed negatively (making the observed scores worse). When the selected applications were re-reviewed, the error terms would be expected to be random rather than predominantly negative, and thus the mean of the observed scores would improve.

Another possibility is that differences in the conduct of real and study reviews account for the difference. The SROs for the experiment included several naïve to NIH review and the procedures for characterizing reviewer expertise and matching reviewers to applications were much simplified compared to the general practices of CSR SROs. Study reviewers saw many fewer applications than is typical for study section reviewers (~3 for most study reviewers, ~8 for CSR study sections). Because of this, any bias that affects the ranking of applications reviewers apply to ‘their pile’ would not likely be seen in this study. Also, reviewers knew that they were participating in an anonymization study and many likely suspected their scores would be used to assess bias in peer review. A complete listing of the deviations in the study procedures from the actual NIH peer review process are detailed in Supplementary file 1D. Despite these differences in review practices, the overall distribution of scores obtained in the experiment closely approximates the distribution of preliminary overall impact scores seen in the actual NIH reviews of these applications.

Designed as a test of whether blinding reviewers to applicant demographics reduces racial disparities in review outcomes, strengths of the study include a large sample (1200 applications, 2116 reviewers, 7155 reviews), the use of real NIH grant applications, and experienced NIH reviewers. Redaction of institutional and individual identity elements from applications changed scores for White PIs applications for the worse, but did not, on average, affect scores of Black PIs’ applications. Although the effect was statistically small, both samples of White PI’s applications scored better than the Black PIs applications; in each case redaction reduced the size of that difference by about half. It is possible that redaction highlighted gaps in applications written with the assumption that reviewers would know the PI identity; for example, a field-leader in use of a technique might have intended their name to substitute for methodological details. If that sort of grantsmanship accounts for the redaction effect, implementing partial blinding in review is unlikely to have any lasting benefit. If, however, blinding reduces halo effects and that accounts for the reduction in White advantage, then changing review models could perhaps result in a fairer peer review process.

Post-submission administrative redaction is too labor-intensive to implement on the scale that NIH requires. And, even this careful redaction was quite imperfect in concealing elements of identity. However, there are other methods of blinding reviewers to applicant demographics and institutional affiliations. For example, self-redaction of applications might be more effective, and two stage models of review that require judging the merit of an application’s science while blinded to applicant identity are interesting. Development of strategies to ensure a level playing field in the scientific peer review process of scientific grant applications is an urgent need.

Data availability

All data analyzed for the findings presented in this manuscript are included in the supporting files.

References

    1. Blank RM
    (1991)
    The Effects of Double-Blind versus Single-Blind Reviewing: Experimental Evidence from The American Economic Review
    The American Economic Review 81:1041–1067.
  1. Book
    1. Campbell DT
    2. Stanley JC
    (1963)
    Experimental and Quasi-Experimental Designs for Research
    Houghton Mifflin Company.
    1. Crane D
    (1967)
    Gatekeepers of Science - Some Factors Affecting Selection of Articles for Scientific Journals
    American Sociologist 2:195–201.
    1. Fisher M
    2. Friedman SB
    3. Strauss B
    (1994)
    The effects of blinding on acceptance of research papers by peer review
    JAMA 272:143–146.
  2. Book
    1. Hengel E
    (2017)
    Publishing While Female: Are Women Held to Higher Standards? Evidence from Peer Review
    Cambridge Working Papers in Economics.
  3. Book
    1. National Academy of Sciences, N.A.o.E
    (2011)
    Expanding Underrepresented Minority Participation: America’s Science and Technology Talent at the Crossroads
    National Academy Press.
  4. Report
    1. Working Group on Diversity in the Biomedical Research Workforce (WGDBRW) and T.A.C.t.t.D. (ACD)
    (2012)
    Draft Report of the Advisory Committee to the Director Working Group on Diversity in the Biomedical Research Workforce
    National Institutes of Health.

Decision letter

  1. Mone Zaidi
    Senior and Reviewing Editor; Icahn School of Medicine at Mount Sinai, United States
  2. Carlos Isales
    Reviewer; Medical College of Georgia at Augusta University, United States

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Acceptance summary:

The authors, a group scientists and peer review administrators from NIH's Center for Scientific Review (CSR), have attempted to study the effect of redaction of applicant identifiers on review outcome using a selection of grant applications from White and Black investigators. The most remarkable finding was that redaction reduced the score difference between White and Black investigators by half, by affecting the scores of White but not Black investigators. Such unconscious bias, evident on this well crafted study, not only re-emphasizes the need for targeted interventions by the NIH and CSR leadership to prevent such bias, but also reiterates the value of diversification of the reviewer pool. Furthermore, the analysis should be extended to investigators from Latin and Asian descents.

Decision letter after peer review:

Thank you for submitting your article "An Experimental Test of the Effects of Redacting Grant Applicant Identifiers on Peer Review Outcomes" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by me. The following individual involved in review of your submission has agreed to reveal their identity: Carlos Isales (Reviewer #1).

The reviewers have discussed their reviews with one another, and I have drafted this letter to help you prepare a revised submission.

Essential revisions:

Please respond to Recommendations from all reviewers that will help improve the clarity and veracity of your conclusions. Please also attach a point-to-point response.

Reviewer #1 (Recommendations for the authors):

This is a new study but additional information would be helpful:

(1) Little information is provided about the reviewers, what was their characteristics (age, gender, race, degree etc).

(2) My understanding is that the NIH shortened the R01 application from 25 to 12 pages in 2011 with the idea that the reviewers track record (eg. biosketch) would weigh more heavily in the reviewers assessment of their ability to complete the proposed studies. In the current study the biosketch is redacted eliminating that metric. Would they have expected to see less of a difference in scores if the application were 25 pages long?

(3) What was the race breakdown of the MPI's? Did presence of a White PI on an application from a Black PI impact the score?

(4) The authors state that reviewers looked at 3.4 applications/reviewer. However the range was quite broad (1-29). This could introduce bias if some reviewers had large number of reviews. What was the median number of grants reviewed per reviewer?

(5) Established investigators work in specific areas, an experienced reviewer will recognize the topic of the grant as most likely coming from one of these specific investigators. Redaction is not likely to help for this group of investigators. Would caps on number of awards or dollar amount to established investigators help? What percentage of the grants reviewed had PI's with 3 or more NIH awards?

(6) An issue is that the group of reviewers from 2014-2015 is not completely comparable to a reviewer from 2020-2021. Thus, the data presented may really only applies to one see specific point in time in the review process.

(7) As the authors acknowledge the dynamics of scoring and review are very different when the process involves meeting and defending your review in front of a group of your peers. Is there a way to assess whether the "quality" of the reviews was different. For example was a subset of reviews sampled by a third party and assessed as high quality?

Reviewer #2 (Recommendations for the authors):

1) Please clarify the areas of expertise for the reviewers versus the areas of applications.

2) Please include a general breakdown of applications from varying races to the NIH.

3) Please include funding rates (%'s) for each race class. This really should be included in the beginning to help the readers understand the overall composition of grants versus race and their funding percentages.

4) For increased clarity about early stages, please spell out ESI in the table legends.

Reviewer #3 (Recommendations for the authors):

Introduction: The opening lines jump straight into "the process" of peer review rather abruptly. This paragraph could be re-arranged slightly for a better lead-in. NIH needs to be spelled out and maybe a good entry would be a description of how much in public funds the NIH distributes for research purposes. Then the sentence at lines 43-44 should be put before the launch into bias in peer review part. This would then permit the end of the sentence on Line 38 to include mention of the interests of the US public who pay for this research, for completeness. This is not merely a matter of idle academic fluffery and the individual interests of scientist, and this should not be lost with inside-baseball framing.

Introduction, Line 54-56. There are so many obvious flaws with the work of Forscher et al., that it is inappropriate to cite unquestioned as a negative outcome. They used laughably cartoonish stimuli, did not account for reviewer penetration of their intent and even discovered that many of the reviewers caught on to them. The present study is a considerable improvement and the only place for mentioning Forscher et al., is in drawing a contrast in how this study has improved on their glaring shortcomings.

Introduction, Line 57: Somewhere in here the authors should point out that not all peer review is the same and highlight the degree to which NIH grant review focuses explicitly, as one of five co-equal criteria, on the Investigator. This is one of the contributing reasons why available prior work may not be sufficient and further underlines the importance of their investigation. There are important differences between NIH grant review and, say, evaluation of legal summaries or resume-based job call-backs, familiar to some of the most frequently mentioned literature on this topic. This should be revisited in the Discussion around the comments being made in Line 338-341.

Introduction, Lines 60-64: This should be considered as an alternate opening to the Introduction. Much catchier than either the comparatively narrow academic issue of peer review, or even the suggestion made above to open more broadly with the mission of the NIH.

Introduction: While it is understood why the NIH has focused so exclusively on the disparity for Black investigators after the 2011 report, there should be some explicit recognition in the Introduction that the original Ginther found various disparities for Hispanic and Asian PIs that have not been as rigorously examined in follow-up studies such as the present one.

Design: The description of reviewer tasks needs to better clarify the procedure. Were reviewers asked to provide their scores and only afterward asked all of the probing questions likely to tip them off about the true purpose of the study? Or did they know before doing their scoring that they would be asked about whether they guessed PI demographics, who it was and ratings of grant writing/preparation issues.

Design: It is disappointing in the extreme that five of the nine SROs recruited for the study had no prior experience as an SRO. SRO reviewer selection and assignment is a hugely influential and almost entirely opaque aspect of grant review. It would seem essential in a study of this nature to include only experienced SROs. Obviously this cannot be fixed but it should be mentioned as a significant caveat in the Discussion. The points being made at lines 297-298 would be a good place- one possible reason for the unexpected outcome is that the experimental reviewers were somehow systematically different from real panel members.

Review Procedure: The range of reviews per-reviewer was 1 to 29. It would be useful to provide better summary statistics, i.e., include the median and Inter-quartile range for this. It is an important issue and it may be best to trim the sample a bit so that the loads are more balanced and more reflective of typical reviewer loads. These authors know perfectly well that real CSR reviewers have pressures that lead to ranking within their piles, putting closely-meritorious proposals in micro-competition. This is very likely a place where very small and implicit biases could act and this should definitely be addressed in the Discussion, the paragraph from Line 369-381 would be a good place. But this study/data set would appear to be a key opportunity to evaluate if, for example, mean scores of reviewers that only have 1-3 apps to review differ from those from reviewers who have 8-10 to review (a more typical load).

The results of Pre-registered Question 1 describe the reviewer confidence on guessing race for redacted applications as "low" and to be "modestly higher" for standard applications. Yet the rather dramatic and study-relevant decrease in of identification of Black applicants from 59% to 30% (highly relevant when success rates overall at the NIH are only about 11% for Black applicants, going by the published data) is not characterized in such interpretive terms. Suggest being consistent in Results terminology. Either leave all such shading to the Discussion or include accurate shading for all Results sections.

One important addition to the study under Q1 would be to assess scores for misses and false alarms relative to accurate identifications of PI race. This would depend on there being applications on which some reviewers inaccurately identified race and other reviewers guessed correctly, but this should be summarized in any case. It would seemingly be important if, for a given application, reviewers were either all likely to mis/identify or were more randomly distributed.

The results of Pre-registered Question 2 describe the scoring distributions for the three groups under both standard and blinded conditions. They note the difference between standard scores for the Black and white "matched" groups despite being matched on original scores from the initial review process. This outcome indicates perhaps that there is inconsistency in the outcome of review based on race, which would seem to be critical for the NIH to understand better. Although any given application is reviewed once, given amendments and the need for multiple similar proposals to be submitted it is relevant if an applicants' race dictates higher or lower variability in scoring. It is critical to include two additional plots. First, a correlation, by group, between the original scores and the ones generated in this test under each of the standard and blinded conditions. Second, the authors should include a repeated-measures type plot where the individual applications' standard and redacted scores within this study can be visualized. This may require breaking the 400 proposals into score range bins for clarity but it would be crucial contribution to understanding what is driving effects.

The results of Pre-registered Question 3 address the study's main hypothesis. This is the main point and the data should be provided in graphical format that better emphasize the finding. This would be addressed by taking up the suggestions about individual data provided above. The comment "On average, applications from White PIs received slightly better scores than those from Black PIs" should be reconsidered. As the authors are well aware, the 9 point rating system introduced discontinuities around whole digit scores due to post discussion scores coalescing at the same number and panels voting within the range. Thus, "slightly" different scores, say, anything below a 20 or 30 versus those round numbers can have a dramatic impact on percentile and the probability of funding. It is best to keep such shading of the outcome to a minimum and just report that there was a difference.

Discussion, Line 275-276. It is unclear if this is referring to the difference in this study or the Ginther/Hoppe findings. The "also pertinent" suggests the latter so please clarify and cite if relevant.

Discussion, Line 324. It seems very strange to say that a 59% hit rate, or 70% miss rate is "usually". Just report the percentages without this sort of shading of the results.

Discussion 329-336: The authors are to be congratulated for including this Discussion.

Discussion Line 366: I may have missed this but it would seem imperative to re-run the analyses with the correctly identified 22% removed from the sample.

Discussion, Line 369-381: Structurally, this should be occurring right after the issue is introduced at Line 297.

Discussion, Line 385-386: This is a statement directly discordant with the primary finding that applications with white investigators score more poorly when anonymized. It should be removed.

Discussion: It is very peculiar that the manuscript has no consideration whatsoever of a recent publication by Hoppe and colleagues describing the impact of, essentially, scientific key words and how they are used by applicants of different races, on review outcome. That paper is cited only for the Black/white disparity but not for the critical new information on topic and methods.

https://doi.org/10.7554/eLife.71368.sa1

Author response

Reviewer #1 (Recommendations for the authors):

This is a new study but additional information would be helpful:

(1) Little information is provided about the reviewers, what was their characteristics (age, gender, race, degree etc).

We did not collect any demographic information on reviewers and recorded only their prior service in the IRGs and study sections. There was concern that collecting information would have made recruitment of reviewers more difficult (i.e. that reviewers might be concerned that we would evaluate them) and we needed to recruit a large number of experienced reviewers. The demographics of these reviewers should be roughly consistent with what is published online now at CSR (see https://public.csr.nih.gov/AboutCSR/Evaluations).

(2) My understanding is that the NIH shortened the R01 application from 25 to 12 pages in 2011 with the idea that the reviewers track record (eg. biosketch) would weigh more heavily in the reviewers assessment of their ability to complete the proposed studies. In the current study the biosketch is redacted eliminating that metric. Would they have expected to see less of a difference in scores if the application were 25 pages long?

Would a longer scientific narrative lead reviewers to attend more to the science proposed and less attention to the biosketch? Perhaps so, although having served as an NIH reviewer reading both 25 page and 12-page applications, it is clear to me that the biosketches carried a lot of weight for many reviewers even when applications were longer. Reviewer 3 raises a related point about the structure of NIH review criteria (item #17, below), which we’ve addressed. We think our discussion of the effects of redacting information (e.g. line 398-408) and halo effects (lines 408-413) bear on this, at least tangentially.

(3) What was the race breakdown of the MPI's? Did presence of a White PI on an application from a Black PI impact the score?

Distribution of MPI PIs according to race: In the set of 400 applications from Black contact PI’s were 98 MPI applications. Of these, 66 had a White MPI, sometimes 1, sometimes 2. The remainder had a mix of non-White PIs. For the matched White sample of applications, 69 of the 71 MPI applications had only White MPIs. For the randomly selected applications from White PIs, 81 of 83 MPI applications had only White MPIs.

We did not do an analysis comparing Black MPI applications that included White MPIs with other applications. Because the number of such application is small, power would be low. We know that applications from White PIs scored better overall and it is not clear how this analysis help us understand the main questions of this study.

(4) The authors state that reviewers looked at 3.4 applications/reviewer. However the range was quite broad (1-29). This could introduce bias if some reviewers had large number of reviews. What was the median number of grants reviewed per reviewer?

Median number of reviews = 3. Mean number of reviews = 3.4, range 1-29. Only 60 reviewers (less than 3%) reviewed more than eight applications. When we removed data from these reviewers from the data we saw small changes in parameter estimates but no change in the patterns of significance. We now report these data, lines 188-189.

(5) Established investigators work in specific areas, an experienced reviewer will recognize the topic of the grant as most likely coming from one of these specific investigators. Redaction is not likely to help for this group of investigators. Would caps on number of awards or dollar amount to established investigators help? What percentage of the grants reviewed had PI's with 3 or more NIH awards?

We agree that highly funded scientists are likely to be better known and consequently more likely to be identified even when applications are redacted. We did not compile characteristics of the investigators who were identified. A few years ago, NIH ran analyses suggesting that the return on investment diminished with high levels of NIH funding. Policies were implemented to ensure that awards to highly funded investigators got special scrutiny before they were approved have not been strikingly successful.

(6) An issue is that the group of reviewers from 2014-2015 is not completely comparable to a reviewer from 2020-2021. Thus, the data presented may really only applies to one see specific point in time in the review process.

Caution in generalizing findings is always warranted. CSR guidance on evaluating the qualifications of reviewers has evolved over the last few years and review panels now include more assistant and associate professors and are modestly more diverse with respect to race, ethnicity, and gender than six years ago. Changes in the reviewer pool are only one reason the results obtained from this sample might not replicate now. Review policies and practices have evolved; NIH has increased attention to rigor and reproducibility in science, OER has taken more enforcement actions regarding integrity in review. However, the funding disparity that Ginther described in 2011 with 2006 data persisted, unchanged at least until last year, and recent papers (e.g. Hoppe, 2019) continue to report a differential in average scores for Black and White PIs. Therefore, we believe these findings to be highly relevant to current NIH peer review.

(7) As the authors acknowledge the dynamics of scoring and review are very different when the process involves meeting and defending your review in front of a group of your peers. Is there a way to assess whether the "quality" of the reviews was different. For example was a subset of reviews sampled by a third party and assessed as high quality?

We did not formally compare the critiques obtained for this study to those typical in CSR review. In the manuscript we emphasize differences between the review procedures of this study and true NIH review, particularly in the discussion lines 446-458 and acknowledge this is a study limitation. See our response to item 21.

Reviewer #2 (Recommendations for the authors):

1) Please clarify the areas of expertise for the reviewers versus the areas of applications.

Reviewers were matched to applications by the SROs. SROs received clusters scientifically related applications. These very roughly corresponded to Integrated Review Groups at CSR, and SROs were given lists of reviewers who had reviewed those topics between October 2013 and December 2016 in the same study sections where the 1,200 applications had been originally reviewed. SROs were given, for each reviewer, names of the study sections where they had served, their areas of expertise as self-described in the NIH system or as provided by a CSR SRO who had worked with them previously, and Research, Condition, and Disease Categorization (RCDC) terms that were weighted with computed scores based on automated analysis of applications they had submitted to NIH. RCDC terms are system-generated tags applied by NIH to all incoming applications, designed to characterize its scientific content and to facilitate reporting of funding patterns. Reviewers were assigned applications to review as follows: the contract SROs first reviewed the application project narrative, abstract, specific aims, and research strategy and characterized them using key words to tag the scientific topic and critical aspects of the scientific approach. Once the key words were identified for a specific application, SROs matched them to potential reviewers’ RCDC terms and scores. In no case were reviewers assigned applications they had seen in the original reviews. A total of 6 reviewers were recruited for each application, 2 reviewers being designated as Reviewer Role 1 (best match), 2 reviewers being considered as Reviewer Role 2 (next best), and 2 reviewers as Reviewer Role 3 (still a reasonable match). We summarize these procedures in Methods, lines 169-175.

2) Please include a general breakdown of applications from varying races to the NIH.

See our response below.

3) Please include funding rates (%'s) for each race class. This really should be included in the beginning to help the readers understand the overall composition of grants versus race and their funding percentages.

We rewrote the introduction, adding information about funding rates for Hispanic and Asian PIs (the 2 largest groups of minority applicants), and provided a stronger explanation for why this study focused on Black-White differences only (lines 93-102) Our aim was to provide a broader context while keeping the intro reasonably focused. Demographic differences in patterns of application numbers, review outcomes, and funding success is a complex topic, not easily presented concisely. More importantly, we think that this information, while no doubt of interest to some, is not relevant background to the experiment at hand. We tried to strike a balance between context and focus.

4) For increased clarity about early stages, please spell out ESI in the table legends.

Done.

Reviewer #3 (Recommendations for the authors):

Introduction: The opening lines jump straight into "the process" of peer review rather abruptly. This paragraph could be re-arranged slightly for a better lead-in. NIH needs to be spelled out and maybe a good entry would be a description of how much in public funds the NIH distributes for research purposes. Then the sentence at lines 43-44 should be put before the launch into bias in peer review part. This would then permit the end of the sentence on Line 38 to include mention of the interests of the US public who pay for this research, for completeness. This is not merely a matter of idle academic fluffery and the individual interests of scientist, and this should not be lost with inside-baseball framing.

We rewrote the introduction to incorporate most of the reviewer’s points (which were thoughtful and quite helpful). The intro now frames the issue more broadly, noting the importance of peer review in US funding of biomedical research, provides better justification for focusing on Black-White differences (while acknowledging other demographic disparities have been observed), eliminates the Forscher reference, and points out that NIH grant review, unlike other peer reviews, specifically calls for evaluation of the investigators and environment.

Introduction, Line 54-56. There are so many obvious flaws with the work of Forscher et al., that it is inappropriate to cite unquestioned as a negative outcome. They used laughably cartoonish stimuli, did not account for reviewer penetration of their intent and even discovered that many of the reviewers caught on to them. The present study is a considerable improvement and the only place for mentioning Forscher et al., is in drawing a contrast in how this study has improved on their glaring shortcomings.

Fair points. We’ve removed the reference.

Introduction, Line 57: Somewhere in here the authors should point out that not all peer review is the same and highlight the degree to which NIH grant review focuses explicitly, as one of five co-equal criteria, on the Investigator. This is one of the contributing reasons why available prior work may not be sufficient and further underlines the importance of their investigation. There are important differences between NIH grant review and, say, evaluation of legal summaries or resume-based job call-backs, familiar to some of the most frequently mentioned literature on this topic. This should be revisited in the Discussion around the comments being made in Line 338-341.

That’s a good point, and we’ve added it to the intro, lines 57-60, and the Discussion, lines 404-406.

Introduction, Lines 60-64: This should be considered as an alternate opening to the Introduction. Much catchier than either the comparatively narrow academic issue of peer review, or even the suggestion made above to open more broadly with the mission of the NIH.

The intro was re-written, as described above.

Introduction: While it is understood why the NIH has focused so exclusively on the disparity for Black investigators after the 2011 report, there should be some explicit recognition in the Introduction that the original Ginther found various disparities for Hispanic and Asian PIs that have not been as rigorously examined in follow-up studies such as the present one.

The intro now provides data on Asian and Hispanic funding disparities and a better explanation of why the study focuses on Black-White differences (lines 93-102).

Design: The description of reviewer tasks needs to better clarify the procedure. Were reviewers asked to provide their scores and only afterward asked all of the probing questions likely to tip them off about the true purpose of the study? Or did they know before doing their scoring that they would be asked about whether they guessed π demographics, who it was and ratings of grant writing/preparation issues.

We explain the sequence of data collection in lines 199-201. The question of what did the reviewers “know” about applicants’ race (and other demographics), and when did they know it in relation to judging the scientific merit of the application matters but is hard to answer with certainty. This was not an experiment in which perceptions of race were manipulated before the applications were scored. Rather, reviewers formed an impression of π race based on the application materials at some point in the process. We do not know when in the process the impression was formed or what it was based on. We do know many guesses were very uncertain (rated as 1 or 2 on a 5-point scale from “1. not at all certain” to “5. Completely certain”). These cautions complicate interpretation of the effects of reviewer guesses of race, and we point this out in the Discussion, lines 379-383.

Design: It is disappointing in the extreme that five of the nine SROs recruited for the study had no prior experience as an SRO. SRO reviewer selection and assignment is a hugely influential and almost entirely opaque aspect of grant review. It would seem essential in a study of this nature to include only experienced SROs. Obviously this cannot be fixed but it should be mentioned as a significant caveat in the Discussion. The points being made at lines 297-298 would be a good place- one possible reason for the unexpected outcome is that the experimental reviewers were somehow systematically different from real panel members.

The general issue raised is how differences between the experimental and actual NIH review affected the results and whether the results can be interpreted as likely to reflect patterns in real NIH review. The experiment replicated actual NIH review to the greatest extent feasible given the scale of the study. We used only experienced NIH reviewers and standard NIH criteria, critique templates, and real NIH grant applications. Reviewers for the experiment were all experienced NIH reviewers who had served on the same study sections as had originally reviewed the applications. However, there were numerous differences between the reviews done for the study and actual NIH review, which we point out in the Discussion (lines 446-458). We acknowledge that not all SROs had experience as NIH SROs and that the procedures for matching reviewers to applicants did not fully replicate those used in NIH review. Whether these differences had any effect on the degree of potential bias expressed in the reviews is unknown.

Certain findings diminish concerns that the experimental procedures greatly altered review outcomes. The overall distribution of scores obtained in the experiment closely approximates the distribution of preliminary overall impact scores seen in the actual NIH reviews of these applications. Model 2 Table 7 replicates multiple findings previously reported using data sets of scores from real NIH review, for example, that type 2 applications, and A1’s do better, and that applications from early stage investigators do worse. We agree that it is important that readers understand the differences between the experimental procedures and standard NIH review, and we point this out in the Methods (lines 190-196) and Discussion (lines 446-458).

Review Procedure: The range of reviews per-reviewer was 1 to 29. It would be useful to provide better summary statistics, i.e., include the median and Inter-quartile range for this. It is an important issue and it may be best to trim the sample a bit so that the loads are more balanced and more reflective of typical reviewer loads. These authors know perfectly well that real CSR reviewers have pressures that lead to ranking within their piles, putting closely-meritorious proposals in micro-competition. This is very likely a place where very small and implicit biases could act and this should definitely be addressed in the Discussion, the paragraph from Line 369-381 would be a good place. But this study/data set would appear to be a key opportunity to evaluate if, for example, mean scores of reviewers that only have 1-3 apps to review differ from those from reviewers who have 8-10 to review (a more typical load).

We now report that the number of reviews completed was 3.4 applications per reviewer on average (Median = 3, interquartile range = 1-5, maximum 29) (lines 188-189). Only 60 reviewers (less than 3%) reviewed more than eight applications. When we removed data from these reviewers from the data we saw small changes in parameter estimates but no change in the patterns of significance. Because the distribution is so highly skewed, we do not have sufficient sample size to perform the suggested comparison of typical load vs. study load. The point that implicit biases might emerge as reviewers rank applications within “their pile” is a good one, and we’ve added it to the discussion (lines 449-452).

The results of Pre-registered Question 1 describe the reviewer confidence on guessing race for redacted applications as "low" and to be "modestly higher" for standard applications. Yet the rather dramatic and study-relevant decrease in of identification of Black applicants from 59% to 30% (highly relevant when success rates overall at the NIH are only about 11% for Black applicants, going by the published data) is not characterized in such interpretive terms. Suggest being consistent in Results terminology. Either leave all such shading to the Discussion or include accurate shading for all Results sections.

We eliminated interpretive terms reporting that result (lines 227-243).

One important addition to the study under Q1 would be to assess scores for misses and false alarms relative to accurate identifications of π race. This would depend on there being applications on which some reviewers inaccurately identified race and other reviewers guessed correctly, but this should be summarized in any case. It would seemingly be important if, for a given application, reviewers were either all likely to mis/identify or were more randomly distributed.

The distribution of guessed race in relation to actual race showing hits and the distribution of misses is presented in Table 2, broken down according to application format. The new table gives more information and we revised the accompanying text accordingly (lines 227-243). Considering the small number of guesses per application, it is difficult to confidently identify applications that are more or less likely to have the π misidentified. However, the multivariate models presented in Table 7 address the question of how judgements of race, some accurate, some not, affect the critical test of the study hypothesis. Model 3 tests the study hypothesis using the randomly selected White applications, and found a significant Format X PI Race interaction. Model 4 adds to that variable set reviewer guess of π race, gender, institution, career stage and grantsmanship indicators. You can see that the parameter estimates for the interaction term change a little, but significance levels and pattern of findings does not. We believe it important that readers do not over-interpret the data on reviewer guesses of race, and added explanatory text in the discussion (lines 379-383).

The results of Pre-registered Question 2 describe the scoring distributions for the three groups under both standard and blinded conditions. They note the difference between standard scores for the Black and white "matched" groups despite being matched on original scores from the initial review process. This outcome indicates perhaps that there is inconsistency in the outcome of review based on race, which would seem to be critical for the NIH to understand better. Although any given application is reviewed once, given amendments and the need for multiple similar proposals to be submitted it is relevant if an applicants' race dictates higher or lower variability in scoring. It is critical to include two additional plots. First, a correlation, by group, between the original scores and the ones generated in this test under each of the standard and blinded conditions. Second, the authors should include a repeated-measures type plot where the individual applications' standard and redacted scores within this study can be visualized. This may require breaking the 400 proposals into score range bins for clarity but it would be crucial contribution to understanding what is driving effects.

The reviewer asks whether the findings indicate differential reliability of review according to race given that scores for the White matched group improved on rereview but scores for the Black group did not. We believe a more likely explanation is that the change in White matched scores represents regression to the mean. The set of applications from Black PIs represented almost the entire population of such applications. The matched applications from White PIs were drawn from a population of 26,000 applications which, on average, scored better than the Black applications. Each observed score has a true score component and an error component, presumed random. Selecting applications to match a score worse than the population mean risks selecting applications where the error terms are not normally distributed but rather are skewed negatively (making the scores worse). When they are re-reviewed, the error terms are expected to conform to a normal distribution, and thus the scores overall would improve.

The issue of what factors are associated with differences in reliability of review is interesting and important. It is a substantial issue in its own right, one that is complicated by differences between the review procedures of this study and those of standard NIH review. We think it tangential to the central point of this paper, difficult to address properly in brief form, and so would rather exclude it. Our analyses to date found higher ICCs for applications from Black PIs than for Whites, but further analyses to inform interpretation of that observation are needed.

To more specifically address the request for “a correlation, by group, between the original scores and the ones generated in this test under each of the standard and blinded conditions”, we are concerned that these correlations would not be informative. The original scores and study scores were obtained under different review conditions, as discussed in Methods and Discussion and further detailed in the supplemental materials. Given different review environments, what information is gleaned from a correlation between original and study scores? As an alternative to spaghetti plots, we report the distribution of change scores between standard and redacted conditions according to race (lines 283-285). These distributions do not show evidence of higher variability in change scores according to PI race.

Author response table 1
Distribution of change (redacted score – standard score) according to PI race.
Min1st quarterMedianMean3rd quarterMax
Black–3.67–0.6700.0413.33
White matched–3–0.6700.1613.33
White random–3.83–0.330.330.2414.33

The results of Pre-registered Question 3 address the study's main hypothesis. This is the main point and the data should be provided in graphical format that better emphasize the finding. This would be addressed by taking up the suggestions about individual data provided above. The comment "On average, applications from White PIs received slightly better scores than those from Black PIs" should be reconsidered. As the authors are well aware, the 9 point rating system introduced discontinuities around whole digit scores due to post discussion scores coalescing at the same number and panels voting within the range. Thus, "slightly" different scores, say, anything below a 20 or 30 versus those round numbers can have a dramatic impact on percentile and the probability of funding. It is best to keep such shading of the outcome to a minimum and just report that there was a difference.

To present the results more clearly, we reformatted Figure 2 so that the effect of format on scores can be more clearly seen, and better compared between groups.

We appreciate the reviewer’s concern about shading the description of differences in a way that might be misleading. However, we believe it important to provide some sort of characterization of the difference. Effect sizes are more informative than p values, and it is effect size that we are trying to convey. We reworded the text to make it clear we are discussing effect size in a statistical sense (271-272).

Although not impossible, we think it unlikely that small effect size differences are having large effects on funding outcomes. For that to be true there would need to be many applications in close proximity to a standard NIH funding line. As shown in Figure 2, scores for these applications are widely distributed, as is typical for NIH applications. In addition, funding lines vary substantially from institute to institute (see Lauer, et. al 2021), so there is no one score that determines funding across NIH. Further, all institutes skip some highly scored applications and instead pay others, less well scored, these “select pay” decisions reflecting institute scientific priorities and portfolio considerations (see Mike Lauer’s blog). Thus, we believe that statistically small effects are in this case also likely to have relatively small real-world effects.

Discussion, Line 275-276. It is unclear if this is referring to the difference in this study or the Ginther/Hoppe findings. The "also pertinent" suggests the latter so please clarify and cite if relevant.

Done

Discussion, Line 324. It seems very strange to say that a 59% hit rate, or 70% miss rate is "usually". Just report the percentages without this sort of shading of the results.

Entire section rewritten, shadings eliminated (lines 378-390).

Discussion 329-336: The authors are to be congratulated for including this Discussion.

Thank you. The section remains in the discussion, (388-397).

Discussion Line 366: I may have missed this but it would seem imperative to re-run the analyses with the correctly identified 22% removed from the sample.

We did so, and it did not change the parameter estimates or significance levels appreciably.

Discussion, Line 369-381: Structurally, this should be occurring right after the issue is introduced at Line 297.

We agree, but decided to place it later because putting it up front gets in the way of discussing more important effects.

Discussion, Line 385-386: This is a statement directly discordant with the primary finding that applications with white investigators score more poorly when anonymized. It should be removed.

Done.

Discussion: It is very peculiar that the manuscript has no consideration whatsoever of a recent publication by Hoppe and colleagues describing the impact of, essentially, scientific key words and how they are used by applicants of different races, on review outcome. That paper is cited only for the Black/white disparity but not for the critical new information on topic and methods.

This is now discussed in the intro, lines 65-69.

https://doi.org/10.7554/eLife.71368.sa2

Article and author information

Author details

  1. Richard K Nakamura

    Retired, formerly Center for Scientific Review, National Institutes of Health, Bethesda, United States
    Contribution
    Conceptualization, Data curation, Funding acquisition, Project administration, Resources, Writing – review and editing
    Competing interests
    now retired, was Director of the NIH Center for Scientific Review (CSR) while the study was designed and implemented.
  2. Lee S Mann

    Retired, formerly Center for Scientific Review, National Institutes of Health, Bethesda, United States
    Contribution
    Conceptualization, Writing – review and editing
    Competing interests
    now retired, was employed by CSR.
  3. Mark D Lindner

    Center for Scientific Review, National Institutes of Health, Bethesda, United States
    Contribution
    Conceptualization, Formal analysis, Writing – review and editing
    Competing interests
    is employed by NIH/CSR
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8646-2980
  4. Jeremy Braithwaite

    Social Solutions International, Rockville, United States
    Present address
    EvaluACT, Inc, Playa Vista, United States
    Contribution
    Conceptualization, Data curation, Formal analysis, Writing – review and editing
    Competing interests
    was employed by the contract research organization that conducted the data collection and initial analysis.
  5. Mei-Ching Chen

    Center for Scientific Review, National Institutes of Health, Bethesda, United States
    Contribution
    Data curation, Formal analysis, Conceptualization
    Competing interests
    MC is employed by NIH/CSR.
  6. Adrian Vancea

    Center for Scientific Review, National Institutes of Health, Bethesda, United States
    Contribution
    Formal analysis, Conceptualization
    Competing interests
    is employed by NIH/Center for Scientific Review.
  7. Noni Byrnes

    Center for Scientific Review, National Institutes of Health, Bethesda, United States
    Contribution
    Conceptualization, Resources, Writing – review and editing
    Competing interests
    is employed by NIH/Center for Scientific Review. She is the Director of CSR.
  8. Valerie Durrant

    Center for Scientific Review, National Institutes of Health, Bethesda, United States
    Contribution
    Conceptualization, Writing – review and editing
    Competing interests
    is employed by NIH/CSR
  9. Bruce Reed

    Center for Scientific Review, National Institutes of Health, Bethesda, United States
    Contribution
    Conceptualization, Supervision, Writing – original draft
    For correspondence
    bruce.reed@nih.gov
    Competing interests
    is employed by NIH, he is the Deputy Director of CSR
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1606-8646

Funding

National Institutes of Health

  • Richard Nakamura

Employees of the NIH were involved in study design, in data analysis, data interpretation and manuscript writing. Data were collected, and major data analysis completed, by a contract research organization.

Acknowledgements

We wish to acknowledge the invaluable assistance of the Electronic Records Administration (eRA) team at OER, with particular recognition to Inna Faenson and Aaron Czaplicki. We thank Katrina Pearson and the Division of Statistical Reporting in OER for the power analyses, we thank the support staff at CSR who made this study possible: Amanda Manning, Denise McGarrell, and Charles Dumais, and we thank the SSI contract SROs for dedicated service.

Ethics

Human subjects: All participants gave informed consent to participate in this study in accordance with a protocol that was approved on March 27, 2017 by the Social Solutions, Inc IRB, (FWA 00008632), protocol #47.

Senior and Reviewing Editor

  1. Mone Zaidi, Icahn School of Medicine at Mount Sinai, United States

Reviewer

  1. Carlos Isales, Medical College of Georgia at Augusta University, United States

Publication history

  1. Received: June 17, 2021
  2. Preprint posted: June 28, 2021 (view preprint)
  3. Accepted: October 8, 2021
  4. Accepted Manuscript published: October 19, 2021 (version 1)
  5. Version of Record published: November 24, 2021 (version 2)

Copyright

This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Metrics

  • 3,808
    Page views
  • 486
    Downloads
  • 10
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Richard K Nakamura
  2. Lee S Mann
  3. Mark D Lindner
  4. Jeremy Braithwaite
  5. Mei-Ching Chen
  6. Adrian Vancea
  7. Noni Byrnes
  8. Valerie Durrant
  9. Bruce Reed
(2021)
An experimental test of the effects of redacting grant applicant identifiers on peer review outcomes
eLife 10:e71368.
https://doi.org/10.7554/eLife.71368

Further reading

  1. Edited by Peter A Rodgers
    Collection

    The study of science itself is a growing field of research.

    1. Cell Biology
    Jini Sugatha, Amulya Priya ... Sunando Datta
    Research Article Updated

    Sorting nexins (SNX) are a family of proteins containing the Phox homology domain, which shows a preferential endo-membrane association and regulates cargo sorting processes. Here, we established that SNX32, an SNX-BAR (Bin/Amphiphysin/Rvs) sub-family member associates with SNX4 via its BAR domain and the residues A226, Q259, E256, R366 of SNX32, and Y258, S448 of SNX4 that lie at the interface of these two SNX proteins mediate this association. SNX32, via its PX domain, interacts with the transferrin receptor (TfR) and Cation-Independent Mannose-6-Phosphate Receptor (CIMPR), and the conserved F131 in its PX domain is important in stabilizing these interactions. Silencing of SNX32 leads to a defect in intracellular trafficking of TfR and CIMPR. Further, using SILAC-based differential proteomics of the wild-type and the mutant SNX32, impaired in cargo binding, we identified Basigin (BSG), an immunoglobulin superfamily member, as a potential interactor of SNX32 in SHSY5Y cells. We then demonstrated that SNX32 binds to BSG through its PX domain and facilitates its trafficking to the cell surface. In neuroglial cell lines, silencing of SNX32 leads to defects in neuronal differentiation. Moreover, abrogation in lactate transport in the SNX32-depleted cells led us to propose that SNX32 may contribute to maintaining the neuroglial coordination via its role in BSG trafficking and the associated monocarboxylate transporter activity. Taken together, our study showed that SNX32 mediates the trafficking of specific cargo molecules along distinct pathways.