Meta-Research: COVID-19 research risks ignoring important host genes due to pre-established research patterns

Successful Clinical Response in Pneumonia Therapy (SCRIPT) Systems Biology Center, Northwestern University, United States
Department of Chemical and Biological Engineering, Northwestern University, United States
Center for Genetic Medicine, Northwestern University School of Medicine, United States
Northwestern Institute on Complex Systems (NICO), Northwestern University, United States
Department of Molecular Biosciences, Northwestern University, United States
Department of Physics and Astronomy, Northwestern University, United States
Department of Medicine, Northwestern University School of Medicine, United States

Nov 24, 2020

https://doi.org/10.7554/eLife.61981

Open access
Copyright information

Download
Cite
CommentOpen annotations (there are currently 0 annotations on this page).
Share

Article
Figures and data
Abstract
Introduction
Results
Discussion
Methods
Data availability
References
Decision letter
Author response
Article and author information
Metrics

Abstract

It is known that research into human genes is heavily skewed towards genes that have been widely studied for decades, including many genes that were being studied before the productive phase of the Human Genome Project. This means that the genes most frequently investigated by the research community tend to be only marginally more important to human physiology and disease than a random selection of genes. Based on an analysis of 10,395 research publications about SARS-CoV-2 that mention at least one human gene, we report here that the COVID-19 literature up to mid-October 2020 follows a similar pattern. This means that a large number of host genes that have been implicated in SARS-CoV-2 infection by four genome-wide studies remain unstudied. While quantifying the consequences of this neglect is not possible, they could be significant.

Introduction

Shortly after SARS-CoV-2, the coronavirus that causes COVID-19, had emerged as a global threat to human health in January 2020, researchers had identified the host proteins required for viral entry into cells (Hoffmann et al., 2020; Monteil et al., 2020; Wrapp et al., 2020), repurposed drugs for treating COVID-19 patients (Grein et al., 2020; Recovery Collaborative Group, 2020), and initiated vaccine development (Folegatti et al., 2020; Jackson et al., 2020). A common feature of these advances was that they drew upon previous lines of research. A major question, however, is whether research into COVID-19 is pursuing all important host genes implicated in COVID-19.

To answer this question we used LitCOVID, a literature hub curated by the National Library of Medicine that tracks publications on COVID-19 (Chen et al., 2020). LitCOVID tags genes within the publicly accessible text of individual publications through PubTator (Wei et al., 2019), which first applies an ensemble of automated approaches to tag genes, and then allows for a revision of these tags through biocurators. We consider genes tagged within the title, abstract or results sections of individual publications, and use MEDLINE to exclude reviews and other non-research publications (see Methods). This yields 10,395 research publications featuring 3733 human protein-coding genes that have been tagged at least once. This enables us to ask whether the choices by scientists to investigate these genes can be understood in terms of current biological knowledge on COVID-19.

Results

The most prominently tagged genes up to this point are: Angiotensin-converting enzyme 2, which serves as receptor for SARS-CoV-2 to enter cells (Hoffmann et al., 2020); C-reactive protein, a serum marker for inflammation (Sproston and Ashworth, 2018); and Interleukin 6, a mediator of systemic inflammatory responses (Kang et al., 2019). They account for 10.8%, 9.7%, and 4.5% of the total research on human protein-coding genes within the COVID-19 literature, respectively (see Methods). Gene Ontology Enrichment analysis of the human protein-coding genes tagged in the COVID-19 literature finds them enriched for annotations on immune response (false-discovery rate <10⁻⁶⁶), inflammatory response (false-discovery rate <10⁻⁶⁵), and defense response to virus (false-discovery rate <10⁻³¹) (Supplementary file 1). These two observations would thus suggest that the choice of host genes tagged in the COVID-19 literature is biologically grounded and in accord with current knowledge about respiratory viruses.

Most host genes identified by genome-wide studies have not been pursued

Genome-wide datasets provide another window on SARS-CoV-2 infection. As genome-wide approaches circumvent research patterns that may have been pre-established within the scientific literature (Haynes et al., 2018; Nelson et al., 2015; Stoeger et al., 2018), they might identify additional genes implicated in COVID-19. RNA-sequencing (RNA-seq) was used recently to identify 1726 host genes that change the expression of their transcripts in the lungs of COVID-19 patients at an adjusted p-value<0.05 (Blanco-Melo et al., 2020). Affinity-purification mass spectrometry (Aff-MS) was used to identify 293 host proteins following the pulldown of exogenously expressed SARS-CoV-2 proteins (Gordon et al., 2020). Using genome-wide association studies (GWAS), the Host Genetics Initiative identified 52 genes through their association at a P-value of 10⁻⁵ or lower in one of three comparisons: COVID-19 vs lab or self-reported negative; hospitalized COVID-19 patients vs population; or very severe respiratory COVID-19 vs population (COVID-19 Host Genetics Initiative, 2020a; Ellinghaus et al., 2020). 15 genes were identified in two comparisons and one gene, Leucine zipper transcription factor-like protein 1 (LZTFL1), was identified in all three comparisons (Supplementary file 2). Using a pooled CRISPR screen to affect SARS-CoV-2 induced cell death in African green monkey cells, Wei et al. identified 41 genes, which we mapped to their human homologs using BioMart (Wei et al., 2020; see Methods). 48 genes are identified in two of the four different genome-wide datasets (Supplementary file 3), but no gene is identified in more than two.

However, an analysis of the COVID-19 literature reveals that most (56%–71%) of the genes identified in these four datasets have not yet been tagged in the COVID-19 literature (Figure 1A). Thus, the genes identified by the four genome-wide datasets are 10–25% more likely to have been tagged than a randomly chosen gene because we also observe that 19% of all human protein-coding genes have been tagged at least once in the COVID-19 literature. Similarly, the fraction of tagged genes only increases by 0–7% if we include preprints (Figure 1—figure supplement 1). We conclude that many genes identified by genome-wide datasets on COVID-19 have not been investigated yet in more detail in the context of COVID-19.

Figure 1 with 1 supplement see all

Download asset Open asset

Most host genes implicated in COVID-19 identified by genome-wide approaches are not being investigated.

(A) Share of identified genes, which are ignored (never tagged, blue) or tagged (at least once) within the COVID-19 literature. (B) Share of tagged genes identified by a single (orange) or multiple (maroon) genome-wide datasets. P-values are calculated via Fisher’s exact test. n is the number of genes. (C) Share of tagged genes identified by a single (orange) or multiple (maroon) GWAS comparisons. P-values are calculated via Fisher’s exact test. n is the number of genes. (D) Non-COVID-19 publications measured for any human protein-coding gene (ocher, any) and those occurring in the COVID-19 literature (ocher, COV19) and genes identified in A (colors as in A). Notches indicate 95% confidence interval of the median. P-values are calculated via *Mann-Whitney U* test. Exceeded percentiles indicates percentiles of all genes exceeded by the median gene of the genes in an individual boxplot. n.c. marks non-computable P-values that approximate 0. (E) As D, but for year of initial publication on the gene. Dashed lines indicate limit of visualized values. Some genes had their first publication before or afterwards.

At the same time, we observe that genes, which have been identified by multiple of the four distinct genome-wide datasets (Figure 1B), or multiple GWAS comparisons (Figure 1C), are more likely to have been tagged in the COVID-19 literature. This, reassuringly, demonstrates that research into COVID-19 host genes enriches for host genes identified by multiple different lines of support – particularly if there exists support from human genome-wide association studies.

Yet, overall, genes identified by multiple genome-wide datasets remain only a minority of all identified genes (2%), and many of them are still ignored in the COVID-19 literature (52%) (Supplementary file 2), suggesting that research into SARS-CoV-2 host genes might be missing important pieces of the puzzle.

Tagged host genes follow pre-established research patterns

A possible explanation for the relative lack of interest in the additional genes implicated in SARS-CoV-2 infection by these genome-wide datasets is that research on COVID-19 is constrained by pre-established research patterns (Chu and Evans, 2018). Briefly, we know that knowledge on human genes is heavily skewed toward a subset of genes (Gans et al., 2008; Gillis and Pavlidis, 2013; Hoffmann and Valencia, 2003; Oprea et al., 2018; Su and Hogenesch, 2007) that were being investigated prior to the Human Genome Project (Edwards et al., 2011; Grueneberg et al., 2008; Stoeger et al., 2018). As a result, if assessing their importance through genetic loss-of-function intolerance or findings of GWAS (Haynes et al., 2018; Stoeger et al., 2018), the most frequently investigated protein-coding genes tend to be only marginally more important to human physiology and disease than a random selection of genes.

To test the hypothesis that COVID-19 research is constrained by patterns similar to those seen in non-COVID-19 research, we take advantage of the ability of gene2pubmed (a service provided by the National Center for Biotechnology Information) to link human protein-coding genes to individual publications, and compare 465,770 non-COVID-19 papers published until December 2015 with 10,395 COVID-19 research publications indexed by LitCOVID until October 16th, 2020. For the non-COVID-19 research we exclude publications that contain any viral gene (irrespective of whether the virus in question is a coronavirus) and publications tagging 100 or more genes.

We find that genes that are tagged in the COVID-19 literature are also frequently investigated in the non-COVID-19 literature. To assess how frequently individual genes have been investigated in the non-COVID-19 literature relative to other genes, we rank all genes according to the number of publications in the non-COVID-19 literature. The median rank of genes tagged in the COVID-19 literature exceeds the rank of 80% of human protein-coding genes (Figure 1D). This demonstrates that the majority of protein-coding human genes tagged in the COVID-19 literature was already heavily investigated in the context of research unrelated to COVID-19.

Next we return to our earlier observation on the majority of the implicated host genes reported by the four different genome-wide datasets being ignored within the COVID-19 literature. As anticipated, we observe that for each of the four distinct datasets investigated, ignored genes also occur less in this non-COVID-19 literature (Figure 1D). When we compare the number of publications on implicated but ignored host genes to the number of publications on any protein-coding gene encoded in the human genome, this difference is modest, and only reaches statistical significance for RNA-seq (RNA-Seq: p<10⁻²; Interactomics: p=0.20; GWAS: p=0.93; CRISPR: p=0.31), where ignored genes had occurred slightly more in the non-COVID-19 literature (median percentile: 52). In contrast, implicated and tagged host genes have occurred significantly more frequently in the non-COVID-19 literature (RNA-Seq: p<10⁻⁹⁸; Interactomics: p<10⁻¹⁸; GWAS: p<10⁻⁶; CRISPR: p<10⁻⁶). We conclude that implicated host genes that are ignored in the COVID-19 literature have in the past been studied as much as randomly chosen human protein-coding genes, whereas implicated host genes that are tagged in the COVID-19 literature have in the past already been investigated much more frequently than randomly chosen human protein-coding genes.

Before the COVID-19 pandemic it had been shown that the literature is skewed toward a subset of genes that were being investigated prior to the productive phase of the Human Genome Project. These features include the fraction of organs with detectable transcript expression, the length of the genes, the hydrophobicity of the coded proteins, their loss-of-function insensitivity, and studies on orthologous genes in model organisms (Stoeger et al., 2018). We decided to explore if the genes tagged in the COVID-19 literature had been studied before the pandemic, and found that they had occurred earlier (Figure 1E), with many also first being studied before the productive phase of the Human Genome Project (NHGRI, 2003). Similarly, the host genes identified by the four genome-wide datasets that are ignored in the COVID-19 literature first appeared in the non-COVID-19 literature after the host genes that are tagged in the COVID-19 literature (Figure 1E).

Trends over time

The COVID-19 pandemic has ravaged for less than a year, which is a short period of time compared to most research projects. Thus, we might not yet be observing research addressing poorly-studied implicated host genes because not sufficient time has passed for research to catch up to the new information.

To anticipate the near future, we follow the occurrence of genes in the COVID-19 literature over time. Based on our insight that ignored host genes have not been studied more than other genes in the non-COVID-19 literature (Figure 1D), we separate genes into two classes: genes that are among the 50% top-studied human protein-coding genes in the non-COVID-19 literature, and genes that are among the 50% least-studied human protein-coding genes. The second class holds 35% of the genes identified by RNA-seq, 33% of the genes identified by Aff-MS, 29% of the genes identified by GWAS, and 24% of the genes identified by CRISPR. If research is catching up to the new knowledge, we would expect to see the fraction of the COVID-19 literature addressing the 50% least-studied human protein-coding genes to increase over time.

When focusing on the genes that are among the 50% top-studied human protein-coding genes, we observe their occurrence in the COVID-19 literature to increase steadily. Extrapolating from the observed trends, we anticipate that it will take around one year till nearly all genes of this class will have been tagged at least once within the COVID-19 literature (Figure 2A). When focusing on the genes that are among the 50% least-studied human protein-coding genes, we too observe their occurrence in the COVID-19 literature to increase steadily over time (Figure 2B). As for each of the four genome-wide datasets the increase is, however, slower than for the 50% most studied protein-coding genes, we project that multiple years could pass until each gene of the 50% least-studied human protein-coding genes will have been tagged at least once within the COVID-19 literature.

Figure 2 with 1 supplement see all

Download asset Open asset

What the future holds?

Percentage of genes with indicated levels of support by the four genome-wide studies which have been tagged at least once in the COVID-19 literature. (A) Analysis restricted to the 50% of genes with highest number of publications in non-COVID-19 literature. (B) Analysis restricted to the 50% of genes with the lowest number of publications in the non-COVID-19 literature. (C) Cumulative share of literature on human protein-coding genes tagged in the COVID-19 literature. Top 20% indicates the 20% of genes that occur the most in the non-COVID-19 literature. Gene rank refers to the order of human protein-coding genes. The gene with the most publication equivalents would be have rank 1. Yellow area indicates share of literature accounted for by the top 20% genes. (D) Share of COVID-19 literature accounted for by the 20% of genes that had occurred the most in the COVID-19 literature by a given date. (E) Number of distinct human protein coding genes that have been tagged in the literature by a given date. (F) Share of COVID-19 literature accounted for by first 100 genes to be tagged in the COVID-19 literature by a given date.

Pursuing this observation further, we turn to the entire COVID-19 literature. Notably, 83% of all human protein-coding genes tagged in the COVID-19 literature have not been identified by any of the four genome-wide datasets. Further, the different genome-wide datasets together only account for 26% of the COVID-19 literature (RNA-seq: 11.7%, Aff-MS: 2.4%, GWAS: 0.5%, CRISPR: 11.1%) (see Methods).

We ask whether the COVID-19 literature might become dominated by a few genes that are tagged more commonly than other genes that are also tagged in the COVID-19 literature. If we consider the current literature, we do indeed observe support for our hypothesis that the COVID-19 literature is becoming dominated by a few genes as currently the 20% top-tagged human protein-coding genes (747 of 3,733) in the COVID-19 literature account for 90% of the literature (Figure 2C). This share exceeds the 80% anticipated for scientific processes subjected to anthropogenic biases (Jia et al., 2019). We conclude that a surprisingly small fraction of genes dominates the COVID-19 literature.

Finally, we inspect whether the extent to which the COVID-19 literature tags each tagged gene is becoming more or less expansive over time. We observe that the COVID-19 has become less expansive, whether we quantify expansiveness through the share of the literature that is accounted for by the 20% top-tagged genes or the Gini coefficient over the share of the COVID-19 literature attributable to individual genes (Gini, 1912; Figure 2D, Figure 2—figure supplement 1A,B). However, if assessing expansiveness by the total number of genes that have been tagged at least once, then this literature did become more expansive after the first months (Figure 2E).

Interestingly, we also observe that the share of the COVID-19 literature, which is accounted for by the 100 genes that were tagged first within the COVID-19 literature has been decreasing (Figure 2F, Figure 2—figure supplement 1C) – though stabilizing at an astonishingly high share of roughly 45% since June (Figure 2F, Figure 2—figure supplement 1C). We conclude that, overall, the literature on COVID-19 became less expansive during the first months of the pandemic and has since stayed focused on a restricted subset of genes.

One possible reason for why some genes are tagged more than others in the COVID-19 literature could be that compared to other genes they are more important in the context of COVID-19. To probe this hypothesis, we consider groups of genes and the four different genome-wide datasets. When contrasting the 100 initially tagged genes against the other genes tagged in the COVID-19 literature, we reassuringly find that the 100 initially tagged genes are 29% more likely to have been identified by one of these four datasets (Supplementary file 4). However, they are on average tagged 2993% more (Supplementary file 4). If we contrast the 20% top-tagged genes against the other tagged genes, we find them to be 3% less likely to have been identified by one of the four genome-wide datasets, while they on average are tagged 3512% more (Supplementary file 4). Cumulatively, this suggests that the present focus of the COVID-19 literature on a restricted subset of genes cannot be explained by those genes having been identified by genome-wide datasets reporting on transcriptomic changes, protein interactions, genetic associations and loss-of-function perturbations.

Study limitations

Our study has several important limitations. First, we cannot say whether a gene tagged in the COVID-19 literature is truly investigated for its potential role in COVID-19. Second, we cannot yet assess how important individual genes are in COVID-19. Third, and despite our projections, it remains formally unclear, whether the findings reported in this manuscript will hold in the upcoming months as more genome-wide datasets will become available and researchers will have had sufficient time for follow-up studies. For instance, there might already be research initiatives specifically targeted toward the ignored COVID-19 host genes.

Nonetheless, in the past, genome-wide experiments have rarely guided subsequent studies in the non-COVID-19 literature (Haynes et al., 2018; Stoeger et al., 2018). Thus, there is a significant risk that the COVID-19 literature will continue to ignore host genes that have not already been extensively studied independently of COVID-19.

Discussion

Our study reiterates prior observations that research into human protein-coding genes is disproportionately skewed towards a comparably small set of genes (Haynes et al., 2018; Nelson et al., 2015; Stoeger et al., 2018). Likewise, our current analysis on COVID-19 already allows us to conclude that genes that are identified by genome-wide datasets, and hence are likely to have biological significance in the context of COVID-19, have hitherto remained ignored if they had not already been investigated more than other genes prior COVID-19.

We realize that there is an exploration-exploitation trade-off at play and that focusing research on genes that have already been heavily investigated yields significant advantages to investigators: applicability of existing research tools, the ability to place findings in a broader context, and the identification of drugs and other reagents that could be repurposed. Supporting a focus on exploitation, we find that: interventions in clinical trials on COVID-19 are biased toward pharmaceutical targets that occurred frequently in the non-COVID-19 literature (Figure 3A); and that antibodies – a class of reagent that cannot be produced for arbitrary genes within a few days – are less available for those genes identified by RNA-seq or Aff-MS or GWAS which have been ignored in the COVID-19 literature (Figure 3B).

Figure 3

Download asset Open asset

Availability of reagents.

(A) Drugs studied in COVID-19 related clinical trials are frequently studied within the non-COVID-19 literature. We compare non-COVID-19 publications measured for human protein-coding genes that are not listed as pharmaceutical targets in DrugBank (ocher, No drug), against those that are listed as pharmaceutical targets but have not occurred as an intervention in a clinical trial on COVID-19 (orange, Drug no trial), and against those that are listed as pharmaceutical targets and have occurred as an intervention in a clinical trial on COVID-19 (green, Drug and trial). Notches indicate 95% confidence interval of the median. P-values are calculated via *Mann-Whitney U* test. (B) Fraction of genes with reported usage of an antibody to detect the encoded protein as a prey in BioGRID. Bars are genes identified by the four different genome-wide studies that have either been tagged in the COVID-19 literature (red) or ignored (blue). Error bars indicate 95% confidence interval. P-values are calculated via Fisher’s exact test.

Further, additional factors might affect the exploration of studies on ignored host genes. First, the number of laboratories working on ignored genes was quite small prior COVID-19 (Supplementary file 5), and plausibly only a small fraction of the laboratories studied host responses toward respiratory viruses. Second, the risk of being outcompeted by other laboratories might discourage individual laboratories from pursuing publicly acknowledged research targets (Bergstrom et al., 2016). Third, scientists rarely switch topics (Zeng et al., 2019). Likewise, laboratories already working on COVID-19 might have little incentive to move toward distinct host genes as it is possible to contribute to the COVID-19 literature irrespectively of whether the genes had been identified by genome-wide datasets (Figure 2A,B and Supplementary file 4). Moreover, it might be beneficial overall if research into COVID-19 is mainly driven by researchers with a background on pathogens (Kwon, 2020). Lastly, concerns have been expressed about the possibility that fraudulent gene knockdown studies that target under-studied human genes may be corrupting the literature and impeding research into biomarkers (Byrne et al., 2019).

We believe that a more complete understanding of host biology could open novel directions for interventions against SARS-CoV-2 and other viruses. However, the challenge remains of how to promote research on ignored host genes. For example, we cannot speculate whether researchers that turn their attention toward ignored genes in the context of COVID-19, will face a similar disadvantage to their career as did those that studied less studied genes prior to COVID-19 (Stoeger et al., 2018).

In the hopes of prompting greater investigation into implicated host genes, we list genes occurring in multiple of the four datasets described earlier in the supplemental material of this manuscript (Supplementary file 3). Most of the genes identified by multiple datasets appear multiple times because of the large volume of genes identified by the RNA-seq dataset. For this reason, we highlight four genes that were identified by multiple smaller datasets: (1) Mitochondrial import inner membrane translocase subunit Timm10 (TIMM10) has been identified through Aff-MS and CRISPR, and (2) FYVE And Coiled-Coil Domain Autophagy Adaptor 1 (FYCO1) and (3) Procollagen-Lysine,2-Oxoglutarate 5-Dioxygenase 2 (PLOD2) and (4) Ras GTPase-activating protein-binding protein 2 (G3BP2) have been identified through Aff-MS and GWAS. Of these four genes only G3BP2 has been tagged in the COVID-19 literature; and TIMM10 and FYCO1 have both occurred in nine publications in the non-COVID-19 literature, matching the expectation for a randomly selected gene. Of additional interest in the context of COVID-19, FYCO1 is associated with the levels of the monocyte chemoattractant protein-1 (Ahola-Olli et al., 2017; Buniello et al., 2019), which contributes to COVID-19 through hyperinflammation (Mehta et al., 2020).

Methods

COVID-19 literature

We downloaded LitCOVID from https://ftp.ncbi.nlm.nih.gov/pub/lu/LitCovid/ on 2020-10-16 and parsed the contained json file for the presence of concepts annotated as genes. For studies annotated with proteins, we used their PubMed identifiers, to query MEDLINE on 2020-10-16 via their efetch API. Subsequently we parsed the MEDLINE entries via pubmed_parser 2.2 (https://github.com/titipata/pubmed_parser; Achakulvisut et al., 2020). We then excluded publications carrying at least one of the following publication types: Review, Comment, Editorial, Meta-Analysis, Systematic Review, News, Published Erratum, Historical Article, Interview, Retracted Publication, Retraction of Publication, Webcast, Expression of Concern or Portrait. Further we excluded publications whose abstract contain either the phrase ‘this review’ or the phrase ‘this perspective’. We considered genes tagged within LitCOVID in the annotated TITLE, INTRO, ABSTRACT or RESULTS sections.

Research intensity within COVID-19 literature

We measured the research intensity directed toward individual implicated host genes in units of publication equivalents. Each gene tagged within a publication accrues the publication equivalent of number of tags to the gene in that publication divided by total number of tags to any gene in that publication. For example, if a study tags two different genes, and the first gene is tagged three times, whereas the second gene is only tagged once, the first gene would accrue 0.75 publication equivalents, and the second gene would accrue 0.25 publication equivalents. We expressed the share of literature covered by an individual human protein-coding gene as the sum of its publication equivalents over the sum of all publication equivalents of human protein-coding genes. We excluded the studies of Blanco-Melo et al., 2020 and Gordon et al., 2020 which report the RNA-seq and Aff-MS datasets, respectively.

Gene ontology enrichment analysis

We used the Database for Annotation, Visualization and Integrated Discovery, version 6.8 (Huang et al., 2009).

Data processing and filtering

For CRISPR we considered the African green monkey genes reported in Figure 1D of Jin Wei et al., 2020. To map African green monkey to human genes, we used BioMart’s (Haider et al., 2009) April 2020 release. We used the genetic polymorphisms reported in the 2020-09-30 release of the Host Genetics Initiative (COVID-19 Host Genetics Initiative, 2020b) and mapped them to human genes through the Ensembl Variant Effect Predictor (McLaren et al., 2016), using the Ensembl release 101. For RNA-seq we only considered comparisons flagged with ‘ok’ by the authors (Blanco-Melo et al., 2020). For Aff-MS we used the data as provided by BioGRID (Chatr-Aryamontri et al., 2017), version 3.5.186 (https://downloads.thebiogrid.org/BioGRID).

We obtained the list of human protein-coding genes from https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz in June 2020.

Occurrence of genes in preprints

We obtained manuscripts abstracts from dimension.ai’s collection of COVID-19 related publications, release 34 (https://dimensions.figshare.com/articles/dataset/Dimensions_COVID-19_publications_datasets_and_clinical_trials/11961063/34; dimension.ai, 2020), and subsequently select manuscripts listing medRxiv or bioRxiv or arXiv as their source. Next, we matched each word against the gene symbols as downloaded from https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz in June 2020.

We excluded the following gene symbols as within the abstracts they would match abbreviations that did not refer to genes: AFM, AIR, AN, APC, APP, AR, ARC, ATM, BCR, BED, BID, CCNC, CFD, CHM, COPD, COPE, CP, CPE, CS, DBI, DCT, ENG, GAN, GC, HP, HPA, HPD, HPO, HR, IDS, IMPACT, IV, KIT, MCC, MET, MICE, MMD, MS, MS2, NHS, NM, NPS, NSF, NTS, PIP, POLL, REST, SEA, SET, SHE, SI, SPR, STS, TAT, TRAP, WAS.

Non-COVID-19 literature

We downloaded gene2pubmed from https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz in early 2017. MEDLINE, containing publication dates and publication types was downloaded from https://www.nlm.nih.gov/databases/download/pubmed_medline.html, and maintained in a local copy of their database in early 2017. We restricted the analysis to research publications published prior 2016.

Temporal profiles

We obtained publication dates from dimension.ai’s collection of COVID-19 publications, release 34 (https://dimensions.figshare.com/articles/dataset/Dimensions_COVID-19_publications_datasets_and_clinical_trials/11961063/34). We excluded publications dating to January 1^st, 2020 – the day linked to the most publications. Manual inspection revealed that the date of January 1^st2020 was assigned to publications lacking a concrete 2020 publication date.

Occurrence within clinical trials

We obtained interventions within clinical trials from dimension.ai’s collection of COVID-19 related clinical trials, release 34 (https://dimensions.figshare.com/articles/dataset/Dimensions_COVID-19_publications_datasets_and_clinical_trials/11961063/34; dimension.ai, 2020). We performed a case-insensitive match against drug names and drug synonyms contained within DrugBank, version 5.1.5 (https://www.drugbank.ca). Next we used DrugBank’s mapping between drugs and the targets of their pharmaceutical action and used the accompanying gene symbol to identify genes.

Identification of research laboratories

We used disambiguated authorship identifiers from Web of Science and considered the last author of each publication as the laboratory.

Data availability

No data was generated for this study. Data underlying this study can be downloaded from the sources indicated in the methods section and used under the respective licenses. The data can be preprocessed with the source code accompanying this manuscript, and - for literature until 2015 - the public source code provided in a former publication of ours, https://github.com/tstoeger/plos_biology_2018_ignored_genes.

References

(2020) Pubmed parser: a Python parser for PubMed Open-Access XML subset and MEDLINE XML dataset XML dataset
Journal of Open Source Software 5:1979.

https://doi.org/10.21105/joss.01979
- Google Scholar
1. Ahola-Olli AV
2. Würtz P
3. Havulinna AS
4. Aalto K
5. Pitkänen N
6. Lehtimäki T
7. Kähönen M
8. Lyytikäinen LP
9. Raitoharju E
10. Seppälä I
11. Sarin AP
12. Ripatti S
13. Palotie A
14. Perola M
15. Viikari JS
16. Jalkanen S
17. Maksimow M
18. Salomaa V
19. Salmi M
20. Kettunen J
21. Raitakari OT
(2017) Genome-wide association study identifies 27 loci influencing concentrations of circulating cytokines and growth factors
American Journal of Human Genetics 100:40–50.

https://doi.org/10.1016/j.ajhg.2016.11.007
- PubMed
- Google Scholar
Preprint
(2016) Why scientists chase big problems: individual strategy and social optimality
arXiv.

https://arxiv.org/abs/1605.05822v2
- Google Scholar
1. Blanco-Melo D
2. Nilsson-Payant BE
3. Liu WC
4. Uhl S
5. Hoagland D
6. Møller R
7. Jordan TX
8. Oishi K
9. Panis M
10. Sachs D
11. Wang TT
12. Schwartz RE
13. Lim JK
14. Albrecht RA
15. tenOever BR
(2020) Imbalanced host response to SARS-CoV-2 drives development of COVID-19
Cell 181:1036–1045.

https://doi.org/10.1016/j.cell.2020.04.026
- PubMed
- Google Scholar
1. Buniello A
2. MacArthur JAL
3. Cerezo M
4. Harris LW
5. Hayhurst J
6. Malangone C
7. McMahon A
8. Morales J
9. Mountjoy E
10. Sollis E
11. Suveges D
12. Vrousgou O
13. Whetzel PL
14. Amode R
15. Guillen JA
16. Riat HS
17. Trevanion SJ
18. Hall P
19. Junkins H
20. Flicek P
21. Burdett T
22. Hindorff LA
23. Cunningham F
24. Parkinson H
(2019) The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019
Nucleic Acids Research 47:D1005–D1012.

https://doi.org/10.1093/nar/gky1120
- PubMed
- Google Scholar
(2019) The possibility of systematic research fraud targeting Under-Studied human genes: causes, consequences, and potential solutions
Biomarker Insights 14:1–12.

https://doi.org/10.1177/1177271919829162
- Google Scholar
1. Chatr-Aryamontri A
2. Oughtred R
3. Boucher L
4. Rust J
5. Chang C
6. Kolas NK
7. O'Donnell L
8. Oster S
9. Theesfeld C
10. Sellam A
11. Stark C
12. Breitkreutz BJ
13. Dolinski K
14. Tyers M
(2017) The BioGRID interaction database: 2017 update
Nucleic Acids Research 45:D369–D379.

https://doi.org/10.1093/nar/gkw1102
- PubMed
- Google Scholar
1. Chen Q
2. Allot A
3. Lu Z
(2020) Keep up with the latest coronavirus research
Nature 579:193.

https://doi.org/10.1038/d41586-020-00694-1
- PubMed
- Google Scholar
1. Chu JSG
2. Evans JA
(2018)
Too many papers? slowed canonical progress in large fields of science

SocArXiv.
- Google Scholar
1. COVID-19 Host Genetics Initiative
(2020a) The COVID-19 host genetics initiative, a global initiative to elucidate the role of host genetic factors in susceptibility and severity of the SARS-CoV-2 virus pandemic
European Journal of Human Genetics 28:715–718.

https://doi.org/10.1038/s41431-020-0636-6
- PubMed
- Google Scholar
Data
1. COVID-19 Host Genetics Initiative
(authors) (2020b) COVID19-hg GWAS meta-analyses round 3
COVID-19 Hg. ID round 3.

https://www.covid19hg.org/results/
Data
1. dimension.ai
(authors) (2020) Dimensions COVID-19 publications, datasets and clinical trials
figshare.

https://doi.org/10.6084/m9.figshare.11961063.v34
1. Edwards AM
2. Isserlin R
3. Bader GD
4. Frye SV
5. Willson TM
6. Yu FH
(2011) Too many roads not taken
Nature 470:163–165.

https://doi.org/10.1038/470163a
- Google Scholar
1. Ellinghaus D
2. Degenhardt F
3. Bujanda L
4. Buti M
5. Albillos A
6. Invernizzi P
7. Fernández J
8. Prati D
9. Baselli G
10. Asselta R
11. Grimsrud MM
12. Milani C
13. Aziz F
14. Kässens J
15. May S
16. Wendorff M
17. Wienbrandt L
18. Uellendahl-Werth F
19. Zheng T
20. Yi X
21. de Pablo R
22. Chercoles AG
23. Palom A
24. Garcia-Fernandez AE
25. Rodriguez-Frias F
26. Zanella A
27. Bandera A
28. Protti A
29. Aghemo A
30. Lleo A
31. Biondi A
32. Caballero-Garralda A
33. Gori A
34. Tanck A
35. Carreras Nolla A
36. Latiano A
37. Fracanzani AL
38. Peschuck A
39. Julià A
40. Pesenti A
41. Voza A
42. Jiménez D
43. Mateos B
44. Nafria Jimenez B
45. Quereda C
46. Paccapelo C
47. Gassner C
48. Angelini C
49. Cea C
50. Solier A
51. Pestaña D
52. Muñiz-Diaz E
53. Sandoval E
54. Paraboschi EM
55. Navas E
56. García Sánchez F
57. Ceriotti F
58. Martinelli-Boneschi F
59. Peyvandi F
60. Blasi F
61. Téllez L
62. Blanco-Grau A
63. Hemmrich-Stanisak G
64. Grasselli G
65. Costantino G
66. Cardamone G
67. Foti G
68. Aneli S
69. Kurihara H
70. ElAbd H
71. My I
72. Galván-Femenia I
73. Martín J
74. Erdmann J
75. Ferrusquía-Acosta J
76. Garcia-Etxebarria K
77. Izquierdo-Sanchez L
78. Bettini LR
79. Sumoy L
80. Terranova L
81. Moreira L
82. Santoro L
83. Scudeller L
84. Mesonero F
85. Roade L
86. Rühlemann MC
87. Schaefer M
88. Carrabba M
89. Riveiro-Barciela M
90. Figuera Basso ME
91. Valsecchi MG
92. Hernandez-Tejero M
93. Acosta-Herrera M
94. D'Angiò M
95. Baldini M
96. Cazzaniga M
97. Schulzky M
98. Cecconi M
99. Wittig M
100. Ciccarelli M
101. Rodríguez-Gandía M
102. Bocciolone M
103. Miozzo M
104. Montano N
105. Braun N
106. Sacchi N
107. Martínez N
108. Özer O
109. Palmieri O
110. Faverio P
111. Preatoni P
112. Bonfanti P
113. Omodei P
114. Tentorio P
115. Castro P
116. Rodrigues PM
117. Blandino Ortiz A
118. de Cid R
119. Ferrer R
120. Gualtierotti R
121. Nieto R
122. Goerg S
123. Badalamenti S
124. Marsal S
125. Matullo G
126. Pelusi S
127. Juzenas S
128. Aliberti S
129. Monzani V
130. Moreno V
131. Wesse T
132. Lenz TL
133. Pumarola T
134. Rimoldi V
135. Bosari S
136. Albrecht W
137. Peter W
138. Romero-Gómez M
139. D'Amato M
140. Duga S
141. Banales JM
142. Hov JR
143. Folseraas T
144. Valenti L
145. Franke A
146. Karlsen TH
147. Severe Covid-19 GWAS Group
(2020) Genome-wide association study of severe Covid-19 with respiratory failure
New England Journal of Medicine 383:1522–1534.

https://doi.org/10.1056/NEJMoa2020283
- PubMed
- Google Scholar
1. Folegatti PM
2. Ewer KJ
3. Aley PK
4. Angus B
5. Becker S
6. Belij-Rammerstorfer S
7. Bellamy D
8. Bibi S
9. Bittaye M
10. Clutterbuck EA
11. Dold C
12. Faust SN
13. Finn A
14. Flaxman AL
15. Hallis B
16. Heath P
17. Jenkin D
18. Lazarus R
19. Makinson R
20. Minassian AM
21. Pollock KM
22. Ramasamy M
23. Robinson H
24. Snape M
25. Tarrant R
26. Voysey M
27. Green C
28. Douglas AD
29. Hill AVS
30. Lambe T
31. Gilbert SC
32. Pollard AJ
33. Oxford COVID Vaccine Trial Group
(2020) Safety and immunogenicity of the ChAdOx1 nCoV-19 vaccine against SARS-CoV-2: a preliminary report of a phase 1/2, single-blind, randomised controlled trial
The Lancet 396:467–478.

https://doi.org/10.1016/S0140-6736(20)31604-4
- PubMed
- Google Scholar
Website
(2008) Patents, papers, pairs and secrets: contracting over the disclosure of scientific knowledge (Preliminary & incomplete)
Accessed October 23, 2020.

http://fmurray.scripts.mit.edu/docs/Gans.Murray.Stern%20_KnowledgeDisclosure_DRAFT_09.30.2008.pdf
1. Gillis J
2. Pavlidis P
(2013) Assessing identity, redundancy and confounds in gene ontology annotations over time
Bioinformatics 29:476–482.

https://doi.org/10.1093/bioinformatics/bts727
- PubMed
- Google Scholar
Book
1. Gini C
(1912)
Variabilità E Mutabilità

Tipogr. di P. Cuppini.
- Google Scholar
1. Gordon DE
2. Jang GM
3. Bouhaddou M
4. Xu J
5. Obernier K
6. White KM
7. O'Meara MJ
8. Rezelj VV
9. Guo JZ
10. Swaney DL
11. Tummino TA
12. Hüttenhain R
13. Kaake RM
14. Richards AL
15. Tutuncuoglu B
16. Foussard H
17. Batra J
18. Haas K
19. Modak M
20. Kim M
21. Haas P
22. Polacco BJ
23. Braberg H
24. Fabius JM
25. Eckhardt M
26. Soucheray M
27. Bennett MJ
28. Cakir M
29. McGregor MJ
30. Li Q
31. Meyer B
32. Roesch F
33. Vallet T
34. Mac Kain A
35. Miorin L
36. Moreno E
37. Naing ZZC
38. Zhou Y
39. Peng S
40. Shi Y
41. Zhang Z
42. Shen W
43. Kirby IT
44. Melnyk JE
45. Chorba JS
46. Lou K
47. Dai SA
48. Barrio-Hernandez I
49. Memon D
50. Hernandez-Armenta C
51. Lyu J
52. Mathy CJP
53. Perica T
54. Pilla KB
55. Ganesan SJ
56. Saltzberg DJ
57. Rakesh R
58. Liu X
59. Rosenthal SB
60. Calviello L
61. Venkataramanan S
62. Liboy-Lugo J
63. Lin Y
64. Huang XP
65. Liu Y
66. Wankowicz SA
67. Bohn M
68. Safari M
69. Ugur FS
70. Koh C
71. Savar NS
72. Tran QD
73. Shengjuler D
74. Fletcher SJ
75. O'Neal MC
76. Cai Y
77. Chang JCJ
78. Broadhurst DJ
79. Klippsten S
80. Sharp PP
81. Wenzell NA
82. Kuzuoglu-Ozturk D
83. Wang HY
84. Trenker R
85. Young JM
86. Cavero DA
87. Hiatt J
88. Roth TL
89. Rathore U
90. Subramanian A
91. Noack J
92. Hubert M
93. Stroud RM
94. Frankel AD
95. Rosenberg OS
96. Verba KA
97. Agard DA
98. Ott M
99. Emerman M
100. Jura N
101. von Zastrow M
102. Verdin E
103. Ashworth A
104. Schwartz O
105. d'Enfert C
106. Mukherjee S
107. Jacobson M
108. Malik HS
109. Fujimori DG
110. Ideker T
111. Craik CS
112. Floor SN
113. Fraser JS
114. Gross JD
115. Sali A
116. Roth BL
117. Ruggero D
118. Taunton J
119. Kortemme T
120. Beltrao P
121. Vignuzzi M
122. García-Sastre A
123. Shokat KM
124. Shoichet BK
125. Krogan NJ
(2020) A SARS-CoV-2 protein interaction map reveals targets for drug repurposing
Nature 583:459–468.

https://doi.org/10.1038/s41586-020-2286-9
- PubMed
- Google Scholar
1. Grein J
2. Ohmagari N
3. Shin D
4. Diaz G
5. Asperges E
6. Castagna A
7. Feldt T
8. Green G
9. Green ML
10. Lescure F-X
11. Nicastri E
12. Oda R
13. Yo K
14. Quiros-Roldan E
15. Studemeister A
16. Redinski J
17. Ahmed S
18. Bernett J
19. Chelliah D
20. Chen D
21. Chihara S
22. Cohen SH
23. Cunningham J
24. D’Arminio Monforte A
25. Ismail S
26. Kato H
27. Lapadula G
28. L’Her E
29. Maeno T
30. Majumder S
31. Massari M
32. Mora-Rillo M
33. Mutoh Y
34. Nguyen D
35. Verweij E
36. Zoufaly A
37. Osinusi AO
38. DeZure A
39. Zhao Y
40. Zhong L
41. Chokkalingam A
42. Elboudwarej E
43. Telep L
44. Timbs L
45. Henne I
46. Sellers S
47. Cao H
48. Tan SK
49. Winterbourne L
50. Desai P
51. Mera R
52. Gaggar A
53. Myers RP
54. Brainard DM
55. Childs R
56. Flanigan T
(2020) Compassionate use of remdesivir for patients with severe Covid-19
New England Journal of Medicine 382:2327–2336.

https://doi.org/10.1056/NEJMoa2007016
- Google Scholar
1. Grueneberg DA
2. Degot S
3. Pearlberg J
4. Li W
5. Davies JE
6. Baldwin A
7. Endege W
8. Doench J
9. Sawyer J
10. Hu Y
11. Boyce F
12. Xian J
13. Munger K
14. Harlow E
(2008) Kinase requirements in human cells: I. Comparing kinase requirements across various cell types
PNAS 105:16472–16477.

https://doi.org/10.1073/pnas.0808019105
- PubMed
- Google Scholar
1. Haider S
2. Ballester B
3. Smedley D
4. Zhang J
5. Rice P
6. Kasprzyk A
(2009) BioMart central portal--unified access to biological data
Nucleic Acids Research 37:W23–W27.

https://doi.org/10.1093/nar/gkp265
- PubMed
- Google Scholar
(2018) Gene annotation bias impedes biomedical research
Scientific Reports 8:1362.

https://doi.org/10.1038/s41598-018-19333-x
- PubMed
- Google Scholar
1. Hoffmann M
2. Kleine-Weber H
3. Schroeder S
4. Krüger N
5. Herrler T
6. Erichsen S
7. Schiergens TS
8. Herrler G
9. Wu NH
10. Nitsche A
11. Müller MA
12. Drosten C
13. Pöhlmann S
(2020) SARS-CoV-2 cell entry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor
Cell 181:271–280.

https://doi.org/10.1016/j.cell.2020.02.052
- PubMed
- Google Scholar
1. Hoffmann R
2. Valencia A
(2003) Life cycles of successful genes
Trends in Genetics 19:79–81.

https://doi.org/10.1016/S0168-9525(02)00014-8
- Google Scholar
(2009) Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources
Nature Protocols 4:44–57.

https://doi.org/10.1038/nprot.2008.211
- PubMed
- Google Scholar
1. Jackson LA
2. Anderson EJ
3. Rouphael NG
4. Roberts PC
5. Makhene M
6. Coler RN
7. McCullough MP
8. Chappell JD
9. Denison MR
10. Stevens LJ
11. Pruijssers AJ
12. McDermott A
13. Flach B
14. Doria-Rose NA
15. Corbett KS
16. Morabito KM
17. O’Dell S
18. Schmidt SD
19. Swanson PA
20. Padilla M
21. Mascola JR
22. Neuzil KM
23. Bennett H
24. Sun W
25. Peters E
26. Makowski M
27. Albert J
28. Cross K
29. Buchanan W
30. Pikaart-Tautges R
31. Ledgerwood JE
32. Graham BS
33. Beigel JH
(2020) An mRNA vaccine against SARS-CoV-2 — Preliminary Report
New England Journal of Medicine 383:1920–1931.

https://doi.org/10.1056/NEJMoa2022483
- Google Scholar
1. Jia X
2. Lynch A
3. Huang Y
4. Danielson M
5. Lang'at I
6. Milder A
7. Ruby AE
8. Wang H
9. Friedler SA
10. Norquist AJ
11. Schrier J
(2019) Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis
Nature 573:251–255.

https://doi.org/10.1038/s41586-019-1540-5
- PubMed
- Google Scholar
(2020)
Genome-wide CRISPR screen reveals host genes that regulate SARS-CoV-2 infection

bioRxiv : The Preprint Server for Biology.
- Google Scholar
(2019) Targeting Interleukin-6 signaling in clinic
Immunity 50:1007–1023.

https://doi.org/10.1016/j.immuni.2019.03.026
- PubMed
- Google Scholar
1. Kwon D
(2020) How swamped preprint servers are blocking bad coronavirus research
Nature 581:130–131.

https://doi.org/10.1038/d41586-020-01394-6
- PubMed
- Google Scholar
1. McLaren W
2. Gil L
3. Hunt SE
4. Riat HS
5. Ritchie GR
6. Thormann A
7. Flicek P
8. Cunningham F
(2016) The ensembl variant effect predictor
Genome Biology 17:122.

https://doi.org/10.1186/s13059-016-0974-4
- PubMed
- Google Scholar
1. Mehta P
2. McAuley DF
3. Brown M
4. Sanchez E
5. Tattersall RS
6. Manson JJ
(2020) COVID-19: consider cytokine storm syndromes and immunosuppression
The Lancet 395:1033–1034.

https://doi.org/10.1016/S0140-6736(20)30628-0
- Google Scholar
1. Monteil V
2. Kwon H
3. Prado P
4. Hagelkrüys A
5. Wimmer RA
6. Stahl M
7. Leopoldi A
8. Garreta E
9. Hurtado Del Pozo C
10. Prosper F
11. Romero JP
12. Wirnsberger G
13. Zhang H
14. Slutsky AS
15. Conder R
16. Montserrat N
17. Mirazimi A
18. Penninger JM
(2020) Inhibition of SARS-CoV-2 infections in engineered human tissues using clinical-grade soluble human ACE2
Cell 181:905–913.

https://doi.org/10.1016/j.cell.2020.04.004
- PubMed
- Google Scholar
1. Nelson MR
2. Tipney H
3. Painter JL
4. Shen J
5. Nicoletti P
6. Shen Y
7. Floratos A
8. Sham PC
9. Li MJ
10. Wang J
11. Cardon LR
12. Whittaker JC
13. Sanseau P
(2015) The support of human genetic evidence for approved drug indications
Nature Genetics 47:856–860.

https://doi.org/10.1038/ng.3314
- PubMed
- Google Scholar
Website
1. NHGRI
(2003) Press release: International consortium completes human genome project
Accessed October 23, 2020.

https://www.genome.gov/11006929/2003-release-international-consortium-completes-hgp
1. Oprea TI
2. Bologa CG
3. Brunak S
4. Campbell A
5. Gan GN
6. Gaulton A
7. Gomez SM
8. Guha R
9. Hersey A
10. Holmes J
11. Jadhav A
12. Jensen LJ
13. Johnson GL
14. Karlson A
15. Leach AR
16. Ma'ayan A
17. Malovannaya A
18. Mani S
19. Mathias SL
20. McManus MT
21. Meehan TF
22. von Mering C
23. Muthas D
24. Nguyen DT
25. Overington JP
26. Papadatos G
27. Qin J
28. Reich C
29. Roth BL
30. Schürer SC
31. Simeonov A
32. Sklar LA
33. Southall N
34. Tomita S
35. Tudose I
36. Ursu O
37. Vidovic D
38. Waller A
39. Westergaard D
40. Yang JJ
41. Zahoránszky-Köhalmi G
(2018) Unexplored therapeutic opportunities in the human genome
Nature Reviews Drug Discovery 17:317–332.

https://doi.org/10.1038/nrd.2018.14
- PubMed
- Google Scholar
1. Recovery Collaborative Group
(2020) Dexamethasone in hospitalized patients with Covid-19 - Preliminary report
New England Journal of Medicine NEJMoa2021436.

https://doi.org/10.1056/NEJMoa2021436
- Google Scholar
1. Sproston NR
2. Ashworth JJ
(2018) Role of C-reactive protein at sites of inflammation and infection
Frontiers in Immunology 9:754.

https://doi.org/10.3389/fimmu.2018.00754
- PubMed
- Google Scholar
(2018) Large-scale investigation of the reasons why potentially important genes are ignored
PLOS Biology 16:e2006643.

https://doi.org/10.1371/journal.pbio.2006643
- PubMed
- Google Scholar
1. Su AI
2. Hogenesch JB
(2007) Power-law-like distributions in biomedical publications and research funding
Genome Biology 8:404.

https://doi.org/10.1186/gb-2007-8-4-404
- PubMed
- Google Scholar
1. Wei C-H
2. Allot A
3. Leaman R
4. Lu Z
(2019) PubTator central: automated concept annotation for biomedical full text articles
Nucleic Acids Research 47:W587–W593.

https://doi.org/10.1093/nar/gkz389
- Google Scholar
Preprint
1. Wei J
2. Alfajaro MM
3. Hanna RE
4. DeWeirdt PC
5. Strine MS
6. Lu-Culligan WJ
7. Zhang SM
8. Graziano VR
9. Schmitz CO
10. Chen JS
11. Mankowski MC
12. Filler RB
13. Gasque V
14. de Miguel F
15. Chen H
16. Oguntuyo K
17. Abriola L
18. Surovtseva YV
19. Orchard RC
20. Lee B
21. Lindenbach B
22. Politi K
23. van Dijk D
24. Simon MD
25. Yan Q
26. Doench JG
27. Wilen CB
(2020) Genome-wide CRISPR screen reveals host genes that regulate SARS-CoV-2 infection
Cell.

https://doi.org/10.1016/j.cell.2020.10.028
- Google Scholar
1. Wrapp D
2. Wang N
3. Corbett KS
4. Goldsmith JA
5. Hsieh CL
6. Abiona O
7. Graham BS
8. McLellan JS
(2020) Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation
Science 367:1260–1263.

https://doi.org/10.1126/science.abb2507
- PubMed
- Google Scholar
1. Zeng A
2. Shen Z
3. Zhou J
4. Fan Y
5. Di Z
6. Wang Y
7. Stanley HE
8. Havlin S
(2019) Increasing trend of scientists to switch between topics
Nature Communications 10:3439.

https://doi.org/10.1038/s41467-019-11401-8
- PubMed
- Google Scholar

Decision letter

Peter Rodgers

Senior and Reviewing Editor; eLife, United Kingdom
Valentin Danchev

Reviewer; Stanford University, United States
Hong Zheng

Reviewer; Stanford University, United States
Steve Brown

Reviewer

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

Thank you for submitting your article "Meta-Research: COVID-19 research risks ignoring important host genes due to pre-established gene-specific biases" to eLife for consideration as a Feature Article. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by the eLife Features Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Valentin Danchev (Reviewer #1); Hong Zheng (Reviewer #2); Steve Brown (Reviewer #3).

The reviewers and editors have discussed the reviews and we have drafted this decision letter to help you prepare a revised submission.

Summary:

There is now a very considerable body of evidence, much if it provided and substantiated through the work of Stoeger and Nunes, that the majority of published gene to phenotype studies continue to focus on genes that are well-annotated or for which knowledge of biological function and pathological consequences of mutations already exist. As a result, much of the human genome remains unexplored and considered "dark". This is not just a major problem for the dynamic development of the biological and biomedical sciences and the provision of novel insights into biological and disease mechanisms. It has the potential to impact significantly on the assessment of mechanisms and interventions in scenarios such as pandemics arising from novel pathogens which potentially elicit and require investigation and understanding of novel host-pathogen interactions involving unknown genetic pathways. Stoeger and Nunes now show, based on an analysis of the COVID-19 literature up to 30 July 2020, that COVID-19 studies display similar biases and critically ignore recent genome-wide datasets. However, a number of points need to be addressed to make the article suitable for publication.

Essential revisions:

1) Please update the analysis to include COVID-19 papers published until the end of August 2020 (or later).

2) The comparison of published literature to large datasets from GWAS experiments is interesting but a comparison of the GWAS datasets themselves would add some interesting detail as well. What is the percentage of overlapping genes across all four GWAS? Is the literature more likely to select for study those genes that appear in many or all four GWAS? If it is not possible to answer these questions, please discuss this issue in the text.

3) Published papers as of July 30 are considered, many of them likely written in May or early June, i.e., a couple of months after the WHO characterised COVID-19 as pandemic. This appears to be bias away from the null. Realistically, it would take some time for the literature to diversify. Preprints could have provided a more nuanced assessment but admittedly, extracting gene keywords from preprints would be computationally hard.

A longitudinal analysis (e.g., two week slices) could be informative – is it the case that earlier in time labs started from diverse sets of genes and then converged; or, the other way around, they started from a few candidates, and then diverged. Or, there was no significant change over time, with incremental increase of the number of genes as the number of papers increased.

If it is not possible to conduct such a longitudinal analysis, please discuss this issue in the text.

4) The framing of the problem in terms of bias might be too strong – it's just too early to tell. An alternative, and less charged ways of framing the problem would be in terms of the exploration-exploitation trade-off.

Essentially, the manuscript tells us that labs have continued to study the genes they studied before COVID-19. The manuscript views this as a bias. But is it? A more charitable interpretation would be that labs continued to study what they know rather than jumping on the new fashionable possibility. A major concern since the pandemic has been the tendency for many to assert expertise in new areas relevant to Covid-19, and it seems that genomic labs are less involved in this. Could that path-dependency bias be seen a good thing in the broader context?

5) It would be helpful to have a discussion on how realistic it is for a lab studying a set of genes to start studying another set of genes at short notice. Studies of country's manufacturing and other fields suggest that big "jumps" in the space of possibilities are limited, meaning that one could jump from B to D but hardly to X, Y, or Z. The manuscript should clarify how possible it is in the context of those gene experiments for a lab to study one set of genes and then "jump" to another set. In other words, if a lab would like to avoid the bias the manuscript outlines, how easy would be to study those understudied genes in terms of expertise, lab equipment, cell lines?

6) Regarding the "Pareto principle" that 80% of research falls onto only 20% of the genes, have the authors considered the possibility that the "Pareto principle" might also apply to the importance of genes? The coding sequence is only 2% of the human genome, which is more extreme than the 80/20 ratio. Within the protein coding genes, the transcriptional factors or the hub of networks may just be functionally more important than other genes. Thus, is it possible that the observed bias in research reflects the innate features of human genes?

7) Another factor that might contribute to the bias is how easy it is to study the gene, especially in early days when the first step is to clone a gene, and the longer the gene is, the harder the process is. Could the authors look into whether the bias is partly introduced by the length of the genes?

If it is not possible to conduct such an analysis, please discuss this possibility in the text.

https://doi.org/10.7554/eLife.61981.sa1

Author response

Essential revisions:

1) Please update the analysis to include COVID-19 papers published until the end of August 2020 (or later).

We updated the manuscript for studies identified until October 16th.

2) The comparison of published literature to large datasets from GWAS experiments is interesting but a comparison of the GWAS datasets themselves would add some interesting detail as well. What is the percentage of overlapping genes across all four GWAS? Is the literature more likely to select for study those genes that appear in many or all four GWAS? If it is not possible to answer these questions, please discuss this issue in the text.

While at the time at the initial submission there was no overlap, the newest release of the GWAS catalog finds 15 genes in two comparisons, and one gene, LZTFL1, in all three comparisons. We now report so in the text and the new Supplementary File 2.

We can now also confirm the reviewer’s hypothesis that genes occurring in multiple GWAS comparisons are more likely to have been mentioned in the COVID-19 literature (Figure 1C). Motivated by this finding we further inspected whether genes identified by multiple genome-wide datasets too are more likely to have been mentioned in the COVID-19 literature – which in turn they are (Figure 1B). However, these genes still only account for a minority of the genes in the COVID-19 literature, and several genes that have been identified by multiple distinct GWAS comparisons or genome-wide datasets remain to be mentioned in the COVID-19 literature.

3) Published papers as of July 30 are considered, many of them likely written in May or early June, i.e., a couple of months after the WHO characterised COVID-19 as pandemic. This appears to be bias away from the null. Realistically, it would take some time for the literature to diversify.

To emphasize this important point, we added a new section: “What the future holds?”

Preprints could have provided a more nuanced assessment but admittedly, extracting gene keywords from preprints would be computationally hard.

We now additionally analyze preprints. Our findings are not altered.

A longitudinal analysis (e.g., two week slices) could be informative – is it the case that earlier in time labs started from diverse sets of genes and then converged; or, the other way around, they started from a few candidates, and then diverged. Or, there was no significant change over time, with incremental increase of the number of genes as the number of papers increased.

We now include a temporal analysis of distinct measures of diversity (Gini coefficient, number of genes, share of literature accounted for by 20% most mentioned genes, share of literature accounted for by the 100 genes first mentioned in the COVID-19 literature) and inspect them twice (Figure 2D-F, Figure 2—figure supplement 1). First, cumulatively until individual days. Second, separately for each month (instead of two-week slices as some applied measures of diversity would look wavy in the latter due to the higher number of publications published at the beginning of each month). Briefly, we find that since June the literature has reached what appears to be a stationary state.

To further strengthen our analysis, we now also include a longitudinal analysis of the number of genes that had been identified by the four genome-wide datasets (Figure 2A, B). It shows that genes that had not been commonly studied in the non-COVID-19 literature are introduced into the COVID-19 literature at a slower pace than those genes that had been commonly studied in the non-COVID-19 literature.

If it is not possible to conduct such a longitudinal analysis, please discuss this issue in the text.

We thank the reviewer for having suggested a longitudinal analysis and believe that it has strengthened our manuscript.

4) The framing of the problem in terms of bias might be too strong – it's just too early to tell. An alternative, and less charged ways of framing the problem would be in terms of the exploration-exploitation trade-off.

We reframed the manuscript. First, we avoided the terminology “bias”. Second, we now explicitly discuss the trade-off between exploration and exploitation and extended the analysis (Figure 3B) to demonstrate that antibodies – one class of gene specific reagents that cannot be manufactured within a few days – are more likely to exist for identified genes that were also mentioned in the COVID-19 literature (a finding that is significant for RNA-seq, Aff-MS and GWAS but not CRISPR).

Essentially, the manuscript tells us that labs have continued to study the genes they studied before COVID-19. The manuscript views this as a bias. But is it? A more charitable interpretation would be that labs continued to study what they know rather than jumping on the new fashionable possibility.

We now extended the Discussion section to list reasons why laboratories might not jump to new possibilities.

A major concern since the pandemic has been the tendency for many to assert expertise in new areas relevant to Covid-19, and it seems that genomic labs are less involved in this. Could that path-dependency bias be seen a good thing in the broader context?

We now mention this possibility. We further explicitly mention a commentary of another group which points out that the influx of scientists from unrelated fields could risk compromising the integrity of the literature and make it more difficult to identify valuable publications.

5) It would be helpful to have a discussion on how realistic it is for a lab studying a set of genes to start studying another set of genes at short notice. Studies of country's manufacturing and other fields suggest that big "jumps" in the space of possibilities are limited, meaning that one could jump from B to D but hardly to X, Y, or Z. The manuscript should clarify how possible it is in the context of those gene experiments for a lab to study one set of genes and then "jump" to another set. In other words, if a lab would like to avoid the bias the manuscript outlines, how easy would be to study those understudied genes in terms of expertise, lab equipment, cell lines?

We now extended the discussion to indicate that we anticipate switching to another set of genes is unlikely due to the risk of being outcompeted by other laboratories, the low availability of reagents for ignored genes, and the low baseline probabilities of scientists to switch topics. While these arguments are based on preceding studies of science, we anticipate them to also hold true for COVID-19.

Moreover, we now provide the number of laboratories working on ignored genes pre-COVID-19 to demonstrate that without big jumps toward these genes, there would be a limited potential of laboratories (likewise, making a jump toward COVID-19 might be a big jump for many of these laboratories) (Supplementary file 5).

6) Regarding the "Pareto principle" that 80% of research falls onto only 20% of the genes, have the authors considered the possibility that the "Pareto principle" might also apply to the importance of genes? The coding sequence is only 2% of the human genome, which is more extreme than the 80/20 ratio. Within the protein coding genes, the transcriptional factors or the hub of networks may just be functionally more important than other genes. Thus, is it possible that the observed bias in research reflects the innate features of human genes?

We now directly acknowledge this limitation within the section on study limitations. We believe that currently it is not possible to quantify the extent of the importance of individual genes toward COVID-19. Indeed, there may be multiple dimensions of importance relating to fraction of cells affected, severity of disease outcomes, number of patients affected, and the relation between these dimensions and the molecular findings is still unclear.

At the same time, we now introduce a novel analysis, Supplementary file 4, which demonstrates that – as a group – the genes that have been studied the most or the first account for a disproportional ~30 times larger share of the COVID-19 literature than one would anticipate based on the propensity of them having been identified by the four genome-wide datasets.

Further we extended the discussion section to add a statement that the identification of genes in genome-wide datasets suggests that they have some importance.

7) Another factor that might contribute to the bias is how easy it is to study the gene, especially in early days when the first step is to clone a gene, and the longer the gene is, the harder the process is. Could the authors look into whether the bias is partly introduced by the length of the genes?

If it is not possible to conduct such an analysis, please discuss this possibility in the text.

We now add an analysis that demonstrates that genes studied in the COVID-19 literature had already been studied prior the productive phase of the Human Genome Project (Figure 1E). Given our earlier findings (Stoeger et al., 2018) this could be mainly interpreted as these genes having been easier to study within the early days.

For gene length as a specific hypothesis, there is additional complexity, and we would thus prefer not to include it. First, length might be less important than other factors facilitating experimentation (such as protein abundance in HeLa cells, or orthologs in model organisms, etc). Regretfully we noted that a statistical approach, which considers multiple alternate possibilities (and we used in Stoeger et al., 2018) is not yet applicable to the COVID-19 literature as it has no predictive power. We believe that this mainly stems from the comparably low number of COVID-19 publications compared to the number of publications in the non-COVID-19 literature. Second – and we did not report so explicitly before – length has a non-linear relationship to publications within the non-COVID-19 literature. Though the earliest discovered genes tend to be short, the longest genes too are studied more than genes with an “average” length (but still studied less than the shortest genes).

Below we add an analysis of gene lengths, analogously to Figures 1D,E from the main manuscript.

Author response image 1

Download asset Open asset

https://doi.org/10.7554/eLife.61981.sa2

Article and author information

Author details

Thomas Stoeger

Thomas Stoeger is in the Successful Clinical Response in Pneumonia Therapy (SCRIPT) Systems Biology Center, the Department of Chemical and Biological Engineering and the Northwestern Institute on Complex Systems (NICO), Northwestern University, Evanston, United States, and the Center for Genetic Medicine, Northwestern University School of Medicine, Chicago, United States

Contribution
Conceptualization, Resources, Data curation, Software, Formal analysis, Funding acquisition, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing

For correspondence
thomas.stoeger@northwestern.edu

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-5540-4278
Luís A Nunes Amaral

Luís A Nunes Amaral is in the Successful Clinical Response in Pneumonia Therapy (SCRIPT) Systems Biology Center, the Northwestern Institute on Complex Systems (NICO), the Department of Molecular Biosciences, and the Department of Physics and Astronomy, Northwestern University, Evanston, United States, and the Department of Medicine, Northwestern University School of Medicine, Chicago, United States

Contribution
Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing

For correspondence
amaral@northwestern.edu

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-3762-789X

Funding

National Institute of Allergy and Infectious Diseases (U19AI135964)

Luís A Nunes Amaral

National Science Foundation (1956338)

Luís A Nunes Amaral

Simons Foundation (DMS-1764421)

Luís A Nunes Amaral

Air Force Office of Scientific Research (FA9550-19-1-0354)

Luís A Nunes Amaral

National Institute on Aging (K99AG068544)

Thomas Stoeger

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank all members of the Amaral lab for feedback, particularly Jennifer Liu, Kedi Cao, Meagan Bechel, Reese Richardson and Sarah Ben Maamar. We thank Richard Wunderink for feedback and suggesting the analysis of clinical trials. We also thank Rick Morimoto for feedback.

Publication history

Received: August 11, 2020
Accepted: November 7, 2020
Version of Record published: November 24, 2020

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.