Abstract
Present-day publications on human genes primarily feature genes that already appeared in many publications prior to completion of the Human Genome Project in 2003. These patterns persist despite the subsequent adoption of high-throughput technologies, which routinely identify novel genes associated with biological processes and disease. Although several hypotheses for bias in the selection of genes as research targets have been proposed, their explanatory powers have not yet been compared. Our analysis suggests that understudied genes are systematically abandoned in favor of better-studied genes between the completion of -omics experiments and the reporting of results. Understudied genes are similarly abandoned by studies that cite these -omics experiments. Conversely, we find that publications on understudied genes may even accrue a greater number of citations. Among 45 biological and experimental factors previously proposed to affect which genes are being studied, we find that 35 are significantly associated with the choice of hit genes presented in titles and abstracts of -omics studies. To promote the investigation of understudied genes we condense our insights into a tool, find my understudied genes (FMUG), that allows scientists to engage with potential bias during the selection of hits. We demonstrate the utility of FMUG through the identification of genes that remain understudied in vertebrate aging. FMUG is developed in Flutter and is available for download at fmug.amaral.northwestern.edu as a MacOS/Windows app.
Introduction
Research into human genes concentrates on a subset of genes that were already frequently investigated prior to the completion of the Human Genome Project in 20031-5. This concentration stems from historically acquired research patterns rather than present-day experimental possibilities6,7. For most human diseases, these patterns lead to little correlation between the volume of literature published on individual genes and the strength of supporting evidence from genome-wide approaches8-13. For instance, we found that 44% of the genes identified as promising Alzheimer’s disease targets by the U.S. National Institutes of Health (NIH) Accelerating Medicine Partnership for Alzheimer’s Disease (AMP-AD) initiative have never appeared in the title or abstract of any publication on Alzheimer’s disease13. Furthermore, when comparing gene-disease pairs, there is no correlation between the ranks of support by transcriptomics and occurrence in annotation databases9.
Although -omics technologies can provide insights on numerous genes across the genome at a time and thus offer the promise to counter historically acquired research patterns14-17, this discrepancy has persisted9,18-22 even as the popularity of -omics technologies has risen5,23,24. We therefore sought to use bibliometric data to delineate where and why understudied human protein-coding genes are abandoned as research targets following -omics experiments. In the absence of any prior quantitative testing of existing hypotheses, it remains unclear whether policies to promote the exploration of a greater set of disease-related genes should focus on how experiments are conducted, how results are reported, or how these results are subsequently received by other scientists.
Data
We considered 450 genome-wide association studies (GWAS, from studies indexed by the NHGRI-EBI GWAS catalog25), 296 studies using affinity capture mass spectrometry (Aff-MS, indexed by BioGRID26), 148 transcriptomic studies (indexed by the EBI Gene Expression Atlas, EBI-GXA27), and 15 genome-wide screens using CRISPR (indexed by BioGRID Open Repository of CRISPR Screens, BioGRID ORCS26) (see PRISMA diagrams in Figures S1-S4). We denote genes that are found to have statistically significant changes in expression or associations with a phenotype as ‘hit’ genes.
As a surrogate for a given gene having been investigated closer, we consider whether it was reported in the title or abstract of a research article. We determined which genes were mentioned in the title or abstract of articles using annotations from gene2pubmed28 and PubTator29. We used NIH iCite v32 for citations30. For determining which gene properties were associated with selection as research targets, we synthesized quantitative measures from a variety of authoritative sources (see Methods).
Results
Understudied genes are abandoned at synthesis/writing stage
We sought to identify at which point in the scientific process understudied genes are ignored as research targets in investigations using -omics experiments (Figure 1A). To receive scholarly attention, a gene must travel through a pipeline from biological reality to experimental results to write-up of those results. These results must be extended by subsequent research by other scholars. Understudied genes do not progress all the way through the pipeline, but it is unclear where this leak primarily occurs. The first possibility is that seemingly understudied genes are, in fact, not understudied as they would rarely be identified through experiments. Prior studies have, however, shown that understudied genes are frequent hits in high-throughput experiments8,9,31, suggesting that this is not the case. The second possibility is that understudied genes are frequently found as hits in high-throughput experiments but are not investigated further by the authors. The final possibility is that subsequent studies do not continue work on understudied genes revealed by the initial study.
Evaluating the first possibility, we found that understudied genes were frequently found as hits in high-throughput experiments. (Figure 1B, Figure S5, and Figure S6). This demonstrates, in line with earlier studies8-13, that the lack of publications on some genes is not explained by underlying biological experimental evidence.
Evaluating the second possibility, we found that hit genes that are highlighted in the title or abstract are strongly over-represented among the 20% highest-studied genes in all biomedical literature (Figure 1B). These trends are independent of significance threshold (Figure S5) and (except for CRISPR screens) whether we considered the current scientific literature or literature published before 2003, before any of these articles had been published (Figure S6).
Understudied genes are least frequently elevated to the title/abstract in transcriptomics experiments and most frequently elevated to the title/abstract in CRISPR screens. GWAS studies tend to return better-studied genes as hits; the median hit gene in GWAS studies was more popular than 75% of genes. Hit genes promoted to the title/abstract in GWAS studies had a median popularity greater 85% of all protein coding genes. This may explain the prior observation that the total number of articles on individual genes partially correlates with the total number of occurrences as a hit in GWAS studies32.
Evaluating the final possibility, we found that the reception of -omics studies in later scientific literature either reproduced authors’ initial selection of highly studied genes or slightly mitigated it. Jointly, the above findings reinforce that understudied genes become abandoned between the completion of -omics experiments and the reporting of results, rather than being abandoned by later research.
Subsequent reception by other scientists does not penalize studies on understudied genes
The abandonment of understudied genes could be driven by the valid concern of biomedical researchers that focusing on less-investigated genes will yield articles with lower impact17, as observed around the turn of the millenium33. If this were the case, preemptively avoiding understudied hits would be the rational decision for authors of -omics studies.
We thus decided to complement our preceding analysis by an analysis explicitly focused on citation impact. Notably, we found that the concern of publications on understudied genes receiving fewer citations does not hold for present-day research on human genes; in biomedical literature at-large, articles focusing on less-investigated genes accumulate more citations, an effect that has held consistently since 2001 (Figure 2). Important to human health, this also holds when only considering disease-related fields (Figure S7).
To rule out that these macroscopic observations stem from us having aggregated over different diseases, we separately analyzed 602 disease-related MeSH terms. We found 29 MeSH terms with a statistically significant Spearman correlation using Benjamini-Hochberg FDR < 0.01 (Table S4), of which 27 showed a negative association and only 2 a positive association. This result suggests that is may actually be rational for most scientists to pursue studies focusing on understudied genes, although most scientists specializing in a disease may also not receive more citations when focusing on understudied genes.
Returning to our observation that understudied hits from high-throughput assays are not promoted to the title and abstract of the resulting publication, we next tested if different experimental approaches demonstrated distinct associations between gene popularity and citations (Figure S8). Among 264 technique-related MeSH terms tested, there were 20 MeSH terms with a statistically significant Spearman correlation using Benjamini-Hochberg FDR < 0.01 (Table S5), of which 16 showed a negative association and only 4 a positive association. Notably, MeSH terms representing high-throughput techniques (e.g. D055106:Genome-Wide Association Study and D020869:Gene Expression Profiling) showed no significant association. This finding suggests that authors of high-throughput studies have little to gain or lose citationwise by highlighting understudied genes.
To summarize, our investigations detail the previously described separation between “largescale” and “small-scale” biological research34-36. Authors of high-throughput studies do not highlight understudied genes in the title or abstract of their publications, the sections of the publication most accessible to other scientists. While, overall, understudied genes (and high-throughput assays themselves5) correlate with increased citation impact, for high-throughput studies any potential gain in citations is either absent or too small to be significant. Thus, there may not be any incentive for authors of high-throughput studies to highlight understudied genes.
Identification of biological and experimental factors associated with selection of highlighted genes
To illuminate why understudied genes are abandoned between experimental results and the write-up of results, we performed a literature review to identify factors that have been proposed to limit studies of understudied genes (Table S1). These factors range from evolutionary factors (e.g., whether a gene only has homologs in primates), to chemical factors (e.g., gene length or hydrophobicity of protein product), to historical factors (e.g., whether a gene’s sequence has previously been patented) to materialistic factors affecting experimental design (e.g., whether designed antibodies are robust for immunohistochemistry).
As any of these factors could plausibly affect gene selection within individual domains of biomedical research, we returned to the -omics data described above (Figure 1) and measured how much these factors align with the selective highlighting of hit genes in the title or abstract of GWAS, Aff-MS, transcriptomics, and CRISPR studies.
We identified 45 factors that relate to genes and found 35 (14 out of 23 binary factors and 21 out of 22 continuous factors) associated with selection in at least one assay type at p < 0.001 (Figure 3, Table S2, and Table S3). Across the four assay types, the most informative binary factor describes whether there is a plasmid available for a gene in the AddGene plasmid catalog. This might reflect that many different research groups produce reagents surrounding the genes that they actively study. The most informative continuous factor is the number of research articles about a gene, supporting the conclusion that gene popularity drives whether it is highlighted or not (Figure 1).
To better understand how all 45 factors are related, we performed a cluster analysis of the collected factors (Figure 4, Figure S9, and Figure S10). This clustering reflects the fact that many suggested factors influencing the abandonment of understudied genes are not independent. For instance, we find that the number of articles about a gene is heavily correlated with the number of annotations for that gene in all surveyed databases. In another case, gene length is heavily correlated with the number of GWAS annotations for a gene, as described before in terms of transcript length and single-nucleotide polymoprhisms37.
Study limitations
Our study has several limitations. First, all analysis is subject to annotation errors in the various databases we employ. While these should be rare and not affect our overall findings, they may affect users who are interested in genes with discordant annotations. Second, we focus only on human genes. Different patterns of selection may exist for research on genes in other organisms. Third, our literature review also identified further factors that we could not test more directly because of absent access to fitting data. These are: experts’ tendency to deepen their expertise3, a perceived lack of accuracy of -omics studies17,38, -omics serving research purposes beyond target gene identification22, the absence of good protocols for mass spectrometry39, the electronic distribution and reading of research articles40, rates of reproducibility17, career prospects of investigators7,41, and the human tendency to fall back to simplifying heuristics when making decisions under conditions with uncertainty42. Fourth, we cannot resolve further which specific step between the conduct of an experiment and the writing of a research article leads to the abandonment of hit genes. Finally, we interpret the results of high-throughput experiments based on their representation in the NHGRI-EBI GWAS, BioGRID, EBI-GXA and BioGRID ORCS databases. The authors of the original studies may have processed their data differently, obtaining different results.
As our present analysis is correlative, it also is tempting to propose controlled trials where published manuscripts on high-throughput studies randomly report hit genes in the abstract even if not investigated further by the authors.
Discussion
Efforts to address the gaps in detailed knowledge about most genes have crystallized as initiatives promoting the investigation of understudied sets of genes16,43-47, an approach to gene scholarship recently termed ‘unknomics’48. We believe that enabling scientists to consciously engage with bias in research target selection will enable more biomedical researchers to participate in unknomics, to the potential benefit of their own research impact and towards the advancement of our collective understanding of the entire human genome.
To achieve this goal, we combined all the above insights to create a tool we denoted find my understudied genes (FMUG). Our literature review revealed several tools and resources aiming to promote research of understudied genes by publicizing understudied genes49-56 or by providing information about hit genes7,57-61. However, we noted the absence of tools enabling scientists to actively engage with factors that align with gene selection. Although such factors are largely correlated when considering all genes (Figure 4, Figure S9, and Figure S10), some cluster together and the influence of specific factors could vary across laboratories. For instance, scientists could vary in their ability to perform proteomics, or ability to explore orthologous genes in C. elegans, or ability to leverage human population data, or perform standardized mouse assays.
Our tool makes selection bias explicit, while acknowledging that different laboratories vary in their techniques and capabilities for follow-up research. Rather than telling scientists about the existence of biases, FMUG aims to prompt scientists to make bias-aware informed decisions to identify and potentially tackle important gaps in knowledge that they are well-suited to address. For this reason, we believe that FMUG will not be of value only to scientists engaging in high-throughput studies, but also by scientists wishing to mine existing datasets for hit genes that they would be well-positioned to investigate further.
FMUG takes a list of genes from the user (ostensibly a hit list from a high-throughput -omics experiment) and provides the kind of information that will allow a user to select genes for further study.
Users can employ filters that reflect the factors identified in our literature review and supported by our analysis. The default information provided to users consists of factors that are representative of the identified clusters (Figure 4) and strongly associated with gene selection in high-throughput experiments (Figure 3, Table S2, and Table S3). In extended options, users can select any factor that demonstrated a significant association with the selection of genes. For instance, a user may need to decide whether loss-of-function intolerant genes should be considered for further research or not, or whether there should be robust evidence that a gene is protein-coding. Some of these filters are context aware. For instance, a user may select genes that have already been studied in the general biomedical literature but not yet within the literature of their disease of interest.
To provide real-time feedback, users are, in parallel, presented the number of articles about genes in their initial input list and the number of articles about genes that passed their filters. Users can then export their filtered list of genes. In the interest of researcher privacy, FMUG keeps all information local to the user’s machine. Usage of FMUG is illustrated in Figure 5A and demonstrated in Movie S1. FMUG is developed in Flutter and is available for download at fmug.amaral.northwestern.edu as a MacOS/Windows app. For the development of custom software and analytical code, we provide the data underlying FMUG at github.com/amarallab/fmug_analysis.
To determine the practical usefulness of FMUG to scientists we used an early prototype of FMUG to identify understudied genes associated with aging. One of these genes was Splicing factor, proline- and glutamine-rich (Sfpq), which had not yet been investigated toward its role in biological aging. We found Sfpq to be transcriptionally downregulated during murine aging. Others had shown Sfpq to be required for the transcriptional elongation of long genes62. This led us to hypothesize that during vertebrate aging, the transcripts of long genes become downregulated in most tissues (Figure 5B). We found this hypothesis to be supported through a multi-species analysis which we published in December 2022 in Nature Aging63, with another group publishing so in January 2023 in Nature Genetics64, and a third group in iScience in March 202365.
Materials and Methods
Genes information
Homo sapiens gene information was downloaded from NCBI Gene on Aug 16, 2022 [ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/All_Data.gene_info.gz]. Only genes with an unambiguous mapping of Entrez ID to Ensembl ID were used (n = 36,035). Number of gene synonyms, protein-coding status, and official gene symbol were derived from this dataset. A gene symbol was considered undefined if the gene’s entry for HGNC gene symbol was “-”.
Genes in title/abstract of primary research articles
Homo sapiens gene information was downloaded from NCBI Gene on Aug 16, 2022 [ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz]. gene2pubmed was download from NCBI Gene on Aug 16, 2022 [ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz]28. PubTator gene annotations were downloaded from NIH-NLM on July 12, 2022 [https://ftp.ncbi.nlm.nih.gov/pub/lu/PubTatorCentral/] 28,66. PubMed was downloaded on Dec 17, 2021 [https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/].
Only using PMIDs annotated as primary research articles, a human gene was considered as mentioned in the title/abstract of the publication if gene was annotated as being in the title/abstract by PubTator and the article appeared in gene2pubmed.
CRISPR articles
BioGRID ORCS26 v1.1.6 was downloaded on April 25, 2022 [https://downloads.thebiogrid.org/BioGRID-ORCS/Release-Archive/BIOGRID-ORCS-1.1.6/]. Any genome-wide CRISPR knockout screens in human with an associated PubMed ID in which hit genes were mentioned in the title or abstract was considered (n = 15). 9,268 unique genes were found as hits. Of these, 18 (0.19%) were elevated to titles/abstracts in the reporting articles and 19 (0.21%) were elevated to titles/abstracts in citing articles. A full list of PubMed IDs is available in Supplementary File 2.
Transcriptomics articles
EBI-GXA27 release 36 was downloaded on Sep 15, 2020 [https://web.archive.org/web/20201022184159/ https://www.ebi.ac.uk/gxa/download]. This is the most recent release of EBI-GXA available as a bulk download. Any transcriptomics comparisons with an associated PubMed ID in which hit genes were mentioned in the title or abstract was considered (n= 148). Analysis was restricted to protein-coding genes (some screens featured non-protein-coding genes, but this was not common to all analyses). DE was called at Benjamini-Hochberg FDR q < 0.05. 18,295 unique genes were found as hits. Of these, 161 (0.88%) were elevated to titles/abstracts in the reporting articles and 692 (3.78%) were elevated to titles/abstracts in citing articles. A full list of PubMed IDs is available in Supplementary File 2.
Affinity capture – mass spectrometry articles
BioGRID26 v3.5.186 was downloaded on April 25, 2022
[https://downloads.thebiogrid.org/BioGRID/Release-Archive/BIOGRID-3.5.186/]. Any interactions involving a human gene as the prey protein with an experimental evidence code of ‘Affinity Capture-MS’ labeled as ‘High-Throughput’ that had an associated PubMed ID in which hit genes were mentioned in the title or abstract was considered (n= 296). Prey proteins in these interactions were considered hits. 7,919 unique genes were found as hits. Of these, 311 (3.93%) were elevated to titles/abstracts in reporting articles and 407 (5.14%) were elevated to titles/abstracts in citing articles. A full list of PubMed IDs is available in Supplementary File 2.
GWAS articles
The NHGRI-EBI GWAS catalog25 (associations and studies) was download on Aug 17, 2022 [https://www.ebi.ac.uk/gwas/docs/file-downloads]. Any GWAS screens with an associated PubMed ID in which hit genes were mentioned in the title or abstract was considered (n= 450). Only SNPs occurring within a gene were considered hits. 1,043 unique genes were found as hits. Of these, 413 (39.6%) were elevated to titles/abstracts in reporting articles and 319 (30.6%) were elevated to titles/abstracts in citing articles. A full list of PubMed IDs is available in Supplementary File 2.
Citing articles
NIH iCite v32 was downloaded on Aug 25, 202230 [https://nih.figshare.com/collections/iCite_Database_Snapshots_NIH_Open_Citation_Collection_/4586573/32].
Functional annotations
Mapping of genes to Gene Ontology / Protein Interaction Database / WikiPathways / Reactome / Kyoto Encyclopedia of Genes and Genomes / Human Phenotype Ontology / BioCarta categories was derived from MSigDB v7.5 Entrez ID .gmt files, downloaded on Apr 12, 2022 [http://www.gsea-msigdb.org/gsea/downloads_archive.jsp].
Between-species homology
Homologene Build 68 was used to determine interspecies homology [ftp.ncbi.nih.gov/pub/HomoloGene/build68/]. Human = taxid:9606, mouse = taxid:10090, rat = taxid:10116, c. elegans = taxid:6239, d. melanogaster = taxid:7227, yeast = taxid:559292, zebrafish = taxid:7955.
Primate specificity
Human genes were considered primate-specific if the only other members of their homology group belonged to primate genomes. Primate taxonomy ids were downloaded from NCBI Taxonomy on Sep 20, 2022 [https://www.ncbi.nlm.nih.gov/taxonomy/?term=txid9443[Subtree]].
Number of publications in model organisms
Gene information was downloaded from NCBI Gene on Aug 16, 2022 [ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/All_Data.gene_info.gz].
Only using PMIDs annotated as primary research articles, genes was considered as mentioned in the title/abstract of the publication if gene was annotated as being in the title/abstract by PubTator and the article appeared in gene2pubmed.
Genes in model organisms were mapped to human genes and the number of articles on those mapping to human genes were counted. If a model organism’s gene had homology to human but no associated publications, the number of publications was resolved to zero. Otherwise, counts were listed as NA.
Mouse phenotype hits
International Mouse Phenotyping Consortium data release 17.0 was downloaded on Aug 18, 2022 [https://www.mousephenotype.org/data/release]. Mouse genes were matched to human genes with Homologene.
Gene Expression Atlas (EBI-GXA)
EBI-GXA release 36 was downloaded on Sep 15, 2020 [https://web.archive.org/web/20201022184159/ https://www.ebi.ac.uk/gxa/download]. This is the most recent release of EBI-GXA available as a bulk download. For probability of DE, only RNA-seq comparisons were considered and DE was called at Benjamini-Hochberg q < 0.05.
Global RNA expression
RNA consensus tissue gene data from HPA release 21.1 was downloaded on Sep 20, 2022 [https://www.proteinatlas.org/about/download]. Global RNA expression was estimated by taking the median expression (nTPM) across tissues for each gene and the proportion of tissues with detectable (≥1 nTPM) expression for each gene.
Expression in HeLa cells
RNA cell line gene data from HPA release 21.1 was downloaded on Sep 20, 2022 [https://www.proteinatlas.org/about/download]. Expression is in nTPM.
Previous patent activity
Genes with patent activity were defined from Table S1 of Rosenfeld and Mason, 201367. Genes were mapped with their HGNC symbol. This analysis aligned sequences in patents to the human genome to estimate patent coverage of human coding sequences. Although this does not necessarily reflect whether the mapped genes were claimed directly by the patent holder, as noted by others68, this analysis remains the most comprehensive available for determining patent coverage of the human genome.
Druggability
Druggable genes were identified from Table S1 of Finan et al., 201769. Genes were mapped with their Ensembl identifier.
Gene length
GenBank was downloaded in spring 2017 (genome version GRCh38.p10). Gene length is defined here as the span of the longest transcript on the chromosome. This aligns with the model of gene length used in Stoeger et al., 20187.
Solubility
SwissProt protein sequences and mapping tables to Entrez GeneIDs were downloaded from Uniprot in spring 2017. Protein GRAVY score (ignoring Pyrrolysine and Selenocysteine) was estimated with BioPython70.
Loss of function intolerance
Data was obtained from Karczewski et al.71. pLI scores > 0.9 on main transcripts, as flagged by authors, were considered as highly loss-of-function intolerant as described by Lek et al.72.
Number of GWAS hits
EBI GWAS catalog25 (associations and studies) was download on Aug 17, 2022 [https://www.ebi.ac.uk/gwas/docs/file-downloads]. Loci were mapped to the nearest gene.
Status as understudied protein
The Illuminating the Druggable Genome understudied protein list was downloaded on Sep 20, 2022 [https://github.com/druggablegenome/IDGTargets/blob/master/IDG_TargetList_CurrentVersion.json].
Human Protein Atlas
HPA release 21.1 was downloaded on Sep 20, 2022 [https://www.proteinatlas.org/search]. Evidence for a protein’s existence, as determined by NeXtProt, HPA, or UniProt was resolved as True if the respective evidence entry was annotated as “Evidence at protein level”. Status as a membrane protein was determined by whether the ‘Protein class’ column contained the string ‘membrane protein’. Antibodies were considered available for each protein if the protein’s entry in the ‘Antibody’ column was not null.
Availability of plasmids
The AddGene plasmid catalog was downloaded on Aug 12, 2022 [https://www.addgene.org/browse/gene/gene-list-data/?_=1666368044314].
Availability of compounds
The catalog of gene targets was downloaded from ChEMBL on Sep 20, 2022 [https://www.ebi.ac.uk/chembl/g/#browse/targets]. UniProt IDs were converted to Entrez IDs to identify which human genes were affected by any compound.
Mendelian inheritance
Autosomal dominant [https://hpo.jax.org/app/browse/term/HP:0000006] and autosomal recessive [https://hpo.jax.org/app/browse/term/HP:0000007] inherited disease-gene associations were downloaded from the Human Phenotype Ontology on Sep 20, 2022. Genes were considered to have evidence of Mendelian inheritance if they appeared in these lists of associations.
Code
Code for analysis is available at github.com/amarallab/fmug_analysis. Code for FMUG is available at github.com/amarallab/fmug.
Data Availability
All underlying data for figures are available at github.com/amarallab/fmug_analysis.
Acknowledgements
We thank Xiaojing Sui for testing FMUG and Northwestern Information Technology for technical assistance. RAKR was supported in part by the National Institutes of Health Training Grant (T32GM008449) through Northwestern University’s Biotechnology Training Program. RAKR also acknowledges support from the Dr. John N. Nicholson fellowship from Northwestern University and Moderna Inc., “Identifying bias and improving reproducibility in RNA-seq computational pipelines”. LANA was supported by NSF 1956338, NIH U19AI135964 and Simons Foundation DMS-1764421. TS was supported by NIH K99AG068544. We thank Alexander Misharin, Richard Morimoto, and Scott Budinger for feedback on an early prototype of FMUG which we used as part of our shared research into the biology of aging.
References
- 1Life cycles of successful genesTrends Genet 19:79–81https://doi.org/10.1016/S0168-9525(02)00014-8
- 2Power-law-like distributions in biomedical publications and research fundingGenome Biol 8https://doi.org/10.1186/gb-2007-8-4-404
- 3Too many roads not takenNature 470:163–165https://doi.org/10.1038/470163a
- 4Assessing identity, redundancy and confounds in Gene Ontology annotations over timeBioinformatics 29:476–482https://doi.org/10.1093/bioinformatics/bts727
- 5The characteristics of early-stage research into human genes are substantially different from subsequent researchPLoS Biol 20https://doi.org/10.1371/journal.pbio.3001520
- 6Kinase requirements in human cells: I. Comparing kinase requirements across various cell typesProc. Natl. Acad. Sci. U. S. A 105:16472–16477https://doi.org/10.1073/pnas.0808019105
- 7Large-scale investigation of the reasons why potentially important genes are ignoredPLoS Biol 16https://doi.org/10.1371/journal.pbio.2006643
- 8Revealing the acute asthma ignorome: characterization and validation of uninvestigated gene networksSci Rep 6https://doi.org/10.1038/srep24647
- 9Gene annotation bias impedes biomedical researchSci Rep 8https://doi.org/10.1038/s41598-018-19333-x
- 10No support for historical candidate gene or candidate gene-by-interaction hypotheses for major depression across multiple large samplesAmerican Journal of Psychiatry 176:376–387
- 11COVID-19 research risks ignoring important host genes due to pre-established research patternsElife 9https://doi.org/10.7554/eLife.61981
- 12Incomplete annotation has a disproportionate impact on our understanding of Mendelian and complex neurogenetic disordersScience Advances 6
- 13Protection of the human gene research literature from contract cheating organizations known as research paper millsNucleic Acids Research 50:12058–12070https://doi.org/10.1093/nar/gkac1139
- 14A vision for the future of genomics researchNature 422:835–847https://doi.org/10.1038/nature01626
- 15Genomic Medicine-Progress, Pitfalls, and PromiseCell 177:45–57https://doi.org/10.1016/j.cell.2019.02.003
- 16The Deep Genome ProjectGenome Biol 21https://doi.org/10.1186/s13059-020-1931-9
- 17Understudied proteins: opportunities and challenges for functional proteomicsNat Methods 19:774–779https://doi.org/10.1038/s41592-022-01454-x
- 18Differential gene expression in disease: a comparison between high-throughput studies and the literatureBMC Med Genomics 10https://doi.org/10.1186/s12920-017-0293-y
- 19Unexplored therapeutic opportunities in the human genomeNat Rev Drug Discov 17:317–332https://doi.org/10.1038/nrd.2018.14
- 20Darkness in the Human Gene and Protein Function Space: Widely Modest or Absent Illumination by the Life Science Literature and the Trend for Fewer Protein Function Discoveries Since 2000Proteomics 18https://doi.org/10.1002/pmic.201800093
- 21Hidden in plain sight: what remains to be discovered in the eukaryotic proteome?Open Biol 9https://doi.org/10.1098/rsob.180241
- 22Perspectives on the Human Genome Project and genomics
- 23Why are there still over 1000 uncharacterized yeast genes?Genetics 176:7–14https://doi.org/10.1534/genetics.107.074468
- 24Confronting the catalytic dark matter encoded by sequenced genomesNucleic Acids Res 45:11495–11514https://doi.org/10.1093/nar/gkx937
- 25The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019Nucleic Acids Res 47:D1005–D1012https://doi.org/10.1093/nar/gky1120
- 26The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactionsProtein Science 30:187–200
- 27Expression Atlas: gene and protein expression across multiple studies and organismsNucleic acids research 46:D246–D251
- 28Entrez Gene: gene-centered information at NCBINucleic acids research 35:D26–D31
- 29PubTator central: automated concept annotation for biomedical full text articlesNucleic acids research 47:W587–W593
- 30Hutchins, B. I., Santangelo, George. iCite, <10.35092/yhjc.c.4586573 > (2019).https://doi.org/10.35092/yhjc.c.4586573
- 31COVID-19 research risks ignoring important host genes due to pre-established research patternsElife 9
- 32Large-scale investigation of the reasons why potentially important genes are ignoredPLoS biology 16
- 33Temporal patterns of genes in scientific publicationsProc. Natl. Acad. Sci. U. S. A 104:12052–12056https://doi.org/10.1073/pnas.0701315104
- 34Epistemic CulturesHarvard University Press
- 35Limits to growth: In biology, small science is good scienceCell :337–338
- 36HallamDuke University Press
- 37Gene size matters: an analysis of gene length in the human genomeFrontiers in Genetics 12
- 38Improving Reproducibility and Candidate Selection in Transcriptomics Using Meta-analysisJournal of Experimental Neuroscience 12https://doi.org/10.1177/1179069518756296
- 39A Call for Systematic Research on Solute CarriersCell 162:478–487https://doi.org/10.1016/j.cell.2015.07.022
- 40Electronic publication and the narrowing of science and scholarshipScience 321:395–399https://doi.org/10.1126/science.1150473
- 41Rescuing US biomedical research from its systemic flawsProc Natl Acad Sci U S A 111:5773–5777https://doi.org/10.1073/pnas.1404402111
- 42Heuristics and biases : the psychology of intuitive judgmentCambridge University Press
- 43The Enzyme Function InitiativeBiochemistry 50:9950–9962https://doi.org/10.1021/bi201312u
- 44Target 2035: probing the human proteomeDrug Discov Today 24:2111–2115https://doi.org/10.1016/j.drudis.2019.06.020
- 45Glimmers in illuminating the druggable genomeNat Rev Drug Discov 17:301–302https://doi.org/10.1038/nrd.2017.252
- 46An open invitation to the Understudied Proteins InitiativeNat Biotechnol 40:815–817https://doi.org/10.1038/s41587-022-01316-z
- 47EUbOPEN, <https://www.eubopen.org/> (
- 48Functional unknomics: Systematic screening of conserved genes of unknown functionPLoS biology 21
- 49Exploring the Uncharacterized Human Proteome Using neXtProtJ Proteome Res 17:4211–4226https://doi.org/10.1021/acs.jproteome.8b00537
- 50Predictability of human differential gene expressionProc Natl Acad Sci U S A 116:6491–6500https://doi.org/10.1073/pnas.1802973116
- 51Dark Proteome Database: Studies on Dark ProteinsHigh Throughput 8https://doi.org/10.3390/ht8020008
- 52The Clinical Kinase Index: A Method to Prioritize Understudied Kinases as Drug Targets for the Treatment of CancerCell Rep Med 1https://doi.org/10.1016/j.xcrm.2020.100128
- 53TCRD and Pharos 2021: mining the human proteome for disease biologyNucleic Acids Res 49:D1334–D1346https://doi.org/10.1093/nar/gkaa993
- 54Functional unknomics: closing the knowledge gap to accelerate biomedical researchbioRxiv
- 55Defining characteristics and conservation of poorly annotated genes in Caenorhabditis elegans using WormCat 2.0Genetics 221https://doi.org/10.1093/genetics/iyac085
- 56A genome-wide atlas of co-essential modules assigns function to uncharacterized genesNat Genet 53:638–649https://doi.org/10.1038/s41588-021-00840-z
- 57GeneCards: a novel functional genomics compendium with automated data mining and query reformulation supportBioinformatics 14:656–664https://doi.org/10.1093/bioinformatics/14.8.656
- 58ADAGE signature analysis: differential expression analysis with data-defined gene setsBmc Bioinformatics 18https://doi.org/10.1186/s12859-017-1905-4
- 59Co-regulation map of the human proteome enables identification of protein functionsNat Biotechnol 37:1361–1371https://doi.org/10.1038/s41587-019-0298-5
- 60clusterProfiler 4.0: A universal enrichment tool for interpreting omics dataInnovation (Camb) 2https://doi.org/10.1016/j.xinn.2021.100141
- 61Systematic illumination of druggable genes in cancer genomesCell Rep 38https://doi.org/10.1016/j.celrep.2022.110400
- 62Loss of Sfpq Causes Long-Gene Transcriptopathy in the BrainCell Rep 23:1326–1341https://doi.org/10.1016/j.celrep.2018.03.141
- 63Aging is associated with a systemic length-associated transcriptome imbalanceNature Aging 2:1191–1206https://doi.org/10.1038/s43587-022-00317-6
- 64Genome-wide RNA polymerase stalling shapes the transcriptome during agingNat Genet 55:268–279https://doi.org/10.1038/s41588-022-01279-6
- 65Age or lifestyle-induced accumulation of genotoxicity is associated with a length-dependent decrease in gene expressioniScience https://doi.org/10.1016/j.isci.2023.106368
- 66PubTator central: automated concept annotation for biomedical full text articlesNucleic Acids Res 47:W587–W593https://doi.org/10.1093/nar/gkz389
- 67Pervasive sequence patents cover the entire human genomeGenome Med 5https://doi.org/10.1186/gm431
- 68Response to ‘pervasive sequence patents cover the entire human genome’Genome medicine 6:1–3
- 69The druggable genome and support for target identification and validation in drug developmentScience translational medicine 9
- 70Biopython: freely available Python tools for computational molecular biology and bioinformaticsBioinformatics 25:1422–1423https://doi.org/10.1093/bioinformatics/btp163
- 71The mutational constraint spectrum quantified from variation in 141,456 humansNature 581:434–443https://doi.org/10.1038/s41586-020-2308-7
- 72Analysis of protein-coding genetic variation in 60,706 humansNature 536:285–291https://doi.org/10.1038/nature19057
- 73Kendall, M. G. & Stuart, A. Inference and Relationship The Advanced Theory of Statistics 2 (1973).Inference and Relationship The Advanced Theory of Statistics 2
- 74Hierarchical grouping to optimize an objective functionJournal of the American statistical association 58:236–244
Article and author information
Author information
Version history
- Sent for peer review:
- Preprint posted:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
- Version of Record published:
Copyright
© 2023, Richardson et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 3,396
- downloads
- 182
- citations
- 7
Views, downloads and citations are aggregated across all versions of this paper published by eLife.