Figure 1:A shift in focus towards well-studied genes occurs during the summarization and write-up of results and remains in subsequent studies. a, Conceptual diagram depicting possible points of abandonment for understudied genes in studies using high-throughput -omics experiments. b, We identified articles reporting on genome-wide CRISPR screens (CRISPR, 15 focus articles and 18 citing articles), transcriptomics (T-omics, 148 focus articles and 1,678 citing articles), affinity purification–mass spectrometry (AP-MS, 296 focus articles and 1,320 citing articles), and GWAS (450 focus articles and 3,524 citing articles). Focusing only on protein-coding genes (white box plot), we retrieved data uploaded to repositories describing which genes came up as “hits” in each experiment (first colored box plot). We then retrieved the hits mentioned in the titles and abstracts of those articles (second colored box plot) and hits mentioned in the titles and abstracts of articles citing those articles (third colored box plot). Unique hit genes are only counted once. Bibliometric data reveals that understudied genes are frequently hits in -omics experiments but are not typically highlighted in the title/abstract of reporting articles, nor in the title/abstract or articles citing reporting articles. ** denotes p < 0.01 and *** denotes p < 0.001 by two-sided Mann-Whitney U test, comparing genes highlighted in title/abstract to genes present in hit lists.Figure 2:Articles focusing on less popular genes tend to accrue more citations.a, Density plot shows correlation between articles per gene before 2015 and median citations to articles published in 2015. Contours correspond to deciles in density. Solid red line shows locally weighted scatterplot smoothing (LOWESS) regression. ρ is Spearman rank correlation and p the significance values of the Spearman rank correlation as described by Kendall and Stuart. We forgo depicting more recent years than 2015 to allow for citations to accumulate over multiple years, providing a more sensitive and robust readout of long-term impact. b, Spearman correlation of previous gene popularity (i.e. number of articles) to median citations per year since 1990. Solid blue line indicates nominal Spearman correlation, shaded region indicates bootstrapped 95% confidence interval (n=1,000). Only articles with a single gene in the title/abstract are considered, excluding the 30.4% of gene-focused studies which feature more than one gene in the title/abstract. For more recent years, where articles have had less time to accumulate citations, insufficient signal may cause correlation to converge toward zero.Figure 3:We evaluated which gene-related factors are associated with elevation to the title/abstract of an article featuring a high-throughput experiment.a) Association between factors with binary (True/False) identities and highlighting hits in title/abstract of reporting articles. Values represent the odds ratio between hits in the collected articles and hits mentioned in the title or abstract of collected articles (e.g. hits with a compound known to affect gene activity are 4.262 times as likely to be mentioned in the title/abstract in an article using transcriptomics, corresponding to an odds ratio of 4.331). Collected articles are described in Figure 1B and Figures S5, S6 and S7. 95% confidence interval of odds ratio is shown in parentheses. * = Benjamini-Hochberg FDR < 0.05, ** = FDR < 0.01, and *** = FDR < 0.001 by two-sided Fisher exact test. Results are shown numerically in Table S2. For consistency between studies, hits were restricted to protein-coding genes. Thus, status as a protein-coding gene could not be tested. †No genes without a defined HUGO symbol were found as hits in GWAS or transcriptomics studies. b) Association with factors with continuous identities and highlighting hits in title/abstract of reporting articles. Values represent F, the common-language effect size (equivalent to AUROC, where ∼0.5 indicates little effect, >0.5 indicates positive effect and <0.5 indicates negative effect) of being mentioned in the titles/abstracts of the collected articles described in Figure 1B and Figures S5, S6 and S7. * = Benjamini-Hochberg FDR < 0.05, ** = FDR < 0.01, and *** = FDR < 0.001 by two-sided Mann-Whitney U test. Results are shown numerically in Table S3.Figure 4:We created FMUG to help researchers identify understudied genes among their genes of interest and characterize their tractability for future research.a, Diagram describing use of FMUG. b, An early prototype of FMUG led us to the hypothesis that transcript length negatively correlates with up-regulation during aging. First, we identified genes that strongly associate with age-dependent transcriptional change across multiple cohorts. We then performed a literature review for each of these genes to identify the most direct way the genes (or evolutionally closely related genes or functionally closely related partner proteins) had been studied in aging. 64% had been functionally investigated in aging, 15% shown to change a measure of gene expression, 3% functionally investigated in a biological domain close to aging (such as senescence), and 5% shown to change a measure of gene expression in a biological domain close to aging. For genes reported by others to change expression with age, we identified tissues in which transcripts of the genes change during aging. We computed ‘feasibility scores’ scientific strategies (GEM: G: strong genetic support, E: and experimental potential, M: homolog in invertebrate model organism) as described by Stoeger et al.7 and total number of publications in MEDLINE. Splicing factor, proline- and glutamine-rich (Sfpq) had previously been demonstrated by Takeuchi et al. to be required for the transcriptional elongation of long genes51. When performing a data-driven analysis of factors that could possibly explain age-dependent changes of the entire transcriptome, we thus included gene and transcript lengths, and subsequently found them to be more informative than transcription factors or microRNAs52