While understudied genes appear often as hits high-throughput -omics experiments, they are seldom highlighted by authors.

a, Conceptual diagram depicting possible points of abandonment for understudied genes in studies using high-throughput -omics experiments. b, Bibliometric data reveals that understudied genes are frequently hits in -omics experiments but are not typically highlighted in the title/abstract of reporting articles, nor in the title/abstract or articles citing reporting articles. We considered articles reporting on genome-wide CRISPR screens (CRISPR, n=15 articles), transcriptomics (T-omics, n=148 articles), affinity capture – mass spectrometry (Aff-MS, n=296 articles), and GWAS (n=450 articles). Numbers to the right of each box plot indicate the percentile (in terms of number of articles about that gene) of all genes exceeded by the median gene in each box plot. ** denotes p < 0.01 and *** denotes p < 0.001 by two-sided Mann-Whitney U test, comparing genes highlighted in title/abstract to genes present in hit lists.

Articles focusing on less popular genes tend to accrue more citations.

a, Density plot shows correlation between articles per gene before 2015 and median citations to articles published in 2015. Contours correspond to deciles in density. Solid red line shows locally weighted scatterplot smoothing (LOWESS) regression. ρ is Spearman rank correlation and p the significance values of the Spearman rank correlation as described by Kendall and Stuart73. b, Spearman correlation of previous gene popularity (i.e. number of articles) to median citations per year since 1990. Solid blue line indicates nominal Spearman correlation, shaded region indicates bootstrapped 95% confidence interval (n=1,000). Only articles with a single gene in the title/abstract are considered, excluding the 30.4% of gene-focused studies which feature more than one gene in the title/abstract.

We evaluated which gene-related factors are associated with elevation to the title/abstract of an article featuring a high-throughput experiment.

a) Association between factors with binary (True/False) identities and highlighting hits in title/abstract of reporting articles. Values represent the odds ratio between hits in the collected articles and hits mentioned in the title or abstract of collected articles (e.g. hits with a compound known to affect gene activity are 5.114 times as likely to be mentioned in the title/abstract in an article using transcriptomics). Collected articles are described in Figure 1B and Figures S5 and S6. 95% confidence interval of odds ratio is shown in parentheses. * = p < 0.05, ** = p < 0.01, and *** = p < 0.001 by two-sided Fisher exact test. Results are shown numerically in Table S2. For consistency between studies, hits were restricted to protein-coding genes. Thus, status as a protein-coding gene could not be tested. †No genes without a defined HUGO symbol were found as hits in GWAS or transcriptomics studies. b) Association with factors with continuous identities and highlighting hits in title/abstract of reporting articles. Values represent F, the common-language effect size (equivalent to AUROC, where ∼0.5 indicates little effect, >0.5 indicates positive effect and <0.5 indicates negative effect) of being mentioned in the titles/abstracts of the collected articles described in Figure 1B and Figures S5 and S6. * = p < 0.05, ** = p < 0.01, and *** = p < 0.001 by two-sided Mann-Whitney U test. Results are shown numerically in Table S3.

Clustermap showing collected factors across all human protein-coding genes.

Factors are shown along the x axis, with genes along the y axis. Eight factors, representing the default factors we selected for FMUG, are shown (all factors are shown in Figure S7 and Figure S8). Binary factors are coded to 0 (purple) and 1 (white), while continuous factors are ranked from 0 to 1 with ties resolved to minimum rank. Clustering was performed with Ward’s method for hierarchical clustering74.

We created FMUG to help researchers identify understudied genes among their genes of interest and characterize their tractability for future research.

a, Diagram describing use of FMUG. b, An early prototype of FMUG led us to the hypothesis that transcript length negatively correlates with up-regulation during aging. First, we identified genes that strongly associate with age-dependent transcriptional change across multiple cohorts. We then performed a literature review for each of these genes to identify the most direct way the genes (or evolutionally closely related genes or functionally closely related partner proteins) had been studied in aging. 64% had been functionally investigated in aging, 15% shown to change a measure of gene expression, 3% functionally investigated in a biological domain close to aging (such as senescence), and 5% shown to change a measure of gene expression in a biological domain close to aging. For genes reported by others to change expression with age, we identified tissues in which transcripts of the genes change during aging. We computed ‘feasibility scores’ scientific strategies (GEM: G: strong genetic support, E: and experimental potential, M: homolog in invertebrate model organism) as described by Stoeger et al.7 and total number of publications in MEDLINE. Splicing factor, proline- and glutamine-rich (Sfpq) had previously been demonstrated by Takeuchi et al. to be required for the transcriptional elongation of long genes62. When performing a data-driven analysis of factors that could possibly explain age-dependent changes of the entire transcriptome, we thus included gene and transcript lengths, and subsequently found them to be more informative than transcription factors or microRNAs63.