Annotation of ncORFs in humans and mice.

(A) Overview of the integrative pipeline for ncORF annotation in human and mouse genomes. (B-C) RPF coverage plot for a representative uORF (B) and a lncORF (C), visualized using the Ribo-Seq signal track from GWIPS-viz68. ncORFs were shown in yellow and CDSs in cyan. Dashed lines indicate identical genomic regions shared across transcript isoforms. (D-E) Proportion of GENCODE ncORFs rediscovered in this study. GENCODE ncORFs were stratified either by reproducibility across independent studies (D) or by tiered translation evidence strength (E). (F) Known ncORFs that were experimentally characterized in earlier studies and independently rediscovered in this study from humans and mice. CDS, coding sequence; FLOSS, fragment length organization similarity score; MANE, matched annotation from NCBI and EBI; ncORF, non-canonical open reading frame; ORF, open reading frame; RPF, ribosome-protected fragment.

Sequence features of ncORFs and putative ncEPs in humans and mice.

(A) Number of ncORFs belonging to different categories. (B) Distribution of ncORF lengths. (C) Proportion of ncEPs that contain known Pfam domains. (D) Number of ncORFs with Pfam domains among lncORFs and other ncORF categories. (E) Distribution of the proportion of intrinsically disordered residues in CDS- and ncORF-encoded proteins. (F) Most frequently identified Pfam domains among human and mouse ncORFs. (G) Proportion of ncORFs overlapping with TEs, stratified by presence or absence of Pfam domains. Differences were assessed using Wilcoxon rank-sum tests. CDS, coding sequence; ncORF, non-canonical open reading frame; ncEP, ncORF-encoded protein; TE, transposable element.

Evolutionary constraints of ncORFs.

(A) The mean PhyloP scores of 30 base pairs upstream and downstream of the ORF start or stop codons in mammals. Codons were delineated with bars of alternating colors by different frames, and nucleotides in untranslated regions were shown in grey. The significance of three-nucleotide periodicity was assessed by autocorrelation with a lag of three. (B) Cladograms of vertebrates with the number of ncORFs originating at each ancestral branch of humans. Triangles indicate species merged into larger clade for visual simplicity. (C) Relationship between ORF origination rates and node ages measured as node-to-tip distances. Origination rate was defined as the number of ncORFs that originated on a branch divided by the branch length. Blue lines show linear regression fits, and grey bands represent 95% prediction confidence intervals. Spearman’s correlation is indicated. (D) Distribution of ncORF PhyloCSF scores normalized by the number of codons in per ncORF. The dashed line denotes zero. CDS, coding sequence; ncORF, non-canonical open reading frame; ORF, open reading frame.

Lineage-specifically conserved ncORFs.

(A) Schematic illustration for BLS calculation. (B) Cladograms of vertebrates showing the number of lineage-specifically conserved ncORFs (local BLS > 0.9) at each ancestral node for humans and mice. (C) Distribution of PhyloCSF scores per codon for lineage-specifically conserved ncORFs compared to all other ncORFs. BLS, branch length score; ncORF, non-canonical open reading frame; ORF, open reading frame.

Evolutionary dynamics of ncORF expression.

(A) Distribution of mean translation levels of human ncORFs grouped by their origin nodes and further stratified by local BLS. Statistical significance was assessed using Wilcoxon rank-sum tests. ***, P < 0.001; **, P < 0.01; *, P < 0.05. (B) Similar to (A) but ncORFs are further stratified by local BLS. (C) Relationship between ncORF tissue specificity at translation and transcription levels. Differences were determined with Wilcoxon signed-rank tests. (D-E) Similar to (A) and (B) but showing the distribution of tissue specificity at the translation level. CDS, coding sequence; ncORF, non-canonical open reading frame; ORF, open reading frame.

ncORF-CDS co-translation network.

(A) Bipartite network of ncORF-CDS co-translation in humans. (B) Bipartite network of ncORF-CDS co-translation in mice. ncORFs and CDSs are represented by different shapes and colored according to their cluster membership. Only the largest clusters were highlighted (top two in humans and top five in mice). (C) Top five enriched gene ontology terms for each of the two largest clusters in the human network shown in (A). (D) Top five enriched gene ontology terms for each of the three largest clusters in the mouse network shown in (B). (E) Proportion of ncORFs cotranslating with CDSs among ancient (pre-mammalian origin) versus younger (mammalian-specific) ncORFs. Differences were tested with Fisher’s exact tests. CDS, coding sequence; ncORF, non-canonical open reading frame; ORF, open reading frame.

Number of high-quality Ribo-Seq libraries passing quality control for each sample in humans (A) and mice (B).

EC, endothelial cell; ESC, embryonic stem cell; LCL, lymphoblastoid cell line; Ribo-Seq, ribosome profiling.

Rediscovery rate of non-canonical ORFs supported by MS evidence, as compiled in our previous study20.

MS, mass spectrometry; ORF, open reading frame.

Translation-related signatures of ncORF sequences.

(A) Start codon usage among ncORFs. (B) Distribution of relative start positions of ncORFs within transcripts. The x-axis represents the position of ncORF start codon normalized by transcript length. (C) Nucleotide pairing probabilities around the start codons of ncORFs and CDSs. (D) Proportion of ncORFs with different Kozak sequence contexts. Statistical significance was assessed using Fisher’s exact test. CDS, coding sequence; ncORF, non-canonical open reading frame; ORF, open reading frame.

Seqlogo plots of Kozak sequence context surrounding ncORF start codons in humans (A) and mice (B).

The first nucleotide of the start codon is designated as position +1. The height of each nucleotide at each position reflects its frequency. ncORF, non-canonical open reading frame.

Differences in codon and amino acid usage between ncORFs and CDSs.

(A-B) Differences in relative usage of codons (A) and amino acids (B). Significance was determined using Fisher’s exact tests and corrected for multiple-testing. Codons or amino acids with adjusted P value < 0.05 are shown in dark grey. (C-D) Distribution of tRNA adaptation index values (C) and effective number of codons (D) for CDSs and ncORFs. Statistical significance was determined by Wilcoxon rank-sum tests. CDS, coding sequence; ncORF, non-canonical open reading frame.

Length distribution of ncORFs with (blue) or without (red) known Pfam domains.

Statistical differences were assessed using Wilcoxon rank-sum tests. AA, amino acid; ncORF, non-canonical open reading frame; ORF, open reading frame.

Distribution of Gnocchi scores for CDSs and ncORFs.

Statistical significance of deviation from zero was assessed using Wilcoxon signed-rank tests. CDS, coding sequence; Gnocchi, Genomic Non-Coding Constraint of Haplo-Insufficient variation; ncORF, non-canonical open reading frame.

The mean PhyloP score of 30 base pairs upstream and downstream of the ORF (CDS or ncORF) start or stop codons in primates.

Codons were delineated with bars of alternating colors for different frames, and nucleotides in untranslated regions were shown in grey. Statistical significance of the three-nucleotide periodicity was determined by autocorrelation with a lag of three. CDS, coding sequence; ncORF, non-canonical open reading frame.

Cladograms of vertebrates showing the number of ncORFs originating at each ancestral branch leading to mice.

(A) All ncORFs. (B) Lineage-specific ncORFs with local branch length score > 0.9. Triangles indicate species merged into larger clades for visual simplicity.

Proportion of translated ncORFs and CDSs detected in each human and mouse sample.

Translated ORFs were detected with RPKM ≥ 1. BMDC, bone marrow-derived dendritic cell; CDS, coding sequence; DC, dendritic cell; EC, endothelial cell; ESC, embryonic stem cell; LCL, lymphoblastoid cell line; ncORF, non-canonical open reading frame; RPKM, reads per kilobase per million mapped reads; T-ALL, T-cell acute lymphoblastic leukemia.

Distribution of average translation levels of ncORFs stratified by local BLS (A) or PhyloCSF (B) in humans and mice.

Statistical differences were assessed using Wilcoxon rank-sum tests. BLS, branch length score; ncORF, non-canonical open reading frame; RPF, ribosome-protected fragment; RPKM, reads per kilobase per million mapped reads.

Expression patterns of ncORFs in mice.

(A) Distribution of mean translation levels of mouse ncORFs grouped by origin nodes and further stratified by local BLS. Statistical significance was assessed with Wilcoxon rank-sum tests. ***, P < 0.001; **, P < 0.01; *, P < 0.05. (B) Similar to (A) but ncORFs are further stratified by PhyloCSF scores. (C-D) Similar to (A) and (B), but showing the distribution of tissue-specificity at the translation level.

Heatmap showing the translation levels of ncORFs in human (A) and mouse (B) samples.

Only ncORFs with translation level (RPKM) ≥ 1 in at least one sample are displayed. Translation levels were normalized across samples by Z-score transformation before visualization. BMDC, bone marrow-derived dendritic cell; DC, dendritic cell; EC, endothelial cell; ESC, embryonic stem cell; LCL, lymphoblastoid cell line; ncORF, non-canonical open reading frame; Ribo-Seq, ribosome profiling; RPKM, reads per kilobase per million mapped reads; T-ALL, T-cell acute lymphoblastic leukemia.

Heatmap showing the transcription levels of ncORF-containing genes in humans (A) and mice (B).

Only genes expressed with TPM ≥ 1 in at least one sample are shown. TPM values were normalized across samples by Z-score transformation prior to visualization. ncORF, non-canonical open reading frame; TPM, transcripts per million.

Permutation tests assessing co-translation between ncORFs and main CDSs within the same genes.

(A-B) Distribution of the number of ncORFs that co-translate with corresponding main CDSs of the same genes in humans (A) and mice (B) based on the permutation analysis. Red dashed lines mark the observed counts. Empirical P values were indicated. (C-D) Same analyses as in (A) and (B), but excluding ncORFs overlapping with any CDS regions. CDS, coding sequence; ncORF, non-canonical open reading frame.