Long non-coding RNAs as a source of new peptides

Abstract
eLife digest
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Deep transcriptome sequencing has revealed the existence of many transcripts that lack long or conserved open reading frames (ORFs) and which have been termed long non-coding RNAs (lncRNAs). The vast majority of lncRNAs are lineage-specific and do not yet have a known function. In this study, we test the hypothesis that they may act as a repository for the synthesis of new peptides. We find that a large fraction of the lncRNAs expressed in cells from six different species is associated with ribosomes. The patterns of ribosome protection are consistent with the translation of short peptides. lncRNAs show similar coding potential and sequence constraints than evolutionary young protein coding sequences, indicating that they play an important role in de novo protein evolution.

https://doi.org/10.7554/eLife.03523.001

eLife digest

Despite the terms being largely interchangeable in modern language, ‘DNA’ and ‘gene’ do not mean the same thing. A gene is made of DNA and contains the instructions to make a protein, and it is the protein that performs the function of the gene. However, cells in the body also contain DNA that does not form genes. Far from being ‘junk’ DNA with no biological purpose; this DNA has a variety of roles, including affecting how other genes are used.

To produce a protein, the DNA sequence of a gene is transcribed into an intermediate molecule called RNA, which is then translated to produce a protein. So-called long non-coding RNA (lncRNA) molecules are also transcribed from DNA, but whether these are translated to make proteins has been a subject of much debate. Indeed, the function of the vast majority of lncRNA molecules is unknown.

Ruiz-Orera et al. analyzed RNA sequences collected from earlier experiments on six different species—humans, mice, fish, flies, yeast, and a plant—and found nearly 2500 as yet unstudied lncRNAs in addition to those previously identified. Many of the lncRNAs that Ruiz-Orera et al. investigated could be found lodged inside the cellular machinery used to translate RNA into proteins. Furthermore, these lncRNA molecules are oriented in the machinery as if they are primed and ready for translation, suggesting that many lncRNAs do produce proteins. However, it is unclear how many of these proteins have a useful function.

Very few lncRNAs were found in more than one species, suggesting that they have evolved recently. The properties of lncRNA molecules also show many similarities with the properties of ‘young’—recently evolved—genes that are known to produce proteins. The combined findings of Ruiz-Orera et al. therefore suggest that lncRNAs are important for developing new proteins. The emergence of proteins with new functions has been an important driving force in evolution, and this work provides important clues into the first steps of this process.

https://doi.org/10.7554/eLife.03523.002

Introduction

Studies performed over the past decade have unveiled a richer and more complex transcriptome than was previously appreciated (Okazaki et al., 2002; Carninci et al., 2005; Kapranov et al., 2007; Ponjavic et al., 2007). Thousands of long RNA molecules (>200 nucleotides) that do not display the typical properties of well-characterized protein-coding RNAs, and which have been named intergenic or long non-coding RNAs (lncRNAs), have been discovered in several eukaryotic genomes (Okazaki et al., 2002; Ponting et al., 2009; Cabili et al., 2011; Liu et al., 2012; Pauli et al., 2012; Ulitsky and Bartel, 2013). There are several lncRNAs that have regulatory functions (Guttman and Rinn, 2012; Ulitsky and Bartel, 2013). For example the X-inactive-specific transcript Xist regulates X chromosome inactivation in eutherian mammals (Brockdorff et al., 1992). However, the vast majority of lncRNAs do not have a known function.

Intriguingly, several recent studies have noted that a large fraction of lncRNAs associate with ribosomes (Ingolia et al., 2011; Bazzini et al., 2014; Juntawong et al., 2014; van Heesch et al., 2014). Deep sequencing of ribosome-protected fragments, or ribosome profiling, provides detailed information on the regions that are translated in a transcript (Ingolia, 2014). According to some studies, the patterns of ribosome protection indicate that lncRNAs are capable of translating short peptides (Ingolia et al., 2011; Bazzini et al., 2014; Juntawong et al., 2014) although others have reached different conclusions (Guttman et al., 2013). Many lncRNAs have the same structure as classical mRNAs: they are transcribed by polymerase II, capped and polyadenylated, and accumulate in the cytoplasm (van Heesch et al., 2014). However, in contrast to typical protein-coding genes, they tend to contain few introns, are expressed at low levels, exhibit weak sequence constraints, and show limited phylogenetic conservation (Cabili et al., 2011; Derrien et al., 2012; Kutter et al., 2012; Necsulea et al., 2014).

The association of lncRNAs with ribosomes, and the fact that many of them appear to have arisen relatively recently in evolution, indicate that they could be an important source of new peptides. Levine et al., who described the first examples of de novo originated genes in Drosophila melanogaster, already noted that non-coding RNAs expressed at low levels could contribute to the birth of novel protein coding genes (Levine et al., 2006). Cai et al. found a new protein coding gene in Saccharomyces cerevisiae likely to have been formed from a previously transcribed non-coding sequence (Cai et al., 2008). Wilson and Masel observed that ribosome profiling reads from a yeast experiment often mapped to intergenic transcripts (Wilson and Masel, 2011), and they proposed that this could help provide the raw material for the birth of new protein-coding genes. Another study in yeast found evidence of translation of short species-specific ORFs located in non-genic regions (Carvunis et al., 2012). More generally, it is important to consider that de novo protein-coding gene evolution, which was once thought to be a very rare event, is now believed to be relatively common (Khalturin et al., 2009; Toll-Riera et al., 2009; Tautz and Domazet-Lošo, 2011; Long et al., 2013; Reinhardt et al., 2013). Recently emerged proteins tend to be very short and evolve under weak evolutionary constraints (Albà and Castresana, 2005; Levine et al., 2006; Cai et al., 2009; Liu et al., 2010; Xie et al., 2012; Palmieri et al., 2014), properties that we also expect to find in the putative ORFs of lncRNAs.

The idea that lncRNAs serve as a repository for the evolution of new peptides is appealing but the evidence is still fragmented. In this study, we have analyzed ribosome profiling experiments performed in six different species and measured the sequence coding potential and selective constraints of the putatively translated ORFs in lncRNAs and codRNAs. We have discovered that lncRNAs show very similar characteristics to evolutionary young protein coding genes (lineage-specific proteins). The results strongly support a role for lncRNAs in the production of new peptides.

Results

Characterization of coding and long non-coding transcripts

We obtained polyA+ RNA and ribosome profiling sequencing data from six different published experiments performed in diverse eukaryotic species, mouse (Mus musculus), human (Homo sapiens, HeLa cells), zebrafish (Danio rerio), fruit fly (D. melanogaster), Arabidopsis (A. thaliana), and yeast (S. cerevisiae) (Table 1). After read mapping and transcript assembly, we classified the expressed transcripts longer than 200 nucleotides into coding and long non-coding classes (codRNAs and lncRNAs, respectively, Table 2).

Table 1

Data sets used in the study

https://doi.org/10.7554/eLife.03523.003

Species		GEO Accession	Mapped reads (millions)	Max read length (bp)	Description	Reference
Mouse M. musculus	RNA-seq	GSE30839	226.0	43	ES cells, E14	Ingolia et al., 2011
Mouse M. musculus	Ribosome profiling	GSE30839	39.2	47	ES cells, E14	Ingolia et al., 2011
Human H. sapiens	RNA-seq	GSE22004	29.8	36	HeLa cells	Guo et al., 2010
Human H. sapiens	Ribosome profiling	GSE22004	78.3	36	HeLa cells	Guo et al., 2010
Zebrafish D. rerio	RNA-seq	GSE32900	1382.2	2 × 75	Series of developmental stages	Chew et al., 2013
Zebrafish D. rerio	Ribosome profiling	GSE46512	1040.0	44	Series of developmental stages	Chew et al., 2013
Fruit fly D. melanogaster	RNA-seq	GSE49197	1317.9	50	0–2hr embryos, wild type	Dunn et al., 2013
Fruit fly D. melanogaster	Ribosome profiling	GSE49197	105.7	50	0–2hr embryos, wild type	Dunn et al., 2013
Arabidopsis A. thaliana	RNA-seq	GSE50597	79.8	51	No stress conditions, TRAP purification	Juntawong et al., 2014
Arabidopsis A. thaliana	Ribosome profiling	GSE50597	140.3	51	No stress conditions, TRAP purification	Juntawong et al., 2014
Yeast S. cerevisiae	RNA-seq	GSE52119	20.54	50	GSY83, diploid	McManus et al., 2014
Yeast S. cerevisiae	Ribosome profiling	GSE52119	6.83	50	GSY83, diploid	McManus et al., 2014

Table 2

Fraction of transcripts associated with ribosomes

https://doi.org/10.7554/eLife.03523.004

	codRNA			lncRNA
	Expressed	Associated with ribosomes (RP)		Expressed	Associated with ribosomes (RP)
		Total	Stringent		Total	Stringent
Mouse	14,245	14,196 (99.7%)	13,918 (97.7%)	476	390 (81.9%)	367 (77.1%)
Human	17,011	16,630 (97.8%)	16,617 (97.7%)	934	403 (43.1%)	343 (36.7%)
Zebrafish	12,595	11,643 (92.4%)	11,637 (92.4%)	2392	726 (30.4%)	684 (28.6%)
Fruit fly	8041	8031 (99.9%)	7623 (94.8%)	28	22 (78.6%)	10 (35.7%)
Arabidopsis	19,162	18,879 (98.5%)	10,329 (53.9%)	139	93 (66.9%)	68 (48.9%)
Yeast	4740	4547 (95.9%)	4335 (91.5%)	21	6 (28.6%)	6 (28.6%)

Stringent: number of transcripts significant at p < 0.05 using 3′UTRs as a null model (see ‘Materials and methods’ for more details).

We detected hundreds of annotated lncRNAs in the vertebrate species (mouse, human and zebrafish), the number being lower (<150) in the other species (fruit fly, Arabidopsis and yeast). In addition, we identified a large number of novel lncRNAs not annotated in the databases, 2488 taking all species together (Supplementary file 1A). The inclusion of such lncRNAs resulted in a sixfold increase in the number of lncRNAs amenable for study in zebrafish and a twofold increase in mouse. In yeast, we only found two annotated lncRNAs, but there were 19 novel ones. In the majority of the analyses, we merged the annotated and the novel lncRNAs.

As expected, lncRNAs tended to be much shorter than codRNAs in all the species studied (Figure 1A). We found that most lncRNAs contained at least one short ORF (≥24 amino acids) and often several ORFs. The average ORF size in lncRNAs was between 43 and 68 amino acids depending on the species (Supplementary file 1B). Consistent with previous studies, lncRNAs were expressed at significantly lower levels than codRNAs (Figure 1B, Wilcoxon test, p < 10⁻⁵).

Figure 1

Download asset Open asset

General characteristics of codRNA and lncRNA transcripts.

(A) Density plots of transcript length. (B) Box-plots of transcript expression level in log2(FPKM) units. lncRNA_ribo: lncRNAs associated with ribosomes; lncRNA_noribo: lncRNAs for which association with ribosomes was not detected. codRNA: coding transcripts encoding experimentally validated proteins except for zebrafish in which all transcripts annotated as coding were considered. The area within the box-plot comprises 50% of the data and the line represents the median value. In all studied species, codRNAs were expressed at higher levels than lncRNAs (Wilcoxon test, p < 10⁻⁵), and lncRNA_ribo at higher levels than lncRNA_noribo (Wilcoxon test, p < 0.005).

https://doi.org/10.7554/eLife.03523.005

Efficient detection of translation events by ribosome profiling

The analysis of ribosome profiling sequencing data showed that the percentage of expressed coding transcripts associated with ribosomes was >90% in all species, with the highest values (>99%) in mouse and fruit fly (Table 2). Pseudogenes had a lower rate of association with ribosomes than coding RNAs, but surprisingly, in species with many annotated pseudogenes, such as human, mouse, and Arabidopsis, the majority of them showed association with ribosomes (Supplementary file 1A). This appeared to be a true signal; while pseudogenes will typically show sequence similarity to other functional copies in the genome, we only considered uniquely mapped reads with no mismatches.

Ribosome profiling is based on deep sequencing, and thus provides an unmatched level of resolution of the translated peptides when compared with current proteomics techniques. This is especially important for short proteins, which are difficult to detect by standard mass spectrometry methods (Slavoff et al., 2013). We used the ribosome-associated protein-coding RNA data to investigate the relationship between peptide detection by proteomics and protein length. We found that human and mouse translated proteins between 24 and 80 amino acids long were more difficult to identify in proteomics databases than longer proteins (Table 3).

Table 3

Fraction of translated proteins of different size detected in proteomics databases

https://doi.org/10.7554/eLife.03523.006

	Protein size (amino acids)
Species	24–80	81–130	131–180	>180
Mouse	27/58 (46.6%)	222/286 (77.6%)	256/330 (77.6%)	3716/4786 (77.7%)
Human	116/272 (42.6%)	536/748 (71.7%)	669/875 (76.5%)	6757/8964 (75.4%)
Yeast	27/30 (90.0%)	168/207 (81.1%)	234/265 (88.3%)	2934/3224 (91.0%)

Only transcripts encoding experimentally validated proteins (codRNAe) were considered.

Long non-coding RNA transcripts frequently associate with ribosomes

The percentage of lncRNAs scanned by ribosomes (lncRNA_ribo) was surprisingly high in all the species studied (Table 2). The values ranged from 28.6% in yeast to 81.9% in mouse. This affected the main lncRNA classes described in Ensembl v. 70, including long intervening non-coding RNAs (lincRNAs) or antisense transcripts (Supplementary file 1C). Short transcript size may hinder ribosome association detection (Aspden et al., 2014). We also found that the ribosome profiling signal was more difficult to detect in poorly expressed transcripts than in highly expressed ones, both for lncRNAs and codRNAs (Figure 2). As lncRNAs tend to be expressed at low levels and are short when compared to codRNAs (Figure 1), we might be underestimating their association with ribosomes.

Figure 2

Download asset Open asset

Effect of transcript expression level on the detection of ribosome association.

The percentage of transcripts associated with ribosomes is shown for several transcript expression intervals. codRNA: annotated coding transcripts encoding experimentally verified proteins (except in zebrafish for which all coding transcripts were considered). lncRNA: annotated and novel long non-coding RNAs. Only species with at least 20 transcripts in each expression bin were plotted. In the rest of species, the data were consistent with the trends shown.

https://doi.org/10.7554/eLife.03523.007

In order to determine if the ribosome profiling signal in lncRNAs was different from noise, we compared ribosome density in the transcripts it to that in 3′untranslated regions (3′UTRs). More specifically, the null model consisted in a size-matched set of sequences containing randomly taken 3′UTR from annotated coding transcripts. Ribosome density was calculated as the number of ribosome profiling reads divided by RNA-seq reads, a ratio defined as translational efficiency (TE) (Ingolia et al., 2011). Both codRNAs and lncRNAS displayed much higher TE values than 3′UTRs in all species studied (Wilcoxon test p < 10⁻⁵, Figure 3). We could reject the null model for 90.12% of the lncRNAs and 87.19% of the codRNAs associated with ribosomes (p < 0.05) (see details by species in Table 2, Stringent set). Therefore, we concluded that the density of ribosomes in lncRNAs is much higher than expected by spurious ribosome binding.

Figure 3

Download asset Open asset

TE distribution in human transcripts and 3′UTRs (null-model).

Cumulative distribution of TE values in human codRNAs, lncRNAs, and 3′UTR sequences. We randomly selected 3′UTRs with a minimum length of 30 nucleotides to build a set of 3′UTR sequences with the same size distribution as the complete transcripts.

https://doi.org/10.7554/eLife.03523.008

Next, we compared ribosome density in lncRNAs and codRNAs in each of the species focusing on regions covered by ribosome profiling reads to accommodate for any differences in the length of the putatively translated regions. In human, fruit fly, and yeast, TE was higher in codRNAs than in lncRNAs (Wilcoxon test, p < 0.005), but in mouse and zebrafish the opposite trend was observed (Wilcoxon test, p < 0.05) (Figure 4). Despite the differences between the species, which may be due to technical issues, it is clear that lncRNAs can show TE values that are similar or even higher than codRNAs. The results were similar when we restricted the analysis to genes encoding a single transcript to avoid any possible biases due to multiple read mapping or when we employed the maximum TE in 90 nucleotide windows (Figure 4—figure supplement 1).

Figure 4 with 1 supplement see all

Download asset Open asset

Ribosome association profiles for codRNAs and lncRNAs.

Box-plots of transcript translational efficiency (TE) in log2(TE) units. The area within the box-plot comprises 50% of the data, and the line represents the median value. lncRNA: lncRNAs for which association with ribosomes was detected. codRNA: coding RNAs transcripts encoding experimentally validated proteins except for zebrafish in which all transcripts annotated as coding were considered.

https://doi.org/10.7554/eLife.03523.009

For comparison, we collected a set of 29 human genes with non-coding functions described in several recent reviews (Supplementary file 2A; Ponting et al., 2009; Ulitsky and Bartel, 2013; Fatica and Bozzoni, 2014). Many of these genes play roles in the regulation of gene expression in the nucleus and are thus unlikely to be translated. We only detected expression for five of these genes: Malat1, Pvt1, Neat1, Meg8, and Cyrano. Transcripts encoded by the first three genes showed ribosome association. In the case of Malat1, this was also consistently observed in mouse and zebrafish (in the latter species Malat1 was identified as a novel transcript) and in the case of Pvt1 in mouse. Given the small number of expressed transcripts, we could not draw any general conclusions for this set.

lncRNAs show similar ribosome protection profiles to codRNAs

The exact positions of ribosome profiling reads on the RNA can be used to delineate the regions that are being actively translated or to discover new functional ORFs (Chew et al., 2013; Guttman et al., 2013; Ingolia, 2014). Because the ribosome is released after encountering a stop codon, this technique can also be employed to identify novel C-terminal protein extensions (Dunn et al., 2013) or to evaluate if a predicted ORF is likely to correspond to a translated peptide (Guttman et al., 2013). We next aimed at comparing the TE values in different transcript regions, including open reading frames (ORFs), putative 5′ and 3′ untranslated regions (UTRs), and the regions between ORFs.

In order to obtain an unbiased picture, it was important to define the different regions in the same way in lncRNAs and codRNAs. In typical codRNAs there is a main translated ORF that covers a large fraction of the transcript, sometimes accompanied by short upstream ORFs in the 5′UTR (Chew et al., 2013). However, lncRNAs may potentially encode several short peptides (Ingolia et al., 2011). The minimum size of ORFs was set at 24 amino acids (75 nucleotides counting the STOP codon), as peptides of this size have been identified in genetic screen studies in humans (Hashimoto et al., 2001). To simplify the comparisons, we employed the same ORF size cut-off in all species. We also considered both a primary ORF, defined as the ORF with the largest number of ribosome profiling reads, as well as any additional non-overlapping ORFs that mapped to ribosome profiling reads (rest of ORFs).

In codRNAs, the primary ORF showed a nearly perfect degree of agreement with the annotated protein, indicating that it was an appropriate metric for the main translated product. Primary ORFs in lncRNAs typically occupied a shorter fraction of the transcript than in codRNAs (Figure 5A). The relative length of the ORF with respect to transcript length did not seem to be a strong predictor of ribosome association, as it did not help distinguish lncRNAs associated with ribosomes (lncRNA_ribo) to those not associated with ribosomes (lncRNA_noribo). In lncRNAs, most of the primary ORFs corresponded to proteins less than 100 amino acids long (Figure 5—figure supplement 1).

Figure 5 with 4 supplements see all

Download asset Open asset

Ribosome association in different transcript regions.

(A) Density plot of the relative length of the primary ORF in lncRNA_ribo and codRNA with respect to transcript length. For comparison data for the longest ORF in lncRNA_noribo is also shown (except for fruit fly due to insufficient data). (B) Box-plots of TE distribution in primary ORF, 5′UTR, and 3′UTR regions. The area within the box-plot comprises 50% of the data, and the line represents the median value. The analysis considered all transcripts with 5′UTR and 3′UTR longer than 30 nucleotides and >0.2 FPKM in all three regions. The number of transcripts was 1956 codRNA and 159 lncRNA_ribo in mouse, 3558 codRNA and 139 lncRNA_ribo in human, 5216 codRNA and 252 lncRNA_ribo in zebrafish, and 2019 codRNA and 33 lncRNA_ribo in Arabidopsis. (C) Box-plots of TE distribution in primary ORFs, rest of ORFs with ribosome profiling reads and non-ORF regions (interORF). The analysis considered all transcripts with at least two ORFs and more than 30 nucleotides interORF. The number of transcripts was 3264 codRNA and 204 lncRNA_ribo in mouse, 3104 codRNA and 168 lncRNA_ribo in human, 1646 codRNA and 212 lncRNA_ribo in zebrafish, and 1098 codRNA and 25 lncRNA_ribo in Arabidopsis. Fruit fly and yeast were not included in the last two analyses due to insufficient data (<8 lncRNA_ribo meeting the conditions).

https://doi.org/10.7554/eLife.03523.011

Next, we focused our attention on the differences between the primary ORF and the 5′UTR and 3′UTR regions in codRNAs and lncRNAs. We defined the 3′ untranslated region (3′UTR) as the sequence located immediately after the STOP codon of the primary ORF or the most downstream ORF associated with ribosomes. We used the same criteria to define the 5′UTR upstream from the initiation codon. In this analysis, we included all transcripts containing at least one ORF associated with ribosomes (the primary ORF) and sufficiently long UTR regions as to detect ribosome profiling reads (>30 nucleotides); insufficient data for fruit fly and yeast precluded the analysis for these species. In both codRNAs and lncRNAs, the 5′UTR showed a ribosome density (translational efficiency, TE) comparable to that of the primary ORF (Figure 5B). In contrast, the 3′UTR showed very little ribosome association and often we could not find a single read mapping to this region (31–91% of cases in codRNAs and 46–68% in lncRNAs). Using genes with a single isoform or considering only annotated transcripts produced similar results (Figure 5—figure supplements 2 and 3). We also controlled for expression level by dividing the data set in transcripts with low (0.5–2 FPKM) and high expression (>2 FPKM), and by sampling the codRNAs in such a way as to have a similar expression distribution as lncRNAs. The results were very similar to those obtained with the complete data set (Figure 5—figure supplement 4), indicating that the analysis is robust to transcript expression differences.

As transcripts may contain several ORFs, we performed a separate analysis in which we compared the translational efficiency of the primary ORF, any additional ORFs with mapped ribosome profiling reads, and the regions between ribosome-protected ORFs (interORF) (Figure 5C). InterORF regions showed little signal when compared to the primary ORF, both in codRNAs and lncRNAs (Wilcoxon test, p < 10⁻⁹ in human, mouse, and zebrafish, p < 0.05 in Arabidopsis, insufficient data for fruit fly and yeast precluded the analysis for these species). The data also indicated that ribosome binding is not always restricted to the primary ORF, especially in lncRNAs, as ribosome protection could sometimes be observed for additional ORFs.

Taken together, these results indicate that lncRNAs have ribosome profiling signatures consistent with translation, with a strong decrease of ribosome density in the 3′UTR but not the 5′UTR region, and preferential binding of ribosomes to the primary ORF. There exists the possibility that the translated peptides are degraded soon after being produced. However, we estimate that the percentage of cases that may undergo nonsense-mediated decay (NMD, see ‘Materials and methods’ for more details) is low, between 4.47 and 14.11% depending on the species. For comparison, the percentage for protein-coding transcripts showing the same patterns (including transcripts annotated as NMD in Ensembl) is between 0.34 and 13.33%.

lncRNAs are less conserved than codRNAs

Are the putatively translated ORF in lncRNAs conserved? We performed sequence similarity searches using BLASTP (E-value < 10⁻⁴) against all annotated coding transcripts in Ensembl, as well as against the primary ORFs in lncRNAs, for the six species studied here (Supplementary files 1D and 2B). The number of lncRNA_ribo with homologues in other species was remarkably low (0–15.6%) except for zebrafish (49.4%). In contrast, the majority of codRNAs had homologues in other species (>95% for vertebrates and fruit fly and 70–73% for Arabidopsis and yeast). After we discarded lncRNAs that showed cross-species conservation, association with ribosomes was still very prevalent (80.4% of mouse, 40.3% of human, and 22.1% of zebrafish lncRNAs were associated with ribosomes).

We also investigated whether the ribosome-associated ORFs in lncRNAs showed homology to annotated proteins in the same species. The values were very low for all the species (0–12.4%) except for zebrafish (47.5%). Therefore, in general lncRNAs are not truncated duplicated copies (pseudogenes). The case of zebrafish is an exception probably because of missing protein-coding annotations in this species.

Coding properties of ribosome-protected ORFs in lncRNAs

Subsequently, we compared the sequence coding properties of the primary ORF in lncRNAs with those in bona fide coding and non-coding sequences using a hexamer-based coding score (see ‘Materials and methods’). In all species the coding scores of the primary ORF in lncRNAs, while lower than that of codRNAs, were significantly higher than the coding score of ORFs in introns (Figure 6, Wilcoxon test lncRNA_ribo vs intron, human, mouse, zebrafish, and Arabidopsis p < 10⁻¹⁶; fruit fly and yeast p < 10⁻⁵). This clearly shows that ORFs in lncRNAs are more coding-like than random ORFs. We repeated the same comparison using 100 different randomly sampled intronic sequence sets, and in >95% of the cases, we obtained the same result. lncRNAs associated with ribosomes (lncRNA_ribo) showed higher coding scores than those not associated with ribosomes (lncRNA_noribo), even when we did not use the ribosome profiling information and compared the longest ORF in both types of transcripts (Figure 6—figure supplement 1). We reached similar conclusions when we restricted the analysis to annotated lncRNA transcripts (Figure 6—figure supplement 2), when we used ORFs from gene deserts as an alternative non-coding sequence set (differences with lncRNAs significant by Wilcoxon test, p < 10⁻¹⁶, see ‘Materials and methods’ for more details), and when we restricted the analysis to lncRNAs for which we did not find protein coding homologues in the other species studied (Figure 6—figure supplement 3). Because a high proportion of lncRNAs contained small ORFs, we repeated the comparison only considering transcripts with ORFs shorter than 100 amino acids to avoid any length biases, again obtaining similar results (Figure 6—figure supplement 4). The use of other coding scores, for example based on codon frequencies instead of hexamer frequencies or related metrics such as GC content produced consistent results (Figure 6—figure supplement 5; Supplementary file 1E).

Figure 6 with 6 supplements see all

Download asset Open asset

Coding scores in ORFs from different types of transcripts.

Intron: randomly selected intronic regions; lncRNA_noribo: lncRNAs not associated with ribosomes; lncRNA_ribo: lncRNAs associated with ribosomes; pseudogene: pseudogenes associated with ribosomes; codRNAne: coding transcripts encoding non-validated proteins associated with ribosomes; codRNAe: coding transcripts encoding experimentally validated proteins. The coding score was calculated as the log ratio of hexamer frequencies in coding vs intronic sequences. In lncRNA_noribo and introns, we considered the longest ORF and in the rest of transcripts the primary ORF. The Class ‘pseudogene’ was only included in species with more than 20 expressed pseudogenes with mapped ribosome profiling reads. The coding score of the primary ORF in lncRNAs (lncRNA_ribo) was significantly higher than the coding score in ORFs defined in introns (Wilcoxon test, human, mouse, zebrafish, and Arabidopsis p < 10⁻¹⁶; fruit fly and yeast p < 10⁻⁴, Wilcoxon test) and in lncRNA_ribo it was significantly higher than in lncRNA_noribo in four species (Wilcoxon test, human, mouse and zebrafish p < 10⁻⁵, and Arabidopsis p < 0.05). Transcripts from genes of different evolutionary age were taken from the literature (see manuscript text). The number of transcripts was 68 for rodent, 127/123 for mammalian (mouse/human as reference species), 11,203/13,423/9812 for metazoan (mouse/human/zebrafish), 162 for fish, 208 for Crucifera, 28 for *S. cerevisiae* and 84 for Saccharomyces. The youngest class of codRNAs displayed similar scores than lncRNA_ribo in mouse, zebrafish, and yeast (classes rodent, fish and *S. cerevisiae*, respectively), being only significantly higher in human and Arabidopsis (Wilcoxon test, p < 0.005; classes primate and Cruciferae). We did not analyze young genes in fruit fly due to lack of a suitable young set of codRNAs in this species.

https://doi.org/10.7554/eLife.03523.016

At the individual transcript level, a sizeable fraction of lncRNAs associated with ribosomes displayed significantly higher coding scores than expected for non-coding sequences (p < 0.05 in all 100 intronic random sets; data in Supplementary file 2C; examples in Figure 6—figure supplement 6). These transcripts are comprised of 143 human lncRNAs (35.5% of the lncRNAs, score > 0.0189), 137 mouse lncRNAs (35.1%, score > 0.0377), 379 zebrafish lncRNAs (52.1% score > 0.0095), 7 fruit fly lncRNAs (31.8%, score > −0.0483), 43 Arabidopsis lncRNAs (46.2%, score > −0.0202), and 5 yeast lncRNAs (83.3%, score > 0.03387). Annotated and novel lncRNAs were present in similar proportions in these sets, supporting the validity of our strategy of merging the two types of transcripts from the beginning. We also noted that the fraction of lncRNAs with coding homologues in other species increased in these sets. For example, whereas the proportion of total human lncRNA_ribo with homologues in other species was 15.6%, in the set with significant coding scores it was 29.3%. This number increased to 57.3% when we performed searches against the NCBI non-redundant peptide database ‘nr’, as some of the ORFs in lncRNAs are annotated as predicted peptides in this database.

If ORFs in lncRNAs are being translated this is likely to be a relatively recent evolutionary event, as many lncRNAs are lineage-specific (Pauli et al., 2012; Necsulea et al., 2014; our data). It is well established that proteins of different evolutionary age display distinct sequence properties, including different codon usage (Toll-Riera et al., 2009; Carvunis et al., 2012; Palmieri et al., 2014). We retrieved sets of annotated protein-coding transcripts of different evolutionary age from human, mouse, zebrafish, Arabidopsis, and yeast available from various studies (Ekman and Elofsson, 2010; Donoghue et al., 2011; Neme and Tautz, 2013) and expressed in the systems studied here. We found that the coding score was always lower in the youngest group than in older groups (Figure 6, Wilcoxon test, p < 0.05). Remarkably, the youngest codRNAs showed a very similar coding score distribution to lncRNAs (Figure 6). We obtained similar results when we discarded lncRNAs that had homologues in any of the other species (Figure 6—figure supplement 3).

We also collected information from young protein coding genes encoding experimentally verified proteins according to Swiss-Prot (Supplementary file 2D). We observed that these proteins were short and the ORF occupied a relatively small fraction of the transcript, features typically observed in lncRNAs. For example, the average size of proteins encoded by primate-specific transcripts was 148 amino acids and the average transcript coverage 47%. The coding score was remarkably low and again similar to that of lncRNAs (median 0.008 for primate-specific human transcripts, 0.046 for rodent-specific mouse transcripts, and 0.089 for yeast-specific coding transcripts).

Selection pressure signatures in ORFs associated with ribosomes

An important measure of the strength of purifying selection acting on a coding sequence is the ratio between the number of non-synonymous and synonymous single nucleotide polymorphisms (PN/PS). Given the nature of the genetic code, there are more possible non-synonymous mutations than synonymous mutations. Under neutrality (no purifying selection), the PN/PS ratio is expected to be approximately 2.89 (Nei and Gojobori, 1986).

Here, we applied the large amount of available polymorphism data for human, mouse, and zebrafish to compare the level of purifying selection in primary ORFs from codRNAs and lncRNAs (Figure 7; Supplementary file 1F). In general, human sequences showed higher PN/PS ratios than sequences from the other analyzed species, probably due to the presence of many slightly deleterious mutations segregating in the population (Eyre-Walker, 2002). However, despite the intrinsic differences between organisms, we observed the same general trends. First, the PN/PS was significantly lower in codRNAs than in lncRNAs (proportion test, p < 10⁻⁵), denoting stronger purifying selection in the former. Second, there was a very clear inverse relationship between the strength of purifying selection and the age of the gene (p < 10⁻¹⁵ between the youngest and rest of codRNAs in mouse and zebrafish), in agreement with previous studies (Liu et al., 2008; Cai et al., 2009). High PN/PS values were also observed in the subset of young genes encoding experimentally validated proteins in human (primate-specific transcripts median PN/PS of 3.10) and mouse (rodent-specific transcripts median PN/PS 1.42), confirming this tendency. Third, the distribution of PN/PS values in lncRNAs was very similar to that of young protein-coding genes. In human and mouse, there were no significant differences, and in the case of zebrafish the lncRNAs had even slightly lower PN/PS values than the fish-specific protein coding genes (p < 0.01).

Figure 7

Download asset Open asset

Selective pressure in ORFs from different types of transcripts.

PN/PS: ratio between the number of non-synonymous and synonymous single nucleotide polymorphisms (SNPs) in the complete set of primary ORFs for a given class of transcripts (in lncRNA_noribo the longest ORF was considered). In blue, data for different coding and non-coding transcript classes. In brown, data for different age codRNA classes. The bars represent the 95% confidence interval for the PN/PS value. For the species not shown there was not sufficient data to perform this analysis.

https://doi.org/10.7554/eLife.03523.023

Discussion

Here, we analyzed the patterns of ribosome protection in polyA+ transcripts from cells belonging to six different eukaryotic species. Among the expressed transcripts, we identified many lncRNAs in the different species. The vast majority of transcripts annotated as coding showed association with ribosomes (>92% in all species). Remarkably, a very large number of transcripts annotated as long non-coding RNA (lncRNAs) also showed such association (30–82% depending on the data set). Considering that lncRNAs are typically much shorter and expressed at lower levels than codRNAs, which may hinder the identification of ribosome association, this is a very significant fraction. In addition, the patterns of ribosome protection along the transcript are similar to those of protein-coding genes. Therefore, many lncRNAs appear to be scanned by ribosomes and are likely to translate short peptides.

Long non-coding RNAs are classified as such in databases because, according to a number of criteria, they are unlikely to encode functional proteins. These criteria include the lack of a long ORF, the absence of amino acid sequence conservation, and the lack of known protein domains (Harrow et al., 2012). Moreover, we expect lncRNAs not to have matches to proteomics databases, as this should classify them as coding. Annotated lncRNAs are typically longer than 200 nucleotides because this is the cutoff size normally implemented to differentiate them from other RNA classes such as microRNAs and small nuclear RNAs. In practice, it is difficult to classify a transcript as coding or non-coding on the basis of the ORF size (Dinger et al., 2008). Some true coding sequences may be quite small, and by chance alone non-coding transcripts may have relatively long ORFs. The majority of lncRNAs contain ORFs longer than 24 amino acids, which can potentially correspond to real proteins. Short proteins are more difficult to detect than longer ones and consequently they are probably underestimated in databases. In recent years, the use of comparative genomics (Frith et al., 2006; Ladoukakis et al., 2011; Hanada et al., 2013), proteomics (Slavoff et al., 2013; Vanderperre et al., 2013; Ma et al., 2014), and a combination of evolutionary conservation and ribosome profiling data (Crappé et al., 2013; Bazzini et al., 2014) have shown that the number of short proteins is probably much higher than previously suspected (Andrews and Rothnagel, 2014). In yeast, gene deletion experiments have provided evidence of functionality for short open reading frames (sORFs < 100 amino acids) (Kastenmayer et al., 2006); in zebrafish, several newly discovered sORFs appear to be involved in embryonic development (Pauli et al., 2014) and other examples exist in insects (Magny et al., 2013) and humans (Lee et al., 2013; Slavoff et al., 2014). In many cases, the transcripts containing sORFs will be classified as non-coding, especially if the ORF is not well conserved across different species.

One approach to identify potential coding transcripts is ribosome profiling (Ingolia et al., 2009), which has been used to study translation of proteins in a wide range of organisms (Guo et al., 2010; Ingolia et al., 2011; Brar et al., 2012; Michel et al., 2012; Chew et al., 2013; Dunn et al., 2013; Huang et al., 2013; Artieri and Fraser, 2014; Bazzini et al., 2014; Juntawong et al., 2014; McManus et al., 2014; Vasquez et al., 2014). In several of these studies it has been noted that lncRNAs can be protected by ribosomes (Ingolia et al., 2011; Chew et al., 2013; Bazzini et al., 2014; Juntawong et al., 2014). However, there is no consensus on whether the observed patterns are consistent with translation. For example in the original analysis of mouse stem cells, which we reanalyzed here, it was reported that many lncRNAs were polycistronic transcripts encoding short proteins (Ingolia et al., 2011), but in another paper where the same data were processed in a different way, they concluded that lncRNAs were unlikely to be protein-coding (Guttman et al., 2013). A zebrafish ribosome profiling study reported resemblance between lncRNAs and 5′leaders of coding RNAs; the authors suggested that translation may play a role in lncRNA regulation (Chew et al., 2013). Nevertheless, in the same study dozens of lncRNAs were proposed to be bona fide protein-coding transcripts. In Arabidopsis, the translational efficiency values of highly expressed lncRNAs (>5 FPKM) were similar to those of coding RNAs and some lncRNAs had profiles consistent with initiation and termination of translation (Juntawong et al., 2014). Finally, using yeast data, Wilson and Masel. (2011) found many cases of non-coding transcripts bound to ribosomes and suggested that this facilitates the evolution of novel protein-coding genes from non-coding sequences.

The disparity of results obtained in different systems motivated us to retrieve the original data and perform exactly the same analyses for six different species. As lncRNA catalogues are still very incomplete for most species, we also defined sets of novel lncRNAs using the RNA-seq sequencing reads for de novo transcript assembly. We discovered many novel, non-annotated, lncRNAs, especially in zebrafish, mouse, and fruit fly (Table 2). After the analysis of the ribosome profiling data, the same general picture emerged for the different biological systems, indicating that we are detecting very fundamental properties. In transcripts classified as lncRNAs, the ribosome profiling reads tend to cover a smaller fraction of the transcript than in typical codRNAs, in agreement with a shorter relative size of the ORF accumulating the largest number of ribosome profiling reads (primary ORF). We also find that the translational efficiency of regions corresponding to the primary ORF is much higher than that of 3′UTRs, both in codRNAs and lncRNAs, consistent with translation of the transcripts. Furthermore, the primary ORF of lncRNAs showed significantly higher coding score than the longest ORF extracted from randomly selected non-coding regions.

lncRNAs often contain several potentially translated ORFs (Ingolia et al., 2011). Transcripts encoding multiple short proteins have been reported in insects (Savard et al., 2006) and could be common in other species as well (Tautz, 2009). One such candidate is AT1G34418.1 in Arabidopsis, an annotated lncRNA which contains a primary ORF followed by two instances of a 12 amino acid ORF also covered by ribosome profiling reads (Figure 6—figure supplement 6). This case is reminiscent of the gene pri in fruit fly, which regulates tarsal development (Galindo et al., 2007) and translates several small redundant ORFs (Kondo et al., 2007).

lncRNAs are poorly conserved across species and so, if translated, they will produce species- or lineage-specific proteins. Recently evolved proteins are markedly different from widely distributed ancient proteins; they are shorter, subject to weaker selective constraints and expressed at lower levels (Albà and Castresana, 2005; Cai et al., 2009; Liu et al., 2010; Donoghue et al., 2011; Carvunis et al., 2012; Xie et al., 2012; Wissler et al., 2013; Neme and Tautz, 2014). Here for the first time, we have compared the properties of the ORFs in lncRNAs associated with ribosomes with the properties of annotated, and in some cases experimentally validated, young protein-coding genes. lncRNAs and young protein-coding transcripts are virtually indistinguishable regarding their coding score and ORF selective constraints (Figures 6 and 7), which is consistent with the idea that many lncRNAs encode new peptides.

Although it is unclear how many of these peptides are functional, the data indicate that at least a fraction of them may be functional. Sequences that translate functional proteins are expected to display signs of selection related to preferential usage of certain amino acids and codons. This can be used to differentiate between coding and non-coding entities, especially in the absence of cross-species conservation, as is the case of many lncRNAs. About 35–40% of primary ORFs in human and mouse lncRNAs displayed coding scores that were significantly higher than those expected for non-coding sequences, making them excellent candidates for translating functional proteins. In fact, five human lncRNAs associated with ribosomes that exhibited high coding scores in our study were re-annotated as protein-coding transcripts in a subsequent Ensembl gene annotation release (version 75, Supplementary file 2C). Gene knock-out experiments in fly have discovered that young proteins, even if rapidly evolving, are often essential for the organism and can cause important defects when deleted (Chen et al., 2010; Reinhardt et al., 2013). Similarly, some peptides translated from lncRNAs may have important cellular functions yet to be discovered.

lncRNAs tend to be expressed at much lower levels than typical codRNAs, so, everything else being equal, the amount of translated peptide is also expected to be smaller. It may be that some of these peptides are not functional, but their translation does not produce a large enough deleterious effect for them to be eliminated via selection. Pseudogenes also showed extensive association with ribosomes in our study, indicating that the translation machinery is probably not very selective or that some pseudogenes produce functional proteins. This question may be worth revisiting, as a recent proteomics study has also found that dozens of human pseudogenes produce peptides (Kim et al., 2014).

The data also indicate that a fraction of lncRNAs have not acquired the capacity to be translated. Depending on the experiment analyzed, a number of lncRNAs did not show any significant association with ribosomes. As previously discussed, this is probably affected by a lack of sensitivity; it is also true that the lncRNAs not associated with ribosomes tended to show lower coding scores than lncRNAs associated with ribosomes, even when we did not use the ribosome profiling data and simply compared the longest ORF in both kinds of transcripts.

Recently, it has been reported that human-specific protein-coding genes are often related to non-coding transcripts in macaque, pointing to a non-coding origin for many newly evolved proteins (Xie et al., 2012). More generally, one may view de novo protein-coding gene evolution as a continuum from non-functional genomic sequences to fully-fledged protein-coding genes (Albà and Castresana, 2005; Toll-Riera et al., 2009; Carvunis et al., 2012). Therefore, many lncRNAs could be in intermediate states in this process, their pervasive translation serving as the building material for the evolution of new proteins. It may be difficult to obtain functional proteins from completely random ORFs (Jacob, 1977), but the effect of natural selection preventing the production of toxic peptides (Wilson and Masel, 2011), and the high number of transcripts expressed in the genome, may facilitate this process.

Materials and methods

Sequencing and mapping of reads

Request a detailed protocol

We downloaded the original data from Gene Expression Omnibus (GEO) for six different ribosome profiling experiments that had both ribosome footprinting and polyA+ RNA-seq sequencing reads: mouse (M. musculus) (Ingolia et al., 2011), human (H. sapiens, HeLa cells) (Guo et al., 2010), zebrafish (D. rerio) (Chew et al., 2013), fruit fly (D. melanogaster) (Dunn et al., 2013), Arabidopsis (A. thaliana) (Juntawong et al., 2014), and yeast (S. cerevisiae) (McManus et al., 2014). We retrieved genome sequences and gene annotations from Ensembl v.70 and Ensembl Plants v.21 (Flicek et al., 2012).

Raw ribosome and RNA-seq sequencing reads underwent quality filtering using Condentri (v.2.2) (Smeds and Künstner, 2011) with the following settings (-hq=30 –lq=10). Adaptors described in the original publications were trimmed from filtered reads if at least five nucleotides of the adaptor sequence matched the end of each read. In zebrafish, reads from different developmental stages were pooled to improve read coverage. In all experiments, reads below 25 nucleotides were not considered. Clean ribosome short reads were filtered by mapping them to the corresponding species reference RNA (rRNA) using the Bowtie2 short-read alignment program (v. 2.1.0) (Langmead et al., 2009). Unaligned reads were aligned to a genomic reference genome with Bowtie2 allowing one mismatch in the first 'seed' region (the length of this region was selected according to the descriptions provided in each individual experiment). RNA-seq short reads were mapped with Tophat (v. 2.0.8) (Kim et al., 2013) to the corresponding reference genome. We allowed two mismatches in the alignment with the exception of zebrafish, for which we allowed three mismatches since the reads were significantly longer. Multiple mapping was allowed unless specifically stated.

Defining a set of expressed transcripts

Request a detailed protocol

Expressed transcripts were assembled using Cufflinks (v 2.2.0) (Trapnell et al., 2010). We initially considered a transcript as expressed if it was covered by at least four reads and its abundance was higher than 1% of the most abundant isoform of the gene. We also discarded assembled transcripts in which >20% of reads were mapped to several locations in the genome. Gene annotation files from Ensembl (gtf format, v.70) were provided to Cufflinks to guide the reconstruction of already annotated transcripts. Annotated transcripts were divided into coding RNAs and long non-coding RNAs (lncRNAs), we only considered lncRNAs that were not part of genes with coding transcripts. Novel isoforms corresponding to annotated loci were not analyzed. Transcripts that did not match or overlapped annotated genes were labeled 'novel’ lncRNAs. We used a length threshold of 200 nucleotides to select novel long non-coding RNAs, as in ENCODE annotations (Djebali et al., 2012).

Strand directionality of multiexonic transcripts was inferred using the splice site consensus sequence. We only considered monoexonic transcripts in the case of Arabidopsis and yeast, provided the transcripts were intergenic.

The inclusion of novel lncRNAs made it possible to perform analyses of species for which there are very few annotated lncRNAs. Annotations of UTR regions in yeast genes were missing from Ensembl because of the variability observed in transcription start sites (TSS). However, we downloaded a set of available 5′ and 3′UTRs obtained by deep transcriptomics (Nagalakshmi et al., 2008) and added them to the existing yeast Ensembl annotations before assembling the transcriptome.

Coding transcripts were classified into different subclasses depending on the existing annotations: (a) Annotated protein-coding transcripts (codRNA), (b) Annotated transcripts with surveillance mechanisms (nonsense mediated decay, nonstop mediated decay, and no-go decay), (c) Annotated pseudogenes. We removed protein-coding transcripts in which annotated coding sequences (CDS) are still incomplete.

Subsequently, we defined an additional subset of annotated protein-coding transcripts with well-established coding properties based on the existence of an experimentally verified protein in Swiss-Prot for the gene (‘evidence at protein level’, downloaded 29 October 2013, UniProt Consortium, 2014). These transcripts were labeled codRNAe. The rest of annotated protein-coding transcripts were abbreviated codRNAne. In zebrafish, most proteins are not yet experimentally validated; and therefore, we generated a single group.

We built a data set of human lncRNAs with described non-coding functions using data obtained from several recent reviews (Ponting et al., 2009; Ulitsky and Bartel, 2013; Fatica and Bozzoni, 2014). This data set included 29 different genes (Supplementary file 2A).

We used cufflinks to estimate the expression level of a transcript in FPKM units (Fragments Per Kilobase per total Million mapped reads). We used a threshold of >0.5 FPKM except in yeast, in which the average read coverage per transcript was much higher than in the other species and the threshold was set up at >5 FPKM. These thresholds guaranteed detection of ribosome association for the majority of expressed coding transcripts (>92%), while yielding proportions of transcripts comparable to those reported in the original papers.

Definition of potential open reading frames (ORFs) and other transcript regions

Request a detailed protocol

We predicted all possible open reading frames (ORFs) in the expressed transcripts. We defined an ORF as any sequence starting with an AUG codon and finishing with a stop codon (TAA, TAG, or TGA), and at least 75 nucleotides long. This would correspond to a 24 amino acid protein, which is the size of the smallest complete human polypeptide found in genetic screen studies (Hashimoto et al., 2001). This ORF definition will not detect non-canonical ORFs with different start or stop codons, although these ORFs often correspond to regulatory ORFs (uORFs) in the 5′UTR region. In monoexonic transcripts (Arabidopsis and yeast), we considered all six possible different frames.

We also defined each transcript 5′UTR as the region between the transcription start site and the AUG codon from the left-most predicted ORF, and the 3′UTR the region from the stop codon in the right-most predicted ORF to the transcript end. UTRs with lengths below 30 nucleotides were not analyzed since ribosome reads could not be properly aligned to these regions due to their small size. Regions between two consecutive putatively translated ORFs (with ribosome profiling reads) were termed interORF. We only analyzed this region when the length of the interORF sequence in a transcript was 30 nucleotides or longer.

We defined a set of bona fide non-coding sequences sampled from intronic fragments. We used the introns of the genes expressed in each experiment, provided they did not overlap to any exons from other overlapping genes. We randomly selected fragments in such a way as to simulate the same size distribution as in the complete set of expressed transcripts. We performed 100 simulations of intron sampling to ensure the results were robust to the randomization process. We selected the longest ORF in each intronic fragment for the calculation of coding scores and GC content.

Association with ribosomes and translational efficiency (TE)

Request a detailed protocol

We computed the number of reads overlapping each feature of interest (transcript, UTR, ORF, and interORF) using the BEDTools package (v. 2.16.2) (Quinlan and Hall, 2010). We only considered ribosome reads in which more than half of their length spanned the considered region. This was considered appropriate because the ribosome P-site is usually detected at the central region of the read, with only slight variations depending on the experimental setting. We set up a minimum ribosome profiling coverage of 75 nucleotides per transcript to define the transcript or transcript region (e.g., ORF) as associated with ribosomes. This is significantly longer than the length of the ribosome profiling sequencing reads (36–51 nucleotides) and is consistent with the minimum ORF length threshold.

The translational efficiency (TE) of a sequence has been previously defined as the density of ribosome profiling (RPF) reads normalized by transcript abundance (Ingolia et al., 2009). We calculated it by dividing the FPKM of the ribosome profiling experiment by the FPKM of the RNA-seq experiment. In transcripts, we also obtained the maximum TE by dividing the sequence in 90 nucleotide windows and selecting the window with the highest TE value.

In order to have a null model of ribosome binding against which to compare the ribosome profiling signal in codRNA and lncRNA transcripts, we extracted annotated 3′ untranslated regions (3′UTRs) from codRNAs in genes in which UTRs did not overlap with coding sequences from other transcripts, and by randomly selecting 3′UTRs with a minimum length of 30 nucleotides, we built a set of 3′UTR sequences with the same size distribution as the complete transcripts. For each species, we calculated the TE values for codRNAs, lncRNA, and 3′UTR sequences. We used the empirical distribution of TE values in the 3′UTRs to calculate the number of codRNAs and lncRNAs that showed significantly higher TE value than expected under the null model at a p < 0.05. These corresponded to TE values higher than 0.1043 in mouse, 0.2556 in human, 0.0004 in zebrafish, 0.7164 in fruit fly, 0.1800 in Arabidopsis, and 0.0527 in yeast.

We defined the primary ORF in a transcript as the ORF with the largest number of RPF reads with respect to the total RPF reads covering the transcript. The rest of ORFs ≥24 amino acids associated with ribosomes were considered as well; when two or more ORFs overlapped, we selected the longest one. In ORFs, interORFs, and UTRs, we computed the TE along the whole region. For comparing the TE in different regions, we only considered transcripts in which all regions had >0.2 FPKM.

Peptide evidence in existing proteomics databases

Request a detailed protocol

We downloaded all peptide sequences from the PeptideAtlas database: 338,013 human peptides (August 2013), 101,695 mouse peptides (June 2013), and 86,836 yeast peptides (March 2013). We investigated if the number of ribosome-associated protein-coding transcripts that matched the peptides in these databases varied with protein length. We omitted this analysis in zebrafish and Arabidopsis due to the lack of sufficiently large peptide databases. The matches were identified using BLASTP searches (v. 2.2.28+) (Altschul et al., 1997). We selected perfect matches only.

Evidence of nonsense mediated decay in ORFs

Request a detailed protocol

We investigated how many primary ORFs may be candidates for being regulated via non-sense mediated decay (NMD) surveillance pathways, whose main function is to eliminate transcripts containing premature stop codons. We defined NMD candidates as all cases in which the stop-codon from a predicted ORF was located ≥55 nucleotides upstream of a splice junction site, provided the stop-codon was not in the terminal exon (Scofield et al., 2007). This mechanism is well characterized in protein-coding genes and it has been proposed as a way to degrade non-functional peptides translated in lncRNAs (Tani et al., 2013). Other surveillance mechanisms, such as non-stop-mediated decay or no-go decay, were not considered since all predicted ORFs finished at a stop codon, and we did not analyze RNA secondary structures.

Defining ages of protein-coding transcripts

Request a detailed protocol

We utilized existing gene age classifications in human, mouse, and zebrafish (Neme and Tautz, 2013) to identify young gene classes: human primate-specific (∼55.8 My), mouse rodent-specific (∼61.7 My), human and mouse mammalian-specific (∼225 My), zebrafish actinopterygii-specific (∼420 My) (abbreviated fish) and metazoan (∼800 My). In yeast, we used predefined genes specific to S. cerevisiae (1–3 My)(abbreviated S. cerevisiae) and the Saccharomyces group (∼100 My) (Ekman et al., 2007). In Arabidopsis, we retrieved Cruciferae(Brassicaceae)-specific genes (20–40 My) (Donoghue et al., 2011). These genes are believed to have arisen primarily by de novo mechanisms, as no homologies in other species have been detected despite the fact that many closely related genomes have now been sequenced.

Defining gene desert sequences

Request a detailed protocol

In humans, we obtained a set of gene desert sequences as defined in Ovcharenko et al. (2005). We selected two stable and two flexible gene deserts (the definition depends on the degree of conservation in other species). They belonged to chromosome 4 (flexible located in coordinates 136,000,001–138,000,000; stable located in coordinates 180,000,001–182,000,010) that has a high number of gene deserts; and chromosome 17 (flexible located in coordinates 51,100,001–51,900,000; stable located in coordinates 69,300,001–70,000,000) that has a high gene density. We ensured that no protein-coding genes were annotated in subsequent Ensembl versions in these regions. We predicted all possible ORFs in these regions and evaluated their coding score and GC content.

ORF coding score

Request a detailed protocol

The examination of nucleotide hexamer frequencies has been shown to be a powerful way to distinguish between coding and non-coding sequences (Sun et al., 2013; Wang et al., 2013). We computed one coding score (CS) per hexamer:

{C S}_{h e x a m e r (i)} = \log (\frac{{f r e q}_{c o d i n g} (h e x a m e r (i))}{{f r e q}_{n o n - c o d i n g} (h e x a m e r (i))}) .

The coding hexamer frequencies were obtained from the open reading frame of all transcripts in a species encoding experimentally validated proteins (except for zebrafish in which all protein-coding transcripts were considered). The non-coding hexamer frequencies were calculated using the longest ORF in intronic regions, which were selected randomly from expressed protein-coding genes. Next, we used the following statistic to measure the coding score of an ORF:

{C S}_{O R F} = \frac{\sum_{i = 1}^{i = n} {C S}_{h e x a m e r (i)}}{n},

where i is each sequence hexamer in the ORF, and n the number of hexamers considered.

The hexamers were calculated in steps of three nucleotides in frame (dicodons). We did not consider the initial hexamers containing a Methionine or the last hexamers containing a STOP codon, since they are not informative. Given that all ORFs were at least 75 nucleotides long the minimum value for n was 22.

We calculated other related statistics in a similar way. This included using an equiprobable hexamer distribution instead of the distribution obtained from non-coding sequences, or using codon frequencies instead of hexamer frequencies. These statistics showed somewhat lower power to distinguish between coding and non-coding sequences. As a complementary measure, we quantified the GC content in different coding and non-coding transcripts and ORFs.

Sequence similarity searches

Request a detailed protocol

We employed BLASTP with an E-value cutoff of 10⁻⁴ to compare the amino acid sequences encoded by ORFs in different kinds of transcripts. We enabled SEG to mask low complexity regions in protein sequences before doing the homology searches. We also searched for homologues in the NCBI non-redundant (nr) protein database (Pruitt et al., 2014). BLAST sequence similarity search programs are based on gapped local alignments (Altschul et al., 1997).

Analysis of single nucleotide polymorphisms

Request a detailed protocol

We downloaded all available single-nucleotide polymorphisms (SNPs) from dbSNP (Sherry et al., 2001) for human (∼50 million), mouse (∼64.2 million), and zebrafish (∼1.3 million). We did not consider other species due to insufficient data for the analysis. We classified SNPs in ORFs as non-synonymous (PN, amino acid altering) and synonymous (PS, not amino acid altering). We computed the PN/PS ratio in each sequence data set by using the sum of PN and PS in all sequences. The estimation of PN/PS ratios of individual sequences was in general not reliable due to lack of sufficient SNP data. We obtained confidence intervals using the proportion test in R (see below).

Statistical data analyses

Request a detailed protocol

The analysis of the data, including generation of plots and statistical tests, was done with R (R Development Core Team, 2010).

Additional files

Request a detailed protocol

Supplementary file 1 contains additional Tables and Supplementary file 2 data subsets. The genomic coordinates of all transcripts used in this study (GTF files) and the amino acid sequences corresponding to primary ORFs in lncRNA with coding scores significant at p < 0.05 (FASTA files) are available at figshare (http://dx.doi.org/10.6084/m9.figshare.1114969).

Data availability

The following previously published data sets were used

(2011) Ribosome Profiling of Mouse Embryonic Stem Cells Reveals the Complexity of Mammalian Proteomes
Publicly available at NCBI Gene Expression Omnibus.

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30839
(2010) Mammalian microRNAs predominantly act to decrease target mRNA levels
Publicly available at NCBI Gene Expression Omnibus.

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE22004
1. Pauli A
2. Valen E
3. Lin MF
4. Garber M
5. Vastenhouw NL
6. Levin JZ
7. Sandelin A
8. Rinn JL
9. Regev A
10. Schier AF
(2011) Comprehensive identification of long non-coding RNAs expressed during zebrafish embryogenesis
Publicly available at NCBI Gene Expression Omnibus.

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE32900
1. Chew G
2. Pauli A
3. Valen E
4. Schier A
(2013) Ribosome Profiling over a Zebrafish Developmental Timecourse
Publicly available at NCBI Gene Expression Omnibus.

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE46512
1. Dunn JG
2. Weissman JS
(2013) Ribosome profiling reveals pervasive and regulated stop codon readthrough in Drosophila melanogaster
Publicly available at NCBI Gene Expression Omnibus.

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE49197
(2013) High-resolution mapping of ribosome footprints from Arabidopsis thaliana
Publicly available at NCBI Gene Expression Omnibus.

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50597
(2014) Ribosome profiling revelas post-transcriptional buffering of divergent gene expression in yeast
Publicly available at NCBI Gene Expression Omnibus.

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52119

References

1. Albà MM
2. Castresana J
(2005) Inverse relationship between evolutionary rate and age of mammalian genes
Molecular Biology and Evolution 22:598–606.

https://doi.org/10.1093/molbev/msi045
- Google Scholar
1. Altschul SF
2. Madden TL
3. Schäffer AA
4. Zhang J
5. Zhang Z
6. Miller W
7. Lipman DJ
(1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research 25:3389–3402.

https://doi.org/10.1093/nar/25.17.3389
- Google Scholar
1. Andrews SJ
2. Rothnagel JA
(2014) Emerging evidence for functional peptides encoded by short open reading frames
Nature Reviews Genetics 15:193–204.

https://doi.org/10.1038/nrg3520
- Google Scholar
1. Artieri CG
2. Fraser HB
(2014) Evolution at two levels of gene expression in yeast
Genome Research 24:411–421.

https://doi.org/10.1101/gr.165522.113
- Google Scholar
1. Aspden JL
2. Eyre-Walker YC
3. Philips RJ
4. Amin U
5. Mumtaz MA
6. Brocard M
7. Couso JP
(2014) Extensive translation of small ORFs revealed by Poly-Ribo-Seq
eLife e03528.

https://doi.org/10.7554/eLife.03528
- Google Scholar
1. Bazzini AA
2. Johnstone TG
3. Christiano R
4. Mackowiak SD
5. Obermayer B
6. Fleming ES
7. Vejnar CE
8. Lee MT
9. Rajewsky N
10. Walther TC
11. Giraldez AJ
(2014) Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation
The EMBO Journal 33:981–993.

https://doi.org/10.1002/embj.201488411
- Google Scholar
1. Brar GA
2. Yassour M
3. Friedman N
4. Regev A
5. Ingolia NT
6. Weissman JS
(2012) High-resolution view of the yeast meiotic program revealed by ribosome profiling
Science 335:552–557.

https://doi.org/10.1126/science.1215110
- Google Scholar
1. Brockdorff N
2. Ashworth A
3. Kay GF
4. McCabe VM
5. Norris DP
6. Cooper PJ
7. Swift S
8. Rastan S
(1992) The product of the mouse Xist gene Is a 15 Kb inactive X-specific transcript containing no conserved ORF and located in the nucleus
Cell 71:515–526.

https://doi.org/10.1016/0092-8674(92)90519-I
- Google Scholar
1. Cabili MN
2. Trapnell C
3. Goff L
4. Koziol M
5. Tazon-Vega B
6. Regev A
7. Rinn JL
(2011) Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses
Genes & Development 25:1915–1927.

https://doi.org/10.1101/gad.17446611
- Google Scholar
1. Cai J
2. Zhao R
3. Jiang H
4. Wang W
(2008) De novo origination of a new protein-coding gene in Saccharomyces cerevisiae
Genetics 179:487–496.

https://doi.org/10.1534/genetics.107.084491
- Google Scholar
1. Cai JJ
2. Borenstein E
3. Chen R
4. Petrov DA
(2009) Similarly strong purifying selection acts on human disease genes of all evolutionary ages
Genome Biology and Evolution 1:131–144.

https://doi.org/10.1093/gbe/evp013
- Google Scholar
1. Carninci P
2. Kasukawa T
3. Katayama S
4. Gough J
5. Frith MC
6. Maeda N
7. Oyama R
8. Ravasi T
9. Lenhard B
10. Wells C
11. Kodzius R
12. Shimokawa K
13. Bajic VB
14. Brenner SE
15. Batalov S
16. Forrest AR
17. Zavolan M
18. Davis MJ
19. Wilming LG
20. Aidinis V
21. Allen JE
22. Ambesi-Impiombato A
23. Apweiler R
24. Aturaliya RN
25. Bailey TL
26. Bansal M
27. Baxter L
28. Beisel KW
29. Bersano T
30. Bono H
31. Chalk AM
32. Chiu KP
33. Choudhary V
34. Christoffels A
35. Clutterbuck DR
36. Crowe ML
37. Dalla E
38. Dalrymple BP
39. de Bono B
40. Della Gatta G
41. di Bernardo D
42. Down T
43. Engstrom P
44. Fagiolini M
45. Faulkner G
46. Fletcher CF
47. Fukushima T
48. Furuno M
49. Futaki S
50. Gariboldi M
51. Georgii-Hemming P
52. Gingeras TR
53. Gojobori T
54. Green RE
55. Gustincich S
56. Harbers M
57. Hayashi Y
58. Hensch TK
59. Hirokawa N
60. Hill D
61. Huminiecki L
62. Iacono M
63. Ikeo K
64. Iwama A
65. Ishikawa T
66. Jakt M
67. Kanapin A
68. Katoh M
69. Kawasawa Y
70. Kelso J
71. Kitamura H
72. Kitano H
73. Kollias G
74. Krishnan SP
75. Kruger A
76. Kummerfeld SK
77. Kurochkin IV
78. Lareau LF
79. Lazarevic D
80. Lipovich L
81. Liu J
82. Liuni S
83. McWilliam S
84. Madan Babu M
85. Madera M
86. Marchionni L
87. Matsuda H
88. Matsuzawa S
89. Miki H
90. Mignone F
91. Miyake S
92. Morris K
93. Mottagui-Tabar S
94. Mulder N
95. Nakano N
96. Nakauchi H
97. Ng P
98. Nilsson R
99. Nishiguchi S
100. Nishikawa S
101. Nori F
102. Ohara O
103. Okazaki Y
104. Orlando V
105. Pang KC
106. Pavan WJ
107. Pavesi G
108. Pesole G
109. Petrovsky N
110. Piazza S
111. Reed J
112. Reid JF
113. Ring BZ
114. Ringwald M
115. Rost B
116. Ruan Y
117. Salzberg SL
118. Sandelin A
119. Schneider C
120. Schönbach C
121. Sekiguchi K
122. Semple CA
123. Seno S
124. Sessa L
125. Sheng Y
126. Shibata Y
127. Shimada H
128. Shimada K
129. Silva D
130. Sinclair B
131. Sperling S
132. Stupka E
133. Sugiura K
134. Sultana R
135. Takenaka Y
136. Taki K
137. Tammoja K
138. Tan SL
139. Tang S
140. Taylor MS
141. Tegner J
142. Teichmann SA
143. Ueda HR
144. van Nimwegen E
145. Verardo R
146. Wei CL
147. Yagi K
148. Yamanishi H
149. Zabarovsky E
150. Zhu S
151. Zimmer A
152. Hide W
153. Bult C
154. Grimmond SM
155. Teasdale RD
156. Liu ET
157. Brusic V
158. Quackenbush J
159. Wahlestedt C
160. Mattick JS
161. Hume DA
162. Kai C
163. Sasaki D
164. Tomaru Y
165. Fukuda S
166. Kanamori-Katayama M
167. Suzuki M
168. Aoki J
169. Arakawa T
170. Iida J
171. Imamura K
172. Itoh M
173. Kato T
174. Kawaji H
175. Kawagashira N
176. Kawashima T
177. Kojima M
178. Kondo S
179. Konno H
180. Nakano K
181. Ninomiya N
182. Nishio T
183. Okada M
184. Plessy C
185. Shibata K
186. Shiraki T
187. Suzuki S
188. Tagami M
189. Waki K
190. Watahiki A
191. Okamura-Oho Y
192. Suzuki H
193. Kawai J
194. Hayashizaki Y
195. FANTOM Consortium
196. RIKEN Genome Exploration Research Group and Genome Science Group (Genome Network Project Core Group)
(2005) The transcriptional landscape of the mammalian genome
Science 309:1559–1563.

https://doi.org/10.1126/science.1112014
- Google Scholar
1. Carvunis AR
2. Rolland T
3. Wapinski I
4. Calderwood MA
5. Yildirim MA
6. Simonis N
7. Charloteaux B
8. Hidalgo CA
9. Barbette J
10. Santhanam B
11. Brar GA
12. Weissman JS
13. Regev A
14. Thierry-Mieg N
15. Cusick ME
16. Vidal M
(2012) Proto-genes and de novo gene birth
Nature 487:370–374.

https://doi.org/10.1038/nature11184
- Google Scholar
1. Chen S
2. Zhang YE
3. Long M
(2010) New genes in Drosophila quickly become essential
Science 330:1682–1685.

https://doi.org/10.1126/science.1196380
- Google Scholar
1. Chew GL
2. Pauli A
3. Rinn JL
4. Regev A
5. Schier AF
6. Valen E
(2013) Ribosome profiling reveals resemblance between long non-coding RNAs and 5’ leaders of coding RNAs
Development 140:2828–2834.

https://doi.org/10.1242/dev.098343
- Google Scholar
(2013) Combining in silico prediction and ribosome profiling in a genome-wide search for novel putatively coding sORFs
BMC Genomics 14:648.

https://doi.org/10.1186/1471-2164-14-648
- Google Scholar
1. Derrien T
2. Johnson R
3. Bussotti G
4. Tanzer A
5. Djebali S
6. Tilgner H
7. Guernec G
8. Martin D
9. Merkel A
10. Knowles DG
11. Lagarde J
12. Veeravalli L
13. Ruan X
14. Ruan Y
15. Lassmann T
16. Carninci P
17. Brown JB
18. Lipovich L
19. Gonzalez JM
20. Thomas M
21. Davis CA
22. Shiekhattar R
23. Gingeras TR
24. Hubbard TJ
25. Notredame C
26. Harrow J
27. Guigó R
(2012) The GENCODE v7 catalog of human long noncoding RNAs: analysis of their gene structure, evolution, and expression
Genome Research 22:1775–1789.

https://doi.org/10.1101/gr.132159.111
- Google Scholar
(2008) Differentiating protein-coding and noncoding RNA: challenges and ambiguities
PLOS Computational Biology 4:e1000176.

https://doi.org/10.1371/journal.pcbi.1000176
- Google Scholar
1. Djebali S
2. Davis CA
3. Merkel A
4. Dobin A
5. Lassmann T
6. Mortazavi A
7. Tanzer A
8. Lagarde J
9. Lin W
10. Schlesinger F
11. Xue C
12. Marinov GK
13. Khatun J
14. Williams BA
15. Zaleski C
16. Rozowsky J
17. Röder M
18. Kokocinski F
19. Abdelhamid RF
20. Alioto T
21. Antoshechkin I
22. Baer MT
23. Bar NS
24. Batut P
25. Bell K
26. Bell I
27. Chakrabortty S
28. Chen X
29. Chrast J
30. Curado J
31. Derrien T
32. Drenkow J
33. Dumais E
34. Dumais J
35. Duttagupta R
36. Falconnet E
37. Fastuca M
38. Fejes-Toth K
39. Ferreira P
40. Foissac S
41. Fullwood MJ
42. Gao H
43. Gonzalez D
44. Gordon A
45. Gunawardena H
46. Howald C
47. Jha S
48. Johnson R
49. Kapranov P
50. King B
51. Kingswood C
52. Luo OJ
53. Park E
54. Persaud K
55. Preall JB
56. Ribeca P
57. Risk B
58. Robyr D
59. Sammeth M
60. Schaffer L
61. See LH
62. Shahab A
63. Skancke J
64. Suzuki AM
65. Takahashi H
66. Tilgner H
67. Trout D
68. Walters N
69. Wang H
70. Wrobel J
71. Yu Y
72. Ruan X
73. Hayashizaki Y
74. Harrow J
75. Gerstein M
76. Hubbard T
77. Reymond A
78. Antonarakis SE
79. Hannon G
80. Giddings MC
81. Ruan Y
82. Wold B
83. Carninci P
84. Guigó R
85. Gingeras TR
(2012) Landscape of transcription in human cells
Nature 489:101–108.

https://doi.org/10.1038/nature11233
- Google Scholar
(2011) Evolutionary origins of Brassicaceae specific genes in Arabidopsis thaliana
BMC Evolutionary Biology 11:47.

https://doi.org/10.1186/1471-2148-11-47
- Google Scholar
1. Dunn JG
2. Foo CK
3. Belletier NG
4. Gavis ER
5. Weissman JS
(2013) Ribosome profiling reveals pervasive and regulated stop codon readthrough in Drosophila melanogaster
eLife 2:e01179.

https://doi.org/10.7554/eLife.01179
- Google Scholar
(2007) Quantification of the elevated rate of domain rearrangements in metazoa
Journal of Molecular Biology 372:1337–1348.

https://doi.org/10.1016/j.jmb.2007.06.022
- Google Scholar
1. Ekman D
2. Elofsson A
(2010) Identifying and quantifying orphan protein sequences in fungi
Journal of Molecular Biology 396:396–405.

https://doi.org/10.1016/j.jmb.2009.11.053
- Google Scholar
1. Eyre-Walker A
(2002)
Changing effective population size and the McDonald-Kreitman test

Genetics 162:2017–2024.
- Google Scholar
1. Fatica A
2. Bozzoni I
(2014) Long non-coding RNAs: new players in cell differentiation and development
Nature Reviews Genetics 15:7–21.

https://doi.org/10.1038/nrg3606
- Google Scholar
1. Flicek P
2. Amode M
3. Barrell D
4. Beal K
5. Brent S
6. Carvalho-Silva D
7. Clapham P
8. Coates G
9. Fairley S
10. Fitzgerald S
11. Gil L
12. Gordon L
13. Hendrix M
14. Hourlier T
15. Johnson N
16. Kähäri AK
17. Keefe D
18. Keenan S
19. Kinsella R
20. Komorowska M
21. Koscielny G
22. Kulesha E
23. Larsson P
24. Longden I
25. McLaren W
26. Muffato M
27. Overduin B
28. Pignatelli M
29. Pritchard B
30. Riat HS
31. Ritchie GR
32. Ruffier M
33. Schuster M
34. Sobral D
35. Tang YA
36. Taylor K
37. Trevanion S
38. Vandrovcova J
39. White S
40. Wilson M
41. Wilder SP
42. Aken BL
43. Birney E
44. Cunningham F
45. Dunham I
46. Durbin R
47. Fernández-Suarez XM
48. Harrow J
49. Herrero J
50. Hubbard TJ
51. Parker A
52. Proctor G
53. Spudich G
54. Vogel J
55. Yates A
56. Zadissa A
57. Searle SM
(2012) Ensembl 2012
Nucleic Acids Research 40:D84–D90.

https://doi.org/10.1093/nar/gkr991
- Google Scholar
1. Frith MC
2. Forrest AR
3. Nourbakhsh E
4. Pang KC
5. Kai C
6. Kawai J
7. Carninci P
8. Hayashizaki Y
9. Bailey TL
10. Grimmond SM
(2006) The abundance of short proteins in the mammalian proteome
PLOS Genetics 2:e52.

https://doi.org/10.1371/journal.pgen.0020052
- Google Scholar
1. Galindo MI
2. Pueyo JI
3. Fouix S
4. Bishop SA
5. Couso JP
(2007) Peptides encoded by short ORFs control development and define a new eukaryotic gene family
PLOS Biology 5:e106.

https://doi.org/10.1371/journal.pbio.0050106
- Google Scholar
(2010) Mammalian microRNAs predominantly act to decrease target mRNA levels
Nature 466:835–840.

https://doi.org/10.1038/nature09267
- Google Scholar
1. Guttman M
2. Rinn JL
(2012) Modular regulatory principles of large non-coding RNAs
Nature 482:339–346.

https://doi.org/10.1038/nature10887
- Google Scholar
(2013) Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins
Cell 154:240–251.

https://doi.org/10.1016/j.cell.2013.06.009
- Google Scholar
1. Hanada K
2. Higuchi-Takeuchi M
3. Okamoto M
4. Yoshizumi T
5. Shimizu M
6. Nakaminami K
7. Nishi R
8. Ohashi C
9. Iida K
10. Tanaka M
11. Horii Y
12. Kawashima M
13. Matsui K
14. Toyoda T
15. Shinozaki K
16. Seki M
17. Matsui M
(2013) Small open reading frames associated with morphogenesis are hidden in plant genomes
Proceedings of the National Academy of Sciences of USA 110:2395–2400.

https://doi.org/10.1073/pnas.1213958110
- Google Scholar
1. Harrow J
2. Frankish A
3. Gonzalez JM
4. Tapanari E
5. Diekhans M
6. Kokocinski F
7. Aken BL
8. Barrell D
9. Zadissa A
10. Searle S
11. Barnes I
12. Bignell A
13. Boychenko V
14. Hunt T
15. Kay M
16. Mukherjee G
17. Rajan J
18. Despacio-Reyes G
19. Saunders G
20. Steward C
21. Harte R
22. Lin M
23. Howald C
24. Tanzer A
25. Derrien T
26. Chrast J
27. Walters N
28. Balasubramanian S
29. Pei B
30. Tress M
31. Rodriguez JM
32. Ezkurdia I
33. van Baren J
34. Brent M
35. Haussler D
36. Kellis M
37. Valencia A
38. Reymond A
39. Gerstein M
40. Guigó R
41. Hubbard TJ
(2012) GENCODE: the reference human genome annotation for The ENCODE Project
Genome Research 22:1760–1774.

https://doi.org/10.1101/gr.135350.111
- Google Scholar
1. Hashimoto Y
2. Niikura T
3. Tajima H
4. Yasukawa T
5. Sudo H
6. Ito Y
7. Kita Y
8. Kawasumi M
9. Kouyama K
10. Doyu M
11. Sobue G
12. Koide T
13. Tsuji S
14. Lang J
15. Kurokawa K
16. Nishimoto I
(2001) A rescue factor abolishing neuronal cell death by a wide spectrum of familial Alzheimer’s disease genes and abeta
Proceedings of the National Academy of Sciences of USA 98:6336–6341.

https://doi.org/10.1073/pnas.101133498
- Google Scholar
(2013) Translational profiling of clock cells reveals circadianly synchronized protein synthesis
PLOS Biology 11:e1001703.

https://doi.org/10.1371/journal.pbio.1001703
- Google Scholar
1. Ingolia NT
(2014) Ribosome profiling: new views of translation, from single codons to genome scale
Nature Reviews Genetics 15:205–213.

https://doi.org/10.1038/nrg3645
- Google Scholar
(2009) Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling
Science 324:218–223.

https://doi.org/10.1126/science.1168978
- Google Scholar
(2011) Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes
Cell 147:789–802.

https://doi.org/10.1016/j.cell.2011.10.002
- Google Scholar
1. Jacob F
(1977) Evolution and tinkering
Science 196:1161–1166.

https://doi.org/10.1126/science.860134
- Google Scholar
(2014) Translational dynamics revealed by genome-wide profiling of ribosome footprints in Arabidopsis
Proceedings of the National Academy of Sciences of USA 111:E203–E212.

https://doi.org/10.1073/pnas.1317811111
- Google Scholar
1. Kapranov P
2. Cheng J
3. Dike S
4. Nix DA
5. Duttagupta R
6. Willingham AT
7. Stadler PF
8. Hertel J
9. Hackermüller J
10. Hofacker IL
11. Bell I
12. Cheung E
13. Drenkow J
14. Dumais E
15. Patel S
16. Helt G
17. Ganesh M
18. Ghosh S
19. Piccolboni A
20. Sementchenko V
21. Tammana H
22. Gingeras TR
(2007) RNA maps reveal new RNA classes and a possible function for pervasive transcription
Science 316:1484–1488.

https://doi.org/10.1126/science.1138341
- Google Scholar
1. Kastenmayer JP
2. Ni L
3. Chu A
4. Kitchen LE
5. Au WC
6. Yang H
7. Carter CD
8. Wheeler D
9. Davis RW
10. Boeke JD
11. Snyder MA
12. Basrai MA
(2006) Functional genomics of genes with small open reading frames ( sORFs ) in S. Cerevisiae
Genome Research 16:365–373.

https://doi.org/10.1101/gr.4355406.7
- Google Scholar
(2009) More than just orphans: are taxonomically-restricted genes important in evolution?
Trends in Genetics 25:404–413.

https://doi.org/10.1016/j.tig.2009.07.006
- Google Scholar
1. Kim D
2. Pertea G
3. Trapnell C
4. Pimentel H
5. Kelley R
6. Salzberg SL
(2013) TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions
Genome Biology 14:R36.

https://doi.org/10.1186/gb-2013-14-4-r36
- Google Scholar
1. Kim MS
2. Pinto SM
3. Getnet D
4. Nirujogi RS
5. Manda SS
6. Chaerkady R
7. Madugundu AK
8. Kelkar DS
9. Isserlin R
10. Jain S
11. Thomas JK
12. Muthusamy B
13. Leal-Rojas P
14. Kumar P
15. Sahasrabuddhe NA
16. Balakrishnan L
17. Advani J
18. George B
19. Renuse S
20. Selvan LD
21. Patil AH
22. Nanjappa V
23. Radhakrishnan A
24. Prasad S
25. Subbannayya T
26. Raju R
27. Kumar M
28. Sreenivasamurthy SK
29. Marimuthu A
30. Sathe GJ
31. Chavan S
32. Datta KK
33. Subbannayya Y
34. Sahu A
35. Yelamanchi SD
36. Jayaram S
37. Rajagopalan P
38. Sharma J
39. Murthy KR
40. Syed N
41. Goel R
42. Khan AA
43. Ahmad S
44. Dey G
45. Mudgal K
46. Chatterjee A
47. Huang TC
48. Zhong J
49. Wu X
50. Shaw PG
51. Freed D
52. Zahari MS
53. Mukherjee KK
54. Shankar S
55. Mahadevan A
56. Lam H
57. Mitchell CJ
58. Shankar SK
59. Satishchandra P
60. Schroeder JT
61. Sirdeshmukh R
62. Maitra A
63. Leach SD
64. Drake CG
65. Halushka MK
66. Prasad TS
67. Hruban RH
68. Kerr CL
69. Bader GD
70. Iacobuzio-Donahue CA
71. Gowda H
72. Pandey A
(2014) A draft map of the human proteome
Nature 509:575–581.

https://doi.org/10.1038/nature13302
- Google Scholar
1. Kondo T
2. Hashimoto Y
3. Kato K
4. Inagaki S
5. Hayashi S
6. Kageyama Y
(2007) Small peptide regulators of actin-based cell morphogenesis encoded by a polycistronic mRNA
Nature Cell Biology 9:660–665.

https://doi.org/10.1038/ncb1595
- Google Scholar
1. Kutter C
2. Watt S
3. Stefflova K
4. Wilson MD
5. Goncalves A
6. Ponting CP
7. Odom DT
8. Marques AC
(2012) Rapid turnover of long noncoding RNAs and the evolution of gene expression
PLOS Genetics 8:e1002841.

https://doi.org/10.1371/journal.pgen.1002841
- Google Scholar
(2011) Hundreds of putatively functional small open reading frames in Drosophila
Genome Biology 12:R118.

https://doi.org/10.1186/gb-2011-12-11-r118
- Google Scholar
(2009) Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
Genome Biology 10:R25.

https://doi.org/10.1186/gb-2009-10-3-r25
- Google Scholar
1. Lee C
2. Yen K
3. Cohen P
(2013) Humanin: a harbinger of mitochondrial-derived peptides?
Trends in Endocrinology and Metabolism 24:222–228.

https://doi.org/10.1016/j.tem.2013.01.005
- Google Scholar
1. Levine MT
2. Jones CD
3. Kern AD
4. Lindfors HA
5. Begun DJ
(2006) Novel genes derived from noncoding DNA in Drosophila melanogaster are frequently X-linked and exhibit testis-biased expression
Proceedings of the National Academy of Sciences of USA 103:9935–9939.

https://doi.org/10.1073/pnas.0509809103
- Google Scholar
(2010) Identification of genetic and epigenetic marks involved in population structure
PLOS ONE 5:e13209.

https://doi.org/10.1371/journal.pone.0013209
- Google Scholar
1. Liu J
2. Jung C
3. Xu J
4. Wang H
5. Deng S
6. Bernad L
7. Arenas-Huertero C
8. Chua NH
(2012) Genome-wide analysis uncovers regulation of long intergenic noncoding RNAs in Arabidopsis
The Plant Cell 24:4333–4345.

https://doi.org/10.1105/tpc.112.102855
- Google Scholar
1. Liu J
2. Zhang Y
3. Lei X
4. Zhang Z
(2008) Natural selection of protein structural and functional properties: a single nucleotide polymorphism perspective
Genome Biology 9:R69.

https://doi.org/10.1186/gb-2008-9-4-r69
- Google Scholar
(2013) New gene evolution: little did we know
Annual Review of Genetics 47:307–333.

https://doi.org/10.1146/annurev-genet-111212-133301
- Google Scholar
1. Ma J
2. Ward CC
3. Jungreis I
4. Slavoff SA
5. Schwaid AG
6. Neveu J
7. Budnik BA
8. Kellis M
9. Saghatelian A
(2014) Discovery of human sORF-encoded polypeptides (SEPs) in cell lines and tissue
Journal of Proteome Research 13:1757–1765.

https://doi.org/10.1021/pr401280w
- Google Scholar
1. Magny EG
2. Pueyo JI
3. Pearl FM
4. Cespedes MA
5. Niven JE
6. Bishop SA
7. Couso JP
(2013) Conserved regulation of cardiac calcium uptake by peptides encoded in small open reading frames
Science 341:1116–1120.

https://doi.org/10.1126/science.1238802
- Google Scholar
(2014) Ribosome profiling reveals post-transcriptional buffering of divergent gene expression in yeast
Genome Research 24:422–430.

https://doi.org/10.1101/gr.164996.113.Freely
- Google Scholar
(2012) Observation of dually decoded regions of the human genome using ribosome profiling data
Genome Research 22:2219–2229.

https://doi.org/10.1101/gr.133249.111
- Google Scholar
1. Nagalakshmi U
2. Wang Z
3. Waern K
4. Shou C
5. Raha D
6. Gerstein M
7. Snyder M
(2008) The transcriptional landscape of the yeast genome defined by RNA sequencing
Science 320:1344–1349.

https://doi.org/10.1126/science.1158441
- Google Scholar
1. Necsulea A
2. Soumillon M
3. Warnefors M
4. Liechti A
5. Daish T
6. Zeller U
7. Baker JC
8. Grützner F
9. Kaessmann H
(2014) The evolution of lncRNA repertoires and expression patterns in tetrapods
Nature 505:635–640.

https://doi.org/10.1038/nature12943
- Google Scholar
1. Nei M
2. Gojobori T
(1986)
Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions

Molecular Biology and Evolution 3:418–426.
- Google Scholar
1. Neme R
2. Tautz D
(2013) Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution
BMC Genomics 14:117.

https://doi.org/10.1186/1471-2164-14-117
- Google Scholar
1. Neme R
2. Tautz D
(2014) Evolution: dynamics of de novo gene emergence
Current Biology 24:R238–R240.

https://doi.org/10.1016/j.cub.2014.02.016
- Google Scholar
1. Okazaki Y
2. Furuno M
3. Kasukawa T
4. Adachi J
5. Bono H
6. Kondo S
7. Nikaido I
8. Osato N
9. Saito R
10. Suzuki H
11. Yamanaka I
12. Kiyosawa H
13. Yagi K
14. Tomaru Y
15. Hasegawa Y
16. Nogami A
17. Schönbach C
18. Gojobori T
19. Baldarelli R
20. Hill DP
21. Bult C
22. Hume DA
23. Quackenbush J
24. Schriml LM
25. Kanapin A
26. Matsuda H
27. Batalov S
28. Beisel KW
29. Blake JA
30. Bradt D
31. Brusic V
32. Chothia C
33. Corbani LE
34. Cousins S
35. Dalla E
36. Dragani TA
37. Fletcher CF
38. Forrest A
39. Frazer KS
40. Gaasterland T
41. Gariboldi M
42. Gissi C
43. Godzik A
44. Gough J
45. Grimmond S
46. Gustincich S
47. Hirokawa N
48. Jackson IJ
49. Jarvis ED
50. Kanai A
51. Kawaji H
52. Kawasawa Y
53. Kedzierski RM
54. King BL
55. Konagaya A
56. Kurochkin IV
57. Lee Y
58. Lenhard B
59. Lyons PA
60. Maglott DR
61. Maltais L
62. Marchionni L
63. McKenzie L
64. Miki H
65. Nagashima T
66. Numata K
67. Okido T
68. Pavan WJ
69. Pertea G
70. Pesole G
71. Petrovsky N
72. Pillai R
73. Pontius JU
74. Qi D
75. Ramachandran S
76. Ravasi T
77. Reed JC
78. Reed DJ
79. Reid J
80. Ring BZ
81. Ringwald M
82. Sandelin A
83. Schneider C
84. Semple CA
85. Setou M
86. Shimada K
87. Sultana R
88. Takenaka Y
89. Taylor MS
90. Teasdale RD
91. Tomita M
92. Verardo R
93. Wagner L
94. Wahlestedt C
95. Wang Y
96. Watanabe Y
97. Wells C
98. Wilming LG
99. Wynshaw-Boris A
100. Yanagisawa M
101. Yang I
102. Yang L
103. Yuan Z
104. Zavolan M
105. Zhu Y
106. Zimmer A
107. Carninci P
108. Hayatsu N
109. Hirozane-Kishikawa T
110. Konno H
111. Nakamura M
112. Sakazume N
113. Sato K
114. Shiraki T
115. Waki K
116. Kawai J
117. Aizawa K
118. Arakawa T
119. Fukuda S
120. Hara A
121. Hashizume W
122. Imotani K
123. Ishii Y
124. Itoh M
125. Kagawa I
126. Miyazaki A
127. Sakai K
128. Sasaki D
129. Shibata K
130. Shinagawa A
131. Yasunishi A
132. Yoshino M
133. Waterston R
134. Lander ES
135. Rogers J
136. Birney E
137. Hayashizaki Y
138. FANTOM Consortium
139. RIKEN Genome Exploration Research Group Phase I & II Team
(2002) Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs
Nature 420:563–573.

https://doi.org/10.1038/nature01266
- Google Scholar
(2005) Evolution and functional classification of vertebrate gene deserts
Genome Research 15:137–145.

https://doi.org/10.1101/gr.3015505
- Google Scholar
(2014) The life cycle of Drosophila orphan genes
eLife 3:e01311.

https://doi.org/10.7554/eLife.01311
- Google Scholar
1. Pauli A
2. Norris ML
3. Valen E
4. Chew G
5. Gagnon JA
6. Zimmerman S
7. Mitchell A
8. Ma J
9. Dubrulle J
10. Reyon D
11. Tsai SQ
12. Joung JK
13. Saghatelian A
14. Schier AF
(2014) Toddler: an embryonic signal that promotes cell movement via Apelin receptors
Science 343:1248636.

https://doi.org/10.1126/science.1248636
- Google Scholar
1. Pauli A
2. Valen E
3. Lin MF
4. Garber M
5. Vastenhouw NL
6. Levin JZ
7. Fan L
8. Sandelin A
9. Rinn JL
10. Regev A
11. Schier AF
(2012) Systematic identification of long noncoding RNAs expressed during zebrafish embryogenesis
Genome Research 22:577–591.

https://doi.org/10.1101/gr.133009.111.2011
- Google Scholar
(2007) Functionality or transcriptional noise? Evidence for selection within long noncoding RNAs
Genome Research 17:556–565.

https://doi.org/10.1101/gr.6036807
- Google Scholar
(2009) Evolution and functions of long noncoding RNAs
Cell 136:629–641.

https://doi.org/10.1016/j.cell.2009.02.006
- Google Scholar
1. Pruitt KD
2. Brown GR
3. Hiatt SM
4. Thibaud-Nissen F
5. Astashyn A
6. Ermolaeva O
7. Farrell CM
8. Hart J
9. Landrum MJ
10. McGarvey KM
11. Murphy MR
12. O'Leary NA
13. Pujar S
14. Rajput B
15. Rangwala SH
16. Riddick LD
17. Shkeda A
18. Sun H
19. Tamez P
20. Tully RE
21. Wallin C
22. Webb D
23. Weber J
24. Wu W
25. DiCuccio M
26. Kitts P
27. Maglott DR
28. Murphy TD
29. Ostell JM
(2014) RefSeq: an update on mammalian reference sequences
Nucleic Acids Research 42:D756–D763.

https://doi.org/10.1093/nar/gkt1114
- Google Scholar
1. Quinlan AR
2. Hall IM
(2010) BEDTools: a flexible suite of utilities for comparing genomic features
Bioinformatics 26:841–842.

https://doi.org/10.1093/bioinformatics/btq033
- Google Scholar
1. R Development Core Team
(2010)
R Foundation for statistical computing

R Foundation for statistical computing, Vienna Austria, R Foundation for statistical computing.
- Google Scholar
1. Reinhardt JA
2. Wanjiru BM
3. Brant AT
4. Saelao P
5. Begun DJ
6. Jones CD
(2013) De novo ORFs in Drosophila are important to organismal fitness and evolved rapidly from previously non-coding sequences
PLOS Genetics 9:e1003860.

https://doi.org/10.1371/journal.pgen.1003860
- Google Scholar
(2006) A segmentation gene in tribolium produces a polycistronic mRNA that codes for multiple conserved peptides
Cell 126:559–569.

https://doi.org/10.1016/j.cell.2006.05.053
- Google Scholar
(2007) Position of the final intron in full-length transcripts: determined by NMD?
Molecular Biology and Evolution 24:896–899.

https://doi.org/10.1093/molbev/msm010
- Google Scholar
1. Sherry ST
2. Ward MH
3. Kholodov M
4. Baker J
5. Phan L
6. Smigielski EM
7. Sirotkin K
(2001) dbSNP: the NCBI database of genetic variation
Nucleic Acids Research 29:308–311.

https://doi.org/10.1093/nar/29.1.308
- Google Scholar
(2014) A human short open reading frame (sORF)-encoded polypeptide that stimulates DNA end joining
The Journal of Biological Chemistry 289:10950–10957.

https://doi.org/10.1074/jbc.C113.533968
- Google Scholar
1. Slavoff SA
2. Mitchell AJ
3. Schwaid AG
4. Cabili MN
5. Ma J
6. Levin JZ
7. Karger AD
8. Budnik BA
9. Rinn JL
10. Saghatelian A
(2013) Peptidomic discovery of short open reading frame-encoded peptides in human cells
Nature Chemical Biology 9:59–64.

https://doi.org/10.1038/nchembio.1120
- Google Scholar
1. Smeds L
2. Künstner A
(2011) ConDe Tri - a content dependent read trimmer for illumina data
PLOS ONE 6:e26314.

https://doi.org/10.1371/journal.pone.0026314
- Google Scholar
1. Sun L
2. Luo H
3. Bu D
4. Zhao G
5. Yu K
6. Zhang C
7. Liu Y
8. Chen R
9. Zhao Y
(2013) Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts
Nucleic Acids Research 41:e166.

https://doi.org/10.1093/nar/gkt646
- Google Scholar
(2013) The RNA degradation pathway regulates the function of GAS5 a non-coding RNA in mammalian cells
PLOS ONE 8:e55684.

https://doi.org/10.1371/journal.pone.0055684
- Google Scholar
1. Tautz D
(2009) Polycistronic peptide coding genes in eukaryotes–how widespread are they?
Briefings in Functional Genomics & Proteomics 8:68–74.

https://doi.org/10.1093/bfgp/eln054
- Google Scholar
1. Tautz D
2. Domazet-Lošo T
(2011) The evolutionary origin of orphan genes
Nature Reviews Genetics 12:692–702.

https://doi.org/10.1038/nrg3053
- Google Scholar
1. Toll-Riera M
2. Bosch N
3. Bellora N
4. Castelo R
5. Armengol L
6. Estivill X
7. Albà MM
(2009) Origin of primate orphan genes: a comparative genomics approach
Molecular Biology and Evolution 26:603–612.

https://doi.org/10.1093/molbev/msn281
- Google Scholar
1. Trapnell C
2. Williams BA
3. Pertea G
4. Mortazavi A
5. Kwan G
6. van Baren MJ
7. Salzberg SL
8. Wold BJ
9. Pachter L
(2010) Transcript assembly and quantification by RNA-seq reveals unannotated transcripts and isoform switching during cell differentiation
Nature Biotechnology 28:511–515.

https://doi.org/10.1038/nbt.1621
- Google Scholar
1. Ulitsky I
2. Bartel DP
(2013) lincRNAs: genomics, evolution, and mechanisms
Cell 154:26–46.

https://doi.org/10.1016/j.cell.2013.06.020
- Google Scholar
1. UniProt Consortium
(2014) Activities at the Universal Protein Resource (UniProt)
Nucleic Acids Research 42:D191–D198.

https://doi.org/10.1093/nar/gkt1140
- Google Scholar
1. van Heesch S
2. van Iterson M
3. Jacobi J
4. Boymans S
5. Essers PB
6. de Bruijn E
7. Hao W
8. Macinnes AW
9. Cuppen E
10. Simonis M
(2014) Extensive localization of long noncoding RNAs to the cytosol and mono- and polyribosomal complexes
Genome Biology 15:R6.

https://doi.org/10.1186/gb-2014-15-1-r6
- Google Scholar
(2013) Direct detection of alternative open reading frames translation products in human significantly expands the proteome
PLOS ONE 8:e70698.

https://doi.org/10.1371/journal.pone.0070698
- Google Scholar
(2014) Comparative ribosome profiling reveals extensive translational complexity in different Trypanosoma brucei life cycle stages
Nucleic Acids Research 42:3623–3637.

https://doi.org/10.1093/nar/gkt1386
- Google Scholar
1. Wang L
2. Park HJ
3. Dasari S
4. Wang S
5. Kocher JP
6. Li W
(2013) CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model
Nucleic Acids Research 41:e74.

https://doi.org/10.1093/nar/gkt006
- Google Scholar
1. Wilson BA
2. Masel J
(2011) Putatively noncoding transcripts show extensive association with ribosomes
Genome Biology and Evolution 3:1245–1252.

https://doi.org/10.1093/gbe/evr099
- Google Scholar
(2013) Mechanisms and dynamics of orphan gene emergence in insect genomes
Genome Biology and Evolution 5:439–455.

https://doi.org/10.1093/gbe/evt009
- Google Scholar
1. Xie C
2. Zhang YE
3. Chen JY
4. Liu CJ
5. Zhou WZ
6. Li Y
7. Zhang M
8. Zhang R
9. Wei L
10. Li CY
(2012) Hominoid-specific de novo protein-coding genes originating from long non-coding RNAs
PLOS Genetics 8:e1002942.

https://doi.org/10.1371/journal.pgen.1002942
- Google Scholar

Article and author information

Author details

Jorge Ruiz-Orera

Evolutionary Genomics Group, Research Programme on Biomedical Informatics, Hospital del Mar Research Institute, Universitat Pompeu Fabra, Barcelona, Spain

Contribution
JR-O, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article

Competing interests
The authors declare that no competing interests exist.
Xavier Messeguer

Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya, Barcelona, Spain

Contribution
XM, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article

Competing interests
The authors declare that no competing interests exist.
Juan Antonio Subirana
1. Evolutionary Genomics Group, Research Programme on Biomedical Informatics, Hospital del Mar Research Institute, Universitat Pompeu Fabra, Barcelona, Spain
2. Real Academia de Ciències i Arts de Barcelona, Barcelona, Spain
Contribution
JAS, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article

Competing interests
The authors declare that no competing interests exist.
M Mar Alba
1. Evolutionary Genomics Group, Research Programme on Biomedical Informatics, Hospital del Mar Research Institute, Universitat Pompeu Fabra, Barcelona, Spain
2. Catalan Institution for Research and Advanced Studies, Barcelona, Spain
Contribution
MMA, Conception and design, Analysis and interpretation of data, Drafting or revising the article

For correspondence
malba@imim.es

Competing interests
The authors declare that no competing interests exist.

Funding

Ministerio de Economía y Competitividad (BFU2012-36820)

M Mar Alba

Ministerio de Economía y Competitividad (TIN2013-45732-C4-3-P)

Xavier Messeguer

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We acknowledge José Luis Villanueva-Cañas and Will Blevins for critical revision of the manuscript. We are grateful to Ivan Ovcharenko for advise on gene deserts. This work was funded by Ministerio de Economía y Competitividad (BFU2012-36820 and TIN2013-45732-C4-3-P) and Fundació ICREA (MMA).

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.