1. Genetics and Genomics
Download icon

Fast turnover of genome transcription across evolutionary time exposes entire non-coding DNA to de novo gene emergence

  1. Rafik Neme  Is a corresponding author
  2. Diethard Tautz  Is a corresponding author
  1. Max-Planck Institute for Evolutionary Biology, Germany
Research Article
  • Cited 19
  • Views 2,630
  • Annotations
Cite this article as: eLife 2016;5:e09977 doi: 10.7554/eLife.09977

Abstract

Deep sequencing analyses have shown that a large fraction of genomes is transcribed, but the significance of this transcription is much debated. Here, we characterize the phylogenetic turnover of poly-adenylated transcripts in a comprehensive sampling of taxa of the mouse (genus Mus), spanning a phylogenetic distance of 10 Myr. Using deep RNA sequencing we find that at a given sequencing depth transcriptome coverage becomes saturated within a taxon, but keeps extending when compared between taxa, even at this very shallow phylogenetic level. Our data show a high turnover of transcriptional states between taxa and that no major transcript-free islands exist across evolutionary time. This suggests that the entire genome can be transcribed into poly-adenylated RNA when viewed at an evolutionary time scale. We conclude that any part of the non-coding genome can potentially become subject to evolutionary functionalization via de novo gene evolution within relatively short evolutionary time spans.

https://doi.org/10.7554/eLife.09977.001

eLife digest

Traditionally, the genome – the sum total of DNA within a cell – was thought to be divided into genes and ‘non-coding’ regions. Genes are copied, or “transcribed”, into molecules called RNA that perform essential tasks in the cell. The roles of the non-coding regions were often less clear, although it has since become apparent that some are also transcribed and generate low levels of RNA molecules. However, many debate how significant this transcription is to living organisms.

Neme and Tautz have now used a technique called deep RNA sequencing to study the RNA molecules produced in several different species and types of mice whose last common ancestor lived 10 million years ago. Different species produced RNA molecules from different portions – both genes and non-coding regions – of their genomes. Comparing these RNA sequences suggests that changes to the regions that are transcribed occur relatively quickly for a large portion of the genome. Furthermore, there have been no significant areas of the common ancestor’s genome that have not been transcribed at some point in at least one of its descendent species.

This therefore suggests that over a relatively short evolutionary period, any part of the genome can acquire the ability to be transcribed and potentially form a new gene. The next challenge is to find out how often these transcribed non-coding parts of the genome show important biochemical activities, and how they find their way into becoming new genes.

https://doi.org/10.7554/eLife.09977.002

Introduction

Genome-wide surveys have provided evidence for 'pervasive transcription', i.e., much larger portions of the genome are transcribed than would have been predicted from annotated exons (Clark et al., 2011; Hangauer et al., 2013; Kellis et al., 2014). Most are expected to be non-coding RNAs (lncRNAs) of which some have been shown to be functional. However, the general conservation level of these additional transcripts tends to be low, which raises the question of their evolutionary turnover dynamics (Kutter et al., 2012; Kapusta and Feschotte, 2014). They are currently receiving additional attention, since they could be a source for de novo gene formation via a proto-gene stage (Carvunis et al., 2012; Ruiz-Orera et al., 2014; Neme and Tautz, 2014). It has been shown that de novo gene emergence shows particularly high rates in the youngest lineages (Tautz and Domazet-Loso, 2011), indicating that there is high turnover of such transcripts and genes between closely related species. Indeed, comparative studies of de novo genes between Drosophila species (Palmieri et al., 2014) and within Drosophila populations (Zhao et al., 2014) have confirmed this.

A number of possibilities have been discussed by which new transcripts are generated in previously non-coding regions, including single mutational events, stabilization of bi-directional transcription and insertion of transposable elements with promotor activity (Brosius, 2005; Gotea et al., 2013; Neme and Tautz, 2013; Wu and Sharp, 2013; Sundaram et al., 2014; Ruiz-Orera et al., 2015). Detailed analyses of specific cases of emergence of a de novo gene have shown that single step mutations can be sufficient to generate a stable transcript in a region that was previously not transcribed and translated (Heinen et al., 2009; Knowles and McLysaght, 2009). The unequivocal identification of de novo transcript emergence can only be made in a comparison between very closely related evolutionary lineages, where orthologous genomic regions can be fully aligned, even for the neutrally evolving parts of the genome (Tautz et al., 2013). While the available genome and transcriptome data for mammals and insects are sufficient to screen for specific cases of de novo transcript emergence, they are still too far apart of each other to allow a comprehensive genome-wide assessment. Our analysis here is therefore based on a new dataset that reflects a very shallow divergence time-frame for relatives of the house mouse (Mus musculus).

Results

We selected populations, subspecies and species with increasing phylogenetic distance to the Mus musculus reference sequence (Keane et al., 2011). This reference was derived from an inbred strain of the subspecies Mus musculus domesticus and we use samples from three wild type populations of M. m. domesticus as the most closely related taxa, separated from each other by about 3,000–10,000 years. Further, we use samples from the related subspecies M. m. musculus and M. m. castaneus, which are separated since 0.3–0.5 million years. The other samples are recognized separate species with increasing evolutionary distances (Figure 1). We call this set of populations, subspecies and species collectively 'taxa' in the following. Altogether they span 10 million years of divergence, which corresponds to an average of 6% nucleotide difference for the most distant comparisons.

Figure 1 with 2 supplements see all
Phylogenetic relationships and time estimates for the taxa used in the study.

New genome sequences were generated for taxa with *. A common genome was constructed across all taxa (Figure 1—figure supplement 1) based on a mapping algorithm that is not affected by the sequence divergence between the samples (Appendix 1). Figure 1—figure supplement 2 shows the intersection of genome coverage between the named species.

https://doi.org/10.7554/eLife.09977.003

We obtained genome sequence reads for all taxa and mapped them to the mouse reference genome, using an algorithm that was specifically designed to deal efficiently with problems that occur in cross-mapping between diverged genomes (Sedlazeck et al., 2013; see Appendix 1 for validation). All regions that could be unequivocally mapped for all taxa were then used for further analysis. We refer to this as the 'common genome' which allows comparisons on those regions of the genomes which have not been gained or lost along the phylogeny, i.e., are common across all taxa (Figure 1—figure supplement 1). It represents 71.7% of the total reference genome length (Figure 1—figure supplement 2). Hence, we are nominally not analyzing about a third of the total genome length, but this corresponds to the highly repetitive parts for which unique and reliable mapping of transcriptomic reads would not be possible. Also, changes in transcription derived from gain or loss of genomic regions do not contribute to the patterns described below.

We chose three tissues for transcriptome sequencing, including testis, brain and liver. Previous studies had shown that testis and brain harbor the largest diversity of transcripts (Necsulea and Kaessmann, 2014). We sequenced only the poly-A+ fraction of the RNA, i.e., our focus is on coding and non-coding exons in processed RNA.

We use non-overlapping sliding windows of 200nt to assay for presence or absence of reads within the windows and express overall coverage as the fraction of windows showing transcription (see methods for details). We use only uniquely mapping reads, implying that we neglect the contributions and dynamics at repetitive loci. We display three thresholds of window coverage, the minimum being coverage by at least a single read, while the higher ones represent at least 10 and 100 reads respectively. The first serves as a very inclusive metric of low-level transcription, with the drawback of potentially including noise into the analysis, due to stochasticity in sampling, while the others represent thresholds for more abundant transcripts that are unlikely to be affected by sampling noise.

Among the three tissues analyzed, liver has the lowest overall read coverage while brain and testis have similar overall levels (Figure 2A–C). Combining the data from all three tissues or triplicating the read depth for one tissue (brain) increases the overall coverage in a similar way (Figure 2D,E).

Transcriptome coverage of the common genome per taxon.

(A–C) Liver, brain and testis, respectively, sequenced at approximately the same depth. (D) Combination of samples from AD. (E) Additional sequencing of brain samples at 3x depth, compared to B. (F) Combination of all samples, including additional brain sequencing. Three coverage levels are represented by colors from light blue to dark blue: window coverage with at least 1, 10 and 100 reads. Taxon abbreviations as summarized in Figure 1, with closest to the reference genome to the left of each panel and most divergent one to the right. Note that the slight rise in low read coverage for the distant taxa could partially be due to slightly more mismapping of reads at this phylogenetic distance (see Appendix 1 for simulation of mapping efficiency), but is also affected by a larger fraction of singleton reads (compare Figure 4—figure supplement 1).

https://doi.org/10.7554/eLife.09977.006

Figure 2F shows the total coverage across all tissues and all sequencing runs, which amounts to an average of 50.0 ± 2.5% per taxon. Hence, for each tissue, as well as in this combined set, we observe a very similar coverage in all taxa, with only a slight increase in the low expressed fraction for the most distant comparisons (see also legend Figure 2). This more or less stable pattern across phylogenetic time could either be due to the same regions being transcribed in all taxa, or a more or less constant rate of turnover of gain and loss of transcription between taxa.

To test these alternatives, we have asked which part of the transcribed window coverage is shared between the taxa and which is specific to the taxa. For this, we consider three classes: i) windows that are found in a single taxon only, ii) windows that are found in 2–9 taxa, i.e. more than one but not in all and iii) windows shared among all taxa (Figure 3; Figure 3—figure supplement 1 shows an extended version where class ii) is separated into each individual group). However, such an analysis could potentially be subject to a sampling problem, i.e. not finding a transcript in a taxon does not necessarily imply true absence, but could also be due to failure of sampling. This would be particularly problematic for singleton reads, since the probability of falsely not detecting one in a second sample that expresses it at the same level is about 37%. However, given that we ask whether it is detected in any of the other 9 taxa, the probability of falsely not detecting it if it exists across all of them becomes small (0.01%) (see also further analysis on singletons below).

Figure 3 with 2 supplements see all
Distribution of shared and non-shared windows with transcripts for each taxon, based on the aggregate dataset across all three tissues.

Three classes are represented: i) windows that are found in a single taxon only, ii) windows found in 2–9 taxa and iii) windows shared among all 10 taxa (from left to right in each panel). Windows with transcripts were first classified as belonging to one of the three classes, independent of their coverage, and were then assigned to the coverage classes represented by the blue shading (from light blue to dark blue: window coverage with at least 1, 10 and 100 reads). Taxon names as summarized in Figure 1. Figure 3—figure supplement 1 shows an extended version where class ii) is separated into each individual group. Relative enrichment of annotated genes in the conserved class is shown in Figure 3—figure supplement 2.

https://doi.org/10.7554/eLife.09977.007

Between 1 and 7% of transcribed windows are unique to one taxon only, with the more distant taxa showing the higher percentages (Figure 3). Most of these taxon-specific transcripts are lowly expressed (<10 reads per window), but the more distant taxa (MAT and APO in Figure 3I,J) show also some more highly expressed ones. We find a total of 6566 windows with read counts >50 that occur in a single taxon only, mostly in the long branches leading to MAT (1638 windows) and APO (4485 windows), but some also between the most closely related taxa (43 windows for DOM, including populations; 38 windows for MUS, including populations).

Approximately 18% of windows show transcripts shared across all taxa. These include most of the very highly expressed ones (>100 reads per window), but also a fraction of the low expressed ones (Figure 3). They are also enriched in annotated genes, especially in exons of protein coding genes, but also in non-coding genes (Figure 3—figure supplement 2). The class ii) windows (sharing between 2 and 9 taxa in Figure 3) represents the genes showing more or less turnover between taxa, with more turnover the more distant they are of each other (Figure 3—figure supplement 1). This class constitutes cumulatively the largest fraction (between 26 and 33% of whole genome coverage - Figure 3), supporting the notion of a fast turnover of most of the transcribed regions between taxa.

The taxon-specific turnover of transcripts is also reflected in a distance tree of shared coverage. Taxa that are phylogenetically closer to each other share more transcripts, i.e. the tree topology mimics that of a phylogenetic tree based on molecular sequence divergence (Figure 4A,B). This implies that the turnover of the transcripts is not random, but time dependent. However, the relative branch lengths are much extended for the more closely related taxa compared to the molecular distances, implying that there is a particularly high turnover between them.

Figure 4 with 3 supplements see all
Distance tree comparisons based on molecular and transcriptome sharing data.

(A) Molecular phylogeny based on whole mitochondrial genome sequences as a measure of molecular divergence (black lines represent the branch lengths, dashed lines serve to highlight short branches). (B) Tree based on shared transcriptome coverage of the genome, using correlations of presence and absence of transcription of the common genome. All nodes have bootstrap support values of 70% or more (n = 1000). (C) Tree based on shared transcriptome coverage of singleton reads only from subsampling of the extended brain transcriptomes. Left is the consensus tree with the variance component between samples depicted as triangles, right is the same tree, but only for the branch fraction that is robust to sampling variance. Taxon names as summarized in Figure 1. Figure 4—figure supplement 1 shows the fraction of singletons in dependence of each sample in each taxon, Figure 4—figure supplement 2 in dependence of read depth. Figure 4—figure supplement 3 shows an extended version of the analysis shown in 4C for higher coverage levels.

https://doi.org/10.7554/eLife.09977.010

To assess in how much this could be due a sampling variance problem at low expression levels, we have separately analyzed the transcripts that are represented by single reads only, since these should be most sensitive towards sampling problems. Depending on read depth and tissue, they constitute about 2–12% of the common windows when assessed on a per sample basis (Figure 4—figure supplement 1). However, most of these singletons in a given sample were re-detected in another tissue or another taxon (Figure 4—figure supplement 1), such that less than 2% are present in a given taxon (Figure 4—figure supplement 1) and less than 7% cumulatively throughout the whole dataset (Figure 4—figure supplement 2). We used the extended brain sample reads, split them into three non-overlapping sets of about 100 Mill reads for each taxon and constructed trees out of these sets using only the singleton reads. This is the equivalent of repeating the same experiment three times. We find indeed differences in the resulting trees, i.e. there is a measureable sampling variance. By constructing a consensus tree, we can partition the data into a variable and a common component. We find that 88% of the branch length is influenced by sampling variance, while the remaining 12% still recover the expected topology (Figure 4C). When we use a read coverage of 1–5 for the same analysis, we find that 52% of the branch length are subject to sampling variance and for all reads combined it is 35% (Figure 4—figure supplement 3). Hence, at the 100 Mill read level, we have a noticeable effect of sampling variance, but this does not erase the underlying signal. Also, the analysis in Figure 4B is based on 600 Mill reads per taxon, where sampling variance is expected to be further lowered.

The high dynamics of transcriptional turnover between taxa raises the question whether all parts of the genome might be accessible to transcription at some point in evolutionary time. To explore this possibility, we used a rarefaction approach to simulate the addition of one taxon at a time and used the curve to predict the behavior of adding more taxa than the ones in the present study. We compared this approach to a curve of increasing depth of sequencing, by taking subsets at 10% intervals to understand whether depth or taxonomic diversity have different behavior in this respect. We assume that in each species only a subset of the genome is transcribed, therefore the increase in depth of sequencing would saturate at some point below 100%. Conversely, if each taxon is transcribing slightly different portions of the genome due to a steady turnover, increasing the total number of sampled taxa should increase the saturation more than the increase that could be achieved by sequencing depth. This is indeed what we find. The addition of taxa indeed leads to a further increase in transcriptomic coverage, with a generalized linear model best describing the data as increasing in a logarithmic fashion (Figure 5A). In contrast, we observe an asymptotic behavior of the curve for increasing depth of sequencing, with apparent saturation reached at 84.1%, close to the 83.2% that we have already achieved (Figure 5B).

Rarefaction, subsampling and saturation patterns using all available samples and reads.

(A) Sequencing depth saturation as estimated from an increase in the number of taxa. (B) Sequencing depth saturation as estimated from increasing read number. Blue dots indicate increases per sub-sampled sequence fraction or taxon added from our dataset. Gray dotted line indicates the predicted behavior from the indicated regression, and gray area shows the prediction after doubling the current sampling either by additional taxa (A) or in sequencing effort (B). Each analysis was tested for logarithmic and asymptotic models. Best fit was selected from ΔBIC, with Bayes factor shown and qualitative degree of support shown. Standard deviations are shown as black lines in A, and are too small to display in B (note that due to the sampling scheme for this analysis, the values above 50% are not statistically independent and that the 100% value constitutes a single data point without variance measure).

https://doi.org/10.7554/eLife.09977.014

Combined with the previous results, this allows two major conclusions. First, random transcriptional noise (technical or biological) or deficiencies in sampling low level transcripts should not be major factors in our analysis, since saturation with sequencing depth would not be possible under a singleton dominated regime. Furthermore, low level transcripts (including singletons) have detectable biological signal (Figure 4C). Second, the data are consistent with the above outlined ideas that the evolutionary turnover leads to steady – and almost unlimited – transcriptional exploration of the genome, when summed over multiple parallel evolutionary lineages and taxa.

The above overall statistical consideration would still allow for the possibility of the existence of a few scattered genomic islands that are not accessible to transcription because of structural reasons (so-called transcriptional deserts – Montavon and Duboule, 2012) or heterochromatically packed because they are not encoding genes required in the respective tissues. Hence, we analyzed also the size distribution of transcript-free genomic regions in our dataset. We find that the maximum observed length of non-transcribed regions is 6 kb (Figure 6), suggesting that apparent transcriptional deserts in one taxon are readily accessible to transcription in other taxa, at least for the non-repetitive windows of the genome that are analyzed here.

Comparative analysis of lengths of regions transcribed or not transcribed across all data (including deeper brain sequencing) in all samples.

Size distribution of regions not covered in any transcript (green) versus size distribution of regions with at least one transcript (blue).

https://doi.org/10.7554/eLife.09977.015

Discussion

Various studies have shown that many more regions of the genome are transcribed than are annotated as exons (Ponting and Belgard, 2010; Kapranov and St. Laurent, 2012). The significance of this additional transcription has been largely unclear and it has even been considered as noise, either biological or technical. Here we were able to trace the turnover of these extra transcripts. Our data suggest that many have sufficient stability to reflect a phylogenetic distance distribution that mimics the phylogeny of the taxa. Hence, they should not simply be considered as noise. Rather, their lifetime should be sufficient to expose them to evolutionary testing and in this way they become a substrate for de novo evolution of genes. On the other hand, they appear to have only a limited lifetime in case they do not acquire a function, i.e. there is also high turnover of the transcribed regions between taxa. This turnover has as a consequence that within a timespan of a few million years practically the whole genome is covered by transcription at some point in time, i.e. no major transcript-free islands exist.

We have here sampled only three tissues. If more tissues and more life stages were sampled, one would expect an even higher coverage of the genome within a given taxon. Such deep analyses have been done in the ENCODE projects (http://www.genome.gov/10005107) and they have confirmed pervasive transcription (Clark et al., 2011; Hangauer et al., 2013; Kellis et al., 2014) at the single-taxon level. Still, we expect that the turnover of transcribed regions between taxa would also apply to the other tissues and stages, i.e. evolutionary testing of new transcripts would relate to all tissues and stages. This turnover is contrasted by the set of conserved genes across taxa, for which even the expression levels may be maintained across larger evolutionary distances (Pervouchine et al., 2015).

We see a particularly large number of lineage-specific transcripts among the most closely related taxa. This becomes most evident in the distance tree in Figure 4B where the branch length of the three populations of M. m. domesticus, which have separated only a few thousand years ago, are almost as long as those of the sister species M. spretus that has separated almost 2 Mill. years ago. Although this is partially influenced by sampling variance of low expressed transcripts (Figure 4C), this suggests that at the very short evolutionary distances (thousands of years) there is an even higher turnover of transcripts than at the longer time frames (millions of years). Such a pattern of unequal rates suggests that weak selection could act against many newly arising transcripts, such that they can exist for a short time at a population scale, but not over an extended time. Hence, we expect that the presence of such transcripts will be polymorphic at the population level, similar as it has been shown in Drosophila (Zhao et al., 2014). We have done a preliminary analysis of transcriptional variance between four individuals of each of the taxa and find this expectation fulfilled, but a more extensive study is required to obtain reliable data at this level.

We expect that a fraction of new transcripts interacts with other genes and cellular processes, either via providing a positive function or via being slightly deleterious. Our data do not allow at present to speculate on how large this 'functional' fraction would be, but this could become subject to future experimental studies. It is also as yet open whether the transcripts exert their functions as RNAs or via translation products. The analysis of ribosome profiling data has shown that many RNAs that were initially classified as non-coding can be associated to ribosomes, i.e. are likely translated (Wilson and Masel, 2011Carvunis et al., 2012; Ruiz-Orera et al., 2014). On the other hand, when tracing the origin of de novo genes, one finds frequently that they act first as RNA and acquire open reading frames only at a later stage (Cai et al., 2008; Kapranov and St. Laurent, 2012; Reinhardt et al., 2013 - see discussion in Schlötterer, 2015). For some of the de novo evolved genes in Drosophila it has been shown that they have assumed essential functions for the organism, such that knockouts of them are lethal (Chen et al., 2010). Global analyses of new gene emergence trends suggest that the de novo evolution process has been active throughout the evolutionary history (Neme and Tautz, 2013). Hence, the possibility of a transition from new transcript emergence over acquisition of reading frames towards assuming essential genetic functions is well documented.

The idea that many de novo transcripts are slightly deleterious is concordant with the fact that various cellular processes maintain a balance between RNA transcription and degradation (Houseley and Tollervey, 2009; Jensen et al., 2013). In yeast and mammals it has been shown that several molecular pathways exist that degrade excess transcripts, in particular the ones from bidirectional promoter activity (Jensen et al., 2013; Wu and Sharp, 2013). Hence, the fact that many of the transcripts found by deep sequencing occur only at low levels does not necessarily imply a low level of transcription, but could alternatively be due to fast targeting by a degradation machinery.

Table 1

Genome sequencing and read mapping information relative to the C57Bl/6 reference strain (GRCm38.3/mm10).

https://doi.org/10.7554/eLife.09977.016
SpeciesUniquely mapping
reads (MAPQ >25)
Mean coverage depth
(window based)
Reference
coverage
(% windows)
Total sequence
divergence*
Accession
Reads
Accession
BAMs
Apodemus uralensis4.46E+0840x78.23%5.60%ERS942341ERS946059
Mus mattheyi5.58E+0852x77.19%4.50%ERS942343ERS946060
Mus spretus7.71E+0852x93.91%1.70%ERS946096**
Mus spicilegus6.16E+0857x84.39%1.60%ERS942342ERS946061
  1. * The percentage of divergence was estimated from mappings using NextGenMap (Sedlazeck et al., 2013). Only uniquely mapping reads were considered and mapping quality greater than 25. Variation was estimated from the alignments using samtools mpileup (Li et al., 2009). Divergence was calculated as number of changes divided by the genome size.

  2. ** Corresponds to study accession PRJEB11535. All other accessions deposited under studies PRJEB11513 and PRJEB11533.

Table 2

Transcriptome reads from each sample sequenced, mapped and normalized.

https://doi.org/10.7554/eLife.09977.017
Taxon
Code
TissueLanesQC-passed
reads
Mapped
reads
(% total)Normalized
subset
(%total)(% mapped)Accession
Reads*
Accession
BAMs**
DOMCBBrain0.33x1.30E+081.26E+0896%9.15E+0770%73%ERS946023ERS942305
DOMCBLiver0.33x1.41E+081.17E+0883%9.07E+0764%77%ERS946025ERS942306
DOMCBTestis0.33x1.26E+081.22E+0896%1.19E+0894%98%ERS946026ERS942307
DOMMCBrain0.33x1.17E+081.13E+0896%9.15E+0778%81%ERS946027ERS942309
DOMMCLiver0.33x1.34E+081.09E+0881%9.07E+0768%84%ERS946029ERS942310
DOMMCTestis0.33x1.42E+081.37E+0896%1.19E+0883%87%ERS946030ERS942311
DOMAHBrain0.33x9.49E+079.15E+0796%9.15E+0796%100%ERS946019ERS942301
DOMAHLiver0.33x1.16E+081.02E+0888%9.07E+0778%89%ERS946021ERS942302
DOMAHTestis0.33x1.61E+081.55E+0896%1.19E+0874%77%ERS946022ERS942303
MUSKHBrain0.33x1.33E+081.28E+0896%9.15E+0769%72%ERS946035ERS942313
MUSKHLiver0.33x1.03E+089.07E+0788%9.07E+0788%100%ERS946037ERS942314
MUSKHTestis0.33x1.36E+081.31E+0896%1.19E+0887%91%ERS946038ERS942315
MUSVIBrain0.33x1.23E+081.19E+0896%9.15E+0774%77%ERS946031ERS942317
MUSVILiver0.33x1.23E+089.47E+0777%9.07E+0774%96%ERS946033ERS942318
MUSVITestis0.33x1.32E+081.27E+0896%1.19E+0890%93%ERS946034ERS942319
CASBrain0.33x1.21E+081.16E+0896%9.15E+0776%79%ERS946039ERS942321
CASLiver0.33x1.23E+081.01E+0882%9.07E+0774%90%ERS946041ERS942322
CASTestis0.33x1.23E+081.19E+0896%1.19E+0896%100%ERS946042ERS942323
SPIBrain0.33x1.34E+081.29E+0896%9.15E+0768%71%ERS946043ERS942325
SPILiver0.33x1.05E+089.82E+0793%9.07E+0786%92%ERS946045ERS942326
SPITestis0.33x1.44E+081.38E+0896%1.19E+0883%86%ERS946046ERS942327
SPRBrain0.33x1.09E+081.05E+0896%9.15E+0784%87%ERS946047ERS942329
SPRLiver0.33x1.35E+081.20E+0889%9.07E+0767%76%ERS946049ERS942330
SPRTestis0.33x1.34E+081.29E+0896%1.19E+0888%92%ERS946050ERS942331
MATBrain0.33x1.12E+081.04E+0893%9.15E+0782%88%ERS946051ERS942333
MATLiver0.33x1.23E+081.12E+0891%9.07E+0774%81%ERS946053ERS942334
MATTestis0.33x1.32E+081.23E+0893%1.19E+0890%97%ERS946054ERS942335
APOBrain0.33x1.36E+081.18E+0887%9.15E+0767%78%ERS946055ERS942337
APOLiver0.33x1.13E+081.00E+0889%9.07E+0780%91%ERS946057ERS942338
APOTestis0.33x1.38E+081.20E+0887%1.19E+0886%99%ERS946058ERS942339
  1. All accessions deposited under studies PRJEB11533* and PRJEB11513**.

Our results provide an evolutionary dynamics perspective where emergence, functionalization and decay of gene functions should be seen as an evolutionary life cycle of genes (Neme and Tautz, 2014). De novo gene birth should no longer be considered as the result of unlikely circumstances, but rather as an inherent property of the transcriptional apparatus and thus a mechanism for testing and revealing hidden adaptive potential in genomes (Brosius, 2005; Masel and Siegal, 2009). Within this evolutionary perspective, any non-genic part of the genome has the possibility to become useful at some time.

Table 3

Additional sequencing effort, focused only on brain samples. Reads sequenced, mapped and normalized.

https://doi.org/10.7554/eLife.09977.018
Taxon
Code
TissueLanesQC-passed
reads
Mapped
reads
(% total)Normalized
subset
(% total)(% mapped)Accession
Reads
Accession
BAMs
DOMCBBrain1x3.89E+083.76E+0897%3.19E+0882%85%ERS946024ERS942308
DOMMCBrain1x3.76E+083.64E+0897%3.19E+0885%88%ERS946028ERS942312
DOMAHBrain1x3.46E+083.35E+0897%3.19E+0892%95%ERS946020ERS942304
MUSKHBrain1x4.64E+084.49E+0897%3.19E+0869%71%ERS946036ERS942316
MUSVIBrain1x4.13E+084.00E+0897%3.19E+0877%80%ERS946032ERS942320
CASBrain1x4.35E+084.21E+0897%3.19E+0873%76%ERS946040ERS942324
SPIBrain1x4.31E+084.16E+0897%3.19E+0874%77%ERS946044ERS942328
SPRBrain1x3.87E+083.73E+0896%3.19E+0882%85%ERS946048ERS942332
MATBrain1x3.62E+083.40E+0894%3.19E+0888%94%ERS946052ERS942336
APOBrain1x4.33E+083.77E+0887%3.19E+0874%84%ERS946056ERS942340
  1. All accessions deposited under studies PRJEB11533* and PRJEB11513**.

Material and methods

Sampled taxa

The youngest divergence point sampled, at about 3,000 years, corresponds to the split between two European populations of Mus musculus domesticus(Cucchi et al., 2005) one from France (Massif Central = DOMMC) and one from Germany (Cologne-Bonn area = DOMCB) (Ihle et al., 2006). These European populations in turn have diverged from an ancestral M. m. domesticus population in Iran (Ahvaz = DOMAH) about 12,000 years ago (Hardouin et al., 2015). The European M. m. domesticus are also the closest relatives of the reference genome, the C57BL/6J strain Didion and de Villena, 2013).

We included two populations of Mus musculus musculus; one from Austria (Vienna = MUSVI) and one from Kazakhstan (Almaty = MUSKH). These two populations are supposed to have a longer divergence between then the European M. m. domesticus populations, but a more accurate estimate is currently not available. We set the divergence for analyses at around 10,000 years as an approximate estimate. M. m. domesticus has diverged from M. m. musculus and Mus musculus castaneus about 0.4 to 0.5 million years ago, with a subsequent divergence, not long after, between M. m. musculus and M. m. castaneus (Suzuki et al., 2013). We included M. m. castaneus (CAS) from Taiwan as a representative of the subspecies.

To account for longer divergence times, we included Mus spicilegus (SPI; estimated divergence of 1.2 million years); Mus spretus (SPR; estimated divergence of 1.7 million years)(Suzuki et al., 2013); Mus matteyii (MAT; subgenus Nannomys), the North African miniature mouse (estimated divergence of 6.6 million years) (Catzeflis and Denys, 1992; Lecompte et al., 2008), and Apodemus uralensis, the Ural field mouse (APO; estimated divergence of 10.6 million years) (Lecompte et al., 2008).

The population-level samples (M. m. domesticus and M. m. musculus) included are maintained under outbreeding schemes, which allows for natural polymorphisms to be present in the samples. All other non-population samples are kept as more or less inbred stock, and therefore fewer polymorphisms are expected. All mice were obtained from the mouse collection at the Max Planck Institute for Evolutionary Biology, following standard rearing techniques which ensure a homogeneous environment for all animals. Mice were maintained and handled in accordance to FELASA guidelines and German animal welfare law (Tierschutzgesetz § 11, permit from Veterinäramt Kreis Plön: 1401–144/PLÖ-004697).

A total of 60 mice were sampled, as follows: Eight male individuals from each population-level sample (outbreds), Iran (DOMAH), France (DOMMC), and Germany (DOMCB) of Mus musculus domesticus, and Austria (MUSVI) and Kazakhstan (MUSKH) of Mus musculus musculus. Four male individuals from the remaining taxa (partially inbred): Mus musculus castaneus (CAS), Mus spretus (SPR), Mus spicilegus (SPI), Mus mattheyi (MAT) and Apodemus uralensis (APO). Mice were sacrificed by CO2 asphyxiation followed immediately by cervical dislocation. Mice were dissected and tissues were snap-frozen within 5 min post-mortem. The tissues collected were liver (ventral view: front right lobe), both testis and whole brain including brain stem.

Genome sequencing

One individual from each of M. spicilegus, M. spretus, M. mattheyi, and Apodemus uralensis were selected for genome sequencing. DNA was extracted from liver samples. DNA extraction was performed using a standard salt extraction protocol. Tagged libraries were prepared using the Genomic DNA Sample preparation kit from Illumina, following the manufacturers’ instructions. After library preparation, the samples were run in IlluminaHiSeq 2000 at a depth of approximately 2.6 lanes per genome. Library insert size is ~190bases and paired-end reads were 100 bases long. Library preparation and sequencing was performed at the Cologne Center for Genomics. Sequencing read statistics are provided in Table 1. Data are available under the study accessions PRJEB11513, PRJEB11533 and PRJEB11535, from the European Nucleotide Archive (http://www.ebi.ac.uk/ena/).

Transcriptome sequencing

The sampled tissues of each taxon were used for RNA extraction with the RNAeasy Mini Kit (QIAGEN) and RNA was pooled at equimolar concentrations. RNA quality was measured with the Agilent RNA Nano Kit, for the individual samples and pools. Samples with RIN values above 7.5 were used for sequencing. Library preparation was done using the Illumina TruSeq library preparation, with mRNA purification (poly-A+ selection), following manufacturers’ instructions. Sequencing was done in Illumina HiSeq, 2000 sequencer. Libraries for each group were tagged, pooled and sequenced in a single lane, corresponding to approximately one third of a HiSeq2000 lane. Library insert size is ~190bases and paired-end reads were 100 bases long. Additional sequencing of the brain samples was performed to identify potential limitations in depth of sequencing. For this, each brain library was sequenced on a full lane of a HiSeq2000. All library preparation and sequencing was done at the Cologne Center for Genomics (CCG). Sequencing read statistics are provided in Tables 2 and 3. Data are available under the study accessions PRJEB11513 and PRJEB11533, from the European Nucleotide Archive (http://www.ebi.ac.uk/ena).

Raw data processing

All raw data files were trimmed for adaptors and quality using Trimmomatic (Lohse et al., 2012). The quality trimming was performed basewise, removing bases below quality score of 20 (Q20), and keeping reads whose average quality was of at least Q30. Reads whose trimmed length was shorter than 60 bases were excluded from further analyses, and pairs missing one member because of poor quality were also removed from any further analyses.

Mapping

The reconstruction of transcriptomes using high-throughput sequencing data is not trivial when comparing information across different species to a single reference genome. This is due to the fact that most of the tools designed for such tasks do not work in a phylogenetically aware context. For this reason, any approximation which deals with fractional data (i.e. any high-throughput sequencing setup available to this date) is limited by the detection abilities of the software of choice and by the quality of the reference (transcriptome and genome).

Given the high quality state of the mouse genome repositories, we decided to take a reference-based approach, in which all analyses are centered in the reference genome of the C57BL/6 laboratory strain of Mus musculus domesticus, which enables direct comparisons across all species based on the annotations of the C57BL/6 laboratory strain.

Transcriptome and genome sequencing reads were aligned against the mm10 version of the mouse reference genome (Waterston et al., 2002) from UCSC (Karolchik et al., 2014) using NextGenMap which performs extremely well with divergences of over 10% compared to other standard mapping software (Sedlazeck et al., 2013), as confirmed by our own simulations (Appendix 1). The program was run under default settings, except for --strata 1 and --silent-clip. The first option enforces uniquely mapping reads and the second drops the unmapped portion of the reads, to avoid inflating coverage statistics. This is particularly relevant around exon-intron boundaries, where exonic reads are forced into intronic regions unless this option is set.

We produced normalized versions of the alignments per tissue. This was achieved by counting the total amount of uniquely mapped reads in each taxon for each tissue, and sampling without replacement a fraction of each file which would result in the roughly the same absolute number of uniquely mapped reads for all samples of the same tissue (summarized in Table 2 and Table 3).

Coverage statistics

We performed coverage statistics on 200 bp windows, to minimize problems derived from the fractional nature of the data, in which a few nucleotides could be absent from a sequenced fragment due to the preparation of the samples, low quality towards read ends, or a few mismatches during mapping. Coverage statistics were computed from normalized alignment files with the featureCounts program from the Subreads suite (Liao et al., 2014). In order to avoid counting reads twice if they would span two windows (which would be the case for most reads), we assigned reads to the window where more than half of the read was present.

Genomic reads were used as a metric of empiric mapability for the coverage statistics, i.e. to identify which regions can be reliably detected. For this, we removed from the mapping results against the reference genome (see above) all regions that were not mapped across the phylogeny based on the genomic reads from the taxa more than 1 Mill years apart. The remaining portion we call the ‘common genome’ in all analyses. It is important to highlight that this is not the same as synteny, since we did not perform any co-linearity analyses between fragments, but rather represent the mere presence in the species, in any possible order. The common genome serves to limit mapping artifacts, since the reads observed in each window must not only be uniquely mapping, but also be present and detectable in all the genomes considered.

We report coverage only from windows in the common genome for several reasons. First, we want to compare changes in transcription in regions which are present across all taxa, so the region must be present at the genome level. Second, the observation of transcriptome coverage on a region of the reference genome without genomic coverage from the respective taxon could represent mapping artifacts. Thus by enforcing coverage on both levels, and in all taxa at the genomic level, we reduce mapping artifacts and errors. Third, we assume that the transcriptional properties of the common genome should be general enough that they represent the properties of each of the genomes of the taxa under study. Summary data for coverage of all genomes and transcriptomes are available under the Dryad accession associated with this manuscript (doi:10.5061/dryad.8jb83).

Reconstruction of phylogenetic relationships

We performed genome-wide correlations of coverage to infer distance between the taxa under study. Correlations of two types were initially used: rank-based (spearman correlation) and binary (phi correlation). From correlation matrices, we constructed Manhattan distance matrices and from those we further constructed neighbor-joining trees to describe the proximity between any two taxa based on shared transcriptome information. We focus mostly on the presence or absence of transcriptional coverage. For this reason, we used only the binary correlations in the figures. In this representation, closely related organisms have more shared transcriptomic coverage than distantly related organisms. Analyses were performed in R, using the function dist() from the stats package and nj() from the ape package (Paradis et al., 2004).

Additionally, whole mitochondrial genomes were obtained for each taxon as consensus sequences from mapped reads using samtools mpileup (Li et al., 2009). The sequences were aligned with MUSCLE (Edgar, 2004), and a NJ tree was constructed with the dist.dna() and nj() functions from the ape package Paradis et al., 2004). All trees were tested with 1000 bootstraps with the boot.phylo() function from the ape package. Reported nodes have a support of 70% or greater.

Estimation of sampling variance from brain samples

The extensive sequencing of brain samples were used to obtain estimates of how sampling might affect the terminal branch lengths of trees based on low coverage regions. For this, we split the alignments into three non-overlapping sets of 100 million reads per taxon, such that each set would contain sets of independent observations. Paired-read relationships were maintained, so that pairs of the same fragments would be in the same set. From this, we obtained trees as mentioned before, and the portions of the branches of each taxon which were shared across sets were considered as robust to sampling biases, while the discordant portions between samples were considered to be due to sampling variance. Summary data from subsampled sets are available under the Dryad accession associated with this manuscript (doi:10.5061/dryad.8jb83).

Rarefaction and subsampling

Transcriptome experiments tend to be limited by the depth of sequencing, with highly expressed genes being relatively easy to sample, and rare transcripts becoming increasingly difficult to find. Given the large amount of data generated, we investigated whether our data show signals of coverage saturation from subsets of the data of different sizes. The total experiment, comprising ten taxa, corresponds to 6.4 x 109 reads (or 6.4 billion reads). We subsampled (samtools view -s) portions of mapped reads for each taxon, ranging between 10% to 100%, at 10% intervals. The observation of coverage saturation in this case would indicate that our sequencing efforts likely cover most of the transcribed regions of the common genome. Summary data are available under the Dryad accession associated with this manuscript (doi:10.5061/dryad.8jb83).

In parallel, we estimated the individual and combined contribution of each taxon to the transcriptomic coverage of the common genome. Not all samples have the same phylogenetic distance to each other (some species have more representatives than others). To account for this we generated one hundred arrays of the ten taxa with random order, and recorded the coverage after the addition of each taxon in each array. The observation of coverage saturation in this setup would indicate that taxonomic sampling is sufficient to cover most of the potentially transcribed regions of the common genome.

In order to estimate whether our data continued to increase or approached saturation, we tested two alternative models: a generalized linear model with logarithmic behavior (ever increasing) or a self-starting nonlinear regression model (saturating). The best fit was decided based on the minimum BIC value between the two models, and an estimate of the Bayes factor was computed from the difference of BIC values and support was obtained from standard criteria (Kass and Raftery, 1995). Analyses were performed in R, using the functions glm(), nls(), SSasymp(), and BIC() from the stats package (R Core Team, 2014).

Analysis of transcribed and non-transcribed regions across the genome

Transcribed and non-transcribed windows of the common genome were defined by the continuous presence or absence of transcriptomic coverage from mapping information of each taxon and tissue. Neighboring transcribed regions across species were combined to obtain stretches of transcriptionally active common genome.

Enrichment of annotations from the mouse reference

Annotations of Mus musculus from Ensembl v81 (Cunningham et al., 2015) were used to infer the relative contribution of known genes to the observed transcription across species. We partitioned the sets between genes, exons, and introns, and those were further partitioned between protein- coding and non-coding genes. To determine if the overlaps are significantly different from a random distribution of the features along the genome, we randomized 1000 times each of the annotated intervals (genes, exons, introns, and subsets of coding and non-coding) along the genome using shuffleBed from the bedtools suite (Quinlan and Hall, 2010), and compared the overlap to various transcribed regions (single taxa, less than 9 taxa, more than 8 taxa, 10 taxa, and transcribed in any taxon). Multiple testing corrections were performed and significant comparisons are reported at 5% FDR. Furthermore, since we assume that most annotations fall within transcribed regions in any species, we used the total transcriptomic coverage across all taxa to calculate potential discrepancies in the shuffling method. The ratios of expected and observed coverage of total transcription across taxa for a given feature were calculated to define the range of ratios for which comparisons were also non-significant, i.e., where we could not rule out method bias.

Appendix 1

Simulation of mapping efficiency depending on sequence variation

We performed simulations of the mapping efficiency of two mappers NextGenMap (NGM) and Bowtie2 (standard mapper) across a range of divergences based on the chromosome 19 of the mouse reference genome (mm10 from UCSC). Mutated versions of chromosome 19 were generated with a python script by choosing to randomly substitute a given fraction of the nucleotides in the sequence in random positions along the genome. From each mutated version we simulated sequencing reads with ART (Huang et al., 2012), with a mean fold coverage of 5x (1x standard deviation) and using the same conditions as in our main sequencing experiment (100 bp paired end reads, 190bp fragments) and the options for empirical read quality of the Illumina HiSeq2000 sequencer.

Reads were subsequently mapped to the chromosome 19 reference sequence with NextGenMap using the default parameters except for --strata 1 --silent-clip to obtain uniquely mapping reads and to remove the non-mapping regions from reads. Reads were also mapped with Bowtie2, following default parameters except for --very-sensitive. Information about uniquely mapping reads from NGM was derived directly from the bam files and from Bowtie2 was derived from the standard error log files. From Appendix 1—table 1 and Appendix 1—figure 1, we observe that NextGenMap performs extremely well with increasing divergences, and greatly outperforms the standard mapper. While the average difference between the most distant genomes analyzed is about 6%, it must be noted that fast evolving regions of the genome can quickly exceed the mean. NextGenMap is able to capture most of the regions of the genome to allow comparisons across very divergent taxa.

In addition to this, we used the set of reads simulated from the chromosome 19 reference sequence and mapped them with NextGenMap to each mutated version of the reference chromosome 19 using the same parameters mentioned above (Appendix 1—figure 2; Appendix 1—tables 2 and 3). This allowed the control of accuracy in read placement across divergent sequences by testing the position of each read in each mapping exercise (Appendix 1—figure 2A; Appendix 1—table 2). This was done with the bedtools suite, intersecting reads from each divergent genome to the original, and counting the reads which were in the same location. Reads were allowed to be offset by 20% (80% overlap), for example in in cases where ends would not map successfully. From this we also derived comparable statistics for total uniquely mapped reads, proper paired reads, paired reads regardless of location and single reads mapped where the pair failed to map (Appendix 1—figure 2B; Appendix 1—table 3).

Appendix 1—table 1

Simulations comparing bowtie2 to NextGenMap. Divergent reads were mapped to a common reference.

https://doi.org/10.7554/eLife.09977.019
Total simulated
reads
% simulated
divergence (reads)
Uniquely mapped reads
Bowtie2
Uniquely mapped reads NGMPercentage unique from total reads
Bowtie2
Percentage unique from total reads
NGM
29103700%2621200287348190.1%98.7%
29109822%2650274286827991.0%98.5%
29113124%2674738286358191.9%98.4%
29102866%2583320285606088.8%98.1%
29109788%2124958283611973.0%97.4%
291044610%1321494277983745.4%95.5%
291061012%587862267501120.2%91.9%
291019614%18682825108406.4%86.3%
291009016%4298622969171.5%78.9%
290999218%748820414370.3%70.2%
291002220%93617599240.0%60.5%
Appendix 1—table 2

Accuracy of NextGenMap. The same set of reads was mapped to divergent genome versions of the reference. We are assuming that the reads coming from the same reference are correctly mapped, and used that as a standard for the divergent genomes, so the estimates should be slightly inflated.

https://doi.org/10.7554/eLife.09977.020
% divergenceAccurately mapped reads%
0%2910370100.0%
2%284207697.7%
4%281662896.8%
6%279893696.2%
8%277860895.5%
10%275619494.7%
12%271742093.4%
14%264847291.0%
16%253172887.0%
18%235896481.1%
20%212092272.9%
Appendix 1—figure 1
Performance of NextGenMap compared to Bowtie2.
https://doi.org/10.7554/eLife.09977.021
Appendix 1—figure 2
Performance of NextGenMap in terms accuracy of mapping using the same set of reads and increasingly divergent versions of the reference genome (A), and paired-end mapping statistics (B).
https://doi.org/10.7554/eLife.09977.022
Appendix 1—table 3

Performance of NextGenMap. Same set of reads was mapped to divergent genomes. Mapped indicates uniquely mapped reads; proper indicates read with both pairs mapped one next to the other; mate mapped indicates that both reads in a pair are mapped, although not necessarily as pairs; singletons indicates the amount of pairs in which only one of both mates was mapped.

https://doi.org/10.7554/eLife.09977.023
% simulated
divergence
(reference)
Total readsMapped (%)Proper (%)Mate mapped (%)Singletons (%)
0%29103702873481 (99%)2869482 (99%)2872432 (99%)1049 (0.1%)
2%29103702883094 (99%)2860794 (98%)2878634 (99%)4460 (0.1%)
4%29103702885714 (99%)2844842 (98%)2877808 (99%)7906 (1%)
6%29103702882035 (99%)2810920 (97%)2866362 (98%)15673 (1%)
8%29103702859215 (98%)2722782 (94%)2817502 (97%)41713 (3%)
10%29103702810639 (97%)2575954 (89%)2722242 (94%)88397 (6%)
12%29103702712723 (93%)2305232 (79%)2536014 (87%)176709 (12%)
14%29103702562495 (88%)1961916 (67%)2266582 (78%)295913 (20%)
16%29103702369165 (81%)1571078 (54%)1945446 (67%)423719 (29%)
18%29103702144444 (74%)1193318 (41%)1609114 (55%)535330 (37%)
20%29103701882993 (65%)844628 (29%)1265102 (43%)617891 (42%)

References

  1. 1
  2. 2
  3. 3
  4. 4
    The african nannomys (muridae) - an early offshoot from the mus lineage - evidence from scDNA hybridization experiments and compared morphology
    1. FM Catzeflis
    2. C Denys
    (1992)
    Israel Journal of Zoology 38:219–231.
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
    Bidirectional promoters as important drivers for the emergence of species-specific transcripts
    1. V Gotea
    2. HM Petrykowska
    3. L Elnitski
    (2013)
    PloS One, 8, 10.1371/journal.pone.0057323.
  12. 12
    Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs
    1. MJ Hangauer
    2. IW Vaughn
    3. MT McManus
    (2013)
    PLoS Genetics, 9, 10.1371/journal.pgen.1003569.
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
    Bayes factors
    1. RE Kass
    2. AE Raftery
    (1995)
    Journal of the American Statistical Association 90:773–795.
    https://doi.org/10.1080/01621459.1995.10476572
  23. 23
  24. 24
  25. 25
  26. 26
    Rapid turnover of long noncoding RNAs and the evolution of gene expression
    1. C Kutter
    2. S Watt
    3. K Stefflova
    4. MD Wilson
    5. A Goncalves
    6. CP Ponting
    7. DT Odom
    8. AC Marques
    (2012)
    PLoS Genetics, 8, 10.1371/journal.pgen.1002841.
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
    Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution
    1. R Neme
    2. D Tautz
    (2013)
    BMC Genomics, 14, 10.1186/1471-2164-14-117.
  35. 35
  36. 36
    The life cycle of drosophila orphan genes
    1. N Palmieri
    2. C Kosiol
    3. C Schlötterer
    (2014)
    eLife, 3, 10.7554/eLife.01311.
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
    Long non-coding RNAs as a source of new peptides
    1. J Ruiz-Orera
    2. X Messeguer
    3. JA Subirana
    4. MM Alba
    (2014)
    eLife, 3, 10.7554/eLife.03523.
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
    Evolutionary Origin of Orphan Genes. In: eLS
    1. D Tautz
    2. R Neme
    3. T Domazet-Loso
    (2013)
    John Wiley & Sons, Ltd: Chichester.
  51. 51
    Initial sequencing and comparative analysis of the mouse genome
    1. RH Waterston
    2. K Lindblad-Toh
    3. E Birney
    4. J Rogers
    5. JF Abril
    6. P Agarwal
    7. R Agarwala
    8. R Ainscough
    9. M Alexandersson
    10. P An
    11. SE Antonarakis
    12. J Attwood
    13. R Baertsch
    14. J Bailey
    15. K Barlow
    16. S Beck
    17. E Berry
    18. B Birren
    19. T Bloom
    20. P Bork
    21. M Botcherby
    22. N Bray
    23. MR Brent
    24. DG Brown
    25. SD Brown
    26. C Bult
    27. J Burton
    28. J Butler
    29. RD Campbell
    30. P Carninci
    31. S Cawley
    32. F Chiaromonte
    33. AT Chinwalla
    34. DM Church
    35. M Clamp
    36. C Clee
    37. FS Collins
    38. LL Cook
    39. RR Copley
    40. A Coulson
    41. O Couronne
    42. J Cuff
    43. V Curwen
    44. T Cutts
    45. M Daly
    46. R David
    47. J Davies
    48. KD Delehaunty
    49. J Deri
    50. ET Dermitzakis
    51. C Dewey
    52. NJ Dickens
    53. M Diekhans
    54. S Dodge
    55. I Dubchak
    56. DM Dunn
    57. SR Eddy
    58. L Elnitski
    59. RD Emes
    60. P Eswara
    61. E Eyras
    62. A Felsenfeld
    63. GA Fewell
    64. P Flicek
    65. K Foley
    66. WN Frankel
    67. LA Fulton
    68. RS Fulton
    69. TS Furey
    70. D Gage
    71. RA Gibbs
    72. G Glusman
    73. S Gnerre
    74. N Goldman
    75. L Goodstadt
    76. D Grafham
    77. TA Graves
    78. ED Green
    79. S Gregory
    80. R Guigó
    81. M Guyer
    82. RC Hardison
    83. D Haussler
    84. Y Hayashizaki
    85. LW Hillier
    86. A Hinrichs
    87. W Hlavina
    88. T Holzer
    89. F Hsu
    90. A Hua
    91. T Hubbard
    92. A Hunt
    93. I Jackson
    94. DB Jaffe
    95. LS Johnson
    96. M Jones
    97. TA Jones
    98. A Joy
    99. M Kamal
    100. EK Karlsson
    101. D Karolchik
    102. A Kasprzyk
    103. J Kawai
    104. E Keibler
    105. C Kells
    106. WJ Kent
    107. A Kirby
    108. DL Kolbe
    109. I Korf
    110. RS Kucherlapati
    111. EJ Kulbokas
    112. D Kulp
    113. T Landers
    114. JP Leger
    115. S Leonard
    116. I Letunic
    117. R Levine
    118. J Li
    119. M Li
    120. C Lloyd
    121. S Lucas
    122. B Ma
    123. DR Maglott
    124. ER Mardis
    125. L Matthews
    126. E Mauceli
    127. JH Mayer
    128. M McCarthy
    129. WR McCombie
    130. S McLaren
    131. K McLay
    132. JD McPherson
    133. J Meldrim
    134. B Meredith
    135. JP Mesirov
    136. W Miller
    137. TL Miner
    138. E Mongin
    139. KT Montgomery
    140. M Morgan
    141. R Mott
    142. JC Mullikin
    143. DM Muzny
    144. WE Nash
    145. JO Nelson
    146. MN Nhan
    147. R Nicol
    148. Z Ning
    149. C Nusbaum
    150. MJ O'Connor
    151. Y Okazaki
    152. K Oliver
    153. E Overton-Larty
    154. L Pachter
    155. G Parra
    156. KH Pepin
    157. J Peterson
    158. P Pevzner
    159. R Plumb
    160. CS Pohl
    161. A Poliakov
    162. TC Ponce
    163. CP Ponting
    164. S Potter
    165. M Quail
    166. A Reymond
    167. BA Roe
    168. KM Roskin
    169. EM Rubin
    170. AG Rust
    171. R Santos
    172. V Sapojnikov
    173. B Schultz
    174. J Schultz
    175. MS Schwartz
    176. S Schwartz
    177. C Scott
    178. S Seaman
    179. S Searle
    180. T Sharpe
    181. A Sheridan
    182. R Shownkeen
    183. S Sims
    184. JB Singer
    185. G Slater
    186. A Smit
    187. DR Smith
    188. B Spencer
    189. A Stabenau
    190. N Stange-Thomann
    191. C Sugnet
    192. M Suyama
    193. G Tesler
    194. J Thompson
    195. D Torrents
    196. E Trevaskis
    197. J Tromp
    198. C Ucla
    199. A Ureta-Vidal
    200. JP Vinson
    201. AC Von Niederhausern
    202. CM Wade
    203. M Wall
    204. RJ Weber
    205. RB Weiss
    206. MC Wendl
    207. AP West
    208. K Wetterstrand
    209. R Wheeler
    210. S Whelan
    211. J Wierzbowski
    212. D Willey
    213. S Williams
    214. RK Wilson
    215. E Winter
    216. KC Worley
    217. D Wyman
    218. S Yang
    219. SP Yang
    220. EM Zdobnov
    221. MC Zody
    222. ES Lander
    223. Mouse Genome Sequencing Consortium
    (2002)
    Nature 420:520–562.
    https://doi.org/10.1038/nature01262
  52. 52
  53. 53
  54. 54
  55. 55

Decision letter

  1. Thomas R Gingeras
    Reviewing Editor; Cold Spring Harbor Laboratory, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your work entitled "Fast turnover of genome-level transcriptional activity across short evolutionary time" for peer review at eLife. Your submission has been favorably evaluated by Mark McCarthy (Senior editor) and two reviewers, one of whom is a member of our Board of Reviewing Editors.

The reviewers have discussed the reviews with one another and the Reviewing editor has drafted this decision to help you prepare a revised submission.

Although the work is of interest and timely, we regret to inform you that the findings at this stage contain several problematic issues that require attention before a final decision can be rendered.

The first of these issues is the possibility that the analysis may be compromised by a serious artifact that would need to be corrected or excluded. In the authors' experimental design, it seems possible to call a region "transcribed" in taxon #1 and "not transcribed" in taxon #2 even if the region is in fact transcribed at exactly the same expression level in both taxa. This is because statistical fluctuation can cause even exactly the same region (even in the same taxon) to pass the simple coverage threshold (of coverage >= 1 or >= 5) in one sample but not another. For example, suppose a region is actually expressed at a level equivalent to a coverage of 1 read per base. With a Poisson probability of about 60%, one or more reads will fall over the position, and satisfy the "transcribed" threshold; with a probability of about 40%, zero reads will fall over the position, and then it will be called "not transcribed". The authors call the >=5 read threshold "more stringent and therefore more reliable" but this is not correct, because it is just as prone to the problem; for a region with an expression level of 5, in about half of samples you would observe a coverage of < 5 and call the region not transcribed, and in the other half you would observe a coverage of >= 5 and call the same region transcribed. Of course only low-expressed regions (close to the >=1 or >=5 thresholds) will suffer the problem of fluctuating above and below the thresholds, but many regions are lowly expressed, and indeed this is the point in a study of "pervasive transcription".

Because the probability of calling even the same low-expressed region "transcribed" in multiple replicate samples can be significantly less than one, the probability of seeing a region called "transcribed" in multiple samples will be an artifactually decreasing function, and indeed could look just like the results in Figure 3, depending on details.

A control experiment that the authors could do: what fraction of regions within the same taxon are detected in common in biological replicates – i.e. what does Figure 3 look like if you do 10 replicates of the same taxon, not 10 different taxa? For the experiment to work as planned, that plot must show a constant level. But if this artifact is in play, you'll see a decreasing function much like Figure 3.

One effect of the artifact would be that closely related taxa would show unexpectedly large "distances" in the analysis of Figure 4, because some unknown amount of "distance" is being generated by stochastic fluctuation on mapped read counts, not by evolutionary divergence. This is an alternative explanation for the authors' observation that the closely related taxa have disproportionately large distances, undermining one of the paper's main conclusions that there is larger turnover at shorter evolutionary divergence. (The fact that the tree is reasonably concordant with the expected mouse tree does suggest that there is signal in the data, though, so my expectation is that there's a mixture of artifact and signal going on.)

Thus I do not think the authors can safely conclude that transcription is not conserved in these taxa. I don't think a simple threshold on coverage is the right way to analyze the data. I think they must do a more statistically rigorous experimental analysis that takes into account the distributions P(c1 e1) and P(c2 e2) for how much read coverage c1, c2 you observe in taxa 1,2 given unknown expression levels e1, e2 (differential RNA-seq methods typically estimate negative binomial distributions, e.g. over dispersed Poissons, for these). For example you might seek to reject the hypothesis that e1=e2 given the observed c1, c2 (though this would only test for significant changes in expression level, not gain/loss of txn), or that e1>threshold for "on" and e2=epsilon where epsilon is some chosen background "off" level. Any such analysis would also leave you with regions of "don't know", where you were unable to tell from the observed counts if the region were differentially transcribed or not.

There is an issue concerning the distribution of the detected transcripts between annotated coding or non-coding transcripts (i.e. a large proportion of which will be coding regions) and novel transcribed regions (a large proportion of which are likely to have little coding potential). The authors indicate that cumulatively (Figure 2F) ~36% present in the "common genome" (~70% of total genome) for each taxa is transcribed, resulting in a total of ~76% of the common genome being transcribed considering all 3 tissues of all 10 taxa. What proportion of these uniquely mapped reads correspond to annotated vs. unannotated (i.e. novel transcribed regions found in these experiments)? Are the turnover rates observed different between these two categories of genomic regions? If they are different is the turnover rate observed driven by one or the other regions? This seems important since the underlying assumption seems to be that the novel regions (i.e. as yet unannotated regions) are driving the rapid turning. Additionally, do the de novo transcripts that survive this turnover possess more coding or non-coding characteristics? If they are non-coding mostly, are there "observations" that can be made about the evolution of genomes as being driven by the ongoing emergence of non-coding genes?

The sequence data must be deposited in appropriate public database (NCBI GEO, for example) and accessioned. The "Datasets generated" point in the cover page is set to "N/A" but should not be. The authors should provide not just the reads, but also some sort of summary data files (such as.bed files) that enable reviewers and readers to reproduce the important features of the analyses without having to remap reads. Both summary.bed files and read data also need to be provided to reviewers.

To do the analysis in Figure 5, you can't just select the best hypothesis of the two (your GLM for continuously increasing, vs. your nonlinear regression for saturation). For all we know, the two hypotheses explain the data essentially equally well. You can't conclude that Figure 5A shows continuous increase and Figure 5B shows saturation unless you show that one hypothesis is significantly better (i.e. statistically significantly). If you're using BIC (Bayesian information criterion) it seems you should be able to directly calculate and report the Bayes factor supporting one hypothesis versus the other.

The simulation is unclear. How does a string of 10,000 states approximate a 3Gb genome? Why bother with making it a string, since there's no dependence (that I see) between positions in your model (even though in a real genome there would be strong dependency between sites, because transcripts are contiguous). Why do you model 0->0 and 1->1 as "events"? (I think a time-dependent Markovian birth/death state change model would usually just have 0->1 with a birth rate, and 1->0 with a death rate.) Why do you call fitting to observed parameters by RMSD "maximum parsimony"? (Parsimony minimizes some counted events, possibly weighted ones, not a RMSD of continuous-valued features.) What exactly are the observed features that you fit the simulation to (as opposed to how you list example "summary values such as…", I want to understand exactly what you used), and how did you calculate them?

It is stated that NextGenMap is designed for mapping to divergent genomes, but your results are so dependent on the quality of that mapping that I think you're obligated to do and show a control experiment that demonstrates that it is in fact working as expected on these genomes, rather than just asserting that it "performs extremely well" (subsection “Mapping”, third paragraph).

[Editors’ note: this article was rejected after discussions between the reviewers, but the authors were invited to resubmit after an appeal against the decision.]

Thank you for submitting your work entitled "Fast turnover of genome transcription across evolutionary time exposes entire non-coding DNA to de novo gene emergence" for consideration by eLife. Your article has been reviewed by one of the original reviewers (Sean Eddy), and the evaluation has been overseen by a Reviewing Editor and Mark McCarthy as the Senior Editor.

Our decision has been reached after consultation between the editors and the reviewer. Based on these discussions and the individual reviews, we regret to inform you that your work will not be considered further for publication in eLife.

After review of your revised manuscript, it has been determined that there are still a number of concerns on the analyses and conclusions described. Details of these concerns are cited below. Given that the overall evaluation is still not positive, regretfully, we are unable to accept this manuscript for publication. We are sorry that on this occasion we cannot be more encouraging, but thank you for this submission.

Reviewer #2:

In this revised manuscript, the authors' central assumption remains: that it is meaningful to distinguish "transcribed" from "non-transcribed" loci (qualitatively), and that regions of the genome with 0 mapped RNA-seq reads can be inferred to be "non-transcribed", as opposed to just undersampled.

The analysis now completely depends on a new conclusion, reached by mathematically erroneous means in Appendix 2, that the minimum expression level for any locus is >=1-2 reads per 100M reads, with no loci more lowly expressed. The authors now switch to an analysis of aggregated data (600M reads), conceding that the 100M sample size in the previous manuscript is too small. They argue that at 600M reads, they have saturated the detected transcriptional coverage (because if the minimum expression level is 1/100M, and we sample 600M reads, we expect to map ~6 reads per window; thus the probability of seeing 0 mapped reads on a transcribed window is small). The analysis then proceeds to count singleton loci with only 1 mapped read as "expressed" and 0 reads as "not expressed" (Figure 2), for large numbers of singletons observed in the data, conflicting with the argument they are at saturation, because saturation requires that the number of singletons has been driven to a negligible number.

It remains the case that we cannot count 1 versus 0 mapped reads as different expression levels, for the reasons already pointed out for the previous manuscript version. This remains a fundamental flaw in the analysis, which in my opinion needs to be completely rethought.

Detail on why I call Appendix 2 mathematically erroneous:

Apologies for the length but I think it's important to explain.

In the previous review, I had worried that the analysis was subject to a small sample number artifact on low-expression regions. Supposing a low-expressed locus is transcribed at the equivalent of 1 expected read: then 60% of the time it will be detected with >=1 mapped read, and 40% of the time it will have 0 mapped reads. So when low-expressed loci exist, where small numbers of reads are mapping, the same locus will appear to be "non-transcribed" in one sample and "transcribed" in another, simply by statistical fluctuation. This would be an extremely troubling and serious error in a data analysis design, but I wasn't entirely sure I understood the manuscript, so I left it open to the authors to explain any misunderstanding. The authors' response confirms that I understood their analysis correctly.

To address the problem, the authors have inserted a new analysis in Appendix 2. The purpose of Appendix 2 is to argue that this is not a problem in the aggregated data (600M reads), although the authors concede the problem in the individual tissue data sets (100M reads). The claim is that the data are "compatible with" a minimum expression level of 1-2 mapped reads per 100M reads (i.e. essentially no locus has a lower expression level than this) and thus at 600M reads, all transcribed loci have been saturated.

Aside from my skepticism that any fixed minimum expression level can exist (how would that work, thermodynamically and biologically?), the analysis in Appendix 2 appears to be mathematically incorrect in several respects. First, most seriously, it confuses a cumulative distribution, P(counts >= 1), with a probability distribution, P(counts = x). The equation P = 1-(1-x)^n gives (as the authors correctly say) "the probability for finding a read at least once among n total reads" – this is a cumulative distribution, P(counts >= 1). The authors then compare this to the number of singletons: which is the number of windows with exactly one mapped read, P(counts = 1). The number of windows with exactly one mapped read is not calculated by the authors' equation; it would instead be calculated by the binomial distribution for P(counts=1), n * x * (1-x)^(n-1).

Second, you cannot extrapolate this calculation to the total number of expected singletons in the 100M or 300M read sample unless you assume that all loci are expressed at the same low level t. This is obviously not the case. For starters, known coding genes are expressed at far higher levels. If less well expressed loci exist, they will also contribute to the singleton counts, each detected with lower probability. The authors' analysis assumes that all loci have the same expression level t and therefore no lower expressed loci exist. Since the analysis is intended to test whether lower expressed loci exist, it is a problem that the analysis assumes that conclusion.

Third, it seems obvious from the singleton counts that the authors' conclusions can't be right. The authors' main argument in Appendix 2 is that the "data are compatible" with loci having a minimum concentration of 1-2 reads per 100M reads, so by the time they have sampled 600M reads, the probability of detecting any given locus is essentially one and "no major sampling variance is expected". But the reason that the probability of detecting a locus transcribed at 1 read/100M reads in a sample of 600M reads is high is that the expected number of reads is 6. Under the authors' assumptions, the Poisson probability of detecting a singleton in a 100M read sample is 40% (40% probability of 0 counts, 40% of 1 count, 20% of >1 count). The Poisson probability of detecting a singleton in a 600M read sample is 15% (5% probability of detecting 0 counts; 15% of 1 count; 80% of >1 count; mean expected number of counts = 6). Thus if the authors' assumptions are correct, the number of singletons should decrease in a larger sample, not increase. Indeed, this is what saturation means! Instead, the data in Appendix 2 consistently show more singletons in the 300M sample than the 100M sample. That's what you expect if there's a continuous tail of lower and lower expressed loci, which seems to me likely.

The authors inappropriately fit their observed singleton numbers for P(counts=1) (for example, 427585 for AH 100M sample, 569564 for AH 300M sample) to the cumulative binomial probability for P(counts >= 1); this is mathematically nonsensical. Oddly, though they get very good "fits", with "expected" counts almost exactly equal to observed counts – how could this be? It took me a while to figure this out. First, note that if these were really "expected" vs. "observed" counts, the observed counts c would have sampling errors of about sqrt(c), about +-700 here; it's a red flag that the agreement between "expected" and "observed" is far too good. What's happening is that the authors' "expected" counts aren't really expected counts under a model. Instead what the authors are doing (I believe) is a numerical fit to the observed data: given observed counts a1 and a2, from sample sizes n1 and n2, they ask if they can fit an unknown "predicted number in sample", N, that satisfies both a1/N = 1 – (1-x)^n1 and a2/N = 1-(1-x)^n2. Because they can fit N, they conclude that the data are "compatible" with their conclusion that there is a "minimum" (actually, the analysis requires it to be constant, not just minimal; see above) expression level of ~1 read/100M reads. The trouble with this analysis (which, again, is set up wrong in the first place), and the reason that it appears to work, is that if n2/n1 > a2/a1, I believe you can always find a "predicted number in sample" N that satisfies the two (inappropriate) cumulative binomial distribution equations. That is, the fit shows nothing other than the fact that the number of singletons is increasing slower than the increase in sample size, which is expected under any possible model. It does not say anything about the minimum concentration t. Indeed, again, the fact that the number of singletons is increasing, not decreasing, as we go from 100M to 300M sample size says that we are not at saturation, and the 300M sample is starting to detect singletons at even lower-expressed loci.

The authors point to Figure 5B as additional evidence that they are at saturation. Figure 5B is indeed puzzling, because the appearance of an asymptotic saturation at about 85% coverage conflicts with the increasing numbers of singletons with increasing read number in Appendix 2. But there are two things about Figure 5B that reduce my confidence in the authors' conclusions about it. One is that if we extrapolate the other direction (to lower coverage), the authors' model fit predicts that with no sequencing reads at all (the y intercept of the graph at x=0), 40% of the genome is detected by (nonexistent) sequencing reads with coverage of >= 1 read! If the model fit can't extrapolate safely to low coverage, I'm not sure why I would trust its extrapolations to high coverage. Second, the authors comment that the standard deviations on the data points in Figure 5B are "too small to display". To obtain a standard deviation, you have to make measurements on multiple independent samples. The top data point is measured on all the data – I believe there is only one measurement, which is why the std deviation is too small to display (it's zero). The other data points are measured (I understand) on subsets of the data. For any depth >0.5, the measurements cannot be independent (because to get measurements at depth 0.5 of the total, you can only split the data in half, to get two independent measurements). This will result in underestimating the variance. Small shifts in those data points could lead to a different appearance of whether these data are saturating or not, and I suspect something like that is going on.

(Similarly the model fit in Figure 5A, if taken seriously, predicts that once we have >25 taxa sequenced, we will cover >100% of the genome; sequencing 100 taxa will cover 120% of the genome. The model fits in Figure 5 seem inappropriate and overly precise to me, given that they both produce nonsensical predictions.)

Transcription levels are likely to be a continuum, reaching down to lower and lower expression levels, never truly zero. (Even if the authors disagree with this, I'd point out that all their data are on heterogeneous tissue samples like brain, which are a complex mixture of cell types. A locus expressed at some amount in one neuronal cell type can show any degree of smaller expression level in the overall tissue sample, simply because there can be any degree of dilution of that cell type in the sample.) I think the authors need to rethink how they're doing their data analysis. I think the only way to do it is to look for loci that have statistically significantly increased expression in one taxon relative to another. Differential analysis of RNA-seq gene expression data is a well understood area, with good existing methods, and I think one of them could be used in a redesigned analytic strategy here.

The data are still not available, though the authors say they are. The authors cite EBI ENA and Dryad accession numbers. None of these accessions exist at EBI ENA or Dryad. Perhaps the authors have them deposited under some sort of hold? Referees must have access to these data to perform peer review.

An important control that the authors should do to validate their analysis is to do replicates on the same taxon. You should not see "novel transcription" in replicates, under the conditions of this analysis (i.e. with individual variation and environment held fixed). If the data analysis strategy is valid (if they are at saturation), they will see a negligible number of "novel transcription" windows in replicates of the same taxon. If instead there is a small sampling artifact, they will see "novel transcription" for low-level expressed regions, created merely by sampling variation between replicates. I suggested this control in the previous review; I do not understand the authors' response. They say they can't do it because they "cannot use different individuals". The idea of a replicate control (in this case) is to take multiple samples from the same individuals/same taxon. (Maybe it's better to think in terms of technical replicates, rather than biological replicates; the question is only about variation cause by sampling depth.) The authors say that "the analogous analysis is done in our use of the brain samples at 100M vs. 300M read depth in Appendix 2". I don't understand that response either, because nothing in Appendix 2 measures the false positive rate of "detected transcription" between different sequence read samples of the same size from the same source. They could, for example, given an even deeper sample (1200M reads, for example), compare 600M to 600M subsets to see how much novel transcription they see, to try to validate their analysis at a 600M sample size. It might also be possible to use bootstrap resampling statistics, resampling the 600M with replacement, which could be done without needing more sequencing depth.

The simulation experiments on the NGM mapper performance may need to be interpreted more carefully. The plots in Appendix 1 show a ~5% mismapping rate at ~6% divergence, increasing with higher divergence. The conclusions in the paper depend in part on observing 1-7% unique transcription coverage per taxon, which is within the mismapping rate of NGM. Note that the observed coverage in Figure 2 increases with increasing phylogenetic distance, which is not expected biologically, but is the prediction of an increasing mismapping rate. Moreover, the actual analysis (as opposed to the simulations in Appendix 1) leaves out ~30% of the genome and only maps to the "common genome"; leaving genome out increases mismapping rates (because a read that maps better to the missing part of the genome that the mapper doesn't see can be mapped to its best but incorrect match in the genome that the mapper does see.)

[Editors’ note: what now follows is the decision letter after the authors submitted for further consideration.]

Thank you for choosing to send your work entitled "Fast turnover of genome transcription across evolutionary time exposes entire non-coding DNA to de novo gene emergence" for consideration at eLife. Your letter of appeal has been considered by Mark McCarthy (Senior editor) and the original Reviewing editor.

In the light of your comments, we have sent your manuscript out for an additional perspective and set of reviews. Enclosed you will find the reviews provided by this reviewer. As you will see, while this reviewer agrees with some of the issues raised by the earlier set of reviews, it is the judgement of the new reviewer that these issues do not affect the major conclusions of your manuscript. This review does point out that there are remaining issues associated with the saturation plots. However, the reviewer makes some concrete suggestions in the final paragraph of the review that should address these issues.

We would welcome a revised manuscript that addresses these remaining issues along the lines suggested by the most recent reviewer. We look forward to your response and receiving a revised manuscript.

Reviewer #3:

In the manuscript "Fast turnover of genome transcription across evolutionary time exposes entire non-coding DNA to de novo gene emergence" the authors, Rafik Neme and Diethard Tautz, analyze transcriptome data (RNA-seq) from different mouse taxa, for each of three tissues, to evaluate the transcription of coding and non-coding regions from an evolutionary perspective. In agreement to previous published results, the authors find that majority of the genome is transcriptionally active. By focusing on the portions of the genomes that are shared across the taxa studied they find that many of these regions are commonly transcribed across multiple taxa. They are capable of recapitulating the phylogenic tree, counteracting the common view that most of these lowly transcribed regions are mostly biological and technical noise. The evidence that pervasive transcription might be a resource to promote non-functional regions to selection is a valuable finding.

The criticisms raised by the reviewers are mostly reasonable. Indeed, some of the analyses presented by the paper, especially the saturation of sequencing depth, are not as robust as they could be. The authors claim that the fraction of the genome transcribed should saturate with increasing sequencing depth at approximately 85%. However, as one of the reviewers points out, 7% of the 200bp windows are singletons (only contain a single read supporting transcription), suggesting many windows are right at the threshold of detectability. Nonetheless, the authors show that the lowly transcribed regions are able to reconstruct the phylogenetic tree of the taxa, indicating that the biological signal is significantly higher than any technical noise due to sequencing depth and thresholding. We believe that despite the limitations of the saturation analysis with respect to sequencing depth being important, in the revision of the manuscript this area has been modified, but some technical details remain that do not help the authors’ arguments (maybe remnants of the first submission).

In order to reconcile the interesting findings and the saturation analysis we suggest that authors revise the text associated with Figure 5 and only include what is necessary for the central conclusions of the paper. Also, they should clearly state the amount of putatively noisy (singleton) windows for each taxon in the main text. We also suggest an additional analysis that would pragmatically answer to the question whether low coverage windows are noise or not. The authors could build phylogenetic trees using different thresholds. If different thresholds result in phylogenetic trees similar to the nucleotide divergence tree, the authors could indirectly infer that the sequencing depth noise is significantly smaller than the biological signal. While we recognize the potential deficiencies of the saturation/thresholding analyses, we don't believe it should preclude publication.

https://doi.org/10.7554/eLife.09977.028

Author response

The reviewers have discussed the reviews with one another and the Reviewing editor has drafted this decision to help you prepare a revised submission.

Although the work is of interest and timely, we regret to inform you that the findings at this stage contain several problematic issues that require attention before a final decision can be rendered. The first of these issues is the possibility that the analysis may be compromised by a serious artifact that would need to be corrected or excluded. In the authors' experimental design, it seems possible to call a region "transcribed" in taxon #1 and "not transcribed" in taxon #2 even if the region is in fact transcribed at exactly the same expression level in both taxa. This is because statistical fluctuation can cause even exactly the same region (even in the same taxon) to pass the simple coverage threshold (of coverage >= 1 or >= 5) in one sample but not another. For example, suppose a region is actually expressed at a level equivalent to a coverage of 1 read per base. With a Poisson probability of about 60%, one or more reads will fall over the position, and satisfy the "transcribed" threshold; with a probability of about 40%, zero reads will fall over the position, and then it will be called "not transcribed". The authors call the >=5 read threshold "more stringent and therefore more reliable" but this is not correct, because it is just as prone to the problem; for a region with an expression level of 5, in about half of samples you would observe a coverage of < 5 and call the region not transcribed, and in the other half you would observe a coverage of >= 5 and call the same region transcribed. Of course only low-expressed regions (close to the >=1 or >=5 thresholds) will suffer the problem of fluctuating above and below the thresholds, but many regions are lowly expressed, and indeed this is the point in a study of "pervasive transcription".

We agree that this issue was not properly addressed in the first version of the manuscript. We have tried various possibilities to deepen the insight into this sampling issue in our data. We have come up with using the extra transcriptome sequencing from the brain samples and binomial sampling distribution formulas to estimate the probabilities to detect the rare transcripts in our data. The detailed explanation for this approach is summarized in Appendix 2. The conclusion from this analysis is that a problem would exist when we do the comparisons at the single tissue level, while we have very high detection probability (>0.997) for the aggregate dataset. Accordingly, most of the further analysis is focused on the aggregate data. Apart of this, the turnover argument rests also on the analysis in Figures 5 and 6, which are not sensitive to sampling variation.

Concerning the threshold argument: we have now changed this into assigning first the presence/absence of a transcript and assigning it then to abundance classes.

Because the probability of calling even the same low-expressed region "transcribed" in multiple replicate samples can be significantly less than one, the probability of seeing a region called "transcribed" in multiple samples will be an artifactually decreasing function, and indeed could look just like the results in Figure 3, depending on details.

As pointed out above, this argument applies to some degree at a 100Mill read level, but not at the aggregate 600Mill read level that we use throughout most of the analysis. Note that this aggregate level is also dominated by transcripts from one tissue, the brain. Hence, there may still be some noise from the liver and testis-specific transcripts, but none that should affect the overall pattern.

A control experiment that the authors could do: what fraction of regions within the same taxon are detected in common in biological replicates – i.e. what does Figure 3 look like if you do 10 replicates of the same taxon, not 10 different taxa? For the experiment to work as planned, that plot must show a constant level. But if this artifact is in play, you'll see a decreasing function much like Figure 3.

We cannot use different individuals to do this control, since they are themselves polymorphic for many of the regions (as it has also been shown in Drosophila). But the analogous analysis is done in our use of the brain samples at 100 vs 300Mill read depth in Appendix 2 and by resampling of trees in Figure 4—figure supplement 3.

One effect of the artifact would be that closely related taxa would show unexpectedly large "distances" in the analysis of Figure 4, because some unknown amount of "distance" is being generated by stochastic fluctuation on mapped read counts, not by evolutionary divergence. This is an alternative explanation for the authors' observation that the closely related taxa have disproportionately large distances, undermining one of the paper's main conclusions that there is larger turnover at shorter evolutionary divergence. (The fact that the tree is reasonably concordant with the expected mouse tree does suggest that there is signal in the data, though, so my expectation is that there's a mixture of artifact and signal going on.)

We have now much expanded this tree section, doing analyses at several levels (see supplemental files for Figure 4). We find indeed that at 100M reads there is a mixture of signal and artifact, but after estimating and correcting the amount of variance in terminal branches, we are able to confirm our original conclusion at essentially all threshold levels, including the potentially most unreliable source (regions covered only once per sample), where the variance is highest, but still yields the expected phylogenetic signal.

Thus I do not think the authors can safely conclude that transcription is not conserved in these taxa. I don't think a simple threshold on coverage is the right way to analyze the data. I think they must do a more statistically rigorous experimental analysis that takes into account the distributions P(c1 e1) and P(c2 e2) for how much read coverage c1, c2 you observe in taxa 1,2 given unknown expression levels e1, e2 (differential RNA-seq methods typically estimate negative binomial distributions, e.g. over dispersed Poissons, for these). For example you might seek to reject the hypothesis that e1=e2 given the observed c1, c2 (though this would only test for significant changes in expression level, not gain/loss of txn), or that e1>threshold for "on" and e2=epsilon where epsilon is some chosen background "off" level. Any such analysis would also leave you with regions of "don't know", where you were unable to tell from the observed counts if the region were differentially transcribed or not.

This proposed statistic would indeed be the best if we wanted to find significant differences in expression levels. However, we focus everything on binary decisions (presence/absence) and assign expression levels only afterwards. For this goal, we consider it as more appropriate to use the sampling statistics approach detailed in Appendix 2.

There is an issue concerning the distribution of the detected transcripts between annotated coding or non-coding transcripts (i.e. a large proportion of which will be coding regions) and novel transcribed regions (a large proportion of which are likely to have little coding potential). The authors indicate that cumulatively (Figure 2F) ~36% present in the "common genome" (~70% of total genome) for each taxa is transcribed, resulting in a total of ~76% of the common genome being transcribed considering all 3 tissues of all 10 taxa. What proportion of these uniquely mapped reads correspond to annotated vs. unannotated (i.e. novel transcribed regions found in these experiments)? Are the turnover rates observed different between these two categories of genomic regions? If they are different, is the turnover rate observed driven by one or the other regions? This seems important since the underlying assumption seems to be that the novel regions (i.e. as yet unannotated regions) are driving the rapid turning. Additionally, do the de novo transcripts that survive this turnover possess more coding or non-coding characteristics? If they are non-coding mostly, are there "observations" that can be made about the evolution of genomes as being driven by the ongoing emergence of non-coding genes?

We now provide a more detailed analysis of this question in Appendix 2. Most of the turnover can indeed be ascribed to non-annotated transcripts, although a portion of the annotated ones are affected as well. We refrain from making statements about the translation status of the non-annotated transcripts. The reason is that there is always at least a short open reading frame to be found and one would have to set a more or less arbitrary cutoff to call something coding or non-coding. There is also increasing evidence for transcripts coding for very short peptides and/or classified "non-coding" transcripts being associated with ribosomes. Further, the concept of de novo gene evolution implies the possibility of a transition from a non-coding RNA to a coding one and we discuss this. However, we agree that a further in depth analysis of the identified de novo genes may reveal further interesting insights, but these would be case studies that go beyond the current scope of the paper. Just as a side note from an ongoing preliminary analysis: we can indeed find signs of translation of de novo transcripts in the non-curated peptide databases.

The sequence data must be deposited in appropriate public database (NCBI GEO, for example) and accessioned. The "Datasets generated" point in the cover page is set to "N/A" but should not be. The authors should provide not just the reads, but also some sort of summary data files (such as.bed files) that enable reviewers and readers to reproduce the important features of the analyses without having to remap reads. Both summary.bed files and read data also need to be provided to reviewers.

We have submitted raw reads and bam files to ENA, as well as the tables of coverage (mixture of bed file with coverage information) across all windows, taxa and tissues to Dryad. Accession numbers are provided.

To do the analysis in Figure 5, you can't just select the best hypothesis of the two (your GLM for continuously increasing, vs. your nonlinear regression for saturation). For all we know, the two hypotheses explain the data essentially equally well. You can't conclude that Figure 5A shows continuous increase and Figure 5B shows saturation unless you show that one hypothesis is significantly better (i.e. statistically significantly). If you're using BIC (Bayesian information criterion) it seems you should be able to directly calculate and report the Bayes factor supporting one hypothesis versus the other.

We now report the statistical support as requested.

The simulation is unclear. How does a string of 10,000 states approximate a 3Gb genome? Why bother with making it a string, since there's no dependence (that I see) between positions in your model (even though in a real genome there would be strong dependency between sites, because transcripts are contiguous). Why do you model 0->0 and 1->1 as "events"? (I think a time-dependent Markovian birth/death state change model would usually just have 0->1 with a birth rate, and 1->0 with a death rate.) Why do you call fitting to observed parameters by RMSD "maximum parsimony"? (Parsimony minimizes some counted events, possibly weighted ones, not a RMSD of continuous-valued features.) What exactly are the observed features that you fit the simulation to (as opposed to how you list example "summary values such as

", I want to understand exactly what you used), and how did you calculate them?

We have removed this simulation, since it was indeed only a preliminary attempt to get some rate estimates. Further, the internal branch resolution is particularly sensitive to sampling variance. We will continue to work on this issue but have decided to drop this part in the present manuscript.

It is stated that NextGenMap is designed for mapping to divergent genomes, but your results are so dependent on the quality of that mapping that I think you're obligated to do and show a control experiment that demonstrates that it is in fact working as expected on these genomes, rather than just asserting that it "performs extremely well" (subsection “Mapping”, third paragraph).

We are now providing such a test in Appendix 1.

[Editors’ note: the author responses to the second round of peer review follow.]

After review of your revised manuscript, it has been determined that there are still a number of concerns on the analyses and conclusions described. Details of these concerns are cited below. Given that the overall evaluation is still not positive, regretfully, we are unable to accept this manuscript for publication. We are sorry that on this occasion we cannot be more encouraging, but thank you for this submission.

Reviewer #2: In this revised manuscript, the authors' central assumption remains: that it is meaningful to distinguish "transcribed" from "non-transcribed" loci (qualitatively), and that regions of the genome with 0 mapped RNA-seq reads can be inferred to be "non-transcribed", as opposed to just undersampled.

The largest part of the data is actually about positively identifying transcripts and most of the conclusions are based on this. Only the statement about the turnover of transcripts depends on "not finding" a transcript. We had addressed this with some very basic probability calculations in Appendix 2. The referee does not agree with these (see further discussion below), but does not seem to appreciate that there are two other major analysis strategies in the paper (represented in Figures 3 and 4 and their supplements) that support our conclusions on the turnover.

The analysis now completely depends on a new conclusion, reached by mathematically erroneous means in Appendix 2, that the minimum expression level for any locus is >=1-2 reads per 100M reads, with no loci more lowly expressed. The authors now switch to an analysis of aggregated data (600M reads), conceding that the 100M sample size in the previous manuscript is too small.

First of all, the analysis does not "completely depend" on the estimates in Appendix 2 and second, we disagree that our approach was "mathematically erroneous" (see below).

They argue that at 600M reads, they have saturated the detected transcriptional coverage (because if the minimum expression level is 1/100M, and we sample 600M reads, we expect to map ~6 reads per window; thus the probability of seeing 0 mapped reads on a transcribed window is small). The analysis then proceeds to count singleton loci with only 1 mapped read as "expressed" and 0 reads as "not expressed" (Figure 2), for large numbers of singletons observed in the data, conflicting with the argument they are at saturation, because saturation requires that the number of singletons has been driven to a negligible number.

The fraction of singleton reads in the data is stated in the table in Appendix 2. At the 100Mill read level, they account for about 0.37-0.84% of the reads, at the 300Mill read level for 0.17-0.32% of the reads. This is not a "large fraction". Even if one assesses this at the level of singleton windows, the fraction is not higher than 7% (depending on read depth for a taxon) of all windows. This introduces some sampling variance (which we analyze explicitly in Figure 4—-figure supplement 3), but this does not invalidate our main conclusions.

It remains the case that we cannot count 1 versus 0 mapped reads as different expression levels, for the reasons already pointed out for the previous manuscript version. This remains a fundamental flaw in the analysis, which in my opinion needs to be completely rethought.

While we agree that this statement would be correct in the context of studying expression differences, we need to point out that our paper is neither designed to do this, nor can it be viewed in this context. We use in some threshold levels the 1-vs-0 mapped reads distinction, but we also use other more conservative thresholds (10-vs-0, and 100-vs-0). Further, we show that even the 1-vs-0 comparison has actually meaningful biological information in Figure 4—figure supplement 3. But more importantly, the transcripts with higher read coverage show also a high turnover, which cannot be statistically disputed. Hence, we do not understand how the conclusion of a "fundamental flaw" comes about, when the large majority of the data are not under discussion.

Detail on why I call Appendix 2 mathematically erroneous: Apologies for the length but I think it's important to explain. In the previous review, I had worried that the analysis was subject to a small sample number artifact on low-expression regions. Supposing a low-expressed locus is transcribed at the equivalent of 1 expected read: then 60% of the time it will be detected with >=1 mapped read, and 40% of the time it will have 0 mapped reads. So when low-expressed loci exist, where small numbers of reads are mapping, the same locus will appear to be "non-transcribed" in one sample and "transcribed" in another, simply by statistical fluctuation. This would be an extremely troubling and serious error in a data analysis design, but I wasn't entirely sure I understood the manuscript, so I left it open to the authors to explain any misunderstanding. The authors' response confirms that I understood their analysis correctly.

We did indeed appreciate that the referee had alerted us to this problem and to think in more depth about it. We have therefore carefully addressed this issue in the new version in multiple ways – not only the one that is detailed in Appendix 2. As a major change, we had dropped the simulation analysis, since this was sensitive to a reasonably accurate estimate at the very low transcription level. All our current conclusions are based on much deeper sampling and revised analysis schemes, where this problem becomes negligible (e.g. the new analysis scheme in Figure 3). Hence, we feel that we have fully responded to the criticism of the first round of reviews.

To address the problem, the authors have inserted a new analysis in Appendix 2. The purpose of Appendix 2 is to argue that this is not a problem in the aggregated data (600M reads), although the authors concede the problem in the individual tissue data sets (100M reads). The claim is that the data are "compatible with" a minimum expression level of 1-2 mapped reads per 100M reads (i.e. essentially no locus has a lower expression level than this) and thus at 600M reads, all transcribed loci have been saturated.

We do not propose a "fixed minimum expression level". Rather we focus the analysis on the transcriptional level that could potentially create the largest impact on the statistics, namely singleton reads that are seen at the 100Mill read level. It appears that the referee has overlooked the following statement in the Appendix "Note that rarer transcripts would have only a small chance to be sampled in the first set, i.e. should also not impact the second sampling so much (see lines for 0.1, 0.25 and 0.5 in the plot).", i.e. transcripts that are not detected at the 100 Mill read level will also have a correspondingly smaller impact at deep sequencing levels (keeping in mind that the overall number of singletons is anyway only at a small percentage).

Aside from my skepticism that any fixed minimum expression level can exist (how would that work, thermodynamically and biologically?), the analysis in Appendix 2 appears to be mathematically incorrect in several respects. First, most seriously, it confuses a cumulative distribution, P(counts >= 1), with a probability distribution, P(counts = x). The equation P = 1-(1-x)^n gives (as the authors correctly say) "the probability for finding a read at least once among n total reads" – this is a cumulative distribution, P(counts >= 1). The authors then compare this to the number of singletons: which is the number of windows with exactly one mapped read, P(counts = 1). The number of windows with exactly one mapped read is not calculated by the authors' equation; it would instead be calculated by the binomial distribution for P(counts=1), n * x * (1-x)^(n-1). Second, you cannot extrapolate this calculation to the total number of expected singletons in the 100M or 300M read sample unless you assume that all loci are expressed at the same low level t. This is obviously not the case. For starters, known coding genes are expressed at far higher levels. If less well expressed loci exist, they will also contribute to the singleton counts, each detected with lower probability. The authors' analysis assumes that all loci have the same expression level t and therefore no lower expressed loci exist. Since the analysis is intended to test whether lower expressed loci exist, it is a problem that the analysis assumes that conclusion.

We agree that our analysis constitutes only a highly simplified first approximation. We focus on only one borderline case, but we think this is actually the most relevant one (see above). We do not think that our analysis is "mathematically incorrect" it simply does not reflect the true distribution of transcript classes. However, this would be much more difficult to model (probably subject to a different paper) and is of little relevance for the overall conclusions in the present paper.

Third, it seems obvious from the singleton counts that the authors' conclusions can't be right. The authors' main argument in Appendix 2 is that the "data are compatible" with loci having a minimum concentration of 1-2 reads per 100M reads, so by the time they have sampled 600M reads, the probability of detecting any given locus is essentially one and "no major sampling variance is expected". But the reason that the probability of detecting a locus transcribed at 1 read/100M reads in a sample of 600M reads is high is that the expected number of reads is 6. Under the authors' assumptions, the Poisson probability of detecting a singleton in a 100M read sample is 40% (40% probability of 0 counts, 40% of 1 count, 20% of >1 count). The Poisson probability of detecting a singleton in a 600M read sample is 15% (5% probability of detecting 0 counts; 15% of 1 count; 80% of >1 count; mean expected number of counts = 6). Thus if the authors' assumptions are correct, the number of singletons should decrease in a larger sample, not increase. Indeed, this is what saturation means! Instead, the data in Appendix 2 consistently show more singletons in the 300M sample than the 100M sample. That's what you expect if there's a continuous tail of lower and lower expressed loci, which seems to me likely.

While the argument of the referee is correct in principle, we need to point out that the comparison of 100M to 300M is not yet in the saturation range (see Figure 5B) – and this is actually not a requirement for our calculation anyway. Further, the fraction of singletons does actually decrease in the 300Mill read sample (as stated above), it is unclear why the referee assumes that they increase (only the absolute number increases, but the relative fraction goes down as predicted by the referee).

The authors inappropriately fit their observed singleton numbers for P(counts=1) (for example, 427585 for AH 100M sample, 569564 for AH 300M sample) to the cumulative binomial probability for P(counts >= 1); this is mathematically nonsensical. Oddly, though they get very good "fits", with "expected" counts almost exactly equal to observed counts – how could this be? It took me a while to figure this out. First, note that if these were really "expected" vs. "observed" counts, the observed counts c would have sampling errors of about sqrt(c), about +-700 here; it's a red flag that the agreement between "expected" and "observed" is far too good. What's happening is that the authors' "expected" counts aren't really expected counts under a model. Instead what the authors are doing (I believe) is a numerical fit to the observed data: given observed counts a1 and a2, from sample sizes n1 and n2, they ask if they can fit an unknown "predicted number in sample", N, that satisfies both a1/N = 1

(1-x)^n1 and a2/N = 1-(1-x)^n2. Because they can fit N, they conclude that the data are "compatible" with their conclusion that there is a "minimum" (actually, the analysis requires it to be constant, not just minimal; see above) expression level of ~1 read/100M reads. The trouble with this analysis (which, again, is set up wrong in the first place), and the reason that it appears to work, is that if n2/n1 > a2/a1, I believe you can always find a "predicted number in sample" N that satisfies the two (inappropriate) cumulative binomial distribution equations. That is, the fit shows nothing other than the fact that the number of singletons is increasing slower than the increase in sample size, which is expected under any possible model. It does not say anything about the minimum concentration t. Indeed, again, the fact that the number of singletons is increasing, not decreasing, as we go from 100M to 300M sample size says that we are not at saturation, and the 300M sample is starting to detect singletons at even lower-expressed loci.

We agree that this is an issue for more in depth discussion and analysis. We believe that we propose in principle a possible way of how to address this, but the execution is still too superficial. However, as pointed out above, the resolution of this issue is only of minor significance for the conclusions of the current paper.We have therefore decided to drop Appendix 2. In the text we point out that the analysis done in Figure 3 was actually another approach to address the issue of possible non-detection. We have now added the sentence: "This would be particularly problematic for singleton reads, since the probability of falsely not detecting one in a second sample is about 37%. However, given that we ask whether it is detected in any of the other 9 taxa, the probability of falsely not detecting it if it exists across all of them becomes very small (0.01%)."

The authors point to Figure 5B as additional evidence that they are at saturation. Figure 5B is indeed puzzling, because the appearance of an asymptotic saturation at about 85% coverage conflicts with the increasing numbers of singletons with increasing read number in Appendix 2.

We do not see a conflict – the numbers used in Appendix 2 correspond to the ascending curve part in Figure 5B (the first point in this figure represents 600 Mill reads, i.e. 10% of the total, the numbers in Appendix 5 represent 100 vs. 300 Mill reads). The point is that going into further sequencing depth leads to saturation and this is well reflected in this figure.

But there are two things about Figure 5B that reduce my confidence in the authors' conclusions about it. One is that if we extrapolate the other direction (to lower coverage), the authors' model fit predicts that with no sequencing reads at all (the y intercept of the graph at x=0), 40% of the genome is detected by (nonexistent) sequencing reads with coverage of >= 1 read! If the model fit can't extrapolate safely to low coverage, I'm not sure why I would trust its extrapolations to high coverage.

It is true that we were not interested to obtain a fit for lower coverage behavior, which would require a different fit. But this is common practice in such rarefaction studies. The actual prediction range that we aimed at is marked as grey area in the figure and there is no reason to assume this could be far off.

Second, the authors comment that the standard deviations on the data points in Figure 5B are "too small to display". To obtain a standard deviation, you have to make measurements on multiple independent samples. The top data point is measured on all the data – I believe there is only one measurement, which is why the std deviation is too small to display (it's zero). The other data points are measured (I understand) on subsets of the data. For any depth >0.5, the measurements cannot be independent (because to get measurements at depth 0.5 of the total, you can only split the data in half, to get two independent measurements). This will result in underestimating the variance. Small shifts in those data points could lead to a different appearance of whether these data are saturating or not, and I suspect something like that is going on.

We agree with the reviewer that the subsets cannot be fully considered independent, and that we would eventually need a much larger set to properly estimate the variances. However, it is clear (both visually and statistically) that we are dealing with two different curves. Note that the first version of the manuscript used reads rather than windows to estimate this and obtained the same overall result. The sequencing depth curve is much faster in reaching the saturation compared to the taxon addition curve, which is in agreement with the other data on turnover. But we have actually also done a quick simulation to assess whether small shifts in data points could lead to a major difference in curve fitting. We find that one would need a standard deviation of 15% for such an effect, i.e. it is negligible for our data points.

(Similarly the model fit in Figure 5A, if taken seriously, predicts that once we have >25 taxa sequenced, we will cover >100% of the genome; sequencing 100 taxa will cover 120% of the genome. The model fits in Figure 5 seem inappropriate and overly precise to me, given that they both produce nonsensical predictions.)

Again we point out that fitting to measured data is the canonical approach and that this can of course not be extended to a non-existing data range. We believe that it is perfectly valid to report a theoretical intersection point at 100% and that it is understood that there is no further extension of the curve.

Transcription levels are likely to be a continuum, reaching down to lower and lower expression levels, never truly zero. (Even if the authors disagree with this, I'd point out that all their data are on heterogeneous tissue samples like brain, which are a complex mixture of cell types. A locus expressed at some amount in one neuronal cell type can show any degree of smaller expression level in the overall tissue sample, simply because there can be any degree of dilution of that cell type in the sample.)

The latter would only be true if possible tissue types would approach the number of cells in the tissue, but this of course not the case. Even rare cell types do not exist as single cells. It is actually a common observation in all large datasets that one can approximate saturation at high sequencing depth – our data are no exception. But we agree that there will be transcripts at lower level than the ones we consider – the important point is that they will not show up at a level where they would compromise the overall conclusions.

I think the authors need to rethink how they're doing their data analysis. I think the only way to do it is to look for loci that have statistically significantly increased expression in one taxon relative to another. Differential analysis of RNA-seq gene expression data is a well understood area, with good existing methods, and I think one of them could be used in a redesigned analytic strategy here.

We have actually done such an analysis during the revision process to evaluate its suitability. However, wet did not find it to be so useful and did not include it in the paper. It would actually address a different question that would only confuse the flow of the arguments. We are aiming to understand the coverage by actually identified transcripts, not the detailed statistics of not found transcripts. Evidently, to call a "significant" expression difference, one has to throw out all data of low expressed transcripts since one needs a certain minimal number (between 8-10) to gain some statistical confidence. While this is relevant when one is interested in actually measuring expression differences, it is not so useful when one is interested in understanding the dynamics of positive coverage.

The data are still not available, though the authors say they are. The authors cite EBI ENA and Dryad accession numbers. None of these accessions exist at EBI ENA or Dryad. Perhaps the authors have them deposited under some sort of hold? Referees must have access to these data to perform peer review.

We apologize that this has not worked properly. Since this is a registered manuscript, the referee should have received a link from eLife to access the Dryad entry. We have now asked ourselves for this link and can provide it here: http://datadryad.org/review?doi=doi:10.5061/dryad.8jb83

The data were also submitted to the nucleotide archive as indicated in our reply, but there was unfortunately an embargo set on them until publication of the manuscript. This means that the referee had no access, but this could easily have been corrected during the review process, if it would have been brought to our attention. We have now removed the embargo, i.e. they are fully accessible.

An important control that the authors should do to validate their analysis is to do replicates on the same taxon. You should not see "novel transcription" in replicates, under the conditions of this analysis (i.e. with individual variation and environment held fixed). If the data analysis strategy is valid (if they are at saturation), they will see a negligible number of "novel transcription" windows in replicates of the same taxon. If instead there is a small sampling artifact, they will see "novel transcription" for low-level expressed regions, created merely by sampling variation between replicates. I suggested this control in the previous review; I do not understand the authors' response. They say they can't do it because they "cannot use different individuals". The idea of a replicate control (in this case) is to take multiple samples from the same individuals/same taxon. (Maybe it's better to think in terms of technical replicates, rather than biological replicates; the question is only about variation cause by sampling depth.)

The referee had asked for biological replicates in the first round of comments and we had responded to this point. Concerning the technical replicates it appears that the referee had missed that we did actually do this for the brain samples. We had added a new paragraph in the Methods section to describe this and we show the results in the supplements of Figure 4.

The authors say that "the analogous analysis is done in our use of the brain samples at 100M vs. 300M read depth in Appendix 2". I don't understand that response either, because nothing in Appendix 2 measures the false positive rate of "detected transcription" between different sequence read samples of the same size from the same source.

There was no request to calculate a false positive rate in the first review, i.e. we could not respond to this. But apart of this, calculating such a rate would remain somewhat artificial. We consider our analysis of phylogenetic information for the re-sampled data actually more useful (see below).

They could, for example, given an even deeper sample (1200M reads, for example), compare 600M to 600M subsets to see how much novel transcription they see, to try to validate their analysis at a 600M sample size. It might also be possible to use bootstrap resampling statistics, resampling the 600M with replacement, which could be done without needing more sequencing depth.

In Figure 4—figure supplement 3 we did a partitioning of the brain data into three completely non-intersecting datasets. When focusing on the trees based on singletons only, we find indeed that 88.1% of the signal could be due to sampling variance, but there is still information in the data, i.e. even at this very noisy level, a phylogenetic turnover signal is recovered. Of course, this level of error seems still impressively high, but it affects only a small percentage of whole data set (about 0.1% of the reads equivalent to 7% of windows are singletons at the 600Mill read level).

The simulation experiments on the NGM mapper performance may need to be interpreted more carefully. The plots in Appendix 1 show a ~5% mismapping rate at ~6% divergence, increasing with higher divergence. The conclusions in the paper depend in part on observing 1-7% unique transcription coverage per taxon, which is within the mismapping rate of NGM. Note that the observed coverage in Figure 2 increases with increasing phylogenetic distance, which is not expected biologically, but is the prediction of an increasing mismapping rate. Moreover, the actual analysis (as opposed to the simulations in Appendix 1) leaves out ~30% of the genome and only maps to the "common genome"; leaving genome out increases mismapping rates (because a read that maps better to the missing part of the genome that the mapper doesn't see can be mapped to its best but incorrect match in the genome that the mapper does see.)

We agree that there may be some "grey area" for the mapping in the two most distant taxa and we have added a corresponding note to the legend of Figure 2. But most of the comparisons relate to taxa that are much more closely related and where mapping fidelity is close to 100%. We would also like to point out that we do not leave out any part of the genome for our initial mapping. We stated in the Methods that both genomic and transcriptomic reads are mapped to the mm10 genome. Only later, after coverage had been called, have we discarded those regions that could not be found in at least one of the four most distant genomes. We consider this drastically reduces mapping artifacts, as we deal with uniquely mapping reads. We have now expanded this explanation in the Methods section.

[Editors’ note: the author responses to the re-review follow.]

Reviewer #3:

In the manuscript "Fast turnover of genome transcription across evolutionary time exposes entire non-coding DNA to de novo gene emergence" the authors, Rafik Neme and Diethard Tautz, analyze transcriptome data (RNA-seq) from different mouse taxa, for each of three tissues, to evaluate the transcription of coding and non-coding regions from an evolutionary perspective. In agreement to previous published results, the authors find that majority of the genome is transcriptionally active. By focusing on the portions of the genomes that are shared across the taxa studied they find that many of these regions are commonly transcribed across multiple taxa. They are capable of recapitulating the phylogenic tree, counteracting the common view that most of these lowly transcribed regions are mostly biological and technical noise. The evidence that pervasive transcription might be a resource to promote non-functional regions to selection is a valuable finding. The criticisms raised by the reviewers are mostly reasonable. Indeed, some of the analyses presented by the paper, especially the saturation of sequencing depth, are not as robust as they could be. The authors claim that the fraction of the genome transcribed should saturate with increasing sequencing depth at approximately 85%. However, as one of the reviewers points out, 7% of the 200bp windows are singletons (only contain a single read supporting transcription), suggesting many windows are right at the threshold of detectability. Nonetheless, the authors show that the lowly transcribed regions are able to reconstruct the phylogenetic tree of the taxa, indicating that the biological signal is significantly higher than any technical noise due to sequencing depth and thresholding. We believe that despite the limitations of the saturation analysis with respect to sequencing depth being important, in the revision of the manuscript this area has been modified, but some technical details remain that do not help the authors’ arguments (maybe remnants of the first submission).

In order to reconcile the interesting findings and the saturation analysis we suggest that authors revise the text associated with Figure 5 and only include what is necessary for the central conclusions of the paper.

We have corrected the inconsistencies in this figure legend and have shortened it to focus on the main points.

Also, they should clearly state the amount of putatively noisy (singleton) windows for each taxon in the main text.

We have now very specifically addressed this issue in the text (Results, eleventh paragraph) and added corresponding supplementary figures (Figure 4—figure supplement 1 shows the fraction of singletons in dependence of each sample in each taxon, Figure 4—figure supplement 2 in dependence of read depth). These show that the majority of singletons in individual samples are actually redetected in other samples and that singleton numbers go down with increasing sequencing depth.

We also suggest an additional analysis that would pragmatically answer to the question whether low coverage windows are noise or not. The authors could build phylogenetic trees using different thresholds. If different thresholds result in phylogenetic trees similar to the nucleotide divergence tree, the authors could indirectly infer that the sequencing depth noise is significantly smaller than the biological signal.

We have done this and have included the tree based on singletons as Figure 4C and the ones with two other thresholds as Figure 4—figure supplement 3. We find indeed that there is phylogenetic signal even for the singleton tree.

While we recognize the potential deficiencies of the saturation/thresholding analyses, we don't believe it should preclude publication.

We appreciate this statement, since the key findings of the paper are indeed based on the actual detection of transcripts, not on the remaining potential uncertainty of not detecting some low level transcripts.

https://doi.org/10.7554/eLife.09977.029

Article and author information

Author details

  1. Rafik Neme

    Max-Planck Institute for Evolutionary Biology, Plön, Germany
    Contribution
    RN, Conception and design, Acquisition of data, Analysis and interpretation of data, Drafting or revising the article
    For correspondence
    rneme@evolbio.mpg.de
    Competing interests
    No competing interests declared.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-8462-5291
  2. Diethard Tautz

    Max-Planck Institute for Evolutionary Biology, Plön, Germany
    Contribution
    DT, Conception and design, Analysis and interpretation of data, Drafting or revising the article
    For correspondence
    tautz@evolbio.mpg.de
    Competing interests
    DT: Senior editor, eLife
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-0460-5344

Funding

European Research Council (322564)

  • Rafik Neme
  • Diethard Tautz

Max-Planck-Gesellschaft

  • Diethard Tautz

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank the C Pfeifle and the mouse team for providing the animals, N Thomsen for technical assistance, J Altmüller and C Becker for sequencing, B Harr, A Nolte, C Xie, L Pallares and L Turner for comments on the manuscript and the members of our group for discussions and suggestions. Special thanks to F Sedlazeck for bioinformatic advice and provision of software before publication. RN was supported by a PhD fellowship of the IMPRS for Evolutionary Biology during the initial phase of the project. The project was financed through an ERC advanced grant to DT (NewGenes - 322564).

Ethics

Animal experimentation: All mice were obtained from the mouse collection at the Max Planck Institute for Evolutionary Biology, following standard rearing techniques which ensure a homogeneous environment for all animals. Mice were maintained and handled in accordance to FELASA guidelines and German animal welfare law (Tierschutzgesetz § 11, permit from Veterinäramt Kreis Plön: 1401-144/PLÖ-004697).

Reviewing Editor

  1. Thomas R Gingeras, Cold Spring Harbor Laboratory, United States

Publication history

  1. Received: July 8, 2015
  2. Accepted: February 1, 2016
  3. Accepted Manuscript published: February 2, 2016 (version 1)
  4. Version of Record published: March 7, 2016 (version 2)
  5. Version of Record updated: October 14, 2016 (version 3)

Copyright

© 2016, Neme et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 2,630
    Page views
  • 745
    Downloads
  • 19
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Scopus, Crossref.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

  1. Further reading

Further reading

    1. Evolutionary Biology
    2. Genetics and Genomics
    Laura A Wetzel et al.
    Research Advance
    1. Chromosomes and Gene Expression
    2. Genetics and Genomics
    Tito Candelli et al.
    Research Article