1. Genetics and Genomics
Download icon

Fast turnover of genome transcription across evolutionary time exposes entire non-coding DNA to de novo gene emergence

  1. Rafik Neme  Is a corresponding author
  2. Diethard Tautz  Is a corresponding author
  1. Max-Planck Institute for Evolutionary Biology, Germany
Research Article
Cite this article as: eLife 2016;5:e09977 doi: 10.7554/eLife.09977
8 figures, 6 tables and 2 data sets

Figures

Figure 1 with 2 supplements
Phylogenetic relationships and time estimates for the taxa used in the study.

New genome sequences were generated for taxa with *. A common genome was constructed across all taxa (Figure 1—figure supplement 1) based on a mapping algorithm that is not affected by the sequence divergence between the samples (Appendix 1). Figure 1—figure supplement 2 shows the intersection of genome coverage between the named species.

https://doi.org/10.7554/eLife.09977.003
Figure 1—figure supplement 1
Scheme for the establishment of the 'common genome' using genomic reads and the mouse reference genome.

The common genome represents the portion of the reference which is present and detectable across all species. The genome sequencing, processing and sequence analysis were done in the same way as for transcriptomes, effectively removing possible biases derived from sequencing and mapping. Note that the assignment of the common genome fraction was done after mapping all genomic and transcriptomic reads to the reference, i.e. the mapping process was not affected by a reduced mapping target.

https://doi.org/10.7554/eLife.09977.004
Figure 1—figure supplement 2
Venn diagrams of representation of the common genome, derived from 200bp windows covered in genomic reads in species with more than one million years divergence to the reference.

Windows covered by all four species are used as the common genome (shown as the intersection of all species).

https://doi.org/10.7554/eLife.09977.005
Transcriptome coverage of the common genome per taxon.

(A–C) Liver, brain and testis, respectively, sequenced at approximately the same depth. (D) Combination of samples from AD. (E) Additional sequencing of brain samples at 3x depth, compared to B. (F) Combination of all samples, including additional brain sequencing. Three coverage levels are represented by colors from light blue to dark blue: window coverage with at least 1, 10 and 100 reads. Taxon abbreviations as summarized in Figure 1, with closest to the reference genome to the left of each panel and most divergent one to the right. Note that the slight rise in low read coverage for the distant taxa could partially be due to slightly more mismapping of reads at this phylogenetic distance (see Appendix 1 for simulation of mapping efficiency), but is also affected by a larger fraction of singleton reads (compare Figure 4—figure supplement 1).

https://doi.org/10.7554/eLife.09977.006
Figure 3 with 2 supplements
Distribution of shared and non-shared windows with transcripts for each taxon, based on the aggregate dataset across all three tissues.

Three classes are represented: i) windows that are found in a single taxon only, ii) windows found in 2–9 taxa and iii) windows shared among all 10 taxa (from left to right in each panel). Windows with transcripts were first classified as belonging to one of the three classes, independent of their coverage, and were then assigned to the coverage classes represented by the blue shading (from light blue to dark blue: window coverage with at least 1, 10 and 100 reads). Taxon names as summarized in Figure 1. Figure 3—figure supplement 1 shows an extended version where class ii) is separated into each individual group. Relative enrichment of annotated genes in the conserved class is shown in Figure 3—figure supplement 2.

https://doi.org/10.7554/eLife.09977.007
Figure 3—figure supplement 1
Distribution of shared transcripts according to the number of taxa shared, based on the aggregate dataset across all three tissues.

Windows with transcripts were first classified as belonging to each of the sharing categories (from 1 to 10), independent of their coverage, and were then assigned to the coverage classes represented by the blue shading (from light blue to dark blue: window coverage with at least 1, 10 and 100 transcripts). Taxon names as summarized in Figure 1.

https://doi.org/10.7554/eLife.09977.008
Figure 3—figure supplement 2
Windows transcribed across most species (9 or more) are strongly enriched in genes known from the reference genome, while windows transcribed in some taxa (8 or less) are strongly depleted from known genes.

The effect is most evident for protein-coding genes, but still present for non-coding genes.

https://doi.org/10.7554/eLife.09977.009
Figure 4 with 3 supplements
Distance tree comparisons based on molecular and transcriptome sharing data.

(A) Molecular phylogeny based on whole mitochondrial genome sequences as a measure of molecular divergence (black lines represent the branch lengths, dashed lines serve to highlight short branches). (B) Tree based on shared transcriptome coverage of the genome, using correlations of presence and absence of transcription of the common genome. All nodes have bootstrap support values of 70% or more (n = 1000). (C) Tree based on shared transcriptome coverage of singleton reads only from subsampling of the extended brain transcriptomes. Left is the consensus tree with the variance component between samples depicted as triangles, right is the same tree, but only for the branch fraction that is robust to sampling variance. Taxon names as summarized in Figure 1. Figure 4—figure supplement 1 shows the fraction of singletons in dependence of each sample in each taxon, Figure 4—figure supplement 2 in dependence of read depth. Figure 4—figure supplement 3 shows an extended version of the analysis shown in 4C for higher coverage levels.

https://doi.org/10.7554/eLife.09977.010
Figure 4—figure supplement 1
Fraction of windows with singletons (one paired read) of the common genome per taxon.

(A-C) Liver, brain and testis, respectively, sequenced at approximately the same depth. (D) Combination of samples from A–D. (E) Additional sequencing of brain samples at 3x depth, compared to B. (F) Combination of all samples, including additional brain sequencing. Light gray indicates singletons observed in each individual sample/taxon combination. Dark gray indicates singletons across the whole experiment, i.e. not re-detected in any other tissue or taxon. Taxon abbreviations as summarized in Figure 1, with closest to the reference genome to the left of each panel and most divergent one to the right. Note that the rise in singleton number for the distant taxa can be ascribed to the longer branch length, i.e. absence of closely related taxa in which the singleton could have been re-detected.

https://doi.org/10.7554/eLife.09977.011
Figure 4—figure supplement 2
Reduction of singletons in dependence of aggregate sequencing depth.
https://doi.org/10.7554/eLife.09977.012
Figure 4—figure supplement 3
Trees based on shared transcriptome coverage of the genome, using binary correlations.

We used the deep sequenced brain samples to estimate the proportion of sampling artifacts in terminal branches, and effectively subtracted the proportion of artifacts to obtain reliable phylogenetic signals. Each brain sample was split in three completely independent samples of 100 million reads. Top: Trees constructed using: regions covered only with one read in each taxon, regions covered by 1 and 5 reads (very low expression), regions covered by any reads, regions above 10 reads (mid expression) and regions above 100 reads (high expression). The percentage shown indicates the average level of sampling artifacts for each threshold, derived from the length of the terminal branches not found in all replicates of each taxon, i.e. the uncorrelated portion across samples of the same origin. These numbers are highest for the lowly expressed regions, and are lowest for the highly expressed regions, and are more or less constant within comparisons. Once subtracted, the phylogenetic signal remains robust. Taxon names as summarized in Figure 1. The figure part with the 1 read fraction corresponds to Figure 4C.

https://doi.org/10.7554/eLife.09977.013
Rarefaction, subsampling and saturation patterns using all available samples and reads.

(A) Sequencing depth saturation as estimated from an increase in the number of taxa. (B) Sequencing depth saturation as estimated from increasing read number. Blue dots indicate increases per sub-sampled sequence fraction or taxon added from our dataset. Gray dotted line indicates the predicted behavior from the indicated regression, and gray area shows the prediction after doubling the current sampling either by additional taxa (A) or in sequencing effort (B). Each analysis was tested for logarithmic and asymptotic models. Best fit was selected from ΔBIC, with Bayes factor shown and qualitative degree of support shown. Standard deviations are shown as black lines in A, and are too small to display in B (note that due to the sampling scheme for this analysis, the values above 50% are not statistically independent and that the 100% value constitutes a single data point without variance measure).

https://doi.org/10.7554/eLife.09977.014
Comparative analysis of lengths of regions transcribed or not transcribed across all data (including deeper brain sequencing) in all samples.

Size distribution of regions not covered in any transcript (green) versus size distribution of regions with at least one transcript (blue).

https://doi.org/10.7554/eLife.09977.015
Appendix 1 figure 1
Performance of NextGenMap compared to Bowtie2.
https://doi.org/10.7554/eLife.09977.021
Appendix 1 figure 2
Performance of NextGenMap in terms accuracy of mapping using the same set of reads and increasingly divergent versions of the reference genome (A), and paired-end mapping statistics (B).
https://doi.org/10.7554/eLife.09977.022

Tables

Table 1

Genome sequencing and read mapping information relative to the C57Bl/6 reference strain (GRCm38.3/mm10).

https://doi.org/10.7554/eLife.09977.016
SpeciesUniquely mapping
reads (MAPQ >25)
Mean coverage depth
(window based)
Reference
coverage
(% windows)
Total sequence
divergence*
Accession
Reads
Accession
BAMs
Apodemus uralensis4.46E+0840x78.23%5.60%ERS942341ERS946059
Mus mattheyi5.58E+0852x77.19%4.50%ERS942343ERS946060
Mus spretus7.71E+0852x93.91%1.70%ERS946096**
Mus spicilegus6.16E+0857x84.39%1.60%ERS942342ERS946061
  1. * The percentage of divergence was estimated from mappings using NextGenMap (Sedlazeck et al., 2013). Only uniquely mapping reads were considered and mapping quality greater than 25. Variation was estimated from the alignments using samtools mpileup (Li et al., 2009). Divergence was calculated as number of changes divided by the genome size.

  2. ** Corresponds to study accession PRJEB11535. All other accessions deposited under studies PRJEB11513 and PRJEB11533.

Table 2

Transcriptome reads from each sample sequenced, mapped and normalized.

https://doi.org/10.7554/eLife.09977.017
Taxon
Code
TissueLanesQC-passed
reads
Mapped
reads
(% total)Normalized
subset
(%total)(% mapped)Accession
Reads*
Accession
BAMs**
DOMCBBrain0.33x1.30E+081.26E+0896%9.15E+0770%73%ERS946023ERS942305
DOMCBLiver0.33x1.41E+081.17E+0883%9.07E+0764%77%ERS946025ERS942306
DOMCBTestis0.33x1.26E+081.22E+0896%1.19E+0894%98%ERS946026ERS942307
DOMMCBrain0.33x1.17E+081.13E+0896%9.15E+0778%81%ERS946027ERS942309
DOMMCLiver0.33x1.34E+081.09E+0881%9.07E+0768%84%ERS946029ERS942310
DOMMCTestis0.33x1.42E+081.37E+0896%1.19E+0883%87%ERS946030ERS942311
DOMAHBrain0.33x9.49E+079.15E+0796%9.15E+0796%100%ERS946019ERS942301
DOMAHLiver0.33x1.16E+081.02E+0888%9.07E+0778%89%ERS946021ERS942302
DOMAHTestis0.33x1.61E+081.55E+0896%1.19E+0874%77%ERS946022ERS942303
MUSKHBrain0.33x1.33E+081.28E+0896%9.15E+0769%72%ERS946035ERS942313
MUSKHLiver0.33x1.03E+089.07E+0788%9.07E+0788%100%ERS946037ERS942314
MUSKHTestis0.33x1.36E+081.31E+0896%1.19E+0887%91%ERS946038ERS942315
MUSVIBrain0.33x1.23E+081.19E+0896%9.15E+0774%77%ERS946031ERS942317
MUSVILiver0.33x1.23E+089.47E+0777%9.07E+0774%96%ERS946033ERS942318
MUSVITestis0.33x1.32E+081.27E+0896%1.19E+0890%93%ERS946034ERS942319
CASBrain0.33x1.21E+081.16E+0896%9.15E+0776%79%ERS946039ERS942321
CASLiver0.33x1.23E+081.01E+0882%9.07E+0774%90%ERS946041ERS942322
CASTestis0.33x1.23E+081.19E+0896%1.19E+0896%100%ERS946042ERS942323
SPIBrain0.33x1.34E+081.29E+0896%9.15E+0768%71%ERS946043ERS942325
SPILiver0.33x1.05E+089.82E+0793%9.07E+0786%92%ERS946045ERS942326
SPITestis0.33x1.44E+081.38E+0896%1.19E+0883%86%ERS946046ERS942327
SPRBrain0.33x1.09E+081.05E+0896%9.15E+0784%87%ERS946047ERS942329
SPRLiver0.33x1.35E+081.20E+0889%9.07E+0767%76%ERS946049ERS942330
SPRTestis0.33x1.34E+081.29E+0896%1.19E+0888%92%ERS946050ERS942331
MATBrain0.33x1.12E+081.04E+0893%9.15E+0782%88%ERS946051ERS942333
MATLiver0.33x1.23E+081.12E+0891%9.07E+0774%81%ERS946053ERS942334
MATTestis0.33x1.32E+081.23E+0893%1.19E+0890%97%ERS946054ERS942335
APOBrain0.33x1.36E+081.18E+0887%9.15E+0767%78%ERS946055ERS942337
APOLiver0.33x1.13E+081.00E+0889%9.07E+0780%91%ERS946057ERS942338
APOTestis0.33x1.38E+081.20E+0887%1.19E+0886%99%ERS946058ERS942339
  1. All accessions deposited under studies PRJEB11533* and PRJEB11513**.

Table 3

Additional sequencing effort, focused only on brain samples. Reads sequenced, mapped and normalized.

https://doi.org/10.7554/eLife.09977.018
Taxon
Code
TissueLanesQC-passed
reads
Mapped
reads
(% total)Normalized
subset
(% total)(% mapped)Accession
Reads
Accession
BAMs
DOMCBBrain1x3.89E+083.76E+0897%3.19E+0882%85%ERS946024ERS942308
DOMMCBrain1x3.76E+083.64E+0897%3.19E+0885%88%ERS946028ERS942312
DOMAHBrain1x3.46E+083.35E+0897%3.19E+0892%95%ERS946020ERS942304
MUSKHBrain1x4.64E+084.49E+0897%3.19E+0869%71%ERS946036ERS942316
MUSVIBrain1x4.13E+084.00E+0897%3.19E+0877%80%ERS946032ERS942320
CASBrain1x4.35E+084.21E+0897%3.19E+0873%76%ERS946040ERS942324
SPIBrain1x4.31E+084.16E+0897%3.19E+0874%77%ERS946044ERS942328
SPRBrain1x3.87E+083.73E+0896%3.19E+0882%85%ERS946048ERS942332
MATBrain1x3.62E+083.40E+0894%3.19E+0888%94%ERS946052ERS942336
APOBrain1x4.33E+083.77E+0887%3.19E+0874%84%ERS946056ERS942340
  1. All accessions deposited under studies PRJEB11533* and PRJEB11513**.

Appendix table 1

Simulations comparing bowtie2 to NextGenMap. Divergent reads were mapped to a common reference.

https://doi.org/10.7554/eLife.09977.019
Total simulated
reads
% simulated
divergence (reads)
Uniquely mapped reads
Bowtie2
Uniquely mapped reads NGMPercentage unique from total reads
Bowtie2
Percentage unique from total reads
NGM
29103700%2621200287348190.1%98.7%
29109822%2650274286827991.0%98.5%
29113124%2674738286358191.9%98.4%
29102866%2583320285606088.8%98.1%
29109788%2124958283611973.0%97.4%
291044610%1321494277983745.4%95.5%
291061012%587862267501120.2%91.9%
291019614%18682825108406.4%86.3%
291009016%4298622969171.5%78.9%
290999218%748820414370.3%70.2%
291002220%93617599240.0%60.5%
Appendix table 2

Accuracy of NextGenMap. The same set of reads was mapped to divergent genome versions of the reference. We are assuming that the reads coming from the same reference are correctly mapped, and used that as a standard for the divergent genomes, so the estimates should be slightly inflated.

https://doi.org/10.7554/eLife.09977.020
% divergenceAccurately mapped reads%
0%2910370100.0%
2%284207697.7%
4%281662896.8%
6%279893696.2%
8%277860895.5%
10%275619494.7%
12%271742093.4%
14%264847291.0%
16%253172887.0%
18%235896481.1%
20%212092272.9%
Appendix table 3

Performance of NextGenMap. Same set of reads was mapped to divergent genomes. Mapped indicates uniquely mapped reads; proper indicates read with both pairs mapped one next to the other; mate mapped indicates that both reads in a pair are mapped, although not necessarily as pairs; singletons indicates the amount of pairs in which only one of both mates was mapped.

https://doi.org/10.7554/eLife.09977.023
% simulated
divergence
(reference)
Total readsMapped (%)Proper (%)Mate mapped (%)Singletons (%)
0%29103702873481 (99%)2869482 (99%)2872432 (99%)1049 (0.1%)
2%29103702883094 (99%)2860794 (98%)2878634 (99%)4460 (0.1%)
4%29103702885714 (99%)2844842 (98%)2877808 (99%)7906 (1%)
6%29103702882035 (99%)2810920 (97%)2866362 (98%)15673 (1%)
8%29103702859215 (98%)2722782 (94%)2817502 (97%)41713 (3%)
10%29103702810639 (97%)2575954 (89%)2722242 (94%)88397 (6%)
12%29103702712723 (93%)2305232 (79%)2536014 (87%)176709 (12%)
14%29103702562495 (88%)1961916 (67%)2266582 (78%)295913 (20%)
16%29103702369165 (81%)1571078 (54%)1945446 (67%)423719 (29%)
18%29103702144444 (74%)1193318 (41%)1609114 (55%)535330 (37%)
20%29103701882993 (65%)844628 (29%)1265102 (43%)617891 (42%)

Data availability

The following data sets were generated
  1. 1
  2. 2
    Transcriptomes of wild mice
    1. Max Planck Institute for Evolutionary Biology
    (2015)
    Publicly available at the EBI European Nucleotide Archive (Accession no: ERA526594).

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)