A high-resolution mRNA expression time course of embryonic development in zebrafish

  1. Richard J White
  2. John E Collins
  3. Ian M Sealy
  4. Neha Wali
  5. Christopher M Dooley
  6. Zsofia Digby
  7. Derek L Stemple
  8. Daniel N Murphy
  9. Konstantinos Billis
  10. Thibaut Hourlier
  11. Anja Füllgrabe
  12. Matthew P Davis
  13. Anton J Enright
  14. Elisabeth M Busch-Nentwich  Is a corresponding author
  1. Wellcome Trust Sanger Institute, United Kingdom
  2. European Bioinformatics Institute, United Kingdom
  3. University of Cambridge, United Kingdom
12 figures, 3 tables and 20 additional files


Figure 1 with 2 supplements
A transcriptional map of development.

(A) Stages represented in this study. The colour scheme is used throughout the figures. Also shown are the ZFS stage IDs, stage names and approximate hours post-fertilisation (at 28.5°C) for each stage and five developmental categories. (B) Sample correlation matrix of Spearman correlation coefficients of every sample compared to every other sample. (C) Example of the overlap between H3K4me3 signal and genes. Ensembl screenshot of the region occupied by ndr2 and lrcc20 (the direction of transcription is indicated by arrows). Included tracks are: H3K4me3-prim-6-FE (blue) – fold enrichment of H3K4me3 ChIP-seq (H3K4me3/control) at prim-6 stage, H3K4me3-dome-FE (red) – fold enrichment of H3K4me3 ChIP-seq (H3K4me3/control) at dome/30% epiboly, Prim-5 RNAseq and Dome RNAseq (grey) – alignments of RNA-seq reads. The table above shows the genome-wide overlap between H3K4me3 ChIP-seq peaks and the promoters of expressed/not expressed genes as well as the average expression (TPM) of each gene at each stage. ndr2 is expressed only at dome stage and displays a peak only in the dome/30% epiboly H3K4me3 ChIP-seq. lccr20 is expressed at both stages and has a H3K4me3 peak at both stages. (D) 3D plot of Principal Component Analysis (PCA) showing the first three principal components (PCs), which together explain 83.8% of the variance in the data. Samples from the same stage cluster together and there is a smooth progression through developmental time (shown as dashed arrow). Samples are coloured as in (A) and are annotated with the stage categories. The amount of variance explained by each PC is indicated on each axis. (E) Representation of the expression profiles that contribute most to the first 6 PCs. The expression values are centred and scaled for each gene for the 100 genes that contribute most to that PC and then the mean value for each sample is plotted.

Figure 1—figure supplement 1
RNA-seq sample QC.

(A) Boxplot of the distributions of numbers of mapped read-pairs for each stage. (B) Boxplot of the distributions of numbers of genes detected in each sample for each stage (≥1 TPM). (C) Histograms of the distribution of gene expression levels (log2[mean TPM]) at each stage. The total number of genes represented for the stage is shown on each panel (>0 TPM for all samples at that stage). At early stages, the distribution is bimodal and shifts to unimodal as development proceeds.

Figure 1—figure supplement 2
PCA matrix plot.

Plots of the first six principal components (PCs) plotted against each other. The components are arranged in a matrix from PC1 (top and left) to PC6 (bottom and right). The plots on the diagonal show the expression profiles of the 100 genes contributing the most to that component. The plots below the diagonal are the appropriate PCs plotted against each other. For example, the plot in the third row and first column is PC1 (x-axis) plotted against PC3 (y-axis). The percentage of variance explained by each PC is given underneath each column.

Data access.

(A) A screenshot from Expression Atlas release May 2017. Relative gene expression levels (FPKM) across all 18 stages are displayed as a heatmap. Genes can be searched for using Ensembl IDs, gene/allele names and Gene Ontology terms. The data for the selected genes can be downloaded (red ellipse) and the entire dataset can be downloaded from the ‘Downloads’ tab (purple rectangle) as either a .tsv or .Rdata file. (B) A screenshot of the RNA-seq matrix (Ensembl v89) with four intron tracks plus the exon read coverage from the merged (all 18 stages) RNA-seq data chosen for display. (C) A screenshot of the Ensembl v89 browser showing the region surrounding the gene eif4enif1 (chr 6:40.922–40.952 Mb). Exon read coverage from the merged 18 stages is displayed as grey histograms. Four of the 18 intron-defining tracks from the RNA-seq matrix are selected. Introns are coloured in teal. Yellow dots indicate introns identifying a splice variant not found in any of the three gene models shown in red. Note that only the forward strand is shown in full and a gene model overlaps on the reverse strand.

Figure 3 with 1 supplement
Clustering of expression patterns.

(A) Heatmap of the expression profiles across the time course (includes genes expressed at >0 TPM in all the samples for at least one stage). The expression values are scaled to the maximum expression for that gene across the time course. Genes are organised by which stage the maximum expression occurs at and clustered within each stage. (B) Table of the numbers of named and unnamed genes assigned to each stage. (C) Network diagram of the clustering produced by MCL on a network graph generated by linking genes with a Pearson correlation coefficient of >0.94 and removing unassigned genes and clusters with five genes or fewer. The numbers indicate the positions in the network diagram of the clusters shown in D and Figure 4. (D) Example clusters illustrating the progression of expression peaks through development displayed as the mean of mean-centred and variance-scaled TPM for the genes in the clusters.

Figure 3—figure supplement 1
Relationship of named and unnamed genes to other species.

Stacked bar charts of the proportions of genes with orthologues at different taxonomic levels for Named (A) and Unnamed (B) genes. Orthologues for genes were obtained from the Ensembl Compara database and genes were then categorised based on the most basal Last Common Ancestor supplied by Ensembl for the orthologues. Genes with no identifiable orthologue in any of the species examined were classed as Danio specific (top bar, blue). Those with orthologues in the six fish species examined (Astyanax mexicanus, Gasterosteus aculeatus, Oryzias latipes, Takifugu rubripes, Tetraodon nigroviridis, Gadus morhua), but not the mammals examined were labelled Teleosts (middle bar, green). Those with orthologues also present in the mammals surveyed (Mus musculus, Rattus norvegicus and Homo sapiens) were labelled Vertebrates (bottom bar, red).

Figure 4 with 1 supplement
Zinc Finger (ZnF) genes expressed at the onset of zygotic transcription.

(A) Heatmap of the expression profiles (log10[TPM]) of the ZnF genes from clusters 7, 23, 26 and 135. (B) Heatmap of the expression of all genes on chromosome 4 (log10[TPM]) in chromosomal order with cluster assignments shown on the right. (C) Expression profiles of clusters 7, 23, 26 and 135 shown as average expression (mean TPM). (D) Plot of the distribution of ZnF (red) and NLR (blue) genes on chromosome 4. (E) Heatmap of the expression profiles (log10[TPM]) of the NLR genes present in GRCz10.

Figure 4—figure supplement 1
Comparison of ZnF and NLR genes on chromosome 4.

The genes on chromosome 4 from 24 Mb (bottom) to the end of the chromosome (top) are displayed showing the Pearson correlation coefficient (Pearson Correlation Matrix) and expression profiles (Expression Heatmap). The Category strip on the left indicates which type each gene is (ZnF, NLR or other) and the Cluster strip shows which BioLayout cluster the genes are assigned to. The ZnF genes correlate with each other, but not with the other types of genes.

Figure 5 with 1 supplement
Chromosomal expression domains.

(A) Manhattan plot of the calculated similarity measure (mean pairwise Pearson correlation coefficient) in windows of 10 genes across the genome. (B) Heatmaps of the expression profiles (log10[TPM]) of genes in chromosomal order. To the left of each histogram, the similarity measure is plotted as a line graph with a red line marking the 0.5 cut-off for identifying regions of co-expression.

Figure 5—figure supplement 1
Heatmaps showing the expression profiles of example regional expression domains.

(A–D) Three more regions of zinc finger genes on chromosomes 3, 15 and 22. (E) A region on chromosome 12 containing an expansion of the cathepsin L gene.

Figure 6 with 1 supplement
Conservation/divergence of paralogous genes.

(A) Plot of the distribution of Pearson correlation coefficients among one-to-one paralogous gene pairs (red) and a random sample of gene pairs of the same size (blue). (B) Heatmap of the expression of the gene pairs with the top 10 highest correlation coefficients (top half) and the top 10 most negative correlation coefficients (bottom half). (C–D) Plots displaying the expression of ckma/b by stage (C) and plotted against each other (D). (E–F) Plots showing the expression of impdh1a/b by stage (E) and plotted against each other (F).

Figure 6—figure supplement 1
Divergent expression of paralogous pairs.

Images from ZFIN (ZFIN.org) of examples of paralogous gene pairs with diverged expression patterns. The Pearson correlation coefficient between the RNA-seq expression profiles of each pair is shown to the left. (A–B) mfsd4a (ENSDARG00000023768, ZDB-IMAGE-041130–790) and mfsd4b (ENSDARG00000008263, ZDB-IMAGE-060810–3118). (C–D) guk1a (ENSDARG00000030340, ZDB-IMAGE-020816–64) and guk1b (ENSDARG00000005776, ZDB-IMAGE-060216–46). (E–F) pbxip1a (ENSDARG00000071015, ZDB-IMAGE-020919–1927) and pbxip1b (ENSDARG00000011824, ZDB-IMAGE-050208–1397).

Figure 7 with 2 supplements
Novel features of the dataset.

(A) Heatmap of the expression profiles (log10[TPM]) of the 127 new protein-coding gene annotations in the Ensembl v90 gene build produced by the contribution of this RNA-seq dataset. (B) Example of a gene (ENSDARG00000029885, rab41) with differential isoform usage across the time course plotted as TPM (points are individual samples and the line represents the mean). (C) Histogram of the number of isoforms per gene for genes with differentially expressed isoforms. (D) Heatmap of the expression profiles (log10[TPM]) of the 1304 detected new lincRNA annotations. (E) Venn diagram showing the number of lincRNA regions that overlap with previously published lincRNA sets (Pauli et al., 2012; Ulitsky et al., 2011). (F) Histogram of the lincRNA transcript lengths. (G) Histogram of the numbers of exons per lincRNA. The red line represents the median and the mean is shown at the top right.

Figure 7—figure supplement 1
Differential isoform usage.

A screenshot from the Ensembl browser (v89) for the region 14:32,938,000–32,957,250. Three tracks at the top show RNA-seq coverage data (grey) for three different stages (1 k-cell, 20–25 somites and 5dpf). Orange boxes indicate the exons that change between transcripts. The three annotated transcripts are displayed below the Contigs bar. Below that are three tracks depicting intron-spanning reads defining splice events at the same three stages (the dotted arrow points to a splicing event that occurs in the 5 dpf sample, but is not represented in the 5 dpf transcript model because only the model with the most support is displayed for each stage). At the bottom are transcript models (dark blue) for all 18 time points that show the transcript model with the most support at that stage. Light blue boxes highlight the differentially-spliced exons.

Figure 7—figure supplement 2
Properties of lincRNAs.

(A) Density estimates of the distributions of mean expression (mean across all samples) for the protein-coding genes present in Ensembl v89 (red) and the 127 new ones in Ensembl v90 (blue). (B) Distribution of mean expression for all protein-coding genes (red) compared to lincRNAs (blue) in Ensembl v90. (C–D) Frequency distributions around the predicted transcription start site of the lincRNAs from Ensembl v90 for two promoter-associated motifs. (C) TATA-box. (D) CCAAT-box.

Annotation of primary miRNA transcripts.

(A) Table and Venn diagram of the numbers of assembled pri-miRNA transcripts and the overlap of Ensembl miRNAs with an annotated pri-miRNA. (B–D) A selection of miRNA loci for which pri-miRNA annotation is not available in Ensembl and Havana v87. (B–C) IGV screen shots illustrating the assembled pri-miRNA transcripts overlapping mir-363, mir-19c, mir-20b and mir-18c (B), and mir-125 (C). Tracks shown are miRNA annotations from miRBase and Ensembl, newly assembled pri-miRNA transcripts and Ensembl pri-miRNAs. (D) Screen shots from the Ensembl v90 browser demonstrating how to display the pri-miRNAs. The track is selectable from the ‘Genes and transcripts’ section of the ‘Configure this page’ menu. The region around miR-142a (chr5: 2,840,000–2,841,500) is shown with its annotated pri-miRNA.

Figure 9 with 3 supplements
Alternative 3’ end use during development.

(A) Diagrams illustrating the possible outcomes of differential 3’ end usage. Only the 3’ end of the gene is shown, where the exons are boxes, the introns are represented as lines and filled boxes are coding. The dotted line represents the remaining 5’ end of the gene. (B) Flowchart of the DeTCT 3’ end filtering strategy. See Materials and methods for the definition of each filter. The table shows the number of ends removed by each filter when using the strict/stricter settings. (C) Graph of the mean normalised counts by stage for the three alternative 3’ ends (shown in D) of the pdlim5b gene. Error bars represent standard deviations. Underneath is a screenshot from Expression Atlas (Aug 2017) showing the overall TPM for pdlim5b. (D) Screenshot from Ensembl (release 89) showing the 3’ portion of the pdlim5b gene (10:1,187,000–1,215,000). The top two tracks are RNA-seq coverage for 1 cell and prim-15 (30 hpf). Red arrowheads point to the genomic positions of the three transcript ends. Note, the 3’ UTR resulting in end 1 is not annotated in this genebuild. Below is a schematic of the proteins that are predicted to be produced by two of the transcripts. The positions of domains within the protein are also indicated.

Figure 9—figure supplement 1

Sample correlation matrix of the Pearson correlation coefficient between each pairwise comparison of samples.

Figure 9—figure supplement 2

DeTCT Matrix PCA plot. Plot showing the PCA of the DeTCT data arranged in the same way as Figure 1—figure supplement 2.

Figure 9—figure supplement 3
3’ end filtering.

UpSet (Conway et al., 2017) plot showing the overlap of the ends that fail different 3’ end filtering rules (stricter). Underneath the main bar chart are the different filters and the filled circles indicate the particular overlap displayed in the bar above. The height of the bar above shows the size of that overlap set. For example, the first bar in the main bar chart shows that 139,626 ends failed all three of the No Hexamer, No Annotation and PolyA filters. The horizontal bar chart to the left displays the size of the set for individual filters.

Author response image 1
Gene detection with reads downsampled to 2 million read pairs per sample.
Author response image 2
Comparison of genes detected from one stage to the next.
Author response image 3


Table 1
BioLayout clusters with significant chromosomal enrichments (adjusted p-value<0.05 and log2 fold enrichment [log2FE]>1).
ClusterChromosomeCountExpectedlog2[FE]Adjusted p-value
Table 2
Comparison of Ensembl genebuild versions 89 and 90.
 Total Genes32,26635,117
 Total Transcripts58,54962,895
Key resource table
Reagent type (species)
or resource
DesignationSource or referenceIdentifiersAdditional information
genetic reagent
(Danio rerio)
droshasa191this paperZFIN:sa191; RRID:
ENU mutant. C - > T at position
GRCz10:12:13311049. Available from ZIRC
genetic reagent
(D. rerio)
dgcr8sa223this paperZFIN:sa223; ZFIN:
ENU mutant. C - > T at position
strain, strain background
(D. rerio)
HLFthis paperHubrecht Long Fin
software, algorithmTopHat10.1038/nprot.2012.016RRID:SCR_013035
software, algorithmCufflinks10.1038/nprot.2012.016RRID:SCR_014597
software, algorithmQoRTs10.1186/s12859-015-0670-5https://github.com/hartleys/QoRTs
software, algorithmhtseq-count10.1093/bioinformatics/btu638RRID:SCR_011867
software, algorithmRhttps://www.r-project.org/RRID:SCR_001905
software, algorithmPerlhttps://perldoc.perl.org/
software, algorithmzfs-devstagesthis papergithub repository,
software, algorithmOntologizer10.1093/bioinformatics/btn250RRID:SCR_005801
software, algorithmSTAR10.1093/bioinformatics/bts635
software, algorithmSamtools10.1093/bioinformatics/btp352RRID:SCR_002105
software, algorithmPicardhttp://broadinstitute.github.io/picardRRID:SCR_006525
software, algorithmgtfToGenePredUCSC Genome Browserhttp://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads
software, algorithmgenePredToBedUCSC Genome Browserhttp://hgdownload.soe.ucsc.edu/downloads.html#utilities_downloads

Additional files

Supplementary file 1

Sample Information (.tsv)

Supplementary file 2

RNA-seq count data (.tsv)

Supplementary file 3

RNA-seq TPM (.tsv)

Supplementary file 4

BioLayout input file (.expression)

Supplementary file 5

BioLayout cluster information (.tsv)

Supplementary file 6

BioLayout cluster GO/ZFA enrichment (.zip)

Supplementary file 7

undetected genes GO enrichment (.tsv)

Supplementary file 8

Cluster7, 23, 26, 135 Pfam info (.tsv)

Supplementary file 9

Regional expression information (.zip)

Supplementary file 10

Paralogue Information (.tsv)

Supplementary file 11

Paralogue GO enrichment (.tsv)

Supplementary file 12

new genes (protein-coding and lincRNAs) in Ensembl v90 (.bed)

Supplementary file 13

e90 transcript level TPM (.tsv)

Supplementary file 14

Differential isoform usage plots (.pdf)

Supplementary file 15

Assembled pri-miRNA transcripts (.bed)

Supplementary file 16

Table of overlap of miRNAs (.xlsx)

Supplementary file 17

Genes contributing to Principal Components (RNAseq and DeTCT) (.zip)

Supplementary file 18

Genes with multiple 3’ ends (.tsv)

Supplementary file 19

GO enrichment - Genes with multiple 3’ ends (.tsv)

Transparent reporting form

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Richard J White
  2. John E Collins
  3. Ian M Sealy
  4. Neha Wali
  5. Christopher M Dooley
  6. Zsofia Digby
  7. Derek L Stemple
  8. Daniel N Murphy
  9. Konstantinos Billis
  10. Thibaut Hourlier
  11. Anja Füllgrabe
  12. Matthew P Davis
  13. Anton J Enright
  14. Elisabeth M Busch-Nentwich
A high-resolution mRNA expression time course of embryonic development in zebrafish
eLife 6:e30860.