Supplementary tables.
(A) Samples used in this study. Details on the material derived from individual human embryos (each listed according to the Carnegie Stage (CS)) used in the biological replicates and the sequencing statistics for each sample. A conversion of Carnegie Stage to an approximate days post-conception is available in Jennings et al. (2015) (open access). Gene level read counts are available for download as a TSV file in Supplementary file 2. (B) Differential gene expression between paired embryonic and fetal RNA-seq data. The R package edgeR (Robinson et al., 2010) was used to test for differential gene expression between embryonic and fetal (Roadmap Epigenomics Consortium, 2015) datasets. Shared tissues were adrenal gland, heart, lung, stomach, kidney, upper limb, lower limb and testis. The table is sorted by FDR (column H) and can be filtered by log fold change (column E) to give embryo-enriched genes (negative values) or fetal-enriched genes (positive values). (C) Gene Ontology (GO) terms and the genes underlying them for embryonic vs.fetal (Roadmap) up-regulated genes. Genes up-regulated in embryonic tissues versus fetal tissues (edgeR, FDR < 0.05, see Supplementary file 1B) were tested for GO term enrichment using Fisher’s exact test and the elimination algorithm implemented in the R package topGO (Alexa and Rahnenfuhrer, 2010). Separate tests were run for embryo up-regulated and fetal up-regulated genes. The table is sorted by enrichment in embryonic genes. (D) Tissue-specific genes contributing to metagenes. All genes with relative basis contribution (across metagenes) greater than 0.8 are listed. (E) The most extreme 1000 genes (high and low) for all principal components (PC1-31) of the LgPCA. The dataset is derived from genes annotated in GENCODE18. Raw gene-level loadings for each principal component are available for download as a TSV file in Supplementary file 3. (F) Gene Ontology (GO) terms and the genes underlying them for organ and tissue-specific transcriptomic signatures from the extremes of the LgPCA. GO terms were identified as enriched in extreme scoring genes (annotated in GENCODE 18) in the principal components (PCs) of the LgPCA. Due to the very large number of terms returned at p<0.0001 by Wilcoxon test (the topGO 'elim' method, see Materials and methods) an illustrative selection are listed with raw gene-level loadings available for download in Supplementary file 3. (G) Transcription factors in the extremes of the LgPCA and their links to developmental morbidity. The most extreme 1000 annotated genes (GENCODE 18) of the LgPCA dataset were filtered for transcription factors based on KEGG and PHANTOM5 annotations and for read counts >500. To identify disease associations each gene was entered as a search term in OMIM (www.ncbi.nlm.nih.gov/omim) and in PubMed. Batch queries were undertaken at Mouse Genome Informatics (MGI, www.informatics.jax.org) with 'Mammalian phenotype' as the output. (H) LgPCA predictions of causal genes for critical regions in either solved or unsolved developmental disorders. Fifty-three developmental disorders (Column A, 'solved') with causally associated transcription factors identified in the appropriate transcriptomic signature of Supplementary file 1G were originally defined by critical regions (Column C with hyperlink). These critical regions were identified by searching OMIM and usually derived from mapping data on affected families or chromosomal deletions in affected patients. Larger critical regions were preferentially selected to test more meaningfully whether the LgPCA model could have pinpointed the causal gene based solely on transcriptomic signatures that involved an affected organ(s) or tissue(s) (Column B). The average critical region was 13.7 Mb (Column D) and contained an average of 111 protein-coding genes (Column E; identified from searching BIOMART on ENSEMBL). In 48/53 instances (91%), LgPCA narrowed the field down to three or fewer transcription factors and in 37 instances (73%) excluded all except the correct transcription factor. Therefore, the same approach was applied to 13 unsolved developmental disorders (mostly deletion syndromes) with predictions made in each case for any type of protein-coding gene (Column H) and transcription factor(s) (Column I). In many instances the transcription factor in Column I possesses an appropriate mutant mouse phenotype. (I) 6251 unannotated transcripts identified during human organogenesis. These are the 6251 novel and distinct transcripts underlying Figure 4 of the main text, which also describes the transcript classification: Anti-sense (AS), Overlapping (OT), Bidirectional (BI), Long-intergenic non-coding (LINC) and / or Transcripts of uncertain coding potential (TUCP) (based on Mattick and Rinn, 2015). Intergenic transcripts are numbered sequentially within each chromosome. Exon lengths and starts (blocks) are recorded here in UCSC BED12 format. Correlations in expression profile were calculated for annotated genes with transcript transcriptional start sites situated within 1 Mb of the novel transcript TSS; the total number of genes in this window is listed. Columns AF-AT (organs and tissues) represent mean, quantile-normalised read counts across tissue replicates. Correlations (and distance) are shown for the closest, best correlated or best anti-correlated genes and were generated using only embryonic RNA-seq data. The pipeline to generate transcripts, distinguish them from previous annotations, name, characterise and filter is described in the Materials and methods. (J) NIH roadmap samples (Kundaje et al., 2015) used in this study.