An integrative transcriptomic atlas of organogenesis in human embryos

Version of Record: August 24, 2016

Download
Cite
Share
CommentOpen annotations (there are currently 0 annotations on this page).

Altmetric provides a collated score for online attention across various platforms and media.
See more details

1. Related to
Transcriptomics: How to build a human

Carla B Mellough, Majlinda Lako

Insight Aug 24, 2016

Abstract
eLife digest
Introduction
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Human organogenesis is when severe developmental abnormalities commonly originate. However, understanding this critical embryonic phase has relied upon inference from patient phenotypes and assumptions from in vitro stem cell models and non-human vertebrates. We report an integrated transcriptomic atlas of human organogenesis. By lineage-guided principal components analysis, we uncover novel relatedness of particular developmental genes across different organs and tissues and identified unique transcriptional codes which correctly predicted the cause of many congenital disorders. By inference, our model pinpoints co-enriched genes as new causes of developmental disorders such as cleft palate and congenital heart disease. The data revealed more than 6000 novel transcripts, over 90% of which fulfil criteria as long non-coding RNAs correlated with the protein-coding genome over megabase distances. Taken together, we have uncovered cryptic transcriptional programs used by the human embryo and established a new resource for the molecular understanding of human organogenesis and its associated disorders.

https://doi.org/10.7554/eLife.15657.001

eLife digest

Individual organs and tissues form in human embryos during the first two months of pregnancy. Any errors during this crucial stage of human development can result in miscarriage or serious birth defects. Yet remarkably little is known about how this process works. What is known has been inferred from studies of how other animals develop, human stem cells grown in a laboratory, and babies born with genetic conditions that cause developmental problems.

Genes control the way that organs and tissues form, and are switched on or off in complex patterns during development to ensure that particular cells develop into one type of organ and not another. When genes are switched on, their DNA is copied into molecules called RNA. Many RNA molecules are used as templates to make proteins, which then perform critical roles in cell processes. One way to find out which genes are activated during development is to identify which RNAs are made by cells in the embryo.

Here, Gerrard, Berry et al. used a technique called RNA-sequencing to identify the RNAs that human embryos make while their organs and tissues form. The RNA came from many different tissues including the heart, limbs and the roof of the mouth. Gerrard, Berry et al. developed a new computational model that used the identity of the RNAs to decode the precise patterns of gene activity in the tissues. The model correctly identified many genes that were already known to cause developmental problems when faulty, and identified numerous others that are now predicted to cause developmental defects in humans.

Gerrard, Berry et al. also discovered over 6,000 RNAs in the human embryos that are unlikely to code for proteins. These “non-coding” RNAs may have other roles in cells, such as switching off genes, and many of them appear to be specific to human embryos. Together, these findings have uncovered new patterns of gene activity that drive development in human embryos and provide a resource for studying how organs and tissues form. Future challenges are to understand what controls these patterns of gene activity, and how the patterns change over time.

https://doi.org/10.7554/eLife.15657.002

Introduction

Embryogenesis encompasses the progression from fertilized zygote to blastocyst and through gastrulation to establish the three germ layers of ectoderm, mesoderm and endoderm, from which all organs and tissues subsequently arise during organogenesis. Remarkably little is known about this latter phase of assembling organs and tissues in human due to the restricted availability of human embryonic tissue and its tiny size. Previous transcriptomics post-implantation have sampled either the whole embryo by expression microarray (Fang et al., 2010), thus lacking organ-specific resolution and the vast majority of long non-coding (lnc) transcription; or included lnc expression by massively parallel short-read RNA sequencing (RNA-seq) but focussed on single sites such as limb bud (Cotney et al., 2013) or pancreas (Cebola et al., 2015). RNA-seq from NIH Roadmap and other studies during or after the end of the first trimester of pregnancy falls after the embryonic period (which ends at 56–58 days post-conception (Carnegie Stage 23)) and commonly reflects near terminal differentiation within heterogeneous fetal organs and tissues (Jaffe et al., 2015; Roadmap Epigenomics Consortium, 2015; Roost et al., 2015). As a consequence of these combined deficiencies, we set about compiling global transcriptomic data during the critical phase of human organogenesis, sampling each germ layer and including sites of mixed origin that are subject to major developmental disorders such as cleft palate and limb abnormalities (Figure 1a).

Figure 1 with 5 supplements see all

Download asset Open asset

Profiling the transcriptomes underlying organogenesis in human embryos.

(a) Human embryo showing the 15 tissues and organs subjected to RNA-seq. (b) High dynamic range of human embryonic RNA-seq. The combined dataset (black) included expression of >90% of annotated protein-coding genes (GENCODE18 [Harrow et al., 2012]). (c) Human embryogenesis possesses a distinctive transcriptome. Human embryonic read counts were compared with equivalent fetal datasets from NIH Roadmap (Roadmap Epigenomics Consortium, 2015) using edgeR (Robinson et al., 2010) and a false discovery rate (FDR) of 0.05 (see Materials and methods, Supplementary file 1B). Negative log10 p-values are shown for selected biological process Gene Ontology (GO) terms with significant enrichment in either the embryonic or fetal gene sets following Fisher's exact test applied using the elimination algorithm (Alexa and Rahnenfuhrer, 2010) (Supplementary file 1C contains the full list of enriched terms). (d) Selected sites illustrate the highly specific expression of *HOX* genes within the human embryo.

https://doi.org/10.7554/eLife.15657.003

Organs and tissues from fifteen human embryonic sites were sequenced in two sets of biological replicates (except pancreas and tongue) to generate 28 strand-specific RNA-seq datasets with 44–90 million uniquely mapped reads per replicate (Figure 1a; Supplementary file 1A, which contains information on embryonic stages). Global transcription rates across all organs and tissues were comparable over a high dynamic range; approximately 70% of protein-coding genes contained 100–10,000 mapped reads (Figure 1b; Supplementary file 2). We assessed whether our human embryo datasets identified earlier developmental processes than currently available fetal data (Roadmap Epigenomics Consortium, 2015). There were three-fold the number of differentially expressed genes in the fetal datasets but equivalent enrichment of gene ontology (GO) terms in the embryo, including many early developmental processes such as morphogenesis of an epithelial bud, anterior/posterior pattern specification and embryonic morphogenesis. These were in contrast to homeostatic processes enriched in the fetal dataset (Figure 1c; Supplementary file 1B–C).

Sampling gene expression across multiple sites allowed us to set about deciphering the precise transcriptomic codes responsible for the development of the different human embryonic organs and tissues. While ZNF and ZSCAN family members were broadly expressed discrete site-specific expression was more apparent for individual members of other transcription factor families (Figure 1—figure supplement 1) exemplified by the HOX gene clusters (Figure 1d). User-defined sets of up to five developmental transcription factors characteristic for a particular organ or tissue displayed very high levels of tissue specificity (Figure 1—figure supplement 2). However, while principal components (PC) analysis (PCA) or clustering grouped biological replicates, relationships between different organs and tissues other than the distinctiveness of brain and liver were not resolved (Figure 1—figure supplements 3–4). Non-negative matrix factorisation (NMF) also allows unbiased clustering of gene expression (Gaujoux and Seoighe, 2010). By setting the parameters such that representative genes were only extracted once, we identified eleven non-overlapping ‘metagenes’ from the complete expression dataset with clear tissue-specific signals for thyroid, liver, RPE, brain, heart and adrenal gland (Figure 1—figure supplement 5; Supplementary file 1D). We hypothesized that these new signals might allow benchmarking to assess the fidelity of in vitro differentiated stem cells, similar to a previous report (Roost et al., 2015). As an exemplar, we chose hepatocyte differentiation for which RNA-seq data are available including positive (primary adult hepatocytes) and negative (human embryonic fibroblasts) control data (Du et al., 2014). Clear enrichment for the stem cell-derived hepatocytes and the primary hepatocytes (but not the fibroblasts) was apparent in metagene 2, the cluster of 39 genes indicative of human embryonic liver. From this starting point, we wanted to move beyond the unique organ-specific signatures to study how patterns of gene expression co-varied across tissues. While relaxing NMF parameters would allow non-exclusive gene selection across metagenes, we also wanted to capture differences in gene expression between organs (e.g. aspects of what is not expressed as a contributing factor to an organ’s identity). Moreover, different embryonic organs are related according to developmental lineage. We reasoned that being able to apply a lineage structure would create natural assemblies of co-regulated genes (Figure 2a). Accordingly, we adapted a method from spatial ecology and phylogenetics (Jombart et al., 2010a, 2010b) to constrain PCA by imposing a hierarchical developmental lineage and termed this approach LgPCA. We also included RNA-seq from undifferentiated human embryonic stem cells (Roadmap Epigenomics Consortium, 2015) to represent pre-gastrulation human biology. Of the total thirty-one principal components (PCs) arising from LgPCA the first fifteen now identified patterns of gene expression across groups of related tissues in addition to unique organ-specific signatures (Figure 2b) while PCs 16–31 sampled heterogeneity within individual organs and tissues (Figure 2—figure supplement 1). In keeping with this transition PCs 1–4 ordered samples reminiscent of very early developmental events: pluripotency (extreme positive loadings in PC1; ‘PC1 high’), early brain formation (extreme negative loadings in PC2; ‘PC2 low’), foregut endoderm (PC4 low) and intermediate mesoderm (PC4 high). PCs 5–15 resolved the individual organs and tissues; for instance low PC5 loadings discriminated liver from the other foregut endoderm derivatives. Heatmaps illustrated the composite or tissue-specific signals emanating from the genes with the most extreme PC loadings which also underlay appropriate developmental gene ontology (GO) terms (Figure 2c–d and Supplementary file 1E–F).

Figure 2 with 1 supplement see all

Download asset Open asset

Lineage-guided PCA discovers unique transcriptional signatures regulating human organogenesis.

(a) Interpreting gene expression profiles is dependent upon the underlying developmental lineage. Similar expression profiles in closely related tissues imply fewer regulatory events. (b) Lineage-guided principal components analysis (LgPCA) constrains PCA by imposing a developmental lineage on the different organs and tissues. The first 15 PCs are shown including biological replicates for the human embryonic organs and tissues integrated with human embryonic stem cell data (Roadmap Epigenomics Consortium, 2015). PC scores for the 15 different dimensions are shown in black (positive/high) or white (negative/low) with scale (extremeness) indicated by circle size (sign/direction is arbitrary). Unique transcriptional signatures were resolved for broad organ groupings (e.g. foregut endoderm derivatives, low scores in PC4), single organs or tissues (e.g. palate, high scores in PC13) or across tissues unrelated by germ layer but connected by multisystem congenital disorders (e.g. heart and limb, low scores in PC13). (c) Heatmaps of quantile normalised expression values of the most extreme 50 genes for selected PCs from the LgPCA. (d) Gene Ontology (GO) terms and their underlying genes illustrate the specific signatures from the LgPCA (further examples in Supplementary file 1F).

https://doi.org/10.7554/eLife.15657.009

Identifying the master regulators that differentially orchestrate organogenesis across the body has not previously been possible directly in human embryos. We undertook this in two different ways, both based on studying the 1000 genes with the most extreme loadings in PCs 1–15 that identified gene co-expression patterns across tissues and within individual organs (Figure 2b; Supplementary file 1E). We interrogated these gene sets for regulatory networks based on the enrichment of transcription factor motifs (Janky et al., 2014). Numerous well known master regulators were recovered alongside previously unappreciated factors for either broad tissue groups (e.g. foregut endoderm derivatives) or individual organs (Figure 3a). As proof-of-principle, this also included proven regulators of human pluripotency, NANOG, OCT4 and MYC, at an extreme of PC1. Remarkably, in several instances approaching half of the 1000 genes with the most extreme PC loadings imputed co-regulation by a single transcription factor, such as HNF4A in the liver or SRF in the heart. Alongside NR5A1, the data predicted RUNX and BAD as novel regulators of human adrenal and gonadal development (Figure 3a). As a second approach to study gene regulation, we extracted the transcription factors (typically <5%) from amongst the 1000 most extreme genes in PCs 1–15 (Supplementary file 1G). We searched the Mouse Genome Informatics database (MGI) and in 309/594 instances there was a relevant mouse mutation phenotype supporting the notion that the transcription factors identified by LgPCA are key regulators of human organogenesis. At the lowest extreme of PC5 (liver) the twenty-two transcription factors contained all three of those required for reprogramming fibroblasts directly to hepatocytes (Huang et al., 2014). This suggests novel fate programming roles for transcription factors at the extremes of other PCs (including new potential regulators of pluripotency amongst the sixteen factors containing zinc fingers in PC1). In keeping with these regulatory roles, the extreme PC loadings in the LgPCA data also prioritized those transcription factors responsible for major congenital disorders (Supplementary file 1G). Because LgPCA is not limited to individual organs this included a novel ability to predict multisystem abnormalities such as the combined heart and limb defects of Holt-Oram syndrome (OMIM 142900, TBX5, PC13 low) or the palate and limb abnormalities associated with mutations in TP63 (OMIM 603543, PC3 high).

Figure 3

Download asset Open asset

LgPCA points to master regulators of human organogenesis and the causes of human congenital disorders.

(a) Predicted regulation by iRegulon (Janky et al., 2014) of the most extreme 1000 genes for different PCs identifies known and unexpected transcription factors regulating human organogenesis. In several examples, individual transcription factors (e.g. REST, NR5A1, HNF4A, FOXA1 and SRF) were predicted to regulate nearly half of the most extreme 1000 genes. (b) Transcription factors at the extremes of individual PCs in the LgPCA are responsible for a diverse range of congenital disorders (red names in the ovals for heart and testis; full details in Supplementary file 1G). To validate the utility of these data, we conservatively selected some of the earliest critical regions for these disorders (two ‘Proven’ examples on the left; all 53 listed in Supplementary file 1H). LgPCA frequently isolated the correct transcription factor from an average of 111 genes across >10 Mb, shown for NKX2-5 in congenital heart disease and SOX9 in campomelic dysplasia. Beyond this validation LgPCA similarly predicts causative transcription factors (blue) for many unresolved congenital disorders such as developmental heart abnormalities in Chr1p36 deletion syndrome and sex reversal / disorders of sex differentiation (DSD) (all 13 examples in Supplementary file 1H).

https://doi.org/10.7554/eLife.15657.011

Mutations in genes encoding transcription factors are over-represented causes of congenital disorders, most likely due to their critical function during organogenesis and inadequacy when haploinsufficient. The enrichment of transcription factors with specific disease-associations at the extremes of the LgPCA implicates the co-enriched genes as leading candidates for unanswered clinical syndromes. To test this model we identified some of the earliest chromosomal mapping or patient deletion data for the known disease-associated transcription factors from Supplementary file 1G. 53 disorders were suitable for assessment with an average critical region of 13.7 Mb each containing an average of 111 protein-coding genes (Supplementary file 1H). Strikingly, in 37 instances (73%) LgPCA uniquely selected the correct transcription factor and in 48 instances (91%) narrowed the field down to three or fewer transcription factors. When applied to 13 syndromes (mostly deletion disorders) where the causative gene remains unresolved clear predictions of causality emerge, for instance in cleft palate (DLX5, DLX6, LHX8 and FOXF2) or cerebellar disorders (ZIC1 and ZIC4) (Supplementary file 1H). Frequently, there is an appropriate mutant mouse phenotype such as CASZ1 in cardiac malformations, part of Chr1p36 deletion syndrome, or SOX10 in the 46,XX disorder of sex differentiation (DSD) linked to duplication on Chr22 (Figure 3b and Supplementary file 1H).

Non-coding transcription has emerged as a critical regulator of cell and developmental biology (Goff and Rinn, 2015). A dedicated programme operating during human organogenesis seemed likely as 81 out of the 1571 genes enriched in embryogenesis compared to the fetal datasets were annotated long intergenic non-coding (LINC) transcripts (Supplementary file 1B). To look beyond this we assembled strand-specific transcripts not recognized by current genome annotation [GENCODE 18 (Harrow et al., 2012)] and systematically named them individually according to recommended criteria (Mattick and Rinn, 2015). 6251 unique loci accounted for in excess of 9 Mb of novel polyadenylated transcription from the human genome (Figure 4a and Supplementary file 1I). The vast majority of transcripts fulfilled criteria as lnc RNAs by assessment of coding potential (CPAT score <0.2) (Figure 4b), length over 200 base pairs (bp) and an absence of reads spanning splice junctions to currently annotated genes (Mattick and Rinn, 2015). These lncRNAs were classified as either bidirectional, antisense or overlapping, or by exclusion intergenic, according to orientation and position in relation to the annotated genome (Mattick and Rinn, 2015). Transcripts were most commonly 500–1,500 bp but could extend to over 600 Kb (Figure 4c) and showed high tissue-specificity with the median Tau value (Yanai et al., 2005) of 0.86, much higher than for protein-coding genes (0.63) but consistent with previously annotated non-coding genes (0.89). We investigated the association between this novel human embryonic transcriptome and the annotated genome. Reduced physical distance to expressed annotated genes markedly increased the likelihood of novel transcript co-expression (Figure 4e), although the best correlations were by no means always with the closest gene (Figure 4f–g). The median distance to the closest annotated gene was 7.7 Kb (Figure 4—figure supplement 1) while on average the best correlation was at 188 Kb (random prediction was 476 Kb). Over half (3634) of the lnc transcripts were classified positionally as LINC RNAs. While LINC RNAs can harbour important regulatory function, how to forecast their relationship(s) with the protein-coding genome and prioritize the investigation of thousands of new transcripts is immensely challenging (Goff and Rinn, 2015). As a first step, the multi-tissue nature of our dataset allowed intricate correlative patterns to be deciphered implying putative relationships; for instance over a 2 Mb window and across numerous genes on chromosome 7 between HE-LINC-C7T121 and TBX20, which encodes a developmental cardiac transcription factor mutated in a wide range of congenital heart disease (Figure 4h).

Figure 4 with 1 supplement see all

Download asset Open asset

6251 novel transcripts identified during human organogenesis show low coding probability and high tissue-specificity.

(a) Novel transcript models were merged across tissues (n = 9180; Supplementary file 4), assessed for coding potential using CPAT and classified (Mattick and Rinn, 2015) as overlapping (OT), antisense (AS), bidirectional (BI), intergenic noncoding (LINC) and/or transcripts of uncertain coding potential (TUCP, if CPAT >0.2). LINC or TUCP transcripts were numbered sequentially (T number) along each chromosome (C, either X, Y or 1–22) whereas BI, AS and OT transcripts were named by association with the annotated gene (‘Z’). A small proportion of transcripts fulfilled dual criteria as BI/AS/OT and TUCP. 6251 unique, non-overlapping, filtered transcript models were identified (the longest from each locus, >200 bp; Supplementary file 1I). (b) Histogram of coding probability determined using CPAT (Wang et al., 2013). 9% of transcripts were classed as TUCP. The small proportion with clear open reading frames (CPAT score = 1.0) were predominantly OT transcripts. (c) Distribution by size of transcript. 114 transcripts were >10 Kb. (d) Tissue specificity was calculated using Tau (Yanai et al., 2005) based on the mean normalized read counts for each tissue or organ site. 80% of transcripts showed Tau values >0.7 indicating high tissue specificity. Details on exon and read counts, and proximity to surrounding genes are shown in Figure 4—figure supplement 1. (e) Box and whisker plots show the correlation between expression of the novel transcripts and surrounding annotated genes within set chromosomal distances of the novel transcriptional start site. Mean correlation was near zero beyond 1 Mb. (f) Histogram showing the correlation (r) between expression of each novel transcript and its closest annotated gene. One quarter of novel transcripts show a correlation (r > 0.71) with the nearest gene; another quarter shows minimal correlation (r = ±0.14). There was no strong anticorrelation. g-h, Expression of the novel transcript is not always correlated with the immediately adjacent gene, illustrated by heatmaps across the 15 organs and tissues. (g) Expression of the novel transcript, *HE-LINC-C6T24*, located just over 2 Kb from *FOXQ1*, correlates strongly with *FOXF2*, approximately 65 Kb distant. (h) Heatmap demonstrates the poor correlation of expression between *HE-LINC-C7T121* and most of the nine genes within 1 Mb on Chr7 but near perfect correlation with *TBX20* located ~0.7 Mb away beyond two intervening genes.

https://doi.org/10.7554/eLife.15657.012

Taken together, this study reports the first comprehensive transcriptomic atlas during human organogenesis to complement parallel initiatives from later development and adulthood (Jaffe et al., 2015; Roadmap Epigenomics Consortium, 2015; Roost et al., 2015). Subjecting transcription from many sites to a method of analysis that incorporated developmental lineage deciphered novel genetic signatures, predicted causality in many human developmental disorders and associated novel non-coding transcription with expression from the surrounding protein-coding genome. At present, the data arise from a relatively narrow window of embryonic development but set the stage for future longitudinal studies for individual organs over time. The tiny amounts and scarcity of human embryonic tissue also necessitated aspects of pooling across different Carnegie stages for some sites but it is striking that this had no impact on ascertaining organ and tissue-specific transcriptomic signatures by LgPCA. The integrated data are expected to be particularly valuable to stem cell researchers examining the fidelity of PSC differentiation in vitro or searching for transcription factors for direct reprogramming of chosen cell lineages. Finally, the discovery of a major new programme of non-coding transcription adds a fresh layer of detail on the spatiotemporal regulation of the human genome.

Materials and methods

Human material

Request a detailed protocol

Human embryonic material was collected under ethical approval, informed consent and according to the Codes of Practice of the Human Tissue Authority and staged by the Carnegie classification as described previously (Jennings et al., 2013). This clinical material was collected on site overseen by our research team with immediate transfer to the laboratory. Individually identified tissues and organs (details in Supplementary file 1A) were immediately dissected. The adrenal gland, whole brain, heart, kidney, liver, entire limb buds, lung, stomach, testis, thyroid and anterior two-thirds of the tongue were readily identifiable as discrete organs and tissues. All visible adherent mesenchyme was removed from organs and tissues under a dissecting microscope. For the adrenal gland, this includes the capsule which allowed separation from the kidney. The ureter, emerging from the renal pelvis, was removed separately from the kidney. For the heart, a window of tissue was removed from the lateral wall of the left ventricle. A segment of the liver was dissected from each embryo that avoided the developing gall bladder. The trachea was removed from the lung at its entry point into the lung parenchyma. The stomach was identified between the gastro-oesophageal and pyloric junctions. The testis was dissected free from the attached mesonephros. While the thyroid was readily visualized as a discrete organ in the neck, it unavoidably contained the developing parathyroids and thus this tissue type was referred to throughout as ‘thyroid/PTH’. The palatal shelves were dissected on either side of the midline. The eye was dissected and the RPE peeled off mechanically from its posterior surface with validation possible under the dissecting microscope because the RPE is very darkly pigmented compared to the other ocular structures. All samples were collected into Trizol (Thermofischer) or Tri reagent (Sigma-Aldrich) for total RNA isolation as individual tissue or organ types followed by treatment with DNaseI (Sigma-Aldrich). Once the quality of each RNA sample had been confirmed, samples were pooled in order to obtain sufficient RNA for each biological replicate (Supplementary file 1A). The pancreas dataset derived from the same tissue collection was re-used from a previous study (Cebola et al., 2015).

RNA-seq and transcriptome

Request a detailed protocol

Quality and integrity of total RNA samples were assessed using a 2100 Bioanalyzer or a 2200 TapeStation (Agilent Technologies) according to the manufacturer’s instructions. RNA sequencing (RNA-seq) libraries were generated using the TruSeq Stranded mRNA assay (Illumina, Inc.) according to the manufacturer’s protocol. Briefly, total RNA (0.1–4 µg) was used as input material from which polyadenylated mRNA was purified using poly-T, oligo-attached, magnetic beads. The mRNA was then fragmented using divalent cations under elevated temperature and then reverse transcribed into first strand cDNA using random primers. Second strand cDNA was then synthesized using DNA Polymerase I and RNase H. Following a single 'A' base addition, adapters were ligated to the cDNA fragments, and the products purified and enriched by PCR to create the final cDNA library. Adapter indices were used to multiplex libraries, which were pooled prior to cluster generation using a cBot instrument. The loaded flow-cell was then sequenced (paired-end; 101 + 101 cycles, plus indices) on an Illumina HiSeq2000 or HiSeq2500 instrument. Demultiplexing of the output data (allowing one mismatch) and BCL-to-Fastq conversion was performed with CASAVA 1.8.3. The RNA-seq was conducted in three batches at different times as a necessity of how human embryonic tissue was collected over time (Supplementary file 1A). Where organs were sequenced across batches (palate, RPE, kidney, testis, adrenal gland, heart / left ventricle and liver) biological replicates clustered together (Figure 1—figure supplement 4).

RNA-seq reads from the Illumina platform were mapped to the human genome (hg19) strand-specifically using TopHat 2.0.9 (Trapnell et al., 2012) and the GENCODE 18 gene annotation set (Harrow et al., 2012). We also remapped the published pancreas RNA-seq dataset (Cebola et al., 2015) obtained from material isolated previously in our laboratory. Additionally, a dataset of hepatocyte differentiation RNA-seq (Du et al., 2014 GEO: GSE54066) was downloaded, re-mapped and quantified as per our own data. Commonly applied RNA-seq normalisation methods such as TMM assume a small proportion of differentially expressed genes in any one dataset (Dillies et al., 2013). Because the highly distinct tissues surveyed here differed strongly on the scale of thousands of genes (for instance liver versus brain) we used quantile normalisation which gave a lower median coefficient of variation than either no or TMM normalization. Read counts from the different datasets were quantile normalized using the R package preprocessCore (Bolstad, 2007). Tissue-specificity was scored per gene using Tau (Yanai et al., 2005) on normalized read counts across all samples. Initial genome-wide relationships were assessed using PCA (Figure 1—figure supplement 3) and hierarchical clustering (heatmap, Figure 1—figure supplement 4).

To compare our samples with RNA-seq from the NIH Roadmap project (Roadmap Epigenomics Consortium, 2015) uniquely mapped strand-specific RNA-seq reads were counted into a set of non-redundant exon annotations (custom made from GENCODE 18 annotations) using bedtools intersect (Quinlan and Hall, 2010). Exon level counts were then summed into a single total per gene per sample. Counts were quantile normalized across samples. NIH roadmap samples (Roadmap Epigenomics Consortium, 2015) used in this study are listed in Supplementary file 1J. For the analysis of human embryonic RNA-seq with comparable Roadmap fetal data (adrenal gland, heart, kidney, lung, limbs, stomach and testis) a single pairwise differential expression test was undertaken using the R package edgeR (Robinson et al., 2010) and an FDR < 0.05.

NMF

Request a detailed protocol

Non-negative matrix factorisation (NMF) searches complex expression data, comprising thousands of genes, for a small number of characteristic ‘metagenes’ (Gaujoux and Seoighe, 2010). NMF was performed using the nmf R package (version NMF_0.20.5) (Gaujoux and Seoighe, 2010) to extract tissue-specific metagenes. Non-normalised read counts were filtered to remove all Y-linked genes, the X-inactivation gene XIST and genes with fewer than 100 reads across all samples. Initially 50 runs each of ranks 11–18 and using the default ‘Brunet’ algorithm (Brunet et al., 2004) were performed to find an optimal factorisation ‘rank’ (r). The maximal cophenetic distance was used to select the value of r. Subsequently, 200 runs using the optimal rank were performed to assess consistency of sample groupings between runs. Non-overlapping (i.e. tissue-specific) gene sets were extracted from each metagene by filtering on basis contribution >0.8.

LgPCA

Request a detailed protocol

The LgPCA approach was adapted from established phylogenetic PCA methodology (Jombart et al., 2010b) and performed using quantile-normalized, gene-level read counts, a high memory (512 Gb) compute node and the ppca function from the adephylo R package (Jombart et al., 2010a). A broad user-defined guide tree (Figure 2b) based on well-established knowledge of mammalian gastrulation and downstream lineage relationships was imposed on the different organ and tissue types following which the adephylo R package weighted the principal components by the lineage auto-correlation between samples; increased if related samples were similar and lessened if related samples were more different. As in the description from Jombart and colleagues the resulting components represented ‘global’ structures (where similarity is high between related samples) and ‘local’ structures (where related samples are dissimilar) (Jombart et al., 2010b). We used the LgPCA to extract all the global patterns from the data (PCs 1–15). These patterns were not apparent if lineage relationships were not included nor were they altered if any one tissue, such as palate, was altered within the broad lineage structure (data not shown). The global patterns in PCs 1–15 infer co-regulatory patterns of gene expression across human organogenesis. The ‘local’ patterns thereafter captured heterogeneity between tissue replicates (Figure 2—figure supplement 1) (while PC7 separated the two PSC populations these RNA-seq datasets represent separate cell lines from NIH Roadmap). We used the Abouheif distance as implemented in adephylo (Jombart et al., 2010a), which takes into account the topology of the specified tree but does not use branch lengths.

Gene set enrichment

Request a detailed protocol

For the comparison of the embryonic versus fetal datasets Gene Ontology term enrichment was performed on upregulated genes (FDR < 0.05) using Fisher’s exact test with the elimination algorithm of the R package topGO (Alexa and Rahnenfuhrer, 2010). For the LgPCA, annotated ontology nodes (>10 genes) were tested for each loadings vector for each PC against background using the Wilcoxon test. Tests were performed sequentially moving up the separate GO ontologies (Biological Process (BP), Molecular Function (MF) and Cellular Component (CC)), excluding significant scoring genes from later tests (the topGO ‘elim’ method).

iRegulon analysis of regulation in the extremes of the LgPCA

Request a detailed protocol

iRegulon is a computational method which tests for enrichment amongst precomputed motif datasets to decipher transcriptional regulatory networks in a set of co-expressed genes. The 1000 genes with the most extreme loadings at either end of each PC (‘high’ and ‘low’) from the LgPCA were loaded into Cytoscape (version 3.2.1) (Shannon et al., 2003) and used as queries to the iRegulon plug-in (version 1.3, build 1024) (Janky et al., 2014). 20 Kb was examined centred on the transcriptional start site (TSS) under default settings.

Novel transcripts

Request a detailed protocol

Sample-specific transcriptomes were assembled with Cufflinks (version 2.2.0) (Trapnell et al., 2010). Transcriptomes were combined (‘cuffmerge’; -min-isoform-fraction = 0.1) and compared with the original GENCODE 18 reference (‘cuffcompare’). We filtered out known transcripts using the ‘Transfrag class codes’ (http://cole-trapnell-lab.github.io/cufflinks/cuffcompare/#transfrag-class-codes) to retain only wholly intronic (‘i’, of which there were none), unknown (‘u’), antisense (x) and overlapping (‘o’) transcripts. We discarded all other classes including pre-mRNA (class ‘e’), novel-isoforms spliced to known exons (class ‘j’), and 3’ run-ons within 2 kb of the end of the transcript annotation (class ‘p’). In addition, some remaining non-spliced transcripts may theoretically represent first or last exon (UTR) extensions; to delimit these, we calculated the distance on the same strand to the closest downstream transcription start site (to consider potential 5’ UTR extension) and upstream transcription termination site (to consider potential 3’ UTR extension). Names were assigned to these novel transcripts following suggested criteria (Mattick and Rinn, 2015) (Figure 4a) with the sole adaptation that bidirectional (BI) transcripts were defined as having their TSS within 1 Kb of the TSS of the associated annotated gene. No transcripts mapped to the same strand within the introns of any annotated gene excluding the possibility of unspliced transcripts from annotated genes being erroneously defined as novel transcripts. All transcript sequences (annotated and unannotated) were scored for protein-coding potential using CPAT (based on human training data included with CPAT) (Wang et al., 2013). A threshold of >0.2 was used to define ‘Transcripts of Uncertain Coding Potential’ (TUCP). Where there were multiple transcripts from a single locus, the longest transcript was retained in assembling the final dataset of 6251 novel transcripts. Transcript level read counts for the embryonic samples and NIH Roadmap samples (Supplementary file 1J) were generated for the merged transcriptome using bedtools multicov (vers 2.21.0) (Quinlan and Hall, 2010). The correlations between each of the 6251 transcripts and all annotated genes within 1 Mb were calculated using only the human embryonic data from this study.

Data availability

Request a detailed protocol

Mapping coordinates against multiple genome versions using a range of common pipelines and summary count data are hosted at www.manchester.ac.uk/human-developmental-biology. To view data in the UCSC genome browser, a trackhub is available: http://www.humandevelopmentalbiology.manchester.ac.uk/data/hub_manchester_hdb/hub.txt.

Data availability

The following data sets were generated

(2016) An integrative transcriptomic atlas of organogenesis in human embryos
Publicly available at EBI ArrayExpress (accession no: E-MTAB-3928).

http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-3928/

The following previously published data sets were used

(2015) TEAD and YAP regulate the enhancer network of human embryonic pancreatic progenitors
Publicly available at EBI ArrayExpress (accession no: E-MTAB-3061).

http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-3061/

References

Software
1. Alexa A
2. Rahnenfuhrer J
(2010) Bioconductor, version topGO: topGO: Enrichment Analysis for Gene Ontology
Bioconductor.

http://bioconductor.org/packages/topGO/
Software
1. Bolstad B
(2007) Bioconductor, version preprocessCore: A Collection of Pre-Processing Functions
Bioconductor.

http://bioconductor.org/packages/preprocessCore/
(2004) Metagenes and molecular pattern discovery using matrix factorization
PNAS 101:4164–4169.

https://doi.org/10.1073/pnas.0308531101
- Google Scholar
1. Cebola I
2. Rodríguez-Seguí SA
3. Cho CH
4. Bessa J
5. Rovira M
6. Luengo M
7. Chhatriwala M
8. Berry A
9. Ponsa-Cobas J
10. Maestro MA
11. Jennings RE
12. Pasquali L
13. Morán I
14. Castro N
15. Hanley NA
16. Gomez-Skarmeta JL
17. Vallier L
18. Ferrer J
(2015) TEAD and YAP regulate the enhancer network of human embryonic pancreatic progenitors
Nature Cell Biology 17:615–626.

https://doi.org/10.1038/ncb3160
- Google Scholar
1. Cotney J
2. Leng J
3. Yin J
4. Reilly SK
5. DeMare LE
6. Emera D
7. Ayoub AE
8. Rakic P
9. Noonan JP
(2013) The evolution of lineage-specific regulatory activities in the human embryonic limb
Cell 154:185–196.

https://doi.org/10.1016/j.cell.2013.05.056
- Google Scholar
1. Dillies MA
2. Rau A
3. Aubert J
4. Hennequet-Antier C
5. Jeanmougin M
6. Servant N
7. Keime C
8. Marot G
9. Castel D
10. Estelle J
11. Guernec G
12. Jagla B
13. Jouneau L
14. Laloë D
15. Le Gall C
16. Schaëffer B
17. Le Crom S
18. Guedj M
19. Jaffrézic F
20. French StatOmique Consortium
(2013) A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
Briefings in Bioinformatics 14:671–683.

https://doi.org/10.1093/bib/bbs046
- Google Scholar
1. Du Y
2. Wang J
3. Jia J
4. Song N
5. Xiang C
6. Xu J
7. Hou Z
8. Su X
9. Liu B
10. Jiang T
11. Zhao D
12. Sun Y
13. Shu J
14. Guo Q
15. Yin M
16. Sun D
17. Lu S
18. Shi Y
19. Deng H
(2014) Human hepatocytes with drug metabolic function induced from fibroblasts by lineage reprogramming
Cell Stem Cell 14:394–403.

https://doi.org/10.1016/j.stem.2014.01.008
- Google Scholar
1. Fang H
2. Yang Y
3. Li C
4. Fu S
5. Yang Z
6. Jin G
7. Wang K
8. Zhang J
9. Jin Y
(2010) Transcriptome analysis of early organogenesis in human embryos
Developmental Cell 19:174–184.

https://doi.org/10.1016/j.devcel.2010.06.014
- Google Scholar
1. Gaujoux R
2. Seoighe C
(2010) A flexible R package for nonnegative matrix factorization
BMC Bioinformatics 11:367.

https://doi.org/10.1186/1471-2105-11-367
- Google Scholar
1. Goff L
2. Rinn J
(2015) Linking RNA Biology to lncRNAs
Genome Research 25:1456–1465.

https://doi.org/10.1101/gr.191122.115
- Google Scholar
1. Harrow J
2. Frankish A
3. Gonzalez JM
4. Tapanari E
5. Diekhans M
6. Kokocinski F
7. Aken BL
8. Barrell D
9. Zadissa A
10. Searle S
11. Barnes I
12. Bignell A
13. Boychenko V
14. Hunt T
15. Kay M
16. Mukherjee G
17. Rajan J
18. Despacio-Reyes G
19. Saunders G
20. Steward C
21. Harte R
22. Lin M
23. Howald C
24. Tanzer A
25. Derrien T
26. Chrast J
27. Walters N
28. Balasubramanian S
29. Pei B
30. Tress M
31. Rodriguez JM
32. Ezkurdia I
33. van Baren J
34. Brent M
35. Haussler D
36. Kellis M
37. Valencia A
38. Reymond A
39. Gerstein M
40. Guigó R
41. Hubbard TJ
(2012) GENCODE: the reference human genome annotation for The ENCODE project
Genome Research 22:1760–1774.

https://doi.org/10.1101/gr.135350.111
- Google Scholar
1. Huang P
2. Zhang L
3. Gao Y
4. He Z
5. Yao D
6. Wu Z
7. Cen J
8. Chen X
9. Liu C
10. Hu Y
11. Lai D
12. Hu Z
13. Chen L
14. Zhang Y
15. Cheng X
16. Ma X
17. Pan G
18. Wang X
19. Hui L
(2014) Direct reprogramming of human fibroblasts to functional and expandable hepatocytes
Cell Stem Cell 14:370–384.

https://doi.org/10.1016/j.stem.2014.01.003
- Google Scholar
1. Jaffe AE
2. Shin J
3. Collado-Torres L
4. Leek JT
5. Tao R
6. Li C
7. Gao Y
8. Jia Y
9. Maher BJ
10. Hyde TM
11. Kleinman JE
12. Weinberger DR
(2015) Developmental regulation of human cortex transcription and its clinical relevance at single base resolution
Nature Neuroscience 18:154–161.

https://doi.org/10.1038/nn.3898
- Google Scholar
(2014) iRegulon: from a gene list to a gene regulatory network using large motif and track collections
PLoS Computational Biology 10:e1003731.

https://doi.org/10.1371/journal.pcbi.1003731
- Google Scholar
(2013) Development of the human pancreas from foregut to endocrine commitment
Diabetes 62:3514–3522.

https://doi.org/10.2337/db12-1479
- Google Scholar
(2015) Human pancreas development
Development 142:3126–3163.

https://doi.org/10.1242/dev.120063
- Google Scholar
(2010a) Adephylo: new tools for investigating the phylogenetic signal in biological traits
Bioinformatics 26:1907–1909.

https://doi.org/10.1093/bioinformatics/btq292
- Google Scholar
(2010b) Putting phylogeny into the analysis of biological traits: a methodological approach
Journal of Theoretical Biology 264:693–701.

https://doi.org/10.1016/j.jtbi.2010.03.038
- Google Scholar
1. Mattick JS
2. Rinn JL
(2015) Discovery and annotation of long noncoding RNAs
Nature Structural & Molecular Biology 22:5–7.

https://doi.org/10.1038/nsmb.2942
- Google Scholar
1. Quinlan AR
2. Hall IM
(2010) BEDTools: a flexible suite of utilities for comparing genomic features
Bioinformatics 26:841–842.

https://doi.org/10.1093/bioinformatics/btq033
- Google Scholar
1. Roadmap Epigenomics Consortium
2. Kundaje A
3. Meuleman W
4. Ernst J
5. Bilenky M
6. Yen A
7. Heravi-Moussavi A
8. Kheradpour P
9. Zhang Z
10. Wang J
11. Ziller MJ
12. Amin V
13. Whitaker JW
14. Schultz MD
15. Ward LD
16. Sarkar A
17. Quon G
18. Sandstrom RS
19. Eaton ML
20. Wu YC
21. Pfenning AR
22. Wang X
23. Claussnitzer M
24. Liu Y
25. Coarfa C
26. Harris RA
27. Shoresh N
28. Epstein CB
29. Gjoneska E
30. Leung D
31. Xie W
32. Hawkins RD
33. Lister R
34. Hong C
35. Gascard P
36. Mungall AJ
37. Moore R
38. Chuah E
39. Tam A
40. Canfield TK
41. Hansen RS
42. Kaul R
43. Sabo PJ
44. Bansal MS
45. Carles A
46. Dixon JR
47. Farh KH
48. Feizi S
49. Karlic R
50. Kim AR
51. Kulkarni A
52. Li D
53. Lowdon R
54. Elliott G
55. Mercer TR
56. Neph SJ
57. Onuchic V
58. Polak P
59. Rajagopal N
60. Ray P
61. Sallari RC
62. Siebenthall KT
63. Sinnott-Armstrong NA
64. Stevens M
65. Thurman RE
66. Wu J
67. Zhang B
68. Zhou X
69. Beaudet AE
70. Boyer LA
71. De Jager PL
72. Farnham PJ
73. Fisher SJ
74. Haussler D
75. Jones SJ
76. Li W
77. Marra MA
78. McManus MT
79. Sunyaev S
80. Thomson JA
81. Tlsty TD
82. Tsai LH
83. Wang W
84. Waterland RA
85. Zhang MQ
86. Chadwick LH
87. Bernstein BE
88. Costello JF
89. Ecker JR
90. Hirst M
91. Meissner A
92. Milosavljevic A
93. Ren B
94. Stamatoyannopoulos JA
95. Wang T
96. Kellis M
(2015) Integrative analysis of 111 reference human epigenomes
Nature 518:317–330.

https://doi.org/10.1038/nature14248
- Google Scholar
(2010) edgeR: a Bioconductor package for differential expression analysis of digital gene expression data
Bioinformatics 26:139–140.

https://doi.org/10.1093/bioinformatics/btp616
- Google Scholar
(2015) KeyGenes, a Tool to Probe Tissue Differentiation Using a Human Fetal Transcriptional Atlas
Stem Cell Reports 4:1112–1124.

https://doi.org/10.1016/j.stemcr.2015.05.002
- Google Scholar
1. Shannon P
2. Markiel A
3. Ozier O
4. Baliga NS
5. Wang JT
6. Ramage D
7. Amin N
8. Schwikowski B
9. Ideker T
(2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks
Genome Research 13:2498–2504.

https://doi.org/10.1101/gr.1239303
- Google Scholar
1. Trapnell C
2. Roberts A
3. Goff L
4. Pertea G
5. Kim D
6. Kelley DR
7. Pimentel H
8. Salzberg SL
9. Rinn JL
10. Pachter L
(2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks
Nature Protocols 7:562–578.

https://doi.org/10.1038/nprot.2012.016
- Google Scholar
1. Trapnell C
2. Williams BA
3. Pertea G
4. Mortazavi A
5. Kwan G
6. van Baren MJ
7. Salzberg SL
8. Wold BJ
9. Pachter L
(2010) Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation
Nature Biotechnology 28:511–515.

https://doi.org/10.1038/nbt.1621
- Google Scholar
1. Wang L
2. Park HJ
3. Dasari S
4. Wang S
5. Kocher JP
6. Li W
(2013) CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model
Nucleic Acids Research 41:e74.

https://doi.org/10.1093/nar/gkt006
- Google Scholar
1. Yanai I
2. Benjamin H
3. Shmoish M
4. Chalifa-Caspi V
5. Shklar M
6. Ophir R
7. Bar-Even A
8. Horn-Saban S
9. Safran M
10. Domany E
11. Lancet D
12. Shmueli O
(2005) Genome-wide midrange transcription profiles reveal expression level relationships in human tissue specification
Bioinformatics 21:650–659.

https://doi.org/10.1093/bioinformatics/bti042
- Google Scholar

Article and author information

Author details

Dave T Gerrard

Division of Diabetes, Endocrinology & Gastroenterology, School of Medical Sciences, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, University of Manchester, Manchester, United Kingdom

Contribution
DTG, Wrote the manuscript, Conducted the bioinformatics analyses, Devised the study and planned experiments, Conception and design, Analysis and interpretation of data, Drafting or revising the article

Contributed equally with
Andrew A Berry

Competing interests
The authors declare that no competing interests exist.

"This ORCID iD identifies the author of this article:" 0000-0001-6890-7213
Andrew A Berry

Division of Diabetes, Endocrinology & Gastroenterology, School of Medical Sciences, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, University of Manchester, Manchester, United Kingdom

Contribution
AAB, Collected, dissected and prepared material, Devised the study and planned experiments, Conception and design, Acquisition of data, Drafting or revising the article

Contributed equally with
Dave T Gerrard

Competing interests
The authors declare that no competing interests exist.
Rachel E Jennings
1. Division of Diabetes, Endocrinology & Gastroenterology, School of Medical Sciences, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, University of Manchester, Manchester, United Kingdom
2. Endocrinology Department, Central Manchester University Hospitals NHS Foundation Trust, Manchester, United Kingdom
Contribution
REJ, Collected, dissected and prepared material, Devised the study and planned experiments, Conception and design, Acquisition of data, Drafting or revising the article

Competing interests
The authors declare that no competing interests exist.
Karen Piper Hanley

Division of Diabetes, Endocrinology & Gastroenterology, School of Medical Sciences, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, University of Manchester, Manchester, United Kingdom

Contribution
KPH, Involved in study design, Conception and design, Drafting or revising the article

Competing interests
The authors declare that no competing interests exist.
Nicoletta Bobola

Division of Dentistry, School of Medical Sciences, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, University of Manchester, Manchester, United Kingdom

Contribution
NB, Wrote the manuscript, Conducted the analysis of developmental disorders, Devised the study and planned experiments, Conception and design, Analysis and interpretation of data, Drafting or revising the article

Competing interests
The authors declare that no competing interests exist.
Neil A Hanley
1. Division of Diabetes, Endocrinology & Gastroenterology, School of Medical Sciences, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, University of Manchester, Manchester, United Kingdom
2. Endocrinology Department, Central Manchester University Hospitals NHS Foundation Trust, Manchester, United Kingdom
Contribution
NAH, Wrote the manuscript, Guarantor, Devised the study and planned experiments, Conception and design

For correspondence
neil.hanley@manchester.ac.uk

Competing interests
The authors declare that no competing interests exist.

"This ORCID iD identifies the author of this article:" 0000-0003-3234-4038

Funding

Wellcome Trust (WT088566MA)

Dave T Gerrard
Andrew A Berry
Neil A Hanley

British Council (MR/J003352/1)

Karen Piper Hanley

Medical Research Council (MR/L009986/1)

Nicoletta Bobola
Neil A Hanley

Wellcome Trust (WT097820MF)

Neil A Hanley

British Council (14BX15NHBG)

Neil A Hanley

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We are very grateful to all women who consented to take part in our research programme and for the assistance of research nurses and clinical colleagues at Central Manchester University Hospitals NHS Foundation Trust. We thank Ian Donaldson, Peter Briggs and Andy Hayes of the Bioinformatics and Genomic Technologies Core Facilities at the University of Manchester for assistance with RNA-sequencing. REJ is a UK Medical Research Council (MRC) clinical research training fellow. NAH is a Wellcome Trust senior fellow in clinical science (WT088566MA). This project received support from the Wellcome Trust (WT097820) with additional support from MRC project grants MR/L009986/1 to NB and NAH, the British Council (14BX15NHBG) to NAH and MR/J003352/1 to KPH.

Ethics

Human subjects: Human embryonic material was collected under ethical approval, informed consent and according to the Codes of Practice of the Human Tissue Authority (protocols number 13/NW/0205).

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.