An integrative transcriptomic atlas of organogenesis in human embryos
Abstract
Human organogenesis is when severe developmental abnormalities commonly originate. However, understanding this critical embryonic phase has relied upon inference from patient phenotypes and assumptions from in vitro stem cell models and non-human vertebrates. We report an integrated transcriptomic atlas of human organogenesis. By lineage-guided principal components analysis, we uncover novel relatedness of particular developmental genes across different organs and tissues and identified unique transcriptional codes which correctly predicted the cause of many congenital disorders. By inference, our model pinpoints co-enriched genes as new causes of developmental disorders such as cleft palate and congenital heart disease. The data revealed more than 6000 novel transcripts, over 90% of which fulfil criteria as long non-coding RNAs correlated with the protein-coding genome over megabase distances. Taken together, we have uncovered cryptic transcriptional programs used by the human embryo and established a new resource for the molecular understanding of human organogenesis and its associated disorders.
https://doi.org/10.7554/eLife.15657.001eLife digest
Individual organs and tissues form in human embryos during the first two months of pregnancy. Any errors during this crucial stage of human development can result in miscarriage or serious birth defects. Yet remarkably little is known about how this process works. What is known has been inferred from studies of how other animals develop, human stem cells grown in a laboratory, and babies born with genetic conditions that cause developmental problems.
Genes control the way that organs and tissues form, and are switched on or off in complex patterns during development to ensure that particular cells develop into one type of organ and not another. When genes are switched on, their DNA is copied into molecules called RNA. Many RNA molecules are used as templates to make proteins, which then perform critical roles in cell processes. One way to find out which genes are activated during development is to identify which RNAs are made by cells in the embryo.
Here, Gerrard, Berry et al. used a technique called RNA-sequencing to identify the RNAs that human embryos make while their organs and tissues form. The RNA came from many different tissues including the heart, limbs and the roof of the mouth. Gerrard, Berry et al. developed a new computational model that used the identity of the RNAs to decode the precise patterns of gene activity in the tissues. The model correctly identified many genes that were already known to cause developmental problems when faulty, and identified numerous others that are now predicted to cause developmental defects in humans.
Gerrard, Berry et al. also discovered over 6,000 RNAs in the human embryos that are unlikely to code for proteins. These “non-coding” RNAs may have other roles in cells, such as switching off genes, and many of them appear to be specific to human embryos. Together, these findings have uncovered new patterns of gene activity that drive development in human embryos and provide a resource for studying how organs and tissues form. Future challenges are to understand what controls these patterns of gene activity, and how the patterns change over time.
https://doi.org/10.7554/eLife.15657.002Introduction
Embryogenesis encompasses the progression from fertilized zygote to blastocyst and through gastrulation to establish the three germ layers of ectoderm, mesoderm and endoderm, from which all organs and tissues subsequently arise during organogenesis. Remarkably little is known about this latter phase of assembling organs and tissues in human due to the restricted availability of human embryonic tissue and its tiny size. Previous transcriptomics post-implantation have sampled either the whole embryo by expression microarray (Fang et al., 2010), thus lacking organ-specific resolution and the vast majority of long non-coding (lnc) transcription; or included lnc expression by massively parallel short-read RNA sequencing (RNA-seq) but focussed on single sites such as limb bud (Cotney et al., 2013) or pancreas (Cebola et al., 2015). RNA-seq from NIH Roadmap and other studies during or after the end of the first trimester of pregnancy falls after the embryonic period (which ends at 56–58 days post-conception (Carnegie Stage 23)) and commonly reflects near terminal differentiation within heterogeneous fetal organs and tissues (Jaffe et al., 2015; Roadmap Epigenomics Consortium, 2015; Roost et al., 2015). As a consequence of these combined deficiencies, we set about compiling global transcriptomic data during the critical phase of human organogenesis, sampling each germ layer and including sites of mixed origin that are subject to major developmental disorders such as cleft palate and limb abnormalities (Figure 1a).
Organs and tissues from fifteen human embryonic sites were sequenced in two sets of biological replicates (except pancreas and tongue) to generate 28 strand-specific RNA-seq datasets with 44–90 million uniquely mapped reads per replicate (Figure 1a; Supplementary file 1A, which contains information on embryonic stages). Global transcription rates across all organs and tissues were comparable over a high dynamic range; approximately 70% of protein-coding genes contained 100–10,000 mapped reads (Figure 1b; Supplementary file 2). We assessed whether our human embryo datasets identified earlier developmental processes than currently available fetal data (Roadmap Epigenomics Consortium, 2015). There were three-fold the number of differentially expressed genes in the fetal datasets but equivalent enrichment of gene ontology (GO) terms in the embryo, including many early developmental processes such as morphogenesis of an epithelial bud, anterior/posterior pattern specification and embryonic morphogenesis. These were in contrast to homeostatic processes enriched in the fetal dataset (Figure 1c; Supplementary file 1B–C).
Sampling gene expression across multiple sites allowed us to set about deciphering the precise transcriptomic codes responsible for the development of the different human embryonic organs and tissues. While ZNF and ZSCAN family members were broadly expressed discrete site-specific expression was more apparent for individual members of other transcription factor families (Figure 1—figure supplement 1) exemplified by the HOX gene clusters (Figure 1d). User-defined sets of up to five developmental transcription factors characteristic for a particular organ or tissue displayed very high levels of tissue specificity (Figure 1—figure supplement 2). However, while principal components (PC) analysis (PCA) or clustering grouped biological replicates, relationships between different organs and tissues other than the distinctiveness of brain and liver were not resolved (Figure 1—figure supplements 3–4). Non-negative matrix factorisation (NMF) also allows unbiased clustering of gene expression (Gaujoux and Seoighe, 2010). By setting the parameters such that representative genes were only extracted once, we identified eleven non-overlapping ‘metagenes’ from the complete expression dataset with clear tissue-specific signals for thyroid, liver, RPE, brain, heart and adrenal gland (Figure 1—figure supplement 5; Supplementary file 1D). We hypothesized that these new signals might allow benchmarking to assess the fidelity of in vitro differentiated stem cells, similar to a previous report (Roost et al., 2015). As an exemplar, we chose hepatocyte differentiation for which RNA-seq data are available including positive (primary adult hepatocytes) and negative (human embryonic fibroblasts) control data (Du et al., 2014). Clear enrichment for the stem cell-derived hepatocytes and the primary hepatocytes (but not the fibroblasts) was apparent in metagene 2, the cluster of 39 genes indicative of human embryonic liver. From this starting point, we wanted to move beyond the unique organ-specific signatures to study how patterns of gene expression co-varied across tissues. While relaxing NMF parameters would allow non-exclusive gene selection across metagenes, we also wanted to capture differences in gene expression between organs (e.g. aspects of what is not expressed as a contributing factor to an organ’s identity). Moreover, different embryonic organs are related according to developmental lineage. We reasoned that being able to apply a lineage structure would create natural assemblies of co-regulated genes (Figure 2a). Accordingly, we adapted a method from spatial ecology and phylogenetics (Jombart et al., 2010a, 2010b) to constrain PCA by imposing a hierarchical developmental lineage and termed this approach LgPCA. We also included RNA-seq from undifferentiated human embryonic stem cells (Roadmap Epigenomics Consortium, 2015) to represent pre-gastrulation human biology. Of the total thirty-one principal components (PCs) arising from LgPCA the first fifteen now identified patterns of gene expression across groups of related tissues in addition to unique organ-specific signatures (Figure 2b) while PCs 16–31 sampled heterogeneity within individual organs and tissues (Figure 2—figure supplement 1). In keeping with this transition PCs 1–4 ordered samples reminiscent of very early developmental events: pluripotency (extreme positive loadings in PC1; ‘PC1 high’), early brain formation (extreme negative loadings in PC2; ‘PC2 low’), foregut endoderm (PC4 low) and intermediate mesoderm (PC4 high). PCs 5–15 resolved the individual organs and tissues; for instance low PC5 loadings discriminated liver from the other foregut endoderm derivatives. Heatmaps illustrated the composite or tissue-specific signals emanating from the genes with the most extreme PC loadings which also underlay appropriate developmental gene ontology (GO) terms (Figure 2c–d and Supplementary file 1E–F).
Identifying the master regulators that differentially orchestrate organogenesis across the body has not previously been possible directly in human embryos. We undertook this in two different ways, both based on studying the 1000 genes with the most extreme loadings in PCs 1–15 that identified gene co-expression patterns across tissues and within individual organs (Figure 2b; Supplementary file 1E). We interrogated these gene sets for regulatory networks based on the enrichment of transcription factor motifs (Janky et al., 2014). Numerous well known master regulators were recovered alongside previously unappreciated factors for either broad tissue groups (e.g. foregut endoderm derivatives) or individual organs (Figure 3a). As proof-of-principle, this also included proven regulators of human pluripotency, NANOG, OCT4 and MYC, at an extreme of PC1. Remarkably, in several instances approaching half of the 1000 genes with the most extreme PC loadings imputed co-regulation by a single transcription factor, such as HNF4A in the liver or SRF in the heart. Alongside NR5A1, the data predicted RUNX and BAD as novel regulators of human adrenal and gonadal development (Figure 3a). As a second approach to study gene regulation, we extracted the transcription factors (typically <5%) from amongst the 1000 most extreme genes in PCs 1–15 (Supplementary file 1G). We searched the Mouse Genome Informatics database (MGI) and in 309/594 instances there was a relevant mouse mutation phenotype supporting the notion that the transcription factors identified by LgPCA are key regulators of human organogenesis. At the lowest extreme of PC5 (liver) the twenty-two transcription factors contained all three of those required for reprogramming fibroblasts directly to hepatocytes (Huang et al., 2014). This suggests novel fate programming roles for transcription factors at the extremes of other PCs (including new potential regulators of pluripotency amongst the sixteen factors containing zinc fingers in PC1). In keeping with these regulatory roles, the extreme PC loadings in the LgPCA data also prioritized those transcription factors responsible for major congenital disorders (Supplementary file 1G). Because LgPCA is not limited to individual organs this included a novel ability to predict multisystem abnormalities such as the combined heart and limb defects of Holt-Oram syndrome (OMIM 142900, TBX5, PC13 low) or the palate and limb abnormalities associated with mutations in TP63 (OMIM 603543, PC3 high).
Mutations in genes encoding transcription factors are over-represented causes of congenital disorders, most likely due to their critical function during organogenesis and inadequacy when haploinsufficient. The enrichment of transcription factors with specific disease-associations at the extremes of the LgPCA implicates the co-enriched genes as leading candidates for unanswered clinical syndromes. To test this model we identified some of the earliest chromosomal mapping or patient deletion data for the known disease-associated transcription factors from Supplementary file 1G. 53 disorders were suitable for assessment with an average critical region of 13.7 Mb each containing an average of 111 protein-coding genes (Supplementary file 1H). Strikingly, in 37 instances (73%) LgPCA uniquely selected the correct transcription factor and in 48 instances (91%) narrowed the field down to three or fewer transcription factors. When applied to 13 syndromes (mostly deletion disorders) where the causative gene remains unresolved clear predictions of causality emerge, for instance in cleft palate (DLX5, DLX6, LHX8 and FOXF2) or cerebellar disorders (ZIC1 and ZIC4) (Supplementary file 1H). Frequently, there is an appropriate mutant mouse phenotype such as CASZ1 in cardiac malformations, part of Chr1p36 deletion syndrome, or SOX10 in the 46,XX disorder of sex differentiation (DSD) linked to duplication on Chr22 (Figure 3b and Supplementary file 1H).
Non-coding transcription has emerged as a critical regulator of cell and developmental biology (Goff and Rinn, 2015). A dedicated programme operating during human organogenesis seemed likely as 81 out of the 1571 genes enriched in embryogenesis compared to the fetal datasets were annotated long intergenic non-coding (LINC) transcripts (Supplementary file 1B). To look beyond this we assembled strand-specific transcripts not recognized by current genome annotation [GENCODE 18 (Harrow et al., 2012)] and systematically named them individually according to recommended criteria (Mattick and Rinn, 2015). 6251 unique loci accounted for in excess of 9 Mb of novel polyadenylated transcription from the human genome (Figure 4a and Supplementary file 1I). The vast majority of transcripts fulfilled criteria as lnc RNAs by assessment of coding potential (CPAT score <0.2) (Figure 4b), length over 200 base pairs (bp) and an absence of reads spanning splice junctions to currently annotated genes (Mattick and Rinn, 2015). These lncRNAs were classified as either bidirectional, antisense or overlapping, or by exclusion intergenic, according to orientation and position in relation to the annotated genome (Mattick and Rinn, 2015). Transcripts were most commonly 500–1,500 bp but could extend to over 600 Kb (Figure 4c) and showed high tissue-specificity with the median Tau value (Yanai et al., 2005) of 0.86, much higher than for protein-coding genes (0.63) but consistent with previously annotated non-coding genes (0.89). We investigated the association between this novel human embryonic transcriptome and the annotated genome. Reduced physical distance to expressed annotated genes markedly increased the likelihood of novel transcript co-expression (Figure 4e), although the best correlations were by no means always with the closest gene (Figure 4f–g). The median distance to the closest annotated gene was 7.7 Kb (Figure 4—figure supplement 1) while on average the best correlation was at 188 Kb (random prediction was 476 Kb). Over half (3634) of the lnc transcripts were classified positionally as LINC RNAs. While LINC RNAs can harbour important regulatory function, how to forecast their relationship(s) with the protein-coding genome and prioritize the investigation of thousands of new transcripts is immensely challenging (Goff and Rinn, 2015). As a first step, the multi-tissue nature of our dataset allowed intricate correlative patterns to be deciphered implying putative relationships; for instance over a 2 Mb window and across numerous genes on chromosome 7 between HE-LINC-C7T121 and TBX20, which encodes a developmental cardiac transcription factor mutated in a wide range of congenital heart disease (Figure 4h).
Taken together, this study reports the first comprehensive transcriptomic atlas during human organogenesis to complement parallel initiatives from later development and adulthood (Jaffe et al., 2015; Roadmap Epigenomics Consortium, 2015; Roost et al., 2015). Subjecting transcription from many sites to a method of analysis that incorporated developmental lineage deciphered novel genetic signatures, predicted causality in many human developmental disorders and associated novel non-coding transcription with expression from the surrounding protein-coding genome. At present, the data arise from a relatively narrow window of embryonic development but set the stage for future longitudinal studies for individual organs over time. The tiny amounts and scarcity of human embryonic tissue also necessitated aspects of pooling across different Carnegie stages for some sites but it is striking that this had no impact on ascertaining organ and tissue-specific transcriptomic signatures by LgPCA. The integrated data are expected to be particularly valuable to stem cell researchers examining the fidelity of PSC differentiation in vitro or searching for transcription factors for direct reprogramming of chosen cell lineages. Finally, the discovery of a major new programme of non-coding transcription adds a fresh layer of detail on the spatiotemporal regulation of the human genome.
Materials and methods
Human material
Request a detailed protocolHuman embryonic material was collected under ethical approval, informed consent and according to the Codes of Practice of the Human Tissue Authority and staged by the Carnegie classification as described previously (Jennings et al., 2013). This clinical material was collected on site overseen by our research team with immediate transfer to the laboratory. Individually identified tissues and organs (details in Supplementary file 1A) were immediately dissected. The adrenal gland, whole brain, heart, kidney, liver, entire limb buds, lung, stomach, testis, thyroid and anterior two-thirds of the tongue were readily identifiable as discrete organs and tissues. All visible adherent mesenchyme was removed from organs and tissues under a dissecting microscope. For the adrenal gland, this includes the capsule which allowed separation from the kidney. The ureter, emerging from the renal pelvis, was removed separately from the kidney. For the heart, a window of tissue was removed from the lateral wall of the left ventricle. A segment of the liver was dissected from each embryo that avoided the developing gall bladder. The trachea was removed from the lung at its entry point into the lung parenchyma. The stomach was identified between the gastro-oesophageal and pyloric junctions. The testis was dissected free from the attached mesonephros. While the thyroid was readily visualized as a discrete organ in the neck, it unavoidably contained the developing parathyroids and thus this tissue type was referred to throughout as ‘thyroid/PTH’. The palatal shelves were dissected on either side of the midline. The eye was dissected and the RPE peeled off mechanically from its posterior surface with validation possible under the dissecting microscope because the RPE is very darkly pigmented compared to the other ocular structures. All samples were collected into Trizol (Thermofischer) or Tri reagent (Sigma-Aldrich) for total RNA isolation as individual tissue or organ types followed by treatment with DNaseI (Sigma-Aldrich). Once the quality of each RNA sample had been confirmed, samples were pooled in order to obtain sufficient RNA for each biological replicate (Supplementary file 1A). The pancreas dataset derived from the same tissue collection was re-used from a previous study (Cebola et al., 2015).
RNA-seq and transcriptome
Request a detailed protocolQuality and integrity of total RNA samples were assessed using a 2100 Bioanalyzer or a 2200 TapeStation (Agilent Technologies) according to the manufacturer’s instructions. RNA sequencing (RNA-seq) libraries were generated using the TruSeq Stranded mRNA assay (Illumina, Inc.) according to the manufacturer’s protocol. Briefly, total RNA (0.1–4 µg) was used as input material from which polyadenylated mRNA was purified using poly-T, oligo-attached, magnetic beads. The mRNA was then fragmented using divalent cations under elevated temperature and then reverse transcribed into first strand cDNA using random primers. Second strand cDNA was then synthesized using DNA Polymerase I and RNase H. Following a single 'A' base addition, adapters were ligated to the cDNA fragments, and the products purified and enriched by PCR to create the final cDNA library. Adapter indices were used to multiplex libraries, which were pooled prior to cluster generation using a cBot instrument. The loaded flow-cell was then sequenced (paired-end; 101 + 101 cycles, plus indices) on an Illumina HiSeq2000 or HiSeq2500 instrument. Demultiplexing of the output data (allowing one mismatch) and BCL-to-Fastq conversion was performed with CASAVA 1.8.3. The RNA-seq was conducted in three batches at different times as a necessity of how human embryonic tissue was collected over time (Supplementary file 1A). Where organs were sequenced across batches (palate, RPE, kidney, testis, adrenal gland, heart / left ventricle and liver) biological replicates clustered together (Figure 1—figure supplement 4).
RNA-seq reads from the Illumina platform were mapped to the human genome (hg19) strand-specifically using TopHat 2.0.9 (Trapnell et al., 2012) and the GENCODE 18 gene annotation set (Harrow et al., 2012). We also remapped the published pancreas RNA-seq dataset (Cebola et al., 2015) obtained from material isolated previously in our laboratory. Additionally, a dataset of hepatocyte differentiation RNA-seq (Du et al., 2014 GEO: GSE54066) was downloaded, re-mapped and quantified as per our own data. Commonly applied RNA-seq normalisation methods such as TMM assume a small proportion of differentially expressed genes in any one dataset (Dillies et al., 2013). Because the highly distinct tissues surveyed here differed strongly on the scale of thousands of genes (for instance liver versus brain) we used quantile normalisation which gave a lower median coefficient of variation than either no or TMM normalization. Read counts from the different datasets were quantile normalized using the R package preprocessCore (Bolstad, 2007). Tissue-specificity was scored per gene using Tau (Yanai et al., 2005) on normalized read counts across all samples. Initial genome-wide relationships were assessed using PCA (Figure 1—figure supplement 3) and hierarchical clustering (heatmap, Figure 1—figure supplement 4).
To compare our samples with RNA-seq from the NIH Roadmap project (Roadmap Epigenomics Consortium, 2015) uniquely mapped strand-specific RNA-seq reads were counted into a set of non-redundant exon annotations (custom made from GENCODE 18 annotations) using bedtools intersect (Quinlan and Hall, 2010). Exon level counts were then summed into a single total per gene per sample. Counts were quantile normalized across samples. NIH roadmap samples (Roadmap Epigenomics Consortium, 2015) used in this study are listed in Supplementary file 1J. For the analysis of human embryonic RNA-seq with comparable Roadmap fetal data (adrenal gland, heart, kidney, lung, limbs, stomach and testis) a single pairwise differential expression test was undertaken using the R package edgeR (Robinson et al., 2010) and an FDR < 0.05.
NMF
Request a detailed protocolNon-negative matrix factorisation (NMF) searches complex expression data, comprising thousands of genes, for a small number of characteristic ‘metagenes’ (Gaujoux and Seoighe, 2010). NMF was performed using the nmf R package (version NMF_0.20.5) (Gaujoux and Seoighe, 2010) to extract tissue-specific metagenes. Non-normalised read counts were filtered to remove all Y-linked genes, the X-inactivation gene XIST and genes with fewer than 100 reads across all samples. Initially 50 runs each of ranks 11–18 and using the default ‘Brunet’ algorithm (Brunet et al., 2004) were performed to find an optimal factorisation ‘rank’ (r). The maximal cophenetic distance was used to select the value of r. Subsequently, 200 runs using the optimal rank were performed to assess consistency of sample groupings between runs. Non-overlapping (i.e. tissue-specific) gene sets were extracted from each metagene by filtering on basis contribution >0.8.
LgPCA
Request a detailed protocolThe LgPCA approach was adapted from established phylogenetic PCA methodology (Jombart et al., 2010b) and performed using quantile-normalized, gene-level read counts, a high memory (512 Gb) compute node and the ppca function from the adephylo R package (Jombart et al., 2010a). A broad user-defined guide tree (Figure 2b) based on well-established knowledge of mammalian gastrulation and downstream lineage relationships was imposed on the different organ and tissue types following which the adephylo R package weighted the principal components by the lineage auto-correlation between samples; increased if related samples were similar and lessened if related samples were more different. As in the description from Jombart and colleagues the resulting components represented ‘global’ structures (where similarity is high between related samples) and ‘local’ structures (where related samples are dissimilar) (Jombart et al., 2010b). We used the LgPCA to extract all the global patterns from the data (PCs 1–15). These patterns were not apparent if lineage relationships were not included nor were they altered if any one tissue, such as palate, was altered within the broad lineage structure (data not shown). The global patterns in PCs 1–15 infer co-regulatory patterns of gene expression across human organogenesis. The ‘local’ patterns thereafter captured heterogeneity between tissue replicates (Figure 2—figure supplement 1) (while PC7 separated the two PSC populations these RNA-seq datasets represent separate cell lines from NIH Roadmap). We used the Abouheif distance as implemented in adephylo (Jombart et al., 2010a), which takes into account the topology of the specified tree but does not use branch lengths.
Gene set enrichment
Request a detailed protocolFor the comparison of the embryonic versus fetal datasets Gene Ontology term enrichment was performed on upregulated genes (FDR < 0.05) using Fisher’s exact test with the elimination algorithm of the R package topGO (Alexa and Rahnenfuhrer, 2010). For the LgPCA, annotated ontology nodes (>10 genes) were tested for each loadings vector for each PC against background using the Wilcoxon test. Tests were performed sequentially moving up the separate GO ontologies (Biological Process (BP), Molecular Function (MF) and Cellular Component (CC)), excluding significant scoring genes from later tests (the topGO ‘elim’ method).
iRegulon analysis of regulation in the extremes of the LgPCA
Request a detailed protocoliRegulon is a computational method which tests for enrichment amongst precomputed motif datasets to decipher transcriptional regulatory networks in a set of co-expressed genes. The 1000 genes with the most extreme loadings at either end of each PC (‘high’ and ‘low’) from the LgPCA were loaded into Cytoscape (version 3.2.1) (Shannon et al., 2003) and used as queries to the iRegulon plug-in (version 1.3, build 1024) (Janky et al., 2014). 20 Kb was examined centred on the transcriptional start site (TSS) under default settings.
Novel transcripts
Request a detailed protocolSample-specific transcriptomes were assembled with Cufflinks (version 2.2.0) (Trapnell et al., 2010). Transcriptomes were combined (‘cuffmerge’; -min-isoform-fraction = 0.1) and compared with the original GENCODE 18 reference (‘cuffcompare’). We filtered out known transcripts using the ‘Transfrag class codes’ (http://cole-trapnell-lab.github.io/cufflinks/cuffcompare/#transfrag-class-codes) to retain only wholly intronic (‘i’, of which there were none), unknown (‘u’), antisense (x) and overlapping (‘o’) transcripts. We discarded all other classes including pre-mRNA (class ‘e’), novel-isoforms spliced to known exons (class ‘j’), and 3’ run-ons within 2 kb of the end of the transcript annotation (class ‘p’). In addition, some remaining non-spliced transcripts may theoretically represent first or last exon (UTR) extensions; to delimit these, we calculated the distance on the same strand to the closest downstream transcription start site (to consider potential 5’ UTR extension) and upstream transcription termination site (to consider potential 3’ UTR extension). Names were assigned to these novel transcripts following suggested criteria (Mattick and Rinn, 2015) (Figure 4a) with the sole adaptation that bidirectional (BI) transcripts were defined as having their TSS within 1 Kb of the TSS of the associated annotated gene. No transcripts mapped to the same strand within the introns of any annotated gene excluding the possibility of unspliced transcripts from annotated genes being erroneously defined as novel transcripts. All transcript sequences (annotated and unannotated) were scored for protein-coding potential using CPAT (based on human training data included with CPAT) (Wang et al., 2013). A threshold of >0.2 was used to define ‘Transcripts of Uncertain Coding Potential’ (TUCP). Where there were multiple transcripts from a single locus, the longest transcript was retained in assembling the final dataset of 6251 novel transcripts. Transcript level read counts for the embryonic samples and NIH Roadmap samples (Supplementary file 1J) were generated for the merged transcriptome using bedtools multicov (vers 2.21.0) (Quinlan and Hall, 2010). The correlations between each of the 6251 transcripts and all annotated genes within 1 Mb were calculated using only the human embryonic data from this study.
Data availability
Request a detailed protocolMapping coordinates against multiple genome versions using a range of common pipelines and summary count data are hosted at www.manchester.ac.uk/human-developmental-biology. To view data in the UCSC genome browser, a trackhub is available: http://www.humandevelopmentalbiology.manchester.ac.uk/data/hub_manchester_hdb/hub.txt.
Data availability
-
An integrative transcriptomic atlas of organogenesis in human embryosPublicly available at EBI ArrayExpress (accession no: E-MTAB-3928).
-
TEAD and YAP regulate the enhancer network of human embryonic pancreatic progenitorsPublicly available at EBI ArrayExpress (accession no: E-MTAB-3061).
References
-
TEAD and YAP regulate the enhancer network of human embryonic pancreatic progenitorsNature Cell Biology 17:615–626.https://doi.org/10.1038/ncb3160
-
A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysisBriefings in Bioinformatics 14:671–683.https://doi.org/10.1093/bib/bbs046
-
Transcriptome analysis of early organogenesis in human embryosDevelopmental Cell 19:174–184.https://doi.org/10.1016/j.devcel.2010.06.014
-
A flexible R package for nonnegative matrix factorizationBMC Bioinformatics 11:367.https://doi.org/10.1186/1471-2105-11-367
-
GENCODE: the reference human genome annotation for The ENCODE projectGenome Research 22:1760–1774.https://doi.org/10.1101/gr.135350.111
-
iRegulon: from a gene list to a gene regulatory network using large motif and track collectionsPLoS Computational Biology 10:e1003731.https://doi.org/10.1371/journal.pcbi.1003731
-
Putting phylogeny into the analysis of biological traits: a methodological approachJournal of Theoretical Biology 264:693–701.https://doi.org/10.1016/j.jtbi.2010.03.038
-
Discovery and annotation of long noncoding RNAsNature Structural & Molecular Biology 22:5–7.https://doi.org/10.1038/nsmb.2942
Article and author information
Author details
Funding
Wellcome Trust (WT088566MA)
- Dave T Gerrard
- Andrew A Berry
- Neil A Hanley
British Council (MR/J003352/1)
- Karen Piper Hanley
Medical Research Council (MR/L009986/1)
- Nicoletta Bobola
- Neil A Hanley
Wellcome Trust (WT097820MF)
- Neil A Hanley
British Council (14BX15NHBG)
- Neil A Hanley
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We are very grateful to all women who consented to take part in our research programme and for the assistance of research nurses and clinical colleagues at Central Manchester University Hospitals NHS Foundation Trust. We thank Ian Donaldson, Peter Briggs and Andy Hayes of the Bioinformatics and Genomic Technologies Core Facilities at the University of Manchester for assistance with RNA-sequencing. REJ is a UK Medical Research Council (MRC) clinical research training fellow. NAH is a Wellcome Trust senior fellow in clinical science (WT088566MA). This project received support from the Wellcome Trust (WT097820) with additional support from MRC project grants MR/L009986/1 to NB and NAH, the British Council (14BX15NHBG) to NAH and MR/J003352/1 to KPH.
Ethics
Human subjects: Human embryonic material was collected under ethical approval, informed consent and according to the Codes of Practice of the Human Tissue Authority (protocols number 13/NW/0205).
Copyright
© 2016, Gerrard et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 8,517
- views
-
- 1,292
- downloads
-
- 62
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Developmental Biology
Premature infants with bronchopulmonary dysplasia (BPD) have impaired alveolar gas exchange due to alveolar simplification and dysmorphic pulmonary vasculature. Advances in clinical care have improved survival for infants with BPD, but the overall incidence of BPD remains unchanged because we lack specific therapies to prevent this disease. Recent work has suggested a role for increased transforming growth factor-beta (TGFβ) signaling and myofibroblast populations in BPD pathogenesis, but the functional significance of each remains unclear. Here, we utilize multiple murine models of alveolar simplification and comparative single-cell RNA sequencing to identify shared mechanisms that could contribute to BPD pathogenesis. Single-cell RNA sequencing reveals a profound loss of myofibroblasts in two models of BPD and identifies gene expression signatures of increased TGFβ signaling, cell cycle arrest, and impaired proliferation in myofibroblasts. Using pharmacologic and genetic approaches, we find no evidence that increased TGFβ signaling in the lung mesenchyme contributes to alveolar simplification. In contrast, this is likely a failed compensatory response, since none of our approaches to inhibit TGFβ signaling protect mice from alveolar simplification due to hyperoxia while several make simplification worse. In contrast, we find that impaired myofibroblast proliferation is a central feature in several murine models of BPD, and we show that inhibiting myofibroblast proliferation is sufficient to cause pathologic alveolar simplification. Our results underscore the importance of impaired myofibroblast proliferation as a central feature of alveolar simplification and suggest that efforts to reverse this process could have therapeutic value in BPD.
-
- Developmental Biology
Cells called alveolar myofibroblasts, which have a central role in the development of the lung after birth, receive an orchestrated input from a range of different signaling pathways.