1. Chromosomes and Gene Expression
Download icon

Chromatin-associated RNA sequencing (ChAR-seq) maps genome-wide RNA-to-DNA contacts

Tools and Resources
  • Cited 2
  • Views 4,587
  • Annotations
Cite as: eLife 2018;7:e27024 doi: 10.7554/eLife.27024

Abstract

RNA is a critical component of chromatin in eukaryotes, both as a product of transcription, and as an essential constituent of ribonucleoprotein complexes that regulate both local and global chromatin states. Here, we present a proximity ligation and sequencing method called Chromatin-Associated RNA sequencing (ChAR-seq) that maps all RNA-to-DNA contacts across the genome. Using Drosophila cells, we show that ChAR-seq provides unbiased, de novo identification of targets of chromatin-bound RNAs including nascent transcripts, chromosome-specific dosage compensation ncRNAs, and genome-wide trans-associated RNAs involved in co-transcriptional RNA processing.

https://doi.org/10.7554/eLife.27024.001

Introduction

Much of the eukaryotic genome is transcribed into non-coding RNA (ncRNA), and several studies have established that a subset of these ncRNAs form ribonucleoprotein complexes that bind and regulate chromatin (Guttman and Rinn, 2012; Meller et al., 2015; Cech and Steitz, 2014). Some of the most well studied ncRNAs are those involved in dosage compensation, which include roX1 and roX2 in Drosophila and Xist in mammals. In Drosophila, roX1 and roX2 are part of the male-specific lethal (MSL) complex that coats the single male X chromosome to acetylate histone H4K16 and increase transcription (Conrad and Akhtar, 2012). In female mammals, Xist is expressed from a single X-chromosomal locus and coats the X chromosome from which it is expressed in order to silence transcription (Augui et al., 2011). Other ncRNAs, such as HOTAIR (Rinn et al., 2007; Chu et al., 2011), HOTTIP (Wang et al., 2011), and enhancer RNAs (Sigova et al., 2015), have been shown to regulate expression of specific genes by localizing to chromatin and recruiting activating or repressing proteins. Finally, repetitive ncRNA transcripts have roles at chromosomal loci essential in maintaining genomic integrity over many cell divisions, including TERRA at telomeres (Bunting et al., 2010) and alpha-satellites near centromeres (Hall et al., 2012). Despite these well-studied examples, the genomic targets of most chromatin-associated ncRNAs are unknown, and the mechanisms by which these ncRNAs regulate the epigenetic and spatial organization of chromatin remain largely unexplored.

Genomic methods for studying the localization of specific RNA transcripts include ChIRP (Chu et al., 2011), CHART (Engreitz et al., 2013), and RAP (Simon et al., 2011). These techniques use hybridization of complementary oligonucleotides to pull down a single target RNA and then next generation sequencing or mass spectrometry to identify its DNA- or protein-binding partners (Simon et al., 2011). However, de novo discovery of chromatin-associated RNAs remains limited to computational predictions (Guttman and Rinn, 2012) or association with previously known factors (Khalil et al., 2009). Nuclear fractionation allows isolation of bulk chromatin and subsequent identification of chromatin bound RNAs via sequencing, but does not provide sequence-resolved maps of RNA binding locations along the genome (Werner and Ruthenburg, 2015). To overcome these limitations, we have developed ChAR-seq, a proximity ligation and sequencing method (Figure 1A) that both identifies chromatin-associated RNAs and maps them to genomic loci (Figure 1B).

Figure 1 with 10 supplements see all
ChAR-seq uses proximity ligation of chromatin-associated RNA and deep sequencing to map RNA-DNA contacts in situ.

(A) Overview of the ChAR-seq method wherein RNA-DNA contacts are preserved by crosslinking, followed by in situ ligation of the 3’ end of RNAs to the adenylated 5’ end of the ssDNA tail of an oligonucleotide ‘bridge’ containing a biotin modification and a DpnII-complementary overhang on the opposite end. After extending the bridge by reverse transcription to generate a strand of cDNA complementary to the RNA, the genomic DNA is then digested with DpnII and then re-ligated, capturing proximally-associated bridge molecules and RNA. The chimeric molecules are reverse-transcribed, purified and sequenced. (B) Chimeric molecules are sequenced and the RNA and DNA ends are distinguished owing to the polarity of the bridge, which preferentially ligates to RNA via the 5'-adenylated tail and to DNA via the DpnII overhang. The RNA and DNA reads are then computationally recombined to produce contact maps for each annotated RNA in the genome. (C) Representative examples of genome-wide RNA coverage plots generated for Total RNA (black), mRNA (red), Hsromega (green), chinmo (green), ten-m (green), snRNA:U2 (cyan), snRNA:7SK (cyan), rox1 (blue) and roX2 (purple). Arrows show the transcription start site for each gene. In chromosome cartoons throughout the paper, light gray represents the primary chromosome scaffolds, darker gray regions are heterochromatic scaffolds, and black circles are centromeres. (D) Zoomed in region for an 850 kilobase region of chromosome 3L (chr3L). ChAR-seq tracks for Total RNA, ten-m, snRNA:U2, and snRNA:7SK are shown in comparison with PRO-seq tracks (Drosophila S2 [Kwak et al., 2013]) and ATAC-seq (this study, CME-W1-cl8+). (E) ChAR-seq contact matrix (RNA-to-RNA, top) plotted and aligned with same 850 kb region as panel D. ChAR-seq was performed without bridge addition (Hi-C/Mock-ChAR), resulting in DNA-DNA proximity ligation as in Hi-C (‘Hi-C, DNA-to-DNA’, bottom).

https://doi.org/10.7554/eLife.27024.002

Results

We developed and performed ChAR-seq using Drosophila melanogaster CME-W1-cl8+ cells, a male wing disc derived cell line with a normal karyotype and well-characterized epigenome and transcriptome (Cherbas et al., 2011; Roy et al., 2010). ChAR-seq is a chromosome conformation capture method that maps genome-wide RNA-to-DNA contacts in crosslinked nuclei (Dekker et al., 2002; Rao et al., 2014). Briefly, cells are cross-linked with formaldehyde and permeabilized, then RNA is partially fragmented and soluble RNA is washed away. The chromatin-cross-linked RNA is then ligated to an oligonucleotide duplex 'bridge' molecule and reverse transcribed. Genomic DNA is then digested and ligated onto the other end of the oligonucleotide 'bridge', creating a link between chromatin-associated RNA and proximal DNA. The ligated RNA is fully converted to cDNA during second strand synthesis. Finally, the DNA is sonicated and the chimeric molecules are purified, processed, and sequenced.

To enable the capture and analysis of RNA-to-DNA contacts, the oligonucleotide bridge (see Figure 1—figure supplement 1) was designed to have several key features: (1) the 5'-adenylated end (5'App) of the bridge ensures that it is the only 5’ end competent for ligation to the 3'-ends of ssRNA by truncated T4 Rnl2tr R55K K227Q mutant ligase (Viollet et al., 2011), which cannot adenylate 5' ends (Figure 1—figure supplement 2), (2) the sequence of the bridge does not exist in the yeast, fly, mouse or human genomes and encodes a defined polarity, (3) the end of the bridge contains a restriction site for specific ligation to digested genomic DNA, and (4) the bridge is biotinylated so that it can be captured and enriched. After the bridge is ligated to RNA in situ, the molecules are stabilized by reverse transcription using Bst3.0 polymerase, which can traverse the DNA-RNA junction. The genomic DNA is then digested using the restriction enzyme DpnII, which has a median spacing of ~200 bp between sites in the fly genome. The digested genomic DNA is then ligated to the bridge adaptor using T4 DNA ligase. Second strand synthesis is completed using limiting RNase H and DNA Polymerase I. Crosslinks are then reversed followed by DNA precipitation, sonication and purification using streptavidin beads. Libraries are constructed by repairing and dA-tailing the DNA fragments, ligating TruSeq adaptors to the ends and PCR amplifying (Figure 1A).

Upon conversion of RNA-DNA contacts to a covalent chimera, the chimeric molecules were sequenced using 152 bp single-end reads. Sequencing across the bridge junction ensures identification of the RNA and DNA portions of the chimeric molecule by reading the polarity of the bridge (Figure 1B). The RNA/cDNA (Figure 1B, red) and the genomic DNA side (Figure 1B, black) of each read were computationally split and aligned to the transcriptome and genome. After post-processing for unique alignments, repeat removal, and removal of blacklisted regions, each RNA molecule was mapped to the genomic location to which it was ligated (see Materials and methods and Figure 1—figure supplement 3), resulting in 22.2 million high-confidence unique mapping events for ~16,800 RNA transcripts. All individual RNA-to-DNA contacts for a given transcript were then combined to produce a genome-wide association map for each individual transcript (Figure 1C). To ensure that ChAR-seq signal was not due to spurious bridge-to-DNA ligation, we performed a control experiment in which we added RNase A and RNase H to lysed cells before the RNA-to-bridge ligation. This RNase-treatment reduced the number of unique bridge molecules identified by six-fold, demonstrating that the vast majority of bridge ligation events are indeed RNA-dependent (Figure 1—figure supplement 4). The ChAR-seq protocol is highly reproducible, with the number normalized RNA-to-DNA contacts observed for each RNA showing high concordance between replicates (Figure 1—figure supplement 5). To estimate the specificity of our observed RNA-to-DNA ligation events and to ensure that the contacts that we observed were not due to diffusion of RNA after fragmentation, we performed a spike-in experiment wherein we added increasing amounts of exogenous RNA to the cells after lysis but before bridge ligation. After lysis and RNA fragmentation, we recovered and quantified the total soluble RNA from the supernatant, then spiked-in purified in vitro transcribed RNA fragments (~200 nt) from commonly used protein expression vectors (MBP, HALO and GFP). The spike-in controls were added at 0.1%, 1% and 10% of the total, recoverable RNA by mass. Though we see a clear concentration-dependent increase in the number of false positives, even in the scenario where we spiked in RNA at 10% of the total soluble RNA we observed fewer than 0.5% false positive RNA-to-DNA contacts (Figure 1—figure supplement 6), which compares favorably to the false positive rates in related RNA-DNA mapping methods (Li et al., 2017; Sridhar et al., 2017).

Only the 3'-hydroxyl of each RNA is available for ligation to the bridge, thus the polarity of each RNA molecule with respect to its transcriptional direction can be determined by its orientation with respect to the bridge. The majority (85% of total) of the RNAs captured in our assay were sense, with the largest single subtype represented by sense-stranded mRNA (32% of total), owing to the capture of nascent transcripts (Figure 1—figure supplement 7). Most of the chromatin-associated antisense transcripts that we identified arose from ncRNA or intronic regions. In fact, 96% of the antisense mRNAs were intronic in origin with 64% of these originating from a single 119 kb gene (CG42339), suggesting the presence of unannotated ncRNAs in this region. The remaining chromatin-associated RNA detected in our assay arose from non-protein coding transcripts (see Figure 1—figure supplement 7), of which 18% were small nucleolar RNA (snoRNA) and 19% were small nuclear RNA (snRNA).

ChAR-seq generated RNA-to-DNA contacts can be aggregated (Figure 1C, see e.g., Total RNA), grouped by RNA class (Figure 1C, see e.g., mRNA) or viewed individually (Figure 1C). Individual RNAs mapped by ChAR-seq generally fell into one of three classes. In the first class, RNAs were found around the locus from which they are transcribed (Figure 1C, see, e.g., Hsromega, chinmo, ten-m). In the second class, RNAs were found bound to chromatin in trans, generally distributed across most or all of the genome, often in addition to a peak around the gene body from which the RNA is transcribed (Figure 1C, see e.g., snRNA:U2, snRNA:7SK). In a third class, RNAs that are part of the dosage compensation complex (Figure 1C, see roX1 and roX2) were enriched on and coat the X chromosome. To investigate this first class of RNAs, we compared aggregated RNA-to-DNA contacts with data from nascent transcription sequencing using PRO-seq (Kwak et al., 2013), and observed qualitative agreement between PRO-seq and ChAR-seq data sets (Figure 1D, see PRO-seq and mRNA). Nevertheless, many RNA-to-DNA contacts in our dataset are associated in trans to genomic regions outside of the gene body from which the RNA is transcribed (Figure 1—figure supplement 8).

To determine if the ChAR-seq protocol disrupts genome organization, we omitted the bridge to produce a mock-treated sample. We next biotinylated the DpnII-digested ends in our mock-sample and ligated them, essentially preparing a Hi-C library (‘Hi-C/Mock-ChAR’) (see Materials and methods for details) from the mock-treated sample (Figure 1E, bottom). Topologically associated domains (TADs) are preserved in the mock-treated sample and the DNA-DNA contacts show high correlation with previously published CME-W1-cl8+ cell Hi-C (Ramírez et al., 2015) (Figure 1—figure supplements 910). Thus, the three-dimensional genome organization is largely preserved in our protocol. Furthermore, the profile of RNA-to-DNA contacts detected by ChAR-seq was distinctly different from that of DNA-to-DNA contacts in our mock-treated Hi-C library (Figure 1E and Figure 1—figure supplement 9), indicating that ChAR-seq signal is not simply a byproduct of DNA-DNA contacts.

ChAR-seq data can also be visualized in a two-dimensional contact plot, where the genomic locus from which the RNA is transcribed is represented on the y-axis in linear genome coordinates, and the x-axis defines the genomic location where each RNA was bound. These plots provide a useful overview visualization for of the entire dataset. When we generated these contact plots for ncRNA (Figure 2A), mRNA (Figure 2B) and snRNA (Figure 2C), we observed strong horizontal lines that represent RNA transcripts that are transcribed from a single locus but are found distributed throughout the genome (class II), or in the special case of roX1 and roX2, specifically along the X chromosome (class III). Furthermore, RNAs found at sites from which they are transcribed clustered tightly along the diagonal, a feature most pronounced for mRNAs (class I) (Figure 2B). Many of the RNAs we found distributed broadly across the genome are transcription associated small nuclear RNAs (snRNAs) (Figure 2C). One of these, snRNA:7SK, is an abundant snRNA that functions as a scaffold for a transcriptional regulatory ribonucleoprotein complex that includes p-TEFb, Hexim and LARP7. Other broadly distributed snRNAs are components of the spliceosome (e.g., snRNA:U2) which largely functions co-transcriptionally (Perales and Bentley, 2009).

Figure 2 with 2 supplements see all
ChAR-seq is an ‘all to all’ RNA-to-DNA proximity ligation method.

(A) Genome-wide plot of RNA to DNA contacts for non-coding RNAs. The y-axis represents the region of the genome from which a given RNA was transcribed and the x-axis represents the region of the genome where each RNA was found to be associated through proximity ligation (i.e., the binding site for each RNA). Genome-wide contact plots generated in the same way for (B) mRNA, and (C) snRNA. (D) Cumulative frequency of length-normalized contacts for 16,812 RNAs identified on the ‘RNA-side’ of chimeric reads. The majority (88%) of RNAs have fewer than 10 contacts per kilobase per million reads (CPKM) in our dataset and were not further analyzed owing to low coverage. The remaining 1952 RNAs account for 18.5 million (83%) of the total RNA-to-DNA contacts. (E) Scatter plot of length normalized chromatin-contacts versus total expression for each RNA. The 138 RNAs that had more than 100 CPKM and were enriched more than ten-fold are highlighted in red.

https://doi.org/10.7554/eLife.27024.013

To identify RNAs that are highly enriched for chromatin interactions, we plotted the normalized cumulative distribution of the number of sense contacts observed for each gene (Figure 2D). The majority of the RNAs in our dataset (14,860 out of 16,812, 88%) had fewer than 10 contacts per kilobase million reads (CPKM) (Figure 2D) and were excluded from further analysis. The remaining 1952 RNAs (12%) accounted for 83% (18.5 million) of all chromatin contacts in our data set. To estimate the contribution of total RNA abundance to this interaction signal, we performed RNA-seq for the CME-W1-cl8+ cell line and compared RNA expression levels with RNA-to-DNA contacts identified by ChAR-seq (Figure 2E, Supplementary file 1). We observed a correlation between RNA expression level and chromatin-RNA contacts; however, a cluster of RNAs clearly generated more chromatin interactions that would be expected from the overall expression levels (Figure 2E). Using both the length and read normalized contacts (CPKM) and the fold-enrichment over RNA expression as measured by RNA-seq, we identified 138 RNAs that had more than 100 CPKM and were enriched more than ten-fold, though many were enriched by 2–5 orders of magnitude (Figure 2E, red symbols; Figure 2—figure supplement 1). Notably, we observe good concordance between the RNAs identified using this methodology between replicates (Figure 2—figure supplement 2).

We developed ChAR-seq using the male WME-cl8+ line, reasoning that the ncRNAs roX1 and roX2 would serve as an internal positive control. Both roX1 and roX2 are part of the MSL2 complex, which binds across the X-chromosome in male flies to recruit chromatin-modifiers that increase transcriptional output (Figure 3A) (Lucchesi and Kuroda, 2015). Indeed, ChAR-seq data showed roX1 and roX2 to be 7.6-fold (p-value<10−10) and 8.1-fold (p-value<10−10) enriched for interactions on the X chromosome, respectively (Figure 3B,C, Figure 3—figure supplement 1). In contrast, female flies express Sex lethal (Sxl), which binds to msl2 mRNA to prevent its translation, blocking assembly of the MSL2 complex (Lucchesi and Kuroda, 2015). Importantly, roX1 and roX2 require MSL2 for X-chromosome specific localization (Lucchesi and Kuroda, 2015), therefore female cells should lack detectable spreading of these ncRNAs along the X-chromosome. When we performed ChAR-seq in a female Drosophila melanogaster cell line, Kc167, we did not detect any significant roX2 localization on the X chromosome (Figure 3D) but observed excellent agreement in interaction signal from other RNAs across both cell lines (Figure 3—figure supplement 2, Figure 3C male, CME-W1-cl8+ and Figure 3D, female, Kc167, see e.g., snRNA:7SK and Hsromega).

Figure 3 with 2 supplements see all
Mapping roX1 and roX2 of the X chromosome dosage compensation complex.

(A) Illustration of the roX1/roX2 spreading across the solitary X chromosome in male flies (CME-W1-cl8+ cell line). In contrast, the female-derived Kc167 cell line expresses significantly lower levels of the MSL2 complex, which mediates the association of roX1 and roX2, which therefore do not coat either of the two X-chromosomes in females. (B) Circos plot showing roX2 spreading from its site of transcription (red arrow) and binding with high density along the X-chromosome but low density binding throughout the genome. (C) Coverage plots of roX1 (blue), roX2 (purple), snRNA:7SK (cyan) and Hsromega (green) in male CME-W1-cl8+ cells. Tracks are DpnII normalized reads. ChAR-seq data were subsampled to match the read depth of the Kc167 sample. (D) Complementary coverage plots generated from female Kc167 cells. (E) Comparison of ChAR-seq (this work) to an alternative RNA-to-chromatin mapping method called ChIRP-seq (data from reference [Quinn et al., 2014]). Tracks for roX1 (upper, blue) and roX2 (lower, purple) were generated from 32308 and 87453 contacts, respectively, from a ChAR-seq dataset containing a total of 22.2 million contacts. For comparison, the roX1 and roX2 tracks derived from ChIRP-seq each represent greater that 20 million reads. To compare tracks at different read depths, the contact number was autoscaled, with the maximum peak height given a value of 1. (F) Comparison of the signal-to-noise ratio (see methods) between ChAR-seq and Chirp-seq for roX genes. ‘Raw reads’ is the number of roX reads present in each data set analyzed. (G) Correlation coefficients were calculated for roX1 and roX2 coverage tracks generated using ChIRP-seq and ChAR-seq and plotted relative to increasing bin size to estimate the resolution of the ChAR-seq assay.

https://doi.org/10.7554/eLife.27024.016

High-resolution maps of roX1 and roX2 localization have previously been generated using ChIRP-Seq, which hybridizes probes against a known RNA and pulls down the associated chromatin for sequencing (Chu et al., 2011; Quinn et al., 2014). Comparing ChIRP-seq to ChAR-seq for both roX1 and roX2 (Figure 3E), we found that DNA contact locations were in surprisingly good agreement despite the fact that ChAR-seq reads are spread across all RNAs while ChIRP-seq reads map the specific RNA target, resulting in a large disparity in the effective sequencing depth between the methods. In ChIRP-seq, virtually all of the signal is attributable to interactions between chromatin and the target RNA. In contrast, ChAR-seq captures all RNA and DNA contacts, so that any given target RNA will comprise a subset of the total RNA-chromatin contacts in the dataset. In the case of roX1 and roX2, we observed 32,308 and 87,453 contacts, representing 0.1% and 0.36% of the ChAR-seq dataset. In contrast, the ChIRP-seq datasets plotted in Figure 3E represent ~24M and ~21M reads for roX1 and roX2, respectively. This indicates that ChAR-seq can identify RNA peaks along chromatin with high sensitivity for a given RNA.

To more quantitatively compare ChAR-seq to ChIRP-seq, we compared the signal-to-noise ratio (SNR) for roX1 and roX2 binding to DNA in each assay (Figure 3F). We defined signal as the most densely bound regions on chrX, whereas we defined noise as binding events distributed randomly on autosomes (see Materials and methods for details). We found that ChAR-seq roX1 SNR was much higher than that of ChIRP-seq (49.8 vs 3.9), and ChAR-seq roX2 SNR was about two-fold higher than that of ChIRP-seq (45.2 vs 22.0), despite having 2 orders of magnitude fewer reads in ChAR-seq. Altogether, these data suggest that ChAR-seq has excellent sensitivity and sufficient signal-to-noise to characterize accurate chromatin-binding events for individual RNAs.

The resolution with which we can measure the localization of an RNA to a given genomic site constrains our ability to assess its potential modes of action. To measure the accuracy of ChAR-seq measurements of RNA interaction with DNA, we compared ChAR-Seq data to ChIRP-seq data to estimate the base-pair resolution of ChAR-Seq. We expected this resolution to be bounded ––in part––by the local DpnII cut frequency and the number of contacts for any given RNA. We divided the X chromosome into evenly sized bins and calculated correlation coefficients between ChIRP-seq and ChAR-seq datasets at increasing bin sizes for both roX1 and roX2 (Figure 3G). Using this method, we noted a bi-phasic increase of the correlation coefficient, corresponding to a minor plateau around 200 bp and a major plateau at ~25 kbp. The minor plateau is likely due to the DpnII distribution bias in the ChAR-seq tracks, while the major plateau is an estimate of the resolution of our assay, which is on the order of other proximity-ligation sequencing assays like Hi-C (van Berkum et al., 2010).

To test if we could identify the functional roles for our most highly enriched RNAs, we clustered the snRNA class of RNAs based on their genomic contacts. These snRNAs collectively comprised 23% of all the RNA-to-DNA contacts in our dataset (Figure 4A) and are a substantial component of the spliceosome, a multi-megadalton ribonucleoprotein complex that catalyzes pre-mRNA splicing (Zhou et al., 2002; Will and Lührmann, 2011). The composition and conformation of the spliceosome is highly dynamic, though two dominant species exist in eukaryotes: the major spliceosome comprised of U1, U2, U4, U5 and U6 snRNAs, and the minor spliceosome comprised of U4:atac, U6:atac, U5, U11, and U12 (Will and Lührmann, 2011). Many members of this class of snRNAs have highly similar gene duplication variants in the Drosophila genome. We therefore first calculated the base sequence similarity of these variants to one another and aggregated signals that were tightly clustered (Figure 4—figure supplement 1). When we then correlated genome-wide binding signal within this class, we found that the distribution patterns of the major spliceosome snRNAs U1, U2, U4, U5, U6 clustered together along with snRNA:7SK (Figure 4B), which is part of the p-TEFb complex that relieves pausing of RNA Polymerase II at promoters (Kwak and Lis, 2013) and may participate in the release of paused polymerase during RNA splicing (Barboric et al., 2009). The components of the minor spliceosome did not cluster together, likely due to their low abundance (Will and Lührmann, 2011) and consequently low representation in our dataset.

Figure 4 with 2 supplements see all
Correlation of chromatin-associated RNAs with genome features.

(A) Relative abundance of snRNAs identified by ChAR-seq. The size of the circles is proportional to the abundance of the snRNAs found by ChAR-seq. RNA components of the major and minor spliceosome are bounded by the gray boxes. (B) Cluster analysis of the pairwise correlation between genome-wide tracks of snRNAs. (C) Meta-analysis plots aggregating the signal of snRNA:7SK, snRNA:U2, snRNA:U5, roX1, roX2 and ATAC-seq over gene bodies (red), putative enhancers (blue dashed line) and random regions (black). (D) Hierarchical clustering based on pairwise Pearson correlation between representative ChAR-seq RNA-to-DNA contact coverage tracks (black) and modENCODE datasets available for the WME-C1-cl8 + cell line. Notable associations for the dosage compensation complex (green) and heterochromatin (‘het’) are indicated in the right margin.

https://doi.org/10.7554/eLife.27024.019

We next reasoned that spliceosome RNAs––as part of the co-transcriptional RNA processing machinery––should also be enriched in regions of active transcription. We therefore aggregated spliceosomal RNA signals over gene bodies (Figure 4C, red lines), putative enhancers (Kvon et al., 2014) (Figure 4C, blue dashed lines) and a random distribution of genomic bins of similar size (Figure 4C, black lines). We observed an enrichment of snRNAs (7SK, U2 and U6), but not roX1 or roX2, over gene bodies (Figure 4C) with a broad peak around transcription start sites, in good agreement with ChIRP data for 7SK in mice (Flynn et al., 2016). Because active transcription is also correlated with topological boundaries in flies (Hou et al., 2012; Ulianov et al., 2016), we examined the relationship between RNA contacts and genome organization. To test whether RNA-DNA contacts are enriched at topological boundaries, we measured the DNA contact frequency for snRNA:7SK and snRNA:U2 in 20 kb windows spanning TAD boundaries and TAD centers. Both 7SK and U2 RNAs were modestly, but significantly (Wilcoxon rank-sum test) enriched at TAD boundaries (Figure 4—figure supplement 2A (top)). We also aggregated 7SK and U2 contact signals across all TAD boundaries (Figure 4—figure supplement 2A (bottom), red lines) and a random distribution of identically sized genomic bins (Figure 4—figure supplement 2A (bottom), black lines), and found that these RNAs were ~1.4 fold enriched over boundaries. We also noted that roX2 was ~2.2 fold enriched over TAD boundaries on the X chromosome (Figure 4—figure supplement 2B), consistent with previous data showing that MSL2 dosage compensation complex High Affinity Sites (HAS) are preferentially found at TAD boundaries (Ramírez et al., 2015). Examination of chromatin accessibility (ATAC-seq) over TAD boundaries (Figure 4—figure supplement 2B) showed that open chromatin is enriched at TAD boundaries, supporting the idea that TAD boundaries are transcriptionally active in flies.

In contrast to the small number of well-defined and well-characterized snRNAs involved in splicing, there are more than 200 snoRNAs in flies (Huang et al., 2005) that are significantly divergent in sequence and, surprisingly, were highly represented in our dataset (Figure 1—figure supplement 7 and Figure 2—figure supplements 12). Most of these snoRNAs have either unknown function or are computationally identified and indirectly implicated in the maturation and modification of ribosomal rRNA (Huang et al., 2005).

To determine if our enriched chromatin-associated RNAs, in particular snRNAs and snoRNAs, might localize to euchromatic or heterochromatic states or with specific transcription factors, we cross-correlated our ChAR-seq signal against modENCODE datasets available for the CME-W1-cl8+ cell line. To normalize the signals for comparison, we first calculated the expected contacts per 2 kb bin for each RNA under a uniform distribution, based on the total number of genome-wide contacts for each RNA and the number of DpnII sites per bin. This null model was then used to calculate the log2 ratio of the observed to the expected contacts per bin for each RNA, which was then transformed into a z-score ((x-μ)/σ) based on the whole-genome mean (μ) and standard deviation (σ). Similarly, we re-binned the modENCODE tracks, removed bins that did not contain a DpnII site, and transformed the log2 mean-shift values to a z-score. We then calculated the pair-wise Pearson correlation coefficients between each signal track, and then clustered the data (Figure 4D). We observed discrete clustering of roX1 and roX2 with known dosage compensation complex factors, MOF, the histone modifications H4K16ac and H3K36me3 (Bell et al., 2008), and JIL-1 kinase (Bai et al., 2004), validating this analytical approach (Figure 4D). Beyond this sub-cluster of dosage compensation factors, the remainder of the chromatin-associated RNAs fell into two distinct and anti-correlated categories: those associated with active chromatin and transcription (e.g., RNAPII, H4K8ac, H3K18ac) or heterochromatin (e.g., HP2, H3K9me3, HP1a) (Figure 4D). In particular, we note that snRNA:U2 and snRNA:7SK cluster tightly with the transcription-associated chromatin marks, while many of the snoRNAs and minor spliceosome snRNAs that we identified are associated with heterochromatin, likely due to co-localization of heterochromatin factors to the nucleolus. Interestingly, snRNA:U5, a component of both the major and minor spliceosome, has variants that clearly cluster with either transcriptionally active chromatin (63BC) or heterochromatin (23D, 38ABa, and 34A). Previous work has shown that the snRNA:U5:38Aba variant (Figure 4D, heterochromatin cluster) exhibits a unique tissue-specific expression profile with the greatest abundance in neural tissue, which led the authors to propose isoform-dependent functions in alternative splicing (Chen et al., 2005). The differential clustering that we observe for snRNA:U5, and between major and minor spliceosome snRNAs, for euchromatin and heterochromatin might reflect such isoform-specific functions of the spliceosome in different chromatin states.

Discussion

ChAR-seq maps the chromosomal binding sites of all chromatin-associated RNAs, independent of whether they are associated as nascent transcripts or bound as part of ribonucleoprotein complexes (RNPs). In this way, ChAR-seq can be thought of as a massively parallelized de novo RNA mapping assay capable of generating hundreds to thousands of RNA-binding maps. ChAR-seq also detects multiple classes of chromatin-associated RNAs. ChAR-seq preserves the three-dimensional structure of the genome and provides an RNA-DNA interaction matrix that complements DNA-DNA proximity measurements using Hi-C. We validated ChAR-seq using chromosome-specific ncRNAs roX1 and roX2 associated with dosage compensation. The comparison between ChAR-seq and ChIRP-seq, which vary dramatically in the sequencing depth needed to analyze a specific RNA, highlights the utility of ChAR-seq as a de novo chromatin-associated RNA discovery tool. ChAR-seq also maps nascent RNAs found at the loci from which they are transcribed.

ChAR-seq is similar to a recently published method (Sridhar et al., 2017), but has two key distinctions. First, proximity ligations are performed in situ in intact nuclei, which reduces nonspecific interactions (Rao et al., 2014). Second, ChAR-seq uses long single-end reads to sequence across the entire junction of the ‘bridge’, ensuring that RNA-to-DNA contacts are mapped with high confidence and reporting on the polarity of the bridge-ligated RNA.

While this work was in revision, a genome-wide RNA-DNA contact method—GRID-seq—was published that also uses proximity ligation of a directional bridge to RNA and DNA (Li et al., 2017). One key difference is that GRID-seq uses a restriction enzyme to cut cDNA and gDNA fragments 19–23 bp distal to the bridge, which allows size selection of molecules that contain both RNA and DNA. This likely reduces the number of uninformative molecules sequenced. A disadvantage, however, is that the resulting reads are 19–23 bp, which have a greater chance of falsely mapping during genome alignment than the 20–100 bp RNA and DNA fragments obtained with ChAR-seq. Although their details differ, ChAR-seq and GRID-seq appear to have similar ability to detect roX2 binding to chrX in Drosophila (see Figure 3—figure supplement 1B).

We used ChAR-seq to discover and map several dozen ncRNAs that are pervasively bound across the genome. Many of these ncRNAs are components of ribonucleoprotein complexes associated with transcription elongation (snRNA:7SK), splicing (snRNA:U2, etc) and RNA processing (snRNAs, snoRNAs and scaRNAs). Interestingly, more than half of the chromatin-associated RNAs identified based on our enrichment criteria are snoRNAs, most of which––but not all––correlate with heterochromatin. Generally, snoRNA ribonucleoproteins (snoRNPs) use intermolecular base pairing to direct chemical modification of the 2'-hydroxyl groups or the isomerization of uridines to pseudouridine (Cech and Steitz, 2014) and snoRNAs are known abundant components of chromatin in both flies (Schubert et al., 2012) and in mice (Meng et al., 2016). Despite their abundance and the their known role in RNA modification, we do not yet understand the functions of these modifications, or the implication of snoRNAs and scaRNAs chromatin association in cells (Cech and Steitz, 2014). We also demonstrate that ChAR-seq can be used with orthogonal genome-wide datasets to identify and classify RNAs that are associated with specific chromatin states (e.g., euchromatin vs heterochromatin). We expect this approach will be particularly useful in organisms that use lncRNAs such as HOTAIR, HOTTIP and BRAVEHEART as scaffolds for ribonucleoproteins that regulate facultative heterochromatin. Finally, we observed enrichment of transcription-associated RNAs with TAD boundaries, reflecting a potential role for active transcription in the topological organization of the genome. This observation is consistent with previous observations that active transcription is a stronger predictor for TAD partitioning in flies (Hou et al., 2012; Ulianov et al., 2016) than CTCF and cohesin, the prototypical TAD boundary markers in mice and humans (Merkenschlager and Nora, 2016).

We anticipate that ChAR-seq will be a powerful new high throughput discovery platform capable of simultaneously identifying new chromatin-associated RNAs and mapping their chromatin binding sites (and associated epigenetic chromatin states), all of which will be particularly useful in comparing ‘epigenomic’ changes that coincide with cellular differentiation or tumorigenesis.

Materials and methods

Brief description of ChAR-Seq protocol

Drosophila melanogaster CME-W1-cl8+ cells (Drosophila Genome Resource Center, Stock #151) were grown in T-75 flasks at 27°C in Shields and Sang M3 media supplemented with 5 µg/mL insulin, 2% FBS, 2% fly extract and 100 µg/mL Pen-Strep (Cherbas et al., 2011). Approximately 100–400 million cells were harvested for each library by centrifugation at 2000 x g for 2–4 min, resuspended in fresh media plus 1% formaldehyde and fixed for 10 min at room temperature. Fixation was quenched by adding 0.2 M glycine and mixing for 5 min at room temperature. Cells were centrifuged at 2000 x g for 2–4 min, resuspended in 1 mL of PBS, and centrifuged again at 2000 x g for 2 min. The supernatant was aspirated and discarded, and the cell pellet was flash frozen in liquid nitrogen and stored at −80C until needed. Cells were thawed in lysis buffer and the cross-linked nuclei and cellular material were isolated by centrifugation for the in situ ligation protocol. Female Kc167 cells (Drosophila Genome Resource Center, Stock #1) for analysis of sex dependent RNA contact map were processed as for CME-W1-cl8+ cells. Both Kc167 and CME-W1-cl8+ cells were authenticated by the Drosophila Genome Resource Center and tested for mycoplasma contamination by cytoplasmic DNA staining.

Briefly, RNA was lightly and partially chemically fragmented by heating in the presence of magnesium. The pellet was isolated and washed, and RNA ends were ligated using truncated T4 Rnl2tr R55K K227Q ligase (hereafter referred to as trT4KQ RNA ligase) to an oligonucleotide 'bridge' molecule containing a 5'-adenylated ssDNA overhang. The RNA ligase was inactivated, the pellet was washed and the RNA strand was stabilized by first strand synthesis of the RNA through extension of the bridge by Bst 3.0 polymerase. The polymerase was inactivated and the pellet was washed. Genomic DNA was then digested with DpnII, followed by ligation of the DpnII digested genomic DNA to the opposite end of the oligonucleotide 'bridge'. Second strand synthesis was then performed using RNaseH and DNA Polymerase I to complete cDNA synthesis of the RNA-encoded side of the new, chimeric molecules. The sample was then deproteinized and crosslinks were reversed by heating overnight in SDS and proteinase K. DNA was then ethanol precipitated and sheared to ~200 bp fragments using a Covaris focused ultra-sonicator. DNA fragments containing the biotinylated bridge were then purified using magnetic streptavidin-coated beads. DNA ends were repaired using the NEBNext End Repair and dA tailing module, and ligated to NEBNext hairpin adaptors for Illumina sequencing. The adaptor hairpin was cleaved using USER, and DNA fragments were amplified by ~8–12 rounds of PCR with NEBNext Indexing Primers for Illumina (TruSeq compatible). The partially amplified library was then purified using AMPure XP beads to remove adaptor dimers, and the optimum number of additional PCR cycles was determined by qPCR to achieve approximately 30% saturation. Library amplification was then completed by the additional rounds of PCR, and the library was purified and size selected to a target range of 100–500 bps using AMPure XP beads. The size distribution of the library was checked by capillary electrophoresis using an Agilent Bioanalyzer, and quantified using qPCR against a phiX Illumina library standard curve. Libraries were sequenced using the Illumina MiSeq platform for quality control, and subsequently sequenced on the Illumina NextSeq platform (Stanford Functional Genomics Facility) using single-end 152 bp reads. Data were processed and analyzed using a custom pipeline.

Detailed ChAR-seq protocol

Shields and Sang M3 media, for 0.5 L

  • 19.6 g S and S M3 powder (Sigma)

  • 0.25 g bicarb

  • 5 µg/mL insulin

  • 2% FBS

  • 2% fly extract

  • 1x Pen/Strep

Sterile filter media using 0.2 um filter. Store media at 4C. Warm to room temperature before splitting cells. Culture cells at 27C in 75 flasks (1–2 flasks per library), splitting or harvesting approximately every 3–4 days. Cells are detached from the flask surface by vigorously pipetting up and down with a serological pipette. Confirm that >95% of cells are healthy and viable using trypan blue assay at each harvest.

All buffers and solutions prepared with DEPC-treated water. All buffers are made with fresh, unopened chemicals to minimize RNase contamination. All enzyme stocks and reagents are dedicated for RNA work and kept RNase-free. Optional steps are included for the three alternative protocols used for (A) the RNA spike-in experiment, (B) the RNase control, and (C) the Hi-C/Mock-ChAR experiment.

Step 1: Harvesting cells and crosslinking (can be done in advance)

  • Pipette cells up and down to detach them from the surface. The cells grow in small strings or clumps that are very weakly attached to the surface.

  • Count the cell using a hemocytometer. Typically, each T75 flask should yield between 100–150 M cells.

  • Spin (~2000 x g) the cells in either one or two 50 µL conical tubes in a swinging bucket centrifuge for 5 min.

  • Resuspend the cells in ~36 mL of fresh media

  • Add formaldehyde to 1% (1 mL of 37% form)

    • incubate at RT for 10 min mixing end over end

  • Quench by adding 3 mL of 2.5 M glycine (final concentration = 0.2 M), mix for 5 min end over end

  • Spin the cells at 2000 x g for 5 min

  • Resuspend in ~1 mL of sterile filtered, ice cold PBS + DEPC

    • count cells and split into an appropriate number of tubes to aliquot 100 M cells per tube

  • Spin again at 2000 x g for 5 min

  • Remove supernatant and flash freeze in liquid nitrogen

NOTE: we find it useful to remove the supernatant in this and all subsequent steps using a mechanical micropipette rather than an aspirator, which risks loss of sample.

Step 2: Lysis (Day 1, start in early afternoon)

Lysis buffer
  • 10 mM TrisHCl pH 8

  • 10 mM NaCl

  • 0.2% igepal

  • 1 mM DTT

  • 0.2 mM EDTA

  • Mix 750 µL lysis buffer +150 µL Sigma protease inhibitor (P8340, comes as a DMSO stock)+45 µL RNaseOut per sample.

  • Add 400 µL lysis buffer plus inhibitors to STILL FROZEN crosslinked cells and thaw with gentle mixing. Incubate for 2 min on ice after thawing is complete. DO NOT let cells thaw on ice before adding lysis buffer!

  • Centrifuge at 2.5 k x g for 2 min, discard supernatant.

  • Wash pellet with 500 µL of lysis buffer +inhibitors, spin again for 1 min.

  • Gently resuspend the pellet in 100 µL 0.5% SDS and incubate at 62C for 5 min

  • Pre-mix 290 µL water +50 µL 10% Triton-X 100 per sample.

  • Quench SDS with diluted Triton-X 100. Mix well by gentle pipetting; incubate at 37C for 15 min.

Step 3: RNA fragmentation (Day 1)

  • Add 11 µL 10x T4 RNA ligase buffer for final concentration 0.25x

  • Incubate for exactly 4 min, 70’C

  • Snap cool on ice

  • Note: Users may want to optimize fragmentation beforehand if using a different cell line or sample.

Step 3A (Optional additional step A): RNA spike in

  • Estimated RNA release is 65 µg / 250 M cells

  • For 180 M cells, spike in 5.6, 0.56 and 0.056 µg

    • 1.86 µL each of 100%, 10% and 1% RNA stocks of MBP RNA

    • 1.55 µL each of 100%, 10% and 1% RNA stocks of GFP RNA

    • 1.55 µL each of 100%, 10% and 1% RNA stocks of Halo RNA

  • These are designated as ‘high’, ‘med’ and ‘low’

  • Incubate on ice for 5 min

Step 4: Wash cells to remove SDS and non-crosslinked RNA fragments (Day 1)

  • Add 1000 µL DEPC-treated PBS and incubate on ice for 2 min

  • Centrifuge at 2.5 k x g for 2 min, discard supernatant

  • Add 1000 µL DEPC-treated PBS, mix gently

  • Centrifuge at 2.5 k x g for 2 min, discard supernatant

  • Immediately proceed to the next step, with pre-mixed reaction buffer already prepared

Step 4B (Optional additional step B): RNase control

  • Pre-mix and resuspend the pellet in the following

    • 160 µL water

    • 20 µL 10x T4 RNA ligase buffer

    • 10 µL 10 mg/mL RNaseA

    • 10 µL RNaseH (NEB)

  • Incubate at 37C for 4 hr

  • Centrifuge at 2.5 k x g for 2 min, discard supernatant

  • Add 1000 µL DEPC-treated PBS, mix gently

  • Centrifuge at 2.5 k x g for 2 min, discard supernatant

  • Immediately proceed to the next step, with pre-mixed reaction buffer already prepared

Step 5: RNA to linker ligation, RNA first, Day 1

  • Pre-mix the following and then add to cells

    • 20 µL 10X T4 trKQ RNA ligase buffer

    • 100 µL 50% PEG

    • 20 µL RNaseOUT

    • 20 µL 50 uM annealed bridge (2.5 nmols)

    • 10 µL T4 trKQ RNA ligase (4000 units)

  • Incubate at 23C overnight, shaking 900 rpm in ThermoMixer

Step 5 (alternative): Hi-C control (PEG mock incubation, no ligase, no bridge)

  • Pre-mix the following and then add to cells

    • 20 µL 10X T4 trKQ RNA ligase buffer

    • 100 µL 50% PEG

    • 20 µL RNaseOUT

    • 30 µL water

  • Incubate at 23C overnight, shaking 900 rpm in ThermoMixer

End day 1

-----------------------------------------

Start day 2

Step 6: Wash to remove PEG (Day 2, morning)

  • Centrifuge at 2.5 k x g for 4 min at RT, remove and discard supernatant

  • Resuspend pellet with 1 mL DEPC-treated PBS

  • Centrifuge at 2.5 k x g for 5 min at room temperature

  • While spinning, make 200 µL 1x T4 trKQ RNA ligase buffer, including:

    • 0.2 µL 1 M DTT

    • 10 µL RNaseOUT

  • Resuspend pellet in 200 µL 1x T4 trKQ RNA ligase buffer +DTT and RNaseOUT

Step 7: First strand synthesis, Bst 3.0 (Day 2, morning)

  • Add 20 µL 10 mM (each) dNTP mix

  • Add 15 µL Bst 3.0 (120 U/ µL) for 1800 Units

  • Incubate for 10 min at RT

  • Increase temp to 37°C for 10 min (shaking 900 rpm in ThermoMixer)

  • Increase temp again to 50°C for 20 min

  • Add 1000 µL DEPC-treated PBS, mix, and cool on ice 1 min

Step 7 (alternative): Hi-C control (mock, no Bst, no NTPs)

  • Add 35 µL PBS (no enzyme, no dNTPs)

  • Incubate for 10 min at RT

  • Increase temp to 37°C for 10 min (shaking 900 rpm in ThermoMixer)

  • Increase temp again to 50°C for 20 min

  • Add 1000 µL DEPC-treated PBS, mix, and cool on ice 1 min

Step 8: Inactivate and wash (Day 2)

  • Centrifuge at 2.5 k x g for 2 min, discard supernatant

  • Repeat wash with another 1 mL of DEPC-treated PBS

  • Centrifuge at 2.5 k x g for 2 min, discard supernatant

  • Resuspend pellet in 100 µL 0.5% SDS +0.1 mM EDTA

  • Heat inactivate for 10 min at 55C

  • Quench SDS with 290 µL water + 50 µL 10% Triton-X 100 (pre-mixed), per sample.

  • Mix well, incubate at 37°C for 15 min.

  • Add 1000 µL DEPC-treated PBS and incubate on ice for 2 min

  • Centrifuge at 2.5 k x g for 2 min, remove and discard supernatant.

  • Wash nuclear pellet by adding 1 mL DEPC-treated PBS, mix, and spin at 2.5 k x g for 2 min

  • Remove and discard supernatant, and proceed IMMEDIATELY to digestion.

Step 9: Genomic DNA digestions (Day 2, afternoon)

  • Resuspend in 200 µL digest buffer: 20 µL 10X T4 RNA ligase buffer + 169.8 µL DEPC-treated water + 0.2 µL 1 M DTT (1 mM DTT final) + 10 µL RNaseOUT (pre-mixed).

  • Add 15 µL DpnII (750 Units)

  • Incubate at 37°C for at least 3 hr, up to overnight if desired.

  • Terminate digestion by adding 7.5 µL 0.5 M EDTA to a final concentration of ~15 mM

Step 10: Pellet and wash cells

  • Spin the cells at 2500 x g for 2 min

  • Remove and discard the supernatant

  • Re-suspend the X-linked cells with 1 mL DEPC-treated PBS,

  • Spin again at 2500 x g for 2 min

  • Remove and discard the supernatant using a P200

Step 11: Inactivate DpnII

  • Re-suspend the washed cells in 100 µL 0.5% SDS +0.1 mM EDTA

  • Incubate at 55°C for 10 min

  • Quench SDS by adding 290 µL water + 50 µL 10% Triton-X 100

  • Incubate at 37°C for 5 min

  • Add 1 mL of PBS and incubate on ice for 5–10 min

Step 12: Pellet and wash cells three times

  • Spin the cells at 2500 x g for 2 min, discard supernatant

  • Re-suspend the pellet with 1 mL DEPC-treated PBS

  • Spin again at 2500 x g for 2 min, discard supernatant

  • Re-suspend the pellet with 1 mL DEPC-treated PBS

  • Spin again at 2500 x g for 2 min, discard supernatant

Step 12 (alternative): Hi-C control (biotinylated the cut ends)

  • Perform washes as in Step 12 above

  • Pre-mix the following:

    • 64 µL DEPC water

    • 10 µL 10X T4 RNA ligase buffer

    • 10 µL 0.4 mM biotin-dATP

    • 4 µL 1 mM dCTP

    • 4 µL 1 mM dGTP

    • 4 µL 1 mM dTTP

    • 4 µL Klenow (exo minus)

  • Add pre-mixed biotinylation solution to samples

  • Incubate at 23°C for 3–4 hr

  • Add 4 µL 500 mM EDTA

  • Heat inactivate at 60°C for 10 min

  • Spin the cells at 2500 x g for 2 min

  • Remove and discard the supernatant

  • Re-suspend the pellet with 1 mL DEPC-treated PBS

  • Spin again at 2500 x g for 2 min

Step 13: Bridge to genomic DNA ligation (Day 2 overnight to day 3)

  • Resuspend pellet in the following pre-mixed solution:

    • 140 µL DEPC-treated water

    • 20 µL RNaseOUT

    • 20 µL T4 DNA ligase buffer

    • 20 µL T4 DNA ligase (8000 Units)

NOTE : this reaction requires ATP and this is different from the buffer used throughout the earlier steps, if you use T4 RNA ligase buffer, you must supplement the final concentration with 1 mM ATP

  • Incubate at 16C overnight

End day 2

-------------------------------------------

Start day 3

Step 14: Pellet and wash cells, to remove T4 ligase

** Allow Hi-C ligation to proceed during second strand synthesis, so skip this step for Hi-C sample **

  • Add 7.5 µL 0.5 M EDTA (~15 mM final)

  • Spin the cells at 2500 x g for 2 min

  • Remove and discard the supernatant

  • Re-suspend the pellet with 1 mL DEPC-treated PBS

  • Spin the cells at 2500 x g for 2 min

  • Remove and discard the supernatant

  • Re-suspend each pellet in the following pre-mixed solution:

    • 80 µL DEPC-treated water

    • 10 µL 10x T4 RNA ligase buffer

    • 1 µL 1 M DTT

    • 10 µL RNaseOUT

Step 15: Second strand synthesis (skip for Hi-C control)

  • 2x cDNA bufffer

    • 10 mM TrisHCl (8.0)

    • 180 mM KCl

    • 100 mM (NH4)2SO4

  • Add 100 µL 2x cDNA buffer

  • Add 20 µL 10 mM dNTP (each) mix

  • Add 100 Units E. coli DNA Polymerase I (NEB)

  • Add 5 Units RNaseH (NEB)

  • Mix and incubate at 16C for at least two hours

Step 16: Crosslink reversal (end of Day 3)

  • Centrifuge at 2.5 k x g for 5 min

  • Remove and discard supernatant

  • Resuspend pellet in 400 µL DEPC-treated PBS

  • To ~450 µL of sample, add

    • 55 µL 10% SDS

    • 55 µL 5 M NaCl

    • 3 µL 20 mg/mL Proteinase K

    • incubate overnight at 70°C

End day 3

-------------------------------------------

Start day 4

Step 17: Precipitate DNA

  • Cool sample to room temperature

NOTE: Do not cool the sample on ice as the SDS will precipitate

  • Add 50 µL 3 M NaOAc

  • Add 2 µL (5 mg/ml) glycogen

  • Add 850 µL ice cold 100% EtOH.

NOTE: Flocculent should be visible after inverting the tube gently 3–4 times. If no flocculent is visible, add 200 µL of additional ice cold 100% EtOH

  • Incubate on ice for 30 min

  • Centrifuge at ~21,000 x g for 5–10 min

  • Wash pellet with 70% ice cold EtOH

  • Centrifuge at ~21,000 x g for 5 min

  • Gently dry pellet and resuspend in 130 µL TE

Step 19: Covaris shear DNA

  • Shear DNA to ~200 bp fragments using the following settings: 10% duty factor, 175 peak incident power, 200 cycles per burst, for 180 s

  • NOTE: shearing of Hi-C sample should target 300–500 bp fragments… make appropriate adjustments for AMPure steps

Step 19: Isolation of biotinylated DNA fragments

1X tween wash buffer (TWB)

  • 5 mM TrisHCl (7.5)

  • 0.5 mM EDTA

  • 1 M NaCl

  • 0.05% Tween 20

2X bead binding buffer (BBB)

  • 10 mM TrisHCl (7.5)

  • 1 mM EDTA

  • 2 M NaCl

  • Wash 150 µL Dynabeads MyOne SA T1 beads (Life Tech) with 400 µL TWB. Pull down beads with a magnetic tube rack and discard supernatant.

  • Resuspend the beads in 130 µL 2X BBB

  • Mix the bead slurry with the DNA (final volume of 260 µL)

  • Incubate at RT for 15 min with rotation to bind biotinylated DNA

  • Pull down beads with a magnetic tube rack and discard supernatant.

NOTE: Do not allow the beads to dry out

  • Wash the beads by adding 750 µL 1x TWB and mixing

  • Warm the tubes to 50°C for 2 min

  • Pull down beads with a magnetic tube rack and discard supernatant

  • Wash the beads by adding 750 µL 1x TWB and mixing

  • Pull down beads with a magnetic tube rack and discard supernatant

Step 20: End repair, and dA tailing

  • Resuspend the beads in 40 µL TE

  • add 7 µL NEB Next End Prep Buffer

  • add 3 µL NEB Next End Prep Enzyme Mix

  • Incubate at RT for 20 min with agitation to keep beads suspended

  • Increase temp to 65°C for 30 min

  • Cool to room temperature

Step 21: Adapter ligation (on bead)

NOTE: The following order of addition is important

  • add 2.5 µL NEBNext Adaptor FIRST

  • add 1 µL ligation enhancer

  • add 30 µL NEBNext UltraII Ligation Master Mix LAST

  • mix vigorously, incubate at Room Temperature for 15–20 min with agitation to keep beads suspended

  • add 3 µL USER enzyme

  • incubate at 37°C for 15 min

Step 22: Wash beads

  • Pull down beads with a magnetic tube rack and discard supernatant

  • Wash the beads by adding 750 µL 1x TWB and mixing

  • Warm the tubes to 50°C for 2 min

  • Pull down beads with a magnetic tube rack and discard supernatant

  • Wash the beads by adding 750 µL 1x TWB and mixing

  • Pull down beads with a magnetic tube rack and discard supernatant

Step 23: On bead amplification (library PCR 1)

  • For each index, pre-mix the following

    • 30 µL 2x NEB Next High Fidelity master mix

    • 3 µL 10 uM Universal Primer (NEBNext Multiplex Oligos for Illumina)

    • 3 µL 10 uM Indexing Primer (NEBNext Multiplex Oligos for Illumina)

    • 24 µL water

  • Resuspend the beads in 50 µL amplification mix (reserve the extra 10 µL as you will need it in step 25)

  • Run the following PCR cycle

    • 98°C for 30 s

    • 63°C for 30 s

    • 72°C for 40 s

    • 98°C for 3 min

    • then 8 cycles of

  • Stop the PCR, pull down beads in a magnetic rack

Step 24: First AMPure XP size selection to remove adaptor dimers (target size 200 to 500 bp)

NOTE: The NEBNext adaptor is a 65 nt hairpin… adaptor dimers would be 65 bp, because each hairpin is ~32 bp long… however, the indexing primers are also 65 nts, so amplification of adaptor dimer results in a 130 bp fragment that must be removed at this step, and adds 130 bp to the ideal target length of 150 bp, so our molecules of interest are now 280–300 bp

  • Allow AMPure bead slurry to warm to room temperature

  • Add an equal volume of AMPure beads to the PCR supernatant, mix by pipetting and incubate for 5 min at room temperature

  • Pull down beads with magnet, give 1–2 min to clear solution. Remove supernatant.

NOTE: This is, in theory, to be discarded, but save until you’ve validated the completed library prep by qPCR and Bioanalyzer in case needed for troubleshooting

  • Wash the beads with 70% EtOH (e.g., 700 µL), pull down, remove sup, and

  • elute with 30 µL of 10 mM Tris pH 8

Step 25: qPCR to determine how many more cycles to amplify

  • Mix 5 µL of each library eluted from the AMPure beads with 10 µL of the reserved PCR mix from step 23

  • Dilute 100x Syber Green 1:3 in water for a 33x working solution

  • Add 0.5 µL to each 15 µL reaction for ~1x

  • Run on qPCR to determine the number of cycles to amplify… target should be the number of cycles required to reach ~1/3 of the saturation

Step 25: Off bead amplification (library PCR 2)

  • For each index, pre-mix the following

  • 30 µL 2x NEB Next High Fidelity master mix

  • 25 µL eluted library from Step 24

  • 2.5 µL 10 uM Universal Primer (NEBNext Multiplex Oligos for Illumina)

  • 2.5 µL 10 uM Indexing Primer (NEBNext Multiplex Oligos for Illumina)

  • Run the following PCR cycle

    • 98°C for 30 s

    • 63°C for 30 s

    • 72C for 40 s

    • 98C for 3 min

    • then N cycles of

NOTE: the number of additional cycles, N, is ascertained by estimating the number of additional cycles required to reach 1/3 saturation from the qPCR analysis in Step 24. The number of cycles will vary between each sample library.

Step 26: High and low size selection using AMPure XP beads (final polishing)

  • Allow AMPure bead slurry to warm to room temperature

  • Add 0.5 volumes of AMPure bead slurry (30 µL) to PCR mix (60 µL). Mix by pipetting and incubate at RT for 5 min.

  • Pull down beads with magnet, give 1–2 min to clear solution, then recover the supernatant. This should contain fragments from 0 to 500 bp. Larger fragments, if you want them, are bound to the beads. Add 70% EtOH to the beads and set aside.

  • To the supernatant (90 µL), add 0.5 volumes of original volume (30 µL) for a final bead to input ratio of 1:1 (total volume should be 120 µL). Mix and incubate at RT for 5 min.

  • Pull down the beads and remove the supernatant, which should contain small, unwanted fragments (adaptor dimers, etc). Nonetheless, save it for now just in case and set aside.

  • Wash the beads with 700 µL 70% EtOH, pull down the beads in a magnetic tube rack until clear (1–2 min), discard supernatant, and repeat for a total of two washes. Do not disrupt the beads during the second wash.

  • Remove the EtOH completely and let the beads passively dry for ~5 min

  • Add 50 µL of elution buffer (20 mM Tris pH 8). Incubate for 2–3 min. Pull down the beads until clear and remove the supernatant. This should contain fragments from ~100–500 bp.

Step 27: Quality control

  • QC library by Bioanalyzer (1:10 dilution, High Sensitivity)

  • Quantify library by phiX qPCR

Sequencing

ChAR-seq libraries were sequenced with single-end 152 bp reads on the Illumina MiSeq and NextSeq according to manufacturer’s instructions.

Data processing and software

Reads were processed using a custom pipeline, which can be accessed at: https://github.com/straightlab/flypipe (Bell, 2017 copy archived at https://github.com/elifesciences-publications/flypipe). PCR duplicates were removed using Super Deduper (Petersen et al., 2015 using the entire read length. Adapters were removed and reads trimmed for low quality with Trimmomatic (Bolger et al., 2014) using a composite set of Illumina adapters, leading/trailing cut length of 3, sliding window of 4:15, and a minimum length of 36. RNA and DNA portions of each read are identified and split by a custom python script that finds the bridge, uses the bridge polarity to identify which side of the read is RNA or DNA, and verifies that there is only a copy of the bridge. All reads lacking a full bridge sequence or that contained two or more bridges were removed. Only reads that contained sequence on both sides of the bridge (RNA and DNA) were processed further. Unique read IDs from the initial read containing the full molecule remained linked to both the RNA and DNA sequences after splitting from the bridge sequence. Additional software used by FlyPipe or during preparation of figures include: bowtie2 (Langmead and Salzberg, 2012), SAMtools (Li et al., 2009), BEDtools (Quinlan and Hall, 2010), MACS2 (Zhang et al., 2008), Circos (Krzywinski et al., 2009), R, GraphPad Prism v7.0, deepTools2 (Ramírez et al., 2016), HOMER (Heinz et al., 2010), and Genieous (Kearse et al., 2012).

ChAR-seq DNA and RNA alignment

Four replicates were processed and merged for analysis in the following way. Reads were aligned to Drosophila melanogaster genome (dm3, release r5.57 using bowtie2 (Langmead and Salzberg, 2012) with the --very-sensitive option and allowing one mismatch. RNA was aligned separately to all dm3 transcriptomes from FlyBase (downloaded January 2016, versions r5.57): with the same bowtie2 parameters, but also first forcing sense strandedness as bridge orientation should preserve the RNA strand information, and then for reads that did not align in the sense direction, permitting antisense alignment. Reads that aligned with equal alignment score to more than one transcriptome were filtered based on a priority rank according to the following order: tRNA, miscRNA, ncRNA, transcript, three_prime_UTR, five_prime_UTR, exon, intron, miRNA, gene, gene_extended2000. For a given read, this rank preserves the annotation information from the top ranked transcripome. For example, a read aligning to the ‘ncRNA’, ‘transcript’, and ‘exon’ transcriptomes would receive ncRNA annotations, and be considered a ncRNA in further analysis. Reads aligning to tRNA and miscRNA groups were removed from further analysis. Reads lacking valid sense alignments to any transcriptome, but that aligned in the antisense orientation were ranked and filtered in the same manner. We note that for some mRNAs, some highly abundant reads emanating from the mRNA locus may potentially be expressed from small RNAs (e.g., unannotated snoRNA, tRNA, snRNA) overlapping the mRNA locus. Although the CME-W1-cl8+ cell line has the most normal karyotype of the commonly used Drosophila cell lines, the CME-W1-cl8+ reference genome has not been assembled and slight variation may exist in the positions of some reads due to potential genomic differences between CME-W1-cl8+ cells and the sequence fly strain used for the reference genome.

ChAR-seq RNA-DNA contact identification

Previously linked RNA and DNA sequences with sense alignments had their information re-associated using the original read ID. These 1-to-1 links are deemed ‘RNA-DNA contacts’. Any RNA-DNA contact that contained Ribosomal RNAs (rRNA) were removed from further analysis. Additional processing was then performed to remove mapped DNA contacts that overlapped regions of poor mappabilty using the modEncode Drosophila blacklist (ENCODE Project Consortium, 2012) (https://sites.google.com/site/anshulkundaje/projects/blacklists), repetitive regions downloaded from the UCSC Table Browser (https://genome.ucsc.edu/cgi-bin/hgTables) using the following selection parameters: clade: Insect, genome: D. melanogaster, assembly: Apr. 2006 (BDGP R5/dm3), group: Variation and Repeats, track: Repeatmasker. Several RNAs that were re-annotated as rRNA in later Drosophila genome annotation versions were also removed.

RNA coverage tracks

Individual RNA tracks were generated by extracting contact information from the BED-formatted table generated by FlyPipe (described above). Coverage was calculated using BEDtools for 200 bp windows, and the number of contacts was then normalized by dividing by the number of DpnII sites for each window throughout the genome. Coverage tracks for the metagene profile analysis in Figure 4C were similarly normalized, but were extracted and normalized using a sliding window method with 200 bp bins and 20 bp steps. For the correlation clustering in Figure 4D, we normalized the X chromosome signal to the autosomes by doubling the raw contacts on chrX and chrXHet before DpnII normalization to account for the presence of only a single X chromosome in the CME-W1-cl8+ cell line. Where indicated, coverage tracks were calculated for 2 kb bins, then converted to the log2 ratio of the contacts and DpnII frequency, which was subsequently used to calculate a z-score for each bin based on the whole genome mean and standard deviation for each RNA signal track.

Hi-C and TAD analysis

HiC libraries (‘Hi-C/Mock-ChAR’) were performed as indicated in the methods (blue text indicates branch points in the protocol). Hi-C libraries were sequenced using 2 × 75 paired end reads and were aligned to blacklisted and repeatmasked genome using the ‘local’ and ‘reorder’ flags. The subsequent bam files were used to build the contact matrices using HiCExplorer (https://github.com/deeptools/HiCExplorer/blob/master/docs/index.rst). Matrices were built at 10 kb resolution. Data from GEO:GSE58821 were downloaded, re-aligned and processed using HiCExplorer in parallel. Plotting, TAD calling, and matrix correlation were all performed in HiCExplorer. Plots showing RNA-DNA and DNA-DNA contacts were also generated using HiCExplorer, where RNA-DNA matrices were also built at 10 kb resolution.

TAD boundary analysis

Boundary regions were chosen as the 10 kb on either side around TAD edges (i.e., 20 kb total). TAD midpoints were calculated and rounded down to the nearest multiple of 10 kb, and the midpoint region was set as 10 kb to either side of the midpoint. Both boundary and midpoint region edges therefore align with 10 kb bins used for TAD calling. Regions overlapping the modENCODE blacklist or our manually curated blacklist were removed from analysis. ChAR-seq contacts were filtered to remove cis contacts, defined here as the RNA parent gene locus ±2 kb. Counts of DpnII sites, ATAC-seq insertion sites (5’ end of sequenced fragments, shifted +4 bp on the +strand and −5 bp on the - strand of the reference genome) and ChAR-seq DNA contacts within regions were calculated using the coverage or map commands in bedtools (version 2.20.1). The number of base pairs in each region (midpoint or boundary) masked by our repeat masking blacklist was calculated using bedtools coverage and groupby commands. Subsequent analysis was done with R version 3.3.1. To match the distributions of masked base pairs per region between TAD midpoints and boundaries, 20 kb region masked lengths were divided into bins of 50 bp between 0 and 1000 bp, and 500 bp between 1000 bp and 20000 bp. The boundaries and midpoints were divided into these masked length bins; within each bin, boundaries and midpoints were subsampled to the minimum number of regions for the two region types (e.g., if in a masked length bin there three boundaries and five midpoints, the midpoints will be randomly subsampled to select three midpoints.) The bin matching was verified with a Wilcoxon rank sum test of the total length of repeat masked regions in each region, with an expected non-significant p-value of 0.76. ChAR-seq DNA contact counts for each RNA type within each region were normalized by dividing the number of counts in the region by the number of DpnII sites in the region. roX2 analysis of trans-ChAR contacts at boundary and midpoint regions was only performed for TADs on the X chromosome. All other RNA contact categories were analyzed for all TAD midpoint and boundary regions that fell outside of blacklisted regions.

Correlation clustering and metaplots

We used deepTools (https://github.com/fidelram/deepTools) to calculate the correlation between coverage tracks using the multiBigWigSummary tool excluding the following chromosome regions: chrU, chrUextra, ChrYHet and chrM. Coverage was calculated using 100 kb bins unless indicated. The Pearson correlation coefficients were then clustered and plotted using the either the PlotCorrelation function in deepTools (snRNA clustering) or the heatplot function in R using the ward.D2 method and Euclidian distance function (modENCODE vs ChAR-seq correlation clustering). modENCODE datasets (M-values, wig-formated) for CME-W1-cl8+ were downloaded from data.modencode.org, and then re-binned into 2 kb windows, and filtered to remove modENCODE blacklist sites and repeat-regions, chrYHet, chrM, chrU and any bins lacking DpnII sites. modENCODE tracks were then transformed into z-scores by dividing the mean shifted (log2 M-values) by the genome-wide standard deviation of the bin depth. Metaplots were generated using the computeMatrix and plotProfile tools with a 2 kb window up and downstream of each region. Signal was then normalized to fold-change relative to mean of the random signal and re-ploted in GraphPad Prism.

roX1 and roX2 chromosome X enrichment

The enrichment of roX RNAs bound to chrX was calculated by obtaining the number of ChAR-seq roX1 or roX2 RNA reads that contact chrX, and comparing this to the expected number on chrX if roX1 or roX2 reads were distributed randomly over total chromosomal sequence space. A one-tailed cumulative binomial distribution test was used to generate each p-value, which is the probability of obtaining the same or greater number of ChAR-seq roX1 or roX2 reads on chrX given a random genomic distribution.

ChIRP-seq and ChAR-seq correlation analysis

ChIRP-seq binding profiles for roX1 and roX2 were obtained from published data (GSE53020_roX1_merge.bw, GSE53020_roX2_merge.bw) (Quinn et al., 2014). The Drosophila melanogaster genome (dm3) was divided into equally sized windows for a range of distinct window sizes (50 bp to 1 MB). Read counts per window were obtained using BEDtools. ChIRP-seq and ChAR-seq data were filtered identically to remove modENCODE blacklist sites and repeat-regions, chrYHet, chrM, and any bins lacking DpnII sites. roX1 and roX2 origins were excluded due to extreme read pileups in ChIRP-seq data (Chu et al., 2011). Spearman’s rank correlations of read-counts-per-bin across the genome between ChAR-seq and ChIRP-seq were performed in R and plots graphed and fitted with Prism.

ChAR-seq and ChIRP-seq signal-to-noise comparison

Signal-to-noise (SNR) analysis for ChAR-seq and ChIRP-seq data was performed similar to the approach in Figure 4B of Quinn et al. (2014), with minor modifications. We treat roX RNA contacts on the autosomes as ‘noise’, and dense chrX contacts as ‘signal’. Due to the limited number of reads for individual RNAs and the insufficient resolution afforded by DpnII cut-site locations, we were unable call narrow (<1 kb) peaks for roX1 and roX2 ChAR-seq reads using standard methods (e.g., MACS2). Therefore, to calculate signal, we divided the genome into 2 kb bins and counted the number of roX1 or roX2 RNA-DNA contacts per bin for the 300 chrX bins containing the most contacts. ChAR-seq reads were normalized to DpnII cut sites within each bin. To define noise, we randomly selected 300 equal sized bins from autosomes. The mean RNA-DNA contact count per bin on chrX peaks (‘signal’) was divided by the mean RNA-DNA contact count per bin on autosomes (‘noise’) to produce the SNR value. To avoid complications due to extreme bin values or repetitive regions, we filtered for chr4, chrM, chrY, chr2Het, chr3Het, chrXHet, chrU, and modEncode blacklist regions (REF:modencode). The roX1 and roX2 loci on chrX were removed due to extreme values for ChIRP-seq data (Chu et al., 2011). For consistency, we re-calculated the SNR for ChIRP-seq roX1 and roX2 data using the above approach and filtering, and obtained values similar to those published in Quinn et al. (2014).

ChAR-seq, ChIRP-seq, GRID-seq correlations

ChAR-seq and ChIRP-seq roX2 DNA contact signal was processed as in the SNR analysis. GRID-seq RNA and DNA raw reads were download from the NCBI SRA database (accessions: SRX1824449 and SRX1824449) aligned to the genome using bowtie2 (version 2.3.4.1) with the parameter --local, and minimum scoring threshold set to G,1,10. GRID-seq S2 cell Replicates 1 and 2 were merged after mapping. Uniquely mapping RNA reads with uniquely mapping DNA mates (samtools flag -q2) were preserved. RNA reads aligning to the roX2 locus were designated as roX2 RNAs, and the associated DNA coordinate as the roX2 DNA-contact. For consistency with the other data sets, GRID-seq reads were filtered for repeats and small chromosome scaffolds identically to the ChAR- and ChIRP-seq filtering applied in the SNR analysis. We divided the genome into 2 kb bins and counted the number of roX2 DNA-contacts in each bin, as above for ChAR- and ChIRP-seq. Restriction enzyme density can bias the number of reads within a given bin, so we normalized the raw GRID-seq read count by dividing the reads-per-bin by the number of AluI RE sites-per-bin (AluI cut map generated with hicup(version 0.5.2). This signal was then calculated for 20 kb bins across chrX, and Spearman’s rank correlation performed pairwise among the three methods.

RNA-seq library preparation

For total RNA-seq analysis, total RNA from 10 to 20 million CME-W1-cl8+ cells was purified using TriPure, then treated with 10 units of TURBO DNase (Life Technologies) at 37°C for 30 min according to the manufacturer’s instructions. RNA was re-purified with TriPure, resuspended in DEPC-treated water, and quality checked by Bioanalyzer (Agilent). Ribosomal RNAs were depleted using the Ribo-Zero rRNA removal kit (Illumina), RNA was purified using Agencourt AMPure XP beads (Beckman Coulter), then cDNAs were generated, amplified, and indexed with the ScriptSeq v2 RNA-Seq Library Preparation Kit (Epicentre) according to manufacturer’s instructions. Indexed libraries were quantified by Bioanalyzer and qPCR, pooled, and sequenced on a NextSeq 500 (Illumina).

ATAC-seq library preparation

ATAC-seq using 250 K-500K CME-W1-cl8+ cells per reaction was performed with the Nextera DNA Library Prep Kit (Illumina, FC-121–1030). The reaction protocol was as previously described (Buenrostro et al., 2013), but without detergent or lysis incubation steps. Transposition reactions were performed at 37°C for 30 min shaking at 400 r.p.m. Libraries were purified with the QIAGEN MinElute Reaction Cleanup Kit and PCR amplified with barcoded primers. Amplification cycle number for each sample was monitored by qPCR to minimize PCR bias. PCR amplified libraries were purified with the QIAquick PCR Purification kit and excess primers removed by AMPure XP bead selection (Beckman Coulter). Final library concentrations were determined by qPCR using custom primers and PhiX sequence (Illumina) as a standard.

ATAC-seq data processing

Six technical replicates were sequenced with 75 bp paired-end reads on the Illumina Next-seq. Illumina Nextera Adapters were removed using a custom Python script. Reads were aligned to the Drosophila melanogaster genome (dm3) using bowtie2 (version 2.2.5) with the parameter -X 2000. Duplicates were removed with Picard; mitochondrial reads or reads with bowtie2 MAPQ score <30 were removed using SAMtools. All replicates had high similarity so their alignment files were merged to increase library complexity before analysis with ChAR-seq data.

References

  1. 1
  2. 2
  3. 3
  4. 4
    Pipeline to process single end ChARseq reads to a combined, filtered and annotated matrix of RNA to DNA contacts, version 662e4bb
    1. J Bell
    (2017)
    Github.
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
    Super deduper, fast PCR duplicate detection in fastq files
    1. KR Petersen
    2. DA Streett
    3. AT Gerritsen
    4. SS Hunter
    5. ML Settles
    (2015)
    BCB '15 ACM International Conference on Bioinformatics, Computational Biology and Biomedicine. pp. 491–492.
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
    Hi-C: a method to study the three-dimensional architecture of genomes
    1. NL van Berkum
    2. E Lieberman-Aiden
    3. L Williams
    4. M Imakaev
    5. A Gnirke
    6. LA Mirny
    7. J Dekker
    8. ES Lander
    (2010)
    Journal of Visualized Experiments, 10.3791/1869, 20461051.
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56

Decision letter

  1. Job Dekker
    Reviewing Editor; University of Massachusetts Medical School, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Chromatin-associated RNA sequencing (ChAR-seq) maps genome-wide RNA-to-DNA contacts" for consideration by eLife. Your article has been favorably evaluated by Kevin Struhl (Senior Editor) and three reviewers, one of whom, Job Dekker (Reviewer #1), is a member of our Board of Reviewing Editors.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

Summary:

This is an interesting manuscript describing a new approach for mapping RNA-chromatin interactions genome-wide. It has the potential to be a major improvement over existing methods. Three patterns of RNAs linked to chromatin are identified in flies: 1) those of nascent transcript next to their genes, 2) RNAs in trans across all genome, and 3) those on X chromosome for male cells. The method could be an important approach for the study of how RNA modulates the genome and is therefore of interest to many. Several issues related to reproducibility, sensitivity and specificity need to be addressed, as outlined below.

Essential revisions:

The authors need to address the following main points:

1) Compare CharSeq performance in more detail to previously published MARGI and ChIRP-Seq methods: what are the key difference, improvements? Can you quantify this?

2) Can you add metrics describing sensitivity and specificity of the method?

3) Please clarify concerns about how many times experiments were performed and add details on statistical analysis of reproducibility of data and analyses.

4) Can you make the results available in a more user-friendly way, i.e. make the results available as a list of RNAs and loci they associate with.

5) Explore RNAs interacting with heterochromatin, rDNA etc. These are repetitive DNA sequences, but a meta-analysis of RNAs associating with defined repeats is of considerable interest and would enhance the paper.

Reviewer #1:

This is an interesting manuscript describing a new approach for mapping RNA-chromatin interactions genome-wide. It has the potential to be a major improvement over existing methods. I missed a more deep quantification of background, sensitivity and signal to noise ratios and how these relate to expression levels.

Reviewer #2:

The manuscript by Bell et al., presents a sequencing method to map all RNA-chromatin/DNA contacts in a genome: Chromatin-Associated RNA sequencing (ChAR-seq) a proximity ligation and sequencing method. The approach is utilized mainly to examine what RNAs are interacting with chromatin in Drosophila melanogaster CME-Wl-cl8+ (male) wing disc cells. Some work is also done using female Kc167 cells. Three patterns of RNAs linked to chromatin are identified: 1) those of nascent transcript next to their genes, 2) RNAs in trans across all genome, and 3) those on X chromosome for its inactivation. This last group, as expected, does not show up in the female cell line. The data obtained via ChAR-seq are in agreement with those obtained by Quinn et al., 2014 using ChIRP-seq. The ChAR-seq appears as a more sensitive method because it captures all RNA-DNA contacts. Authors also use their data to provide novel insights about the function of the various RNAs that they found to have abundant association with chromatin. The application of ChRA-seq in this study have revealed many snoRNA molecules interacting with chromatin in the heterochromatin form.

1) Overall, the authors show that ChRA-seq works well to identify both RNAs that are nascent and thus still connected with DNA in cis, and RNAs that do not directly interact with DNA, but rather with chromatin factors in trans. The ChRA-seq approach is validated with the RNAs that are known to be functioning in chr. X dosage compensation, in fact these RNAs (roX1 and roX2) are found to bind chr. X chromatin in the Drosophila male cells but not in the female cells. In addition to the similarity of the ChRA-seq approach to the ChIRP-seq by Quinn et al., Nat Biotech 2014, which although is limited to RNA-chromatin interactions, as pointed out by the Authors, ChRA-seq is almost identical to the method used in Sridhar et al., 2017in vivo. Thus, the novelty of the study by Bell et al. is in part reduced because of this recent publication, which exploits a very similar ligation method to capture sites of RNA interaction with chromatin and DNA.

Other points:

2) What is described in the first paragraph of Results does not match perfectly with Figure 1A, because the RT step in the figure is shown after the bridge is ligated to DNA, while in the text it is described before the bridge is ligated to DNA. The text should match the figure description.

3) It would be good to add the polarity of the RNA molecule in red in the Figure 1A. It would be good also to explain that the approach captures the 3' RNA ends of RNA molecules. Also Figure 1A should be bigger to clearly show which RNA end is ligated to the bridge, to clearly indicate the polarity of ends and the structure at the junction. If the App has a free 3' end, and this is ligated to 3' end RNA, is this a 3'-3' ligation? Or is the App removed in the ligation process? This should be explained well because it is not evident.

4) The statement "the 5'-adenylated end (5'App) enables increased ligation specificity for 3'-terminated ssRNA" is not clear: 'increased specificity' relative to what?

5) Is the sequence of RNA and DNA oligos presented in Figure 1—figure supplement 2 provided? It is not clear what the molar excess of DNA over RNA is, please clarify. Also the legend of this figure should explain what is shown in every lane. In lane 1 there is a high mw band, what is this, what is its size? The ladder sizes are provided only for the bottom part of the gel, they should be provided also for the top part, considering that there is a high mw band. How many times was this experiment repeated? Is there any statistical analysis of these data? This, possibly, should be explained in the legend.

6) The statement: "This RNase-treatment dramatically reduced in the number of bridge molecules identified, demonstrating that bridge ligation is indeed RNA-dependent (Figure 1—figure supplement 5)" is not strongly supported by the data shown in this figure. In addition to a typo in this statement, looking at the figure, there seems to be a factor of six reduction, which does not represent a 'dramatic' reduction. Importantly, there appears to be no repeats of this experiment and no statistical analysis of the data. Was this experiment reproducible, are there repeats that can be shown?

7) Legend to Figure 1E has a typo: "Zoomed in region of shown".

8) Typo in Results: "we performed RNA-seq to for".

9) The correlation between expression level of the RNA and FPKM contacts is not very clear, because in Figure 2—figure supplement 1 and 2 this does not seem to be the case.

Reviewer #3:

This manuscript presents a very important new technique (CHAR-seq) developed to identify chromatin-associated RNAs. The major benefit of this method in contrast to previously used techniques is the use of proximity dependent ligation to generate sequences from ssRNAs linked to nearby DNA, which allows relatively unbiased assessments of associations across much of the genome. The authors demonstrate very nicely that application of ChAR-seq to Drosophila cultured cells identifies a spectrum of chromatin-associated RNAs, including chromosome-specific, nascent mRNAs, and sn/sno RNAs. The utility and novelty of ChAR-seq makes it completely appropriate for publication in eLife; I am very enthusiastic about acceptance once some issues about the presentation and analyses are addressed / clarified.

1) Data completeness and transparency:

To provide a more complete resource for the community, all potentially interesting RNAs should be named/identified in a table, including those that were initially included but didn't match criteria for 'chromatin associated'.

a) There should be a table listing the 1797 RNAs (Figure 2D), and relevant information, such as% and numbers of cis and trans contacts, RNA expression level, RNA-DNA contacts, and a way to access information about their genomic distributions, etc. – essentially anything that may be of use to scientists interested in whether or not their favorite RNA is present, or in performing their own analyses of the data.

b) I assume Figure 2E shows all 1797 as 'included', but the 'chromatin associated' RNAs in Figure 2E should be listed in a table. My impression is that some (73) but not all are listed in Figure 2—figure supplement 1 – are these the red dots in 2E? Even if all red dots are listed in the supplement, readers should be able to access identity and other data for all 1797, and especially the grey dots that are RNA-DNA contact outliers with high expression (I suspect that normalization to expression levels eliminates interesting candidates, see statistics discussion below). Similarly, there appear to be RNAs with low expression that may be chromatin enriched, yet are not highlighted in red.

2) Genomics, Analysis and Statistics:a) Figure 1D, 2A and elsewhere. I could not find a description of what the black circles and darker grey areas on the chromosome graphics indicate. I assume black=centromere, but the rest is unclear.

b) Why was release 5 used, given that the more complete release 6 has been available for at least a year? In addition, genomes from the cell lines used have not been assembled, and are likely rearranged (plus other variation, including DNA copy number) relative to the reference sequence, so the positions shown in the figures are likely to be incorrect, though we don't know where. No, the authors do not need to assemble the clone8 or Kc genomes, but they should acknowledge this issue.

c) It is unclear what parts of the genome assembly were included, for example were heterochromatic regions included, and if so where is that data presented? This information would help with the interpretations, especially in Figure 4D.

d) A related point, for Figure 4D it would be helpful to report the genomic locations (middle of euchromatin? Pericentromeric? X vs. autosomes?) for RNAs whose chromatin contacts show strong correlation with specific chromatin marks. Also, please explain how the correlation signals were aggregated in Figure 4D – the methods described using 2kb bins while the figure showing 100kb bins.

e) Can the authors discuss how the use of PCR to amplify ligated RNA/DNA contacts, combined with normalization to expression levels, could exclude RNAs with lower transcription levels, even if they have many (potentially important) DNA contacts? I think they are being appropriately conservative, but as with all such screens, and to help others who will certainly try it themselves, it would be good to have a brief discussion about the tradeoffs.

f) Some of the statements need quantitative support, such as:

- Figure 2E – what is the threshold for identifying RNAs with more than expected chromatin interaction (how was the expectation calculated?

- How were p-values for rox1 and rox2 (7.6 fold and 8.1 fold enrichment) calculated?

- Figure 3E: correlation between the DNA contact locations from ChIRP and ChAR- seq –.is this simply the Pearson correlation coefficient?

- Figure 4—figure supplement 1: how was the base sequence similarity calculated and used to aggregate signals for different snRNAs?

g) Using the correlations between ChIRP and ChAR-seq to assess the resolution of ChAR-seq does not seem to be appropriate, since the preparation of sequencing libraries may be very different between these two datasets. (Also, the curve actually goes down as window size increases?!) The authors should provide more reasoning for this approach.

How were the replicates analyzed?.Were they merged? What are the correlations between replicates?

h) How was the z-score was calculated? It seems like it is calculated as comparing between the values in each bin vs. the rest of the genome. If this is the case, there may be no need to calculate a z-score as every window will have the same denominator, and it may be just as informative as using the read counts. Also, it may be informative to use the chromosome-specific mean (instead of whole genome mean) to further identify interesting local events (the current analysis may be mainly identifying between chromosome differences).

https://doi.org/10.7554/eLife.27024.036

Author response

Essential revisions:

The authors need to address the following main points:

1) Compare CharSeq performance in more detail to previously published MARGI and ChIRP-Seq methods: what are the key difference, improvements? Can you quantify this?

We have now quantitatively compared our data to ChIRP-seq and MARGI with new analysis in the main text and figures. We have also included a comparison to another recently published method called GRID-seq (described below).

Comparison to ChIRP-seq:

ChAR-seq produces multiplexed RNA-chromatin contact data for every RNA in the cell without any requirement for the identity of the RNA, whereas ChIRP-seq requires production of oligos against a single target RNA of known sequence. They are complementary technologies, in that ChAR-seq is suited for discovery of interactions and interaction patterns in an unbiased assay, while ChIRP-seq provides targeted, higher-coverage data to explore the contact profiles of RNAs that have already been identified as interesting targets.

To address this point, we added a signal-to-noise calculation between ChIRP-seq and ChAR-seq for both roX1 and roX2 (Figure 3F). Briefly (see subsection “ChAR-seq and ChIRP-seq signal-to-noise comparison”), we used the same signal-to-noise (SNR) metric employed by Howard Chang’s Lab in a related ChIRP-seq paper (Quinn et al., 2014). This metric considers the chrX bins or peaks with the most roX RNA binding events to be ‘signal’ and treats roX RNA binding in randomly chosen autosomal bins to be ‘noise’. To quantitatively compare these two very different methods (genome-wide vs single-RNA), we were careful to re-analyze published ChIRP-seq signal from the original paper (Chu et al., 2011) with identical filtering and random bin selection. This resulted in higher SNR values for ChAR-seq than ChIRP-seq (roX1: 49.8 to 3.9, roX2: 45.2 to 22.0), despite having ~100-fold fewer roX reads per data set. We believe this quantitative analysis demonstrates the high sensitivity of our genome-wide method, in addition to validating it as identifying specifically bound RNAs.

In addition, we previously included several direct comparisons between ChIRP- and ChAR-seq, including genome-wide correlation plot for roX1 and roX2 across different bin sizes ranging over 5 orders of magnitude (Figure 3G) that revealed strong correlation over appropriate bin sizes (Spearman rho = 0.7-0.85). We also display roX1 and roX2 RNA binding signal across a 250kb region of chrX for both methods (Figure 3E).

Comparison to MARGI:

MARGI and ChAR-seq both ligate a 5-App oligonucleotide ‘bridge’ or ‘linker’ to the 3’-end of RNA. In MARGI, the linker molecule is then ligated to either restriction digested genomic DNA or sonicated DNA. Beyond this, the methods differ greatly. In MARGI, the proximal RNA and DNA are attached via circularization, and the linker molecule is cut with a restriction enzyme to form templates for sequencing primers at the end of the molecule. We list the two primary advantages of our method over MARGI in the main text discussion: first, ChAR-seq is performed in situ in intact nuclei, whereas the MARGI proximity ligations are performed on beads. As with Hi-C, it is expected that in vitro based proximity ligation will have more spurious ligation events than those from intact nuclei. Second, in ChAR-seq the final RNA-bridge-DNA molecule is sequenced with long single-end reads, which preserves the original ligation junctions and allows us to detect multiple bridges and other aberrant ligation events.

For many types of analyses, we cannot compare ChAR-seq directly to MARGI because MARGI analyzed mammalian cells, whereas we analyzed Drosophila cells. However, we did quantify signal-to-noise in MARGI by calculating the fold-enrichment of Xist on the X-chr as compared to autosomes:

We used MARGI data from HEK293 cells, which contain two stable Barr bodies (Gilbert et al., 2000) and thus avoid uncertainties about incomplete X inactivation associated with stem cells. Because pxMARGI (proximal) data had insufficient association reads (only counting distal and inter-chromosomal contacts), we analyzed distal RNA (diMARGI) contacts on chr X vs. autosomes. We assumed that autosomal bound Xist constitutes noise because FISH in HEK293 cells does not show widespread localization of Xist RNA (Chow et al., 2007). Because the data were also too sparse to call high-confidence X association peaks within the X chromosome, we divided the genome into 1 Mb bins, keeping only bins that (a) contained at least 1 diMARGI Xist read, (b) did not contain any ENCODE blacklisted regions, and (c) were the full 1 Mb length. The X chromosome and autosomes contain similar fractions of bins with zero Xist reads: 31% and 26%, respectively. To account for chromatin accessibility or HaeIII site bias, autosomal bins were chosen to have similar accessibility (measured by ENCODE H3K27ac ChIP) and similar HaeIII site density to the chosen bins on the X chromosome.

Author response image 1
Control bins on autosomes were chosen to match bins on the X chromosome in terms of HaeIII restriction sites and accessible chromatin, here measured using the proxy of H3K27ac data because direct measurements of accessibility such as ATAC-seq or DNase-seq data were not publicly available for HEK293 cells.
https://doi.org/10.7554/eLife.27024.026

Xist diMARGI contact counts in these matched bins are remarkably similar across autosomes and the X chromosome, indicating a signal to noise ratio of ~1. The ratio of the medians is 1 (2 counts per 1 Mb bin). The ratio of the means is 3.4 (X:autosome) which is largely due to a single outlier bin on the X chromosome with a count of 555, as opposed to the maximum count of 13 counts in sampled autosomal bins. This X chromosome bin contains the Xist locus and although proximal diMARGI contacts are excluded, the area around the Xist gene nevertheless contains a high density of contacts. This is not surprising, but the dramatic drop in Xist contact density from the locus of transcription to other sites on the X chromosome is not in agreement with FISH data which show Xist coating the entire Barr bodies (Chow et al., 2007). We conclude that, by this metric, the signal-to-noise in MARGI is low, and ChAR-seq may have distinct advantages as an approach to determine genome-wide RNA-DNA contacts (see Discussion).

We have attached the MARGI signal-to-noise figures for reviewers, but decided not to put this analysis in the text because we did not directly analyze mammalian data.

Author response image 2
diMARGI signal to noise estimate via X vs. autosome comparison in HEK293 cells.

Plots of log2(diMARGI counts per megabase) for bins spanning the X chromosome and random, accessibility- and restriction site-matched bins on autosomes (left: box pot, right: scatter plot of same data).

https://doi.org/10.7554/eLife.27024.027

Comparison to GRID-seq:

While this paper was in revision, another genome-wide RNA-DNA method was published (“GRID-seq”; Li et al., 2017) that uses a methodology similar to ChAR-seq. While many details differ, GRID-seq also involves proximity ligation of a directional bridge to RNA and DNA to construct an RNA-DNA contact map. One key difference is that GRID-seq uses the restriction enzyme MMeI to cut the cDNA and gDNA fragments distal to the linker/bridge, which produces identically sized fragments. This has the advantage over ChAR-seq of being able to size select and sequence only molecules with RNA and DNA information. A critical disadvantage, however, is that reads are 19-23bp, which have a greater chance of falsely mapping to the wrong place in the genome than the 20-100bp RNA and DNA fragments obtained via ChAR-seq. We now mention the above difference in the Discussion.

The available GRID-seq data (from NCBI GEO:GSE82312) is heavily processed and some contacts were removed via a background model subtraction. We attempted to perform a signal-to-noise analysis after aligning the raw reads from NCBI SRA and filtering repeats but were unable to determine an appropriate SNR metric as with ChIRP-seq. This is because GRID-seq has lower sequencing depth for roX2 reads, and therefore the vast majority of genomic bins on the autosomes have zero reads, precluding the proper calculation. However, we analyzed the correlation of signal density between GRID and ChAR, GRID and ChIRP, or ChAR and ChIRP for roX2 reads over 20kb chrX bins. GRID and ChAR-seq match well with ChIRP (GRID-ChIRP = 0.84, ChAR-ChIRP = 0.80) and with each other (GRID-ChAR = 0.79), suggesting that – at a coarse level – ChAR-seq and GRID-seq have similar ability to detect true RNA-DNA contacts. This analysis is now in Figure 3—figure supplement 1, along with signal tracks along a 2Mb region of the genome.

False RNA-DNA contact% calculation, and comparison to MARGI and GRID-seq:

In addition to the above comparisons, we spiked in exogenous RNA species (GFP, HALO tag, and MBP) at varying levels relative to total RNA and identified their DNA contacts. This provides us with a measure of false positive contacts, which was always equal to or less than 0.5% in the condition where we spike in a contaminant up to 10% of the total RNA by mass (Figure 1—figure supplement 6). This value compares favorably to the false positive values published by MARGI (~2.2%) and GRID-seq (human cells: 3.3%, fly S2 cells: 6.9%, mESCs: 4.7%), with the caveat that a cross-species analysis was used instead of exogenous RNAs.

2) Can you add metrics describing sensitivity and specificity of the method?

As mentioned above, we have performed a spike-in experiment to address this question about specificity where we added exogenous RNA to our lysed cells in order to mimic the RNA that might be released during our fragmentation step, and which could theoretically cross-ligate to distant sites owing to non-specific binding. In this experiment, we isolated the total soluble RNA after lysis and fragmentation, added back a percentage of spike in RNA by mass at 0.1, 1 and 10%, and then proceeded with our library prep and quantified the number of false positives by alignment to the reference sequences used to generate the in vitro transcribed RNAs. Even in the condition where 10% of the total RNA was a spiked in contaminant, our false positive rate was less than 0.5%, indicating that the subsequent washes and steps after fragmentation but before ligation are sufficient to efficiently remove the contaminating RNA.

We have added signal-to-noise metrics and false-positive estimates, as described above, to the text. These reveal that ChAR-seq can detect RNAs with both high sensitivity (e.g., roX1 and roX2) and high specificity (e.g., roX1 and roX2 enrichment on chrX, and low percentage of DNA contacts by exogenous spike-in RNAs). In addition, we showed that roX1 and roX2 bind near previously identified dosage compensation peaks from ChIRP-seq data.

Other evidence for high sensitivity and specificity are presented in the following figures:

- Figure 1—figure supplement 4: 80% of ligation reads are specific to RNA.

- Figure 2E: A large subset of RNAs, including roX1 and roX2, are found enriched on chromatin after controlling for expression levels (off-diagonal RNAs in red on plot).

- Figure 3—figure supplement 2, and Figure 4C, D: When comparing female Kc167 cells to male CME-W1-cl8+ cells, roX1 and roX2 are clearly found enriched in male cells, as expected.

- Figure 3—figure supplement 1: roX enrichment on chrX.

3) Please clarify concerns about how many times experiments were performed and add details on statistical analysis of reproducibility of data and analyses.

We have now explicitly stated the number of biological and technical replicates used. Critical information relevant to each ChAR-Seq library is now presented in Supplementary file 1. For the primary data set, 4 biological replicates were performed and sequenced, along with one technical replicate sequenced at much higher depth. Additional datasets generated during revision provide three new technical replicates.

We also have added several graphs that show reproducibility of independent replicates, including a comparison between ChAR-seq replicates with respect to number of RNA-DNA contacts (Figure 1—figure supplement 5), reproducibility of chromatin enriched RNAs across 3 different replicates, at 2 different expression levels (Figure 2—figure supplement 2), and reproducibility across different cell lines (Figure 3—figure supplement 2).

Finnaly, we have included all details for statistical analysis in the main text, figure legends, or Materials and methods. These include the statistical tests, p-value, null hypothesis (where appropriate), as well as a text description of how the data inputs were generated or processed.

4) Can you make the results available in a more user-friendly way, i.e. make the results available as a list of RNAs and loci they associate with.

We have now deposited a user-friendly file into NCBI GEO (accession GSE97131) that contains all RNA-DNA contacts in BED format (RNA-chr, start, stop; and DNA-chr, start, stop). This file contains 39 additional columns with information regarding:

- RNA transcript details (length, genome coordinates, transcript coordinates, flybase ID, transcription start site, etc.).

- DNA contact details.

- RNA alignment info (mapping quality, read length, read start position, CIGAR string, etc.).

- DNA alignment information.

This text file can be easily manipulated using bedtools or shell commands to extract answers to various questions related to the dataset as a whole (what are the 100 RNAs with the most DNA contacts?) or to specific RNA or DNA locations (where does my favorite RNA bind in the genome?). We expect such a file to be valuable to biologists with or without genomics analysis experience.

In addition, we now include a supplementary table (Supplementary file 2) that lists 8822 unique RNAs, as well as: their total number of DNA contacts, RNA length, common name, Flybase ID, RNA-seq expression level, normalized RNA-DNA contact number, and chromatin enrichment (over expression level).

Finally, we will provide the code for our of analysis pipeline (from raw reads to the above user-friendly file) in a publicly accessible format for download on Gitlab (https://gitlab.com/charseq/flypipe).

5) Explore RNAs interacting with heterochromatin, rDNA etc. These are repetitive DNA sequences, but a meta-analysis of RNAs associating with defined repeats is of considerable interest and would enhance the paper.

We agree that an analysis of heterochromatin and repeats would be of great interest, and a great use of the ChAR-seq method. We attempted to analyze which RNAs were bound to repetitive DNA sequences, but found that it was difficult to make any conclusions about data since: 1) much of the repetitive portion of the Drosophila genome is incomplete, including some centromeric sequence; 2) given 1, the sequence space covered by a given short repeat element is unknown, making it difficult to make any meaningful conclusion regarding DNA abundance after alignment to these RNAs; 3) repeats are not easily mapped, or mapped to, with current alignment tools. For example, we removed all non-uniquely mapping reads here, but would need to set different multi-mapping threshold parameters for different repeat classes. We also attempted to analyze where RNAs that emanate from DNA repeats (such as transposons) bind; specifically, do RNAs expressed by DNA repeats bind nearby, or to other repetitive DNA loci at a greater frequency? However, this analysis was also inconclusive, primarily due to the multi-mapping issue described above.

My laboratory has a strong interest in repeat regions of the genome from our studies of vertebrate centromeres Thus, we are particularly interested in pursuing this analysis in human cells where high quality models for centromere repeats have been generated by Karen Miga, Jim Kent, David Haussler and Hunt Willard and long-read sequence information has enabled the first complete assembly of the human Y centromere. However, we feel that these experiments are outside the scope of this current manuscript.

Reviewer #2:

[…] 1) Overall, the authors show that ChRA-seq works well to identify both RNAs that are nascent and thus still connected with DNA in cis, and RNAs that do not directly interact with DNA, but rather with chromatin factors in trans. The ChRA-seq approach is validated with the RNAs that are known to be functioning in chr. X dosage compensation, in fact these RNAs (roX1 and roX2) are found to bind chr. X chromatin in the Drosophila male cells but not in the female cells. In addition to the similarity of the ChRA-seq approach to the ChIRP-seq by Quinn et al., Nat Biotech 2014, which although is limited to RNA-chromatin interactions, as pointed out by the Authors, ChRA-seq is almost identical to the method used in Sridhar et al., 2017. Thus, the novelty of the study by Bell et al. is in part reduced because of this recent publication, which exploits a very similar ligation method to capture sites of RNA interaction with chromatin and DNA.

Other points:

2) What is described in the first paragraph of Results does not match perfectly with Figure 1A, because the RT step in the figure is shown after the bridge is ligated to DNA, while in the text it is described before the bridge is ligated to DNA. The text should match the figure description.

We thank the reviewer for pointing out this discrepancy in the figure. Figure 1A has been changed to reflect the protocol as described in the text.

3) It would be good to add the polarity of the RNA molecule in red in the Figure 1A. It would be good also to explain that the approach captures the 3' RNA ends of RNA molecules. Also Figure 1A should be bigger to clearly show which RNA end is ligated to the bridge, to clearly indicate the polarity of ends and the structure at the junction. If the App has a free 3' end, and this is ligated to 3' end RNA, is this a 3'-3' ligation? Or is the App removed in the ligation process? This should be explained well because it is not evident.

We have added a polarity indicator arrow to the illustration in Figure 1B, showing how the first-strand cDNA sequence (representing the RNA) is always 5’ of the adaptor sequence top strand (which does not contain the biotin modification) and the DNA sequence is always 3’ of the adaptor sequence top strand.

4) The statement "the 5'-adenylated end (5'App) enables increased ligation specificity for 3'-terminated ssRNA" is not clear: 'increased specificity' relative to what?

We have clarified the language in the manuscript to indicate that the ligation is performed without free ATP, making 5’ adenylation a requirement for ligating the 5’ ends. This adenylation and ligation scheme ensures that only bridge 5’ ends are ligated to RNA, preventing non-specific ligation between non-adenylated RNA 5’ ends and RNA 3’ ends.

5) Is the sequence of RNA and DNA oligos presented in Figure 1—figure supplement 2 provided? It is not clear what the molar excess of DNA over RNA is, please clarify. Also the legend of this figure should explain what is shown in every lane. In lane 1 there is a high mw band, what is this, what is its size? The ladder sizes are provided only for the bottom part of the gel, they should be provided also for the top part, considering that there is a high mw band. How many times was this experiment repeated? Is there any statistical analysis of these data? This, possibly, should be explained in the legend.

We added the missing label to Lane 3 to indicate that it represents the reaction using R55K K227Q mutant T4 Rnl2tr ligase and added additional labels for the ladder used (a combination of NEB microRNA and low range ssRNA ladders). The high molecular weight band is most likely long concatemers of the ssDNA substrate created by the Thermostable 5’App DNA/RNA ligase. It is possible that this enzyme is able to remove the 3’ block from the ssDNA oligo used in this optimization experiment, or that the block was incomplete. This enzyme is known to have higher activity in ssDNA-ssDNA ligation than the other two. We have modified the figure legend to reflect this explanation and to provide more detailed information on the enzymes used. We only use these results to make a qualitative judgment about which enzyme performed best. Because the conclusion of the experiment would not be altered by further quantification, we proceeded with the approach based on this qualitative assessment. We have also updated the legend to state that the experiment was performed once.

The DNA sequence used in Figure 1—figure supplement 2 is the universal miRNA cloning linker: CTGTAGGCACCATCAAT and the RNA sequence was that of the CENP-B box: TTTCGTTGGAAGCGGGA. These details have been added to the Figure 1—figure supplement 2 legend.

6) The statement: "This RNase-treatment dramatically reduced in the number of bridge molecules identified, demonstrating that bridge ligation is indeed RNA-dependent (Figure 1—figure supplement 5)" is not strongly supported by the data shown in this figure. In addition to a typo in this statement, looking at the figure, there seems to be a factor of six reduction, which does not represent a 'dramatic' reduction. Importantly, there appears to be no repeats of this experiment and no statistical analysis of the data. Was this experiment reproducible, are there repeats that can be shown?

We have changed the manuscript text to be more specific, indicating the fold-reduction in bridge ligation events. The experiment was performed once for this control. In response to reviewer 1, question 3, we discuss why an RNase treatment could still result in the low presence of bridge-containing sequences.

7) Legend to Figure 1E has a typo: "Zoomed in region of shown".

The figure legend has been updated to fix this typo.

8) Typo in Results: "we performed RNA-seq to for".

Thank you for pointing it out. This typo has been corrected in the manuscript.

9) The correlation between expression level of the RNA and FPKM contacts is not very clear, because in Figure 2—figure supplement 1 and 2 this does not seem to be the case.

The correlation the reviewer describes can be observed in Figure 2E, in which we plot the total RNA expression (measured by RNA-seq) vs. RNA-to-DNA contacts (measured by ChAR-seq). For most RNAs, there is a strong correlation between expression level and ChAR-seq contacts observed. The red dots represent RNAs that are enriched by at least 10-fold in the ChAR-seq dataset over the average level of contacts for a given expression level. This correlation is already normalized in the bar plot on the right side of Figure 2—figure supplement 1 and the quantity being displayed is fold-enrichment of ChAR-seq contacts over the expected contacts for a given expression level. The left bar plot is the raw expression level. What this figure illustrates is that the level of chromatin enrichment for a given RNA is not correlated with the total number of contacts we observe – enrichment is not merely a property of being above some limit of detection, it is a property of the RNA itself.

Reviewer #3:

[…] 1) Data completeness and transparency:

To provide a more complete resource for the community, all potentially interesting RNAs should be named/identified in a table, including those that were initially included but didn't match criteria for 'chromatin associated'.

a) There should be a table listing the 1797 RNAs (Figure 2D), and relevant information, such as% and numbers of cis and trans contacts, RNA expression level, RNA-DNA contacts, and a way to access information about their genomic distributions, etc. – essentially anything that may be of use to scientists interested in whether or not their favorite RNA is present, or in performing their own analyses of the data.

We agree that the 1797 (now 1952 after additional sequencing and analysis see revised Figure 2D) most abundant transcripts are of interest. We have now generated a 39 column ‘masterfile’ that contains all RNA-DNA contacts and related information, as described in our response to the editor’s Essential revisions. These data are publicly available on NCBI GEO (GSE97131), In addition, we provide Supplementary file 1 that includes the top 8822 unique RNAs, their RNA-DNA contact number (CPKM), RNA expression level, and chromatin enrichment values, which allows any scientist to search for a particular RNA of interest.

b) I assume Figure 2E shows all 1797 as 'included', but the 'chromatin associated' RNAs in Figure 2E should be listed in a table. My impression is that some (73) but not all are listed in Figure 2—figure supplement 1 – are these the red dots in 2E? Even if all red dots are listed in the supplement, readers should be able to access identity and other data for all 1797, and especially the grey dots that are RNA-DNA contact outliers with high expression (I suspect that normalization to expression levels eliminates interesting candidates, see statistics discussion below). Similarly, there appear to be RNAs with low expression that may be chromatin enriched, yet are not highlighted in red.

The red dots in revised Figure 2E are the 138 RNAs with contacts enriched 10-fold over the expression levels, and that have more than 100 CPKM (DNA contacts). As stated above, the most abundant RNAs can be extracted from Table 1, along with all RNAs above any desired fold enrichment threshold.

2) Genomics, Analysis and Statistics:a) Figure 1D, 2A and elsewhere. I could not find a description of what the black circles and darker grey areas on the chromosome graphics indicate. I assume black=centromere, but the rest is unclear.

We have clarified these graphics in the figure legends. The cartoon chromosomes depicted in Figure 1D, 2A, 3C, and 3D are scaled visual representations of the chromosomes. The black circles are the centromeres, the light gray regions are the primary chromosome scaffolds, and the dark gray regions are the heterochromatic scaffolds.

b) Why was release 5 used, given that the more complete release 6 has been available for at least a year?

Release 5 (dm3) was used for several reasons. First and foremost, the ChIRP-seq roX1 and roX2 datasets that we used to for validation and the modENCODE data that we used for deeper analysis were all generated with the dm3 build. Many available datasets in NCBI-GEO lack either the raw read files (e.g., fastq) or lack the necessary analysis pipelines to proceed from raw reads to the available processed genome coverage files. For the processed files that are not at single bp resolution, it would be difficult to convert genome coordinates from release 5 to release 6.

Second, while major improvements on dm6 over release 5 exist in the heterochromatic scaffolds, the rest of the genome has only minor changes in dm6 compared to dm5 (http://flybase.org/static_pages/feature/previous/articles/2014_07/FB2014_04.html). Finally, coordinate conversion is generally easier going from an older build to a newer one, so for the small number of affected regions or genes, researchers for whom this is necessary can use readily available conversion tools to re-map the ChAR-seq dataset, provided in NCBI GEO.

In addition, genomes from the cell lines used have not been assembled, and are likely rearranged (plus other variation, including DNA copy number) relative to the reference sequence, so the positions shown in the figures are likely to be incorrect, though we don't know where. No, the authors do not need to assemble the clone8 or Kc genomes, but they should acknowledge this issue.

We agree that the genome assemblies are lacking for some of these cell lines and have added a sentence to the Materials and methods section to indicate that there may exist variation in the positions of some datasets due to genomic differences between cell lines. We took great care in selecting a suitable cell line for a genome wide RNA-DNA mapping method: The clone 8 cell line (CME-W1-cl8+) used for the majority of the data collection in this work has a normal and stable karyotype, and is male so we could analyze the roX genes for validation. In a recent study, 19 Drosophila cell lines had their DNA sequenced to determine ploidy across the genome (Lee et al., 2014). Of all 19 commonly used cell lines, the CME-W1-cl8+ line had, by far, the most optimum ploidy across the genome, which was very close to 1X throughout, whereas the commonly used S2 lines were largely 4X with variable regions up to 8X ploidy. The CME-W1-cl8+ line is also one of the modENCODE lines and maintained by the Drosophila Genomics Resource Center.

c) It is unclear what parts of the genome assembly were included, for example were heterochromatic regions included, and if so where is that data presented? This information would help with the interpretations, especially in Figure 4D.

We have now included which regions of the genome are included in each figure legend or extended Materials and methods, including Figure 4D. For the majority of analyses, chr2Het, chr3Het, and chrXHet were included, with exceptions noted in the figure legends or Materials and methods.

d) A related point, for Figure 4D it would be helpful to report the genomic locations (middle of euchromatin? Pericentromeric? X vs. autosomes?) for RNAs whose chromatin contacts show strong correlation with specific chromatin marks. Also, please explain how the correlation signals were aggregated in Figure 4D – the methods described using 2kb bins while the figure showing 100kb bins.

The whole reference genome was used, including the heterochromatin scaffolds; however, we filtered the regions of the build determined to have high frequency mapping artifacts (i.e., the so-called blacklisted regions) and repeats, which also produce mapping artifacts. The cluster analysis in Figure 4D is genome-wide, excluding only the chrU and chrUhet scaffolds, because those are of low quality and do not represent a real biological chromosome. The DpnII normalization (i.e., z-score calculation) of ChAR-seq tracks was performed using 2-kb windows, while the Pearson correlation coefficient was calculated pairwise between each ChAR-seq track (i.e., columns in the clustergram) and each modENCODE track (i.e., rows). The Pearson correlation coefficient was calculated using 100 kb bins because we sought to use a window size slightly larger than our estimated resolution.

e) Can the authors discuss how the use of PCR to amplify ligated RNA/DNA contacts, combined with normalization to expression levels, could exclude RNAs with lower transcription levels, even if they have many (potentially important) DNA contacts? I think they are being appropriately conservative, but as with all such screens, and to help others who will certainly try it themselves, it would be good to have a brief discussion about the tradeoffs.

It is true that using PCR to amplify our molecules during the library prep could introduce biases. However, we minimized the number of PCR cycles performed by using a side PCR to estimate the minimum number of cycles required to reach a stable exponential phase. We note that by normalizing to expression levels, RNAs with lower transcription levels but many DNA contacts may actually be more likely to be called as “chromatin-bound” after our enrichment calculations. Although our current dataset may have missed some RNAs expressed at very low levels, sequencing at greater depth with sufficient library complexity can mitigate this issue, as it does with RNA-seq.

f) Some of the statements need quantitative support, such as:

- Figure 2E: what is the threshold for identifying RNAs with more than expected chromatin interaction (how was the expectation calculated?

We used two cut-offs to call RNAs chromatin enriched. First, we excluded all RNAs with fewer than 10 DNA contacts based on the cumulative frequency distribution curve as well as the fact that 10 or fewer contacts is insufficient information to make biological conclusions about each RNA. Second, we required a 10-fold enrichment of ChAR-seq contacts over expression level (FPCM). We also required more than 100 ChAR-seq contacts, which we empirically found to be a rough minimum for which to explore various properties of the RNA-DNA contact (e.g., cis/trans% , broad chromosome distribution, etc.).

- How were p-values for rox1 and rox2 (7.6 fold and 8.1 fold enrichment) calculated?

This enrichment was calculated using a one-tailed cumulative binomial test. A full description of this calculation is included in the Materials and methods.

- Figure 3E: correlation between the DNA contact locations from ChIRP and ChAR- seq – is this simply the Pearson correlation coefficient?

This is Spearman’s rank correlation, and this is now noted in the Materials and methods and new Figure 3 legend.

- Figure 4—figure supplement 1: how was the base sequence similarity calculated and used to aggregate signals for different snRNAs?

This is described in the Figure 4—figure supplement 1 legend. In short, we used GeniousR7 to calculate a pairwise distance score, then clustered the data using R.

g) Using the correlations between ChIRP and ChAR-seq to assess the resolution of ChAR-seq does not seem to be appropriate, since the preparation of sequencing libraries may be very different between these two datasets. (Also, the curve actually goes down as window size increases?!) The authors should provide more reasoning for this approach.

ChIRP, RAP, and CHART are all well established RNA hybridization and chromatin pull down methods designed to map individual chromatin associated RNAs. ChIRP, in particular, has been applied in the CME-W1-cl8+ cell line, which is why we used this specific dataset to estimate our resolution. ChIRP data have a resolution of ~1-5kb, which provides a useful lower bound, and ChIRP peaks match well with 0.5-1kb ChIP-seq peaks for the dosage compensation complex protein MSL3. The reviewer is correct that the library preparation is very different, but this is advantageous for benchmarking our method: ChIRP was established as a method for mapping chromatin interaction patterns of single RNAs, and therefore comparing the performance of the highly multiplexed ChAR-seq protocol against ChIRP-seq by selecting the same target RNA from both data sets is the most direct comparison we can make.

For both roX1 and roX2, the correlation curve increases, then plateaus. The small dip in correlation between a 250 kb bin size and a 1 Mb bin size after the plateau is likely due to the bin size being so large that differences in filtering, blacklisting, etc. between ChIRP and ChAR-seq data may cause different edge effects at ends of chromosomes or in large domains flanked by repetitive regions or heterochromatin.

How were the replicates analyzed? Were they merged? What are the correlations between replicates?

The replicates show excellent reproducibility and are now presented in Figure 1—figure supplement 5. The replicates were merged for most of the analysis, as described in the Materials and methods and the statistics included in new Supplementary file 1.

h) How was the z-score was calculated? It seems like it is calculated as comparing between the values in each bin vs. the rest of the genome. If this is the case, there may be no need to calculate a z-score as every window will have the same denominator, and it may be just as informative as using the read counts. Also, it may be informative to use the chromosome-specific mean (instead of whole genome mean) to further identify interesting local events (the current analysis may be mainly identifying between chromosome differences).

We clarified this calculation in the extended Materials and methods section. The z-score is calculated to normalize ChAR-seq mapping events relative to the DpnII frequency for a given bin size.

References:

Chow JC, Hall LL, Baldry SEL, Thorogood NP, Lawrence JB, Brown CJ. Inducible XIST-dependent X-chromosome inactivation in human somatic cells is reversible. Proceedings of the National Academy of Sciences of the United States of America. 2007;104(24):10104-10109. doi:10.1073/pnas.0610946104.

Gilbert SL, Pehrson JR, Sharp PA. XIST RNA associates with specific regions of the inactive X chromatin. J Biol Chem. 2000 Nov 24;275(47):36491-4. PubMed PMID:

ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012 Sep 6;489(7414):57-74. doi: 10.1038/nature11247. Updated blacklist for hg38 downloaded from: http://mitra.stanford.edu/kundaje/akundaje/release/blacklists/hg38-human/hg38.blacklist.bed.gz.

https://doi.org/10.7554/eLife.27024.037

Article and author information

Author details

  1. Jason C Bell

    Department of Biochemistry, Stanford University, Stanford, United States
    Present address
    Molecular Biology, 10x Genomics, Pleasanton, United States
    Contribution
    Conceptualization, Resources, Software, Formal analysis, Funding acquisition, Investigation, Methodology, Writing—original draft, Writing—review and editing
    For correspondence
    jason.bell@10xgenomics.com
    Competing interests
    No competing interests declared
    ORCID icon 0000-0001-5480-7975
  2. David Jukam

    Department of Biology, Stanford University, Stanford, United States
    Contribution
    Conceptualization, Resources, Software, Formal analysis, Funding acquisition, Investigation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon 0000-0003-4167-2754
  3. Nicole A Teran

    1. Department of Biochemistry, Stanford University, Stanford, United States
    2. Department of Genetics, Stanford University, Stanford, United States
    Contribution
    Software, Formal analysis, Investigation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon 0000-0002-9625-5010
  4. Viviana I Risca

    Department of Genetics, Stanford University, Stanford, United States
    Contribution
    Conceptualization, Resources, Software, Formal analysis, Funding acquisition, Investigation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon 0000-0003-2670-8704
  5. Owen K Smith

    1. Department of Biochemistry, Stanford University, Stanford, United States
    2. Department of Chemical and Systems Biology, Stanford University, Stanford, United States
    Contribution
    Formal analysis, Investigation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
    ORCID icon 0000-0003-0880-2801
  6. Whitney L Johnson

    Department of Biochemistry, Stanford University, Stanford, United States
    Contribution
    Conceptualization, Resources, Funding acquisition, Investigation, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
  7. Jan M Skotheim

    Department of Biology, Stanford University, Stanford, United States
    Contribution
    Resources, Funding acquisition, Writing—review and editing
    Competing interests
    No competing interests declared
  8. William James Greenleaf

    1. Department of Genetics, Stanford University, Stanford, United States
    2. Department of Applied Physics, Stanford University, Stanford, United States
    Contribution
    Resources, Formal analysis, Funding acquisition, Methodology, Writing—review and editing
    Competing interests
    No competing interests declared
  9. Aaron F Straight

    1. Department of Biochemistry, Stanford University, Stanford, United States
    2. Department of Chemical and Systems Biology, Stanford University, Stanford, United States
    Contribution
    Conceptualization, Resources, Formal analysis, Supervision, Funding acquisition, Methodology, Project administration, Writing—review and editing
    For correspondence
    astraigh@stanford.edu
    Competing interests
    No competing interests declared
    ORCID icon 0000-0001-5885-7881

Funding

National Institutes of Health (Stanford Center for Systems Biology (NIH P50 GM107615) Seed Grant)

  • Jason C Bell
  • David Jukam
  • Viviana I Risca
  • Whitney L Johnson

National Institutes of Health (NIH Ruth Kirchstein National Research Service Award (F32GM116338))

  • Jason C Bell

Stanford University School of Medicine (Dean's Fellowship)

  • Jason C Bell

National Institutes of Health (NIH Ruth Kirchstein National Research Service Award (F32GM108295 ))

  • David Jukam

National Institutes of Health (Stanford Genetics Training Program (5T32HG000044-19))

  • Nicole A Teran

Stanford University (Walter V. and Idun Berry Fellowship)

  • Viviana I Risca

Stanford University (Grant Reference: Katharine McCormick Advanced Postdoctoral Fellowship)

  • Viviana I Risca

National Institutes of Health (Molecular Pharmacology Training Grant (NIH T32-GM113854-02))

  • Owen K Smith

National Institutes of Health (NIH T32 Training Fellowship (GM007276))

  • Whitney L Johnson

National Science Foundation (Graduate Research Fellowship (DGE-114747))

  • Whitney L Johnson

National Institutes of Health (RO1 HD085135)

  • Jan M Skotheim
  • Aaron F Straight

Howard Hughes Medical Institute (HHMI-Simons Faculty Scholar Award)

  • Jan M Skotheim

National Institutes of Health (P50HG00773501)

  • William James Greenleaf

National Institutes of Health (R21HG007726)

  • William James Greenleaf

National Institutes of Health (R01HG009909)

  • Aaron F Straight
  • William James Greenleaf

National Institutes of Health (R01GM106005)

  • Aaron F Straight

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This project was funded by a Stanford Center for Systems Biology (NIH P50 GM107615) Seed Grant to JCB, DJ, VIR and WLJ. JCB and DJ were also supported by NIH Ruth Kirchstein National Research Service Awards (F32GM116338 to JCB) and (F32GM108295 to DJ). JCB was also supported by the Stanford School of Medicine Dean’s Fellowship. VIR was supported by the Walter V and Idun Berry Fellowship. NAT was supported by the Stanford Genetics Training Program (5T32HG000044-19). OKS was supported by the Molecular Pharmacology Training Grant (NIH T32-GM113854-02). WLJ was supported by a NIH T32 Training Fellowship (GM007276) and the National Science Foundation Graduate Research Fellowship (DGE-114747). We acknowledge support from NIH RO1 HD085135 to JMS and AFS, HHMI-Simons Faculty Scholar Award to JMS, NIH grants P50HG00773501 and R21HG007726 to WJG, R01GM106005 to AFS and R01 HG009909 to WJG and AFS. We would like to thank Julia Salzman, Kyle Eagen, and members of the Straight, Greenleaf and Skotheim labs for thoughtful discussions. We thank Lucy O’Brien for sharing equipment. We acknowledge the Drosophila Genomics Resource Center (NIH grant 2P40OD010949-10A1) for providing cell lines and the Stanford Functional Genomics Facility for providing sequencing services.

Reviewing Editor

  1. Job Dekker, University of Massachusetts Medical School, United States

Publication history

  1. Received: March 21, 2017
  2. Accepted: April 11, 2018
  3. Accepted Manuscript published: April 12, 2018 (version 1)
  4. Version of Record published: May 21, 2018 (version 2)

Copyright

© 2018, Bell et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 4,587
    Page views
  • 1,014
    Downloads
  • 2
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Chromosomes and Gene Expression
    2. Plant Biology
    Corinna Speth et al.
    Research Article Updated
    1. Chromosomes and Gene Expression
    Julien Bischerour et al.
    Research Article