Defining the location of promoter-associated R-loops at near-nucleotide resolution using bisDRIP-seq

Abstract
eLife digest
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

R-loops are features of chromatin consisting of a strand of DNA hybridized to RNA, as well as the expelled complementary DNA strand. R-loops are enriched at promoters where they have recently been shown to have important roles in modifying gene expression. However, the location of promoter-associated R-loops and the genomic domains they perturb to modify gene expression remain unclear. To resolve this issue, we developed a bisulfite-based approach, bisDRIP-seq, to map R-loops across the genome at near-nucleotide resolution in MCF-7 cells. We found the location of promoter-associated R-loops is dependent on the presence of introns. In intron-containing genes, R-loops are bounded between the transcription start site and the first exon-intron junction. In intronless genes, the 3' boundary displays gene-specific heterogeneity. Moreover, intronless genes are often associated with promoter-associated R-loop formation. Together, these studies provide a high-resolution map of R-loops and identify gene structure as a critical determinant of R-loop formation.

https://doi.org/10.7554/eLife.28306.001

eLife digest

Genes contain coded instructions for making proteins. When the cell needs to use a gene, molecular machinery assembles near the start of the gene in regions called promoters. Part of this machinery then reads along the gene, making a copy of the code in the form of a DNA-like molecule called RNA. These RNAs typically contain regions called exons, which carry the instructions, interspersed with spacer regions called introns. As RNAs are made they are 'spliced' to chop out the introns, leaving behind the final instructions.

Most DNA exists in a double helix shape with two connected DNA strands, but the regions near the start of genes often contain structures called R-loops. In these structures, one strand of the DNA partners up with a single strand of RNA, forcing the other strand to bulge out on its own. Their location at gene promoters indicates that R-loops could change the cell's use of genes by impacting the machines that assemble near the start of genes. However, R-loops are not well understood. A major barrier to understanding the role of R-loops is that we do not know exactly where they are with respect to the start of genes.

Dumelie and Jaffrey now report a new method to map R-loops almost to the resolution of single letters of the DNA code – a method which they called bisDRIP-seq. The approach extends an existing technique called DRIP-seq, which uses antibodies to capture DNA sequences stuck to strands of RNA. It can find R-loops, but it cannot tell the difference between the loop itself and the DNA surrounding it. The new technique uses a chemical called bisulfite to alter the DNA letters. It only affects the loop of the R-loop because the RNA shields the other strand. Sequencing then pinpoints the modified letters, revealing the exact location of the loop.

For human cells grown in the laboratory, the technique found that R-loops form between the start of the gene and its first intron. Some genes do not have any introns, and in these cases, the R-loops extended deep into the code. Most human genes have only a small amount of DNA between the start site and the first intron, which may act to limit the effect of R-loops in these genes.

This new technique allows the high-resolution study of R-loops, and could help to reveal their role in regulating genes. Abnormal R-loops have already been linked to a small set of human diseases like fragile-X syndrome. As the tools to study R-loops improve, it is possible that scientists will make connections to other diseases. In time, improved understanding of these structures could lead to better diagnosis, and eventually treatment, for these conditions.

https://doi.org/10.7554/eLife.28306.002

Introduction

R-loops are nucleic acid structures in which a strand of RNA is hybridized to a strand of DNA, while the other strand of DNA is looped out. Recent techniques for genome-wide mapping of R-loops revealed that promoter regions are enriched in R-loops (Ginno et al., 2012). The presence of R-loops in promoter regions raises the possibility that they may regulate gene expression. Indeed, more recent studies provided evidence that R-loops in these critical regions can alter histone modifications and are associated with changes in gene transcription (Chen et al., 2015; Colak et al., 2014; Sun et al., 2013).

A major unanswered question is the precise location of these R-loops. The location of an R-loop in a gene is likely to dictate how that R-loop can impact promoter function. This is because eukaryotic promoter regions contain multiple functional domains that have distinct roles in transcription, including transcription start sites, transcription factor-binding sites, exon-intron junctions, CpG islands, and nucleosome-associated DNA (Lenhard et al., 2012). The location of an R-loop within the promoter could influence transcription by disrupting or enhancing protein recruitment to any of these sites. Thus, understanding the precise location of R-loops can provide insight into how R-loops affect gene transcription.

A major barrier to discovering the exact location of R-loops is the low resolution of current genome-wide R-loop mapping methods like DRIP-seq (DNA-RNA immunoprecipitation sequencing) (Ginno et al., 2012). This method uses the S9.6 antibody which binds RNA-DNA hybrids (Boguslawski et al., 1986). With this antibody, genome-wide R-loop maps are created by immunoprecipitating and sequencing the genomic fragments containing RNA-DNA hybrids (Ginno et al., 2012; Sanz et al., 2016; Stork et al., 2016). However, DRIP-seq does not discriminate between the R-loop sequence and the surrounding non-R-loop sequence. Therefore, the exact boundaries of R-loops cannot be resolved using DRIP-seq.

Based on DRIP-seq and similar R-loop mapping methods, promoter-associated R-loops are thought to form within a few kilobases downstream of the transcription start site (Chédin, 2016). From these low-resolution experiments, it is not clear if R-loops have specific boundaries or if they are relatively amorphous structures that lack well-defined boundaries.

To understand where R-loops are positioned in genomic promoter regions, we developed bisDRIP-seq (bisulfite-DNA-RNA immunoprecipitation sequencing). bisDRIP-seq is an approach to map R-loops at near-nucleotide resolution throughout the genome. In this approach, we use bisulfite to selectively convert cytosine residues into uracil residues within genomic DNA regions that contain single-stranded DNA. We then identify single-stranded regions likely to be in R-loops based on preferential labeling of one strand of DNA and the requirement that the labeling be transcription dependent. Remarkably, we find that promoter-associated R-loops are typically bounded by the transcription start site and the first exon-intron junction in intron-containing genes. Thus, we find that the maximum size of promoter-associated R-loops is controlled by the location of the first exon-intron junction in intron-containing genes. We also identify prominent promoter-associated R-loop forming regions in intronless genes, including MALAT1, NEAT1, and the replication-dependent histone genes. In some of these genes, the R-loops are associated with well-defined 3’ boundaries that are located within the gene body. Thus our high-resolution map of promoter-associated R-loops defines the boundaries of R-loops and suggests a role for first exon length in regulating the formation of R-loops.

Results

bisDRIP-seq concept

It is not yet possible to define the exact size and location of R-loop-forming regions on a genome-wide scale. In contrast, it is possible to determine the exact location of a specific R-loop-forming region in a specific gene using a previously developed bisulfite mapping approach (Yu et al., 2003). Essentially, this approach involved treating genomic DNA with bisulfite under non-denaturing conditions. Bisulfite specifically causes cytosine-to-uracil conversions in the single-stranded DNA portion of R-loops. On the other hand, cytosines in the RNA-DNA hybrid portion of the R-loop are protected from bisulfite conversion. Sequencing of both strands then revealed the location and strand in which cytosines were converted to uracils. The presence of an R-loop was identified by showing that the converted cytosines occurred primarily on one of the two strands of DNA (Yu et al., 2003). The location of bisulfite-induced conversions on a single strand of DNA was then used to define the boundaries of the R-loop at near-nucleotide resolution.

This use of bisulfite to detect single-stranded regions of DNA contrasts with the use of bisulfite in mapping 5-methylcytosine in DNA. For 5-methylcytosine mapping, DNA is denatured into single strands and then all cytosines are converted to uracils, while 5-methylcytosine is poorly converted (Frommer et al., 1992). Thus, while mapping 5-methylcytosine involves searching for unconverted cytosines, mapping single-stranded DNA involves detection of converted cytosines.

The use of bisulfite to map R-loops on a genome-wide scale poses several significant challenges. First, non-physiological R-loop formation, removal, expansion, or contraction can occur between the time of lysis and the time when bisulfite is used to mark single-stranded DNA (Kaback et al., 1979; Landgraf et al., 1996). Second, high sequencing depth is needed to search for converted cytosines across the entire genome (Sims et al., 2014). Finally, bisulfite causes infrequent, but measurable, background conversions in double-stranded DNA (Yu et al., 2003), which can generate a significant amount of noise on a genome-wide scale.

To map R-loops at near-nucleotide resolution, we developed a genome-wide bisulfite-based approach called bisDRIP-seq (Figure 1A). This method incorporates steps to overcome each of the challenges listed above. In this approach, cells are lysed in the presence of bisulfite and SDS. By including bisulfite during lysis, genomic DNA structures have as little time as possible to change conformation prior to the bisulfite modification of cytosines in single-stranded DNA regions. To overcome the problem of needing high sequence coverage, the S9.6 antibody (Boguslawski et al., 1986) is used to enrich for R-loops. The S9.6 antibody has high affinity for RNA-DNA hybrids and lower affinity for other double-stranded RNA sequences (Phillips et al., 2013). Thus, after genomic DNA is sheared using restriction digestion, the S9.6 antibody enriches the bisulfite-modified R-loops for sequencing analysis.

Figure 1 with 3 supplements see all

Download asset Open asset

Near-nucleotide resolution mapping of R-loops.

(A) Diagram illustrating the work-flow of the bisDRIP-seq protocol. For DNA molecules, one DNA strand is shown in blue, while the other strand is shown in orange. Red dots indicate the location of bisulfite-induced cytosine-to-uracil conversions. (B) A high number of cytosine conversions and high bisDRIP-seq scores are observed at a specific gene locus. bisDRIP-seq cytosine conversions and high bisDRIP-seq scores are specifically observed on one strand of the *RPPH1* gene. In these plots, the template strand refers to the strand used for *RPPH1* transcription, rather than the template strand used for *PAPR2* transcription. In the top plot, cytosine-to-uracil conversions were mapped to the genomic region surrounding *RPPH1*. The number of conversions on the template strand and non-template strand were plotted below the x-axis (orange) or above the x-axis (blue), respectively. Shown are the total number of conversions observed in all bisDRIP-seq samples (n = 13). In the lower plot, the bisDRIP-seq scores were mapped to the genomic region surrounding *RPPH1* (mean bisDRIP-seq score from n = 13 samples). The bisDRIP-seq score on the template strand and non-template strand were plotted below the x-axis (orange) or above the x-axis (blue), respectively. As can be seen, the genomic region containing *RPPH1* had both a high number of cytosine-to-uracil conversions and high bisDRIP-seq scores specifically on the non-template strand. By contrast, the *PARP2* gene had minimal bisDRIP-seq conversions and low bisDRIP-seq scores. Source code for bisDRIP-seq score calculation can be found in Source code 1.

https://doi.org/10.7554/eLife.28306.003

Finally, to overcome the problem of stochastic cytosine conversions, a computational pipeline was developed to identify single-stranded regions using bisDRIP-seq data. This pipeline was developed to identify regions with high concentrations of cytosine-to-uracil conversions. This computational pipeline also identifies single-stranded regions that are likely to contain R-loops as opposed to other single-stranded DNA structures in the genome. Additionally, the pipeline was designed to reveal the specific strand orientation of R-loops.

Thus, bisDRIP-seq provides an approach to map R-loops at near-nucleotide resolution on a genome-wide scale.

Near-Nucleotide resolution mapping of R-loops on a Genome-Wide scale

We used bisDRIP-seq to map single-stranded DNA throughout the genome. Thirteen bisDRIP-seq experiments were performed on different samples of MCF-7 cells. After performing bisDRIP-seq on these samples, the DNA fragments were sequenced using a traditional post-bisulfite library preparation method (see Materials and methods).

We next aligned the sequenced reads to the genome using Bismark, an alignment approach typically used for 5-methylcytosine mapping (Krueger and Andrews, 2011). This was necessary since the conversion of cytosines to uracils would confound traditional read alignment programs. Bismark was used to map conversions associated with single-stranded DNA as follows: first, reads were aligned to the genome. Then the cytosines that had been converted to uracils were identified (Figure 1B). As expected, reads were detected that contained only a single conversion, consistent with noise due to low-level double-stranded DNA cytosine conversions. However, reads and regions were also observed that contained consecutive cytosine conversions (Figure 1—figure supplement 1A). These reads are suggestive of single-stranded DNA.

We next applied our bioinformatic pipeline to distinguish the multiple conversions seen in single-stranded DNA regions from the stochastic conversions due to background noise.

In the first part of this pipeline, the stochastic rate of conversions was estimated. This was estimated based on the overall cytosine-to-uracil conversion rate in a given sample. Reads with a high percentage of conversions were excluded from this calculation since it was assumed that those conversions were not stochastic.

Next, we generated a ‘bisDRIP-seq score’ for each read. This score was calculated based on the number of cytosines converted in the read, relative to the likelihood of observing that number of conversions by stochastic noise (see Materials and methods for more details). Next, the score was normalized so that the sum of bisDRIP-seq scores within each sample was the same.

Next, we calculated bisDRIP-seq scores for individual nucleotide positions in the genome. The bisDRIP-seq score for each nucleotide position was calculated as the sum of the bisDRIP-seq scores of all of the reads that overlapped with that nucleotide position. These scores provide a read-length resolution map of single-stranded DNA that filters out stochastic conversions and that is more comparable between genomic regions (Figure 1B).

Importantly, bisDRIP-seq scores were substantially reduced at specific gene loci in samples treated with RNase H, which degrades RNA in RNA-DNA hybrids (Figure 1—figure supplement 2). This supports the idea that elevated bisDRIP-seq scores reflect R-loops.

We next performed two basic tests of the quality of data produced by bisDRIP-seq:

First, we asked whether reads with high bisDRIP-seq scores are randomly distributed across the genome or whether they are clustered in specific regions. To simulate a random distribution, Monte Carlo simulations were applied to our data. Relative to these Monte Carlo simulations, bisDRIP-seq scores were found to be clustered in specific regions (Figure 1—figure supplement 1B–D).

Second, we performed correlation tests between the bisDRIP-seq scores obtained in each of our samples. In all cases, there was significant correlation between bisDRIP-seq samples (p<10⁻¹⁶, Spearman's rank-correlation test, Figure 1—figure supplement 3).

bisDRIP-seq scores show transcription-dependent enrichment in promoter regions

We next wanted to know if bisDRIP-seq scores are associated with promoter regions that contain R-loops. R-loops were previously mapped to promoter regions (Ginno et al., 2012), where they are thought to play important roles in gene expression (Chen et al., 2015). These R-loops, as mapped using DRIP-seq, were found to be transcription dependent and correlated with gene expression (Sanz et al., 2016). We therefore wanted to determine if bisDRIP-seq scores have a similar enrichment and transcription dependence in active promoter regions.

In order to investigate promoter regions, we compiled a list of transcription start sites using the GENCODE database (Harrow et al., 2012) (see Materials and methods). We then defined promoter regions, for the purpose of our analyses, as the region one kilobase on either side of each of these transcription start sites.

We next performed several simple analyses to ensure that the results from bisDRIP-seq experiments recapitulate the promoter-region enrichment observed in DRIP-seq studies. First, bisDRIP-seq scores, like DRIP-seq reads, were found to be enriched in promoter regions relative to downstream exon-containing regions (Figure 2A and Figure 2—figure supplement 1A). Next, we found that the number of DRIP-seq reads correlates with bisDRIP-seq scores in individual promoter regions (p<2.2×10⁻¹⁶, Spearman's rank-correlation test, Figure 2—figure supplement 1B). Thus, bisDRIP-seq scores, like DRIP-seq reads, are enriched in promoter regions.

Figure 2 with 1 supplement see all

Download asset Open asset

Transcription-dependent R-loops form in active promoter regions.

(A) bisDRIP-seq scores in promoter regions tend to be higher than in matched exon-containing regions. R-loops were previously mapped to promoter regions (Ginno et al., 2012). To determine if bisDRIP-seq scores also map to promoter regions, we compared bisDRIP-seq scores in promoter regions with matched exonic regions. Promoter regions (blue) were defined as the region within a thousand base pairs of each transcription start site. For each promoter region, a matched region (orange) was selected downstream of the promoter region in the same gene centered on an exonic site chosen at random. The distribution of bisDRIP-seq scores (y-axis, mean bisDRIP-seq score from n = 13 samples) was plotted for promoter regions (blue, n = 60016) and matched regions (orange, n = 60016) using a violin plot. Within each violin plot, the fraction of genes with a given bisDRIP-seq score are represented by the width of the overlapping violin plot. Individual lines in the violin plot represent quartiles. bisDRIP-seq scores were significantly higher in promoter regions relative to matched exon-centered regions. ‘TSS’ refers to the transcription start site. The y-axis in the plot was log2 transformed. ***p<2.2×10⁻¹⁶, Wilcoxon signed-rank test. (B) R-loop formation correlates with promoter activity. Based on previous studies (Sanz et al., 2016), R-loops are expected to form in active promoter regions, rather than in inactive promoter regions. R-loops can be identified by bisDRIP-seq based on preferential labeling of the non-template strand. Therefore, we compared the bisDRIP-seq scores on the non-template strand to the scores on the template strand and determined whether this correlated with promoter activity. Promoter activity and bisDRIP-seq scores were assessed in the region between the transcription start site and + 1000 bp. bisDRIP-seq scores were assessed separately for the non-template (blue) and template (orange) strands. For each strand, a LOESS smoothed curve was plotted of the bisDRIP-seq scores (y-axis) at different levels of promoter activity (x-axis). This was repeated for control-treated samples (solid, mean bisDRIP-seq score from n = 13 samples) and samples treated with the transcription-inhibitor triptolide (dashed, mean bisDRIP-seq score from n = 2 samples). bisDRIP-seq scores on both strands are correlated with promoter activity. Notably, with increasing promoter activity, the non-template strand is preferentially labeled. This suggests that sense-strand R-loop form in these promoters. Promoter activity was assessed using a MCF-7 GRO-seq dataset from Hah et al., 2013. Both promoter activity and bisDRIP-seq scores were plotted on log2-transformed axes. Source data for figure included in Figure 2—source data 1. (C) Transcription-dependent R-loops form in active promoters. The presence of R-loops is suggested by bisDRIP-seq strand asymmetry as illustrated in Figure 2D. Here, strand asymmetry was calculated as the log2-fold ratio of the bisDRIP-seq score of the non-template strand relative to the bisDRIP-seq score of the template strand (y-axis). The distribution of strand asymmetry for promoter regions with high promoter activity (right, GRO-seq > 24 reads, n = 4895 promoter regions), as well as an equivalent number of inactive promoter regions (left, GRO-seq = 0 reads, n = 4895 promoter regions) was plotted using a violin plots. This was repeated for control-treated samples (green, mean bisDRIP-seq score from n = 13 samples) and triptolide-treated samples (pink, mean bisDRIP-seq score from n = 2 samples). Active promoter regions typically had higher non-template bisDRIP-seq scores than template-strand bisDRIP-seq scores in control-treated samples. This strand asymmetry was significantly reduced in triptolide-treated samples. These results suggest that there were transcription-dependent R-loops in active promoter regions. Promoter activity was assessed using a GRO-seq dataset from Hah et al., 2013. The width of violin plots represents the fraction of genes with the strand asymmetry plotted on the y-axis. The individual lines in violin plots represented quartiles. ***p<2.2×10⁻¹⁶, Wilcoxon signed-rank test. (D) Simple models of the structures that may explain the high bisDRIP-seq scores observed 3' of the transcription start site in Figure 2B. As illustrated, the sense-strand R-loops (top row) logically explain the strand ‘positive’ asymmetry observed in bisDRIP-seq scores. Additionally, the high bisDRIP-seq scores observed on both DNA strands of active promoters are likely explained by some combination of all three types of structure (all three rows). In these models, the vertical black hash marks between nucleic acid strands indicate that two strands are hybridized. Red circles refer to the location of bisulfite induced cytosine-to-uracil conversions. ‘asTSS’ refers to transcription start sites for antisense transcription. Source code for calculating bisDRIP-seq region scores can be found in Source code 2.

https://doi.org/10.7554/eLife.28306.007

Figure 2—source data 1 bisDRIP-seq scores of promoter regions.: https://doi.org/10.7554/eLife.28306.009
Download elife-28306-fig2-data1-v2.xls

We next asked if bisDRIP-seq enrichment in promoter regions depends on active transcription. To test if bisDRIP-seq enrichment requires transcription, bisDRIP-seq was repeated using MCF-7 cells treated with the transcription-inhibitor triptolide (Kupchan et al., 1972; Titov et al., 2011; Vispé et al., 2009). bisDRIP-seq enrichment was reduced in these samples (Figure 2—figure supplement 1C). This suggests that bisDRIP-seq enrichment in promoter regions depends on ongoing transcription.

We next asked if the bisDRIP-seq scores in promoter regions are correlated with promoter activity. Promoter activity was assessed using existing GRO-seq datasets from MCF-7 cells cultured in a similar manner to our cells (Hah et al., 2013; Hah et al., 2011). GRO-seq measures the presence of active polymerases on a genome-wide scale (Core et al., 2008). Thus, these GRO-seq datasets allow us to distinguish between promoter regions with low and high promoter activity. We therefore compared the number of GRO-seq reads with bisDRIP-seq scores in promoter regions (Figure 2—figure supplement 1D). In this analysis, GRO-seq-measured promoter activity correlates with bisDRIP-seq scores (p<2.2×10⁻¹⁶, Spearman's rank-correlation test). Thus, bisDRIP-seq scores are enriched in active promoter regions.

Notably, the enrichment of bisDRIP-seq scores in active promoter regions was reduced when RNA synthesis was blocked by triptolide treatment (Figure 2—figure supplement 1D).

Together, these results confirm that bisDRIP-seq scores, like DRIP-seq reads, are enriched in active promoter regions.

Transcription-dependent R-loops form downstream of transcription start sites

We next wanted to determine if the bisDRIP-seq score enrichment in promoter regions is due to co-transcriptional R-loops. Conceivably, cytosine conversions could occur if single-stranded DNA is exposed as a result of other single-stranded DNA structures near transcription start sites, including unwound DNA due to supercoiling (Hsieh and Wang, 1975), G-quadruplexes (Sen and Gilbert, 1988), or genomic regions that contain paused polymerases (Core et al., 2008) that become more accessible to bisulfite after SDS treatment.

Relative to other single-stranded DNA structures, R-loops are known to produce a specific cytosine conversion signature upon bisulfite treatment. Specifically, cytosine conversions are limited to one strand of DNA in an R-loop. In other types of genomic structures that expose single-stranded DNA, cytosines on both strands of DNA can be converted. Thus efforts to map R-loops with bisulfite must demonstrate preferential labeling of one strand of DNA (Yu et al., 2003).

Cytosine conversions can also occur as a result of two types of R-loops: ‘sense-strand R-loops’ and ‘antisense-strand R-loops.’ Sense-strand R-loops refer to R-loops in which the RNA component of the R-loop is transcribed from the annotated promoter in the expected direction. In these R-loops, the non-template strand of DNA is exposed for bisulfite-mediated conversion. Antisense-strand R-loops, on the other hand, contain RNA that is transcribed from the opposite DNA strand. In this case, the template strand of DNA for the canonical gene transcript would be modified by bisulfite. Thus, the specific strand of DNA that exhibits bisulfite-mediated cytosine conversions indicates if a sense-strand R-loop or antisense-strand R-loop is present.

In order to distinguish between R-loops and other single-stranded DNA structures, we took advantage of the cytosine conversion strand asymmetry that is expected to result from R-loops. In other types of genomic structures that expose single-stranded DNA, cytosines on both strands of DNA can be converted.

To determine if the observed bisDRIP-seq enrichment downstream of the transcription start site in promoter regions is caused by sense-strand R-loops, we repeated our comparison of bisDRIP-seq score with promoter activity. However, here bisDRIP-seq scores were plotted separately for the template and non-template strands of DNA (Figure 2B). With increasing promoter activity we observed increasing bisDRIP-seq scores on both strands, with preferential labeling of the non-template strand of DNA. Thus, although both strands of DNA had bisulfite-induced cytosine conversions, we also observed asymmetric labeling of the non-template strand (Figure 2C). This is consistent with a mixture of single-stranded DNA structure and sense-strand R-loops in active promoter regions (Figure 2D).

Notably, the higher bisDRIP-seq scores on the non-template strand than the template strand were largely eliminated in triptolide-treated samples (Figure 2B and C). This suggests that the sense-strand R-loops in the promoter region are transcription dependent.

The transcription start site is the 5' boundary of promoter-associated R-loops

The exact starting position and ending positions of promoter-associated R-loops remains unclear. This is due to the low resolution of conventional R-loop mapping methods (Chen et al., 2015; Ginno et al., 2012). We wanted to take advantage of the high resolution of bisDRIP-seq to map the exact boundaries of R-loops in promoter regions.

We first asked where R-loops are located in relation to transcription start sites. In many promoters, transcription initiates from multiple nearby transcription start sites (Carninci et al., 2006). This creates a practical limit to the precision that we can achieve in mapping the location of R-loops relative to transcription start sites.

We first mapped R-loops relative to all transcription start sites using metaplots of bisDRIP-seq scores. First, bisDRIP-seq scores were calculated for each nucleotide position surrounding the transcription start site in all individual promoter regions. Promoter regions were defined using the GENCODE database described above. Then, the bisDRIP-seq score at a given nucleotide position relative to the transcription start site was summed across all promoter regions. These scores were then plotted separately for the non-template strand and the template strand (Figure 3A).

Figure 3 with 4 supplements see all

Download asset Open asset

The 5' boundary of promoter-associated R-loops is located at the transcription start site.

(A) Metaplot analysis of bisDRIP-seq scores reveals the location of promoter-associated R-loops. To determine the location of R-loops within promoter regions at read-length resolution, we used bisDRIP-seq score metaplot analysis. Metaplots were created by summing the bisDRIP-seq scores across all promoter regions (n = 78218) at each nucleotide position relative to the transcription start site. The score was calculated separately for the nucleotide position on the non-template strand (blue) and template strand (orange). Metaplots were then plotted for control-treated samples (solid, mean bisDRIP-seq score from n = 13 samples) and triptolide-treated samples (dashed, mean bisDRIP-seq score from n = 2 samples). In control-treated samples, bisDRIP-seq scores increase near the transcription start site for both strands. However, bisDRIP-seq scores were greater on the non-template strand than on the template strand immediately 3' of the transcription start site. This suggests that R-loops formed immediately 3' of the transcription start site. ‘TSS’ indicates the location of the transcription start site. **p<0.005, Wilcoxon signed-rank test. (B) Model of the sense-strand R-loops forming 3' of the transcription start site in Figure 3A. The location where strand asymmetry is observed in the bisDRIP-seq metaplot suggests that R-loops form 3' of the transcription start site. Additionally, the triptolide sensitivity of the bisDRIP-seq score asymmetry suggests that R-loops contain newly transcribed RNA. In this model, the black lines between nucleic acid strands indicate that two strands are hybridized. (C) Background subtraction more clearly reveals the location of R-loops. Triptolide-resistant bisDRIP-seq scores do not appear to reflect the presence of R-loops in Figures 2B and 3A. As such, we repeated our metaplot analysis using triptolide-sensitive bisDRIP-seq scores. Metaplots of the template (orange) and non-template (blue) triptolide-sensitive scores (y-axis) in all promoter regions (n = 78218) were plotted relative to the transcription start site. The preferential labeling of the non-template strand immediately 3' of the transcription start site is more apparent in this plot. Triptolide-sensitive non-template bisDRIP-seq scores were generated by subtracting the triptolide-treated sample scores (mean bisDRIP-seq score from n = 2 samples) from the control-treated sample scores (mean bisDRIP-seq score from n = 13 samples). Triptolide-sensitive template bisDRIP-seq scores were generated in the same manner. (D) Metaplot of ‘R-loop signal’ reveals that R-loops form at the transcription start site. To better visualize the location of R-loops, we directly examined the difference in the bisDRIP-seq scores of the two strands by generating a metaplot of R-loop signal. R-loop signal was defined as the triptolide-sensitive template-strand bisDRIP-seq score subtracted from the triptolide-sensitive non-template bisDRIP-seq score. A metaplot of R-loop signal (y-axis) was plotted for all promoter regions (n = 78218) relative to the transcription start site. R-loop signal (blue) was highest in the region immediately 3' of the transcription start site and decreased 200–250 bp downstream of the transcription start site. R-loop signal at each nucleotide position was derived using the mean bisDRIP-seq score from n = 13 control-treated samples and mean bisDRIP-seq score from n = 2 triptolide-treated samples. (E) Metaplot of bisDRIP-seq scores in active promoter regions. In Figure 2C,R-loops appeared to predominantly form in active promoter regions. Thus metaplot analysis was repeated using only active promoter regions. A metaplot of bisDRIP-seq scores was created for the non-template strand (blue) and template strand (orange) across only active promoter regions (n = 15644). bisDRIP-seq scores were plotted on separate lines for control-treated samples (solid, mean bisDRIP-seq score from n = 13 samples) and triptolide-treated samples (dashed, mean bisDRIP-seq score from n = 2 samples). In control-treated samples, the location in the metaplot where the non-template bisDRIP-seq scores are higher than the template bisDRIP-seq scores is the same as in the metaplot for all promoters (Figure 3A). However, the difference between the strands is more pronounced in this metaplot of just active promoter regions. In this plot, active promoters refers to promoters with activity in the top twenty percentile as calculated using the GRO-seq dataset from Hah et al., 2013. (F) R-loops are only clearly observed in active promoter regions. A metaplot of R-loop signal was plotted for promoter regions in the top twenty percentile of promoter activity (green, n = 15644) and for promoter regions in the bottom eighty percentile of promoter activity (purple, n = 62574). There is no clear positive R-loop signal observed in promoter regions with lower promoter activity. By contrast, in active promoters the R-loop signal 3' of the transcription start site appears to be as high as the R-loop signal observed in all promoter regions (Figure 3D). R-loop signal at each nucleotide position was derived using the mean bisDRIP-seq score from n = 13 control-treated samples and mean bisDRIP-seq score from n = 2 triptolide-treated samples. Source code for metaplots can be found in Source code 5.

https://doi.org/10.7554/eLife.28306.010

The resulting metaplot suggests that a mix of R-loops and other single-stranded structures surround the transcription start site. The presence of single-stranded DNA at transcription start sites is suggested by the peak of bisDRIP-seq scores near the transcription start site (Figure 3B). The location of these single-stranded structures is consistent with previous maps of single-stranded DNA (Kouzine et al., 2013). However, the presence of R-loops is specifically suggested by asymmetric, preferential labeling of the non-template strand (Figure 3B). Indeed, the non-template strand bisDRIP-seq scores are significantly higher than the template strand bisDRIP-seq scores immediately 3’ of the transcription start site (p<0.005, Wilcoxon signed-rank test) (Figure 3A, Figure 3—figure supplement 1B). This suggests that the transcription start site is the 5’ boundary of promoter-associated R-loops.

We next asked whether the R-loops bounded by the transcription start site are dependent on transcription. We repeated our metaplot analysis using the samples treated with triptolide. In these samples, there was minimal difference between the template and non-template strand bisDRIP-seq scores (Figure 3A). The small remaining difference in scores upon triptolide-treatment may reflect a real difference, but it may also reflect noise in our measurements. Overall, the loss of bisDRIP-seq score strand asymmetry upon triptolide treatment demonstrates that the enrichment of R-loops 3' of the transcription start site requires transcription.

The transcription start site boundary was even more apparent after applying a background correction. To do this, we defined an ‘R-loop signal,’ which reflects the difference in bisDRIP-seq labeling of the non-template strand from the template strand after using the triptolide-treated samples for background correction. Thus, the template strand metaplot from the triptolide-treated sample was subtracted from the template strand metaplot from the control sample. The same background correction was used for the non-template strand metaplot. This background correction further enhances the demarcation of the transcription start site as the 5’ boundary of promoter-associated R-loops (Figure 3C). Next, we subtracted the template strand metaplot from the non-template metaplot to generate a metaplot of the R-loop signal (Figure 3D). In this plot, the 5’ boundary of R-loop signal at the transcription start site is very pronounced.

We repeated this analysis on promoters that have either high promoter activity or low promoter activity (Figure 3E and F, Figure 3—figure supplement 1C, Figure 3—figure supplement 2). A peak of R-loop signal was only observed immediately 3' of the transcription start site in the analysis of active promoters. This is consistent with our previous analysis showing that R-loops specifically form 3' of the transcription start site in active promoters.

Notably, the observed difference in strand bisDRIP-seq scores 3' of the transcription start site in active promoters is lost after RNase H treatment (Figure 3—figure supplement 3). This confirms that the R-loop signal observed in active promoter regions is caused by R-loops.

Taken together, these data indicate that the transcription start site demarcates the 5’ boundary of promoter-associated R-loops. Additionally, since only noise was observed from promoter regions with low promoter activity, these regions were removed from future analysis unless otherwise noted.

The first exon-intron junction acts as a 3' boundary to promoter-associated R-loops

We next wanted to know if there is a 3’ boundary to R-loops. Based on the metaplot analysis in Figure 3D,R-loop signal drops approximately 200–250 bp downstream of the transcription start site. This is further from the transcription start site than the typical first post-transcription start site nucleosome (Schones et al., 2008) and the location of promoter-proximal RNA polymerase II pausing (Core et al., 2008), suggesting that these features probably do not impede R-loop expansion. On the other hand, 200–250 bp is reasonably close to the median distance between the transcription start site and the first exon-intron junction, which is 181 bp in our dataset (see Materials and methods). Also, previous studies found that knockdown of the 5' splice site-binding factor SRSF1 induces the formation of R-loops (Li and Manley, 2005), which could be explained if splicing is involved in bounding R-loop expansion. These pieces of evidence suggested that the first exon-intron junction might act as the 3' boundary to R-loop expansion in promoter regions.

We therefore asked where R-loops are located relative to the first exon-intron junction. We used the 5' end of the first intron as the reference point for a metaplot of bisDRIP-seq scores. Intronless genes were not considered in this analysis. In these metaplots, we observed that the bisDRIP-seq scores on the non-template strand are significantly higher than on the template strand immediately 5' of the first exon-intron junction (Figure 4A and Figure 4—figure supplement 1A). This strand asymmetry in bisDRIP-seq scores drops 3' of the exon-intron junction. This suggests that the 3’ end of R-loops are bounded by the first exon-intron junction.

Figure 4 with 1 supplement see all

Download asset Open asset

The first exon-intron junction appears to act as a 3' boundary to R-loops.

(A) Strand asymmetry in bisDRIP-seq scores ends at the first exon-intron junction. To determine where R-loops are located relative to the first exon-intron junction, a metaplot of bisDRIP-seq scores was generated relative to the first exon-intron junction. bisDRIP-seq scores were calculated for control-treated samples (solid, mean bisDRIP-seq score from n = 13 samples) and triptolide-treated samples (dashed, mean bisDRIP-seq score from n = 2 samples). These mean bisDRIP-seq scores were then summed at each position relative to the first exon-intron junction for all intron-containing gene promoter regions (n = 14538). These values were plotted separately for the template strand (orange) and for the non-template strand (blue). Immediately 5' of the exon-intron junction, the non-template bisDRIP-seq scores were greater than the template bisDRIP-seq scores in control-treated samples. This difference in bisDRIP-seq scores between the two strands was eliminated almost immediately 3' of the exon-intron junction. This suggests that the first exon-intron junction acted as a 3' boundary to promoter-associated R-loops. (B) R-loop-forming regions expand further from both the transcription start site and the exon-intron junction in promoter regions with larger first exons. To test if the first exon-intron junction was acting as a boundary to R-loops, we created metaplots of R-loop signal in bins of promoter regions with different first-exon sizes. Promoter regions were binned into five groups based on the size of their first exon (n = 2907 or 2908 per bin). On the left side of the panel, metaplots of R-loop signal (blue) centered on the transcription start site were plotted for each bin of promoter regions. A vertical, dashed line (red) indicates the 3’-most location where R-loop signal was at half of the maximum signal in the metaplot. The arrow pointing to the dashed line indicates the distance from the line to the transcription start site. In the middle of the panel are schematics indicating the range of first-exon sizes observed in the bin of promoter regions that is examined in the adjacent metaplots. As illustrated, the smallest first exons are examined in the bin displayed in the top row and each subsequent row examines a bin containing larger first exons. On the right side of the panel, metaplots of R-loop signal (blue) centered on the first exon-intron junction were plotted for each bin of promoter regions. A vertical, dashed line (red) indicates the 5'-most location where R-loop signal was at half of the maximum signal in the metaplot. The arrow pointing to the dashed line indicates the distance from the line to the transcription start site. In metaplots representing genes with longer first exons, R-loop signal extended further 3' from the transcription start site and further 5' of the first exon-intron junction. These results suggest that R-loops were typically bounded between the transcription start site and the first exon-intron junction. ‘TSS’ refers to the transcription start site. In each case, R-loop signal at each nucleotide position was derived using the mean bisDRIP-seq score from n = 13 control-treated samples and mean bisDRIP-seq score from n = 2 triptolide-treated samples. (C) The 3' boundary of promoter-associated R-loops is within a few base pairs of the first exon-intron junction. To determine, at near-nucleotide resolution, the 3' R-loop boundary relative to the first exon-intron junction, we generated a metaplot of bisDRIP-seq-conversion asymmetry. To generate this metaplot, the total number of conversions was summed at each position relative to the first exon-intron junction for all intron-containing gene promoter regions (n = 14538). This was performed separately for each strand of the control-treated sample (mean of n = 13 samples) and then these values were background-corrected by subtracting the same values from the triptolide-treated samples (mean of n = 2 samples). Finally, the strand asymmetry of conversions was calculated as the log ratio of the number of conversions on the non-template strand relative to the template strand. This strand asymmetry of conversions (y-axis) was plotted for each position relative to the exon-intron junction (x-axis). In this metaplot, there is asymmetry in the strand orientation of conversions immediately 5' of the exon-intron junction, with more conversions on the non-template strand. However, within a few base pairs 3' of the exon-intron junction, conversions appear to be equally distributed on both the template and non-template strand. The consensus splice site confounds this analysis to some extent at the exact splice site. Nevertheless, this analysis suggests that the 3' R-loop boundary is located within base pairs of the first exon-intron junction. See Figure 4—source data 1 for source data regarding the set of exon-intron junctions studied in Figure 4A and B.

https://doi.org/10.7554/eLife.28306.015

Figure 4—source data 1 Location of exon-intron junctions analyzed in Figure 4.: https://doi.org/10.7554/eLife.28306.017
Download elife-28306-fig4-data1-v2.txt

We further tested the idea that the first exon-intron junction is the 3’ boundary of R-loops. To do this, we asked if the size of the R-loop-forming region in promoter regions increases with the length of the first exon. To test this question, metaplot analysis was repeated on groups of promoter regions with different sized first exons. First, promoter regions were binned into five groups based on the annotated size of the first exon. Then, metaplots were created of the R-loop signal centered around either the transcription start site or the first exon-intron junction (Figure 4B). Strikingly, in groups of promoter regions with longer first exons, the R-loop signals are also longer. This supports the idea that R-loops are bounded at both the transcription start site and the first exon-intron junction.

We next asked if we could map the location of the 3' R-loop boundary at near-nucleotide resolution. bisDRIP-seq scores are shared across entire reads, which limits the resolution of using bisDRIP-seq scores to read length resolution. On the other hand, cytosine-to-uracil conversions should map R-loops at near-nucleotide resolution. We therefore generated a metaplot of the strand asymmetry of cytosine conversions relative to the first exon-intron junction. First, the number of conversions on either strand were background corrected by subtracting the number of conversions observed in our triptolide bisDRIP-seq data. Next, the log ratio of conversions on the non-template strand relative to the template strand was plotted at each site relative to the exon-intron junction (Figure 4C). In this metaplot there is a decrease in the relative number of conversions on the non-template strand within base pairs of the exon-intron junction (Figure 4—figure supplement 1B). Thus, it appears that the R-loop boundary is within a few base pairs of the first exon-intron junction.

We next asked whether other exon-intron junctions also act as R-loop boundaries. In particular, we focused on the junctions between the first intron and the second exon, as well as the junction between the second exon and the second intron. We repeated our metaplot analysis by plotting R-loop signal at each position relative to the given exon-intron or intron-exon junction (Figure 4—figure supplement 1A). No clear peak in R-loop signal is observed near these downstream intron-exon junctions. This suggests that only the first exon-intron junction acts as a boundary to R-loop formation.

Together, these results suggest that there is a boundary to R-loop formation located within base pairs of the first exon-intron junction.

bisDRIP-seq reveals evidence for antisense-strand R-loops

We noticed that there is negative R-loop signal upstream of the transcription start site (Figure 3D and F). This could be caused by ‘antisense-strand R-loops,’ i.e., with antisense RNA transcripts hybridized to the annotated non-template strand of DNA (See Figure 3—figure supplement 4C for this structure). Antisense-strand R-loops would result in more prominent bisulfite conversions on the annotated template strand. This type of labeling is opposite from the non-template strand labeling that is caused by the predominant type of R-loop that forms from sense transcription.

We considered that antisense transcription could lead to antisense-strand R-loops that generate these negative R-loop signals. To test this, we calculated the antisense-transcription activity of each promoter region upstream of the transcription start site using the previously described GRO-seq dataset (Hah et al., 2013). In promoter regions with high antisense-transcription promoter activity, the template-strand bisDRIP-seq scores upstream of the transcription start site were significantly higher than on the non-template strand (Figure 3—figure supplement 4A and B). These data indicate a correlation between antisense transcription and antisense R-loops.

We next used the promoters that showed the highest level of antisense-transcription promoter activity to generate a metaplot of R-loop signal. In this metaplot, the negative R-loop signal was prominent upstream of the transcription start site (Figure 3—figure supplement 4D). Taken together, these data suggest that some promoter regions contain antisense-strand R-loops and that this is linked to antisense transcription in these promoter regions.

R-loops are observed in the promoter regions of intronless genes

We next wanted to identify the promoters that show the strongest association with transcription-dependent R-loops. Any unique features associated with these promoters may be directly related to the R-loops forming at these promoter regions. We searched for promoter regions with two major features: First, we searched for promoter regions that showed disproportionately high bisDRIP-seq score on the non-template strand compared to the template strand. Second, we searched for promoter regions where the majority of the bisDRIP-seq score on the non-template strand was lost upon triptolide treatment. In this analysis, we noticed a set of promoter regions that exhibited both of these features (Figure 5A). We therefore ranked genes based on the sum of these two features to identify the genes that show the strongest association with R-loop structures (Table 1).

Figure 5 with 3 supplements see all

Download asset Open asset

Replication-dependent histone genes frequently form R-loops.

(A) Genes that are strongly associated with promoter-associated R-loops were identified using two criteria. First, transcription-dependent R-loop formation was identified based on the triptolide-sensitivity of the promoter region bisDRIP-seq score (x-axis). Second, R-loop formation was identified by the strand asymmetry of the promoter region bisDRIP-seq score (y-axis). Promoter regions for all genes except those encoding replication-dependent histones (n = 2064, blue) were plotted on these two axes alongside a linear regression model (blue line). Promoter regions for replication-dependent histones (n = 13, orange) were plotted alongside a separate linear regression model (orange line). In the resulting plot, there appears to be a group of genes with relatively high scores on both axes. This suggests that these genes are strongly associated with transcription-dependent R-loops formation. This set of genes included the replication-dependent histones genes and the indicated lncRNA genes *MALAT1* and *NEAT1.* In this plot, all bisDRIP-seq score measurements were made in the region between the transcription start site and + 250 bp. Triptolide-sensitivity was calculated as the non-template bisDRIP-seq score in triptolide treated-samples (mean bisDRIP-seq score from n = 2 samples) subtracted from the non-template bisDRIP-seq score from control-treated samples (mean bisDRIP-seq score from n = 13 samples). Strand asymmetry was calculated as the log2 ratio of the non-template bisDRIP-seq score to the template bisDRIP-seq score in control-treated samples (mean bisDRIP-seq score from n = 13 samples). The shaded areas around both linear regression models represent 95% confidence intervals. (B) Promoter activity does not explain the association between histone genes and R-loop formation. R-loop formation was calculated for each promoter region by taking the sum of the x-axis and y-axis values from Figure 5A. R-loop formation was then plotted against promoter activity for each promoter region. Promoter regions for non-histone genes (n = 2064, blue) were plotted alongside a linear regression model (blue) for these genes. Promoter regions for replication-dependent histone genes (n = 13, orange) were plotted alongside a separate linear regression model (orange). Promoter activity was correlated with R-loop formation. Nonetheless, all of the examined replication-dependent histone genes have higher R-loop formation scores than most of the genes with similar promoter activity. Promoter activity (x-axis) was calculated between the transcription start site and +250 bp using the GRO-seq dataset from Hah et al., 2013. R-loop formation was calculated as the sum of bisDRIP-seq score strand asymmetry and triptolide-sensitivity. These values were derived from the mean bisDRIP-seq scores of n = 13 control-treated samples and mean bisDRIP-seq scores of n = 2 triptolide-treated samples. The shaded areas around both linear regression models represent 95% confidence intervals. (C) R-loops are prevalent in the promoter regions of replication-dependent histone genes. The location of R-loops in the promoter regions of replication-dependent histone genes was examined using metaplot analysis. A metaplot was generated of bisDRIP-seq scores relative to the transcription start sites of all replication-dependent histone genes (n = 69). bisDRIP-seq scores were plotted separately for the template (orange) and non-template (blue) strands. This was repeated for control-treated samples (solid, mean bisDRIP-seq score from n = 13 samples) and triptolide-treated samples (dashed, mean bisDRIP-seq score from n = 2 samples). The bisDRIP-seq scores on the non-template strand were higher than the scores on the template strand throughout the promoter region 3' of the transcription start site. Additionally, almost no signal was observed in the triptolide-treated samples. These results suggest that R-loops are the predominant structure observed by bisDRIP-seq in the promoter regions of this class of genes. ‘TSS’ indicates the location of the transcription start site. (**D,E**) Gene-specific heterogeneity in the 3' R-loop boundaries of histone genes. To examine whether promoter-associated R-loops have 3' boundaries in intronless genes, R-loop signal was examined in select histone genes. The observed heterogeneity in 3' R-loop boundaries is represented by (E) *HIST1H2BG* and (F) *HIST1H1E*. At the top of each panel are the gene loci and the sense-strand RNA-seq reads that map to the loci containing either gene. In the middle of each panel is the R-loop signal (y-axis) plotted across the genomic loci containing each gene (blue line). In the lower plot, cytosine-to-uracil conversions were mapped to the genomic loci containing each gene. The number of conversions on the template strand and non-template strand were plotted below the x-axis (orange) or above the x-axis (blue), respectively. Shown are the total number of conversions observed in all bisDRIP-seq samples (n = 13). The sharp drop in R-loop signal near the start of the *HIST1H2BG* gene contrasts with the relatively stable R-loop signal observed in *HIST1H1E*. Transcription start sites are indicated by both ‘TSS’ and by dashed vertical lines. R-loop signal at each nucleotide position was derived using the mean bisDRIP-seq score from n = 13 control-treated samples and mean bisDRIP-seq score from n = 2 triptolide-treated samples. See Figure 5—source data 1 for source data for Figure 5A and B.

https://doi.org/10.7554/eLife.28306.018

Figure 5—source data 1 R-loop scoring of promoters studied in Figure 5A, B.: https://doi.org/10.7554/eLife.28306.022
Download elife-28306-fig5-data1-v2.txt

Table 1

The 25 genes that were most strongly associated with transcription-dependent sense-strand R-loop structures in the region between the transcription start site and + 250 bp.

https://doi.org/10.7554/eLife.28306.023

Gene	Histone	Single exon	Genetype
MALAT1	no	yes	lncRNA
CTB-58E17.1	no	yes	lncRNA
RHOB	no	yes	protein coding
NEAT1	no	yes	lncRNA
HIST1H2BC	yes	yes	protein coding
STX16	no	no	protein coding
ARFIP2	no	no	protein coding
HIST1H2BG	yes	yes	protein coding
XBP1	no	no	protein coding
TM7SF2	no	no	protein coding
DNAJB1	no	no	protein coding
HIST1H2BK	yes	yes	protein coding
MIEN1	no	no	protein coding
NFKBIA	no	no	protein coding
RP11-166B2.1	no	no	protein coding
SRSF3	no	no	protein coding
LASP1	no	no	protein coding
RPPH1	no	yes	ribozyme
HIST4H4	yes	yes	protein coding
RPL23	no	no	protein coding
HIST1H2BD	yes	yes	protein coding
GSS	no	no	protein coding
HIST1H1E	yes	yes	protein coding
SRSF7	no	no	protein coding
TPM3	no	no	protein coding

Two important classes of genes were identified in this analysis and both lack introns. First, six of the top 25 genes (24%) and nine of the top 50 genes (18%) are replication-dependent histone genes. The second class of genes encode intronless noncoding RNAs, including MALAT1, NEAT1, RPPH1, and CTB-58E17.1. We noticed that two other top hits, the protein-coding genes RHOB and JUNB, are also intronless genes.

In general, intronless genes appear to be enriched among the promoters that show strong association with R-loop structures. In total, 44% of the top 25 genes in our list are intronless, compared to approximately 2% of long non-coding RNAs (lncRNAs) and 3% of protein-coding genes (Derrien et al., 2012; Louhichi et al., 2011). The presence of R-loops in the promoter regions of these intronless genes suggests that the first exon-intron junction does not promote R-loop formation.

Replication-dependent histone genes are strongly associated with R-loops

The presence of R-loops in the replication-dependent histone genes is potentially interesting given how these genes are regulated. As their name suggests, these histone genes are regulated in a cell-cycle dependent manner (Robbins and Borun, 1967). They also lack poly-A tails and are processed in special histone bodies in the nucleus (Dominski and Marzluff, 1999). Thus these genes are co-regulated using special processing pathways.

We first considered the possibility that the prominent R-loop signal detected in histone promoter regions simply reflects high promoter activity in these genes. However, histone genes consistently had higher R-loop-associated bisDRIP-seq scores than the majority of genes with similar promoter activity (Figure 5B). This suggests that promoter activity does not explain the strong R-loop signal observed in histone genes.

Similar analysis indicated that nuclear RNA levels and recruitment of RNA polymerase II also do not explain the R-loop signal observed in histone genes (Figure 5—figure supplement 1A and B).

We next examined the prevalence of R-loops and other single-stranded DNA structures in the promoter regions of histone genes. We repeated our metaplot analysis using the bisDRIP-seq scores from the entire class of replication-dependent histone genes (Figure 5C). In the resulting metaplot, bisDRIP-seq scores are low on the template strand and in triptolide-treated samples. This suggests that there is little single-stranded structure outside of R-loops in the promoter regions of these genes. In contrast, bisDRIP-seq scores are high downstream of the transcription start site on the non-template strand. Moreover, these high bisDRIP-seq scores are not observed after RNase H treatment (Figure 5—figure supplement 1C–E). Together, these results indicate that histone genes, as a group, have a high level of R-loop formation relative to the formation of other single-stranded DNA structures.

We next asked whether R-loops are bounded in replication-dependent histone genes. It was not clear if R-loops would be bounded in these genes since they lack introns and therefore they lack exon-intron junctions (Marzluff et al., 2008). Conceivably, the R-loops could extend to the entire length of the transcript. We therefore identified the boundaries of the R-loop signal in each of the nine histone genes that had the highest propensity to form R-loops (Figure 5A). As expected, the 5’ boundary of the R-loops appear to be near the transcription start site in all nine genes. In five of the nine histone genes, the entire R-loop appeared to be restricted to the initial portion of the gene (Figure 5D and Figure 5—figure supplement 2A–E). Sequence analysis of these boundaries does not reveal a clear sequence enrichment or motif (Figure 5—figure supplement 1D and E), making it currently unclear how this boundary is determined. In other cases, like HIST1H1E, R-loops seemed to cover nearly the entire gene (Figure 5E and Figure 5—figure supplement 2F–I). Together these results suggest that additional factors may establish 3’ R-loop boundaries in a subset of the replication-dependent histone genes.

Large R-loops form immediately downstream of the transcription start site in MALAT1 and NEAT1

Another set of genes which preferentially exhibit R-loops in their promoter regions are MALAT1 and NEAT1. MALAT1 and NEAT1 are adjacent genes that encode abundant, intronless lncRNAs (Hutchinson et al., 2007). These lncRNAs remain in the nucleus where they are involved in the regulation of transcription (Hirose et al., 2014) and splicing (Tripathi et al., 2010), respectively. Both MALAT1 and NEAT1 are longer than 3 kb, which is longer than the replication-dependent histone genes studied above. We were therefore interested in whether there are boundaries to R-loop expansion in these much longer intronless genes.

We first asked where R-loops are located in MALAT1 and NEAT1. The R-loop forming region in MALAT1 extends from the transcription start site to a position approximately 1700 bp downstream, with a sharp decrease in R-loop signal downstream of this position (Figure 6A). Similarly, the R-loop in NEAT1 extended approximately 1400 bp from the transcription start site (Figure 6B). Beyond this site, there was minimal detectable R-loop signal. As with the R-loops in the replication-dependent histone genes, these R-loops showed nearly complete loss of bisDRIP-seq signal on the non-template strand after triptolide treatment (Figure 6—figure supplement 1A and B). Moreover, the high bisDRIP-seq scores in this region are not observed after RNase H treatment (Figure 6—figure supplement 2A and B). This suggests that relatively long R-loops form in MALAT1 and NEAT1 and that these R-loops are bounded to the 5' end of each gene.

Figure 6 with 4 supplements see all

Download asset Open asset

*MALAT1* and *NEAT1* contain large, bounded promoter-associated R-loops.

(A) The promoter-associated R-loop forming region in *MALAT1* is large, but bounded. *MALAT1* had the strongest association with R-loops in Figure 5A and it is a longer intronless gene than the previously examined replication-dependent histone genes. To determine how far the R-loop-forming region in *MALAT1* extends into the gene body, the bisDRIP-seq signal at the *MALAT1* locus was examined. At the top of the panel is the gene model of *MALAT1* and the sense-strand RNA-seq reads that mapped to this region from an ENCODE MCF-7 RNA-seq dataset (blue, plotted on a log axis). Under the gene model of *MALAT1*, bisDRIP-seq scores were mapped to the genomic region containing *MALAT1* (mean bisDRIP-seq score from n = 13 samples). The bisDRIP-seq score on the template strand (orange) and non-template strand (blue) were plotted separately. In the lower plot, cytosine-to-uracil conversions were mapped to the genomic region surrounding the *MALAT1* R-loop forming region. The number of conversions on the template strand and non-template strand were plotted below the x-axis (orange) or above the x-axis (blue), respectively. Shown are the total number of conversions observed in all bisDRIP-seq samples (n = 13). The *MALAT1* R-loop region extended from the transcription start site to a site approximately 1750 base pairs downstream of the transcription start site. The transcription start site, indicated by ‘TSS’ and a dashed vertical line, was determined based on the MCF-7 ENCODE CAGE-seq dataset ENCFF207DXM and was located at ch11:65,499,042. (B) The promoter-associated R-loop forming region in *NEAT1* appears to have reduced R-loop further into the *NEAT1* gene body. R-loop formation in the *NEAT1* locus was examined since *NEAT1* is adjacent to *MALAT1* and was also strongly associated with R-loop formation. At the top of the panel is a gene model of *NEAT1* and the sense-strand RNA-seq reads that mapped to this region from an ENCODE MCF-7 RNA-seq dataset (blue, plotted on a log axis). Below the gene model of *NEAT1*, the bisDRIP-seq scores were mapped to the genomic region containing *NEAT1* (mean bisDRIP-seq score from n = 13 samples). The bisDRIP-seq score on the template strand (orange) and non-template strand (blue) were plotted separately. In the lower plot, cytosine-to-uracil conversions were mapped to the genomic region surrounding the *NEAT1* R-loop forming region. The number of conversions on the template strand and non-template strand were plotted below the x-axis (orange) or above the x-axis (blue), respectively. Shown are the total number of conversions observed in all bisDRIP-seq samples (n = 13). The *NEAT1* R-loop forming region extended almost 1500 base pairs from the transcription start site. However, R-loop signal showed periodicity and appeared to decrease gradually from the transcription start site to its final 3' boundary. The gene model of *NEAT1* represents the 23 kb *NEAT1* isoform.

https://doi.org/10.7554/eLife.28306.024

Although our mapping reveals the location of the R-loops in MALAT1 and NEAT1 at near-nucleotide resolution, analysis of previous DRIP-seq datasets (Sanz et al., 2016) reveal signals in the same overall regions (Figure 6—figure supplement 1A and B).

Interestingly, there appears to also be periodicity in the bisDRIP-seq scores on the non-template strand of NEAT1 with peaks and valleys every 300 base pairs (Figure 6B). This phenomenon is also detectable, but less prominent in MALAT1 (Figure 6A).

These valleys may indicate the existence of smaller R-loops in some MALAT1 or NEAT1 genes in some cells. This idea is supported by examining individual reads within the R-loop forming region in MALAT1. In most cases, we observed that cytosines were almost completely converted in individual reads (Figure 6—figure supplement 3A). However, we also observed a subset of reads with long stretches of cytosine conversions on one end and long stretches of unconverted cytosines on the other end (Figure 6—figure supplement 3B). The region where the conversions stop occurring in individual reads might reflect an internal border of a R-loop. It should be noted that we cannot exclude the possibility that this may reflect the location of a structured region in the single-stranded DNA that prevents bisulfite reactivity. Nevertheless, the presence of peaks and valleys within the R-loop-forming region of NEAT1 and MALAT1 raises the possibility of heterogeneity in the size and location of the individual R-loops within the larger R-loop forming region identified by bisDRIP-seq.

Comparison of bisDRIP-seq to existing R-loop mapping methods

Previous efforts to map R-loops have primarily relied on immunoprecipitation of RNA-DNA hybrids (Chédin, 2016). Traditionally, DNA fragments containing an R-loop are recovered and sequenced. The sequenced fragments contain both the DNA involved in the R-loop and regions of DNA that are not in the R-loop. Newer approaches, like DRIPc-seq (Chen et al., 2015; Sanz et al., 2016), can provide higher resolution by sequencing the RNA component of the R-loop. We therefore next wanted to determine if the specific R-loop boundaries detected by bisDRIP-seq could also be detected in the human DRIPc-seq datasets.

We first compared the location of R-loop signal in metaplots of DRIPc-seq and bisDRIP-seq signal in active promoters regions. As we demonstrated earlier, the R-loop signal in bisDRIP-seq is bounded between the transcription start site and the first exon-intron junction (Figure 7A). Thus, we observe a tight peak of R-loop signal within a few hundred base pairs of the transcription start site in active promoter regions. On the other hand, DRIPc-seq shows a marked enrichment of reads following the transcription start site; however, there is no clear boundary of sense-strand DRIPc-seq reads peaks within 5 kb of the transcription start site (Figure 7A). This highlights the improvement in resolution obtained by bisDRIP-seq.

Figure 7

Download asset Open asset

bisDRIP-seq provides improved mapping of R-loops relative to DRIPc-seq.

(A) The well-defined R-loop boundaries observed by bisDRIP-seq are less well demarcated in DRIPc-seq. To determine if the resolution of bisDRIP-seq R-loop mapping improves upon the resolution of previous R-loop mapping methods, we compared the data from bisDRIP-seq and DRIPc-seq. DRIPc-seq maps the location of RNAs that are incorporated into R-loops and, apart from bisDRIP-seq, it is the highest resolution method to map R-loops (Sanz et al., 2016). A metaplot of bisDRIP-seq R-loop signal (y-axis, solid blue) was plotted for active promoter regions (n = 15644) relative to the transcription start site. Overlapped onto the bisDRIP-seq metaplot are metaplots of sense-strand and antisense-strand DRIPc-seq reads. R-loop signal (blue) was highest in the region immediately 3' of the transcription start site and decreased 200–250 bp downstream of the transcription start site. Although there is also a peak of sense-strand DRIPc-seq reads 3' of the transcrption start site, there is no sharp 3' boundary to these reads within 5000 bp of the transcription start site. This suggests that bisDRIP-seq can map R-loop boundaries at a higher resolution than existing R-loop mapping methods. R-loop signal was defined as the triptolide-sensitive template-strand bisDRIP-seq score subtracted from the triptolide-sensitive non-template bisDRIP-seq score. This R-loop signal at each nucleotide position was derived using the mean bisDRIP-seq score from n = 13 control-treated samples and mean bisDRIP-seq score from n = 2 triptolide-treated samples. DRIPc-seq reads reflect the mean number of reads at a given site from n = 2 DRIPc-seq experiments calculated using the datasets from Sanz et al., 2016. The location of the transcription start site is demarcated by ‘TSS’ and a dashed vertical line. (**B–E**) A subset of R-loops are unstable ex vivo. The structure of R-loops has previously been shown to change in vitro (Kaback et al., 1979; Landgraf et al., 1996). To determine whether the structure of R-loops isolated from cells can change ex vivo, we compared R-loop maps from experiments in which R-loop mapping occurred either during cell lysis (bisDRIP-seq) or hours after cell lysis (delayed bisDRIP-seq and DRIPc-seq). In the delayed bisDRIP-seq experiment (lower plot), R-loops were labeled with bisulfite 16 hr after cell lysis (black, y-axis, n = 1 sample) rather than during cell lysis. R-loops were mapped to the gene loci of (B) *MALAT1,* (C) *NEAT1,* (D) *RPPH1* and (E) *HIST1H2BG* using DRIPc-seq (top plot, green, mean of n = 2 samples), bisDRIP-seq (middle plot, blue, n = 13 samples) and delayed bisDRIP-seq (lower plot, black, n = 1 sample). For (A) *MALAT1* and (B) *NEAT1*, R-loops map to similar regions in all three experiments. However, the R-loop signal appears to migrate towards the 3' end of the gene in both genes. On the other hand, the R-loop signal observed for (C) *RPPH1* and (D) *HIST1H2BG* in control-treated bisDRIP-seq samples is not observed in either the delayed bisDRIP-seq or DRIPc-seq datasets. Together, these results suggest that a subset of R-loops are unstable ex vivo. DRIPc-seq data was obtained from Sanz et al., 2016.

https://doi.org/10.7554/eLife.28306.029

We next compared the R-loop maps generated using DRIPc-seq or bisDRIP-seq at individual gene loci. In some genes, such as MALAT1 and NEAT1, we observed tight concordance between the R-loop maps generated from DRIPc-seq and bisDRIP-seq (Figure 7B and C). However, the boundaries demarcated by bisDRIP-seq appear more clear than the boundaries demarcated by DRIPc-seq. On the other hand, there were few DRIPc-seq reads mapped to genes such as RPPH1 and HIST1H2BG, which displayed strong R-loop signal in bisDRIP-seq (Figure 7D and E). This suggests the possibility that bisDRIP-seq might identify a set of R-loops that are not observable by DRIP-seq.

We next asked if the size or location of the R-loop might change ex vivo. bisDRIP-seq labels R-loops upon cell lysis, while other mapping methods recover the R-loop hours or days after cell lysis. R-loops could conceivably, expand, contract, or disappear during this time period. This seemed plausible, since the structure of some R-loops has previously been observed to change in solution (Kaback et al., 1979; Landgraf et al., 1996). To test this possibility, we performed a ‘delayed-bisulfite’ bisDRIP-seq experiment. In this experiment, we delayed the bisulfite step of our protocol so that samples were only treated with bisulfite 16 hr after cell lysis. In some genes, such as MALAT1 and NEAT1, the bisDRIP-seq scores were similar when using the delayed-bisulfite protocol. In other cases, the delayed-bisulfite treatment was associated with a marked depletion of the R-loop signal in RPPH1 and HIST1H2BG (Figure 7B–E). This suggests the possibility that some R-loops may not be stable for prolonged periods of time ex vivo and therefore may not be observed in approaches that do not label or recover R-loops during lysis.

Discussion

The precise location of R-loops in promoter regions has been obscure due to the low resolution of conventional R-loop mapping approaches (Chen et al., 2015; Ginno et al., 2012). These approaches rely on the S9.6 antibody to recover and sequence genomic fragments containing R-loops. Here we describe bisDRIP-seq, a near-nucleotide resolution method for mapping R-loops. In this approach, single-stranded DNA regions are identified based on their reactivity with bisulfite. These regions were then mapped throughout the genome based on the location of bisulfite-induced cytosine-to-uracil conversions. In some regions, bisulfite-induced cytosine conversions are enriched on one strand of DNA, show a requirement for RNA transcription and are removed following RNase H treatment, supporting the idea that these regions contain R-loops. R-loops were previously thought to expand thousands of base pairs into gene bodies without any clear boundary to their expansion or formation (Chédin, 2016). Here, we discover boundaries to R-loop formation at the transcription start site and the first exon-intron junction. The discovery of these boundaries suggest that the maximum length of most promoter-associated R-loops is predetermined by the exon-intron structure of genes.

Identification of R-loops by bisDRIP-seq depends primarily on bisulfite labeling of the single-stranded component of R-loops. In addition to bisulfite labeling, bisDRIP-seq takes advantage of the S9.6 antibody to enrich for R-loop containing fragments of DNA, thus leading to enhanced read coverage in R-loop-containing regions. However, S9.6 is known to react to other nucleic acid structures, including structured RNA (Phillips et al., 2013), and its full specificity relative to genomic DNA has not been examined. In our analysis, non-R-loop promoter single-stranded structures were immunoprecipitated by S9.6 antibody. This may reflect the poor specificity of the S9.6 antibody. Therefore, simply sequencing DNA recovered by S9.6 may not provide sufficient specificity for R-loop mapping.

bisDRIP-seq overcomes problems with the S9.6 antibody by adding multiple criteria to selectively identify which recovered fragments contain R-loops. These criteria included preferential labeling of a single strand of DNA and the requirement that the labeling be transcription-dependent. S9.6 antibody enrichment in protocols like DRIP-seq and bisDRIP-seq have been shown to be biased towards promoter regions (Halász et al., 2017). The requirement for preferential bisulfite labeling of one strand of DNA should largely mitigate this bias. Still, it is worth nothing that these criteria may have resulted in us excluding specific genomic regions that form RNA-DNA hybrids on both DNA strands or that contain extremely stable R-loops. Nevertheless, these criteria allow us to exclude non-R-loop single-stranded structures that might otherwise be mistaken for R-loops.

The location of the 5' boundary of R-loops, identified here to be located at the transcription start site, suggests that R-loops are primarily formed using the canonical gene transcript. Conceivably, other transcripts could form R-loops in promoter regions, such as promoter upstream transcripts (Preker et al., 2011) or extracoding RNAs (Savell et al., 2016). Our data suggests that other transcripts initiating upstream or downstream of the canonical transcription start site are less likely to be the major source of RNA in promoter-associated R-loops.

Why would the exon-intron junction serve as the 3’ R-loop boundary? The simplest explanation is that RNA splicing limits the length of the hybrid that can form between the spliced RNA and the template strand of DNA. If the RNA that forms the R-loop is spliced, then the RNA would not be able to hybridize to intron-encoding DNA (Figure 8A). Alternatively, R-loop expansion might be blocked by proteins that are bound to the 5' splice site in the RNA (Figure 8B). Both of these mechanisms could potentially explain the exon-intron junction R-loop boundary.

Figure 8

Download asset Open asset

Two possible models to explain the R-loop boundaries observed in the promoter regions of intron-containing genes.

(A) Spliced RNA R-loop model. Promoter-associated R-loops may be bounded by the exon-intron junction because the RNA component of R-loops is spliced. If the first intron is spliced out of the RNA that is incorporated into promoter-associated R-loops, then the RNA cannot hybridize to the region of DNA encoding the first intron. This would prevent R-loops from expanding past exons. In this model, the template strand for canonical transcription is represented in orange, while the opposite, non-template strand is represented in blue. Black lines between two nucleic acid strands indicate that the nucleic acids are hybridized. ‘TSS’ refers to the transcription start site. (B) Protein-recruitment model. Alternatively, the first exon-intron junction may recruit proteins that prevent R-loop formation in the intron region. For example, the first exon-intron junction recruits splicesome proteins, which have been implicated in R-loop processing (Li and Manley, 2005). These or other proteins may bind the exon-intron junction and then block R-loop expansion. In this model, the template strand for canonical transcription is represented in orange, while the opposite, non-template strand is represented in blue. The red, green and blue circles represent proteins associated with first exon-intron junction. Black lines between two nucleic acid strands indicate that the nucleic acids are hybridized.

https://doi.org/10.7554/eLife.28306.030

Previous studies found that the relative location of the first exon-intron junction impacts the rate of transcription (Bieberstein et al., 2012; Brinster et al., 1988; Fong and Zhou, 2001). Smaller first exons are associated with higher rates of transcription and enhanced recruitment of transcription initiation factors (Bieberstein et al., 2012). Conceivably, if smaller first exons are associated with less efficient or less stable R-loops, this may account for the higher transcription associated with these genes.

Although intronless genes lack exon-intron junctions, we identified R-loops and R-loop boundaries in these genes. Intronless genes were among the genes with the highest R-loop signal immediately downstream of the transcription start site. In particular, two classes of intronless genes were identified: intronless lncRNAs, including MALAT1 and NEAT1, and replication-dependent histones.

Histone genes are notable for their unusual replication-dependent transcription regulation and for their non-canonical end termination (Marzluff et al., 2008). During DNA synthesis, histone genes are transcribed and processed more efficiently than at other parts of the cell-cycle (Hereford et al., 1981). Additionally, the histone genes are the only mRNAs that lack poly-A tails. Like MALAT1 and NEAT1, the histone genes are terminated with a special stem-loop structure (Marzluff et al., 2008). This special processing has been previously linked to transcription and the absence of histone gene splicing. It is intriguing to speculate that either of these processes may relate to the robust R-loop structures that are seen at the 5’ end of these genes.

MALAT1 and NEAT1, which are both intronless genes, also had two of the strongest R-loop signals. These genes clearly show that promoter-associated R-loops can be bounded to specific regions of genes in the absence of an exon-intron junction. RNA-binding proteins might determine R-loop boundaries in these types of intronless genes. Indeed, we observed prominent binding of SRSF1 at a position that corresponds to the 3’ boundary of the R-loop in MALAT1 in publicly-available datasets (Figure 6—figure supplement 4). Thus, RNA-protein complexes may also limit the expansion of R-loops, especially in intronless genes.

Interestingly, all of the intronless genes that we studied are components of nuclear bodies. The histone RNAs, MALAT1 and NEAT1 associate with histone locus bodies, nuclear speckles and paraspeckles, respectively (Clemson et al., 2009; Hutchinson et al., 2007; Liu et al., 2006). The strong R-loop signal in so many genes with roles in well-established nuclear bodies raises the possibility that R-loops may play a role in the formation or regulation of these bodies. In the case of at least the histone genes and NEAT1, formation of the nuclear bodies is known to occur near the DNA encoding these RNA transcripts (Clemson et al., 2009; Mao et al., 2011). This suggests the possibility that R-loop formation may be involved in tethering RNA to chromatin to facilitate ribonucleoprotein assembly. Future research will be needed to test this and other possibilities for why these nuclear body-associated genes are prone to R-loop formation.

Although bisDRIP-seq was used here to study steady-state promoter-associated R-loops, the method can be used for studying other questions regarding R-loops. For example, this method could be used to study the dynamic changes in R-loop size and location in response to various signals like estrogen. Moreover, as transcription termination sites become better annotated, high-resolution R-loop could guide the study of R-loops involved in transcription termination.

Materials and methods

Key resources table

Reagent type (species) or resource	Designation	Source or reference	Identifiers	Additional information

cell line (human)	MCF-7 cells	ATCC	HTB-22
antibody	S9.6 antibody	Kerafast	ENH001	Used at a concentration of 20 mg/L
commercial assay or kit	Pico-Methyl Seq Library Prep Kit	Zymo Research	D5455
chemical compound, drug	Ammonium sulfite monohydrate	Sigma-Aldrich	358983–500G
chemical compound, drug	45% ammonium bisulfite	Pfaltz and Bauer Inc	A29946250g
chemical compound, drug	sodium bisulfite	Sigma-Aldrich	243973–100G
chemical compound, drug	triptolide	R and D Systems	3253
software, algorithm	Bismark	Babraham Bioinformatics	version 0.14.3
software, algorithm	Flexbar	Dodt et al., 2012 PMID: 24832523	version 2.5

Share this article

Cite this article

Near-nucleotide resolution mapping of R-loops.

Transcription-dependent R-loops form in active promoter regions.

Figure 2—source data 1

The 5' boundary of promoter-associated R-loops is located at the transcription start site.

The first exon-intron junction appears to act as a 3' boundary to R-loops.

Figure 4—source data 1

Replication-dependent histone genes frequently form R-loops.

Figure 5—source data 1

The 25 genes that were most strongly associated with transcription-dependent sense-strand R-loop structures in the region between the transcription start site and + 250 bp.

MALAT1 and NEAT1 contain large, bounded promoter-associated R-loops.

bisDRIP-seq provides improved mapping of R-loops relative to DRIPc-seq.

Two possible models to explain the R-loop boundaries observed in the promoter regions of intron-containing genes.

Author details

Jason G Dumelie

Contribution

Competing interests

Samie R Jaffrey

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism