Transcription-coupled genetic instability marks acute lymphoblastic leukemia structural variation hotspots

Abstract
eLife digest
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Progression of malignancy to overt disease requires multiple genetic hits. Activation-induced deaminase (AID) can drive lymphomagenesis by generating off-target DNA breaks at loci that harbor highly active enhancers and display convergent transcription. The first active transcriptional profiles from acute lymphoblastic leukemia (ALL) patients acquired here reveal striking similarity at structural variation (SV) sites. Specific transcriptional features, namely convergent transcription and Pol2 stalling, were detected at breakpoints. The overlap was most prominent at SV with recognition motifs for the recombination activating genes (RAG). We present signal feature analysis to detect vulnerable regions and quantified from human cells how convergent transcription contributes to R-loop generation and RNA polymerase stalling. Wide stalling regions were characterized by high DNAse hypersensitivity and unusually broad H3K4me3 signal. Based on 1382 pre-B-ALL patients, the ETV6-RUNX1 fusion positive patients had over ten-fold elevation in RAG1 while high expression of AID marked pre-B-ALL lacking common cytogenetic changes.

https://doi.org/10.7554/eLife.13087.001

eLife digest

Some of the most common cancers found in children are called precursor leukemias, which may start to develop before birth. Cancerous cells often contain alterations to the genetic information in their DNA. In precursor leukemias, the most common genetic changes involve deleting, adding or rearranging segments of the DNA sequence.

Several researchers have sequenced the entire DNA of childhood leukemia cells, with the result that almost all of the genetic alterations linked to these conditions have been catalogued. These efforts have shown that certain DNA regions are particularly affected by mutations, but no one knows why errors occur so frequently in these regions.

Recent evidence also suggests that transcription – the process of producing useful molecules from a stretch of DNA – can play a role in generating genetic alterations. Heinäniemi et al. have now used a technique called global run-on sequencing to measure the extent of transcription in many different types of leukemia cells. This revealed that in the error-prone DNA regions, two processes – called convergent transcription and transcriptional stalling – interfere with transcription. Both processes temporarily leave the normally double-stranded DNA unzipped as two single strands and free of nucleosomes, which makes DNA more vulnerable to breaking. This would explain how pieces of DNA might be lost, added, or moved to cause the genetic errors that lead to leukemia.

Further investigation revealed that two protein complexes called RAG and AID, which rearrange segments of DNA in immune cells, are likely to cause the errors in the vulnerable DNA regions. Different amounts of RAG and AID were present in different subtypes of leukemia cells, and these amounts also varied with the risk classification of the disease. Further studies are now needed to investigate the exact roles of these protein complexes. This could eventually help scientists devise strategies to protect the DNA of people with leukemia from these errors, which could reduce the risk of the cancer reoccurring.

https://doi.org/10.7554/eLife.13087.002

Introduction

In precursor lymphoblastic leukemia, primary genetic lesions often arise in utero (Wiemels et al., 1999; Mori et al., 2002; Maia et al., 2003, Bateman et al., 2015), while the onset of overt disease requires additional genetic alterations. Whole-genome sequencing (WGS) of ETV6-RUNX1 (also known as TEL-AML1) positive acute leukemias suggested that the secondary lesions are predominantly caused by off-target activity of the RAG complex (Papaemmanuil et al., 2014). In a similar fashion, the expression of the AID complex in more mature B cells is implicated in genomic instability and development of lymphomas (Meng et al., 2014; Qian et al., 2014; Robbiani et al. 2015). To date, WGS in leukemia have been reported from several pre-B-ALL subtypes (Andersson et al., 2015; Holmfeld et al., 2013; Paulsson et al., 2015; Zhang et al., 2012), resulting in a comprehensive characterization of the underlying genetic alterations. Therefore, the research focus on leukemia genetics is moving into characterization of the mechanisms by which these lesions occur and the consequences of the resulting clonal heterogeneity.

Antigen receptor genes are assembled from discrete gene segments by RAG-mediated V(D)J recombination at sites of recombination signal sequences (RSS) during early lymphocyte development (Gellert 2002; Schatz and Swanson, 2011). Cells incorporate multiple strategies to control the action of the RAG complex to appropriate genomic loci: the expression of RAG1 and RAG2 is limited to precursor stages of lymphocytes, the activity of the complex is attenuated during S-phase of cell cycle, and RAG cleavage is directed towards RSS pair containing sequences (Schatz and Swanson, 2011). The engagement of RAG2 is further limited by the histone modification H3K4me3, which is typically found at transcription start sites (TSS) (Matthews et al., 2007; Teng et al., 2015). However, RSS and RSS-like motifs are found only at around 7–40% of breakpoints at SV (genomic imbalance, translocation or inversion) sites (Andersson et al., 2015; Papaemmanuil et al., 2014). Furthermore, the RSS motifs and H3K4me3 occur frequently in the genome suggesting that additional features, possibly even additional complexes including AID (Swaminathan et al., 2015), are relevant for the genetic instability underlying leukemia SV.

In lymphomas, AID off-target effects localize to intragenic super-enhancer (SE) and promoter areas characterized by transcription from both strands, i.e. convergent transcription (convT) (Meng et al., 2014). Notably, VH gene segment recombination by RAG at the IgH locus coincides with sense- and antisense transcription (Bolland et al., 2004), which could be relevant also at off-target sites. Secondly, stalled polymerases, which are found at exons, R-loops and actively paused at TSS regions (Jonkers and Lis, 2015), expose single stranded DNA, recruiting AID via Spt5 binding (Pavri et al., 2010). Furthermore, the polymerase complex displaces nucleosomes completely or partially (the H2A/H2B moiety), which in vitro promotes cleavage by RAGs (Bevington and Boyes, 2013). Despite these intriguing findings, the relevance of transcription-coupled processes has not been systematically characterized, and the clinical relevance of RAG and AID expression in the different leukemia subtypes remains unclear. RNA polymerases engaged into primary transcription across the genome can be measured using Global-Run-On sequencing (GRO-seq) (Kaikkonen et al., 2013). Therefore, this method is ideally suited to distinguish features of transcription at SV sites, including convT and RNA polymerase stalling. To this end, we acquired the first patient profiles of nascent transcriptional activity in leukemic blasts representing seven cytogenetic subgroups and performed integrative analysis of various genome-wide profiles and patient transcriptomes.

Results

Integrative analysis of transcription and genomic instability in leukemic cells

Transcriptional activity from ALL cells representing seven different pre-B-ALL cytogenetic subtypes was assayed using GRO-seq (both primary patient and cell line samples, see Supplementary file 1 and Materials and methods), and jointly analyzed with WGS data from the ETV6-RUNX1 (51 cases; Papaemmanuil et al., 2014), high hyperdiploid (HeH, 16 cases; Paulsson et al., 2015), hypodiploid (20 cases; Holmfeldt et al., 2013) and MLL-rearranged (22 cases at diagnosis and 2 relapses; Andersson et al., 2015) subtypes of precursor B-ALL. GRO-seq signals and breakpoint data are shown in Figure 1—figure supplement 1 at the CDKN2A locus, a significant SV site in childhood ALL (Sulong et al., 2009).

To systematically identify regions with high frequency of SV across the genome, topologically-associated domains (TADs) were retrieved based on HiC data from B-lymphoid lineage cells (Rao et al., 2014). TADs reflect the three-dimensional structure of chromatin. These natural boundaries to transcriptional activity were used to divide the chromosomes into subregions for analysis (see Figure 1—source data 1 and Materials and methods). To link typical transcriptional activity patterns and hotspots of genomic instability, we related the breakpoint frequency with chromatin domains, as illustrated in Figure 1A (see also Figure 1—figure supplement 2).

Figure 1 with 5 supplements see all

Download asset Open asset

Integrative analysis of transcription and high-recurrence SV sites highlights novel transcribed regions.

(A) WGS data from the ETV6-RUNX1 (51 cases; Papaemmanuil et al., 2014), high hyperdiploid (16 cases; Paulsson et al., 2015), hypodiploid (20 cases; Holmfeldt et al., 2013) and MLL-rearranged (22 cases; Andersson et al., 2015) subtypes of precursor B-ALL was integrated with profiles of transcriptional activity assayed using GRO-seq from ALL patient and cell line samples (see also Figure 1—figure supplement 1 and Supplementary file 1). HiC data from B-lymphoid cells (Rao et al., 2014) was used to define TADs based on the HiC interaction frequency, shown as grey scale heatmap, in order to distinguish TADs with highest frequency of SV. (B) The *PAX5* and *ZCCHC7* loci are located in the TAD shown that has high SV frequency in hyperdiploid, ETV6-RUNX1- and MLL-fusion positive patients (4, 20 and 6 breakpoints, respectively, Figure 1—source data 1). The GRO-seq signal profiles from three pre-B-ALL cytogenetic subtypes and normal B-lymphoblastoid cells are displayed as indicated in the figure (see also Figure 1—figure supplement 4 and Figure 2—figure supplement 2). The y-axis shows the normalized read density (plus strand in red, minus strand in blue). convT regions regions are indicated in purple and leukemia breakpoints in red. The TSS region of *PAX5* overlaps convT that co-localized with an intragenic SE (B-lymphoblastoid H3K27ac track is shown at the bottom). (C) A TAD with the same number of breakpoints (20) in ETV6-RUNX1 patients is shown with signal from REH cells (see also Figure 1—figure supplement 4). Genomic annotations include the location of GENCODE transcripts (in green). A strong transcription signal is visible that spans approximately 500 kb near the TAD boundary, lacking annotated transcripts. A zoom-in panel shows the most recurrent SV site. (D) The TAD visualized represents a genomic region that harbors most SV in HeH (see Figure 1—figure supplement 5 for the hypodiploid SV hotspot). The GRO-seq signal (track from patient 1) indicates a novel locus with abundant transcription in leukemic samples (refer to Figure 1—figure supplement 4 for all GRO-seq profiles). The highest recurrence of SV occurs at the convT overlapping mid-region (zoom-in panel), which has also two ETV6-RUNX1 breakpoints.

https://doi.org/10.7554/eLife.13087.003

Figure 1—source data 1 Identified topologically associated domains. The chromosomal coordinates (hg19) of TADs with breakpoints and all TADs as identified based on HiC data are reported. The number and frequency of breakpoints, together with% of domain spanned by convT are given for TADs with breakpoints. The highlighted rows correspond to TADs shown in Figure 1 and Figure 2—figure supplement 1. Notice the separate worksheets.: https://doi.org/10.7554/eLife.13087.004
Download elife-13087-fig1-data1-v1.xlsx

The most frequent SV regions encompass novel transcribed regions

An increasing trend of transcriptional activity was observed when TADs were compared based on breakpoint frequency quartiles (see Materials and methods, Figure 1—figure supplement 3). TADs with highest SV count are shown in Figure 1 (see also Figure 1—source data 1 and Figure 1—figure supplement 4). The PAX5 and ZCCHC7 genes are located within a TAD region with 20 breakpoints in the ETV6-RUNX1, 4 in HeH and 6 in MLL subtype (excluding the MLL-fusion itself) (Figure 1B). Frequent SV were also found in TADs with no annotated coding genes (Figure 1C, 20 breakp in ETV6-RUNX1; Figure 1D, 4 breakp in HeH), yet GRO-seq exhibited transcription signal spanning several hundred thousand base pairs in both regions, typical of long non-coding transcripts (Sun et al., 2015). There was evidence of non-coding transcripts, based on Refseq and GENCODE, but none matched the same location (refer to Supplementary file 2 for all genomic coordinates shown; a TAD with frequent SV in hypodiploid subtype is shown in Figure 1—figure supplement 5). The nascent ALL transcriptomes thus reveal novel transcribed regions as recurrent SV-associated hotspots in the two most common ALL subtypes.

Convergent transcription and RNA polymerase stalling are prevalent at genomic regions with frequent breakpoint events

The prevailing notion is that active transcription start sites (TSS) in pre-B cells are susceptible to RAG off-targeting due to the H3K4me3 chromatin mark (Matthews et al., 2007; Teng et al., 2015). However, we noticed that the recurrent breakpoints often lied several kb downstream of TSS, as highlighted in Figure 1B and D (see inserts), and coincided with simultaneous transcription on both strands, ie. convT spanning a minimum of 100 bp. In closer examination of the signal data from leukemia SV hotspots, many of these regions likely correspond to transcription from intragenic enhancers that generate enhancer RNAs (eRNA) that are typically a few kb in size (Kaikkonen et al. 2013). In agreement, a significant enrichment of breakpoints in enhancers overlapping with convT was observed (hypergeometric test P=0.00012 for intergenic and P=4.6e-08 for all enhancers identified based on eRNA signal, see Materials and methods and Figure 2—source data 1). An overlapping eRNA transcript at the TSS region of PAX5, confirmed by the active enhancer chromatin marker H3K27ac, led to convT extending nearly 20 kb, with SV sites located between 3.7–9.7 kb downstream of the TSS (Figure 1B, see insert).

Secondly, convT in the vicinity of intragenic breakpoints was often associated with localized elevation in the GRO-seq signal, as exemplified at the ZCCHC7 and RAG loci (Figure 2A, see also Figure 2—figure supplement 1). The observed signal features were highly reproducible between biological replicates and shared among a subset of cytogenetic groups (Figure 2—figure supplement 2). We hypothesized that they represent RNA polymerase II (Pol2) stalling events. Previous analyses of Pol2 stalling have focused on promoter proximal regions (Adelman and Lis, 2012). To examine such events genome-wide and across gene bodies, we developed a general analysis approach that identifies change points within gene regions and reports those with high elevation in the signal level (see Materials and methods and Figure 2—source data 1 for the identified regions) (Killick et al., 2012). As additional confirmation, we analyzed stalling from Pol2 ChIP-seq in the REH and Nalm6 cell lines (Figure 2A). To distinguish between different Pol2 complexes (Zhou et al., 2012), antibodies against the serine 2 or serine 5 phosphorylated Pol2 were used (see Materials and methods).

Figure 2 with 6 supplements see all

Download asset Open asset

Convergent transcription and Pol2 stalling characterize genomic regions with high number of breakpoint events.

(A) The GRO-seq signal in the ETV6-RUNX1 positive REH cell line is shown to exemplify the co-occurrence of convT (in purple) and local elevation in GRO-seq signal (Pol2 stalling, in light blue) at both R- and NR-breakp (in red and brown, respectively) that reside within intronic (*ZCCHC7*), TSS (*RAG2*) or putative enhancer regions (*RAG2*). The elevated signal is also visible in Pol2 ChIP-seq signal (Pol2 S2P in green, Pol2 S5P in orange, input in grey). See also Figure 2—figure supplement 1. The percentage of TAD spanned by convT (in B) or Pol2 stalling (in C) in pre-B/B-lymphoid cells is summarized as boxplots from TADs divided into quartiles based on number of breakpoints per bp (see also Figure 1—figure supplement 3, Figure 2—figure supplement 3–6). The quartile ranges are for exclusive lower and inclusive upper value in the range, as indicated. Refer to Figure 2—source data 1 for statistical analysis.

https://doi.org/10.7554/eLife.13087.010

Figure 2—source data 1 Identified convT and Pol2 stalling regions. Regions with transcription from both strands (convT) or change points in the intragenic signal (Pol2 stalling sites) identified from B-lineage cells (REH, Nalm6 and Lymphoblastoid) or ES cells are listed with coordinates in the hg19 human genome. Coordinates of putative enhancers in B-lineage cells based on eRNA analysis (hg19 genome version) are shown in the second work sheet. The statistical analysis for genome-wide enrichment and the TAD analysis are included as separate worksheets, including also the results obtained based on ENCODE chromatin segmentation. Related to Figures 2–4.: https://doi.org/10.7554/eLife.13087.011
Download elife-13087-fig2-data1-v1.xlsx

Genome-wide analysis of convT and Pol2 stalling (see Materials and methods and Figure 2—source data 1) substantiated the relevance of these observations: considering the breakpoint frequency per TAD size, the top ranked TADs in each ALL subtype represented genomic regions with abundant convT and Pol2 stalling (Figure 2B). Significant enrichment was confirmed for the upper quartiles (hypergeometric test P=0.00038 in ETV6-RUNX1, P=0.00018 in hyperdiploid, P=0.028 in hypodiploid and P=0.00004 in MLL-rearranged). The increased overlap was found for breakpoints with and without RSS motifs (denoted as R-breakp and NR-breakp, see Figure 2—figure supplement 3 and Materials and methods) and it was preserved when total transcriptional activity was considered (Figure 2—figure supplement 4). Furthermore, the distinct transcriptional profile of embryonic stem cells (ES) had lower overlap (Figure 2—figure supplement 5).

For comparison, chromatin segmentation of B-lymphoid cells was similarly analyzed (see Figure 1—source data 1 and Figure 2—source data 1). TADs with high number of breakpoints consistently had significant overlap with chromatin segments representing active transcription (refer to Figure 2—source data 1), supporting a transcription-coupled mechanism for the observed genetic instability. We then distinguished regions with overlap to the transcriptional features defined here within active promoters and enhancers. Comparing these against the TAD SV frequency quartiles (Figure 2—figure supplement 6), as before, revealed the most pronounced enrichment in convT/Pol2 stall overlapping regions.

Next, we set out to define what may link convT and Pol2 stalling regions with AID and RAG recruitment. The signal feature detection for convT (as in Meng et al., 2014) and Pol2 stalling (as defined here) enables this on a genome-wide level.

R-loop formation and convergent transcription co-occur with Pol2 stalling

RNA polymerases are expected to stall at regions harboring R-loop forming sequences (RLFS) (Skourti-Stathaki et al., 2014a; Jenjaroenpun et al., 2015). The sensitivity of DNA sequence to form R-loops can be computationally predicted (Jenjaroenpun et al., 2015) (see Materials and methods). These RLFS motif containing regions exhibited a significantly higher overlap with Pol2 stalling sites when compared to random intragenic regions (Figure 3B, empirical P<0.001 in B-lineage and ES cells). A highly concordant local RLFS motif density and GRO-seq signal profile was observed across gene regions (Figure 3—figure supplement 1A and B). The profiles peaked near TSS, where the presence of RLFS motifs led to a significant elevation in the median GRO-seq signal level (Figure 3—figure supplement 1, 2.1-fold increase in B-lineage cells, Wilcoxon rank sum test P<2.2e-16, 95% CI 2.1–2.3). As a second mechanism, collisions due to convT may halt transcription (Prescott and Proudfoot, 2002) in a dynamic and cell-specific manner. Accordingly, higher antisense signal at convT regions (see Materials and methods) increased the overlap with Pol2 stalling sites on the sense strand (Figure 3B), intriguingly exceeding that observed for RLFS motifs (Figure 3A).

Figure 3 with 3 supplements see all

Download asset Open asset

Indication of transcription-coupled genetic instability at leukemia SV hotspots lacking RSS motifs.

(A) Overlap between RLFS motif harboring intragenic regions and detected Pol2 stalling sites in B-lineage and ES cells. The high overlap of RLFS-positive regions is statistically significant compared to random regions (empirical P is indicated for 30% and 28% overlaps, respectively). (B) Overlap of detected Pol2 stalling sites also increases based on the strength of antisense signal level for B-lineage and ES cell convT regions divided into quartiles. (C) The influence of RLFS at TSS on ES cell DRIP-seq signal level is shown (Wilcoxon rank sum test P is indicated). Input signal levels are shown as control. (D) ES cell DRIP-seq signal is plotted similarly as in C, from convT-positive and -negative TSS regions. The DRIP-signal is higher in convT-positive TSS (Wilcoxon rank sum test P is indicated, TSS with convT N = 11774, TSS without convT N = 12092, refer to Figure 3—source data 2 for statistical analysis based on separate DRIP-seq replicates). (E) The percentages of breakpoint regions with no RSS motifs overlapping intragenic Pol2 stalling sites found in B-lineage cells are shown as barplots. The mean overlap observed in random sampling is indicated in grey bars (further statistical analysis is presented in Supplementary file 3). Categories with increasing cut-off for recurrence (1: non-recurrent in dim color, >1 and above: recurrent in darker color) were tested. (F) Overlap with RLFS, convT and annotated TSS is shown, as in E, for ETV6-RUNX1 NR-breakp (see also Supplementary file 3). (G) A schematic model illustrating how transcription from both strands (convT) or RLFS can locally arrest the Pol2 complex leading to recruitment of DNA damage-sensing complexes to R-loops, such as AID or BRCA (Alt et al., 2013, Hatchi et al., 2015), in an RSS-independent manner. (H) NR-breakp hotspot with the highest recurrence (*TPI1* locus) is shown. DRIP-seq signal (shown in tones of red overlaid with input control signal in blue), and RLFS motifs indicated as a magenta bar track represent two levels of independent data that were integrated with GRO-seq data (signal from REH and ES cells is shown) to characterize properties of convT and Pol2 stalling regions. The breakpoint data (NR-breakp in brown) and detected convT (in purple) and Pol2 stalling in B-lineage cells (in blue) are shown. At the the recurrent breakpoint sites antisense transcription of neighboring gene (*SPSB2* primary transcript) leads to a broad convT region, as indicated in the figure. Elevated DRIP-signal indicates formation of DNA-RNA hybrids (see also Figure 3—figure supplement 3).

https://doi.org/10.7554/eLife.13087.018

Figure 3—source data 1 Breakpoint clustering to regions. ETV6-RUNX1 breakpoint data used in the analysis was divided into three categories based on evidence for RSS-guided RAG targeting to the region (RSS-motifs). To analyze recurrence, breakpoint events within 1-kb distance were stitched together. The resulting genomic region coordinates (hg19) and the number of breakpoints contained within them are reported sorted by breakpoint count. Statistical analysis of feature overlap based on binomial and hypergeometric distribution is summarized in the following worksheet. Coordinates and statistics for all pre-B-ALL breakpoint regions are listed in the last worksheet. Notice the separate worksheets.: https://doi.org/10.7554/eLife.13087.019
Download elife-13087-fig3-data1-v1.xlsx
Figure 3—source data 2 Statistical analysis of separate DRIP-seq and DNAse-seq replicates. Statistical analysis is presented for the independent experiments used in the Wilcoxon rank sum tests. Related to Figures 3 and 4.: https://doi.org/10.7554/eLife.13087.020
Download elife-13087-fig3-data2-v1.xls

As an additional experimental validation of R-loops, we used DNA-RNA-immunoprecipitation sequencing (DRIP-seq) results from ES cells (see Materials and methods) that correspond to detection of DNA-RNA hybrids (Ginno et al., 2013). The 2.1-fold elevation in median DRIP-seq signal confirmed that RLFS motifs favor DNA-RNA hybrid formation (Figure 3C, Wilcoxon rank sum test P<2.2e-16, 95% CI 2.0–2.1, see Figure 3—source data 2 for each replicate). Moreover, DRIP-seq quantification showed 1.7-fold higher median signal at convT-positive TSS regions (Figure 3D, Wilcoxon rank sum test P<2.2e-16, 95% CI 1.6–1.7). These results demonstrate that transcription stalling occurs at RLFS and convT regions in mammalian cells that associates with R-loop formation based on evidence from ES cells.

Transcriptional-coupled instability at RSS-independent SV hotspots

A mechanistic link between R-loops and AID off-targeting has been established in lymphomas (Alt et al., 2013). With this in mind, we investigated regions where off-targeting could occur via R-loops by focusing on breakpoints without RSS-motifs (data shown in figures represents the 416 ETV6-RUNX1 NR-breakp, refer to Figure 3—source data 1 and Supplementary file 3 for all statistical results). We observed significant genome-wide enrichment of breakpoints with the investigated transcriptional features (Figure 3E and F, 29% overlap with Pol2 stalling within gene regions, binomial test P=4.088e-07; 9% genome-wide overlap with convT, P=5.16e-07). This enrichment of breakpoints to convT and Pol2 stalling regions was significant across a wide range of transcriptional activity (refer to Supplementary file 3). Co-occurrence of breakpoints within a 1-kb window was used to distinguish non-recurrent (one breakpoint) and recurrent (more than one breakpoint) events (Figure 3—source data 1). Breakpoint recurrence was found to increase the overlap with both Pol2 stalling (Figure 3E)and convT (Figure 3F). The mean overlap observed in 1000-fold random sampling (grey bars) confirmed the specificity of the overlap (note that Pol2 stalling is analyzed from intragenic regions only). The breakpoints in Pol2 stalling sites were concordant with analysis using Pol2 ChIP-seq (by 78%) and they co-localized with both Ser2 and Ser5 phosphorylated forms of Pol2 complex (Figure 3—figure supplement 2). A schematic model summarizing the possible underlying mechanisms based on these results is shown in Figure 3G. The distinct integrated genomic profiles are collectively depicted at the TPI1 loci, representing an SV hotspot with the highest number of NR-breakp in ETV6-RUNX1 cases (Figure 3H, see also Figure 3—figure supplement 3 and Figure 2A). At the breakpoint region, both RLFS and convT are visible and overlap the elevated DRIP-seq signal measured from ES cells.

Access to RAG cleavage sites increases at Pol2 stalling regions

Next, we focused on deciphering whether the transcriptional features associate with RAG off-targeting. We hypothesized that locally depleted nucleosomes around the Pol2 complex (Bevington and Boyes, 2013) may enhance access to RSS/RSS-like sequences. To this end, we retrieved DNAse hypersensitivity data from ENCODE (The ENCODE Project Consortium, 2012; see Materials and methods). DNAse-seq signal peaks were significantly wider when overlapping with Pol2 stalling sites (Figure 4A). A 876 bp (95% CI, 855–896) increase was observed in B-lymphoblastoid cells and 412 bp (95% CI, 395–429) in ES cells (Wilcoxon rank sum test P<2.2e-16 in both cell types, see also Figure 3—source data 2). This was reproducibly observed using peaks located within gene TSS, body or end regions (Figure 4A). We selected TSS regions with RSS motifs for closer examination and found that Pol2 stalling sites at these TSS were significantly wider than at other TSS (Figure 4B), with a difference of 259 bp (95% CI, 79–475 bp, Wilcoxon rank sum test P=0.0024). Thus, wide Pol2 stalling increases the likelihood of RSS motif occurrence in accessible chromatin. The width of stalling did not correlate positively (Pearson’s correlation −0.11; 95% CI, −0.09 to −0.13) with the transcription level of the corresponding gene, indicating that stalling events, and not just active transcription, are important. We further analyzed the top 5% of widest Pol2 stalling regions by comparing them to widest peaks from DNAse hypersensitivity and ChIP for histone marks (see Materials and methods). The odds ratios for the overlap are visualized as a heatmap (see Figure 4C, OR>10 is shown in darkest color tone, refer to Figure 4—source data 1 for more statistics). In addition to DNAse-seq and Pol2 ChIP peaks, the H3K4me3 was found among the top category, confirmed also by ChIP-seq data acquired from REH and Nalm6 cells (Figure 4—source data 1).

Figure 4 with 2 supplements see all

Download asset Open asset

SV with RSS motifs localize to Pol2 stalling regions with broad open chromatin regions.

(A) DNA access based on DNAse-seq peak width (GM12878 or H1 ES from ENCODE) is compared between regions with no Pol2 stalling (no color) and overlapping Pol2 stalling (light blue, cell-specific Pol2 stalling coordinates are listed in Figure 2—source data 1) at TSS, body and end region of transcripts (refer to Figure 3—source data 2 for statistical analysis based on separate DNAse-seq replicates). (B) The TSS stalling width is compared between TSS harboring R-breakp and TSS with no breakpoints (Wilcoxon rank sum test P is indicated, TSS with R-breakp N = 38, TSS without breakpoints N = 11957, 95% CI for size difference 67–491 bp). (C) The 5% widest Pol2 stalling regions were overlapped with similarly defined widest peaks in different ChIP- and DNAse-seq data (refer to Figure 4—source data 1 for details and all statistics). The odds-ratio (OR) for the overlap is visualized in color from discrete categories (<5; 5–10; >10, with darker color tones indicating higher OR). Pol2 S5P, DNAse-seq and H3K4me3 peaks had highest OR based on both B-lineage and ES cell data. D and E: The percentages of R-breakp overlapping Pol2 stalling (as in Figure 3E) or RLFS, convT and annotated TSS (as in Figure 3F) are shown as barplots, respectively. Overall, the recurrence was higher compared to NR-breakp and therefore two categories for recurrent R-breakp are shown (>2; >4). The overlap with convT reaches 91% at highly recurrent R-breakp hotspots (source data can be found in Figure 2—source data 1, S6 and statistics for genes binned by their transcription level in Supplementary file 3). (F) A schematic model illustrating how the transcriptional features may lead to the recruitment of RAG1 and RAG2 based on RSS-motif recognition and chromatin. Pol2 stalling associated with DNA accessibility and wide deposition of the H3K4me3 mark. (G) R-breakp hotspot with the highest recurrence (*BTG1* locus) is shown. B-lymphoblastoid and ES cell tracks from DNAse-seq and H3K4me3 from pre-B-ALL cells (Nalm6) represent signals with highest overlap to wide Pol2 stalling (other tracks as in Figure 3H, see also Figure 4—figure supplement 1).

https://doi.org/10.7554/eLife.13087.024

Figure 4—source data 1 Overlap of wide Pol2 stalling regions with unusually wide peaks representing other chromatin features. The table summarizes the highest observed odds ratios in the Fisher test for the overlap between top 5% widest chromatin features and 5% of widest Pol2 stalling regions. Empirical p-values are reported together with the Fisher test values separately for in ES and B-lineage cells. Data for the different replicate experiments are shown as a separate work sheet.: https://doi.org/10.7554/eLife.13087.025
Download elife-13087-fig4-data1-v1.xlsx

Next, the ETV6-RUNX1 R-breakp (335; 156 intragenic) were analysed for the genome-wide overlap with the transcriptional features. A 66% overlap was found with Pol2 stalling at intragenic regions (binomial test P<2.2e-16) and a 44% genome-wide overlap with convT (binomial test P<2.2e-16, see also Figure 3—source data 1 for joint analysis across pre-B-ALL subtypes). The overlap with Pol2 stalling had high agreement between GRO-seq and ChIP-seq (Figure 3—figure supplement 2) and it increased at recurrent R-breakp (Figure 4D). In addition, overlap with convT (Figure 4E) was considerable (91%) at regions with 4 or more breakpoints. In comparison, regions with RLFS motifs or annotated TSSs showed less marked enrichment (up to 36%) (Figure 4E). Similar, as for NR-breakp, the significant overlap with transcriptional features was preserved at a wide range of expression levels (Supplementary file 3). A schematic model that links the obtained results with vulnerability to RAG cleavage is shown in Figure 4F. As in Figure 3I, the different profiles are depicted at the SV hotspot with the highest number of R-breakp (Figure 4G BTG1 locus, see also Figure 4—figure supplement 1). Further examples in Figure 4—figure supplement 2 show RSS-dependent clustered deletions as defined in (Papaemmanuil et al., 2014). Overall, the presence of both convT and Pol2 stalling best characterized the recurrent ETV6-RUNX1 breakpoints with RSS motifs (101/148; compared to 20/70 without motif), with 90% (43/48, empirical P=0.002) co-occurrence at intragenic sites (see also Supplementary file 4).

AID expression marks pre-B-ALL lacking common cytogenetic changes

To elucidate the potential for RAG and AID mediated genetic instability in leukemia blasts, we compared the expression of the genes RAG1, RAG2 and AICDA across a transcriptome data set with 1382 pre-B-ALL patients (Figure 5—source data 1, Figure 5). Among samples with annotation of cytogenetic subtype (N = 1008), the ETV6-RUNX1 cases (N = 153) exhibited 10.8-fold higher median level of RAG1 expression relative to other cases with annotated cytogenetic type (Wilcoxon rank sum test P<2.2e-16, 95% CI 8.6–13.6-fold, Figure 5A) and also high RAG2 expression (Figure 5B). Moreover, AICDA expression was also detected in a specific subset of patients. It was highest in the 'other' group (N = 267) that does not carry recurrent fusion genes or karyotypic changes (Figure 5C, no statistical evaluation was performed as majority of signal values were below detection level of 4.2 in log2 scale). As comparison, we carried out unsupervised analysis of sample similarities based on the global gene expression profiles. To visualize these molecular subtypes in two dimensions, we utilized the t-Distributed Stochastic Neighbor Embedding (t-SNE) method (van der Maaten and Hinton, 2008) (see Materials and methods, refer to Figure 5—source data 1 for coordinates). The t-SNE map places highly similar samples in close proximity. The discrete expression states (high; low; not detected) of RAG1, RAG2 and AICDA were evident in distinct groups (Figure 5D–F, respectively, the annotated ALL subtypes are colored in Figure 5G). Upon further examination, high levels of AICDA expression were particularly prevalent in sample clusters that corresponded to high risk cases from two independent ALL datasets (hypergeometric test P=7.19e-47, Figure 5H, see Supplementary file 5 for patient characteristics). The highest level of AICDA expression was presented by a relapsed ALL case, and the RAG1 and RAG2 expression levels were 3.09- and 1.93-fold increased at relapse, respectively. Based on the integrated patient profiles, the expression of AID and RAG is distinct in leukemia subtypes and clinical prognosis groups.

Figure 5

Download asset Open asset

Expression of AID and RAG across molecular subtypes of leukemia.

The log2 expression signal is summarized as boxplots for (A) *RAG1* (B) *RAG2* and (C) *AICDA* across the pre-B-ALL subtypes (N = 153 BCR-ABL1, N = 153 ETV6-RUNX1, N = 151 hyperdiploid, N = 198 MLL rearrangement, N = 267 other, N = 82 TCF3-PBX1). Wilcoxon rank sum test p-value is indicated for differential *RAG1* expression in the ETV6-RUNX1 subtype (N = 153, patients with cytogenetic subtype information N = 1008) (in A). (**D–F**) Alternative representation of discrete expression states for *RAG1, RAG2*, and *AICDA*, respectively (red: high, pink: low, grey: not detected). The data points shown as a t-SNE map correspond to the full set of pre-B-ALL patient samples (N = 1382) (see also Figure 5—source data 1). Their relative positions are defined by the transcriptome similarity. The sample groups can be compared to annotated cytogenetic types, as colored on the same map in (G, H). The location of high-risk samples (N=295) from two independent studies is indicated in color on the same map (COG studies GSE7740 in red and GSE11877 in black, see also Supplementary file 5). Hypergeometric test p-value is indicated for enrichment of detected *AICDA* expression in the high risk studies (N = 112, refer to Supplementary file 5 for population statistics).

https://doi.org/10.7554/eLife.13087.028

Figure 5—source data 1 pre-B-ALL transcriptome samples. Sample identifiers of pre-B-ALL transcriptomes analyzed and their coordinates on the dimensionality reduction plot.: https://doi.org/10.7554/eLife.13087.029
Download elife-13087-fig5-data1-v1.xls

Discussion

Next generation sequencing technologies have enabled the elucidation of mechanisms regulating transcription and the analysis of genetic alterations across different cancer genomes. Precursor leukemias are unique in that they often harbor SV and have relatively few mutations (Roberts and Mullighan, 2015). Recently, a functional role of transcription in genomic instability has begun to emerge (Hatchi et al., 2015; Sollier et al., 2014). The maturing lymphoid cells are vulnerable to off-target effects downstream of RAG and AID activity that is required for immune gene rearrangement (Meng et al., 2014; Qian et al., 2014, Papaemmanuil et al., 2014, Swaminathan et al., 2015). The present study represents a systematic investigation of SVs detected in acute pre-B-cell leukemia using WGS in the context of global transcriptional activity in leukemic cells. We identified specific transcriptional features, namely convergence of transcription and Pol2 stalling, as key factors underlying secondary genetic lesions frequently seen in precursor B leukemias.

Pol2 stalling and convT strongly associate with recurrent breakpoint sites across the genome and at gene loci implicated in leukemia such as CDKN2A and PAX5 (Sulong et al., 2009). While protein-coding secondary hits required in disease progression have been recognized for some time, our integrative analysis identified several putative long non-coding RNAs and eRNAs, which merit further investigation. Earlier work has linked eRNAs generated from intragenic superenhancers with AID-mediated instability in lymphomas (Meng et al., 2014), proposing that convT leads to arrested transcription, in agreement with experimental evidence from yeast cells (Prescott and Proudfoot, 2002). Similarly, it has been shown that Pol2 stalling and R-loops expose ssDNA for AID targeting (Huang et al., 2007, Pavri et al., 2010, Alt et al., 2013). We show for the first time that leukemia breakpoints similarly display significant enrichment to enhancers overlapping convT. We further demonstrate a link between convT and elevated R-loop levels and Pol2 stalling on a genome-wide level, with evidence from normal and leukemic human cells. These mechanisms of transcription-coupled genetic instability, earlier implicated in lymphomas (Pavri et al., 2010; Meng et al., 2014; Pefanis et al., 2014) and breast cancer (Hatchi et al., 2015), therefore have relevance in multiple different cancer types.

Breakpoints carrying RSS-like recognition motifs for RAG1 showed high overlap with the vulnerable regions as defined by convT and Pol2 stalling. Therefore, we propose that also RAG1 access to its target sites is related to the fidelity of elongation. Previous studies investigating motif recognition and genome-wide binding profiles of RAGs have shed light on the mechanisms how this complex is recruited to DNA (Bevington and Boyes, 2013, Teng et al., 2015); however these studies have been carried out using normal cells or mouse models that limit their integration with patient WGS data. The chromatin mark H3K4me3 typically found at active promoters serves as a docking site for RAG2 (Matthews et al., 2007, Teng et al., 2015). RAG-mediated cleavage further requires recognition of RSS motifs by RAG1 (Schatz and Swanson, 2011). Our results revealed that TSS that carry breakpoints with an RSS motif differ from unaffected TSS by the presence of unusually wide Pol2 stalling. We show that Pol2 stalling sites, in general, have increased DNA accessibility. Further, the top 5% of widest stalling regions are characterized by unusually broad DNAse hypersensitive regions and H3K4me3 signal. Unique regulation of Pol2 pausing and elongation has been recognized to be related to broad H3K4me3 domains across a wide variety of cell types (Benayoun et al., 2014, Scheidegger and Nechaev, 2016). Together, these properties of Pol2 stalling sites may favor both the recognition and cleavage by the RAG complex.

In this study, we developed a genome-wide approach to capture Pol2 stalling events across gene bodies using change points analysis. This extends previous approaches to detect promoter-proximal pausing events (reviewed in Adelman and Lis, 2012) to analysis of slowing down of Pol2 within the full transcribed region. The feasibility of our approach was confirmed by high overlap of detected regions with RLFS rich regions that represent known structural obstacles to the progression of transcription (Skourti-Stathaki et al., 2014a; Skourti-Stathaki et al., 2014b). Furthermore, analysis of Pol2 stalling from Pol2 ChIP-seq profiles acquired in pre-B-ALL cells had high agreement with the GRO-seq profiles. The slowing of Pol2 upon transition from initiation to elongation, measured by the Pol2 Ser5 phosphorylation, occurs at AID hypermutation sites within the IgH-V region (Wang et al., 2014a). We show that this type of Pol2 stalling had high overlap with leukemia breakpoints.

While RAG has a well-established role in pre-B cells, expression of AID represents a recently discovered threat for lymphoid precursor genome integrity. (Swaminathan et al., 2015) showed that infection-triggered attenuation of IL-7 receptor signaling led to strong AID expression, thus exposing pre-leukemic cells to additional off-targeting events. Moreover, a negative effect on patient survival and increased relapse frequency were observed in high AICDA expressing leukemia patients (Swaminathan et al., 2015). We found that high expression of RAG1/2 or AICDA is markedly distinct between different subtypes of pre-B-ALL at the leukemia state. Prevalent AICDA expression was a distinguishing feature of high risk pre-B-ALL cases, in line with the previous data (Swaminathan et al., 2015). Furthermore, the molecular profiles of patients belonging to the cytogenetic subtype designated as 'other', had high similarity, placing them in close proximity on the t-SNE map. This genetically heterogeneous category of rare cytogenetic types had a distinct elevation in AICDA expression. Further investigation of the WGS profiles focusing on this patient category may shed light on whether AICDA expression could serve as a putative underlying factor that may spur the diversity of DNA lesions in these patients. Similarly, the over ten-fold higher RAG1 expression could also be relevant for the prevalent development of leukemia carrying the ETV6-RUNX1 initiating fusion. The RAG locus is under complex regulation of local chromatin looping by SATB1 (Hao et al., 2015) that controls silencing and activating regulatory elements and was shown to directly control the elevated RAG1 expression in mice. The enhancer activity in patient blast cells, as captured here in the nascent transcriptomes, will help understanding the regulation of such key loci in detail.

As more data on SV becomes available across cancers, further efforts should be made to elucidate the contribution of different complexes in transcription-coupled genomic instability and to develop strategies for dampening their levels and activity. Translation of these measures into clinical practice could impact treatment efficacy by decreasing clonal heterogeneity and relapse risk.

Share this article

Cite this article

Integrative analysis of transcription and high-recurrence SV sites highlights novel transcribed regions.

Figure 1—source data 1

Convergent transcription and Pol2 stalling characterize genomic regions with high number of breakpoint events.

Figure 2—source data 1

Indication of transcription-coupled genetic instability at leukemia SV hotspots lacking RSS motifs.

Figure 3—source data 1

Figure 3—source data 2

SV with RSS motifs localize to Pol2 stalling regions with broad open chromatin regions.

Figure 4—source data 1

Expression of AID and RAG across molecular subtypes of leukemia.

Figure 5—source data 1

Author details

Merja Heinäniemi

Contribution

For correspondence

Competing interests

Tapio Vuorenmaa

Contribution

Contributed equally with

Competing interests

Susanna Teppo

Contribution

Contributed equally with

Competing interests

Minna U Kaikkonen

Contribution

Contributed equally with

Competing interests

Maria Bouvy-Liivrand

Contribution

Competing interests

Juha Mehtonen

Contribution

Competing interests

Henri Niskanen

Contribution

Competing interests

Vasilios Zachariadis

Contribution

Competing interests

Saara Laukkanen

Contribution

Competing interests

Thomas Liuksiala

Contribution

Competing interests

Kaisa Teittinen

Contribution

Competing interests

Olli Lohi

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism