1. Computational and Systems Biology
  2. Genetics and Genomics
Download icon

Global donor and acceptor splicing site kinetics in human cells

  1. Leonhard Wachutka
  2. Livia Caizzi
  3. Julien Gagneur  Is a corresponding author
  4. Patrick Cramer  Is a corresponding author
  1. Technical University of Munich, Germany
  2. Max-Planck-Institute for Biophysical Chemistry, Germany
Tools and Resources
  • Cited 1
  • Views 1,877
  • Annotations
Cite this article as: eLife 2019;8:e45056 doi: 10.7554/eLife.45056

Abstract

RNA splicing is an essential part of eukaryotic gene expression. Although the mechanism of splicing has been extensively studied in vitro, in vivo kinetics for the two-step splicing reaction remain poorly understood. Here, we combine transient transcriptome sequencing (TT-seq) and mathematical modeling to quantify RNA metabolic rates at donor and acceptor splice sites across the human genome. Splicing occurs in the range of minutes and is limited by the speed of RNA polymerase elongation. Splicing kinetics strongly depends on the position and nature of nucleotides flanking splice sites, and on structural interactions between unspliced RNA and small nuclear RNAs in spliceosomal intermediates. Finally, we introduce the ‘yield’ of splicing as the efficiency of converting unspliced to spliced RNA and show that it is highest for mRNAs and independent of splicing kinetics. These results lead to quantitative models describing how splicing rates and yield are encoded in the human genome.

https://doi.org/10.7554/eLife.45056.001

eLife digest

Genes are portions of DNA that carry the instructions to build proteins. In particular, they are formed of segments called exons, which contain the protein-building information, and of non-coding segments known as introns. Exons and introns alternate within a gene.

To create a given protein, the cell first uses an enzyme, Polymerase II, to copy the entire related gene – including introns and exons – into a molecule of ribonucleic acid, or RNA. As the gene is copied, a machine called the spliceosome comes onto the RNA molecule to remove the introns and create the final RNA template used to produce proteins.

The spliceosome works by recognizing specific sequences that signal the border between introns and exons. Once the machine is bound to these ‘splice sites’ on each side of an intron, it brings the two neighboring exons close together and cuts out the intron. The two ends of the exons are then attached together. Previous studies have measured how fast introns are removed, but it remained unclear how long it takes to cut individual splice sites genome-wide.

To address this question, Wachutka, Caizzi et al. combined a mathematical approach with a biochemical method that purifies newly made RNA in human cells. The experiments showed that it only took a few minutes to cut most splice sites. Cutting splice sites that bordered very long introns was slower, presumably because the Polymerase II took longer to produce these introns. In addition, the genetic sequences of the splice sites affected the time it took to remove the introns: some made it harder for the spliceosome to recognize where to cut, but others made it easier.

Mistakes in removing introns from RNA can lead to producing abnormal proteins, and many diseases such as cystic fibrosis and Duchenne muscular dystrophy can be caused by such errors. In particular, small changes in the sequences at the splice sites or in the surrounding areas can create problems when it comes to eliminating introns. Decrypting the dynamics of intron cutting and removal may give scientists new insight into the molecular causes of cystic fibrosis and many other genetic disorders.

https://doi.org/10.7554/eLife.45056.002

Introduction

Transcription of eukaryotic genes produces precursor RNA molecules that are processed by splicing. Splicing is a two-step reaction that results in the removal of introns from precursor RNA and the formation of mature RNA with joined exons. Splicing is catalyzed by the spliceosome, a dynamic ribonucleoprotein machine assembled from small nuclear ribonucleoprotein complexes (snRNPs) and several non-RNP factors (Herzel et al., 2017; Wahl et al., 2009). In metazoa, the majority of the introns are excised by the major spliceosome, whereas a minority is removed by the minor spliceosome (Will and Lührmann, 2011). Spliceosomes are recruited through conserved RNA elements, the 5´-splice site at the exon-intron border (donor site), the 3´-splice site at the intron-exon border (acceptor site), and the branch point, which is followed by a polypyrimidine track (Coolidge et al., 1997; Turunen et al., 2013) and located ~18–40 nucleotides (nt) upstream of the acceptor site (Ruskin and Green, 1985; Ruskin et al., 1985; Taggart et al., 2012). Additional RNA sequences such as splicing enhancers and silencers are found in both introns and exons and can influence the choice between splice sites (Zhu et al., 2001).

In recent years, the intricate mechanisms of splicing were investigated with a combination of structural and functional studies (Mayerle and Guthrie, 2017; Shi, 2017; Wilkinson et al., 2018; Will and Lührmann, 2011). The spliceosome assembles in a stepwise manner, adopting different intermediates with varying composition, conformation, and interactions between RNAs and proteins (Will and Lührmann, 2011). Assembly of the major spliceosome begins with recognition of the donor site by U1 snRNP (Kondo et al., 2015; Lerner et al., 1980; Seraphin and Rosbash, 1989; Zhuang and Weiner, 1986). The U2 auxiliary factor then binds to the poly-pyrimidine track and the acceptor site (Valcárcel et al., 1996; Zamore and Green, 1989; Zamore et al., 1992) generating the E complex. U2 snRNP then binds the branchpoint, resulting in the A complex (Konarska and Sharp, 1986; Perriman and Ares, 2010). In the A complex, U1 snRNA binds the donor site and U2 snRNA binds the branchpoint, rendering the branchpoint adenosine base available for interaction with the acceptor site (Berglund et al., 2001; Query et al., 1994). Binding of the U4/U5/U6 tri-snRNP leads to the B complex (Bertram et al., 2017a), which is first activated (Bact) and then converted to the catalytically active B* complex. In the B* complex, the donor site is positioned close to the branchpoint in a RNA network formed between the precursor RNA and U2, U5 and U6 snRNAs (Zhang et al., 2018).

The activated spliceosome catalyzes intron removal in two steps, which are both transesterification reactions. In the first step, the 2´-hydroxyl group of the branchpoint adenosine serves as a nucleophile to attack the donor site and to generate a cleaved 5´-exon and the lariat intermediate. This leads to the C complex (Zhan et al., 2018) that is then rearranged to form the C* complex, which is catalytically active to carry out the second step. In the C* complex, the ends of exons to be joined are juxtaposed (Bertram et al., 2017b; Zhang et al., 2017), and this enables the 3´-hydroxyl group of the last nucleotide of the 5´-exon to attack the acceptor site, leading to exon ligation and excision of the intron lariat. The resulting P complex contains the ligated exons, which are subsequently released, completing the splicing process (Bai et al., 2017; Wilkinson et al., 2017).

Taken together, extensive in vitro studies have strongly advanced our understanding of the splicing process, but the kinetics and mechanisms of splicing in vivo remain far less understood. Although biochemical assays show that splicing can occur in the absence of transcription, in vivo splicing happens mainly co-transcriptionally, when newly transcribed RNA is still attached to RNA polymerase II (Pol II) and chromatin (for review see Alexander and Beggs, 2010; Bentley, 2014; Saldi et al., 2016). Furthermore, compromising Pol II transcription elongation increases alternative splicing (de la Mata et al., 2003; Dujardin et al., 2014; Ip et al., 2011; Pagani et al., 2003), providing evidence that an optimal elongation rate is essential for a co-transcriptional splicing (Davis-Turak et al., 2018; Fong et al., 2014). Native elongating transcript sequencing in human cells (NET-seq) indicates that splicing occurs soon after introns are synthesized (Mayer et al., 2015; Nojima et al., 2015). Further, the combination of single molecule intron tracking (SMIT) and long read sequencing in yeast shows that splicing is 50% complete when Pol II is 45 nt downstream the acceptor spice site (Oesterreich et al., 2016).

Despite these advances, the in vivo kinetics of splicing remain poorly understood. In particular, different estimates for splicing rates have been reported. Splicing rates have been measured for selected endogenous human genes with the use of live cell imaging (Coulon et al., 2014; Martin et al., 2013; Rino et al., 2014; Schmidt et al., 2011) or with a combination of cellular RNA extraction and quantitative PCR (Pandya-Jones and Black, 2009; Singh and Padgett, 2009). Such studies led to very different splicing rate estimates, ranging from 15 to 30 s (Huranová et al., 2010; Martin et al., 2013; Rino et al., 2014) to 4.3-10 min (Coulon et al., 2014; Schmidt et al., 2011; Singh and Padgett, 2009) per splicing event. These discrepancies may stem from the difference in methods used, which can introduce perturbations, from the selection of genes studied, and from the great variance in intron lengths between human genes.

Other studies have estimated in vivo splicing rates globally with the use of RNA sequencing technologies. The short sequence reads collected from steady-state in vivo samples reflect RNA synthesis, splicing and degradation, which are entangled (Wachutka and Gagneur, 2017). In particular, the ‘splicing efficiency’ is typically defined by the ratio of spliced over unspliced RNAs at steady state (Braberg et al., 2013; Wilhelm et al., 2008). However, the same ratio has been successfully employed to study RNA stability, with the assumption that unspliced RNA levels reflect RNA synthesis and spliced RNA levels reflect the ratio of RNA synthesis over degradation (Gaidatzis et al., 2015; Zeisel et al., 2011). This ambiguity in definitions and interpretations questions the use of splicing efficiency and calls for alternative concepts and metrics. To overcome the limitations of steady-state transcriptome sequencing, one approach is to sequence new transcripts from chromatin-associated RNA fractions and compare them to cytoplasmic fractions, which led to splicing rate estimates from 43 s (Davis-Turak et al., 2015) to 15–120 min (Bhatt et al., 2012; Pandya-Jones et al., 2013) per splicing event.

An alternative method to investigate splicing kinetics in vivo is the use of metabolic RNA labeling with 4-thiouracil (4sU) (Dölken et al., 2008; Rabani et al., 2011; Rabani et al., 2014; Windhager et al., 2012) coupled to sequencing of the labeled RNA (4sU-seq). We previously combined 4sU-seq with kinetic modeling to obtain RNA synthesis, splicing, and degradation rates in the fission yeast S. pombe (Eser et al., 2016). Others have used 4sU-seq and similar approaches to obtain median splicing rate estimates in human cells of 6.7 min (Mukherjee et al., 2017) or 14 min (Rabani et al., 2014) per splicing event. However, 4sU-seq also introduces biases when applied to the human system because the obtained data are artificially biased toward pre-existing 5’-regions of the RNA due to the length of human genes (Schwalb et al., 2016). These RNA 5’-regions predate the labeling time and are generally already observed to be spliced by 4sU-seq, leading to potential errors in the rate estimates. As a result of these difficulties, in vivo splicing kinetics remain unclear, and individual rate estimates for the two steps of splicing are lacking. However, such information is highly desirable because it may be interpreted alongside the mechanistic information obtained in vitro to provide a better understanding of the splicing process.

To study kinetics of splicing in vivo, we performed TT-seq (Schwalb et al., 2016) after different 4sU-labeling time points in human K562 cells. In contrast to 4sU-seq, the TT-seq protocol includes RNA fragmentation before 4sU-labeled RNA is purified and sequenced. This is a crucial step that eliminates 5’ regions of nascent RNAs that were already transcribed, and spliced, prior to incorporation of the label, and thus removes the 5’-bias. We developed a computational approach that estimates the metabolic rates of single phosphodiester bonds. This approach enabled uncoupled quantification of donor- and acceptor-specific kinetics and to relate these to the two transesterification reactions and to the contribution of single nucleotides in spliceosome intermediates defined by structural studies. Moreover, to calculate the amount of precursor RNA successfully spliced into mature RNA, we introduced the ‘splicing yield’ as the conversion efficiency of unspliced to spliced RNA. As a result, our analysis provides genome-wide metabolic rates for donor and acceptor splice sites and identifies RNA-RNA interactions in the spliceosome that could contribute to in vivo splicing kinetics. Furthermore, we provide genome-wide estimates of the splicing yield that is not biased by splicing kinetics. From this work emerges a comprehensive global view of splicing kinetics and yield in human cells.

Results

RNA labeling monitors intron removal

To monitor global RNA metabolism in human cells, we performed TT-seq analysis in K562 cells after different times of RNA labeling with 4-thiouracil (4sU) (Figure 1A). We previously showed that such a labeling time series can estimate splicing rates in the yeast S. pombe (Eser et al., 2016). We exposed K562 cells to 500 μM of 4sU for a labeling time of 2, 5, 10, 15, 20, 30, or 60 min, isolated RNA, and conducted both TT-seq and total RNA-seq (Figure 1A). On average, we obtained 250 million and 55 million 150-nucleotide (nt) paired-end reads for each of the TT-seq and RNA-seq samples, respectively. We next mapped TT-seq and RNA-seq data to the human genome (Materials and methods). The experiments were highly reproducible (Figure 1—figure supplement 1A). Visual inspection of the mapped reads revealed strong TT-seq signals in transcribed regions covering both introns and exons (Figure 1B), whereas RNA-seq data covered mainly exons (Figure 1B, Figure 1—figure supplement 1B). The relative number of intronic reads in TT-seq data decreased with 4sU-labeling time, whereas the signal for exons increased. These observations were consistent with capture of newly synthesized precursor RNA because spliced introns are more rapidly degraded than exons that are maintained in mature, stable RNA. Thus, our time series data contained information about the kinetics of precursor RNA splicing that we exploited further.

Figure 1 with 1 supplement see all
TT-seq monitors human RNA splicing.

(A) Experimental design. TT-seq and RNA-seq were performed after 2, 5, 10, 15, 20, 30 and 60 min of 4sU-labeling in human K562 cells. The number of reads spanning exon-intron/intron-exon splice-sites (non-split reads) gradually decrease with longer labeling duration, while exonic reads and reads spanning exon-exon junctions (split reads) increase during the time course, converging to levels similar to RNA-seq. (B) Coverage track of MYNN gene for 2, 5, 10, 15, 20, 30 and 60 min of 4sU-labeling followed by TT-seq and 2 min of 4sU-labeling followed by total RNA-seq. (C) Distribution of non-split reads (left) and split-reads (right) in previously published 4sU-seq (orange) and TT-seq (blue) (Schwalb et al., 2016) split by quartiles of the transcription start site to donor splice-site distance. Black bars represent the median values for each group. Lower and upper boxes are the first and third quartile, respectively. (D) Fractions of novel splicing events detected in intragenic and intergenic genomic regions. Solid boxes represent known exons, dashed boxes represent novel exons. (D based on Supplementary file 1).

https://doi.org/10.7554/eLife.45056.003

New and alternative splice sites

We first analyzed our TT-seq data for the occurrence of reads that are informative of precursor RNA splicing, that is reads spanning exon-exon boundaries (‘split reads’) and reads spanning exon-intron (‘donor’) and intron-exon (‘acceptor’) boundaries (‘non-split reads’). Using a threshold of at least 10 split reads for an exon-exon boundary, we found 341,855 putative introns (Materials and methods). The relative number of these non-split reads compared to split reads was highest after 2 min of 4sU-labeling and progressively decreased with longer labeling times, eventually converging to a similar coverage as in the RNA-seq samples (Figure 1—figure supplement 1C). Coverage with split reads decreased with the distance to the transcription start site (TSS) in 4sU-seq data but not in TT-seq data (Figure 1C, data from Schwalb et al., 2016), suggesting that the previously reported estimation of splicing rates with 4sU-seq (Mukherjee et al., 2017; Rabani et al., 2014), overestimated splicing rates for 5’ introns compared to 3’ introns. With the use of TT-seq we could avoid a 5’-bias in splicing rate estimations.

About one half of the putative introns (177,322) mapped to splice sites that had been annotated in the database of transcribed regions GENCODE (Materials and methods), whereas the other half did not (164,533). Of these putative introns that had not been previously annotated as splice sites in GENCODE, more than 99% represented introns ending with the GT|AG canonical dinucleotides. Moreover, 21% represented new combinations of already annotated donors and acceptors (Figure 1D). Furthermore, 18% map to a non-annotated donor site and to a previously annotated acceptor site, 22% to a non-annotated acceptor site and to a previously annotated donor site. Interestingly, another 38% mapped to both non-annotated donor sites and non-annotated acceptor sites. Of these, about one half was located within an annotated GENCODE gene (intronic), whereas the other half was located in regions of the genome not annotated in GENCODE (intergenic). Overall, this analysis indicates that the number of splice sites has previously been underestimated, in agreement with recent studies that integrated very large datasets of the public RNA-seq repository (Nellore et al., 2016) or studies that used full-length mRNA sequencing (Anvar et al., 2018).

Kinetic modeling

Defining the kinetics of RNA synthesis, splicing, and degradation from short-read-based protocols is inherently ambiguous due to the many RNA species overlapping any genomic position, including precursor RNAs and multiple splice isoforms. In the future, quantitative and high-throughput full-length transcriptome sequencing may become available to improve the situation; however, co-transcriptional alternative splicing would still cause ambiguities. We have therefore shown it is accurate to analyze the metabolism of phosphodiester bonds rather than RNA species themselves (Wachutka and Gagneur, 2017). Following this idea, we modeled the steady-state rates of synthesis and degradation (or equivalently cleavage) of each of three different phosphodiester bonds individually: the exon-intron bond at the donor site, the intron-exon bond at the acceptor site, and the bond between the two joined exons after successful exon ligation to yield product RNA (Figure 2A left, Appendix). We refer to these definitions throughout when we use the terms ‘splicing kinetics’ or ‘splice site kinetics’.

Figure 2 with 1 supplement see all
Estimation of RNA metabolic rates at individual phosphodiester bonds.

(A) Definition of kinetic parameters at the level of individual phosphodiester bonds enables independent estimation of the rates of donor (red) and acceptor (blue) bond half-life and of junction (product, green) formation (left). Reads covering the introns were extracted and divided into three classes: reads starting at the upstream exon and extending into the intron (red), reads starting within the intron and extending to the downstream exon (blue) and reads mapped to the upstream and downstream exon but gaping the intron (green). Theoretical analytical curves as well as the curves corrected for cross-contamination for TT-seq and RNA-seq are shown. Sequencing depth-normalized counts are provided for both replicates (circle, triangle). Estimated standard errors are depicted in grey. A dip in expectation value of the junction count was observed at 30 min 4sU-labeling time due to variation of cross-contamination across samples (unlabeled RNA in TT-seq samples), (right). (B–D) Violin plots representing the distribution of synthesis rate (B), average donor and acceptor bond half-life (C) and junction bond half-life (D) for mRNAs (n = 8,770), lincRNAs (n = 204), antisense RNAs (asRNA, n = 162) and other ncRNAs (n = 290). Black bars represent the median values for each group. Lower and upper boxes are the first and third quartile, respectively. (B–D based on Supplementary file 3).

https://doi.org/10.7554/eLife.45056.005

We then considered the metabolism of these three phosphodiester bond types at steady-state. Synthesis balances out degradation at steady-state for any molecular species, independently of the kinetics. The steady-state synthesis rate (amount produced per unit of time) and the steady-state degradation rate (ratio of steady state amount by the steady-state synthesis rate) are defined quantities without any assumption on the kinetics (Appendix). Synthesis of the donor and acceptor bonds reflects precursor RNA synthesis. Cleavage of the donor bond is caused by either splicing or by precursor RNA degradation. The half-life of those donor bonds that get spliced depends on intronic transcription at least up to the branchpoint and on the first transesterification step. Cleavage of the acceptor bond is caused by either splicing or by precursor RNA degradation. The half-life of those acceptor bonds that get spliced is determined by the first and the second transesterification step but not by intronic transcription. Synthesis of the junction bond is the outcome of completed splicing. Cleavage of the junction bond indicates RNA degradation (Materials and methods, Appendix).

Our experimental design includes the injection of labeled and unlabeled spike-ins at constant concentrations in all samples, prior to the purification step. These spike-ins allowed for estimating the variations in sequencing depth as well as the overall newly synthesized RNA fraction of every sample (Materials and methods). The unlabeled spike-ins also allowed estimating the amount of cross-contamination, that is of unlabeled RNAs that are purified and which can represent a large fraction of all RNA-seq reads at short labeling durations (Materials and methods). These technical parameters estimated from the spike-ins read counts were then used as covariates to model expected read counts of all three types of bonds in each sample.

Application and testing of the kinetic model

Using these considerations, we fitted the abundance of each of these three types of bonds with a first-order kinetic model for a total of 162,134 donor, 177,543 acceptor and 156,825 junction bonds that showed at least 100 supporting reads across the full dataset (Figure 2A right, Materials and methods, Appendix). Overall, TT-seq read counts agreed with the expected counts of our kinetic model (Figure 2A central column, Figure 2—figure supplement 1A). The synthesis rates for donors and acceptors, and the product half-life inferred from distinct splice junctions (Materials and methods) agreed well, demonstrating the robustness of our approach (Spearman rank correlation >0.33 for synthesis time downstream of the first exon, p<2 × 10−16 and Spearman rank correlation >0.76 for half-life, p<2 × 10−16, variation of 180% fold for synthesis rate, and 32% fold for half-life, Materials and methods, Figure 2—figure supplement 1B). Variations were larger for synthesis rates because these estimates are in a first approximation proportional to the coverage in the short-labeled TT-seq samples and are therefore more sensitive to sequencing biases. In contrast, half-lives, which are in a first approximation proportional to the ratio of coverages in short-labeled TT-seq samples and in RNA-seq, better control for sequencing biases.

We furthermore conducted extensive simulations to assess the performance and limitations of the fitting procedure to estimate the rates when the ground truth is known. We simulated counts based on the estimated distributions of synthesis rates, splicing half-times and half-life times based on the experimental data. Based on simulated data, our method leads to unbiased estimates of ground truth synthesis rates (Appendix 1—figure 12), splicing half-time and half-life time (Appendix 1—figure 13) with high precision compared to the dynamic range. Also, we used simulations to explore how estimation accuracy is affected when using data with much lower read coverage or for extremely slow or fast rates. These simulations showed that lowering the total read coverage cut-off below 100 reads would lead to relative errors typically surpassing 100% (median, Appendix 1—figure 23). These simulations also showed that estimations of half-lives shorter or much longer than our labeling durations (shorter than 1 min or longer than 3 days) would lead to median error surpassing 100% (Appendix 1—figure 24).

First-order kinetic models are simple models that grossly model the underlying biochemical processes. We also investigated two alternative models that potentially capture more complex kinetics. The first one is a delay differential equation model for donor bond kinetics that modeled the time to transcribe the intron up to the branchpoint with a delay, followed by first-order kinetics for the first transesterification step (Appendix). Simulations indicated that identifying the parameters of this delay differential equation model is difficult (Appendix 1—figures 22022) because the data do not support distinguishing the contribution of transcription from the one of the first transesterification step. However, fitting a first order kinetic model on data simulated according to the delay differential equation model showed that the estimated donor bond half-life approximately equated the sum of the intronic transcription delay and the half-time of the first transesterification step (Appendix 1—figure 5, yet usually underestimating with a median relative level of 0.89). The second alternative model is a coupled differential equation model for the junction bonds that modeled junction formation as the outcome of a first-order kinetics splicing process rather than as a constant. Simulations showed that the data did not allow to easily distinguish this coupled kinetics from first-order kinetics (Appendix 1—figure 3). Moreover, the junction bond half-life estimated by the first order kinetics model approximately equated the sum of the splicing half-time and of the half-life of the processed RNA (Appendix 1—figure 7, yet usually overestimating with a median relative level of 1.2). Unless specifically mentioned, we used the first order kinetics model in the remaining analyses because of its robustness and its approximate equivalence with alternative models (Appendix).

Splicing times are in the range of minutes

Based on GENCODE annotation, the analyzed bonds mapped to 8,770 mRNAs, 162 RNAs antisense to protein-coding genes (asRNA), 204 long intergenic non-coding RNAs (lincRNA), and 290 other non-coding RNAs (Figure 2—figure supplement 1C). When averaged within major isoforms (Materials and methods), synthesis rates and half-lives of donors and acceptors ranged over two orders of magnitude (42-fold and 48-fold change, 90% equi-tailed range), whereas the junction bond half-life spanned only slightly over one order of magnitude (8.1-fold 90% equi-tailed range, Figure 2B–D). These major isoform aggregated rates generally agreed with previously reported splicing rates and RNA half-lives (Figure 2—figure supplement 1D; Mukherjee et al., 2017). Moreover, mRNAs were spliced significantly faster (median of 7.2 min) than lincRNAs (median of 11 min, p<2 × 10−16) and other non-coding RNAs (Figure 2C). Also, mRNAs had junction bonds with the longest half-lives (median of 316 min) (Figure 2D), consistent with previous studies (Mukherjee et al., 2017; Schlackow et al., 2017; Schwalb et al., 2016). Similar conclusions can be reached using site-specific rates (Figure 2—figure supplement 1E–J). The obtained apparent splicing times in the range of minutes agree with many previous estimates that were obtained using different methods, but argue against fast splicing, within less than a minute, that was suggested by some studies (Carmo-Fonseca and Kirchhausen, 2014).

Intron length constrains splicing times

Intron length has been suggested to affect splicing kinetics (Hicks et al., 2010; Khodor et al., 2011; Pai et al., 2017; Proudfoot, 2003; Windhager et al., 2012), and we therefore investigated this further. First, our de novo annotation of introns is in agreement with a minimal intron length of about 80 nt (Figure 3A), as expected from the spatial needs within the spliceosome (Ruskin et al., 1985; Wieringa et al., 1984). Second, we find that among introns shorter than 2,000 nt, acceptor and donor bond half-life showed similar distributions and decreased with increasing intron length (Figure 3B). The reasons for why short introns are spliced more slowly than long ones remain to be investigated. It is possible that for longer introns the splice site definition by the following exon facilitates splicing and that there are less restraints for splicing for longer introns. This observation also strongly argues for pre-ordering of the spliceosome on the transcribing polymerase.

Figure 3 with 1 supplement see all
Intron length influences splicing kinetics.

(A) Number of introns against intron length. Dashed line represents the minimum estimated intron length. (B) Distribution of donor and acceptor bond half-life for introns split by intron length septiles. Black bars represent the median values for each group. Lower and upper boxes are the first and third quartile, respectively. (C) Donor bond half-life against intron length for all observed introns, minimum estimated intron length (black vertical line), and theoretical elongation boundary for a intronic polymerase elongation rate of 4 kb min−1 (black lines) and 95% confidence interval estimates (dashed lines, Materials and method). Colors encode data point density. (D) Distribution of half-life of donor (red) and acceptor (blue) bonds for first and last introns in major isoforms compared to other introns. (A–D based on Supplementary file 2).

https://doi.org/10.7554/eLife.45056.007

Our analysis also reveals that donor and acceptor bond half-lives differ for long introns. Among introns longer than 2,000 nt, acceptor bond half-life plateaued at a median value of about 4 min, whereas donor bond half-life increased with intron length up to a median value of about 8 min for introns larger than 7,700 nt (last septile). A possible explanation for this significant difference is that donor sites of long introns are transcribed much earlier than acceptor sites and splicing can only start when the intron is transcribed. Indeed, the donor bond half-life is determined by the elongation time needed to transcribe at least the branchpoint and by the first transesterification step, whereas the acceptor bond half-life is determined by both the time for the first transesterification step and for the second transesterification step to be completed. Assuming a maximum polymerase elongation velocity of 4 kb/min (Fuchs et al., 2014; Gressel et al., 2017; Jonkers and Lis, 2015; Saponaro et al., 2014; Veloso et al., 2014), we observed very few introns violating this predicted limit (Figure 3C). This limit for donor bond half-life affects only a small proportion of all introns (last septile) so that, overall, there is no positive correlation between donor bond half-life and intron length (Figure 3C). For shorter introns, the donor bond half-life and the acceptor bond half-life were similar (Figure 3B), indicating that the second transesterification step is fast compared to the overall splicing kinetics.

Another prediction of this model is that for slowly transcribed introns the donors should take longer to cleave than the acceptors. Consistent with this hypothesis, the median half-life was 2.5-fold (p<2×10−16) longer for donor bonds than for acceptor bonds of first introns (Figure 3D), which are known to be more slowly transcribed (Danko et al., 2013; Fuchs et al., 2014; Jonkers and Lis, 2015; Saponaro et al., 2014; Veloso et al., 2014). A small significant difference was also found for the last intron (1.4-fold, p<2×10−16), which could reflect slower polymerases near the transcript end or different kinetics of splicing of the last intron (Davis-Turak et al., 2015; Rigo and Martinson, 2008). In conclusion, these data show that donor half-life and thus the beginning of splicing is limited by transcription elongation for long introns. Taken together, our results are generally consistent with the co-transcriptional nature of splicing and reveal that the length of the intron influences splicing kinetics in at least two different ways.

Several snRNA interactions are related to donor cleavage kinetics

Whereas overall trends in splicing kinetics can be explained by global features such as intron length and polymerase elongation velocity, the kinetics of splicing also critically depend on the RNA sequence context around the donor, acceptor, and branchpoint. To gain insights into the sequence determinants for splicing, we built a linear model (Materials and methods) that allowed us to estimate changes in donor bond half-life as a function of single nucleotide changes relative to the consensus sequence. The single nucleotide model could explain 19% of the observed variance in log-transformed donor bond half-life and achieved a median relative error for individual sites of 150%, which is small compared to the dynamic range across sites spanning two orders of magnitude (Figure 4A). This analysis showed that nucleotide deviations from the consensus splice-site increase donor bond half-life. These findings are consistent with evolutionary pressure for donor sequences optimized for fast splicing.

Figure 4 with 1 supplement see all
Sequence and structural contributions to donor bond half-life (related to the first catalytic step).

(A) Measured donor bond half-life against single nucleotide model prediction (median relative error of 150%, Spearman's rank correlation ρ = 0.45, p<10−16). (B) Nucleotide frequency (middle track) and prediction of the relative effect on donor bond half-life (upper track) for single nucleotide deviation from consensus sequence around donor, acceptor splice-site and branchpoint. Grey color marks positions predicted to have no effect (Materials and methods). RNA map of base-pairing interactions between four snRNAs (U1, U2, U5, U6) and precursor RNA sequences used in three published spliceosome structures of U1 snRNP binding to 5´splice-site, B and C* complex (bottom track) (exon (blue), intron (red), U1 snRNA (light blue), U2 snRNA (green), U5 snRNA (orange), U6 snRNA (yellow)). Canonical and non-canonical base-pairing interactions are depicted by black solid lines and black dots, respectively. (B based on Supplementary file 4). (C) Structure of U1 snRNA interactions with precursor RNA 5´splice-site (Kondo et al., 2015). (D) Structure of U5 and U6 snRNA interactions with precursor RNA in spliceosome B complex (Bertram et al., 2017a).

https://doi.org/10.7554/eLife.45056.009

In order to elucidate the contribution a single nucleotide change might have on interactions within the spliceosome, we compared our predicted single nucleotide effects with base interactions observed in three different spliceosome structures (Figure 4B, bottom). Recognition of the donor site by U1 snRNP plays a crucial role during early spliceosome assembly. RNA-RNA interaction between precursor RNA and U1 snRNA are mainly stabilized through Watson-Crick interactions (Figure 4C; Kondo et al., 2015). In our model, substitution of the highly frequent G at +1 or −1 from the donor site with a C resulted in an increase in bond half-life (Figure 4B), likely because C cannot form a stable interaction with a C in the U1 snRNA, in agreement with previous in vitro studies (Kondo et al., 2015). In contrast, at position +3 from the donor site, a change from A to G has only a minor effect on the bond half-life (Figure 4B), likely because G can still form a non-canonical base pair with U in the U1 snRNA (Kondo et al., 2015). These results suggest that interactions of the precursor RNA donor region with U1 snRNA contribute to the observed donor bond half-lives.

After donor site recognition by U1 snRNP, the Prp28 RNA helicase mediates the exchange of U1 with U6 snRNP and the U4/U5/U6 tri-snRNP can stably bind the precursor RNA. In the resulting B complex, the U5 stem loop interacts with the three terminal nucleotides of the 5´-exon, whereas the U6 ACAGA helix is formed near the donor site (Figure 4D; Bertram et al., 2017a). Our results suggest that U5 interactions may contribute to the kinetics of donor cleavage. For example, an A in the position −3 relative to the donor site leads to faster donor bond half-life supposedly because this enables base-pairing with U5 snRNA in the extended precursor RNA-U5 snRNA duplex.

Completion of step-one results in the C complex that is then converted to the activated C* complex, which contains the two exon ends in close proximity for the step two reaction. RNA duplexes are formed between the intron region close to the donor site and U6 snRNA and between the branch site region and U2 snRNA (Figure 5A) (Zhang et al., 2017). We also found that interactions in the C* complex were predictive of donor bond half-life (Figure 4D). In particular, changes in the branchpoint adenine and at all positions −4 to +3 of the branchpoint show kinetic effects, except for the positions −1 and +2 that are predicted to not contribute to donor bond half-life (grey highlighting in Figure 4D). In agreement with the structural data, these nucleotides are also the only two nucleotides in the vicinity of the branchpoint that do not interact with U2 in the C* complex (Zhang et al., 2017). When compared to each other, the precursor RNA nucleotides interacting with snRNAs during 5’ splice site recognition showed the strongest effects on donor bond half-life, followed by nucleotides interacting in the B complex, and to a lesser extent the nucleotides interacting in the C* complex (Figure 4—figure supplement 1A). Positions with no predicted contact in these structures showed least effects (Figure 4—figure supplement 1A). These observations support our kinetic modeling, but also argue that the structurally characterized spliceosomal complexes represent functional states.

Sequence and structural contributions to acceptor bond half-life (related to the first and second catalytic steps).

(A) Structure of U2, U5 and U6 snRNAs interactions with precursor RNA in C* spliceosome complex (Zhang et al., 2017). (B) Measured acceptor bond half-life against single nucleotide model prediction (median relative error of 150%, Spearman's rank correlation ρ = 0.45, p<10−16). (C) Nucleotide frequency (middle track) and prediction of the relative effect on acceptor bond half-life (upper track) for single nucleotide deviation from consensus sequence in sequence around donor, acceptor splice-site and branchpoint. Grey color marks positions predicted to have no effect (Materials and methods). RNA map of base-pairing interactions between two snRNAs (U2 and U6) and precursor RNA sequences used in the published spliceosome structures of C* complex (intron (red), U2 snRNA (green), U6 snRNA (yellow)). Canonical and non-canonical base-pairing interactions are depicted by black solid lines and black dots, respectively. (D) Comparison of the effects of single nucleotide changes on donor bond half-life against effects on acceptor bond half-life. Donor (blue), acceptor (red), branchpoint (green) single nucleotide effects are shown. Positions one nt before the donor site and one nt after the acceptor site are encircled in grey. (C, D based on Supplementary file 4).

https://doi.org/10.7554/eLife.45056.011

Taken together, variation in the in vivo kinetics for the donor cleavage can in part be rationalized with early interactions of precursor RNA with U1 snRNA, and with later U5 snRNA interactions observed in structures of the B and Bact complexes. Moreover, the stability of the C* complex appears to also affect donor bond half-life, possibly because it prevents the reverse reaction of donor site cleavage (Tseng and Cheng, 2008). Since several precursor RNA positions are involved in different types of interactions in different splicing intermediates, the observed overall kinetics of donor cleavage reflect a combination of distinct microscopic rates, which cannot be distinguished by our in vivo approach. Furthermore, not all observed effects of nucleotide changes could be explained with available structures. For example, the first nucleotide of the downstream exon (acceptor +1 position) was important for donor cleavage kinetics. Although it remains unclear why, this could be related with co-transcriptional recruitment and recycling of splicing factors, maybe favored by Pol II 3´splice-site pausing, similar to that suggested in Aitken et al. (2011).

Sequence and structural contributions to acceptor bond half-life

We also built a regression model predicting log-transformed acceptor bond half-life from sequence (Figure 5B, 20% of variance explained, median relative error of 150%). Single nucleotide changes around the acceptor site generally had larger effects on acceptor bond half-life, reflecting effects on step two kinetics, whereas changes around the donor site had greater effects on donor bond half-life, reflecting effects on step one kinetics (Figure 5C). The post-catalytic complex (P complex), which is specific to step two splicing reaction, is not yet structurally characterized in human. Nevertheless, nucleotide changes that influence base pair interactions reported for the P-complex in S. cerevisiae (Bai et al., 2017) showed stronger effects on the acceptor bond half-life than for the donor bond half-life (Figure 4—figure supplement 1A). Nucleotide positions in the precursor RNA that are not involved in base pair interaction with snRNAs in B-type and C* complex structures were irrelevant for predicting acceptor bond half-life (grey highlighting in Figures 4B and 5C, feature selection, Materials and methods).

Most nucleotides showed similar effects in the donor and acceptor bond half-life models but some noticeable differences were observed between them (Figure 5D). Our results indicate that a non-canonical G branchpoint does not affect acceptor bond half-life but increases donor bond half-life. We also observed that the predominant G at the donor −1 nucleotide leads to fast donor cleavage kinetics but to slow acceptor cleavage kinetics, maybe because this interferes with positioning of the neighboring +1 donor nucleotide that serves as a nucleophile during step two. Despite this disadvantage in acceptor cleavage kinetics, the donor −1 position is predominantly G, presumably because this improves donor site recognition by base pairing with a C in U1 snRNA as described above. Taken together, available structural information on the spliceosome help to rationalize some of the effects of base changes around splice sites. Even though the contributions of several mechanistic processes to the observed kinetics cannot be disentangled, our results reveal which nucleotide positions around splice sites are critical for fast splicing kinetics.

Regulatory precursor RNA motifs contribute to splicing kinetics

Splicing is modulated by auxiliary factors, including serine/arginine-rich proteins and hnRNPs (heterogeneous nuclear ribonucleoproteins), that bind to regulatory motifs around the splice sites (Fu and Ares, 2014; Matlin et al., 2005; Wang and Burge, 2008). We therefore aimed to identify putative regulatory motifs and to quantify their contribution to splicing kinetics. We derived two extended models for donor and acceptor bond half-life including the single nucleotide effects in the core regions investigated so far and all 65,536 possible RNA octamers in four extended intronic regions, 100 nt downstream of the donor site, 100 nt upstream of the acceptor site, and 100 nt upstream of the branchpoint and the region between branchpoint and acceptor site (Figure 6A, Materials and methods). We also included intron length and GC content, but did not include exonic regions, because this would require isoform annotations. Because of the very large number of octamers, we used a feature selection method (Lasso regression), which yielded 551 octamers jointly predicting donor bond half-life and 2,319 octamers jointly predicting acceptor bond half-life (Materials and methods).

Figure 6 with 1 supplement see all
Precursor RNA sequences contributions to splicing kinetics.

(A) Schematic representation of the regions used for the octamer search (BP = branchpoint, polyY = polyY track). In case of short intron, the regions 5ʹ downstream, 3ʹ upstream, and BP upstream were cropped to not extend into the exonic regions. Poly-Y track was defined as the region from the branchpoint up to the acceptor site. (B–C) Proportion of variance explained by the joint model (purple) for log-transformed donor (B) or acceptor (C) bond half-lives, as well as proportion of variance explained by individual features (orange) and drop of the proportion of variance explained when individual features are removed from the joint model (green). 5´ds = 5´downstream, 3´us = 3´upstream. (D) Effect on donor bond half-life of octamers matching to RNA-binding proteins (RBP, rows) motifs identified using the ATtRACT database. Each column represents one octamer; the color depicts strength and direction of the effect. (E) Distribution of the phylogenetic conservation score (PhastCons 100-way) of random octamers (green), significant octamers matching (red) and not matching (blue) the ATtRACT database of RNA-binding motifs, estimated by region (column) and model (row). Black bars represent the median values for each group. Lower and upper boxes are the first and third quartile, respectively. Stars above boxes depict pairwise significance levels by Wilcoxon signed rank test. (D based on Supplementary file 5).

https://doi.org/10.7554/eLife.45056.012

Compared to the single nucleotide models, the extended models substantially increased the proportion of variance explained from 19% to 26% for the log-transformed donor bond half-life and from 20% to 29% for the log-transformed acceptor bond half-life (Figure 6B and C). These proportions of variance increased when we restricted the analysis to junctions of major isoforms, showing that the results are not over-estimated due to double counting of donors and acceptors belonging to multiple exon junctions (donor bond half-life model by 1.3% and the acceptor bond half-life model by 1.4%). Cumulatively, the proportions of variance explained of these non-overlapping regions largely exceed the proportion of variance explained by the joint model, indicating widespread co-occurrence of splicing-regulatory sequences across introns.

The improved prediction of the extended models over the single nucleotide models is mostly attributable to the octamers of the extended intronic regions. The largest number of predictive octamers identified by the donor bond half-life model was found in the 5’ donor site region (Figure 6—figure supplement 1A). This set of octamers was the most predictive feature for donor bond half-life individually (19% of the variance) and the feature with the largest impact on variance when dropped from the joint model (3% of the variance). In the acceptor bond half-life model, the regions flanking the branchpoint and acceptor site contained most predictive octamers (Figure 6—figure supplement 1A) and associated with largest proportion of variance explained (Figure 6C). Moreover, the effects of octamers on bond half-life were of similar order of magnitude than the effects of single nucleotides in donor, acceptor and branchpoint sites for both categories (Figure 6—figure supplement 1B, median effect for octamer 1.4%, median effect for nucleotide 5.4%). The drop of proportion of variance explained when a feature was removed from the joint models were small (between 0.0 and 3.4, Figure 6B and C) indicating of substantial correlation between the features. These correlations could be technical in the case of overlapping regions, or the result of co-evolution. Altogether, these results show that the octamers in the extended intronic regions contribute to splicing kinetics.

To identify putative regulatory factors that could bind to the predicted RNA octamers, we scored the octamers binding affinities to the 159 human RNA-binding proteins of the ATtRACT database (Giudice et al., 2016). We found 258 octamers predicted by the donor bond half-life model (47% versus 42% of non-selected octamers, p=0.017, Fisher test) associating with 69 RNA-binding proteins and 1,039 octamers identified by the acceptor bond half-life model (45% versus 42% of non-selected octamers, p=0.007, Fisher test) associating with 99 RNA-binding proteins motifs (Figure 6D, Figure 6—figure supplement 1C, relative position weight matrix score >0.9 and selecting for the 5% highest absolute scores, Materials and methods). Our results suggest that several serine/arginine-rich and hnRNP proteins (Supplementary file 5) regulate donor and acceptor bond half-life in both positive and negative fashions, depending on the location of their binding site. Octamers associated with the binding site of the polypyrimidine tract-binding protein Ptpb1 are predictive of short donor bond half-life when present between branchpoint and acceptor site but they prolong the donor bond half-life when located near the donor site (Figure 6D). The remaining octamers may reflect cis-regulatory elements bound by splicing factors that remain to be characterized. To address the evolutionary conservation of the identified octamers, we aligned them to conserved sequences across 99 mammalian and other vertebrate genomes. Except for octamers predicted to affect donor bond half-life in the region 100 nt downstream of the donor site, the remaining ones show significantly higher phylogenetic conservation compared to a negative control of random octamers (Figure 6E), providing evidence of their biological significance.

Splicing yield differs between RNA classes

Cleavage of the phosphodiester bonds at the donor and acceptor sites can lead to ligation of the two exon ends, thus completing splicing. However, cleavage of these bonds may also be non-productive in the sense that exon ligation can fail and RNA may be degraded after cleavage. To account for this, we defined the ‘splicing yield’ as the proportion of precursor RNA successfully converted into spliced RNA (Figure 7A). A splicing yield of 1 means that all precursor RNAs that are synthesized are also successfully spliced, whereas a splicing yield less than one means that only a fraction of the precursor RNA is converted to spliced product. We estimated the splicing yield using the junction bonds modeled with the coupled model and the acceptor bonds modeled with first-order kinetics, because alternative kinetic models or using the donor bonds led to systematic biases (Appendix). Hence, we did not computationally constrain our splicing yield estimates to be bounded by 1. Due to estimation errors of the synthesis rates, yields sometimes turn out to be greater than 1.

Figure 7 with 1 supplement see all
Splicing yield differs between RNA classes.

(A) Definition of synthesis rate at individual phosphodiester bonds enables the estimation of splicing yield for acceptor splice sites (ηacceptor). (B) Distribution of splicing yield of acceptor splice sites for introns of mRNAs (n=64,555), lincRNAs (n=398) antisense RNAs (asRNA, n=265), and other ncRNAs (n=876) of major transcripts (Materials and methods) for GENCODE genes. Black bars represent the median values for each group. Lower and upper boxes are the first and third quartile, respectively. (C–D) Distributions of the observed acceptor spicing yield (C) and donor and acceptor bond half-life (D) of splice-sites split by major (U2-type) and minor (U12-type) spliceosome. (E). Single nucleotide effects for branchpoint and acceptor splice site on ηacceptor. Color depicts effect on yield relative to consensus nucleotide (Materials and methods). Relative frequency of the nucleotides is shown for all modeled introns. (B–E based on Supplementary file 2).

https://doi.org/10.7554/eLife.45056.014

We found that the splicing yield across sites was much higher for mRNAs (median = 1.2, Figure 7B) than for antisense RNAs (median = 0.2), lincRNAs (median = 0.3), and other non-coding RNAs (median = 0.5), suggesting that degradation pathways are competing with splicing more intensively for non-coding RNAs than for coding RNAs. Moreover, the higher yield of mRNAs compared to lincRNAs also held when stratifying by cumulative read coverage across all samples and by half-life (Appendix 1—figures 26 and 27), two possible confounders associating with synthesis rate estimation biases in simulations (Appendix). Furthermore, splicing yield was the same for the 139,344 (99.9%) introns harboring the canonical terminal dinucleotides GU and AG recognized by the major spliceosome (U2-type) than for the 182 (0.1%) introns harboring the terminal dinucleotides AU and AC recognized by the minor spliceosome (U12-type) (Figure 7C). Although introns targeted by the minor spliceosome showed two-fold slower donor and acceptor bond half-lives compared to those targeted by the major spliceosome (Figure 7D, Figure 7—figure supplement 1B), the minor spliceosome nonetheless reached similar splicing yields.

To analyze how the nature of the splice sites contributes to splicing yield, we built a model that allow us to predict splicing yield based on sequence (Figure 7E). Similar to the effects on bond half-lives (Figure 4B, Figure 5C), deviations from the consensus sequence led to lower splicing yield. Furthermore, nucleotides near a splice site showed stronger effects than more distant ones, suggesting that the early recognition of donor and acceptor splice sites is a determinant for splicing yield. Taken together, these results indicate that rate and yield are distinct aspects of splicing that may have evolved independently and that the sequence around splice sites determines both the rate and the yield of splicing.

Discussion

RNA splicing is an essential step of eukaryotic gene expression, but the in vivo kinetics of this two-step process and its dependence on transcription remain poorly understood. Here, we have coupled a metabolic RNA labeling time series to TT-seq analysis of new and total RNA to investigate RNA metabolism in human cells. We have then used kinetic modeling based on a definition of RNA metabolic rates at the level of individual phosphodiester bonds to provide rate estimations for cleavage of phosphodiester bonds at donor and acceptor splice sites in human introns. The obtained splice site cleavage rates, expressed as donor and accept bond half-lives, are free of ambiguities introduced by other methods and are related to the independent contributions of the two splicing steps in vivo.

The donor and acceptor bond half-lives were found to be generally in the range of minutes, although we cannot exclude that we are missing a small population of quickly spliced introns. The donor and acceptor bond half-lives were found to depend on intron length, on the nucleotide sequence surrounding splicing sites, including the branchpoint, and on flanking octamer sequences that may bind regulatory factors. This is consistent with a complex relationship between the splicing machinery and its nuclear environment, in which splicing rates can be influenced not only by RNA sequence but also by gene structure and chromatin landscape (Davis-Turak et al., 2018; Davis-Turak et al., 2015). In addition, we define the yield of successful splicing and show that it differs dramatically between different RNA classes.

Our results also provided insights into the nature and evolution of co-transcriptional splicing. Previous studies of splicing kinetics in mouse (Rabani et al., 2014) and fission yeast S. pombe (Eser et al., 2016) all found that splicing is faster for shorter introns. A recently published study in Drosophila (Pai et al., 2017) demonstrated that splicing is faster for intron-defined, short introns (60–70 nt), whereas for exon-defined introns, splicing was faster for introns longer than 2,944 nt, suggesting a complex relationship between splicing kinetics and intron length. We find here that exon-defined splicing in human cells is fastest for introns with a length of around 2,000 nt, whereas short introns (<140 nt) on average take about twice as long to be spliced. High spatial and temporal resolution kinetics data coupled with focused analyses of these very short introns remain to be performed for other species to understand how universal these observations are.

Another observation we made was that longer human introns (>10,000 nt) show increased donor bond half-life, apparently because donor cleavage requires RNA polymerase II to first transcribe the intron. This effect of transcription-limited splicing is observed at ~10% of human precursor RNA introns, whereas most human introns are short enough so that their splicing kinetics are not limited by transcription. How the polymerase elongation rate depends on the nature of the intron and how this influences splicing remains to be investigated. The observed relationship between transcription and splicing of coding RNAs extends to non-coding RNAs. We found that spliced coding RNAs and spliced non-coding RNAs showed similar RNA synthesis rates (Figure 2C), whereas previous studies reported considerably slower synthesis rates for mainly unspliced, non-coding RNAs (Mukherjee et al., 2017).

Our definition of splice site-specific RNA cleavage rates also allowed for a comparison of kinetic information in vivo with detailed structural knowledge of the spliceosome in different functional states obtained in vitro. Structure-based interpretation of our nucleotide-resolution kinetic results highlights the importance of interactions between snRNAs in the spliceosome and the precursor RNA substrate and provides guidance for interpreting interactions in structures of the spliceosome yet to be obtained. Our results suggest that the predicted interactions between the precursor RNA and snRNAs in the P complex, which have presently only been obtained in yeast (Bai et al., 2017; Wilkinson et al., 2017) may be similar in human. We show that different types of RNA-RNA interactions observed at various stages of the process are related to the splice site cleavage time. In particular, RNA-RNA interactions must be of high enough affinity to allow for sufficiently specific recognition of splice sites, yet the affinities must be in a range that also allows for rapid conversion between subsequent states, which can show strongly altered RNA-RNA interactions. However, many processes and factors contribute to the observed apparent splicing rates, and these must be disentangled in the future.

Taken together, we have analyzed the metabolism of individual donor and acceptor splice sites in vivo and provided quantitative models for how RNA splicing kinetics may be encoded in the human genome. As we looked at a single growth condition, our models are derived from comparisons across genes and essentially reflect the affinity of precursor RNAs to the core splicing machinery. However, the experimental and computational methodology presented here could be applied to different cell types or under dynamic responses to reveal and quantify the role of splicing regulatory factors and of their related binding sites. Another interesting future direction is the modeling of alternative splicing, which is understood to be the outcome of competitions of alternative donor or acceptor sites with various strengths. Our distinct models of donor and acceptor site kinetics may help to build up such quantitative competition models. Eventually, quantifying the contribution of individual bases to splicing rates, backed by structural and functional studies, may explain the numerous contributions of splicing to the genetics of rare (López-Bigas et al., 2005) and common (Li et al., 2016) diseases.

Materials and methods

Key resources table
Reagent type
(species) or resource
DesignationSource or referenceIdentifiersAdditional
information
Cell line (Homo sapiens; female)K562 chronic myeloid leukemia in blast crisisDSMZDSMZ Cat# ACC-10, RRID:CVCL_0004
Commercial assay or kitPlasmo Test Mycoplasma Detection KitInvivoGen, San Diego, CA USArep-pt1
Commercial assay or kitOvation Universal RNA-Seq SystemNuGEN, Leek, The Netherland0343–32
Chemical compound, drug4-thiouracilCarbosynth, UKNT06186CAS 13957-31-8
SoftwareSTARhttps://github.com/alexdobin/STARRRID:SCR_015899
SoftwarePicardhttp://broadinstitute.github.io/picard/RRID:SCR_006525
SoftwareSalmonhttps://combine-lab.github.io/salmon/RRID:SCR_017036
Softwareglmnethttps://cran.r-project.org/web/packages/glmnet/index.htmlRRID:SCR_015505
SoftwarePyMOLhttps://pymol.org/2/RRID:SCR_000305
SoftwareLaBranchoRhttps://kipoi.org/models/labranchor
SoftwareCleTimerhttps://kipoi.org/models/CleTimer
SoftwarerCubehttps://github.com/gagneurlab/rCubeLast commit number: 463119

Cell culture

Request a detailed protocol

K562 cells were obtained from DSMZ (DSMZ no.: ACC-10) and grown in RPMI 1640 medium (Thermo Fisher Scientific, 31870–074) supplemented with 10% heat-inactivated fetal bovine serum (Thermo Fisher Scientific, 10500–064) and 2 mM GlutaMAX (Thermo Fisher Scientific, 35050087) at 37°C and 5% CO2. Cells were routinely verified to be free of mycoplasma contamination using Plasmo Test Mycoplasma Detection Kit (InvivoGen, rep-pt1). K562 cells were authenticated at the DSMZ Identification Service according to standards for STR profiling (ASN-0002).

TT-seq time series

Request a detailed protocol

TT-seq was performed as described (Schwalb et al., 2016), with minor modifications. Specifically, 2.5 × 107 cells from two biological replicates were used for each time point. Cells were exposed to 500 µM of 4-thiouracil (4sU, Carbosynth, NT06186) for 2, 5, 10, 15, 20, 30, 60 min at 37°C and 5% CO2. Cells were harvested by centrifugation at 600 g for 2 min at 37°C. Cell pellets were lysed in 5 mL of QIAzol (Qiagen) and 150 ng of RNA spike-ins mix were added to each sample. RNA spike-ins were produced in house, based on ERCC-RNA sequences (sequences of spike-ins are described in Supplementary file 6). RNA spike-ins were produced as described (Schwalb et al., 2016). RNAs were extracted using QIAzol according to the manufacturer’s instructions. RNAs were sonicated to obtain fragments of <6 kbp using AFAmicro tubes in a S220 Focused-ultrasonicator (Covaris Inc, parameters: 10 s, peak power 100, cycles 200, duty cycle 1%). The quality of RNAs and the size of fragmented RNAs were checked using Fragment Analyzer. 1 μg of each of the sonicated RNAs was stored at −80°C as total RNA (RNA-seq) and later eluted with miRNAeasy Micro Kit (Qiagen, 217084) together with 4sU-labeled purified RNAs.

4sU-labeled RNAs were purified from 300 µg of each of the fragmented RNAs. Biotinylation and purification of 4sU-labeled RNAs was performed as described (Dölken et al., 2008; Schwalb et al., 2016). Biotinylated 4sU-labeled RNAs were separated from unlabeled RNAs with streptavidin beads (Miltenyi Biotec, Bergisch Gladbach, Germany) and eluted in 100 mM DTT as described in Dölken et al. (2008) and Schwalb et al. (2016). 0.3M sodium acetate was added to 4sU-labeled purified RNAs and to total RNAs prior RNA extraction. RNAs were extracted and eluted using miRNAeasy Micro Kit (Qiagen, 217084). The on-column DNAse I treatment (Qiagen, 79254) was performed for 15 min at 25°C. Prior to library preparation, total RNAs and 4sU-labeled purified RNAs were quantified using Qubit. Enrichment of 4sU-labeled versus unlabeled RNAs was analyzed by RT-qPCR using oligonucleotides amplifying selected regions of 4sU-labeled and unlabeled spike-ins (sequences of oligonucleotides are described in Supplementary file 6). Only 4sU-labeled purified samples showing ΔΔCt changes from 4 to 6 were subjected to library preparation (total RNAs were used as a control for normalization). 100 ng of input RNA was used for strand-specific library preparation according to the Ovation Universal RNA-seq System (NuGEN). Libraries were prepared using random hexamer priming only. The size-selected libraries were analyzed on a Fragment Analyzer before sequencing on the Illumina HiSeq 4000.

Read alignment and counting

Request a detailed protocol

Paired-end 150 bp reads with additional 6 bp of barcodes were obtained for each sample. Reads were aligned using STAR version 2.5.0a (Dobin et al., 2013) in single pass mode. The genome Index was built against the full GENCODE version 24 annotation and the hg38 (GRCh38) genome assembly (Human Genome Reference Consortium) using 150 bp overhang size. Additional specified parameters were alignSJDBoverhangMin 2, chimSegmentMin 15, chimScoreMin 15, chimScoreSeparation 10, and chimJunctionOverhangMin 15. The aligned reads were filtered for duplicates using Picard tools version 2.5.0 (https://broadinstitute.github.io/picard/) using the option MarkDuplicates REMOVE_DUPLICATES = true. In average, each TT-seq sample yielded about 250 M reads and each RNA-seq sample about 55 M reads. For each sample, ~90% of the reads could be uniquely mapped to the reference genome. The duplication ratio was estimated to 55% by FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

Using the rCube package (https://github.com/gagneurlab/rCube), all split reads (containing N stretches in Cigar string) were extracted to create a database of potential introns (~341 k). The obtained introns were classified relative to annotated introns and genetic elements from the GENCODE annotation (version 24 obtained from https://www.gencodegenes.org/releases/24.html). For each intron three characteristic counts were calculated: The numbers of reads starting in the upstream exon and extending into the intron (‘donor’), the number of reads starting in the intron and extending into the downstream exon (‘acceptor’), and all split reads matching the introns coordinates (‘junction’). The reads were filtered using a bam quality score of 255. Reads having secondary alignment flag were discarded.

Estimation of sample normalization factors and cross-contamination

Request a detailed protocol

To estimate the sample normalization factors Fj that account for variations in sequencing depth as well as the overall newly synthesized RNA fraction and the fraction of cross-contamination χj of non-labeled reads in the TT-seq data, we modeled the expectation of counts Eij of spike-in i in sample j using a statistical model similar to the one of Schwalb et al. (2016).

(1) Eij=Fjpij(χj-δiχj+δi)

χj is set to 1 for all RNA-seq samples, δi is 0 for labeled spike-ins and 1 for unlabeled spike-ins. The parameter pij is the condition and spike-in specific extraction probability. The difference with (Schwalb et al., 2016) is to allow the parameter pij to be condition-specific (TT-seq or RNA-seq), which turned out to model better cross-contamination of unlabeled RNA in the short duration TT-seq libraries. We set pij=pik for all sets of j and k belonging to either RNA-seq samples, TT-seq samples or if i belongs to a labeled spike-in. We assumed read count data to follow a negative binomial distribution with a common dispersion parameter for all data. The model parameters and the dispersion parameter were fitted as generalized linear model using maximum likelihood.

Kinetic rate modeling and estimation

Request a detailed protocol

For each detected intron i we modeled the concentrations ci,l of each of three characteristic bonds (donor, acceptor, junction) independently following a first order kinetic rate equation. Without loss of generality, we consider in the following just one of the three equations - the other two behave the same.

(2) ddtci(t)=αi-βici(t)

We assume that all newly synthesized RNAs are labeled. The concentration of labeled bonds, assuming an initial concentration of 0, follows:

(3) cit, labeled=αiβi(1-e-tβi)

Also, the old, non-labeled RNA decays exponentially as  ci(t, unlabeled)=αiβietβi. Using the normalization factor Fj of sample j, the labeling time tj and χj the cross-contamination of unlabeled RNAs in the purified fraction, the concentration can be mapped to its expected count Ei,j:

(4) Ei,j=Fjαiβi(1 +etjβi (χj  1))forTT-seq;Ei,j=FjαiβiforRNA-seq

We modeled read counts using the negative binomial distribution, a count distribution often used for RNA-seq data because it captures sampling noise and further sources of variations. The kinetic parameters αi,βi are estimated by maximizing the log likelihood l=i,j log(NBki,jEi,j(αi,βi), θ, where ki,j are the observed counts, using the BFGS numerical optimization algorithm and using the dispersion parameter obtained from the spike-ins analysis. The optimization was initialized 10 times with independent random parameters; the final solution comprises the median of all αi,βi over the different runs to compensate for numerical instabilities. We removed all donors, acceptors, junctions with too few counts (jki,j<100) from the modeling.

Using the table in Figure 2—figure supplement 1A we map the rates α,β to the characteristic kinetic parameters of donor, acceptor, and junction. The whole modeling approach was implemented in R and is available as apackage called rCube (https://github.com/gagneurlab/rCube). Because the donor and acceptor bond half-life models work on a logarithmic scale, we present model errors as multiplicative errors given by the equation median (exp(|log(yy^)|)), with y as the observation and y^ the prediction. More details about kinetic models are provided in the Appendix.

Determination of the major isoforms

Request a detailed protocol

We applied the software Salmon (Patro et al., 2017) (index kmer size = 31) to all RNA-seq samples and mapped them against the full transcriptome of the GENCODE (Ver. 24) annotation. For each gene, we selected the isoform with the maximum mean TPM value across all RNA-seq samples as the major isoform. The major isoform was only used in the analyses in Figures 2B–D, 3D and 7B. Elsewhere, analyses were only relying on individual junction annotations.

Estimation of the relative uncertainty for the kinetic parameters

Request a detailed protocol

To estimate relative uncertainty of the kinetic parameters in a conservative way we assumed that all donors and acceptors of the major isoform of a given GENCODE gene shared the same synthesis rate equal to the transcription rate of the gene. We further assumed that all products (‘junctions’) shared the same half-life equal to the mature RNA half-life. Because noise of these rate estimates is typically multiplicative, we computed the standard errors of the logarithm of these rates and reported relative uncertainties as the exponential of these standard errors.

Comparison of 4sU-seq and TT-seq

Request a detailed protocol

Alignment, counting and estimation of normalization and cross-contamination factors of the RNA-seq data sets of Schwalb et al. (2016) was done as described above for our data. Counts for 4sU-seq and TT-seq was normalized using Ki,j^=Ki,jFjχjKi,RNA-seqFRNA-seq, where i denotes the split / unsplit reads as shown if Figure 2A for each intron of the major transcripts and j is the sample (4sU- / TT-seq and replicate). Both replicates were pooled together.

Branchpoint identification

Request a detailed protocol

Due to the limited availability of experimental branchpoint measurements, the prediction algorithm LaBranchoR (Paggi and Bejerano, 2018) was utilized to predict branchpoint positions within introns. We applied the model within the kipoi framework (http://kipoi.org/) to score the last 100 nucleotides of each intron and took the nucleotide with the maximum score for being the utilized branchpoint. The results were validated using experimental data of Mercer et al. (2015) where available.

Estimation of single nucleotide effects

Request a detailed protocol

We identified nucleotide positions not predictive of donor or acceptor bond half-life and estimated the effect of the remaining single nucleotides on the donor bond half-life and on the acceptor bond half-life by regression. To this end, we modeled log-transformed half-lives of the donor bonds (and with a separate model of the acceptor bonds) as a weighted sum of each of the 20 nucleotides upstream or downstream of the donor, acceptor site and branchpoint, as well as of the GC frequency of the whole intron, the donor site, and the acceptor site. In this linear model, the reference sequence was chosen to be the consensus sequence so that the coefficients can be interpreted as the effects of substituting a consensus nucleotide to an alternative nucleotide. Lasso regression (Tibshirani, 1996) is a regularized linear regression method that can estimate some of the coefficients to be exactly 0 and that is therefore often used to select explanatory variables. We performed Lasso regression as implemented in glmnet (Friedman et al., 2010), choosing the largest shrinkage parameter at which the mean squared error (MSE) was within one standard error of the minimal MSE using 10-fold cross-validation. For donor bond half-life, as well as for acceptor bond half-life, the Lasso regression fit led to several nucleotide coefficients to be exactly 0. We then removed all the nucleotide positions where all single nucleotide effects had a coefficient equal to 0. Next, we estimated the single nucleotide effects of all remaining positions as well as the effect of GC frequency of the whole intron, the donor site, and the acceptor site on log-transformed donor and acceptor bond half-lives using ordinary least squares regression.

Structures modeling

Request a detailed protocol

Images of spliceosome structures (PDB code 4PJO, 5O9Z, 5XJC) were drawn using Pymol (https://pymol.org/).

Estimation of octamer effects and multivariate model

Request a detailed protocol

The number of occurrences of all 65,536 nucleotide octamers in the regions 15–100 nt downstream of the donor site, 100 nt upstream of the branchpoint, all nucleotides between the branchpoint and the five nt upstream of the acceptor site and 5–100 nt upstream of the acceptor site were counted allowing for two mismatches. The 15 nt immediately downstream of the donor site or 5 nt upstream of the acceptor site were excluded from the octamer search space because they were already incorporated in the single nucleotide model. Regions extending in the upstream or downstream exon were cropped to keep them within the intron. The base 2 logarithm of octamer pseudo-counts log2 (count +1) were used as covariates together with the GC frequency of the intron and the GC frequency of each region. The log-transformed donor/acceptor bond half-lives were the response variable. Lasso regression was applied to each region independently with 5-fold cross-validation to choose the optimal shrinkage parameter and select potential significant octamers. In a second step all selected octamers of each region were used together with the single nucleotide model as well as the GC frequency of the different regions, intron length and whether an intron is the first within the major transcript in a joint model to refine the selection of octamers (Lasso 10-fold cross-validation).

Octamer match to ATtRACT database

Request a detailed protocol

We compared each octamer to all reported PWMs with at least 5 nucleotides of the ATtRACT database and calculated the ratio between the probability of the best matching position (PWM-score) and the highest possible probability for any octamer (RPM-score, Cook et al., 2011). Each octamer was padded with an equal number of ‘N’s at both sides if the PWM was longer than the octamer. We ranked all matches based on their RPM-score and kept only the best 5% for each PWM and removed afterwards all matches with a RPM-score less than 0.9. The remaining matches were considered as hits.

Phylogenetic conservation of octamers

Request a detailed protocol

To calculate the phylogenetic conservation score for each octamer, we retrieved the PhastCons 100-way track (http://hgdownload.cse.ucsc.edu/goldenpath/hg38/phastCons100way/), which reports conservation across 99 vertebrates aligned to the human genome, and extracted the mean of all nucleotides for all matching positions. Octamers of the region 100 nt downstream of the donor site found to be predictive for donor site or acceptor bond half-life were also searched in the region 100 nt downstream of the donor site. Octamers of the region 100 nt upstream of the branchpoint or acceptor site or between the branchpoint and the acceptor site found to be predictive for donor or acceptor bond half-life were jointly searched in the region 100 nt upstream of the acceptor site, since these three regions were strongly overlapping. We also included as list of 2000 random octamers to estimate the background distribution in the same regions.

Calculation of splicing yield

Request a detailed protocol

We define the splicing yield of donor ηdonor and acceptor  ηacceptor as follows:

(5) ηdonor={acceptor}αjunction(donor,acceptor)αdonorηacceptor={donor}ajunction(donor,acceptor)aacceptor

where αdonor and αacceptor denote the synthesis rates of the donor site and of the acceptor site phosphodiester bond, respectively, and αjunctiondonor,acceptor denote the synthesis rate of the spliced exon-exon phosphodiester bond utilizing the specified donor and acceptor. Since the first-order kinetic model does systematically underestimate junction synthesis rates and overestimate donor synthesis rates, we switched to the alternative kinetic models to estimate these rates. However, since the first-order kinetic model is more robust and the acceptor kinetics do not include a delay we used the first-order kinetic model for the estimation of the acceptor synthesis rate. We defined the intron splicing yield η as the acceptor site splicing yield because its estimation is more robust compared to the donor site splicing yield.

Code availability

Request a detailed protocol

All the code used for counting donor site, acceptor sites, and junction reads as well as estimating the kinetic rates is available in the R package rCube (https://github.com/gagneurlab/rCubeWachutka et al., 2017). The single nucleotide model is shared in the model repository Kipoi (http://kipoi.org/models/CleTimer/Avsec et al., 2019).

Accession code

Request a detailed protocol

The sequencing data and processed files were deposited in NCBI Gene Expression Omnibus (GEO) database under accession code GSE129635.

Appendix 1

Notations, definitions, and relation to splicing quantities found in literature

1.1 Definitions of parameters used in this study

We consider RNA metabolism in unsynchronized cells at steady state. Synthesis balances out degradation at steady-state for any molecular species, independently of the kinetics. Denoting the steady state concentration c and the steady-state synthesis rate α (amount produced per unit of time) of such a molecular species of interest, we define the steady-state degradation rate constant β, such that c=αβ. We underscore that these constants are defined at steady-state without any assumption on the kinetics.

As molecular species of interest, we considered the donor (exon-intron), acceptor (intron-exon) and junction (exon-exon) phosphodiester bonds. We defined for each of them their corresponding rates (Appendix 1—table 1).

Appendix 1—table 1
Rate definitions of phophodiester bonds.
https://doi.org/10.7554/eLife.45056.024
Phosphodiester bondSteady-state synthesis rateSteady-state degradation rate
Donor

αD

βD=σD+λD

Acceptor

αA

βA=σA+λA

Junction

αJ

βJ=λJ

We assume that the donor bond can only disappear by splicing with rate σD or by precursor RNA degradation with rate λD. Similarly, we assume that the acceptor bond can only disappear by splicing with rate σA or by precursor RNA degradation with rate λA. We assume that the junction bond only disappears by mature RNA degradation with rate λJ.

We define the `splicing yield` of an acceptor bond ηA as the fraction of acceptor bonds that get spliced ηA:=σAσA+λA. For an acceptor that is spliced with a single donor, this is equivalently defined by the relation αJ=ηAαA because of flux balance at steady-state. Generally, as acceptors can splice with multiple donors, we sum up the synthesis rates of the different resulting junctions J, leading to:

(1.1) ηA:=JαJαA

The splicing yield for donor bonds ηD is analogously defined.

1.2 Relation to steady-state quantities used in literature

In this section we are relating our parametrization defined in section 1.1 with other RNA-seq based steady-state quantities typically found in the literature. Many studies used quantities that combine donor and acceptor reads. A reasonable assumption is that RNA polymerase II does not drop off during the transcription of an intron. With this assumption, steady-state synthesis rates of the donor and of the acceptor of one intron are equal to the transcription rate μ(μ:=αD=αA). Moreover, the steady-state amount of donor and acceptor unsplit RNA-seq reads of a same intron are often roughly equal, implying that the relative difference of degradation rates of the donor and acceptor bonds are small compared to their average. We define the average degradation rate of the donor and of the acceptor bond as σ:=βA+βD2 βAβD. Appendix 1—table 2 expresses several splicing quantities often used in literature using these parameters and these assumptions.

Appendix 1—table 2
Relations of different splicing quantities.

Square brackets stand for concentrations. The lariat is a spliced-out intron, known to be quickly degraded. Because lariat degradation rate may not be faster than splicing rate, we therefore consider it in the model of the concentration of intronic reads.

https://doi.org/10.7554/eLife.45056.025
Steady-state quantityValue

[exon-intron bond]

αDβD=αDσD+λDμσ

[intron-exon bond]

αAβA=αAσA+λAμσ

[exon-exon bond]

αJβJ=μηλJ

[exon]

μσ+μηλJ

[intron]μσ+[lariat]
3’SS or 5´SS intron-exon ratio intronexon(Khodor et al., 2011)

1+lariatσμ 1+σηλ

Splicing efficiency exon-exonexon-intron(Převorovský et al., 2016)

σλJη

Splicing Index exon-exonintron-exon
(Schlackow et al., 2017)

σλJη

Note that these steady-state quantities combine effects of splicing, degradation and splicing yield and are therefore not ideal to characterize splicing kinetics. For instance, splicing efficiency can be affected by mature RNA degradation rate.

1.3 Percent spliced-in

One common metric used to quantify alternative splicing is the percent splice-in (psi) (Schafer et al., 2015). Given three successive exons (1,2,3), PSI (Ψ is defined as the ratio of inclusion split reads (n1,2,n2,3) and the sum of inclusion split reads and exclusion split reads (n1,3).

(1.2) Ψ=n1,2+n2,3n1,2+n2,3+2n1,3

We assume that we only have the two isoforms, including and excluding the middle exon (2), such that n1,2=n2,3. Moreover, we assume no polymerase drop off such that all introns are transcribed at the same rate. Further assuming that each junction is spliced at its splicing rate constant σ and degraded at the mature RNA degradation rate constant λ, and that the splicing yield is 1 for all sites, we obtain:

(1.3) Ψ=σ1,2λ1,2σ1,2λ1,2+σ1,3λ1,3

The simple form of Equation 1.3 is a particular case that is based on many assumptions. Nonetheless, it shows that ψ is not only determined by the relative splicing rates but also by the relative stabilities of the isoforms. Steady-state data does not allow untangling these quantities. Moreover, it shows that splicing yield is not a quantity that is redundant with Ψ.

2 Kinetic models

2.1 Models for the donor bonds and for the acceptor bonds

2.1.1 Constant degradation rate

We used a model that assumes first order kinetics, that is with constant parameters αD and βD. Denoting D the concentration of donor bonds, we modelled:

(2.1) ddtD=αD-βD[D]=μD-σDD-λD[D]

This leads to:

(2.2) [D]=αDβD(1-e-βDt)
2.1.2 Constant splicing rate with a fixed delay

We also used a more complex model that takes into account that splicing can only be completed after the transcription of the corresponding acceptor site, which introduces a delay δ. Assuming for the sake of simplicity that degradation exhibits the same time delay we find:

(2.3) βD={0: t<δβD: tδ}

The resulting delayed differential equation model solves as:

(2.4) [D]={αDt : t<δαDδ+αDβD(1eβD(tδ)) : tδ}

We show below that this model is difficult to fit on our data. Although there is no obvious biophysical motivation for the delay to be equal for degradation and splicing, a more complex model would need more parameters and be even more difficult to be fitted on our data.

2.2 Models for the acceptor site

Since the acceptor site can be spliced immediately after its transcription, we did not consider delayed models for the acceptor bonds. Instead, we only considered a first order kinetic model with constant synthesis rate as in Equation (2.2).

2.3 Model for the junction bonds

2.3.1 Constant junction formation rate model

We used a model that assumes first order kinetics, that is with constant parameters αJ and βJ. Denoting J the concentration of junction bonds, we modelled:

(2.5) ddtJ=αJ-βJ[J]

which leads to:

(2.6) J= αJβJ(1-e-βJt)
2.3.2 Coupled model

We also used a more complex model that describes the formation of junction bonds as the outcome of splicing with a coupled first order kinetic model. Assuming that each acceptor bond that disappears through splicing creates a junction, we have αJ=σAA in (2.6).

This leads to:

(2.7) J=αJβJβJ-βA1-e-βAtβJ-1-e-βJtβA

Because of possible alternative splicing, this model was fitted to the split reads data only, that is without considering the unsplit reads of the corresponding donor and acceptor sites. Below we show that the coupled model is difficult to fit and therefore we worked with the constant rate model during our whole analysis except when calculating splicing yield.

To calculate the splicing yield ηA of an acceptor as in Equation (1.1), we (i) estimated αJ by fitting Equation (2.7) to the split reads of all junctions the acceptor is involved in, and (ii) estimated αA using the first order kinetic model on intron-exon reads.

3 Comparison of the models

3.1 Parameter estimation

The constant rate models use only two parameters whereas the more complex models (the delay model and the coupled splicing model) used three parameters. To compare how well we can retrieve the model parameters from observed data, we simulated counts based on typical synthesis, splicing, degradation and delay constants using typical negative binomial noise and parameter ranges. We then tried to retrieve the parameters using a maximum likelihood approach with random initialization. We compared the results to the ground truth and found that the two-parameter models (Appendix 1—figure 1) were more accurately fitted than the three-parameter models (Appendix 1—figures 2,3). All models showed an unbiased estimate for the synthesis rate, which is important to calculate the splicing yield.

Appendix 1—figure 1
Evaluation of first order kinetic model fitting.

Estimated synthesis rate α (left) and phosphodiester bond half-life (log2β, right) on x-axis vs. simulated ground truth synthesis rate and phosphodiester bond half-life. The red line shows the identity line y=x.

https://doi.org/10.7554/eLife.45056.026
Appendix 1—figure 2
Evaluation of delay model fitting.

Estimated splicing half-time (log2β), synthesis rate α and delay time δ on the x-axis vs. simulated ground truth half-life, synthesis rate and delay. The red line shows the identity line. Most estimates are close to the identity line except for the delay.

https://doi.org/10.7554/eLife.45056.027
Appendix 1—figure 3
Evaluation of coupled model fitting.

Estimated synthesis rate α, splicing half-time (log2βA) and junction bond half-life (log2βJ) on the x-axis vs. simulated ground truth synthesis rate, splicing half time and half-life time. The red line shows the identity. Most estimates are unbiased but show a high degree variation.

https://doi.org/10.7554/eLife.45056.028

3.2 Relation between the parameters of the different models

Since the two-parameter models are more accurately fitted than the three-parameter models, we investigated the validity of approximations relating the parameter of the two-parameter models to those of the three-parameter models.

For the donor bonds, we asked how well the sum of the delay δ and the splicing half-time log2βD~ is approximated by the donor bond half-life log2βD estimated by the constant rate model. For the junction bonds, we asked how well the sum of the splicing half-time log2βA and the mature RNA half-life log2βJ is approximated by the junction bond half-life estimated by the constant rate model. To assess how many introns are affected by potential biases of using the constant rate model instead of the more complex models, we created histograms of the estimated error for each model parameter given the observed distribution of synthesis rates, splicing times and half-lives (Appendix 1—figures 47). To get a distribution for the delay, the intron length was divided by an assumed polymerase velocity of 4 kb/min (Jonkers et al., 2014; Gressel et al., 2017). Compared to our estimated measuring precision of 180% fold for synthesis and 32% fold for half-life, the absolute errors are negligible for the synthesis rates and for the splicing half-time.

Appendix 1—figure 4
Estimated bond synthesis rate error for delay model.

Histogram of the ratio of estimated and true synthesis rate of donors simulated with the delay model. The median error is 1.08.

https://doi.org/10.7554/eLife.45056.029
Appendix 1—figure 5
Estimated bond half-life error for delay model.

Histogram of the ratio of donor bond half-life estimated using a first order kinetic model and the sum of the true splicing half-time and delay of donors simulated using a delay model. The median error is 0.89.

https://doi.org/10.7554/eLife.45056.030
Appendix 1—figure 6
Estimated bond synthesis rate error for the coupled model.

Histogram of the ratio of estimated and true synthesis rate of junctions based on the coupled model. The median error is 0.8.

https://doi.org/10.7554/eLife.45056.031
Appendix 1—figure 7
Estimated bond half-life error for coupled model.

Histogram of the ratio of junction bond half-life estimated with a first order kinetic model and the sum of the true processed RNA half-life and of the splicing half-time of junction data simulated with a coupled model. The median error is 1.2.

https://doi.org/10.7554/eLife.45056.032

3.3 Experimental data

To further assess whether all three models give similar estimations on real data, we applied the donor models to all donor sites as well as the junction models to all junctions (Appendix 1—figures 815). The results were similar in all models and showed only marginal differences. However, synthesis rates and half-lives were systematically underestimated or overestimated. This has no relevance if rates are only compared within one model, but needs to be taken into account if acceptor and junction rates are compared, as in the case of splicing yield. Therefore, we based our splicing yield estimates on the more complex models.

Appendix 1—figure 8
Synthesis rate comparison of delay and constant rate model.

Synthesis rate estimated with the constant splicing rate model (y-axis) vs. synthesis rate estimated with the delay model (x-axis) for the donor bond based on the observed experimental data.

https://doi.org/10.7554/eLife.45056.033
Appendix 1—figure 9
Donor bond half-life comparison of delay and constant rate model.

Donor bond half-life estimated with the constant splicing rate model (y-axis) vs. the sum of the splicing half-time and delay estimated with the delay model (x-axis) for the donor bond (right) based on the observed experimental data.

https://doi.org/10.7554/eLife.45056.034
Appendix 1—figure 10
Synthesis rate comparison of coupled and constant rate model.

Estimated synthesis rate of the constant splicing rate model (y-axis) vs estimated synthesis rate of the coupled model based on the observed experimental data.

https://doi.org/10.7554/eLife.45056.035
Appendix 1—figure 11
Junction bond half-life comparison of delay and constant rate model.

Junction bond half-life estimated with the constant splicing rate model (y-axis) vs. the sum of splicing half-time and half-life estimated with the coupled model (x-axis) based on the observed experimental data.

https://doi.org/10.7554/eLife.45056.036
Appendix 1—figure 12
Synthesis rate precision of rate estimation procedure.

Estimated synthesis rate based on simulated counts on the x-axis vs. the ground truth synthesis rate for donor, acceptor and junctions.

https://doi.org/10.7554/eLife.45056.037
Appendix 1—figure 13
Splicing half-time precision of rate estimation procedure.

Estimated half splicing time for donor and acceptor as well as half-life for junctions based on simulated counts on the x-axis vs. the ground truth.

https://doi.org/10.7554/eLife.45056.038
Appendix 1—figure 14
Donor synthesis rate vs. acceptor synthesis rate within the same intron.
https://doi.org/10.7554/eLife.45056.039
Appendix 1—figure 15
Acceptor synthesis rate vs. junction synthesis rate within the same intron.
https://doi.org/10.7554/eLife.45056.040

4 Quality of fits

4.1 Simulated vs. fitted rates

To assess whether our method is able to estimate synthesis and degradation rates from ground truth, we simulated counts based on the estimated distributions of synthesis rates, splicing half times and half-lives based on the experimental data. Based on simulated data, our method is an unbiased estimator of ground truth synthesis rates (Appendix 1—figure 12), splicing half-time and half-life (Appendix 1—figure 13) with high precision compared to the dynamic range.

4.2 Agreement of kinetic parameters donor, acceptor and junction model

Under the assumption that RNA polymerase II does not drop off during the transcription of one intron, the donor and acceptor synthesis rate are equal (Appendix 1—figure 14), whereas the junction synthesis rate reduced by the splicing yield (Appendix 1—figure 15).

4.3 Expected vs. observed counts

Comparisons of the predicted expected counts of the constant rate model with our observed experimental counts are shown in Appendix 1—figures 16 (donor), Appendix 1—figures 17 (acceptor) and Appendix 1—figures 18 (junctions).

Normalization was based on spike-ins. Therefore, errors in spike-ins quantification possibly led to off centred the density plots. Indeed, the non-centred scatterplot showed deviation compatible with the bins they were created, {2 min}, {5 min, 10 min}, {15 min, 20 min}, {30 min, 60 min}.

Appendix 1—figure 16
Comparison of expected counts of the constant rate model with observed experimental counts.

Expected donor counts based on the constant rate model vs. the experimentally observed counts.

https://doi.org/10.7554/eLife.45056.041
Appendix 1—figure 17
Comparison of expected counts of the constant rate model with observed experimental counts.

Expected acceptor counts based on the constant rate model vs. the experimentally observed counts.

https://doi.org/10.7554/eLife.45056.042
Appendix 1—figure 18
Comparison of expected counts of the constant rate model with observed experimental counts.

Expected junction counts based on the constant rate model vs. the experimentally observed counts.

https://doi.org/10.7554/eLife.45056.043

4.4 Poor identifiability of splicing half-time and delay

In order to estimate splicing half-time and delay using our experimental data, we fitted a delay model. Although from theory we should be able estimate half splicing time and delay based on the delay model, the low numbers of data points and the noise of our data is too limiting to distinguish between linear or exponential start of the donor read population. This is supported by the bimodal distribution of the maximum likelihood estimate for the delay (Appendix 1—figure 2). If we fit a delay model to the donor site we would ideally expect that the donor delay correlates with intron length and the splicing half-time of the donor and acceptor sites correlate. Only for this analysis, we fitted the delay model also to the acceptor site. However, we found that the estimated delay or splicing half-time of the donor and acceptor sites of one junction correlates only little (Appendix 1—figures 1921). Their sum however correlates much more, showing that we actually can measure the sum of delay and splicing half-time but not each one alone (Appendix 1—figure 22).

Appendix 1—figure 19
Acceptor vs. donor synthesis rate estimation based on the fixed delay model.
https://doi.org/10.7554/eLife.45056.044
Appendix 1—figure 20
Acceptor vs. donor splicing half-time estimation based on the fixed delay model.
https://doi.org/10.7554/eLife.45056.045
Appendix 1—figure 21
Acceptor vs. donor delay estimation based on the fixed delay model.
https://doi.org/10.7554/eLife.45056.046
Appendix 1—figure 22
Acceptor vs. donor sum of splicing half-time and delay estimation based on the fixed delay model.
https://doi.org/10.7554/eLife.45056.047

5 Limits of the model

We simulated counts based on the constant rate model given a large range of log-uniform distributed bond synthesis rates (10−5 – 100 1/cell/ min) and bond half-lives (10−3 - 7 × 10−6 min). Based on this data we investigated the typical lower bound of reads necessary in all samples to proper model the kinetics a bond (Appendix 1—figure 23). We find that using a lower cut-off of 100 reads (after the 6th ventile) allows us to estimate the kinetics with a typical error below 100%.

Appendix 1—figure 23
Relative synthesis rate and half-life error vs. the accumulative bond read count accumulated in all samples and binned in its ventiles.

The first bin comprises the first two ventiles.

https://doi.org/10.7554/eLife.45056.048

To test the shortest and longest bond half-lives our experimental approach is able to capture and model properly we stratified the results also by bond half-live. We proceeded similarly with the bond synthesis rates. We found that most of our modelled data lies within a range where the median relative error of synthesis rates and half-lives is below 100% or 30% respectively (Appendix 1—figure 24, 25). We note that the errors given for the stratifications by bond half-life and synthesis are inflated for real world data, because the simulation is based on half-lives and synthesis rates that were drawn independently and more extreme than our real world data.

Appendix 1—figure 24
Relative synthesis rate and half-life error vs. the bond half-life binned in its ventiles.
https://doi.org/10.7554/eLife.45056.049
Appendix 1—figure 25
Relative synthesis rate and half-life error vs. the bond synthesis rate binned in its ventiles.
https://doi.org/10.7554/eLife.45056.050

6 4sU incorporation

For our analysis we did not consider the time until the labeled uracil 4sU gets available to the transcription machinery (by diffusion and import). The lag is a constant that is the same for all genes. Note that this time has to be very short since labeled RNA was detected after 2 min labeling.

Assuming that only a fraction of all transcripts incorporates 4sU, this would lead to an underestimation of the synthesis rates (e.g. a labeling efficiency of 90% would results in a 10% underestimation of the rate), given that the labeling efficiency is constant over all samples. However, it does not affect estimation of the half-lives nor of the splicing yields. Indeed, the kinetic models are then modeling the kinetics of the labelled fraction of the RNA species rather than the overall RNA species. These kinetics differ from the kinetics of the overall RNA species only by different synthesis rates.

7 Splicing yield

Splicing yield is computed by independently estimating the synthesis rate of the precursor (using unsplit reads of the acceptor bond) and of the mature RNA (using split reads) and taking the ratios (Materials and methods). Due to estimation errors, these ratios may turn out to be greater than 1. Simulations (section 5) showed that cumulative read coverage across all samples as well half-life can lead to bias estimates of synthesis rates. We therefore investigated whether this could confound our observation that non-coding RNAs show lower yield than mRNAs. However, the higher yield of mRNA versus non-coding RNA was recapitulated when stratifying by these possible confounders (Appendix 1—figures 26, 27).

Appendix 1—figure 26
Splicing yield distribution (boxplot) of lincRNA (red) and mRNA (blue) stratified by bins of cumulative number of reads across all samples.

Although yield correlates negatively with the cumulative number of reads, indicative of potential estimation bias, the mRNA yield remains higher than the lincRNA yield in every stratum.

https://doi.org/10.7554/eLife.45056.051
Appendix 1—figure 27
Splicing yield distribution (boxplot) of lincRNA (red) and mRNA (blue) stratified by bins of acceptor bond half-life.

Although yield correlates negatively with acceptor bond half-life (for half-life larger than 16 min), indicative of potential estimation bias, the mRNA yield remains higher than the lincRNA yield in every stratum.

https://doi.org/10.7554/eLife.45056.052
https://doi.org/10.7554/eLife.45056.023

References

  1. 1
  2. 2
  3. 3
  4. 4
    The Kipoi repository accelerates community exchange and reuse of predictive models for genomics
    1. Ž Avsec
    2. R Kreuzhuber
    3. J Israeli
    4. N Xu
    5. J Cheng
    6. A Shrikumar
    7. A Banerjee
    8. DS Kim
    9. T Beier
    10. L Urban
    11. A Kundaje
    12. O Stegle
    13. J Gagneur
    (2019)
    Nature Biotechnology.
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
    Regularization paths for generalized linear models via coordinate descent
    1. J Friedman
    2. T Hastie
    3. R Tibshirani
    (2010)
    Journal of Statistical Software, 33, 10.18637/jss.v033.i01, 20808728.
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
  69. 69
  70. 70
  71. 71
  72. 72
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
    rCube - RNA-Rates in R. rCube
    1. L Wachutka
    2. C Demel
    3. J Gagneur
    (2017)
    rCube - RNA-Rates in R. rCube.
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91
  92. 92
  93. 93
  94. 94
  95. 95
  96. 96
  97. 97
  98. 98
  99. 99
  100. 100
  101. 101
  102. 102

Decision letter

  1. Douglas L Black
    Reviewing Editor; University of California, Los Angeles, United States
  2. James L Manley
    Senior Editor; Columbia University, United States

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

[Editors’ note: a previous version of this study was rejected after peer review, but the authors submitted for reconsideration. The first decision letter after peer review is shown below.]

Thank you for submitting your work entitled "Global two-step RNA splicing kinetics in human cells" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by a Reviewing Editor and a Senior Editor. The reviewers have opted to remain anonymous.

Our decision has been reached after consultation between the reviewers. Based on these discussions and the individual reviews below, we regret to inform you that your work will not be considered further for publication in eLife.

There was some interest from the reviewers and editors in this new approach to kinetic analysis of splicing. However, the reviews point out that the study does not demonstrate how well the new model actually fits the observations, either on a global scale or at the level of individual genes. This would be a key step for a study presenting a new approach to such questions. A number of other major concerns were raised regarding the kinetic modeling.

Reviewer #1:

Wachutka et al. describes a method to assess splicing kinetics independently for each of the two steps of splicing, with downstream analyses to understand what may contribute to variation in the rates of splicing donor and acceptor cleavage during the 1st and 2nd steps of splicing respectively. I found that their approach to splicing kinetics – breaking it up into the individual steps rather than focusing on the entire intron as one piece – to be very interesting and innovative. The time course TT-seq labeling data presented in the manuscript is perfectly suited to answer these questions and their conclusions seem to support the idea that we can use this approach to gain insights on how splice sites are being recognized and efficiently used. Overall, I found the manuscript to be fairly well written, though I had a hard time following exactly what the authors were describing or their rationale in many places, as highlighted below. Specifically, I have the following concerns:

1) This manuscript is both technically and biologically dense, with the mathematical modeling and structurally based interpretation of the resulting kinetic data. I think it is necessary to provide more details about the methods in order to give readers an understanding of (1) the precise normalization, scaling, and modeling steps with regard to how to interpret each value being used for these steps, (2) the assumptions about kinetic relationships that underlie these models, and (3) how to assess the appropriateness and validity of this model (as described below).

2) I think it would be very useful to have an assessment of how well their model is fitting to their data, and the limitations of this model (and the data used). Ideally this would be provided by an orthogonal approach, but I sympathize that the experiments necessary for this may be logistically prohibitive. However, a computational assessment of this – either by simulations (accounting for how well their model picks up the ground truth after accounting for all the assumptions being made) or assessment of their model fit – is crucial. This would relate to why they choose a first order kinetic model, given that they seem to be modeling multiple potential first order processes (synthesis, bond cleavage, and RNA degradation) simultaneously.

3) One assumption that seems to be have been made in their kinetic modeling is that the site (either donor/acceptor) was created exactly at the beginning of labeling – while the experimentation fragmentation inherent to the TT-seq protocol allows us to know that the site being sequenced was indeed transcribed sometime during the labeling period, it may have been at the very beginning or the very end of this period. How can the authors account for this (i.e. should the assumption, given a uniform probability of transcribing during every finite time in the labeling period, be that the site was transcribed at the beginning or the middle of the labeling period on average)?

4) The authors make assumptions about a maximum polymerase elongation velocity for a few of their analyses, however studies have also shown substantial variability in polymerase elongation rates, across genes and potentially across regions within a gene. How might this affect their conclusions about intron length affecting the time to splicing or other observations?

5) The authors seem to suggest that previous studies may have overestimated splicing rates and not accounted for biases in these rates across the gene, however, this study seems to get similar splicing rate values. How do they account for this agreement of their measurements? Furthermore, how might they make these comparisons given that most other studies estimated rates while considered the entire intron entity rather than each individual splicing site? It seems like the junction formation rate may be the best metric for comparison, however, they do not focus very much analysis or time on this particular metric (which is indicative of completed splicing).

6) I really don't understand the splicing yield measurement and, particularly, how it differs or compares to a standard percent-spliced-in value that is commonly used to evaluate splicing patterns in RNA-seq datasets. What happens when only look at read abundance over these sites – is using the rates telling us anything different? Would be useful to put this in context of a metric that readers may be more used to thinking about in order to give it the proper interpretation.

Reviewer #2:

In this manuscript by Wachutka et al., the authors use modeling of 4sU pulse-chase RNA-sequencing data to infer rates of splicing across introns genome-wide. They then use sequence features and comparison to structural data to draw conclusions about first and second step transesterification rates. Experimentally, cells are exposed to a brief pulse of 4 thiouracil, and RNA made during the pulse will incorporate the 4sU and can be purified through biotin pull downs. So a junction (exon-intron, intron-exon, exon-exon) which is still around at some variable chase time provides kinetic information about the rates of removal of that particular junction. These removal processes could be splicing, non-sense mediated decay, or mRNA decay. Usually, the entire RNA is pulled down, giving the complete history of that particular transcript. The innovation from the Cramer lab (previously published TT-seq method) is to fragment this RNA to < 6kb, which removes some of the memory effect, although one is still analyzing RNA which was transcribed before the pulse. The present work is based entirely on analysis of this one data set in K562 cells. Overall, I have deep concerns about the analysis which are further exacerbated by the complete lack of perturbations or any effort at comparing to a ground truth measure of splicing kinetics. For these reasons, I don't think the paper is suited to eLife.

1) The following example illustrates the problem I have with this approach. Take the exon-intron donor junction reads as an example. Assume that the only way for this collection of reads to disappear is through the first-step transesterification step of splicing, which means the polymerase must at least reach a branchpoint, followed by the branchpoint nucleophilic attack. It is this second part which is the rate of the first step of splicing, not the time to reach the branchpoint. In addition, this elongation time to the branchpoint is not exponentially-distributed because RNAP is a processive enzyme. This time simply cannot be modeled as a single-step reaction but is rather a series of many exponential reactions. Moreover, even if one were to somehow include this information, meaning the local context of this donor junction, and then compute an elongation time to the branchpoint, it is impossible to know from this assay which branchpoint was chosen. This omission is a critical, and in my mind, fatal flaw of this analysis. I cannot see how one can possibly determine a rate for the first step of splicing without knowing which branchpoint was used. Going further, it would even be inappropriate in this analysis to use the annotated branchpoints because the authors are purporting to finds all sorts of unannotated introns and splicing events. Indeed, I think it is for this combination of reasons why so many labs are exerting great efforts into branchpoint sequencing and analysis.

2) Therefore, statements about "donor and acceptor cleavage times" are meaningless in the splicing context. It could be that the authors intend to report a 'donor cleavage time' which is indeed a combined measure of elongation, branchpoint selection, and the first chemical step of splicing, but I can't see what biochemical insight is gained from that time. There is an inclination that they are measuring something related to splicing based on the results in Figure 4B: donor cleavage time varies with departures from the consensus sequence. However, these changes don't seem to be consistent with the known biochemistry of 5' ss.

3) The theoretical treatments are in general lacking in rigor.

- Several papers have treated the kinetics of splicing and modeled kinetic data in single cells or even genome-wide (PMIDs: 25271374, 22022255, 25541195), yet the authors seem to be unaware of this work. Although one paper they do cite (Schmidt et al., 2011) indicates the importance of the correct theoretical treatment of elongation.

- There is no measure for the goodness of fits or examination of other models.

- They use a worm-like chain model based on RNA persistence length to compute the energy of looping. With the energy, they apply the Arrhenius equation to get a rate and compare this value to the donor cleavage time (Figure 3E). However, whether there is any trend in this data and whether this model explains that trend is difficult to determine.

4) I found the splicing yield part of the paper (Figure 7) to be the most interpretable of the results presented. Again, however, none of this was validated with any more quantitative measure such as digital droplet PCR.

In summary, I see this analysis (with the underlying parameterization) as a reasonably good way to explain 'donor cleavage time' and 'acceptor cleavage time' as measured in this assay. And there may be some utility in such metrics. My overall difficulty is that I don't think this model parameterized in such a way is telling us much about splicing and transcription in a way that could be validated or interpreted in other assays. So I see this paper as a bioinformatic exercise lacking in biochemical restraint or biophysical insight.

Reviewer #3:

The authors present human genome-wide estimates of cleavage times at a single splice site resolution (donor and acceptor) along with estimates of synthesis and mRNA half-life using transient transcriptome sequencing (TTseq) which uses pulse labeling of 4SU followed by RNA fragmentation prior to purification of the labelled RNA – this reduces the 5' bias. They also assess the effect of different features to splicing time, such as intron position, length and sequence, and then attempt to develop correlations between structural aspects of spliceosomal interactions and kinetics.

Given that the optimization methods used for estimating synthesis/formation and cleavage/degradation of the splice sites/junction is a local algorithm, is 10 random initialization enough to be confident that the global minimum has been reach?

A key assumption for the model fitting is that 100% of the label is incorporated. The authors should explicitly explore how lower levels of labelling (e.g. 90, 50, 10%) affect their results and conclusions.

Some comparison and correlation of synthesis rate estimated has been done separately for different donor sites and different acceptor sites (Figure 2—figure supplement 2B), but as mentioned, these should also be consistent for donor and acceptor of the same gene; however, donor synthesis rate vs acceptor synthesis rate is not shown.

Model fits should be shown for individual genes, highlighting different classes: e.g. single vs multi-intron, highly expressed vs low expressed.

Kinetic language is not precise and causes confusion: For example "Cleavage times" should be referred to as "half cleavage times" or possibly "cleavage half-times". They do not refer to completion times, or characteristic times.

"Cleavage Rates in the range of minutes" is also imprecise. They seem to refer to cleavage half-life times being in in the minute range. More precise would be to report the actual cleavage rate constants, in addition to the cleavage half-life times. ("Rate" = rate constant x abundance).

The part on the single nucleotide model methods would benefit of a more detailed description. It seems to be a statistical model based on the previously derived cleavage time and consensus splice site sequence but the methods is not very clear to me.

[Editors’ note: what now follows is the decision letter after the authors submitted for further consideration.]

Thank you for submitting your article "Global donor and acceptor splicing site kinetics in human cells" for consideration by eLife. Your article has been reviewed by three peer reviewers, and the evaluation has been overseen by Douglas Black as Reviewing Editor and James Manley as the Senior Editor. The reviewers have opted to remain anonymous.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission. We believe the paper is in principle more appropriate as a "Tools and Resources" paper than as a Research Article, and therefore request that a revised manuscript be submitted in that category.

Summary:

Gagneur and colleagues describe a new approach to measuring rates of intron excision in vivo that combines a metabolic labeling method with RNAseq and mathematical modeling. Focusing the analysis on the appearance and disappearance of reads for individual exon-intron, intron-exon, and exon-exon junctions, the authors show that they can extract rate information for the chemical steps at particular splice sites genomewide. They then correlate these rates with a variety of features in the pre-mRNA including consensus sequence match, intron length, and local sequence context. Elucidating the determinants of splicing kinetics is important to understanding gene regulatory programs in metazoans and has been understudied. A new method for measuring splicing rates is thus potentially very valuable.

The reviewers all felt that the paper was much improved from the previously version, which we had declined. There was universal interest in the method of combining metabolic labeling with analyses of single phosphodiester junctions to extract individual rates of decay and formation for particular splice sites and exon-exon junctions. This method presents an original approach to measuring splicing kinetics in vivo, and a potentially valuable advance for the field. As such, the paper could make a good addition to eLife, if it can be appropriately revised. The primary interest of the paper is in the technology and analytical techniques. New biological or mechanistic insights arising from the study are limited, with values for intron excision rates similar to previous studies. It is very important that the method be well described and that the data not be overinterpreted. While improved, the paper is still very dense and difficult to follow. The reviewers raised a number of issues and points of needed clarification that must be addressed.

Essential revisions:

1) The reviewers remain unconvinced that the measured decay and formation rates of phosphodiester linkages can be directly attributed to individual recognition steps and enzymatic rates. By the time a chemical step causes the loss or gain of a junction read, dozens of events have occurred that could contribute to the rate of that step. For a 5' splice site these include initial recognition by U1, transcriptional elongation/pausing in the downstream intron, branchpoint recognition (which cannot be measured here), myriad spliceosome assembly steps, and finally catalysis. While the effects of splice site mutations agree with the known interaction of the 5' splice site by the U1 snRNP, other moieties will be interacting with these residues, and there is no evidence that the U1 interaction is determinative of the rate of cleavage. It is very simplistic to claim this is so. The rate of exon-intron junction loss will be the aggregate function of many different rates that likely vary from site to site and are not being measured. It should be noted that U1 is not involved in catalytic steps, only initial recognition. To say that "we developed a computational approach that models rates of RNA metabolism at the level of single phosphodiester bonds" is misleading. The authors need to examine their interpretations of their data and be careful not to claim that the rates they are observing can be broken down into steps that are not being measured.

2) Subsection “New and alternative splice sites”, last paragraph: The authors reported numbers (177,322 and 164,533) of non-split reads mapping to exon-intron boundaries (5' splice sites) sum to the total number of putative introns reported in the paragraph above (341,855, identified by split reads across an exon-exon boundary). This seems wrong, or did they throw out all examples of alternative splicing? They don't report numbers for intron-exon boundaries (3' splice sites), but farther down in the same paragraph they treat the read numbers as if these sites are included. In that later discussion, the authors state that of the split reads, "18% contained a non-annotated donor site, 22% a non-annotated acceptor site and another 38% represented new putative splice sites". What is the difference between a non-annotated donor or acceptor site and a new putative splice site? Furthermore, if aiming to differentiate between annotated and non-annotated sites, wouldn't it make more sense to start with the percentage of split reads mapping to non-annotated regions, where all the sites are likely unannotated, and then discuss new previously unannotated sites in GENCODE?

3) The authors indicate that they are fitting the model independently to every donor-acceptor combination that they observe. If so, what is causing the differences in the numbers of donor (162,134), acceptor (177,543), and junction (156,825) bonds that they are able to model? Don't they need sufficient read coverage across all three to confidently estimate the synthesis/cleavage rates on a per-junction level? Or does this include some cases of alternative splice site pairings?

4) The use of simulations to estimate the robustness of the model fitting, across both the first order kinetic model and alternative models is a helpful addition to the paper. It should be highlighted more in the main text how this supports the validity of the method. An assessment should also be provided for how the method performs and where it doesn't work well. For example, the authors indicate that "the robustness of our approach… consistent with high accuracy of the fitting procedure on simulated data." However, all models or experimental systems are underpowered in certain cases. The authors should include a short discussion of where their model falls short (ultra-fast rates, length considerations, etc.) and thus more fully inform the reader about the biological interpretations that can be drawn from the data. Simulations are often the best way to push the model and test these extreme cases.

5) In the section entitled "Intron length constrains co-transcriptional splicing times." It is not clear what the authors mean by "co-transcriptional". The authors describe how donor bond half-life increases with intron length, presumably indicating that donor bond half-life is dependent on the time to reach the branchpoint and acceptor site. This would imply that the first step of splicing occurs soon after the acceptor site is transcribed and available. If so, then why is the acceptor bond half-life so long? A median acceptor half-life of 4 min indicates that the Polymerase is on average 16kb away by the time the second step of splicing occurs. Do they propose that the first step is close in time to the second step, with a lag between transcription of the branchpoint and the loss of the donor? Or are the two chemical steps separated by 4 minutes? This seems unlikely in that it implies that spliceosome assembly up to C complex formation is very fast compared to the transition from the first to the second catalytic steps. If the authors think that donor loss occurs closer in time to the second step – well after the appearance of the branchpoint – then the definition of "co-transcriptional" that involves immediate splicing upon transcription/availability of the acceptor site doesn't seem to hold. All of this needs better explanation.

6) It is not clear how the splicing yield calculation can produce a value of > 1, with the median splicing yield reported for mRNAs as 1.2. Does this mean that the measured levels of precursor mRNAs are consistently underestimated? If it's not bounded as described (0-1), the splicing yields are difficult to assess as it seems one or more of the measured values is incorrect.

https://doi.org/10.7554/eLife.45056.066

Author response

[Editors’ note: the author responses to the first round of peer review follow.]

Reviewer #1:

[…] 1) This manuscript is both technically and biologically dense, with the mathematical modeling and structurally based interpretation of the resulting kinetic data. I think it is necessary to provide more details about the methods in order to give readers an understanding of (1) the precise normalization, scaling, and modeling steps with regard to how to interpret each value being used for these steps, (2) the assumptions about kinetic relationships that underlie these models, and (3) how to assess the appropriateness and validity of this model (as described below).

We thank the reviewer for the comment. (1) We have described the spike-in normalization procedure in main text. (2) We provided more details about our kinetic modeling in the Appendix (subsections “Notations, definitions, and relation to splicing quantities found in literature” and “Kinetic models”). (3) To assess the appropriateness and validity of our model, we compared our model with other two alternative, more complex splicing models (Appendix subsection “Comparison of the models”). As a result, our model (two free parameters constant rate) estimates synthesis, splicing, and degradation rate in a more accurate way than the other models by at least one order of magnitude (Appendix subsection “Parameter estimation”, Appendix—figures 13). When we applied the model to our experimental data, the results were similar in all models (Appendix subsection “Experimental data”, Appendix—figures 8-11). To assess the appropriateness and validity of our model we added a subsection in the Appendix (“Quality of fits”) where we tested our quality of fitted rates (see answer below).

2) I think it would be very useful to have an assessment of how well their model is fitting to their data, and the limitations of this model (and the data used). Ideally this would be provided by an orthogonal approach, but I sympathize that the experiments necessary for this may be logistically prohibitive. However, a computational assessment of this – either by simulations (accounting for how well their model picks up the ground truth after accounting for all the assumptions being made) or assessment of their model fit – is crucial. This would relate to why they choose a first order kinetic model, given that they seem to be modeling multiple potential first order processes (synthesis, bond cleavage, and RNA degradation) simultaneously.

We thank the reviewer for the insightful comment. To assess whether our model estimates kinetic rates from ground truth, we have now generated simulated datasets under the assumptions of the first order kinetic models, using realistic ranges of parameters to assess how accurate the estimation of the parameters were. Our estimated rates recovered accurately the simulated ones (Appendix subsection “,Simulated vs. fitted rates”, Appendix—figures 12-13). Comparison of the expected versus observed counts of donor (Appendix—figure 16), acceptor (Appendix—figure 17), junctions (Appendix—figure 18) demonstrates the precision of our model. Moreover, we have also assessed how robust our fitting is to deviations to the first order kinetics assumption by simulated data with different kinetics (delayed process and coupled ODEs). This showed that i) these more complex kinetics models are difficult to fit with our experimental design (Appendix—figures 1-3) and ii) the 1st order ODEs can be robustly fitted to such data yielding parameters that can be interpreted (Appendix—figures 4-11).

3) One assumption that seems to be have been made in their kinetic modeling is that the site (either donor/acceptor) was created exactly at the beginning of labeling – while the experimentation fragmentation inherent to the TT-seq protocol allows us to know that the site being sequenced was indeed transcribed sometime during the labeling period, it may have been at the very beginning or the very end of this period. How can the authors account for this (i.e. should the assumption, given a uniform probability of transcribing during every finite time in the labeling period, be that the site was transcribed at the beginning or the middle of the labeling period on average)?

No, a first order kinetic model does not assume that each molecule is created at the beginning of the time series but that they are continuously synthesized and degraded during the time series. Related to this point, our model did not consider time until 4sU gets available to the transcription machinery (by diffusion and import). The lag is a constant that is the same for all genes. Note that this time has to be very short since labeled RNA was detected after 2 min labeling. We now mention this point in the Appendix.

4) The authors make assumptions about a maximum polymerase elongation velocity for a few of their analyses, however studies have also shown substantial variability in polymerase elongation rates, across genes and potentially across regions within a gene. How might this affect their conclusions about intron length affecting the time to splicing or other observations?

We agree with the reviewer and we are aware that Pol II elongation rate is gene specific and it shows variability across the regions within the genes. However, precise calculation of Pol II elongation rate would require the integration of TT-seq data with local Pol II occupancy profiling which is difficult and would justify a manuscript on its own. It has been shown that Pol II elongation velocity varies from 0.5 to 4kb/min (Gressel et al., 2017, Jonkers et al., 2014, Ardehali and Lis, 2009), with an average elongation velocity of 2.3kb/min (Gressel et al., 2017, Fuchs et al., 2014; Jonkers et al., 2014; Saponaro et al., 2014, Veloso et al., 2014). Pol II accelerates in the gene body up to 4kb/min, indicating a faster transcription of introns (Gressel et al., 2017, Jonkers et al., 2014). Because of this, we reported an overall trend assuming a 4kb/min Pol II elongation rate in introns. Investigation of intron-specific relationship of Pol II elongation rate and splicing could be an interesting future research direction. We now comment on this in the Discussion.

5) The authors seem to suggest that previous studies may have overestimated splicing rates and not accounted for biases in these rates across the gene, however, this study seems to get similar splicing rate values. How do they account for this agreement of their measurements? Furthermore, how might they make these comparisons given that most other studies estimated rates while considered the entire intron entity rather than each individual splicing site? It seems like the junction formation rate may be the best metric for comparison, however, they do not focus very much analysis or time on this particular metric (which is indicative of completed splicing).

The closest dataset to ours is the one of Mukherjee et al., 2016, which uses 4sU-seq in human cells. Mukherjee et al., 2016, did not compute splicing kinetics of individual introns but only classified them as fast, medium or slow, by clustering the time series data. We therefore could not directly compare the splicing rates of individual introns. Mukherjee et al., 2016, provided gene level splicing rates, which were computed by fitting first order kinetics on the complete precursor RNA, using RSEM to estimate mature and precursor RNA abundance. To compare our results with their results, we averaged donor and acceptor cleavage times within major isoforms as these reads are specific to the precursor. While our measures of half-life correlate well across genes, those of synthesis rates and splicing rates were more modest (Figure 2—figure supplement 1D). The orders of magnitudes match. Junction formation rate estimation is indicative of completed splicing but is not the most informative metric to estimate the time needed to splice donor and acceptor splice-site (cleavage time). Assuming perfect yield. i.e. that every precursor RNA leads to a mature RNA, the junction formation rate equals the precursor synthesis rate. It is thus roughly a constant across one transcript. We now explain this in the Appendix (Appendix 1—table 1).

6) I really don't understand the splicing yield measurement and, particularly, how it differs or compares to a standard percent-spliced-in value that is commonly used to evaluate splicing patterns in RNA-seq datasets. What happens when only look at read abundance over these sites – is using the rates telling us anything different? Would be useful to put this in context of a metric that readers may be more used to thinking about in order to give it the proper interpretation.

We apologize for not being clear enough in the text and we thank the reviewer to point it out. In Appendix we now relate our metric to existing ones, including Psi. In Appendix subsection “Notations, definitions, and relation to splicing quantities found in literature”, we added more detailed explanation of how we defined splicing quantities and of the difference between percentage-spliced-in and splicing yield. Splicing yield cannot be computed from steady-state RNA-seq data because synthesis and degradation are entangle in steady-state data. We have also edited the text extensively to make sure this is now easier to understand and we trust the reviewer agrees.

Reviewer #2:

In this manuscript by Wachutka et al., the authors use modeling of 4sU pulse-chase RNA-sequencing data to infer rates of splicing across introns genome-wide. They then use sequence features and comparison to structural data to draw conclusions about first and second step transesterification rates. Experimentally, cells are exposed to a brief pulse of 4 thiouracil, and RNA made during the pulse will incorporate the 4sU and can be purified through biotin pull downs. So a junction (exon-intron, intron-exon, exon-exon) which is still around at some variable chase time provides kinetic information about the rates of removal of that particular junction. These removal processes could be splicing, non-sense mediated decay, or mRNA decay. Usually, the entire RNA is pulled down, giving the complete history of that particular transcript. The innovation from the Cramer lab (previously published TT-seq method) is to fragment this RNA to < 6kb, which removes some of the memory effect, although one is still analyzing RNA which was transcribed before the pulse. The present work is based entirely on analysis of this one data set in K562 cells. Overall, I have deep concerns about the analysis which are further exacerbated by the complete lack of perturbations or any effort at comparing to a ground truth measure of splicing kinetics. For these reasons, I don't think the paper is suited to eLife.

We thank the reviewer for acknowledging the advantages of TT-seq. However, we wish to point out that there is no reason to believe that we do analyze RNA that was transcribed before the pulse. There is no cross-contamination of labeled RNA fragments with non-labeled RNA, and the RNA fragment length is much shorter than the distance of polymerase progression during the labeling time. Thus the assumption of the reviewer is not substantiated. We also feel there may be a misunderstanding: we are not conducting a pulse-chase experiment but rather apply labeling pulses that vary in length. Analysis of in vivo pulse-chase experiments is made difficult due to recycling of labelled uracil after RNA degradation. Our approach does not have this limitation. We think it is the best currently available experimental protocol to infer in vivo rates that are tightly coupled to splicing events. Since we also provide splicing yields and can related our in vivo data to in vitro structural data, we think our manuscript is novel and timely and highly suitable for eLife. We kindly ask the reviewer to reconsider his/her opinion in the light of our additional analysis and modeling.

1) The following example illustrates the problem I have with this approach. Take the exon-intron donor junction reads as an example. Assume that the only way for this collection of reads to disappear is through the first-step transesterification step of splicing, which means the polymerase must at least reach a branchpoint, followed by the branchpoint nucleophilic attack. It is this second part which is the rate of the first step of splicing, not the time to reach the branchpoint.

We apologize if we did not explain it clear enough in the text: what we define as “donor cleavage time” is the sum of two different processes, elongation time to the branchpoint and first transesterification step, which together will determine the time needed to cleave the phosphodiester bond of the donor splice site. We had actually used this extensively in our manuscript to provide insights into the co-transcriptional nature of the process. Because of this, we could investigate the role of intron length, and therefore elongation time, in the paragraph of the paper “Intron length constrains co-transcriptional splicing times” where we explicitly stated that donor splicing kinetics are regulated by the time the polymerase needs to transcribe the intron. To avoid misunderstanding, we now describe the processes determinant of the cleavage time of the donor site and of the acceptor site in the first paragraph of the section “Kinetic modeling”. We have also changed the manuscript title, which may have been too short and thus misleading. In summary, the kinetic insights are relevant to understanding splicing and its co-transcriptional nature in vivo.

In addition, this elongation time to the branchpoint is not exponentially-distributed because RNAP is a processive enzyme. This time simply cannot be modeled as a single-step reaction but is rather a series of many exponential reactions.

More complex models could be devised but would need to be fitted to the data. We have now investigated alternative models (Appendix subsections “Kinetic models” and “Comparison of the models”). One is a delay model for modeling the time of intron transcription. Another one is a coupled model for the junction reads that couple two exponential processes, namely donor-site / acceptor-site cleavage and mature RNA degradation. When we compared our model to the other more complex ones (Appendix subsection “Parameter estimation”), we found that the more complex models typically offer no advantage over our simple model because our experimental data is too limited to substantially differentiate between the models. Based on simulated data we show that the simple constant rate model deviates only slightly from the results if we applied more complex models and the differences are typically below our estimated experimental error. Furthermore, the more complex models are less precise to estimate and therefore the overall precision of our analysis would decrease. Systematic errors between the different models only occur if we compare kinetic rates between different models as done in our splicing yield calculation. Therefore, this is the only case where we apply the more complex models. Taken together, although the process of co-transcriptional splicing is complex and should ideally be described with more detailed models in the future, experimental data are limiting what can be modeled and our model is robust and does reflect the process in a meaningful way that enables us to draw simple conclusions.

Moreover, even if one were to somehow include this information, meaning the local context of this donor junction, and then compute an elongation time to the branchpoint, it is impossible to know from this assay which branchpoint was chosen. This omission is a critical, and in my mind, fatal flaw of this analysis. I cannot see how one can possibly determine a rate for the first step of splicing without knowing which branchpoint was used. Going further, it would even be inappropriate in this analysis to use the annotated branchpoints because the authors are purporting to finds all sorts of unannotated introns and splicing events. Indeed, I think it is for this combination of reasons why so many labs are exerting great efforts into branchpoint sequencing and analysis.

We have now clarified that we are not estimating directly the rates of the individual transesterification step, which would indeed lead to these complications. Also, we have used branchpoint predictions later in the manuscript for predicting the rates from sequence (Figure 4, 5), which is a practical way to move forward as maps of branchpoints are still lacking. These turned out to have very reasonable predictive power. Also, we would like to underscore that many introns contain multiple branchpoints but their distance from the acceptor site is narrowly distributed, typically from 18 to 45 nt from acceptor site (Mercer et al., 2015). Taken together, our conclusions are not compromised by the fact that multiple branchpoints may contribute.

) Therefore, statements about "donor and acceptor cleavage times" are meaningless in the splicing context. It could be that the authors intend to report a 'donor cleavage time' which is indeed a combined measure of elongation, branchpoint selection, and the first chemical step of splicing, but I can't see what biochemical insight is gained from that time.

Please compare our answers above. Indeed, we intentionally report donor cleavage time as we think that this is a quantity that can be defined and directly measured from such genome-wide data. Delineating it into its biochemical components (elongation, branchpoint selection, and the first chemical step of splicing) would require strong modeling assumptions and lead to overfitting of the data. We show with multiple analyses that these times do contain information and are thus meaningful. We agree however that in the future other sequencing protocols may provide information that allows for the fitting of more complex mathematical models. This is however clearly beyond the scope of this work.

There is an inclination that they are measuring something related to splicing based on the results in Figure 4B: donor cleavage time varies with departures from the consensus sequence. However, these changes don't seem to be consistent with the known biochemistry of 5' ss.

We discussed our data with senior colleagues who are in the splicing field and they spotted many observations that explain prior biochemical data. Note we have also performed extensive analyses supporting our findings beyond Figure 4B, including: 2C (lncRNA splice slower than mRNAs), 4A,C,D Figure 5, Figure 6 (including recovery of motifs of know splicing factors) and Figure 4—figure supplement 2 (nucleotide in physical contact with the spliceosome complexes). We are happy to discuss any deviations from known biochemistry in the manuscript but we would need this reviewer to specify such inconsistencies. The statement that changes do not seem to be consistent with known biochemistry must be substantiated and we need to obtain references to check into this before we can address this point.

3) The theoretical treatments are in general lacking in rigor.

We have substantially extended the description of the method, modeling assumptions, model interpretation, and performed simulations (Appendix). We think that this improved very much the manuscript. Again, we would need more specific comments if the reviewer wishes us to improve on a certain aspect. Without specification, we cannot further address this concern.

- Several papers have treated the kinetics of splicing and modeled kinetic data in single cells or even genome-wide (PMIDs: 25271374, 22022255, 25541195), yet the authors seem to be unaware of this work. Although one paper they do cite (Schmidt et al., 2011) indicates the importance of the correct theoretical treatment of elongation.

We have analyzed this published work in detail and made the following changes:

1) PMID 25271374 (Coulon et al., 2014)

We added the citation in the text. The authors used a dual color single-molecule RNA imaging to estimate the kinetic of β-globin reporter gene in human U2OS cells. They concluded that introns are spliced both before and after transcript release, although the majority of them is spliced after transcript release. Note that in the submitted manuscript, we have cited similar papers in which splicing rates have been measured in endogenous β-globin human gene (Martin et al., 2013) and with MS2-GFP tagged MINX intron gene (Schmidt et al., 2011) with single-molecule live imaging.

2) PMID 22022255 (Aikten et al., 2011)

The authors modeled splicing kinetics of a single-intron containing reporter Ribo1 gene in yeast, using RT-qPCR data. They concluded that splicing is predominantly co-transcriptional and point mutations at 3´SS reduce the probability that step I is happening co-transcriptionally (a sort of feedback). We included the citation in our text.

3) PMID 25541195 (Davis-Turak et al., 2015)

This is the only genome-wide analysis. The authors used a mathematical analysis to study cotranscriptional splicing. However, they do not analyze RNA labeling data. Their treatment of elongation relates to post-transcriptional splicing (i.e. to the time for polymerase to reach the polyA site) and not to the transcription of the intron. Moreover, Davis-Turak et al. did also used first order kinetics for modeling splicing of single introns for the very same reason than us, namely that more complex models did not improve the fits. We are now referring to this work when stating why we had to adopt first order kinetics models.

- There is no measure for the goodness of fits or examination of other models.

We are now providing in the Appendix subsection “Quality of fits”, scatterplots of read counts versus expected read counts according to the fits (Appendix—figures 16-18). These show very good agreements (Spearman rho >0.89 for all time points except for the RNA-seq samples with rho >0.8 for donor and acceptor sites reads). The lower correlation of the RNA-seq samples for donor and acceptor sites can be explained by low abundance of those unsplit reads at steady-state. Moreover, the aforementioned subsection of the Appendix provides plots demonstrating how accurate the model parameters are estimated when the ground truth is known (simulations, Appendix—figures 12-13, rho >0.99 for every parameter and model). We have also now investigated the accuracy of fitting a non-first order kinetic model to the donor site (model with delay). This showed that the delay is difficult to estimate accurately (Appendix—figure 2). Moreover, we are also now plotting the estimated synthesis rate of donor against the acceptor site. Assuming no polymerase drop- offs during transcription of the intron, these should agree. These parameters are estimated from distinct data (donor sit and acceptor site unsplit reads). While they agree reasonably well with the first order kinetic model (Appendix—figure 14, rho =0.42), the agreement is much poorer when using a delay model for the donor site (rho = 0.26, Appendix—figure 19).Altogether, these analyses demonstrate that the models fit well to the data and failure of a simple more complex model to do so.

- They use a worm-like chain model based on RNA persistence length to compute the energy of looping. With the energy, they apply the Arrhenius equation to get a rate and compare this value to the donor cleavage time (Figure 3E). However, whether there is any trend in this data and whether this model explains that trend is difficult to determine.

We agree with the reviewer that the trend is hard to read and we removed the figure and its discussion from the manuscript. In any case, this was very peripheral to our analysis.

4) I found the splicing yield part of the paper (Figure 7) to be the most interpretable of the results presented. Again, however, none of this was validated with any more quantitative measure such as digital droplet PCR.

We thank the reviewer for the positive comment about splicing yield. However,splicing yield is not directly accessible without kinetic modeling. Using PCR we cannot validate the model.

In summary, I see this analysis (with the underlying parameterization) as a reasonably good way to explain 'donor cleavage time' and 'acceptor cleavage time' as measured in this assay. And there may be some utility in such metrics. My overall difficulty is that I don't think this model parameterized in such a way is telling us much about splicing and transcription in a way that could be validated or interpreted in other assays. So I see this paper as a bioinformatic exercise lacking in biochemical restraint or biophysical insight.

We strongly disagree that our paper is a “bioinformatic exercise”. As demonstrated at various points throughout the manuscript, we obtain biological insights and can explain structural data. We also hope the revised version will convince this reviewer of the added value and practical necessity of modeling the kinetics of individual phosphodiester bonds, which are biochemical entities in their own right and which are directly measured by the assay. We furthermore hope that we could convince this reviewer of the difficulties of using more sophisticated models because they suffer from poor identifiability of the parameters. A paper should not be rejected by eLife based on notion that it is an “exercise”. Instead, the community should judge over the years to come how useful our data and conclusions are for moving the field forward. We kindly ask the reviewer to reconsider his/her opinion and to hope he/she can acknowledge the high technical quality of our work, the importance of the problem addressed, the advances made beyond the state-of-the-art, and the implications for future work in this interesting field.

Reviewer #3:

[…] Given that the optimization methods used for estimating synthesis/formation and cleavage/degradation of the splice sites/junction is a local algorithm, is 10 random initialization enough to be confident that the global minimum has been reach?

We have used10 random initializations as previously (Eser et al., 2016). To further assess the quality of our data, we now compare simulated counts based on distribution of synthesis, splicing, degradation rate versus ground truth in the Appendix (Appendix—figures 1, 12-13). Based on these results, our model estimates ground truth of kinetic rates with high accuracy compared to the dynamic range of these parameters (Spearman rho >0.99 for all parameters). Moreover, these results empirically show that optimization of the cost function of the first order kinetic model does not suffer from local minima. This is in contrast to the delay model (Appendix—figure 2). We trust this clarifies the reviewers concern.

A key assumption for the model fitting is that 100% of the label is incorporated. The authors should explicitly explore how lower levels of labelling (e.g. 90, 50, 10%) affect their results and conclusions.

A lower level of label incorporation would lead to lower synthesis rate estimates (e.g. a labeling efficiency of 90% would result in a 10% underestimation of the rate) but would not affect estimation of donor and acceptor bond half-lives and of splicing yield. We added a paragraph to the appendix (Appendix subsection “4sU incorporation”) to explain this further.

Some comparison and correlation of synthesis rate estimated has been done separately for different donor sites and different acceptor sites (Figure 2—figure supplement 2B), but as mentioned, these should also be consistent for donor and acceptor of the same gene; however, donor synthesis rate vs acceptor synthesis rate is not shown.

We thank the reviewer to point this out. In the Appendix subsection “Agreement of kinetic parameters donor, acceptor and junction model”, we now show the correlation between donor and acceptor synthesis rate of the same gene. As expected under the assumption that there is no Pol II drop off or transcription termination before the end of the intron, we find that donor and acceptor synthesis rates correlate.

Model fits should be shown for individual genes, highlighting different classes: e.g. single vs multi-intron, highly expressed vs low expressed.

The individual fits are often noisy and hard to read because of the large deviations due to count data. This is in line with our reported accuracies of individual rates, which are relatively large (about 2-fold). Our findings rely on aggregated analyses. Moreover, we did not see a trend between these different classes. We therefore decided to not clutter the manuscript with more individual examples than the one in Figure 2A.

Kinetic language is not precise and causes confusion: For example "Cleavage times" should be referred to as "half cleavage times" or possibly "cleavage half-times". They do not refer to completion times, or characteristic times.

"Cleavage Rates in the range of minutes" is also imprecise. They seem to refer to cleavage half-life times being in in the minute range. More precise would be to report the actual cleavage rate constants, in addition to the cleavage half-life times. ("Rate" = rate constant x abundance).

We have revised the terminology, also in response to reviewer 3. We are now using the terms ‘donor bond half-life’, referring to the half-life of the exon-intron nucleotide bound and ‘acceptor bond half-life’, referring to the half-life of the intron-exon phosphodiester bond. We agree that ‘rate constant’ is more accurate than ‘rate’. We note however that the literature (including the physics literature, e.g. see the usage of decay rate) also often uses the term rate in lieu of rate constant. We thank the reviewer for pointing this out, these changes have improved the language.

The part on the single nucleotide model methods would benefit of a more detailed description. It seems to be a statistical model based on the previously derived cleavage time and consensus splice site sequence but the methods is not very clear to me.

We have re-written this section of the Materials and methods to give more explanations also for readers not versed in statistical modeling (Lasso regression).

[Editors' note: the author responses to the re-review follow.]

Essential revisions:

1) The reviewers remain unconvinced that the measured decay and formation rates of phosphodiester linkages can be directly attributed to individual recognition steps and enzymatic rates. By the time a chemical step causes the loss or gain of a junction read, dozens of events have occurred that could contribute to the rate of that step. For a 5' splice site these include initial recognition by U1, transcriptional elongation/pausing in the downstream intron, branchpoint recognition (which cannot be measured here), myriad spliceosome assembly steps, and finally catalysis. While the effects of splice site mutations agree with the known interaction of the 5' splice site by the U1 snRNP, other moieties will be interacting with these residues, and there is no evidence that the U1 interaction is determinative of the rate of cleavage. It is very simplistic to claim this is so. The rate of exon-intron junction loss will be the aggregate function of many different rates that likely vary from site to site and are not being measured. It should be noted that U1 is not involved in catalytic steps, only initial recognition. To say that "we developed a computational approach that models rates of RNA metabolism at the level of single phosphodiester bonds" is misleading. The authors need to examine their interpretations of their data and be careful not to claim that the rates they are observing can be broken down into steps that are not being measured.

The reviewers are correct. We agree that the rates of single phosphodiester bonds cannot be unambiguously related to individual chemical reactions. We did not mean to suggest so, but obviously there were misunderstandings along these lines. We have therefore carefully revised every sentence related to individual chemical reaction rates so that our claims only relate to the phosphodiester bonds overall kinetics. Also, we changed the sentence “we developed a computational approach that models rates of RNA metabolism at the level of single phosphodiester bonds” to “we developed a computational approach that estimates the metabolic rates of single phosphodiester bonds”.

2) Subsection “New and alternative splice sites”, last paragraph: The authors reported numbers (177,322 and 164,533) of non-split reads mapping to exon-intron boundaries (5' splice sites) sum to the total number of putative introns reported in the paragraph above (341,855, identified by split reads across an exon-exon boundary). This seems wrong, or did they throw out all examples of alternative splicing? They don't report numbers for intron-exon boundaries (3' splice sites), but farther down in the same paragraph they treat the read numbers as if these sites are included. In that later discussion, the authors state that of the split reads, "18% contained a non-annotated donor site, 22% a non-annotated acceptor site and another 38% represented new putative splice sites". What is the difference between a non-annotated donor or acceptor site and a new putative splice site? Furthermore, if aiming to differentiate between annotated and non-annotated sites, wouldn't it make more sense to start with the percentage of split reads mapping to non-annotated regions, where all the sites are likely unannotated, and then discuss new previously unannotated sites in GENCODE?

We were using “split reads” instead of “putative introns” in this part of the text. We have fixed the text and thank the reviewer for spotting this. Also, the 38% correspond to putative introns that map to non-annotated donor sites and acceptor sites. We have made this clear now. Finally, we feel that starting or ending with filtering for unannotated regions does not make a fundamental difference in the way these numbers are presented.

3) The authors indicate that they are fitting the model independently to every donor-acceptor combination that they observe. If so, what is causing the differences in the numbers of donor (162,134), acceptor (177,543), and junction (156,825) bonds that they are able to model? Don't they need sufficient read coverage across all three to confidently estimate the synthesis/cleavage rates on a per-junction level? Or does this include some cases of alternative splice site pairings?

We apologize if we did not explain this clear enough in the text. In order to measure the metabolism of the single bond, we individually model the three bonds (donor, acceptor and junction) for each detected intron. We consider each bond with at least 100 supporting reads across the datasets. Therefore, we do not need to combine their coverage cutoffs. We now stress further that these models are uncoupled with the sentence “we modeled the steady-state rates of synthesis and degradation (or equivalently cleavage) of each of three different phosphodiester bonds individually” in the Results section.

4) The use of simulations to estimate the robustness of the model fitting, across both the first order kinetic model and alternative models is a helpful addition to the paper. It should be highlighted more in the main text how this supports the validity of the method. An assessment should also be provided for how the method performs and where it doesn't work well. For example, the authors indicate that "the robustness of our approach… consistent with high accuracy of the fitting procedure on simulated data." However, all models or experimental systems are underpowered in certain cases. The authors should include a short discussion of where their model falls short (ultra-fast rates, length considerations, etc.) and thus more fully inform the reader about the biological interpretations that can be drawn from the data. Simulations are often the best way to push the model and test these extreme cases.

We thanks the reviewers for the suggestions. We have included a new paragraph that explains the results of these simulations and the new simulations we have conducted in response to this point to explore the extreme cases. We added a new section in the Appendix (subsection “Limits of the model”) in which we investigate the limits of our model. We simulated counts based on the constant rate model given a large range of log-uniform distributed bond synthesis rates (10-5 – 100 1 / cell / min) and bond half-lives (10-3 – 7x10-6 min). Based on this data, we investigated the typical lower bound of reads necessary in all samples to proper model the kinetics of a bond (Appendix—figure 23). We found that a lower cut-off of 100 reads (after the 6th ventile of simulated data) allows us to estimate the kinetics with a typical error below 100%. The cut-off of 100 reads is the one we have applied to the real data. To test the shortest and longest bond half-lives that our experimental approach is able to capture and model, we stratified the results by bond half-live. We then proceeded similarly with the bond synthesis rates. As a result, we found that most of our modelled data lies within a range in which the median relative error of synthesis rates and half-lives is below 100% or 30%, respectively (Appendix—figures 24, 25).

5) In the section entitled "Intron length constrains co-transcriptional splicing times." It is not clear what the authors mean by "co-transcriptional".

As we defined in the Introduction, splicing is “co-transcriptional” when it occurs on newly synthesised RNA still attached to the RNA Pol II machinery. This does not necessarily imply that introns are excised as soon as the acceptor site gets transcribed. Nonetheless, the results described in this section do not depend on the co-transcriptional nature of splicing. We therefore changed the title of the subsection to avoid misunderstandings.

The authors describe how donor bond half-life increases with intron length, presumably indicating that donor bond half-life is dependent on the time to reach the branchpoint and acceptor site. This would imply that the first step of splicing occurs soon after the acceptor site is transcribed and available.

Not necessarily. We interpret these observations provided that the first step of splicing occurs after (but not necessarily soon after) the branch point and the acceptor site are transcribed, which are close to each other – typically less than 40 nt. We do not claim that the first step occurs “soon after the acceptor is transcribed and available”, but rather that the length between donor and branchpoint is a limiting factor for splicing kinetics only in very long introns (>10.000kbs). We have made sure this is properly understood.

If so, then why is the acceptor bond half-life so long? A median acceptor half-life of 4 min indicates that the Polymerase is on average 16kb away by the time the second step of splicing occurs.

This could be true since 16 kb is not excessively long for a human gene (median size 26kb, average size 66kb). Moreover, Pol II is known to pause at exons (Mayer, et al., 2015), and such pausing is in the range of minutes, and hence the polymerase may not have progressed far within 4 min.

Do they propose that the first step is close in time to the second step, with a lag between transcription of the branchpoint and the loss of the donor? Or are the two chemical steps separated by 4 minutes? This seems unlikely in that it implies that spliceosome assembly up to C complex formation is very fast compared to the transition from the first to the second catalytic steps.

As mentioned in the response to the above questions, we are estimating overall rates but cannot untangle the rates of the individual biochemical steps contributing to the overall rates. Except for very long introns, the half-life of the acceptor bond is similar to the half-life of the donor bond (Figure 3B). Hence our data actually suggests (in agreement with the reviewer’s intuition) that spliceosome assembly up to C complex formation is slow compared to the transition from the first to the second catalytic steps. We now provide this interpretation of Figure 3B in the manuscript.

If the authors think that donor loss occurs closer in time to the second step – well after the appearance of the branchpoint – then the definition of "co-transcriptional" that involves immediate splicing upon transcription/availability of the acceptor site doesn't seem to hold. All of this needs better explanation.

We do not wish to provide an extensive discussion of the co-transcriptional nature of splicing because this would require a different type of data that can be related to nascent RNA and transcribing Pol II. We however checked that we do not make statements on the co-transcriptional nature that are definitive and unsupported. At this time, with no additional data in hand, we do not wish to go beyond this, and trust the reviewer understands.

6) It is not clear how the splicing yield calculation can produce a value of > 1, with the median splicing yield reported for mRNAs as 1.2. Does this mean that the measured levels of precursor mRNAs are consistently underestimated? If it's not bounded as described (0-1), the splicing yields are difficult to assess as it seems one or more of the measured values is incorrect.

The yield is computed by independently estimating the synthesis rate of the precursor (using unsplit reads) and of the mature RNA (using split reads) and taking the ratios. Due to estimation errors, these ratios may turn out to be greater than 1. We find that this is acceptable as long as it does not affect our conclusions. We found by simulations that the cumulative read coverage across all samples as well half-life can lead to bias estimates of synthesis rates. However, the higher yield of mRNA versus non-coding RNA was recapitulated when stratifying by these possible confounders. We have now included these analyses (Appendix—figures 26, 27).

https://doi.org/10.7554/eLife.45056.067

Article and author information

Author details

  1. Leonhard Wachutka

    Department of Informatics, Technical University of Munich, Garching, Germany
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing, Designed and carried out the bioinformatics analysis
    Contributed equally with
    Livia Caizzi
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-5959-040X
  2. Livia Caizzi

    Department of Molecular Biology, Max-Planck-Institute for Biophysical Chemistry, Göttingen, Germany
    Contribution
    Conceptualization, Data curation, Validation, Investigation, Visualization, Methodology, Writing—original draft, Writing—review and editing, Optimized and carried out TT-seq experiments, contributed to the design of the bioinformatics analysis, and used molecular modeling to interpret results
    Contributed equally with
    Leonhard Wachutka
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9723-6893
  3. Julien Gagneur

    Department of Informatics, Technical University of Munich, Garching, Germany
    Contribution
    Conceptualization, Resources, Software, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    gagneur@in.tum.de
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-8924-8365
  4. Patrick Cramer

    Department of Molecular Biology, Max-Planck-Institute for Biophysical Chemistry, Göttingen, Germany
    Contribution
    Conceptualization, Resources, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    For correspondence
    patrick.cramer@mpibpc.mpg.de
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5454-7755

Funding

European Molecular Biology Organization (ALTF-1261-2014)

  • Livia Caizzi

Horizon 2020 (SOUND 633974)

  • Leonhard Wachutka
  • Julien Gagneur

European Research Council

  • Patrick Cramer

Volkswagen Foundation

  • Patrick Cramer

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Jun Cheng for initial pre-processing of the data and stimulating discussions. We thank Carina Demel for help in developing the rCube package and Carina Demel and Björn Schwalb for support during the pre-processing of the data. We thank Hauke Hillen for help with structural modeling. LC was funded by EMBO Long-Term Postdoctoral Fellowship (ALTF-1261–2014). LW and JG were supported by EU Horizon2020 Collaborative Research Project SOUND (633974). PC was funded by Advanced Grant TRANSREGULON of the European Research Council and the Volkswagen Foundation.

Senior Editor

  1. James L Manley, Columbia University, United States

Reviewing Editor

  1. Douglas L Black, University of California, Los Angeles, United States

Publication history

  1. Received: January 20, 2019
  2. Accepted: April 25, 2019
  3. Accepted Manuscript published: April 26, 2019 (version 1)
  4. Version of Record published: June 4, 2019 (version 2)

Copyright

© 2019, Wachutka et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,877
    Page views
  • 361
    Downloads
  • 1
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Computational and Systems Biology
    Darwin Y Fu, Jacob J Hughey
    Feature Article
    1. Computational and Systems Biology
    2. Plant Biology
    Mary-Ann Blätke, Andrea Bräutigam
    Research Article