A new view of transcriptome complexity and regulation through the lens of local splicing variations

Abstract
eLife digest
Introduction
Results
Discussion
Materials and methods
Data availability
References
Article and author information
Metrics

Abstract

Alternative splicing (AS) can critically affect gene function and disease, yet mapping splicing variations remains a challenge. Here, we propose a new approach to define and quantify mRNA splicing in units of local splicing variations (LSVs). LSVs capture previously defined types of alternative splicing as well as more complex transcript variations. Building the first genome wide map of LSVs from twelve mouse tissues, we find complex LSVs constitute over 30% of tissue dependent transcript variations and affect specific protein families. We show the prevalence of complex LSVs is conserved in humans and identify hundreds of LSVs that are specific to brain subregions or altered in Alzheimer's patients. Amongst those are novel isoforms in the Camk2 family and a novel poison exon in Ptbp1, a key splice factor in neurogenesis. We anticipate the approach presented here will advance the ability to relate tissue-specific splice variation to genetic variation, phenotype, and disease.

https://doi.org/10.7554/eLife.11752.001

eLife digest

Genes contain coded instructions to build other molecules that are collectively referred to as gene products. Building these products requires the gene’s instructions to be copied into a molecule of RNA in a process called transcription. Over 90% of human genes undergo a process by which different segments of the transcribed RNA molecule are either removed or retained. This process, termed alternative splicing, results in a single gene encoding different gene products that can perform in different ways.

Alternative splicing can also mean that gene products vary between different cells, tissues and individuals. Some of these variations can be harmful and lead to disease. However, it is difficult with current methods to accurately identify variations in gene products that are due to alternative splicing and see how these products differ between groups of people, such as patients and healthy controls.

Vaquero-Garcia, Barrera, Gazzara et al. have now developed new methods to define, measure and visualize the variations in RNA gene products. First, splicing variations were catalogued across a range of species from lizards to humans, which revealed that some fairly complicated variations were much more common than previously appreciated. These complex variations had not been studied much before, but the new methods showed that they make up a third of the variations in the RNA products copied from human genes.

Vaquero-Garcia, Barrera, Gazzara et al. then showed that the new methods are more accurate and sensitive than previous methods, and can be used to discover splicing variations that were previously unknown. For example, applying the new methods to data collected in other studies revealed variations in genes that are important for brain development and activity. Further analysis then showed that these variations were also altered in brain samples from patients with Alzheimer disease.

The new methods developed by Vaquero-Garcia, Barrera, Gazzara et al. can now shed new light on gene product variations, especially the more complex ones that have not been studied before. The next challenge is to use these tools to better understand the regulation and purpose of splicing variants and how they can contribute to diseases in humans.

https://doi.org/10.7554/eLife.11752.002

Introduction

Production of distinct mRNA isoforms from the same locus has been shown to be common phenomena across metazoans (Barbosa-Morais et al., 2012; Merkin et al., 2012). Different isoforms may arise through the use of alternative transcription start and end sites, or through alternative processing of pre-mRNA. A key process is alternative splicing (AS) of pre-mRNA, where different subsets of pre-mRNA segments are removed while others are joined, or spliced together. The resulting differences between the mature mRNA isoforms can, in turn, encode different protein products, or affect mRNA stability, localization, and translation. Over 95% of human multiexon genes undergo AS, and disease associated genetic variants have been shown to frequently lead to splicing defects (Cooper et al., 2009; Pan et al., 2008; Wang et al., 2008). These observations emphasize the need to accurately map and quantify splice variations.

RNA-Seq technology has advanced the detection and quantitation of splice variants by producing millions of short sequence reads derived from the transcriptome. Despite constant technological advancement, the combination of limited coverage depth, experimental biases, and reads spanning only a small fraction of the variable parts of transcripts has left accurate mapping of transcriptome variations an open challenge (Alamancos et al., 2014).

Transcriptome variations have been traditionally studied either at the level of full gene isoforms or through the specification of alternative splicing 'events'. The latter have been categorized into several common types, such as intron retention, alternative 3’/5’ splice sites, or cassette exons. Importantly, while exact isoforms and their quantifications cannot be directly inferred from the short RNA-Seq reads, AS events can be detected via reads that span across spliced exons (junction reads). Both AS events and full isoforms can be captured by a gene schematic or a splice graph (Heber et al., 2002), where edges (lines) connect pre-mRNA segments spliced together in different transcripts (Figure 1A, top).

Figure 1

Download asset Open asset

LSV formulation and prevalence.

(A) LSVs can be represented as splice graph splits from a single source exon (yellow) or into a single target exon (pink). LSV formulation captures previously defined, 'classical', binary alternative splicing cases (top) as well as other variations (bottom). An asterisk denotes complex variations involving more than two alternative junctions; dash line denotes redundant LSVs that are a subset of other LSVs (see Materials and methods). (B) Example of a complex LSV in the *Camk2g* gene. The gene’s splice graph (top) includes known splice junctions from annotated transcripts (red) and novel junctions (green) detected from RNA-Seq data. The splice graph includes a complex LSV involving exons 14–17 (middle). RT-PCR validation of the LSV in brainstem, cerebellum, hypothalamus, muscle, and adrenal is shown at the bottom. Several isoforms are preferentially included in brain and muscle.

https://doi.org/10.7554/eLife.11752.003

While useful, the previously defined AS types fail to capture the full complexity of spliceosome decisions. Specifically, AS types represent spliceosome decisions as strictly binary, involving only two exons or two splice sites in the same exon. The bottom panel in Figure 1A illustrates a few possible splicing variations that do not fit the previously defined AS types and can involve more than two alternative junctions.

Figure 1B serves as a visual summary for both the potential and challenges in analyzing splicing variations. Combining known transcripts and RNA-Seq data results in the Camk2g splice graph shown (Figure 1B, top). This splice graph includes novel, unannotated, splice junctions detected from junction spanning RNA-Seq reads (green), as well as a complex case where exon 14 can be spliced to exons 15, 16, or 17 (Figure 1B, middle). Quantification by RT-PCR in several mouse tissues validate the existence of these variations and also points to isoforms that are predominantly produced in brain subregions and in muscle (Figure 1B, bottom). In order to achieve such results we need to have a computational framework that efficiently combines RNA-Seq with existing gene annotation and enables us to accurately detect, quantify, and visualize diverse splicing variations across different experimental conditions.

Results

Formulation of local splicing variations (LSVs)

To address the shortcomings of previously defined AS types we suggest the formulation of local splicing variations, or LSVs. LSVs are defined and easily visualized as splits (multiple edges) in a splice graph where several edges either come into or from a single exon, termed the reference exon. A Single Source (SS) LSV (Figure 1, yellow) corresponds to a reference exon spliced to several downstream RNA segments while single target (ST) LSV (Figure 1, pink) corresponds to a reference exon spliced to upstream segments. The full specification of an LSV also includes the relative location of the exons and junctions (see Material and methods). Figure 1A illustrates how this formulation captures previously defined AS types (top panel) as well as more complex cases (bottom panel). Specifically, previously defined 'classical' AS events appear as special cases of binary graph splits (e.g., include or skip a cassette exon), while LSVs capture non-classical binary splits and splits involving more than two junctions. Such non-binary splits are termed complex LSVs. LSVs can also involve intron retention (intronic LSVs) or be comprised of only exons (exonic LSVs). Moreover, the transcriptome variability captured by LSVs may be the result of not only spliceosome decisions but also of alternative transcription start or end positions. For example, the gene in Figure 1A bottom panel involves two alternative first exons so a relative change in the transcription start site usage can result in changes in downstream LSVs quantification. Importantly, LSV formulation allows the probing of transcriptome structure and complexity yet, unlike full transcripts, can still be quantified directly from junction spanning reads.

LSV detection, quantification and visualization using MAJIQ

In order to address the challenges involved in detection, quantification and visualization of LSVs we developed a new computational framework that we have termed Modeling Alternative Junction Inclusion Quantification (MAJIQ). MAJIQ’s first step (Figure 2A, top) is to parse a known database of transcripts, given as a GFF3 annotation file, along with a set of mapped and aligned RNA-Seq experiments (indexed BAM files). Unlike many methods that only analyze known isoforms, MAJIQ supplements known transcripts with 'reliable' edges derived from de novo junction spanning reads. Several filters can be applied to define which edges are considered reliable and which LSVs have enough reads to be later quantified (see Material and methods). Similarly, LSVs whose edges are a subset of other LSVs, such as those denoted with dashed rectangles in Figure 1A, are removed to avoid redundancy (see Material and methods). Next, MAJIQ can be executed to quantify LSVs either in a specific condition or to compare two experimental conditions, with or without replicates. LSV quantification in a specific condition is based on the marginal percent selected index (PSI, denoted Ψ) for each junction involved in the LSV, while comparison of experimental conditions is based on relative changes in PSI (dPSI, ΔΨ). MAJIQ uses a combination of read rate modeling, Bayesian Ψ modeling, and bootstrapping to report posterior Ψ and ΔΨ distributions for each quantified LSV. The results of MAJIQ’s LSV detection and quantification can be interactively visualized with the package VOILA in a standard web browser (Figure 2A bottom).

Figure 2 with 2 supplements see all

Download asset Open asset

LSV analysis using MAJIQ.

(A) MAJIQ’s analysis pipeline. RNA-Seq reads are combined with an annotated transcriptome to create splice graphs and detect LSVs for each gene, then LSVs are quantified and compared between conditions. The visual output (VOILA) lists LSVs with violin plots representing estimates of percent inclusion index (PSI, Ψ) or changes in inclusion (dPSI, ΔΨ). Two cases are illustrated, for a single source three way LSV (orange), and a single target two way LSV (pink). (B) Correspondence between E[Ψ] by MAJIQ and Ψ by RT-PCR. R is the correlation coefficient. Colors and shapes represent different experimental conditions: mouse cerebellum and liver (dark and light orange diamonds, respectively); human unstimulated and stimulated T-Cells (dark and light purple dots, respectively). Total n = 208. (C) Correspondence between E[ΔΨ] by MAJIQ and ΔΨ by RT-PCR, where |ΔΨ^RT|>0.2. R is the correlation coefficient. Changes in inclusion were measured between liver and cerebellum mouse tissues (diamonds, n = 45); stimulated and unstimulated T-Cells (dots, n = 9). (D) Reproducibility ratio (RR) of high confidence differentially included LSVs, *i.e.* LSVs for which P(|ΔΨ|> 0.2) > 0.95), when comparing RNA-Seq from two conditions. A differentially included LSV is considered replicated if it maintains a rank at least as high as N in biological replicates, where N is the set size. LSVs are ranked by E[ΔΨ] and filtered for overlap. Twelve replicate pairs from Keane et al. (2011) were used to compute the histogram’s std (light blue). Other lines show MAJIQ’s RR with replicates (thick blue), RR for AS events detected by rMATS w/wo replicates (light and dark green), MISO (red), and RR for LSVs using Naïve Bootstrapping (orange). The inset bar chart shows the number of LSVs or AS events (N) derived by each method and used in the RR plots (see Materials and methods for more details).

https://doi.org/10.7554/eLife.11752.004

We assessed MAJIQ’s quantification accuracy for both Ψ and ΔΨ using a combination of RNA-Seq from biological replicates and an extensive set of 208 RT-PCR validations. These experiments included two mouse tissues (cerebellum and liver [Zhang et al., 2014]), and a human Jurkat T cell line (unstimulated and stimulated, [Cole et al., 2015]). While accuracy depended on the dataset used, MAJIQ achieved an overall correlation of R = 0.8 and R = 0.95 for PSI and dPSI quantification by RT-PCR, comparing favorably to alternative methods on all datasets (Figure 2B,C, Figure 2—figure supplement 1). Next, we used biological replicates from the Mouse Genome Project (Keane et al., 2011) to assess reproducibility of differential splicing detection from RNA-Seq when comparing two experimental conditions. The reproducibility ratio (RR, see Material and methods) captures the fraction of top ranked differentially spliced LSVs that maintain their top ranking when analyzing another set of replicate experiments. Figure 2D shows MAJIQ compares favorably to other methods, including MISO (Katz et al., 2010), rMATS (Shen et al., 2014), and a bootstrapping approach (Xiong et al., 2015) adopted for LSV. While MISO and rMATS achieved a reproducibility ratio of 61–67% we found the bootstrapping approach (N.B.) suffered from particularly high variance, which degraded reproducibility of LSVs ranking. In comparison, MAJIQ achieved a mean RR=77% when comparing two pairs of experiments and improving to RR=86% when the experiments compared had replicates. Notably, detection power was also improved. Defining differentially spliced LSVs as those for which P(|ΔΨ|>0.2) > 0.95, the number of detected LSVs (N), after removing LSVs overlap (see Materials and methods), was on average 400 for pairwise and 447 for group comparisons, compared to 240 and 260 respectively by rMATS. The improvement in both detection and reproducibility of differentially spliced LSVs (N, RR) was robust to the statistical threshold used to define N (Figure 2—figure supplement 2A) and when we removed MAJIQ’s de-novo junction detection the number of LSVs dropped as expected but reproducibility remained high (N = 337, RR= 87%, data not shown). Importantly, this result also indicated that including de-novo junctions increased the number of differentially spliced LSVs that could be detected by over 30% (337 vs. 447), while retaining equivalent reproducibility. Defining differential splicing reproducibility by RT-PCR as LSVs for which |ΔΨ^RT|>20% resulted in 95% reproducibility. The higher reproducibility by RT-PCR can be expected given the lower experimental variability compared to RNA-Seq. Notably, the LSVs tested by RT-PCR were selected to cover a wide spectrum of read depth. We found that while higher coverage allowed more differential LSVs to be detected and steadily increased reproducibility by RNA-Seq, MAJIQ’s reproducibility by RT-PCR was stable across read coverage depth, pointing to the robustness of the method (Figure 2—figure supplement 2B). Finally, we note that the above RT-PCR evaluation concentrated on binary LSVs to allow comparison to currently available methods, but we observed similar accuracy for the quantification of complex LSVs (Figure 2—figure supplement 2C).

Complex LSV are prevalent in diverse metazoa

To assess the significance of LSVs formulation we estimated LSVs prevalence in several metazoans, ranging from lizard to human (Figure 3). Naturally, this analysis is affected by how well a species transcriptome is annotated, and how permissive the database used is. In human for example, complex LSVs constitute 20.6% to 33.7% of the LSVs in annotated transcripts by RefSeq and Ensembl respectively, but only 1.86% in opossum’s Ensembl annotation (Figure 3A,B). Next, we expanded the set of annotated transcripts with novel junctions detected from RNA-Seq junction spanning reads. Limiting our analysis to only 5–6 similar tissues in all species and conservative junction detection still increased the total number of LSVs in human by 11% and the fraction of complex LSVs from 33.7% to 37.1% (Figure 3A). In species not as well annotated the effect of adding RNA-Seq data was more dramatic, jumping in opossum for example from 1,610 to 10,228 LSVs, of which 10% were complex. In summary, while LSV analysis across species was confounded by read coverage and transcriptome annotation we find that non-classical and complex LSVs make up a substantial fraction of observed transcriptome variations. Such complex LSVs are likely to be removed, undetected, or mislabeled by algorithms that only quantify binary AS events from previously annotated transcripts.

Figure 3

Download asset Open asset

LSV prevalence across diverse metazoans.

(A) Number of LSVs (top) and fraction of complex LSVs (bottom) when using Ensembl annotated transcripts only (grey) or combining it with RNA-Seq from 5–6 similar tissues (red). Mouse* is the dataset from Zhang et al. (2014). (B) Number of LSVs (top) and fraction of complex LSVs (bottom) when using RefSeq (orange) and Ensembl (blue). The RNA-Seq dataset is the same as in (A).

https://doi.org/10.7554/eLife.11752.007

A genome wide view of LSV across 12 mouse tissues

Given the clear impact of the RNA-Seq dataset and transcriptome annotation, we chose to focus our genome wide analysis on a recent mouse dataset. This allowed us to analyze 12 tissues with an average of over 120M reads per sample, produced by a single lab (Zhang et al., 2014). This data included three brain subregions, eight samples per tissue, and matching RNA for RT-PCR validations, leading to a total of 100,512 LSVs detected. First, we used this data to assess the usage of LSVs across tissues. In order to minimize LSVs that result from false junctions identified by the mapper we only included junctions with multiple uniquely mapped staggered reads across multiple biological replicates (see Material and methods). Next, we tested the maximal inclusion level of the second, third, or the least used junction in an LSV across twelve mouse tissues. We detected a switch behavior where a different junction becomes dominant at 50% inclusion or more in approximately 5% of the classical binary LSVs (Figure 4A, grey), compared to 12% for the second most used junction in complex LSVs (Figure 4A, light green). Setting a conservative threshold of Ψ > 10% to denote splice junctions that are less likely to be splicing noise or database errors we find that for the classical binary LSVs approximately 32%, or 9,516 pass that threshold, compared to 57% and 19% of the complex LSVs that pass that threshold for the second and third most used junction respectively. These correspond to a total of 6,338 and 2,112 LSVs in our datasets, pointing to the importance of complex LSVs in transcriptome analysis. Even when testing for the least used junction in complex LSVs (e.g. the ninth in a nine junction LSV), we still find almost 10% pass the 10% inclusion threshold (Figure 4A, dark green). Finally, for intronic LSVs we find almost 11,000 cases where an intron is retained at least 50% in one tissue, and 3,844 cases where the intron is almost always retained with Ψ > 99% (Figure 4—figure supplement 1D). This observation of widespread intron retention (IR), especially in brain tissues, is in line with a recent study across many more tissues and cell lines (Braunschweig et al., 2014), though our overall estimate of IR prevalence is more conservative.

Figure 4 with 1 supplement see all

Download asset Open asset

Genome wide view of exonic LSVs across twelve mouse tissues.

(A) Cumulative distribution (CDF) for maximal junction inclusion (PSI) across tissues. Plot includes the least used junction in binary LSV (grey), the second, third and least used junction in complex LSVs (light, medium, dark green). Dashed vertical line denotes 10% inclusion. (B) Histogram of the most common exonic LSV types. (C) Histogram of the number of exons, junctions, 3’ and 5’ splice sites in all identified LSV. (D) Histogram of which 3’ (left) or 5’ (right) splice site are found to be dominant across all tissues and all LSVs. X-axis denotes the order of the splice site. Dominance is defined as E[Ψ] > 0.6. Cases with no dominant junction are represented by the bars on the far left. (E) The fraction of complex LSVs (green, top right) from the total number (purple, bottom left) of differentially spliced LSVs (|E[ΔΨ]| >0.2) between pairs of tissues.

https://doi.org/10.7554/eLife.11752.008

Figure 4—source data 1 dPSI values for all pairs of tissues.: https://doi.org/10.7554/eLife.11752.009
Download elife-11752-fig4-data1-v2.xlsx

Commonly occurring network substructures, or network motifs, have garnered much research attention in diverse fields (Milo et al., 2002). Gene splice graphs can also be thought of as networks with exons as nodes and spliced junctions as edges. In this interpretation, LSVs can be thought of as small network motifs and used to shed light on the transcriptome complexity and commonly reoccurring sub-structure. Comparing the frequency of exonic LSV types (Figure 4B) we find that the more common non classical LSVs involve 3 to 5 exons, combine exon skipping with an alternative 3’/5’ splice site, or involve alternative transcript start/end at the LSV’s reference exon. In contrast, intronic LSVs are much less diverse, with classical intron retention making 68% of the cases (Figure 4—figure supplement 1C). Figure 4C shows that for exonic LSVs 14% involve more than 2 exons, 30% of the single source and 20% and of the single target LSVs involve a reference exon with two or more 5’/3’ splice sites, respectively. Overall, complex (non-binary) LSVs comprise 36.2% of the transcriptome variations detected in the data and 27.5% of the variations deemed quantifiable (see Materials and methods), yet spliceosome decisions still appear localized, with few LSVs involving more than 6 exons or junctions. When analyzing LSVs usage, we found that the biochemical 'proximity rule', by which the splice site nearest to the reference exon is preferred (Reed and Maniatis, 1986), is commonly not reflected at the genomic level. Defining 'dominant' junctions as those included at least 60%, we found proximal junctions appear dominant in approximately two thirds of the cases involving binary LSVs (Figure 4D) while more complex LSV tend to have more evenly distributed inclusion levels with no dominant junction (Figure 4D, left bars). This more evenly distributed usage of exons and junctions in complex LSVs further supports possible functionality of multiple isoforms.

Figure 4E gives a genome wide view of the exonic LSVs that exhibit significant splicing changes (|E[ΔΨ]|> 20%) between mouse tissues. In line with previous reports (Barash et al., 2010; Barbosa-Morais et al., 2012), we find clear clusters for brain and muscle tissues (average of 875 and 657 changing LSVs, respectively), a weaker cluster for digestive tissues (liver, kidney) with an average of 501 changing LSVs, and lung as a unique signal (549 changing LSVs). Brain regions have a higher average of 927 (Cerebellum) to 840 (brainstem) changing LSVs compared to non-brain tissues. The number of LSVs changing between brain subregions varies between 36% and 57% of those changing between CNS and non-CNS tissues, with hypothalamus standing out as more similar to the two other CNS tissues (average of 937 and 343 changing LSVs when compared to non brain and other brain sub-regions, respectively). Overall, we find that complex LSVs make up almost 47% of the differentially spliced LSVs, a fold enrichment of 1.7 compared to their relative proportion of 27.5% in the quantifiable set (P < 2.3 x10^-278, binomial test).

Complex LSV are enriched in regulated splicing that is associated with higher intronic conservation and specific protein features

Given the above result of complex LSV enrichment in tissue dependent splicing variations we decided to test whether this enrichment holds in other datasets that involve developmental stages, splice factor knockdowns, and disease. We performed a meta analysis of 31 mouse datasets that involve a total of 243 RNA-Seq experiments covering a variety of tissues, cell lines, developmental stages, and knockdowns of key splicing factors. To this set we also added a human dataset comparing Alzheimer’s disease and healthy brain samples (Figure 5A and below). We found the median fraction of complex LSV in these datasets was 0.309 and their median fold enrichment in differentially spliced LSVs was 1.63, a significant enrichment in 30/32 of the datasets (1.6x10^-322 < p-val < 1x10^-3, Bonferroni corrected binomial test, see Figure 5A, and Figure 5—source data1). This consistent overrepresentation of complex LSVs among differentially spliced LSVs across a variety of contexts further suggests that complex LSVs are an important aspect of regulated alternative splicing.

Figure 5 with 1 supplement see all

Download asset Open asset

Meta analysis of complex LSVs.

(A) Fold enrichment (green dots) of complex LSVs calculated by comparing the fraction of complex LSVs among differentially spliced LSVs (dark blue bars) to their relative proportion (light blue bars) in 32 datasets. The corrected p-value column on the left measures significance of the fold enrichment (binomial test, Bonferroni corrected p-value) Medians are displayed for fold enrichment (green line, 1.63), fraction of complex LSVs among changing LSVs (orange line, 0.52), and fraction of complex LSVs among all detected LSVs (red line, 0.31). Human AD versus healthy brain data corresponds to the cohort from (Bai et al., 2013). See Figure 5—source data 1 for more information. (B) Empirical cumulative distribution function (CDF) of the maximal change of junction inclusion ( ΔΨ ) across all mouse datasets in Figure 5A. Only the LSVs detected in the twelve mouse tissues (Figure 4) are included. The plot includes junctions in binary LSVs (grey), and the second, third, and least changing junction in complex LSVs (light, medium, dark green). Dashed vertical line denotes ΔΨ of 10%. (C) Per nucleotide average conservation score (phastCons60 track) in regions proximal to single source (top) and single target (bottom) LSVs that were differentially spliced between any pair of tissues shown in Figure 4. The average is plotted for the subsets of complex (green) LSVs and binary (grey) LSVs as well as around a randomly selected set of constitutively spliced junctions (red, see Materials and methods for details).

https://doi.org/10.7554/eLife.11752.011

Figure 5—source data 1 LSV enrichment meta analysis table.: https://doi.org/10.7554/eLife.11752.012
Download elife-11752-fig5-data1-v2.xlsx

Next, we asked how does the inclusion of junctions change across these datasets. For this, we took a conservative approach monitoring only the LSVs that have been already identified in normal tissues used to build the genome wide view of LSVs (Figure 4). Figure 5B shows over 20% of all complex LSVs detected in more than one sample had the third most differentially included junction exhibit |ΔΨ|> 10%, corresponding to 2,236 LSVs. Strikingly, these additional experimental contexts showed that over 39% of all complex LSVs detected in our normal tissue set had their third most included junction with Ψ > 10%, corresponding to 4,201 LSVs (Figure 5—figure supplement 1).

Finally, we plotted the conservation level around constitutive exons and differentially spliced LSVs shown in Figure 4 that are either binary or complex (Figure 5C). Inline with previous reports, we found tissue regulated splicing involves significantly higher conservation in the intron proximal to the variable exonic segments, a region known to include cis elements to which tissue specific splice factors bind. However, we also found that differentially spliced complex LSVs exhibited significantly higher conservation levels in these regions compared to their binary counterparts. This finding may be the result of the more complex splicing changes that need to be controlled or tighter control associated with complex LSVs specific function. In summary, these different lines of evidence all support the functional relevance and utility of accurately mapping and quantifying complex splicing variations in genome wide studies.

The observed evolutionary pressure to conserve intronic segments around tissue dependent LSV raises the questions what are the functional consequences of LSVs and whether complex LSVs are functionally distinct from classical binary ones. To probe possible function we mapped exons in LSVs into their matching protein domains (see Material and methods). We then grouped LSV junctions based on whether they were part of binary or complex LSVs and whether they were differentially included across tissues. In line with previous works (Ellis et al., 2012), we find that binary LSVs, such as cassette exons, which are also differentially included across tissues, more frequently affect low-complexity, disordered regions when compared to non-changing binary LSVs (p<1x10^-4, corrected Fisher’s exact test). Interestingly, differentially included complex LSVs are similarly enriched for such low-complexity regions (p<1x10^-4), but also show enrichment for specific protein families (e.g. spectrin/filamin) and domains (e.g. RNA recognition motifs) when compared to non-changing complex LSVs. These families and domains are largely distinct from those enriched in binary LSVs (e.g. WW domains or coiled coils). The complete list of enriched protein features can be found in Supplementary file 1. Overall, this analysis suggests that regulated alternative splicing of both binary and complex LSVs can affect protein interactions via unstructured protein regions, or affect the inclusion of distinct protein domains in specific families.

MAJIQ detects a novel, brain-specific, PTC-introducing, developmentally- regulated exon in Ptbp1

To further demonstrate the power of MAJIQ and our LSV based approach we validated a set of complex LSVs that exhibit tissue and brain region dependent splicing patterns. Surprisingly, this analysis revealed a previously uncharacterized, brain-specific exon in the gene encoding PTBP1, an extremely well studied splicing factor critical to neural development (Keppetipola et al., 2012) (Figure 6A, Figure 6—figure supplement 1A). While this novel exon remained undetected when running cufflinks (Trapnell et al., 2010) on this dataset (data not shown), expression of this novel exon as part of a complex LSV was supported by RT-PCR from cerebellum and adrenal tissues (Figure 6B, top) with good concordance with MAJIQ’s PSI quantification (Figure 6B, bottom). Products including exon 14 were also weakly detected by RT-PCR of brainstem and hypothalamus-derived RNA, but not from any of the other eight tissues tested (Figure 6—figure supplement 2). Together these data strongly support exon 14 as brain-specific.

Figure 6 with 2 supplements see all

Download asset Open asset

Identification of a novel, brain-specific, PTC-introducing, developmentally-regulated exon in *Ptbp1*.

(A) Top: Splice graph representation of a complex target LSV containing a previously unannotated, PTC-introducing exon in *Ptbp1 (exon 14, green)*. Stop signs indicate multiple conserved premature termination codons. Bottom: UCSC Genome Browser tracks of RNA-seq reads from adrenal (red) and cerebellum (blue), and conserved Rbfox binding sites ([U]GCAUG) found within the bounds of this LSV. (B) Top panel: RT-PCR validation of RNA from replicate cerebellar and adrenal tissues with isoforms illustrated on the left. Asterisk denotes a background band that migrates non-specifically. Bottom panel: E[Ψ] violin plots of MAJIQ quantification for the colored junctions in (A). Matching isoforms are indicated on the left. (C) Top: RNA-seq reads from mouse cortices (Yan et al., 2015). Developmental time points indicated on the right with exons colored as in (A). Bottom: Ψ violin plots for the PTC-introducing exon 14 across brain development. (D) Top panel: Top regulatory motifs predicted by AVISPA to influence the neuronal-specific splicing of exon 14. Stacked bars represent the normalized feature effect (NFE) for each motif. Colors indicate the contribution of the corresponding motif in the region indicated in the inset. (E) MAJIQ Ψ quantification of the LSV shown in (A), using RNA-seq from one month old wild type whole brain (left) and nestin-specific *Rbfox1* KO littermates (right).

https://doi.org/10.7554/eLife.11752.014

Interestingly, Ptbp1 exon 14 shows conservation of splice sites between mouse and human and inserts multiple premature termination codons (PTCs) in both species, as well as in other mammals, before RMMs 3–4 of PTBP1 (Figure 6—figure supplement 1A,B), suggesting that mRNAs including this exon are likely targets of nonsense-mediated decay (NMD). Regulated alternative splicing that introduce PTCs is a common theme among numerous splicing factors (Ni et al., 2007) and exclusion of Ptbp1 exon 16 (exon 11 in the literature) has already been identified and shown to induce NMD (Figure 6—figure supplement 1A) (Wollerton et al., 2004). Remarkably, exclusion of exon 16 is barely detectable in the brain regions examined and inclusion of exon 14 is not associated with this event (Figure 6—figure supplement 1C). Together, this suggests that these splicing events are independent mechanisms to control Ptbp1 expression and that inclusion of novel exon 14 plays a larger role in the brain regions examined, with 26% of the Ptbp1 transcripts in the cerebellum containing PTCs.

Embryonic down regulation of Ptbp1 by miR-124 is crucial at the onset of neurogenesis (Makeyev et al., 2007) and leads a change in splicing programs (Boutz et al., 2007; Keppetipola et al., 2012), but cannot account for additional postnatal down regulation of this protein (Boutz et al., 2007; Zheng et al., 2012). Remarkably, MAJIQ analysis of RNA-seq data from mouse cortices across development (Yan et al., 2015) reveals clear developmental regulation of exon 14 with a dramatic increase in inclusion from P15 through adulthood (Figure 6C). Taken together, this complex LSV offers a novel mechanism for postnatal neuronal reduction in Ptbp1.

To identify putative regulators of novel exon 14, we used AVISPA (Barash et al., 2013), a web tool that utilizes splicing code models to suggest motifs important for tissue-specific splicing, and identified the [U]GCAUG binding motif of the Rbfox family as important for neuronal splicing outcome (Figure 6D). AVISPA’s map of regulatory motifs pointed to a number of Rbfox binding sites downstream of exon 14 (Figure 6A). These motifs, perfectly conserved between mouse and human, suggested enhancement of inclusion by the Rbfox family (Lovci et al., 2013). Consistent with this regulatory hypothesis, MAJIQ analysis of RNA-seq data from one month old nestin-specific Rbfox1 KO mice revealed a marked decrease in inclusion of exon 14 from ~16% in wild type mice to nearly undetectable in the KO (Figure 6E; Figure 6—figure supplement 1D) and similar decreased inclusion was observed upon Rbfox2 KO (Lovci et al., 2013) (Figure 6—figure supplement 1E). Together these data demonstrate the power of MAJIQ, in combination with the VOILA and AVISPA analysis tools, in identifying previously uncharacterized isoforms and understanding the regulation of biologically important transcript variation.

MAJIQ detects novel splicing variations in the CAMK2 family which are conserved, developmentally regulated, and dysregulated in AD

Several of the brain specific LSVs we detected were found in genes encoding calcium/calmodulin-dependent protein kinase II (CAMK2) subunits which regulate functions in the brain such as neurotransmitter synthesis and release, cellular transport, neurite extension, synaptic plasticity, learning and memory (Griffith, 2004). We focused on Camk2d and Camk2g as these exhibit complex changes and were expressed in nearly all tissues examined (Figure 4—source data 1). Figure 1B and Figure 7—figure supplement 1B show MAJIQ’s analysis and matching RT-PCR validation of a Camk2g LSV containing three exons across five tissues. Figure 7 shows similar verification for another complex LSV but in Camk2d. In both cases, exon inclusion creates consensus NLS motifs (KKRK), which localize these subunits to the nucleus (Braun and Schulman, 1995). For Camk2g the NLS motif is contained in exon 15 whose inclusion levels are highest in the brain, particularly in the brainstem (Figure 1B, Figure 7—figure supplement 1B).

Figure 7 with 7 supplements see all

Download asset Open asset

Camk2d LSV exhibits complex developmental dynamics and is misregulated in Alzheimer’s disease.

(A) Representation of complex source LSV in *Camk2d* with matching RT-PCR validation in five tissues (brainstem, cerebellum, hypothalamus, heart, and adrenal). Colored arcs represent the junctions quantified by MAJIQ for this LSV while dashed arcs correspond to junctions in the RNA-seq data that are not part of the quantified LSV. Violin plots on the bottom display Ψ quantifications (x-axis) for each of the colored junctions (y-axis) across the five tissues with appropriate isoforms from the gel on the right. Isoforms with known tissue-specific splicing patterns are labeled as in the literature (B) Line graphs of MAJIQ E[Ψ] quantification (y-axis) of junctions as in (A) across time points (x-axis) through cortex development (top) and heart development (bottom). Points represent mean Ψ and error bars represent one standard deviation in E[Ψ]. (C) ΔΨ quantification comparing changes between control and Alzheimer’s patient brains of the homologous junctions illustrated in (A).

https://doi.org/10.7554/eLife.11752.017

Several other important aspects of Camk2d splicing are accurately captured by MAJIQ. These include near 100% skipping of exons 21 through 23 in all non brain or muscle tissues (known in the literature as isoform C or Camk2δC, (Xu et al., 2005)), high relative inclusion of NLS containing exon 21 in heart (isoform B or Camk2δB), and high levels of isoform A (Camk2δA), which includes exons 22 and 23, in the brain regions examined (Figure 7A). This result is consistent with previous reports of Camk2d splicing patterns and isoform A being neuronal-specific (Xu et al., 2005). Importantly though, MAJIQ also detects isolated inclusion of exon 23 in the heart (Figure 7A, green junction), which is supported by both the RT-PCR experiment and analysis of an independent dataset across heart development (see below). Previous studies focused on splicing regulation of Camk2d in the heart used junction spanning primers that preclude detection of this highly utilized splicing choice (Xu et al., 2005; Ye et al., 2015).

Because CAMK2 has been implicated in neurodevelopment and is proposed to be critical for postnatal heart development (Xu et al., 2005), we next looked for developmental changes in LSVs by analyzing RNA-seq data derived from mouse cortices (Yan et al., 2015) and hearts (Giudice et al., 2014) at different time points. In the brain there is a switch in the splicing of Camk2d between the C and the A isoforms, reaching over 80% use of the A isoform by postnatal day 15, corresponding to a time of intense synaptogenesis and plasticity (Licatalosi et al., 2012) (Figure 7B, top). In the heart we see a more modest decrease in isoform C and increase in exon 23 only during postnatal heart development (Figure 7B, bottom, compare purple with green), consistent with results from RT-PCR from eight week old mice (Figure 7A). Notably, other CAMK2 subunits also displayed developmental dynamics in both tissues, such as inclusion of NLS containing exons in Camk2g and Camk2a (Figure 7—figure supplement 1C and 2), an unannotated mouse cassette exon in Camk2g regulated by the Rbfox family (Figure 7—figure supplement 1D–G), and a complex LSV in the variable domain of Camk2b that affects autophosphorylation and is regulated by Ptbp2 (Li et al., 2014) (Figure 7—figure supplement 3).

Given the suggested role of calcium signaling in neurodegeneration (Marambaud et al., 2009) and CAMK2 implication in Alzheimer’s disease (AD) (Steiner et al., 1990), we also analyzed RNA-seq data from three control brains and compared them to three AD brains (Bai et al., 2013). Strikingly, in CAMK2D we observe a marked decrease of ~38% of the neuronal specific isoform of the complex, developmentally-regulated mouse LSV we validated above, with reciprocal increase in the all exclusion, isoform C in AD brains (Figure 7C). We also observe changes in a CAMK2G LSV that corresponds to an unannotated mouse exon (Figure 7—figure supplement 1D,E). Importantly, these exons are perfectly conserved between mouse and human at the amino acid level, further suggesting physiologic importance of the novel splicing variations detected by MAJIQ. Finally, we validated that the observed CAMK2 splicing changes in AD brains can be reproduced in a second independent study. We used data from the AMP-AD Target Discovery Consortium (doi:10.7303/syn2580853) involving a larger cohort of 157 samples from AD patient’s brains and 128 control samples, across three different brain sub regions (Figure 7—figure supplement 4). Overall, we detected approximately 200 LSVs that are reproducibly differentially spliced between AD and normal brains (see Methods) and enriched in GO terms such as cytoskeleton, GTPase regulator activity, and synapse organization (data not shown). This set constitutes approximately 12% of the changing LSVs detected in the original dataset, a fraction that grows to 21% but only 164 LSVs if stricter filtering is applied to both datasets (data not shown). This relatively low percentage of reproducible changes across the two datasets can be at least partially attributed to the small number of samples in the original study combined with an average of 1.8 fold lower coverage in the second, larger dataset. Notably though, among the reproducible set of differentially spliced LSVs 79 are complex, a significant, 1.2-fold enrichment compared to their relative proportion among all LSVs detected (p=0.04, binomial test). While the validation and experimental follow up on these LSVs is beyond the scope of this paper these results and the related CAMK2 analysis demonstrate the usefulness of our combined approach for LSV detection, quantification, and visualization for disease studies.

Overall, our analysis of CAMK2 is in line with previous studies but also detects additional isoforms and exons that are conserved, developmentally regulated, and dysregulated in AD, making for a more accurate picture of CAMK2 splicing patterns. Additional complex LSVs we validated and analyzed include brain specific isoforms of the kinesin light chain Klc1, recently shown to be an amyloid-beta accumulation modifier (Morihara et al., 2014) (Figure 7—figure supplement 5); the clathrin light chain Clta, which displays developmental dynamics and dysregulation in both Alzheimer's disease cohorts (Figure 7—figure supplement 6, Figure 7—figure supplement 4); and the translation initiation factor scaffold Eif4g3, which has high inclusion of a cassette microexon specifically in cerebellum and a novel, muscle-specific exon (Figure 7—figure supplement 7).

Discussion

The work presented here spans a wide spectrum of topics from a new formulation of transcriptome variations in units of local splicing variations (LSVs); through algorithms for detecting, quantification and visualization of LSVs; a genome wide map of LSVs; analysis of the prevalence and functional significance of complex LSVs; to validation of several complex LSVs that affect protein domains in developmentally regulated genes with key roles in neurogenesis or other brain functions. For the latter, we also demonstrated dysregulation in Alzheimer’s disease using two independent datasets.

The new formulation of LSVs sheds light on what has thus far been mostly a 'dark side' of the transcriptome and RNA-Seq based studies, i.e. complex splicing variations. Several previous works aimed to address the apparent representational gap between full transcripts and the classical binary AS events. For example, (Nagasaki et al., 2006) developed an efficient bit array representation for the various exonic segments that make up different gene isoforms, and (Sammeth et al., 2008) suggested an elaborate notational system that allowed them to catalogue all the splicing variations in a given transcriptome, comparing the frequencies of different AS types across 12 metazoa. More recently, (Pervouchine et al., 2013) developed bam2ssj, a package implementing a general intron centric approach to estimate AS from RNA-Seq data that can capture non classical AS variations. bam2ssj gives a BAM-file–processing pipeline that counts junction reads to compute the ratio of inclusion levels either from the 5’ or the 3’ end of an intron, denoted Ψ₅and Ψ₃.A different, graph based, approach was taken by (Hu et al., 2013) where a splice graph is divided into subunits termed alternative splicing modules (ASMs). ASMs are hierarchically structured, each capturing all the possible paths along a splice graph between specific start (‘single entry’) and end (‘single exit’) points. The matching algorithm, DiffSplice, then aims to identify cases of differential transcription of ASMs between two experimental conditions. All of these works differ substantially in the formulation of splicing variation, the underlying algorithms, and visualization approach, yet all share the effort to capture non classical AS types. In comparison, MAJIQ offers a unique approach that spans formulation, detection, quantification and visualization of splicing variations. Unlike ASMs, LSVs can be inferred directly from junction spanning reads and result in quantitative PSI and dPSI estimates, while MAJIQ’s probabilistic model offers significant accuracy boost for PSI and dPSI estimates compared to alternative methods.

The importance of LSVs formulation is manifested in how common complex LSVs are in diverse metazoans, making up at least a third of observed LSVs in human and mouse. Complex LSVs are also enriched for regulated splicing when analyzing over thirty datasets across different tissues, developmental stages, splice factor knockdowns and neurodegenerative disease. In addition, LSV formulation can be used to investigate substructures of the transcriptome. We found that the biochemically-based proximity rule is commonly overcome at the genomic level and that complex LSVs are less likely to have a dominant splice junction. As for LSVs possible function, our results indicate that tissue dependent binary and complex LSVs both tend to occur in unstructured regions known to affect protein-protein interactions, as well as in specific yet distinct protein domains and families.

In order to benefit from the new LSV formulation matching software is needed. The software we developed, MAJIQ, is LSV focused and compares favorably with available tools on AS quantification based both on RNA-Seq from biological replicates and on a compendium of over 200 RT-PCR experiments. Unlike many tools, MAJIQ supplements annotated transcriptomes with novel splice junctions, while VOILA allows the resulting LSVs to be interactively visualized within standard web browsers. Thus, MAJIQ and VOILA offer a compelling LSV centered addition to tools such as MISO (Katz et al., 2010), rMATS (Shen et al., 2014) and cuffdiff (Trapnell et al., 2013) that allow users to quantify whole isoforms relative abundance, alternative polyadenylation, or differential expression.

Immediate applications of the novel LSV framework and the MAJIQ software cover a wide spectrum. Examples include improved disease studies where transcriptome variations play a role, enhancing predictive models for splicing and for the effect of genetic variants, studying the regulatory underpinning of complex LSVs, and examining their evolutionary history. At the most basic level, our results illustrate the potential for novel discoveries in reanalyzing previously published data with the new LSV based methods. We anticipate the framework and resources provided here will form the basis of many additional new discoveries in diverse fields.

Share this article

Cite this article

LSV formulation and prevalence.

LSV analysis using MAJIQ.

LSV prevalence across diverse metazoans.

Genome wide view of exonic LSVs across twelve mouse tissues.

Figure 4—source data 1

Meta analysis of complex LSVs.

Figure 5—source data 1

Identification of a novel, brain-specific, PTC-introducing, developmentally-regulated exon in Ptbp1.

Camk2d LSV exhibits complex developmental dynamics and is misregulated in Alzheimer’s disease.

Author details

Jorge Vaquero-Garcia

Contribution

Contributed equally with

Competing interests

Alejandro Barrera

Contribution

Contributed equally with

Competing interests

Matthew R Gazzara

Contribution

Contributed equally with

Competing interests

Juan González-Vallinas

Contribution

Competing interests

Nicholas F Lahens

Contribution

Competing interests

John B Hogenesch

Contribution

Competing interests

Kristen W Lynch

Contribution

Competing interests

Yoseph Barash

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organisms