Figures and data

RNA alternative Splicing (AS) and its predictive generative modeling.
(a) cartoons of several types of AS, involving exon skipping as well as alternative 3’ and 5’ splice sites. (b) Quantification of AS events such as exon skipping from RNA-Seq. PSI is used to represent the relative exon inclusion level, and dPSI is used to show the inclusion change across different conditions. Quantifying thousands of such AS events across many conditions, represented by the 3D stacks, enable training ML algorithms to predict PSI and dPSI from genomic sequence. (c) A genome browser view of an illustrative exon skipping event. The genomic regions spanned by cassette exons varies from tens to hundreds of thousands of bases. (d) Schematic of TrASPr (left) and BOS (right). TrASPr’s input include genomic regions proximal to splice sites around AS event e, along with other genomic features and two cellular conditions (c, c′) (bottom left). It outputs a predicted inclusion level 


Comparison of PSI prediction results on GTEx dataset.
(a) Heatmaps show the distribution of prediction vs. RNA-Seq values for all samples(left) and for samples involving exons that exhibit a change (ΔΨ ≥ 0.15) between at least two tissues (right) for SpliceAI (top), SpliceTransformer(2nd row), Pangolin (3rd row), and TrAS Pr (bottom). r is Pearson correlation, a is the proportion of predictions approximately correct (within the dashed lines). n is the number of samples (i.e., a cassette event measured in two tissues) in each setting, including heart-atrial appendage, cerebellum, and liver. (b) AUPRC for predicting events that are differentially included (ΔΨ ≥ 0.15) in one tissue (e.g., heart) compared to another (e.g., liver) and for predicting differentially spliced events (|ΔΨ ≥ 0.15|) between the two tissues. The tissue pair is denoted at the bottom. (c) Same as b above but for AUROC. Evaluations in a-c include samples from tissues Pangolin was originally trained on (heart-atrial appendage, cerebellum, and liver). (d) Ablation study across all pairs of six GTEX tissues (the above three as well as lung, spleen, and EBV-transformed lymphocytes). AUPRC and AUROC are averaged across all tissue pairs for the change vs no-change prediction task as in b-c, while Pearson correlation is for PSI as in a. Top row (TrASPr) is the full model with pre-trained transformers. noPre - Same structure and input as TrASPr but trained from scratch. noFeat - same train/pretrain as TrASPr but without extra features. wLSTM - model with a bidirectional LSTM instead of Transformer and without the extra features. nodPSI - remove dPSIs in target function. noConsVal - same train/pretrain as TrASPr but without conservation value feature. noCDS: without coding region-related indicators and frame shifting features. noLen: without exon/intron length features.

TrASPr prediction results in unseen conditions and alternative splice sites.
n is the number of samples in each setting. (a) Performance of TrASPr on two test GTEx tissues (cortex, adrenal gland). Top: The tissues were first represented as tokens, and new tissue results were predicted based on the average over conditions during training. Mid: TrASPr used the PCA learned representation to predict AS in the 2 test GTEx tissues it never trained on. Bottom: TrASPr was trained on all 8 tissues using token-based tissue representations and tested on the two test GTEx tissues. The left column includes all samples and the right one only has changing event samples. Changing events have inclusion level change larger or equal to 0.15 in at least one tissue pair. (b) AUROC and AUPRC plots for predicting change vs no-change events in the two GTEx test tissues used in (a), compared to the six original tissues. Blue: TrASPr with a token per tissue, trained on samples from all eight tissues. Red: TrASPr using PCA embedding to represent tissues, where samples from the two test tissues where not included in the training. (c) Prediction accuracy of TrASPr when applied to alternative 3’ (top) and alternative 5’ (bottom) splice sites.

TrASPr prediction results on mutation effect.
(a) Whisker plot for splice site mutation effect on predicted PSI when weak splice sites are made strong (blue, left) and when strong splice sites are made weak (brown, right). (b) Distribution of mutation positions in CD19 dataset (left) and the CDF of the marginal effect per each of those sequences (right). (c) Heatmaps showing the performance of SpliceAI (left column) and TrASPr (right column) in predicting the effect of mutations shown in b, under three settings: random 5-fold cross-validation (top row), random 5-fold cross-validation for changing mutations only (middle row), and single unseen mutation filter (bottom row). n indicates the number of cases in the test set. (d) Predicting the effect (dPSI direction) of RBPs KD by mutating their corresponding sequence motifs. Blue, grey, and red correspond to correct, no change, and opposite direction prediction, respectively.

Experimental validations for TrASPr predictions.
(a) Bar plot for the validation rate of low coverage AS events predicted by TrASPr to exhibit tissue-specific splicing between cerebellum, liver, and heart-atrial appendage. Validation rate was between 48.8% to 55.8%, depending on the prediction stringency, discovering a total of 169 new tissue specific events. (b) Two examples of newly found tissue specific AS events from (a). For each case, the top graph illustrates the splicing context of the event. Two bar plots show the comparison between LSV-seq experimental results(bottom left) and TrASPr predictions(bottom right). (c)(d) Two AS events where specific regions were targeted by dCas13d including elements predicted by TrASPr to have significant regulatory effect and negative control regions. The bar plot(top right) shows the predicted inclusion level changes by TrASPr for 6b long windows in the tested region. Effects of dCas13d targeting were assessed by RT-PCR (bottom, NT = non-targeting, nc = negative control).

RNA design results by BOS.
(a) Results for the task of improving inclusion of weak cassette exons (n=8 exons). Top: Bar plots for success rate in achieving desired design task (increased inclusion). Error bars represent standard deviation over the set of exons tested. Bottom: CDFs over the best designed sequences (top 20%) by the MaxEnt splice site score change between the original sequence and proposed sequence. GA - Genetic Algorithm, RM - Random. (b) BOS generation results for CD19 mutation dataset, showing 575 generated sequences across 4846 possible positions. The positions mutated by BOS (bottom) capture regions close to the alternative exon splice sites whose mutations have strong marginal effects on inclusion levels (top). (c) Comparison of BOS, GA and RM on tissue-specific(cerebellum) sequence generation. Different start sequences (n=10) are randomly chosen from cassette exons exhibiting low inclusion levels. Every algorithm is tasked with adopting the start sequence to achieve cerebellum-specific high-inclusion (Ψ ≥ 0.5 for Cerebellum, otherwise Ψ ≤ 0.2) within 30 edits. Top: Success rate for this task. Bottom: The achieved improvement (dPSI) for the top 20% sequences generated by each algorithm. (d) BOS generation results for neuronal specific Daam1 exon 16. Bar plots indicate the distribution of 27206 hits where BOS mutated from 2000 sequences. The bottom plot is the zoom-in region of the top one. Regions that were validated experimentally by mutating them in a mini-gene systems are marked either blue (yes) or red (no) depending if TrASPr that teaches BOS is able to predict the effect of those segments. The green region indicates a region that doesn’t affect the inclusion level and is predicted correctly by TrASPr.

GTEx Test set Pearson correlation for each model, binned by the combined length of the cassette exon’s upstream and downstream introns.
Left: All test cases. Right: Only for samples involving a splicing change (dPSI ≥ 0.15)

Same as Figure 2 - Supp 1 but here test cases are binned by the alternative exon’s length.

(a) Pearson correlation between gene expression (TPM) and splice site usage as defined by SpliceTranformer using GTEx Brain Cerebellum (r = 0.52, left) and liver (r = 0.53, right). (b) Correlation is further improved when considering the expression of only the isoforms that contain a specific splice site (cerebellum r = 0.71, liver r = 0.69 right).

Tissue PSI values in GTEx samples (top - Cerebellum, bottom - liver) vs. SpliceTransformer usage.
Correlation between usage and PSI is weak (Pearson r = 0.076 for Cerebellum, r = 0.074 for liver). Only shown are cassette exon PSI in chromosome 8 which were quantified by MAJIQ with high confidence and used as test data for all algorithms in the main text Fig2. When usage is high (left panels) PSI has the typical bimodal distribution such that it can either be very high or very low. Conversely, when PSI is low coverage can greatly vary between 0.05 (the threshold filter set by the SpliceTransformer authors) and 1 but when PSI is high the events are detected in almost all samples (usage close to 1).

Differential usage (x-axis) vs. differential splicing (dPSI, y-axis) for cassette exons in chromosomes 7,8 assessed for the three GTEx tissue pairs used in the main text (Fig 2): heart_BCer, heart_liver, BCer_liver.
dPSI values are the same as those used to train and test all algorithms in the main text. Top: Scatter plot Bottom: matching heat map. Note that ~ 80-90% of the samples with dPSI > 0.1 have dUsage ~ 0 and therefore will not be captured by the SpliceTransformer target function which weighs samples by their dUsage. A few points exhibit high dUsage and high dPSI contributing to some correlation, with pearson correlation ranging from 0 to 0.14 depending on the tissue pair.

TrASPr was trained on GTEx 6 tissues or GTEx + two ENCODE cell lines and then tested on two cell lines in ENCODE (HepG2, K562).
Top: The tissues were first represented as tokens, and new cell line results were predicted based on the average over conditions during training. Mid: TrASPr used the PCA learned representation to predict AS in the two ENCODE cell lines it never trained on. Bottom: TrASPr was trained with token represented tissues on GTEx + two ENCODE cell lines. Left: All samples. Right: Changing events

Heatmaps showing the performance of Pangolin in predicting the effect of mutations for CD19 dataset with the same settings as shown in Figure 4c

CDF for the difference between TrASPr test data predictions (PSI’) and ground truth PSI derived from MAJIQ on ENCODE dataset.

Predicting the effect of tissue specific splice factors exhibits strong positional biases.
The sequence motif of RBFOX (TGCATG - left) and QKI (ACTAAC - right) was inserted at each of the modeled positions around 80 cassette exons, 40 of which are highly included(PSI>0.75) and 40 are lowly included(PSI<0.25). Predictions were compared to 10 random kmers inserted in the same location (see Methods). Colors represent the indicated percentile, per position, of effects ranging from the 10th percentile, where increased exclusion is observed over the 80 cassette events, to the 95 percentile, where increased inclusion is observed.

BOS sequence edits for neuronal specific splicing.
This figure matches Figure 6C but includes all BOS generated samples that pass the user-defined criteria of brain-specific dPSI > 0.2 and not just the top 20th percentile for the best generated samples. Left: The overall number of edits by BOS per position over all generated sequences (n=20349). Right: A CDF of TrASPr based predicted neural specific dPSI after the BOS edits (blue), and when intronic edits that are further than 8bp from the nearest splice sites are removed (orange). The shift between the blue and orange curves demonstrates that almost all of the predicted neuronal specific effect is achieved via intronic edits. We further tested the effect of removing edits only from the downstream (green) and upstream (red) intronic region flanking the middle exon.

Generated sequences result for Daam1 gene exon 16 (n=4392).
Algorithms were tasked with reducing inclusion in N2A cell line (dPSI>0.2) but not destroying the inclusion in other tissues(PSI>0.1). Left: success rate of each algorithm to indicate the efficiency. Right: CDF of dPSIs for each algorithm to indicate the effectiveness.