Figures and data

Overview of the OpenSpliceAI design.
This toolkit features six primary subcommands: (A) The “create-data” subcommand processes genome annotations in GFF/GTF format and genome sequences in FASTA format to produce one-hot encoded gene sequences (X) and corresponding labels (Y), both stored in HDF5 format. (B) The “train” subcommand utilizes the HDF5 files generated by “create-data” to train the SpliceAI model using PyTorch, resulting in a serialized model in PT format. This process also generates logs for training, testing, and validation. (C) The “calibrate” subcommand takes both training and test datasets along with a pre-trained model in PT format. It randomly allocates 10% of the training data as a validation (calibration) set, which is then used to adjust the model’s output probabilities so that they more accurately reflect the observed empirical probabilities during evaluation on the test set. (D) The “transfer” subcommand allows for model customization using a dataset from a different species, requiring a pre-trained model in PT format and HDF5 files for transfer learning and testing. (E) The “predict” subcommand enables users to predict splice site probabilities for sequences in given FASTA files. (F) The “variant” subcommand assesses the impact of potential SNPs and indels on splice sites using VCF format files, providing predicted cryptic splice sites.

(A) Schematic overview of OpenSpliceAI’s approach. Gene sequences are first extracted from the genome FASTA file and one-hot encoded (X). Splice sites are identified and labeled using the annotation file (Y). The resulting paired data (X, Y) for each gene is then compiled for model training (80% of the sequences) and testing (20% of the sequences). (B) Workflow of the OSAIMANE 10,000 model. Input sequences are one-hot encoded and padded with 5,000 Ns ([0,0,0,0]) on each side, totaling 10,000 Ns. The model processes the input and outputs, for each position, the probability of that position being a donor site, an acceptor site, or neither. (C-D) Performance comparison between OSAIMANE and SpliceAI-Keras on splicing donor and acceptor sites, trained with 80nt, 400nt, 2,000nt, and 10,000nt flanking sequences. Evaluation metrics include top-1 accuracy for both donor and acceptor sites. Blue curves represent SpliceAI-Keras, while orange curves represent OSAIMANE. Each dot represents the average score with error bars indicating ± one standard error. Performance is compared across test datasets from human. (E) Benchmarking results for elapsed time, average memory usage, and GPU peak memory for the prediction submodule.

Genome assembly and annotation details for species used for OpenSpliceAI training and transfer-learning in this study.
Note: For each species, the table includes the GenBank accession number, assembly name, ftp sites for assembly and annotation downloads, and annotation release dates.

(A) The number of protein-coding genes in the training and test sets, along with the count of paralogous genes removed for each species: Human-MANE, mouse, zebrafish, honeybee, and Arabidopsis. (B) Scatter plots of DNA sequence alignments between testing and training sets for Human-MANE, mouse, honeybee, zebrafish, and Arabidopsis. Each dot represents an alignment, with the x-axis showing alignment identity and the y-axis showing alignment coverage. Alignments exceeding 80% for both identity and coverage are highlighted in the red-shaded region and excluded from the test sets. (C-F) Performance comparisons of OSAIs trained on species-specific datasets (mouse, zebrafish, honeybee and Arabidopsis) versus SpliceAI-Keras, original published SpliceAI models, trained on human data. The orange curves represent OSAI metrics, while the blue curves show SpliceAI-Keras metrics. Each subplot (C-F) includes F1 score evaluated separately for donor and acceptor sites.

Performance comparison of scratch-trained and transfer-trained OSAIs across species and sequence lengths.
(A-D) Top-1 accuracy for donor and acceptor splice sites of 80 nt, 400 nt, 2,000 nt, and 10,000 nt models, comparing OSAIMouse (scratch-trained) and OSAIMouse-transferred (transfer-trained) models over epochs 1 to 10 on the test dataset. (E-H) Top-1 accuracy after one epoch of training versus after ten epochs for both scratch-trained and transfer-trained models across the same sequence lengths. Each plot represents one species and its corresponding transfer-trained model: (E) OSAIMouse vs. OSAIMouse-transferred, (F) OSAIZebrafish vs. OSAIZebrafish-transferred, (G) OSAIArabidopsis vs. OSAIArabidopsis-transferred, and (H) OSAIHoneybee vs. OSAIHoneybee-transferred.

(A) Calibration results for OSAIMANE across non-splice sites, acceptor sites, and donor sites. Models trained with different flanking sequence lengths are represented by color: 80 nt (blue), 400 nt (green), 2,000 nt (orange), and 10,000 nt (red). Dotted curves in lighter colors denote pre-calibration results, while solid curves in darker shades show post-calibration results. (B) Expected calibration error (ECE) on the validation set (top) and test set (bottom), comparing the OSAIMANE’s performance before (blue bars) and after (orange bars) calibration. For each flanking sequence OSAIMANE, five calibration experiments were performed, with the mean loss and ± one standard error. (C) Two-dimensional calibration map for OSAIMANE, illustrating how raw predicted probabilities for acceptor (x-axis) and donor (y-axis) sites are transformed after calibration. Arrows indicate the shift from pre- to post-calibration states in two-dimensional probability space, resulting in a smoother probability distribution.

(A) Plot of importance scores for nucleotides near the acceptor site of exon 9 of U2SURP (top) and DST (bottom), for both SpliceAI and OSAIMANE. The importance score is calculated by taking the average decrease in acceptor site score across the three possible point mutations at a given base position. (B) Plot of the impact of each possible point mutation within 80 bp of a donor (top) site or acceptor (bottom) site, for both SpliceAI and OSAIMANE. The impact is the raw decrease in predicted splice site score after mutating a given base to a different one. (C) Visualization of cryptic splicing variants being predicted for the MYBPC3 gene (top), with an acceptor site gain and loss event, from SpliceAI’s original analysis, and the OPA1 gene (bottom), where a cryptic exon inclusion event was recently reported (Qian et al., 2021). (D) Predicted splice sites for the entire CFTR gene, with the corresponding predicted probability distribution by base position plotted below, for both SpliceAI and OSAIMANE.

Summary of the four OpenSpliceAI model architectures, each trained with a distinct flanking sequence length (80, 400, 2,000, and 10,000 nucleotides).
The table lists the kernel sizes (W), dilation rates (AR), number of residual and skip blocks, and total cropping length (CL).