A unified reference-free approach to transcriptomic diversity analysis.

(A) Current RNA-seq analysis methods, relying on read alignment to a reference genome, introduce biases and blindspots, and are tailored for specific applications. SPLASH offers a unified alignment-free solution for detecting myriad transcriptome diversifying mechanisms. SPLASH parses reads from all samples (cells) to extract constant kmers (anchors) that are followed by a set of diverse kmers (targets). For each anchor, SPLASH creates a contingency table and computes a closed-form p-value. Anchors with significant p-values are indicative of single-cell-dependent target distribution evidencing regulation and can subsequently be used for downstream analysis for a wide range of applications. SPLASH+ integrates SPLASH with a local assembly approach, called compactors, and an automated anchor classification based on compactors to facilitate biological interpretation and categorize anchors into biologically relevant events such as alternative splicing, SNP, and V(D)J recombination. (B) SPLASH-called anchors are extended to longer sequences or “compactors’’ through a statistically valid local assembly approach. The process involves collecting reads containing the anchor, recursively extending the sequence beyond the anchor by comparing nucleotide frequencies at each position against a threshold based on sequencing error, resulting in distinct branches that correspond to compactors for the anchor. (C) The boxplot showing the expression values (TPM) for the detected and undetected fusions by compactors in the fusion benchmarking dataset, suggesting that compactors provided evidence for the majority of sufficiently-expressed fusions. (D) Unmapped compactors can be annotated by association through other compactors linked to the same anchor.

Transcript diversity variation in repetitive RNA loci.

(A) SPLASH+ detects variation in centromeric repeat arrays, illustrated for the anchor ATTCCATTCCATTCCATTCCATTCCAC, containing 5 contiguous repeats of ATTCC (the characteristic sequence for canonical pericentromeric). Among the anchors containing ATTCCATTCC, this anchor has the most diverse compactors (147 compactors) ordered through multiway alignment. The heatmap shows the cell-type-specific compactor expression (in logarithmic scale), displaying read counts per cell type (collapsed across donors and tissues). (B) The 71 detected RNU6 compactors exhibit high conservation with RNU6 reference sequences, which include 1,281 gene and pseudo-gene loci across the human genome. (C) Direct and annotated-by-association RNU6 compactors show similar abundances and sequence similarities to RNU6 reference genes, indicating that the annotated-by-association compactors may be missing from the reference annotation. (D) Heatmap showing differential compactor counts per cell (skin cells from donor 2) for a compactor mapped to the pseudo-gene RNU6-6P and other non-uniquely aligned compactors from the same anchor. (E) Multiway alignments of compactors to annotated RNU6 loci demonstrate alignment both upstream and downstream of annotated boundaries.

SPLASH+ enables de-novo analysis of alternative splicing in single cells.

(A) Dot plot showing the number of distinct splicing anchors and the fraction of them representing annotated alternative splicing per donor-tissue pair, with >55% involving annotated alternative splicing events (based on T2T). Each dot represents a distinct donor corresponding to a specific tissue within the dataset. Muscle has the most number of donors (3 donors). (B) Heatmap (top) showing the concordance of splicing genes (fraction of genes called by a method that are shared in the same tissue from different donors) for SPLASH+, SpliZ (Olivieri, Dehghannasiri, and Salzman 2022), and Leafcutter (Li et al. 2018) in lung, blood, and muscle. This plot suggests that, despite not using cell metadata, SPLASH+ achieves higher concordance in all tissues compared to both Leafcutter and SpliZ. The numbers in parentheses indicate the absolute number of shared genes. Heatmap (bottom) shows the logarithmic binomial p-values (with the numbers in parentheses indicating the absolute number of genes called by both methods) for testing the significance of the number of genes called by both SPLASH+ and Leafcutter, and by both SPLASH+ and SpliZ for each donor tissue. To further emphasize on different cell type composition in tissue replicates, we used different symbols for a tissue from different donors. (C) The reproducible compartment-specific alternative splicing of GAS5 in muscle cells from 3 donors (intervals are 95% binomial confidence interval), with cd8+, alpha-beta t cells consistently showing a higher inclusion rate for the isoform with shorter intron (green isoform). (D) SPLASH+ detects extensive single-cell expression variation of eight CD47 isoforms, including novel isoforms, for a single anchor across 10 donors and 14 tissues. The heatmap, sorted by hierarchical clustering, includes a novel CD47 isoform (yellow) with annotated junctions (isoform coordinates: chr3:110767091:110771730--chr3:110771761:110775311). Cells with >5 reads are included and horizontal bars show the donor, tissue, and compartment identity for each cell. (E) Four alternative splice variants of RPS24 in lung (from donor 2) involve inclusion/exclusion of ultraconserved cassette exons, including a 3-bp microexon. SPLASH+ detects a novel isoform with only the microexon, missed by both STAR and BLAT, confirmed through multiway alignment. The heatmap shows compartment-specific usage of four RPS24 isoforms, with the isoform containing the 3-nt and 21-nt exon predominantly expressed in epithelial cells while the novel isoform predominantly expressed in type ii pneumocytes. (F) Spatial validation of RPS24 alternative splicing using 10x Visium data of lung tissue. Top panels show isoform fractions (left: isoform excluding microexon, right: isoform including microexon). The histology image (bottom) shows bronchial structures (red ellipses), with higher expression of the isoform including the microexon.

Pan-tissue alternative splicing regulation of splicing factors and histone modifications.

(A) 57% of genes (2,118) with significant splicing in at least 2 tissues, including 10 genes found in >= 18 tissues. Notably, 8 splicing factors (green) and histone modifications (brown) are present. (B) GO enrichment analysis of genes found in >15 tissues reveals pathways related to mRNA processing and splicing regulation (Fisher test, FDR corrected p-value< 0.05). (C, D) Pan-tissue intron retention diversity in HSP90AA1, the only SPLASH+ core gene found in all 19 tissues. Twelve anchors detect differential intron retention events in 7 (out of 8) HSP90AA1 introns. The compartment-specific expression of compactors corresponding to intron retention (pink) and splice junctions (green); the barplot on the right shows total read counts for intron retention and splicing isoform for each anchor. Fractional isoform distribution is depicted for the anchor with the most compactors (anchor 1) and the anchor found in the most tissues (anchor 3). (E) Number of SRA studies containing each HSP90AA1 junction reported by Pebblescout supports SPLASH+’s novel splice prediction. Junctions are grouped as annotated, unannotated, and one detected unannotated junction (between exons E1 and E4). (F) Single-cell dependent alternative splicing of gene NCOR1, involving 3 different splicing isoforms for the NCOR1 anchor found in the most tissues. (G) Plot showing the number of tissues and unique splice junctions for each gene with a splicing anchor. IL32, RBM39, and IGKC have the most unique splice junctions. (H) Alternative splicing of PRPF38B involving an intron retention and six splicing isoforms, showing one of the most diverse isoform structures with significant alternative splicing in 11 donors and 17 tissues.

SPLASH+ detects V(D)J recombination with highest levels of transcriptome diversity and achieves higher sensitivity than BraCeR.

(A) Plot showing the number of compactors and donors for each anchor, indicating that anchors with the most compactor diversity extend known biology of V(D)J recombination. The top marginal histogram shows the probability that each compactor category falls into a specific range of number of compactors per anchor. The right marginal histogram shows the probability of each category as a function of the number of donors. Multiple sequence alignment of the anchor mapping to an immunoglobulin gene with the most distinct compactors. (B) Comparison of SPLASH+ and BraCeR for spleen cells (that are analyzed by both methods) from donors 2 and 7 shows higher sensitivity for SPLASH+. Cells are called by BraCeR or SPLASH+ based on the presence of at least one BraCeR contig or IG-compactor, respectively. The x-axis shows the highest count for an IG-compactor in each cell and we also show the percentage of cells called only by SPLASH+ (green) within each cell type having at least one in-frame V(D)J transcript as confirmed by IgBLAST. (C) Cells called by both SPLASH+ and BraCeR increase with the minimum Hamming distance between IG-compactors and BraCeR contigs per cell. (D) Alignment of multiple compactors from the same anchor to IGHV3 shows split mapping to IGHV3-43-201 and IGHV3-15-201.

Single-cell-regulated RNA editing and repeat polymorphism detected de novo by SPLASH+.

(A) Multiple sequence alignment of compactors generated de novo by SPLASH+ reveals extensive cell-type regulated editing in 5’ UTR of ANAPC16. Red indicates A-to-I edits, and compactors matching the reference allele for each donor-tissue are highlighted in orange boxes. Side bar plots show total counts for each compactor in each donor tissue, with top 4 compactors in each donor tissue color-coded by shared sequences across donor tissues. Predicted miRNA binding site (Chen and Wang 2020) in ANAPC16 is disrupted by observed edits in all four donor-tissues. (B) Similar analysis for AGO2 identifies RNA editing within an Alu element including a circular RNA in the third compactor from donor 2 skin. (C) ARPC2 sequence diversity shows 20 distinct variants in a 17-bp region (positions 58-74 in the compactor sequences). Multiple sequence alignment reveals base-pair changes between the 5’ UTR and translational start (chr2: 218,703,167–218,703,183; T2T assembly). All kmer counts from donor bone marrow exceed expected sequencing error rate by orders of magnitude. (D) Sankey plot displays cell counts for primary (repeat with the highest count in cell) and secondary (repeat with the second highest count in cell) repeat lengths in donor 2 lung (donor tissue in which the anchor was called). The reference repeat length is indicated as 0 and repeat contractions are indicated relative to this reference length. Bar plot shows cell counts with reference or one repeat contraction. Each cell is counted for a repeat length if either its primary or secondary repeat matches that length. (E) Similar analysis for repeat polymorphism in VSNL1 shows cell counts for each combination of secondary and primary repeat lengths.

(A) In SPLASH, a single consensus sequence is constructed for each sample (cell) by taking the plurality of bases at each position across targets. This approach loses both within-cell diversity and count information, as it reports only one sequence and may even lead to misassembly. The consensus sequence is then aligned to the genome for gene name and alignment information. Notably, SPLASH lacks automated anchor classification, limiting its application for targeted studies on specific events such as alternative splicing. In contrast, SPLASH+ employs a branching approach to construct long assembled sequences (compactors) with assigned counts for each sample. The most abundant compactors for each anchor are then used to assign biologically relevant events (e.g., alternative splicing, V(D)J recombination) to each anchor. (B) Consensus (used in original SPLASH) fails to capture sequence diversity variation across cells as it produces only one sequence per cell. For example, for the given gene with 4 transcript isoforms (each exon is assumed to have two nucleotides as shown), the consensus sequence for both Cells 1 and 2 is isoform 1 (i1), the dominant isoform in both cells, implying no diversity variation between these cells. However, Cells 1 and 2 have different count distributions of isoforms 2, 3, and 4. For Cell 3, the consensus approach incorrectly represents isoform 4, which has the lowest count. Similarly when sequence diversity is due to SNPs, the consensus approach might fail to show diversity variation as seen with Cells 1 and 2, where both are represented by variant V1 despite their different distributions for other variants. In Cell 3, the consensus incorrectly represents a non-existent variant. We show the details of how the consensus sequence is produced for Cell 3 (for both splicing and SNP diversity).

(A) Overview of compactor benchmarking analysis using the fusion benchmarking dataset. The dataset provides the sequence and breakpoint position for each true positive gene fusion. We constructed seed anchors from the 27 base pairs immediately upstream of each breakpoint and generated compactors using these seed anchors using FASTQ files. Each compactor was assigned to a true positive gene fusion if it matched the 20-mer junctional sequence for that fusion, which was created by taking 10 base pairs on each side of the gene fusion junction. For each dataset, the number of unique gene fusions with associated compactors is counted, representing the number of detected fusions reconstructed by compactors for that dataset. (B) Number of detected fusions per simulated datasets. In total 57.8% (1339/2315) of fusions were detected by compactors.

Post-facto classification assigns distinct biologically relevant RNA events to each anchor based on its compactors’ characteristics.

Six categories are considered: splicing, internal splicing, base pair change, 3’ UTR, centromere, and repeat (ordered by priority). An anchor that cannot be assigned to any of these categories is referred to as unclassified. Reference-free categories include internal splicing and base pair change; and splicing, 3’ UTR, centromere, and repeats are assigned based on compactors’ alignment to the genome.

SPLASH+ was run on 10,326 SmartSeq2 cells from 12 donors and 19 tissues (29 donor-tissue pairs), including replicates for several tissues from multiple donors, enabling reproducibility analysis.

(A) Run time and (B) memory usage per step (extracting and counting anchor and target sequences from input sample in parallel for each input sample, stratification of anchor sequences to allow for parallelization of p-value calculation, p-value calculation in parallel for each set of stratified anchors, and multiple testing correction) in SPLASH across eight donor tissues. SPLASH, being fully parallelized, allows the initial step of parsing input reads and extracting anchor and target sequences to run concurrently for each cell. Thus, the run time for this step is presented for individual cells. Memory requirements are displayed for the two steps consuming the most memory.

(A) Compactor count per anchor in each tissue, categorized by anchor class, with splicing anchors exhibiting the lowest compactor diversity. (B) Fraction of reads assigned to each anchor category in each tissue, with base pair change anchors having the highest fraction of reads.

Single-cell-regulated expression of the anchor containing pericentromeric repeat CCATT in tongue replicates (donors 4 and 7).

Tongue basal cells consistently exhibit the highest expression in both replicates.

The barplot showing the total read count across the entire dataset for each category of noncoding RNAs.

(A, B, C) Upset plots showing the comparison of the SpliZ, Leafcutter, and SPLASH+ for detecting genes with significant alternative splicing in lung, blood, and muscle tissues from each donor. (D) Barplots show the number of splicing genes called by each method and in each donor tissue.

Number of SRA studies reported by Pebblescout for CD47 junctions, providing further evidence for the expression of the two novel splice predictions made by SPLASH+ for CD47.

The two overrepresented cell types in bronchiole epithelium (ciliated and club cells) exhibit the highest expression fraction of the isoform with the microexon (dark blue isoform in Figure 3E) in lung replicates from donors 1 and 2, which is consistent with findings in the Visium data.

Ciliated cells show 93 reads in 5 cells and 38 reads in 2 cells for donors 1 and 2, respectively, while club cells exhibit 22 reads in 2 cells for donor 2.

Single-cell dependent alternative splicing of two hnRNPs: (A) HNRNPDL and (B) HNRNPC illustrated for the anchor expressed in the most tissues for each of these genes.

Query of the editing events detected in ARPC2 using Pebblescout.

(A) Number of studies reported by Pebblescout for the 25-mers with changes in the 6 RNA editing locations, as shown in Figure 6C. (B) Pebblescout reported compactors with editing in significantly more studies than decoy events or kmers constructed with base pair changes not predicted by SPLASH+.