Coveragedependent bias creates the appearance of binary splicing in single cells
Abstract
Singlecell RNA sequencing provides powerful insight into the factors that determine each cell’s unique identity. Previous studies led to the surprising observation that alternative splicing among single cells is highly variable and follows a bimodal pattern: a given cell consistently produces either one or the other isoform for a particular splicing choice, with few cells producing both isoforms. Here, we show that this pattern arises almost entirely from technical limitations. We analyze alternative splicing in human and mouse singlecell RNAseq datasets, and model them with a probabilistic simulator. Our simulations show that low gene expression and low capture efficiency distort the observed distribution of isoforms. This gives the appearance of binary splicing outcomes, even when the underlying reality is consistent with more than one isoform per cell. We show that accounting for the true amount of information recovered can produce biologically meaningful measurements of splicing in single cells.
Introduction
Singlecell RNA sequencing (scRNAseq) has provided impressive temporal resolution to our understanding of continuous biological processes such as cell differentiation (Wagner et al., 2016; Tanay and Regev, 2017). It has uncovered hidden heterogeneity among cells and exposed the factors that determine each cell’s unique identity. One broad source of transcriptomic diversity is alternative splicing, and several studies have uncovered compelling evidence of changes in alternative splicing among single cells during differentiation (Welch et al., 2016; Qiu et al., 2017; Song et al., 2017; Huang and Sanguinetti, 2017).
A particularly surprising conclusion of several scRNAseq studies was the observation that splicing was often bimodal among supposedly homogeneous cells (Shalek et al., 2013; Marinov et al., 2014; Song et al., 2017; Westoby et al., 2018). Individual cells had binary outcomes in splicing: some cells always spliced in a particular cassette exon, and some cells never spliced in the exon, but few cells showed truly intermediate inclusion within one cell. This unexpected result contrasted with previous single molecule imaging studies of several alternative exons that showed that celltocell variability is minimized and tightly regulated by the splicing machinery in single cells (Waks et al., 2011; Maamar et al., 2013). Curiosity about this result led to investigations of mechanisms that might be responsible for stochastic splicing variability among apparently homogeneous cells, such as variation in DNA methylation (Linker et al., 2019).
We propose that the observed bimodality does not generally reflect a binary nature of splicing biology, but rather, that it exposes a technical limitation of the scRNAseq data that have been collected so far. Because alternative isoforms of a gene share much of the same sequence, only the few RNAseq reads mapping to the exact alternative splice junctions, or to the alternative exon itself, reveal its alternative splicing. When combined with the low mRNA capture efficiency of scRNAseq and the PCR amplification of small amounts of starting material into a fulllength sequencing library, these circumstances create the risk of bottlenecks that lose all but a few individual mRNAs of most genes in each cell.
The limitations of scRNAseq are a known obstacle to studying splicing in single cells (ArzalluzLuque and Conesa, 2018). Similar concerns have arisen with the use of scRNAseq to infer allelic expression; a careful analysis showed that stochastic patterns resulted from technical noise (Kim et al., 2015). A recent study observed and modeled the high dropout rate of individual isoforms in scRNAseq and advised that scRNAseq is fundamentally unsuitable for measuring changes in alternative splicing (Westoby et al., 2020). Others have implemented workarounds, for example using sequence features to predict splicing outcomes in lieu of sufficient sequencing coverage (Huang and Sanguinetti, 2017), or attempting to identify excess variance beyond technical noise (Welch et al., 2016; Linker et al., 2019). These studies have identified true examples of differential splicing in single cells, but they fundamentally do not explain how scRNAseq limitations have caused qualitative, not just quantitative, distortions in our understanding of alternative splicing.
Here, we show that scRNAseq splicing data are consistent with a simple model (Figure 1). Consider a particular cassette exon whose true pattern of exclusion follows a unimodal distribution of isoform ratios across cells (i.e. most cells express both isoforms, with a ratio revolving around the same mean). This distribution can be distorted by extreme information loss during library preparation and sequencing, creating the illusion that individual cells only produce one isoform or the other. Our simulations make it clear that the reliability of splicing measurements is a function of the initial amount of mRNA, the efficiency of its recovery, the underlying splicing rate, and further distortions from PCR amplification of cDNA. These effects should be considered when interpreting previous studies that used qualitative changes in the observed distribution of the splicing rates (Song et al., 2017) or changes in their variance (Linker et al., 2019) as evidence for regulation of alternative splicing. Considering the true amount of information available for a cassette exon can allow for accurate observations of alternative splicing. Using a data normalization and filtering method to identify cassette exons with sufficient information, we are able to draw biologically relevant conclusions about alternative splicing in single cells.
Results
Our interest in splicing regulation led us to examine alternative splicing in several single cell differentiation datasets from mice and humans that were generated with methods that recover sequence from along the full length of mRNAs. To investigate the reported high variability of splicing between cells more closely, we began by examining the splicing of cassette exons in a highcoverage mouse scRNAseq dataset (Chen et al., 2016), estimating their percent splicedin as the fraction of splice junction reads that show exon inclusion (out of all reads that cover the junction). We use $\widehat{\mathrm{\Psi}}$ to denote these estimated rates, while $\mathrm{\Psi}$ denotes the actual rate as it is in the cell. For clarity, we define a single $\widehat{\mathrm{\Psi}}$ observation (which pertains to a specific cassette exon in an individual cell) as binary if it is close to 0 or 1 (i.e. the respective cell tends to express transcripts that either include the exon or exclude it, but not both). We then describe the distribution of an exon’s $\widehat{\mathrm{\Psi}}$ across cells as bimodal when its individual values are predominantly binary, where some cells have a $\widehat{\mathrm{\Psi}}$ close to 1 (most observed transcripts include the exon) and others have $\widehat{\mathrm{\Psi}}$ close to 0 (most observed transcripts do not include the exon). Strikingly, when we inspected several exons, we saw that they had more binary outcomes in cells with fewer reads covering their splice junctions, while cells with more reads were more likely to show nonbinary $\widehat{\mathrm{\Psi}}$ values (Figure 2a). We realized that this effect of coverage may reflect a nonbinary reality, since even if both isoforms are expressed in a certain cell, the likelihood of observing both isoforms is reduced as the number of captured mRNAs decreases. In contrast, if the underlying distribution were indeed bimodal with binary modes, as previously proposed (Shalek et al., 2013; Marinov et al., 2014; Song et al., 2017), then the read coverage would have little effect on the proportion of binary $\widehat{\mathrm{\Psi}}$ observations across cells.
To further explore this phenomenon, we extended our analysis to the full scRNAseq datasets. In all cases, we found a strong effect of coverage on the observed binary $\widehat{\mathrm{\Psi}}$ in exons with intermediate splicing (average $\widehat{\mathrm{\Psi}}$ between 0.2 and 0.8). We consistently found that exons with low junction read coverage had more binary $\widehat{\mathrm{\Psi}}$ values and bimodal $\widehat{\mathrm{\Psi}}$ distributions (Figure 2b,c; Figure 2—figure supplement 1a,b). We found that the association between binary values and read coverage was not observed in exons that are binary but not bimodal (i.e. nearly constitutively excluded or included exons, with average $\widehat{\mathrm{\Psi}}$ close to 0 or 1; Figure 2c). Taken together, these observations suggest that the abundance of binary observations in exon inclusion patterns may reflect a distortion of an underlying unimodal splicing distribution (i.e. when cells in fact express both isoforms), rather than a truly bimodal splicing pattern in the analyzed cells.
We then asked if the association of binary splicing outcomes with low read coverage could be due to some biological consequence of low transcription, or if it was a technical consequence of low sequencing coverage. We found that, among the cells in a single experiment, the cells with an overall higher number of splice junction reads also tended to have a smaller fraction of exons with binary values (Figure 2d, Figure 2—figure supplement 1b), suggesting an influence of technical coverage rather than transcription level. We further considered the effect of biological variations in transcription. Transcription at a single locus occurs in intermittent bursts of RNA synthesis (Raj and van Oudenaarden, 2008), suggesting a possible effect of size and frequency of transcriptional bursts on binary splicing. Using transcriptional bursting data from Larsson et al., 2019, we found that genes with either high burst frequency or large bursts did exhibit fewer binary splicing observations (Figure 2—figure supplement 1c), but a linear regression showed that burst frequency and size did not contribute to binary observations beyond the effect of read coverage (Figure 2—figure supplement 1d). This suggests that transcriptional bursting contributes to splicing variability only by determining the overall abundance of an mRNA, consistent with singlemolecule fluorescence observations (Waks et al., 2011).
Simulations of RNA sequencing reveal technical sources of distortion of splicing estimates
A simple probabilistic exercise shows the potential loss of splicing information during sequencing. Single cell RNAseq experiments that capture fulllength transcripts have an estimated capture efficiency of only ∼10%, due to RNA degradation and inefficient reverse transcription (Marinov et al., 2014; Grün et al., 2014; Ziegenhain et al., 2017; Qiu et al., 2017). For instance, a gene that expresses 20 mRNA molecules in a cell might only have two mRNAs recovered, and if that gene is alternatively spliced with a true splicing rate $\mathrm{\Psi}$ of 0.5, there is approximately a 50% chance that those few recovered mRNAs will only represent one of the two isoforms that were originally present in the cell (Figure 3—figure supplement 1a,b). As many genes are expressed at just a few RNA molecules per cell, low recovery might affect many alternative splicing events (Shapiro et al., 2013; Zenklusen et al., 2008). Furthermore, while the empirically observed $\widehat{\mathrm{\Psi}}$ provides a maximum likelihood estimate for the true splicing rate, the uncertainty of this estimate (i.e. the range of alternative values with a nearly similar likelihood) decreases substantially with the number of observed molecules (Figure 3—figure supplement 1c and Materials and methods). As a result, the probability that the observed $\widehat{\mathrm{\Psi}}$ is close to the true underlying $\mathrm{\Psi}$ increases when more mRNA molecules are captured (Figure 3—figure supplement 1d).
Our theoretical reasoning above relied on a simple model where the number of observed mRNA molecules (rather than number of reads) is known and the only distorting factor is a limited capture efficiency. In practice, both of these assumptions are challenged due to additional factors, such as PCR amplification and variability in the capture efficiency across cells. To investigate the pertinent effects on $\widehat{\mathrm{\Psi}}$ distributions in this more complex setting, we designed a probabilistic simulator of alternative splicing in single cells (Figure 3—figure supplement 2). The model has two main components: we begin by simulating the underlying molecular content of each cell, by drawing gene expression levels and cassette exon splicing rates from a probabilistic model of cell state. We then simulate the technical process of extracting data from each cell using singlecell RNA sequencing with full transcript coverage. This part accounts for variability in capture rates, and the effects of PCR amplification, fragmentation and sequencing. It relies on SymSim, a simulation tool for singlecell RNA sequencing data (Zhang et al., 2019). The final product of our simulation is the number of splice junction reads that either span or skip each exon in each cell. These numbers are distorted in a way that reflects real nuisance factors. For instance, two reads could have originated from the same molecule due to amplification effects.
We used our simulator to investigate how the observed inclusion ($\widehat{\mathrm{\Psi}}$) of cassette exons differs from the underlying $\mathrm{\Psi}$, under different average capture rates, and setting the other technical parameters to values that are characteristic of Smartseq2 datasets (see Materials and methods). We considered either a binarybimodal regime of $\mathrm{\Psi}$ (i.e. both isoforms are expressed in the population, but rarely by the same cell; Figure 1a), or a nonbinary regime (cells tend to express both isoforms; Figure 1b). We simulated splicing of cassette exons in 1500 genes, in a population of 300 single cells. In the bimodal simulation, 500 exons were modeled to have alternative splicing with a bimodal distribution across cells, and in the unimodal simulation, these 500 exons were modeled to have a unimodal distribution. To reflect the patterns seen in real data, we also simulated splicing of 500 exons that were nearly constitutively included and 500 that were nearly constitutively skipped.
As expected, in the binarybimodal simulation, the observed $\widehat{\mathrm{\Psi}}$ reflected the underlying process well, independent of the average capture efficiency (Figure 3a). In contrast, when we modeled a nonbinary splicing regime, the observed $\widehat{\mathrm{\Psi}}$ distributions were strikingly similar to the splicing distributions of cassette exons in real singlecell RNAseq datasets (Figure 3b). Specifically, the loss of information due to mRNA recovery and library generation led many of the observed $\widehat{\mathrm{\Psi}}$ to become binary, and their observed distribution across cells to become bimodal. This tendency again correlated with coverage, whereby lowly covered exons showed the strongest effect, while exons with high coverage maintained a nonbinary, unimodal distribution. Consistently, in this nonbinary regime, the average of $\widehat{\mathrm{\Psi}}$ was similar to the true average of $\mathrm{\Psi}$, but the variance of $\widehat{\mathrm{\Psi}}$ increased (Figure 3—figure supplement 1e,f). Furthermore, as in the real data sets (Figure 2c), we also found that the dependency between read coverage and the chance of observing a binary $\widehat{\mathrm{\Psi}}$ is more pronounced in exons with an underlying $\mathrm{\Psi}$ that is far from binary (Figure 3c–f), highlighting again that such an association likely indicates an artifact.
To address the extent of the distortion from mRNA recovery, we ran the simulator with varying capture efficiency and a fixed underlying $\mathrm{\Psi}=0.5$. We found that decreasing the average capture efficiency dramatically increased the number of binary $\widehat{\mathrm{\Psi}}$ observations, particularly for exons with low expression, although even highly expressed genes suffered great distortion in the observed $\widehat{\mathrm{\Psi}}$ when the average capture efficiency was very low (Figure 3g). These results reinforced the hypothesis that capture efficiency is a main technical factor that creates the appearance of bimodality in singlecell splicing.
Finally, we estimated the chance of observing only one type of isoform in any given cell (i.e. a binary $\widehat{\mathrm{\Psi}}$) as a function of the underlying $\mathrm{\Psi}$ and the number of transcripts that are present in the cell (Figure 3h). For this analysis, we set an average mRNA capture rate of 10%. Our results delineate the range of values in which an artifact is less expected. For instance, for an exon with 50% inclusion rate, a nonbinary estimate is more likely if the respective gene has at least 50 transcripts in the cell. Notably, these estimates are more conservative than the theoretical analysis (Figure 3—figure supplement 1a,b), due to the effect of the technical nuisance factors we modeled in this simulation.
Accounting for mRNA recovery improves analysis of alternative splicing
We sought criteria that would identify reliable measurements of splicing in single cells and avoid distortions from low mRNA recovery. While ideally we should rely on the actual number of mRNA molecules recovered, fulllength RNAseq experiments generally do not report an absolute mRNA count. Previous studies assessed the quality of splicing observations based on the number of reads covering alternative splice junctions (Song et al., 2017; Linker et al., 2019), but the number of splice junction reads is influenced by the extent of PCR amplification and sequencing depth, and so it does not directly reflect the number of recovered mRNAs.
We realized that we could estimate the number of mRNA molecules that were captured into cDNA by using the Census normalization approach proposed by Qiu et al., 2017. This method infers a percell scaling factor between the relative abundance of each transcript, inferred from RNAseq, and the actual number of mRNAs recovered. We found that some datasets with many reads per cell, such as the Song dataset (Song et al., 2017), nonetheless had few mRNAs recovered per cell (Figure 4a, Figure 4—figure supplement 1c), which may explain the extensive splicing bimodality in this dataset. The dataset with the highest recovery of mRNAs (Chen et al., 2016) indeed showed less binary splicing (Figure 2b).
Next, we sought a threshold for mRNA recovery that would suppress spurious observations of binary splicing. Our simulations and our analytical calculations suggested that, for a capture efficiency of 10%, recovering 7–10 mRNA molecules would be more likely than not to capture both isoforms for exons with intermediate splicing (Figure 3h), and that the observed $\widehat{\mathrm{\Psi}}$ would most likely be within 0.1 of the real value (Figure 2—figure supplement 1d). In keeping with this, we saw in the real data that exons with an average of at least 10 mRNAs recovered per cell generally had substantially fewer binary observations (Figure 4b, Figure 4—figure supplement 1d; limited to exons with average $\widehat{\mathrm{\Psi}}$ between 0.05 and 0.95). Nonetheless, a subset of these exons still showed binary splicing in most cells. These exons had few splice junction reads relative to the estimated mRNA count. We expect that these represent anomalously low recovery of reads from the specific splice junctions of interest, perhaps due to annotation errors or poor recovery of fragments with particular sequence composition. We reasoned that a datadriven splice junction read criterion would prevent distortions arising from this low read recovery. To find an appropriate read minimum, we determined the number of splice junction reads expected to arise from a cassette exon in each cell in different datasets, based on that cell’s overall mRNA recovery (Figure 4c; see Materials and methods). This provided a second filtering criterion: we excluded exons with fewer splice junction reads than would be expected from 10 mRNAs, given that exon’s $\widehat{\mathrm{\Psi}}$ and the coverage rate in that particular cell. This metric is calculated for each exon separately, driven by the actual information in each cell, and it varies substantially between datasets.
We then selected the exons from each cell that met the combined requirement of 10 mRNAs and the splice junction reads that would arise from 10 mRNAs, and asked what effect this filter had on binary observations of splicing. As an example, our filter removed seemingly spurious binary observations for alternatively spliced exon 8 of Cadm1 (Figure 4d). The observations that remained after filtering showed a clear pattern of increased inclusion along pseudotime, with exon skipping in stem cells and exon inclusion in fully differentiated neurons (Spearman’s ${r}_{s}$ = 0.52, p=1.2 × 10^{7}; Figure 4e). The correlation between splicing observations and differentiation pseudotime highlighted that cell type differences could contribute to observations of bimodality that are not the result of an artificial inflation of binary observations, and this was an important factor to consider in our analysis.
Bimodality in filtered exons is explained by differential splicing between cell types
Earlier papers had reported a surprising amount of bimodal splicing within a cell type, with extensive binary outcomes in individual cells, that is similar cells with different dominant isoforms. This suggested that some aspects of gene expression stochastically locked individual cells into producing primarily molecules of one or the other isoform. However, our results so far suggest that much of this observed variability may be an artifact of low mRNA recovery. In contrast, differences in splicing between cells of different types are consistent with conventional mechanisms such as regulation by celltypespecific splicing factors. Therefore, when considering heterogeneous cell samples, large shifts in splicing between cell types may result in truly bimodal $\mathrm{\Psi}$. In contrast to the first scenario, we would expect the corresponding observations $\widehat{\mathrm{\Psi}}$ to remain bimodal even after lowquality data points were removed.
We therefore sought to distinguish between these two scenarios and estimate the extent of bimodal splicing within a cell type that is strongly supported by the data. Most of the datasets in our analysis came from differentiation time courses, and so to examine homogeneous cell subsets, we stratified the cells into groups indicative of their developmental stage. The groups were identified by clustering the cells using the normalized sum of reads from every gene (i.e. not directly representing their variation in splicing; Figure 4f) and labeled based on expression of known marker genes (Figure 4—figure supplement 1e).
We found that, prior to filtering, hundreds of exons with intermediate splicing (average $\widehat{\mathrm{\Psi}}$ between 0.2 and 0.8) seemed to have at least weakly bimodal splicing distributions (using a heuristic definition of at least 25% of cells with $\widehat{\mathrm{\Psi}}\le $ 0.25% and 25% with $\widehat{\mathrm{\Psi}}\ge $ 0.75), a rate that roughly matched previous descriptions (Song et al., 2017; Table 1). We then discarded the observations that did not meet our filtering criteria, and retained for every cluster only those exons with observations remaining in at least 50% of cells (Figure 4g). With that filtering we found that most remaining exons had fewer binary observations than the discarded exons (Figure 4—figure supplement 1f). Furthermore, cell clusters had no or very few remaining exons with bimodal splicing distributions (Figure 4h, Figure 4—figure supplement 1g,h, Table 1). For instance, in the Chen dataset, there were two remaining cases of bimodality, and both seem to be the result of sexspecific splicing (Figure 4—figure supplement 1i). This trend persisted when reanalyzing all datasets and clusters using a range of cutoffs for qualifying a splicing pattern as binary (Figure 4—figure supplement 1e).
Filtering improves the identification of differentially spliced exons
Finally, we asked if our filtering criteria could help identify biologically relevant changes in splicing between cell types, by selecting exons whose biological variance is not overwhelmed by high technical variance. Focusing on three datasets with multiple cell type clusters (Chen, Trapnell, and Song), we examined the effect of retaining only exons that passed our filter in at least 50% of cells in each cluster. To avoid biases depending on the number of observations available for each remaining exon, we retained all data points for these exons, rather than removing individual cell data points that did not pass the filter. As a benchmark for comparison, we used a simpler criterion, akin to previous studies, of at least 10 splice junction reads in at least 50% of cells (Figure 4i). To focus on exons with substantial alternative splicing, we omitted exons from this analysis that were consistently excluded (average $\widehat{\mathrm{\Psi}}$ < 0.05) or included (average $\widehat{\mathrm{\Psi}}$ > 0.95) across all cells.
Taken together, these results challenge the idea that widespread bimodality among homogeneous cells arises from a binary nature of splicing in single cells, and instead suggest that splicing bimodality reflects biological differences between cell types or subtypes. Indeed, experimental observations of splicing bimodality support the importance of differences between cell types, rather than stochastic differences between homogeneous cells. The two cases of splicing bimodality confirmed with smFISH by Song et al., 2017 both showed bimodality between two characterized cell types. Similarly, Shalek et al., 2013 confirmed one case of splicing bimodality, in a gene whose expression they showed to be a marker of a hidden difference between mature and maturing cells, and so its splicing may reflect cell state difference at a wider context as well.
For each remaining exon in the two filtering schemes, we performed a Kruskal–Wallis oneway analysis of variance, which assigned it a pvalue (Materials and methods). This test determines if the exon’s median $\mathrm{\Psi}$ changed significantly between any clusters. Both filtering schemes enriched substantially for exons that show significant changes across different clusters, compared with the unfiltered set of all exons with average $\widehat{\mathrm{\Psi}}$ between 0.05 and 0.95. Furthermore, the extent of enrichment increased with the strictness of the differential splicing test (Figure 4i–k; Figure 4—figure supplement 2). It is interesting to note that in the Chen dataset, the baseline (ten reads) filter was stricter than the combined filter, retaining 66 vs. 198 exons and thus achieving high precision at the expense of a lower recall. Indeed, due to the high mRNA recovery and low amplification in this dataset, our combined filter required only seven splice junction reads, rather than ten as in the baseline scheme. Conversely, in the Song dataset, which is predicted to have low mRNA recovery yet a large number of reads, the baseline filter retained almost all exons, while the combined filter retained only 104 of them, leading to significant overrepresentation of exons that are differentially spliced between clusters. The results show how differences in mRNA recovery and amplification between datasets may change the interpretation of an absolute number of reads. We corroborated these results using spatial autocorrelation as a way of identifying informative exons, rather than the clustering and differential splicing analysis. The autocorrelation test builds on the work of DeTomaso et al., 2019 to identify exons with a significantly high level of consistency in $\widehat{\mathrm{\Psi}}$ among transcriptionally similar cells (Figure 4—figure supplement 3).
Taken together, these results indicate that careful filtering of exons can help identify observations that are consistent with the gene expression space and the stratification of cells into types, and are thus likely more biologically meaningful (e.g. capturing known forms of regulation during differentiation) (Figure 4l, Figure 4—figure supplement 4; Liu et al., 2018). The concept of ‘meaningful’ is ascribed here to the observations, not the exons themselves, as there may be many additional exons with relevant variation (e.g. stable differences between cell types) that is not reliably measured due to low coverage. Indeed, the effects of the two filtering schemes and the differences between them exemplify how technical factors may distort our view of the data and consequently, our understanding of variation and regulation of splicing.
Discussion
The surprising result that alternatively splicing is bimodal among single cells provoked curiosity and speculation. Bimodal outcomes might reflect hidden cell subtypes, but the bimodality was seen even among apparently homogeneous cells. Did splicing outcomes reflect some unknown, stochastic cell state?
We have shown here that the bimodal patterns could have an entirely different explanation: profound technical limitations of single cell RNA sequencing. A crucial limit on biologically meaningful splicing observations in a single cell is the number of mRNAs available to inform the measurement. This is determined both by the expression of the genes that contain the exons of interest, and by the capture efficiency of the experiment. It is important to note that the depth of a sequencing library does not necessarily reflect its quality. Along with low mRNA numbers, splicing observations are also distorted by uneven amplification efficiency and cDNA overamplification. Increasing PCR amplification cycles in an attempt to compensate for low capture efficiency has the risk of worsening the technical distortion. Indeed, in our analysis, the dataset with the highest read count per cell actually had quite low mRNA recovery and large technical distortion, creating an appearance of bimodal splicing (Song et al., 2017). Moreover, a qualitative change in the observed $\widehat{\mathrm{\Psi}}$ distribution of an exon between single cell subpopulations does not necessarily reflect a change in the underlying splicing rate, as changes in gene expression and mRNA recovery between samples can create the illusion of a splicing change.
Further developments in statistical analysis that carefully account for both missing and redundant information due to low capture efficiency could make splicing observations in single cells more reliable. We set the foundation for such analysis by proposing a probabilistic process that describes the biological and technical steps that generate single cell splicing data. We also introduced a simple approach that builds on the Census normalization (Qiu et al., 2017) to estimate the number of mRNAs recovered and the extent of artificial duplication of splicing information. This metric provides a practical filter for identifying exons with sufficient information to analyze. On the experimental side, improving the capture efficiency of scRNAseq methods while moderating the extent of overamplification is crucial for increasing the subset of exons for which reliable observation can be made.
True biological insight into alternative splicing can indeed be found from highquality scRNAseq data, and we hope that new methods will allow better understanding of splicing regulation, celltocell variation, and the importance of alternative splicing in defining cell fate (HagemannJensen et al., 2020; Gupta et al., 2018). However, some limitations are inherent to the situation. Single cells express a limited number of mRNAs per gene; splicing observations in single cells will always be inherently noisy reflections of the underlying biology.
Materials and methods
Analysis code is available at https://github.com/lareaulab/sc_binary_splicing. (Buen Abad Najar and Lareau, 2020; copy archived at https://github.com/elifesciencespublications/sc_binary_splicing).
Analysis of singlecell RNAseq datasets
Datasets
Request a detailed protocolSix publicly available single cell RNA sequencing datasets were analyzed (Table 2). These datasets are referenced with the first author’s name throughout this paper. For the Chen dataset, we selected the four cell types used to represent developmental stages in Chen et al., 2016: mouse embryonic stem cells cultured in 2i (ES2i) and LIF (ES), mouse EpiStem cells (Epi) and neurons. For the Lescroart dataset, we limited the analysis to the cells derived from mouse wildtype strains. For the Trapnell dataset, we only selected the runs that are annotated to have one cell per well.
Alignment, TPM quantification and $\mathrm{\Psi}$ estimation
We aligned the reads of each dataset using STAR 2.5.3 (Dobin et al., 2013) with twopass mode and index overhang length adapted to the read length of each dataset. We used the hg38 genome annotation for the human RNAseq datasets, and the mm10 annotation for the mouse datasets. Gene expression levels in transcripts per million (TPM) were calculated by running RSEM (Li and Dewey, 2011) on the BAM files produced by the STAR alignment. We ran rMATS 3.2.5 (Shen et al., 2014) on bulk human and mouse RNAseq datasets from cell types matching the scRNAseq datasets (Chen et al., 2016; Hubbard et al., 2013; Busskamp et al., 2014; Trapnell et al., 2014) to find all annotated cassette exon alternative splicing events in each cell type. Then we used the SJ.out.tab files obtained from the scRNAseq STAR alignment to obtain the splice junction reads compatible with the list of cassette exons found by rMATS. For each cell $i$, we calculated the observed $\hat{\mathrm{\Psi}}$ of the cassette exon $j$ as:
where ${\text{SJ}}_{{A}_{ij}}$ correspond to the number of reads that cover the two splice junctions compatible with cassette exon inclusion, and ${\text{SJ}}_{{B}_{ij}}$ are the reads that cover the splice junction compatible with its exclusion. We also determined the coverage of an exon $j$ in $i$ as ${\text{SJ}}_{ij}={\text{SJ}}_{{A}_{ij}}+{\text{SJ}}_{{B}_{ij}}$. We used ${\text{SJ}}_{ij}$ and ${\widehat{\mathrm{\Psi}}}_{ij}$ for the analyses shown in Figure 2.
Gene expression normalization and pseudotime inference
Request a detailed protocolFor the purpose of visualization and clustering of cells, we normalized the gene expression data. First, we selected the genes with TPM > 20 in at least 20 cells. After filtering, we used SCONE 1.6.1 (Cole et al., 2019) to select the best normalization approach for the data. For improving the normalization of the data, we used additional information for each cell, including the annotated cell type and batch, total number of reads, housekeeping genes and genes that are expected to change in the biological process that the dataset covers. We applied principal component analysis (PCA) over the logcounts from the best SCONE normalization, and used the first two principal components to infer pseudotime using Slingshot 1.0.0 (Street et al., 2018). We used the cell type annotation as the cluster input for Slingshot, and manually indicated the direction of the biological process. Single cells can be highly heterogeneous, even within cells labeled to be in the same biological condition. To account for this variation, instead of relying on the annotated cell type provided with each dataset, we used agglomerative clustering over the PCA projection of the matrix of normalized gene expression to identify groups of similar cells in the Chen, Trapnell and Song datasets. For the Trapnell and the Song dataset, we used the number of cell types reported by the authors. In the Chen dataset, we noticed that the cells annotated as neurons could form two welldefined clusters, with some of the cells presenting expression profiles consistent with mature neurons. For this reason, we used five instead of four clusters in the Chen dataset. In the Lescroart dataset, we used the original labels provided by the authors because PCA over the normalized expression data separates the cells into two groups. This analysis was not performed in the Shalek dataset due to its small number of cells. Instead, we treated all the cells in this dataset as one single cluster. The Fletcher dataset was not considered for the analysis that required clustering due to the low number of exon observations that pass the minimum mRNA requirements described below.
mRNA counts estimation with the Census approach
Request a detailed protocolWe performed our own implementation of the mRNA count estimation proposed in Census (Qiu et al., 2017). The total number of transcript mRNAs in cell $i$ is estimated as:
where ${x}_{i}^{*}$ is the mode of the logtransformed distribution of TPM values in cell $i$. As in Census, we found ${x}_{i}^{*}$ by fitting a Gaussian kernel density estimation to each distribution and finding its peak. We also set 0.1 as is the minimum TPM below which it is assumed that no mRNA is present. ${n}_{i}$ is the number of genes in cell $i$ with an estimated TPM in the interval $(0.1,{x}_{i}^{\ast})$. $F}_{{X}_{i}$ is the cumulative distribution function of the TPM values in cell $i$. The original Census implementation also adjusts the mRNA estimation by multiplying by $\frac{1}{\theta}$, where $\theta $ is the capture efficiency of the dataset estimated with RNA spikeins. Since for most datasets, we do not have a reliable way of estimating the capture efficiency, we removed this adjustment from the equation, so that ${M}_{i}$ in our estimation is not an estimation of the amount of mRNAs present in the cell lysate as it is in Census, but an estimate of the mRNAs successfully captured into cDNA.
We found that some datasets contained outlier cells with ${M}_{i}$ much higher than the median estimate (more than tenfold increases). These outliers generally correspond to cells with a multimodal TPM distribution. An inflation in very low TPM values distorts the normalization by shrinking the values of ${F}_{{X}_{i}}({x}_{i}^{*}){F}_{{X}_{i}}(\u03f5)$, thus inflating the ${M}_{i}$ values in these cells. Because the Census method relies on a Gaussian kernel density estimation that performs inaccurately for multimodal distributions, we excluded this handful of outliers from further analysis. In the Trapnell dataset, we found that several cells had an unusually small number of recovered reads and genes observed. We reasoned that this had the potential to skew the TPM estimations on which the Census normalization depends (Figure 2—figure supplement 1b), so we excluded the cells in the bottom 25% quantile of reads from the Census normalization and all downstream analyses that depend on it.
Finally, the number of mRNA transcripts of gene $g$ in cell $i$ is calculated as:
where ${X}_{ig}$ is the expression of gene $g$ in cell $i$ expressed in TPM.
Nucleotide coverage and expected splice junction reads
Request a detailed protocolAmplification in shortread library preparation can lead to multiple reads from the same sequence fragment of a single mRNA molecule. To filter out exons with anomalously low read coverage, we wanted to know the number of splice junction reads expected to originate from one exon junction in a single mRNA, a number which differs in each cell and each experiment. We estimated the splice junction coverage rate of each cell as the expected number of reads covering the splice junction of a mRNA molecule:
where ${C}_{j}$ is the splice junction coverage rate, which is the number of reads expected to cover each splice junction in cell $j$; $r}_{jk$ is the number of reads that map to all constitutive splice junctions of gene $k$ in cell $j$, as reported by STAR; ${j}_{k}$ is the number of constitutive splice junctions in a mRNA molecule of gene $k$; and ${m}_{jk}$ is the Census estimate of captured mRNAs of gene $k$ in cell $j$. To calculate the ${C}_{j}$ of each cell, we only considered the constitutive splice junctions of genes that were estimated to be expressed in at least one mRNA molecule by the Census normalization. This calculation captures the slight undercounting of splice junctions (relative to other positions) because of factors including ambiguous read mapping.
We identified the constitutive splice junctions of the human and mouse transcriptome using the Gencode hg38 and mm10 annotations respectively. For each organism, we identified the splice junctions that occur in all protein coding isoforms of each protein coding gene. This resulted in a total of 59,477 constitutive splice junctions in the human genome, and 108,481 in the mouse genome.
Then, based on the overall recovery of splice junction reads in a cell, the total expected splice junction reads for a particular cassette exon $i$ in cell $j$ is estimated as:
where ${\text{SJ}}_{{E}_{ij}}$ is the expected number of splice junction reads covering the splicing of exon $i$ in cell $j$ (both for mRNAs that splice in or skip the exon). ${m}_{ij}$ is the estimated number of mRNAs from the gene containing the cassette exon $i$ in cell $j$; $\mathrm{\Psi}}_{ij$ is the observed splicing rate of exon $i$ in cell $j$. The expected number of splice junctions per mRNA is $1+{\widehat{\mathrm{\Psi}}}_{ij}$ because one splice junction read is present in mRNA molecules that skip the exon, and two in those that include it.
Filtering to select good quality observations
Request a detailed protocolSimulations of the effect of the initial number of mRNA molecules of a gene and the underlying $\mathrm{\Psi}$ suggest that an average of 44 mRNA molecules are necessary to have a 50% chance of making an intermediate $\mathrm{\Psi}$ observation when the underlying $\mathrm{\Psi}$ is 0.5. This number goes up to 65 if the $\mathrm{\Psi}$ is 0.2 or 0.8, and to 127 if the $\mathrm{\Psi}$ is 0.1 or 0.9 (Figure 4—figure supplement 1f). Assuming a capture efficiency of 10%, we rounded at 10 captured mRNA molecules as the lower threshold for a quality $\mathrm{\Psi}$ observation.
In some cases, the number of observed splice junction reads is discordant with the estimated number of mRNAs recovered. Therefore, we set a additional filter based on the number of reads expected to come from 10 mRNA molecules that are informative about the splicing of a cassette exon:
Therefore, for every observation, we required at least 10 mRNAs of the gene captured, and at least the number of reads that we expect if 10 mRNAs are informative. Notice that this minimum will be unique to each observation (combination of cassette exon and cell), as it depends on the cellspecific coverage rate, and the cell and exon specific observed $\widehat{\mathrm{\Psi}}$.
KruskalWallis test and filter evaluation
Request a detailed protocolWe evaluate the significance of splicing changes between different cell types, using the clusters defined above, for the Chen, Trapnell, and Song datasets. For each exon, we asked if its median $\widehat{\mathrm{\Psi}}$ is different between clusters. In order to do this, we grouped the exon’s $\widehat{\mathrm{\Psi}}$ observations by cluster and ran the Kruskal–Wallis test, which is a nonparametric oneway analysis of variance across all clusters at once. We reported the pvalue from each exon. A significant result indicates that the exon has a significantly different median $\hat{\mathrm{\Psi}}$ in at least one cluster relative to at least one other cluster, indicating cell typeassociated differential splicing. A nonsignificant result from the test suggests that the exon’s median $\widehat{\mathrm{\Psi}}$ is not significantly different between any pairs of clusters. The strictness of the differential splicing test was determined by setting different pvalue thresholds.
Autocorrelation test
Request a detailed protocolWe reasoned that exons with reliable estimation of their splicing rate would tend to have similar observations ($\widehat{\mathrm{\Psi}}$) in cells that are transcriptionally similar to each other. To quantify this, we adapted the autocorrelation test described in DeTomaso et al., 2019 to compute, for every exon, the similarity in $\widehat{\mathrm{\Psi}}$ amongst neighboring cells in the space of the top two principal components of the gene expression space.
To calculate the autocorrelation score of one exon, first we normalized the $\widehat{\mathrm{\Psi}}$ as follows:
where ${\widehat{\mathrm{\Psi}}}_{ij}^{\prime}$ is the normalized ${\widehat{\mathrm{\Psi}}}_{ij}$ of exon $i$ in cell $j$; $\overline{\mathrm{\Psi}}}_{j$ is the average observed $\widehat{\mathrm{\Psi}}$ of all the tested exons in cell $j$, and $\text{Var}(\widehat{{\mathrm{\Psi}}_{j}})$ is the variance of all observed $\widehat{\mathrm{\Psi}}$ in cell $j$.
For each cell $j$, we identified all its $k$nearest neighbors in the PCA projection of the normalized gene expression. For each neighbor $k$, we calculated a similarity score as follows:
where ${d}_{jk}$ is the Euclidean distance in the PCA projection between cell $j$ and its neighbor $k$; $\sigma}_{j$ is the Euclidean distance from cell $j$ to its farthest $K$nearest neighbor. For all cells $k$ that are not neighbors of $j$, we set ${w}_{jk}=0$. For each analyzed dataset, we set $K$ to be half the total number of cells. The smartseq and smartseq2 datasets that we analyzed here contain orders of magnitude fewer cells than the UMIbased datasets analyzed by DeTomaso et al., and we reasoned that the default setting of $K=\sqrt{N}$ might result in neighborhood sizes too small to ensure stability in some datasets. In the datasets with fewer cells, $K=\frac{N}{2}$ agreed better than $K=\sqrt{N}$ with the results of the KruskalWallis test (Jaccard indexes: 0.52 and 0.45 respectively in the Song dataset, 0.43 and 0.28 in the Trapnell dataset, and 0.51 and 0.54 in the Chen dataset).
For each exon $i$, we calculate the autocorrelation score as a variation of the Geary’s C statistic as:
where $N$ is the total number of cells in the dataset, and $W$ is the sum of all ${w}_{ij}$.
For each exon, we calculated an empirical pvalue. To avoid bias from missing data and different splicing rates, we binned all the exons by average $\widehat{\mathrm{\Psi}}$ and percent of missing values. We grouped exons with average $\widehat{\mathrm{\Psi}}$ of: {(0.05−0.1&0.9−0.95), (0.1−0.2&0.8−0.9), (0.2−0.3&0.7−0.8), (0.3−0.4&0.6−0.7), (0.4−0.6)}. We also binned exons with missing observations between: {(50−60%), (60−70%), (70−80%), (80−90%), (90−100%)}.
For each bin, we randomized the values of the exons across the cells 20,000 times. We then calculated the autocorrelation score for all the randomized exons.
For each exon, we calculated its pvalue as $p=\frac{x+1}{20,001}$, where $x$ are all the randomized exons of the same bin with a higher autocorrelation score.
Splicing change analysis and filter evaluation
Request a detailed protocolWe evaluated if a subset of exons is enriched for exons with low pvalues indicating significant change. We repeated this analysis for the pvalues from the KruskalWallis test and from the autocorrelation test. For each $x$ as a threshold of low pvalues, we define:
$M$ as the set of all exons.
$m$ as the set of exons with pvalue $\le x$.
$P$ as the exons in the selected subset.
$p$ as the exons in the subset with pvalue $\le x$.
We then calculate the fold enrichment as follows:
We then use the hypergeometric test to determine if the subset of selected exons is significantly enriched for exons with pvalues below the threshold. In our analysis, we tested $x$ in a range between 0.1 and 0.00001, and corrected the pvalues of the hypergeometric test for multiple testing using the BenjaminiHochberg correction.
In addition, we define:
$p$ as true positives.
$\mathrm{\neg}m$ AND $\mathrm{\neg}P$ as true negatives.
$\mathrm{\neg}m$ AND $P$ as false positives.
$m$ AND $\mathrm{\neg}P$ as false negatives.
Using these definitions, we calculate the precision, recall, specificity, accuracy and F1 score of the filter for every $x$.
Linear regression on transcriptional burst kinetics
Request a detailed protocolTranscriptional burst kinetics parameters (burst frequency and burst size) of mouse embryonic stem cells were obtained from Larsson et al., 2019. The authors modeled these parameters from cells from the Chen dataset (Chen et al., 2016). We selected 619 intermediate exons observed in the ES2i cells that have binary observations (i.e., $\widehat{\mathrm{\Psi}}$ = 0 or 1) in between 1% and 99% of the cells. For each exon, we calculated the logit of the proportion of cells that present binary observations as the target variable. Additionally, we matched each exon to three predictive features: the transcriptional burst size of its gene, the transcriptional burst frequency of its gene, and its expression, represented as the average number of informative splice junction reads that cover the exon. Each predictive feature was transformed by ${\mathrm{log}}_{10}+1$, and all variables were scaled to its standard score. We trained a linear regression model to predict the logit of the proportion of binary exons, using several combinations of the previously described predictive features:
logit$(\mathrm{\Psi})\sim $ burst size
logit$(\mathrm{\Psi})\sim $ burst frequency
logit$(\mathrm{\Psi})\sim $ size + freq + size · freq
logit$(\mathrm{\Psi})\sim $ expression
logit$(\mathrm{\Psi})\sim $ size + freq + exp + size · freq + size · exp + freq · exp
We evaluated each model by calculating the ${R}^{2}$ score between the logit$(\mathrm{\Psi})$ and the regression prediction.
Theoretical analysis of the observed $\widehat{\mathrm{\Psi}}$ with limited capture rate
Request a detailed protocolmRNA molecules are captured at a limited rate, approximated in some instances as 10% of the molecules in the cell. Under the assumption of uniform sampling of transcripts and isoforms, and assuming the only nuisance factor is the limited capture rate, we formalize the probability for observing a splicing ratio $\widehat{\mathrm{\Psi}}$. We start by specifying this probability, assuming that we know the total number of transcripts from the respective gene in the cell ($m$), the real splicing rate $\mathrm{\Psi}$ and the number of captured molecules $r$ (assuming that for any capture molecule we know if it includes the exon or not). In that case:
Note that for this calculation, the capture efficiency ($c$) is not needed, since we assume that we know $m$ and $r$. For a more useful analysis, we will next assume that only one of these variables is not known (starting with $m$ and then $r$).
In a more realistic scenario, $r$ and $c$ can be estimated (e.g. using Census), while $m$ remains unknown. We can therefore marginalize $m$ to calculate:
To estimate $\mathrm{Pr}(m\mid r,c)$ we note the following:
where we model the probability of capturing $r$ mRNA molecules as a binomial sample from $m$ with probability $c$. Note that the third transition is done under the assumption of a uniform prior on $m$.
To compute the denominator, we expand:
Thus,
In the last equation, we estimate the sum going only up to a large value of $m$, since its posterior probability diminishes. In expectation $m\approx r/c$. We use ten times this value as the maximum.
We can use this equation to estimate the expected proportion of binary $\widehat{\mathrm{\Psi}}$ observations ($\widehat{\mathrm{\Psi}}$=0 or 1) that is expected when we observe only $r$ junctions from a splicing event with a given true rate $\mathrm{\Psi}$ (Figure 3—figure supplement 1a). We can also estimate the chance to have an empirical $\widehat{\mathrm{\Psi}}$ that is at least within a certain delta (in absolute terms) from the real $\mathrm{\Psi}$. Namely, we can estimate $\mathrm{Pr}(\mid \widehat{\mathrm{\Psi}}\mathrm{\Psi}\mid <\delta \mid \mathrm{\Psi},r,c)$ (Figure 3—figure supplement 1d).
In another calculation of interest, one can ask how many mRNA molecules should a gene have in a cell in order to correctly estimate the splicing rate, under a limited capture efficiency $c$. To estimate it, we denote by $d$ a binary variable indicating that the gene has been detected (i.e. $r>0$) and marginalize $r$ in the following way:
Where
and
We use this in Figure 3—figure supplement 1b to plot the chances to see only one isoform (binary $\widehat{\mathrm{\Psi}}$) for a fixed $\mathrm{\Psi}$ (set to 0.5) as a function of the number of molecules present in the cell ($m$).
Probabilistic simulator of splicing in single cell data
1) Biological process. We simulate the expression of 1500 hypothetical genes in 300 cells using SymSim, an in silico simulator of gene expression in single cells (Zhang et al., 2019); the expression of gene $g$ in cell $i$ is annotated as ${X}_{i}$. We simulate one cassette exon $j$ for each gene $g$. For each cassette, exon $j$ in cell $i$, we simulate an underlying splicing distribution ${\mathrm{\Psi}}_{ij}$ as a Beta distribution with exonspecific parameters ${\alpha}_{j}$ and ${\beta}_{j}$. The splicing of $j$ in $i$ is simulated as a binomial sampling from ${X}_{ig}$ with probability ${\mathrm{\Psi}}_{ij}$. 2) Technical process. We simulate the capture, fragmentation and sequencing of each transcript using a modified version of SymSim’s True2ObservedCounts function and a random vector of transcript lengths. Finally, we subsample the obtained reads based on the transcript length in order to simulate the coverage of informative splice junctions.
We simulate the splicing of cassette exons in a set of genes $G$ expressed in a population of cells $N$. For each gene $g\in G$, we simulate the splicing of one cassette exon $j$. The inclusion of $j$ forms the isoform ${j}_{A}$, while the exclusion of the exons forms the isoform ${j}_{B}$. The production of mRNA molecules from ${j}_{A}$ in a single cell $i\in N$ is determined by the total expression of $g$, and by the action of the splicing machinery of $i$.
Biological process
Request a detailed protocolSplicing from premRNA transcripts in individual cells
Request a detailed protocolFor each alternatively spliced gene $g$ in a cell $i$, we simulate the expression of $g$ and the splicing of its cassette exon.
${X}_{ig}$ represents the total number of premRNA transcribed from each gene across all cells. We simulate the total counts using SymSim, an insilico approach for simulation of single cell gene expression by accounting for the biological sources of variation (Zhang et al., 2019). ${\mathrm{\Psi}}_{ij}$ referred to as the underlying splicing rate, is the probability of splicing in the cassette exon $j$ of gene $g$ in cell $i$. Notice that in this simulation, each gene $g$ only has one cassette exon $j$. In a biological context, ${\mathrm{\Psi}}_{ij}$ would be determined by intrinsic attributes of the cassette exon inherent of $g$ (e.g. sequence, secondary structure, binding sites), and by the profile of splicing factors expressed in $n$. $X}_{{A}_{ij}$ and ${X}_{{B}_{ij}}$ are respectively the counts of mRNA molecules from isoforms ${g}_{A}$ and ${g}_{B}$ in $i$. Notice that ${X}_{{A}_{ij}}$ is a random binomial sample from the total number of expressed premRNA molecules of $g$ in $i$ with a probability ${\mathrm{\Psi}}_{ij}$, as it has been modeled before (Waks et al., 2011; Xiong et al., 2011; Faigenbloom et al., 2015; Shen et al., 2014). ${\mathrm{\Psi}}_{{T}_{ij}}$ is the true isoform ratio that include cassette exon $j$ in cell $i$, obtained as the proportion of molecules of gene $g$ that include the cassette exon $j$.
The distribution of ${\mathrm{\Psi}}_{ij}$ across all cells $i\in N$ is modeled as a Beta distribution, which has been used in previous studies of single cell splicing (Barash et al., 2010; Song et al., 2017; Linker et al., 2019). In this model, the distribution is determined by the parameters ${\alpha}_{j},{\beta}_{j}\in (0,\mathrm{\infty})$. The values of these parameter determine the distribution of ${\mathrm{\Psi}}_{ij}$ across all cells as follows:
Unimodal with intermediate mode if ${\alpha}_{j},{\beta}_{j}\text{}\text{}1$.
Unimodal with mode 1 if $0\text{}\text{}{\alpha}_{j}\text{}\text{}1\le \text{}{\beta}_{j}$.
Unimodal with mode 0 if $0\text{}\text{}{\beta}_{j}\text{}\text{}1\le \text{}{\alpha}_{j}$.
Bimodal with modes 0 and 1 if $0\text{}\text{}{\alpha}_{j},{\beta}_{j},\text{}\text{}1$.
Uniform if ${\alpha}_{j}\text{}=\text{}{\beta}_{j}\text{}=\text{}1$.
Notice that $\frac{{\alpha}_{j}}{{\alpha}_{j}+{\beta}_{j}}=\mu ({\mathrm{\Psi}}_{ij})$. By controlling the ${\alpha}_{j},{\beta}_{j}$ parameters we can compare the biological underlying distribution of the exon splicing rate with the observed distribution of $\mathrm{\Psi}$ inferred from single cell RNAseq data.
To compare the results of our simulations under the binarybimodal (alternative isoforms are present in the population, but rarely in the same cell; Figure 3a,c,d) and nonbinary unimodal (both isoforms regularly appear in the same cell; Figure 3b,e,f) models of splicing, we simulated 500 alternatively spliced exons for each model. For the first model, we simulated the underlying splicing distribution of each exon as bimodal Beta distributions. For the second model, we simulated the underlying splicing distributions of each exon with unimodal Beta distributions with intermediate mode. For each cassette exon $j$ in cell $i$:
Binarybimodal splicing:
Nonbinary unimodal splicing:
To simulate a realistic scenario in both models, we simulated 500 additional exons that are consistently included, and 500 that are consistently excluded. For the consistently included exons, we sampled the parameters for Unimodal Beta distributions with mode 1 as
For the consistently excluded exons, we sampled the parameters for Unimodal Beta distributions with mode 0 as
Technical process
Request a detailed protocolmRNA capture into cDNA
Request a detailed protocolAfter simulating the production of mRNAs of distinct isoforms in single cells, we simulate the process of capture and sequencing of mRNA molecules from ${X}_{{A}_{ij}}$ and ${X}_{{B}_{ij}}$.
The process of mRNA capture is simulated using SymSim. ${C}_{{I}_{ng}}$ is the number of mRNA molecules of isoform $I\in \{{g}_{A},{g}_{B}\}$ for gene $g$ on cell $n$. This number is sampled from the total number of molecules for isoform that are present in the cell. $c$ is a parameter that determines the expected capture efficiency. ${c}_{{I}_{ng}}$ is the specific probability of capture of isoform $I$ of gene $g$ in cell $n$, which is drawn from a truncated normal distribution with mean $c$. The variance of the truncated normal distribution is set at 0.002, which is the default variance in SymSim. The distribution is truncated at 0 and at 1.
RNA sequencing
Request a detailed protocolSequencing is also simulated using SymSim’s approach, which includes a lengthdependent PCR amplification bias modeled from experimental data.
${l}_{{g}_{B}}$ is the length of the mRNA transcript of isoform ${g}_{B}$ of gene $g$, which represents the isoform that skips the alternative exon. ${l}_{{g}_{A}}$ is the length of the mRNA transcript of the isoform ${g}_{A}$, which includes the exon. We assigned the lengths to the excluded isoform by drawing without replacement from SymSim’s transcript length database. ${l}_{{e}_{A}}$ is the length of the alternative exon included in isoform ${g}_{A}$; it is sampled without replacement from a database of skipped exon lengths from the human genome. ${R}_{{I}_{ng}}$ is the total number of observed reads from isoform $I$ of gene $g$ in cell $n$ obtained from single cell RNA sequencing. These are obtained using the True2ObservedCounts function from SymSim with the modification previously described.
Splice junction coverage and observed $\mathrm{\Psi}$ calculation
Request a detailed protocolWe also simulate the downsampling from observing only reads that overlap the splice junctions that are informative about the splicing of the cassette exon.
${l}_{r}$ corresponds to the constant read length from the sequencing process. ${\text{SJ}}_{{I}_{ng}}$ is the number of reads that cover informative splice junctions for isoform $I\in \{{g}_{A},{g}_{B}\}$ for gene $g$ in cell $n$, which are sampled from the total number of reads covering the isoform. In order to account for variation in read density along transcripts, we sample the splice junction reads of each event from a binomial distribution with probabilities ${j}_{{A}_{g}}$ and ${j}_{{B}_{g}}$, which are respectively the probabilities of a given read to cover the splice junctions informative with isoform ${g}_{A}$ and ${g}_{B}$. We derive these probabilities from the length of the simulated reads, and the length of the transcript. Each read can be mapped to $2({l}_{r}1)$ positions in the transcript that overlap one splice junction. Thus, the probability of covering one given splice junction is defined as the number of possible positions in the transcript that are informative for the splice junction, divided by the length of the transcript. ${j}_{{A}_{g}}$ is the probability to map to any of the two splice junctions that are informative for isoform $g}_{A$. $j}_{{B}_{g}$ is the probability to map to one single splice junction, since there is only one junction informative for isoform ${g}_{B}$. Coverage depth variance might sometimes exceed the variance of the binomial distribution, such that the informative reads for one isoform might be much more likely to be sequenced than the informative reads of the other isoform in a given cell. In these situations, the technical variance of $\widehat{\mathrm{\Psi}}$ and the number of spurious binary observations could be higher than modeled in these simulations.
Finally, the observed $\mathrm{\Psi}$ is calculated as:
Simulator variants for studying sources of variation
Request a detailed protocolGene expression and underlying $\mathrm{\Psi}$
Request a detailed protocolWe tested the effect of the interplay between gene expression and the ratio of isoforms that contain the cassette exon on the observed distribution of $\mathrm{\Psi}$. For this test, we simulated a population of 300 single cells with 500 genes, indexed 1 to 500. For every cell $i$, the expression of gene $g$ is fixed as ${X}_{ig}=g$, where $g\in \{1,2,\mathrm{\dots},500\}$. That is, every gene had a different level of expression, and the expression of every individual gene was constant across all cells. For each simulation, we fixed the underlying splicing rate of all cassette exons across all cells. That is, for each cassette exon $j$ of gene $g$, in every cell, we set ${\mathrm{\Psi}}_{ij}=$ constant. We ran the simulator with different underlying splicing rates, with ${\mathrm{\Psi}}_{ij}\in \{0.01,0.02,\mathrm{\dots},0.5\}$. For every simulation, we used an average capture efficiency $c=0.1$. We ran 30 simulations for every fixed ${\mathrm{\Psi}}_{ij}$ value. For every fixed ${\mathrm{\Psi}}_{ij}$ and for every fixed expression level $g$, we took the average proportion of cells with binary values ($\widehat{\mathrm{\Psi}}$ = 0 or 1) for the observed $\widehat{\mathrm{\Psi}}$. That is, we reported:
Gene expression and capture efficiency
Request a detailed protocolWe tested the effect of capture efficiency in $\mathrm{\Psi}$ observations. To minimize the effect of the underlying $\mathrm{\Psi}$ in the simulations, in this analysis we fixed the true splicing rate of all exons to ${\mathrm{\Psi}}_{Tng}=0.5$ (we achieved this by setting ${X}_{Aij}={X}_{Bij}=\frac{1}{2}{X}_{ig}$). We ran simulations for each possible value for the average capture efficiency in $c\in \{0.01,0.011,\mathrm{\dots},0.1\}$. For each tested average capture efficiency rate, we ranked the alternative splicing events by the number of reads that cover the informative splice junctions. For each alternative event, we observed the proportion of cells that present only one type of isoform (either including the cassette exon or excluding it, but not both).
Data availability
All sequencing data reanalyzed in this study were acquired from GEO.

NCBI Gene Expression OmnibusID GSE74155. Singlecell analysis of allelic gene expression in pluripotency, differentiation and Xchromosome inactivation.

NCBI Gene Expression OmnibusID GSE100471. Defining the early steps of cardiovascularlineage segregation by single cell RNAseq.

NCBI Gene Expression OmnibusID GSE52529. Pseudotemporal ordering of individual cells reveals regulators of differentiation.

NCBI Gene Expression OmnibusID GSE85908. Singlecell alternative splicing analysis with Expedition reveals splicing dynamics during neuron differentiation.

NCBI Gene Expression OmnibusID GSE95601. Olfactory stem cell differentiation: horizontal basal cell (HBC) lineage.
References

Modelbased detection of alternative splicing signalsBioinformatics 26:i325–i333.https://doi.org/10.1093/bioinformatics/btq200

Rapid neurogenesis through transcriptional activation in human stem cellsMolecular Systems Biology 10:760.https://doi.org/10.15252/msb.20145508

Functional interpretation of single cell similarity mapsNature Communications 10:4376.https://doi.org/10.1038/s41467019122350

STAR: ultrafast universal RNAseq alignerBioinformatics 29:15–21.https://doi.org/10.1093/bioinformatics/bts635

Regulation of alternative splicing at the singlecell levelMolecular Systems Biology 11:845.https://doi.org/10.15252/msb.20156278

Validation of noise models for singlecell transcriptomicsNature Methods 11:637–640.https://doi.org/10.1038/nmeth.2930

Singlecell isoform RNA sequencing characterizes isoforms in thousands of cerebellar cellsNature Biotechnology 36:1197–1202.https://doi.org/10.1038/nbt.4259

Singlecell RNA counting at allele and isoform resolution using Smartseq3Nature Biotechnology 38:708–714.https://doi.org/10.1038/s4158702004970

Alternative RNA splicing associated with mammalian neuronal differentiationCerebral Cortex 28:2810–2816.https://doi.org/10.1093/cercor/bhx160

lincHOXA1 is a noncoding RNA that represses Hoxa1 transcription in CisGenes & Development 27:1260–1271.https://doi.org/10.1101/gad.217018.113

Singlecell sequencingbased technologies will revolutionize wholeorganism scienceNature Reviews Genetics 14:618–630.https://doi.org/10.1038/nrg3542

Revealing the vectors of cellular identity with singlecell genomicsNature Biotechnology 34:1145–1160.https://doi.org/10.1038/nbt.3711

Cell‐to‐cell variability of alternative RNA splicingMolecular Systems Biology 7:506.https://doi.org/10.1038/msb.2011.32

Robust detection of alternative splicing in a population of single cellsNucleic Acids Research 44:e73.https://doi.org/10.1093/nar/gkv1525

SingleRNA counting reveals alternative modes of gene expression in yeastNature Structural & Molecular Biology 15:1263–1271.https://doi.org/10.1038/nsmb.1514

Simulating multiple faceted variability in single cell RNA sequencingNature Communications 10:2611–2627.https://doi.org/10.1038/s4146701910500w
Article and author information
Author details
Funding
UC MEXUSConacyt (Doctoral Fellowship)
 Carlos F Buen Abad Najar
Chan Zuckerberg Biohub
 Nir Yosef
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We thank Don Rio for discussions that inspired this analysis, Chenling Xu and Amanda Mok for insightful suggestions, and Nicholas Ingolia for critical comments on the manuscript. C.F.B. was supported by the UC MEXUSCONACYT doctoral fellowship. N.Y. was supported by the Chan Zuckerberg Biohub.
Version history
 Received: December 19, 2019
 Accepted: June 28, 2020
 Accepted Manuscript published: June 29, 2020 (version 1)
 Version of Record published: September 17, 2020 (version 2)
Copyright
© 2020, Buen Abad Najar et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 5,424
 views

 557
 downloads

 32
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Chromosomes and Gene Expression
 Genetics and Genomics
ASARs are a family of verylong noncoding RNAs that control replication timing on individual human autosomes, and are essential for chromosome stability. The eight known ASAR lncRNAs remain closely associated with their parent chromosomes. Analysis of RNAprotein interaction data (from ENCODE) revealed numerous RBPs with significant interactions with multiple ASAR lncRNAs, with several hnRNPs as abundant interactors. An ~7 kb domain within the ASAR6141 lncRNA shows a striking density of RBP interaction sites. Genetic deletion and ectopic integration assays indicate that this ~7 kb RNA binding protein domain contains functional sequences for controlling replication timing of entire chromosomes in cis. shRNAmediated depletion of 10 different RNA binding proteins, including HNRNPA1, HNRNPC, HNRNPL, HNRNPM, HNRNPU, or HNRNPUL1, results in dissociation of ASAR lncRNAs from their chromosome territories, and disrupts the synchronous replication that occurs on all autosome pairs, recapitulating the effect of individual ASAR knockouts on a genomewide scale. Our results further demonstrate the role that ASARs play during the temporal order of genomewide replication, and we propose that ASARs function as essential RNA scaffolds for the assembly of hnRNP complexes that help maintain the structural integrity of each mammalian chromosome.

 Chromosomes and Gene Expression
Eukaryotic chromatin is organized into functional domains, that are characterized by distinct proteomic compositions and specific nuclear positions. In contrast to cellular organelles surrounded by lipid membranes, the composition of distinct chromatin domains is rather ill described and highly dynamic. To gain molecular insight into these domains and explore their composition, we developed an antibodybased proximity biotinylation method targeting the RNA and proteins constituents. The method that we termed antibodymediated proximity labelling coupled to mass spectrometry (AMPLMS) does not require the expression of fusion proteins and therefore constitutes a versatile and very sensitive method to characterize the composition of chromatin domains based on specific signature proteins or histone modifications. To demonstrate the utility of our approach we used AMPLMS to characterize the molecular features of the chromocenter as well as the chromosome territory containing the hyperactive X chromosome in Drosophila. This analysis identified a number of known RNAbinding proteins in proximity of the hyperactive X and the centromere, supporting the accuracy of our method. In addition, it enabled us to characterize the role of RNA in the formation of these nuclear bodies. Furthermore, our method identified a new set of RNA molecules associated with the Drosophila centromere. Characterization of these novel molecules suggested the formation of Rloops in centromeres, which we validated using a novel probe for Rloops in Drosophila. Taken together, AMPLMS improves the selectivity and specificity of proximity ligation allowing for novel discoveries of weak protein–RNA interactions in biologically diverse domains.