Coverage-dependent bias creates the appearance of binary splicing in single cells

  1. Carlos F Buen Abad Najar
  2. Nir Yosef  Is a corresponding author
  3. Liana F Lareau  Is a corresponding author
  1. Center for Computational Biology, University of California, Berkeley, United States
  2. Department of Electrical Engineering and Computer Science and the Center for Computational Biology, University of California, Berkeley, United States
  3. Ragon Institute of MGH, MIT, and Harvard, United States
  4. Chan Zuckerberg Biohub, San Francisco, United States
  5. Department of Bioengineering, University of California, Berkeley, United States
4 figures, 2 tables and 1 additional file

Figures

Bimodal vs unimodal models of cassette exon splicing.

In the bimodal model, some cells consistently splice in the exon, while others consistently skip it. After mRNA capture and sequencing, observations of Ψ are almost exclusively binary. In the unimodal model, individual cells express some mRNAs that splice in the cassette exon and some that skip it. Low mRNA capture dramatically reduces the number of cells in which both isoforms are observed, artificially inflating binary Ψ values.

Figure 2 with 1 supplement
Splice junction read coverage is correlated with unimodality of splicing distributions.

(a) Comparison of splice junction read coverage and observed Ψ^ for three cassette exons in the Chen dataset, with low (Cadm1 exon 8; chr9: 47829377–47829409), medium (Thyn1 exon 6, chr9: 27006801–27006951), and high coverage (Rbm39 exon 3; chr2: 156178880–156178952). Each dot represents the Ψ^ of that exon in one cell. (b) Ψ^ distribution of the 300 highest coverage cassette exons with intermediate splicing (average Ψ^ between 0.2 and 0.8) in each of the six analyzed datasets. Each row in the heatmap shows the distribution of Ψ^ for one exon across all cells. (c) Relationship between the average read coverage and proportion of binary observations for each cassette exon in the Chen dataset. (d) Correlation between the total number of splice junction reads captured in each cell, and proportion of cassette exons with intermediate splicing that show binary Ψ^ in that cell.

Figure 2—figure supplement 1
Junction read coverage determines the proportion of binary observations in all analyzed datasets.

(a) As in Figure 2c, the splice junction read coverage of an intermediate exon was anti-correlated with the proportion of cells in which it shows binary Ψ^ (only one isoform is observed) in all datasets. Barplot: Pearson correlation between read coverage and proportion of binary observations of each cassette exon. (b) As in Figure 2d, the total number of splice junction reads in a cell was inversely proportional to the fraction of exons that have binary values in the cell in all datasets. Barplot: Pearson correlation between the total number of captured reads in each cell and the proportion of cassette exons with binary Ψ^. (c) Relationship between transcriptional burst frequency and size, and the proportion of binary observations of an exon. (d) R2 of the linear regression prediction of the proportion of binary observations per exon based on burst frequency, size, number of reads, and a combination of these features.

Figure 3 with 2 supplements
Simulations show that gene expression and capture efficiency influence the observed distribution of splicing.

(a) Simulations of alternative splicing and scRNA-seq under the binary-bimodal model, in which each cell produces one isoform or the other, but rarely both. As in Figure 2b, each row of the histogram shows Ψ^ for one intermediate exon across all cells. The observed Ψ^ distribution is similar to the true Ψ distribution, and its shape is largely unaffected by capture efficiency. (b) Simulations with the non-binary, unimodal model, in which most cells present a mixture of the two alternative isoforms. Exons with high expression have a unimodal distribution of true Ψ. Low capture efficiency results in an increase in binary observations (only one isoform observed), leading to a distortion of the observed distribution of Ψ^ to look bimodal. Only a handful of the highest expressed exons maintain a unimodal distribution of Ψ^. Fewer exons show unimodal splicing as the capture efficiency is reduced. (c) Under the binary-bimodal model, exons with high coverage have slightly fewer binary Ψ observations, and (d) simulated cells with a high number of total splice junction reads have slightly fewer exons with binary Ψ^. (e) Under the unimodal model, exons with intermediate splicing show a strong decrease in binary observations as coverage increases, as seen in real data (Figure 2c). (f) Similarly, simulated cells with high read coverage have a decrease of the proportion of binary Ψ^. (g) Effect of capture efficiency on the proportion of binary observations of cassette exons with underlying Ψ = 0.5. (h) Effect of the initial number of mRNA molecules and underlying Ψ on the proportion of binary Ψ^ observations.

Figure 3—figure supplement 1
Theoretical calculations and simulations of the effect of biological and technical factors in splicing observations.

(a) Theoretical likelihood of capturing only mRNAs representing one isoform of an alternatively spliced gene in a single cell, determined by the total number of mRNAs in the cell, the Ψ of the isoforms, and a 10% capture efficiency. (b) Probability of observing a binary Ψ^ (only one isoform observed) given the number of mRNA molecules of that gene present in the cell, when the underlying Ψ is fixed at 0.5. (c) Effect of the number of captured mRNA molecules on the uncertainty of Ψ^ observations. Ψ^ is understood as the maximum likelihood estimate of the underlying Ψ. (d) The posterior probability of having an observed Ψ^ within 0.1 of the underlying Ψ. (e) Distribution of the relative error between the average observed Ψ^, and the average true Ψ, in our simulations under different capture efficiency rates. The relative error is calculated as the difference between the average Ψ^ and the average true Ψ, scaled by the average true Ψ. (f) Distribution of the relative error between the variance in the observed Ψ^, and the variance in the true Ψ, in our simulations under different capture efficiency rates.

Figure 3—figure supplement 2
Schematic of scRNA-seq splicing simulator.

Simulator of scRNA-seq splicing data. Green elements represent biological variables; blue elements represent technical processes. The underlying Ψ is drawn from a Beta distribution with either a bimodal or unimodal shape. Individual gene expression determines the total number of mRNAs, and these mRNAs are spliced stochastically according to Ψ, producing isoforms that splice in or skip the exon. mRNAs are captured with a probability drawn from a normal distribution. Sequencing produces splice junction reads, which determine the observed Ψ^.

Figure 4 with 4 supplements
Accounting for coverage biases reveals unimodal splicing distributions and differential splicing.

(a) Estimated total mRNA molecules captured per cell. (b) Estimated number of recovered mRNAs vs splice junction reads for cassette exons, averaged across cells. Each dot corresponds to an exon, and its color indicates the proportion of cells in which it has a binary observation (only one isoform observed). We analyzed exons with average Ψ^ between 0.05 and 0.95. (c) Per-cell splice junction coverage rate in each dataset. (d) Cadm1 exon 8 alternative splicing appears binary in many cells in the Chen dataset. Correlation with lineage pseudotime: Spearman’s rs = 0.1. (e) Cadm1 exon after removing cells with fewer than 10 recovered Cadm1 mRNA molecules and fewer splice junction reads than expected from 10 mRNAs (grey). Spearman’s rs = 0.52. (f) PCA projection and clustering of single cells in the Chen dataset, showing differentiation of mouse ES cells into neurons. Red line, lineage inferred with Slingshot. (g) Number of cassette exons with observations from at least 10 mRNA reads in at least 50% of cells in any cluster. (h) Stacked histograms showing the distribution of observed Ψ^ of exons as in (g), in each cell cluster of the Chen dataset. Observations with fewer than 10 mRNA molecules were removed. We show exons with average Ψ^ ranging from 0.1 to 0.9 per cluster. (i) QQ-plot comparing the quantiles of a uniform distribution (x-axis) with the quantiles of the distributions of p-values from the Kruskal-Wallis test (y-axis). A diagonal line (gray dotted line) would mean the p-values are uniformly distributed. A lower area under the curve indicates enrichment for low p-values. The point on the x-axis at which each line crosses the dotted red line indicates the proportion of p-values that are below 0.05 in the distribution. (j) Fold enrichment of exons with a Kruskal-Wallis p < x in the set of exons selected with the mRNA-based filter (blue), and exons selected with a flat read minimum filter (red). (k) Significance p-value of the enrichment, estimated with the hypergeometric test and adjusted for FDR. (l) Example exons that pass the overall filter criteria in the Chen dataset and have p<0.05 in the Kruskal-Wallis test.

Figure 4—figure supplement 1
Relationship between read coverage, captured mRNAs, and binarity in single cell datasets.

(a) Average number of splice junction reads required to have a 50% likelihood of observing both isoforms, from simulations in Figure 3j. (b) Total mRNAs required to have a 50% likelihood of observing both isoforms, from simulations in Figure 3k. (c) Total mapped reads per cell for each dataset. (d) Estimated average number of recovered mRNAs vs. average number of splice junction reads for each exon. (e) Expression of pluripotency and neuron differentiation marker genes in the Chen dataset; expression is scaled by each row’s maximum and minimum. Cells are ordered by inferred pseudotime. Top color key, clusters from agglomerative clustering; bottom color key, labels from Chen et al. The red box highlights two sub-groups of neurons identified by clustering. (f) Visualization of the distribution of Ψ^ observations toward extreme values in the first cluster of cells of each dataset. For each x cutoff in the x-axis, the y-axis corresponds to the proportion of cells that have Ψ^x or Ψ^(1-x). Blue lines, exons passing the mRNA-based filter; red lines, exons removed by filter. (g) Stacked histograms, as in Figure 4h, for other datasets. The Fletcher dataset is excluded because only five exons pass the mRNA filter. (h) Stacked histograms of exons that do not pass the filter in the Chen dataset; exons were subsampled to match the number that do pass the filter in each cell type cluster. (i) Sex-dependent distribution of Ψ^ of the two bimodal exons of the Chen dataset after cell-type stratification: Smarcad1 exon 3 (chr6: 65043836–65044108) in early Epi cells, Nsfl1c exon 4 (chr2: 151502455–151502460) in late Epi cells.

Figure 4—figure supplement 2
Analysis of differential splicing among selected exons with the Kruskal-Wallis analysis of variance.

(a) Comparison of the precision of selecting exons with low p-values, for the mRNA-based filter versus selecting exons on a flat minimum of 10 reads. (b) Recall. (c) F1-score. (d) Specificity. (e) Accuracy. (f) Number of exons with px in each dataset.

Figure 4—figure supplement 3
Analysis of differential splicing among selected exons with the autocorrelation test.

(a) Number of exons with p x in each dataset. (b) QQ-plot comparing the quantiles of a uniform distribution, with the quantiles of the distributions of p-values from the autocorrelation test. (c) Fold enrichment of exons with an autocorrelation p < x in the set of exons selected with the mRNA-based filter (blue), and exons selected with a flat read minimum filter (red). (d) Significance p-value of the enrichment shown in (c). For each p-value limit x, we estimate the significance with the hypergeometric test, and correct for multiple testing using the Benjamini-Hochberg FDR adjustment. (e) F1-score. (f) Specificity. (g) Accuracy.

Figure 4—figure supplement 4
Distribution of Ψ^ of example exons.

(a) Skipped Cadm1 exon 8 (chr9: 47829377–47829409). We show the relationship between pseudotime and observed Ψ^; as well as violinplots of the distribution of Ψ^ in each cluster before filtering, and after filtering. (b) Violinplots of the distribution of Ψ^ in each cluster after filtering for the exons in Figure 4k: Nsfl1c exon 4 (chr2: 151502455–151502460), Rpn2 exon 16 (chr2: 157323223–157323270), Tecr exon 4 (chr8: 83573411–83573455), Zfp207 exon 8 (chr11: 80393084–80393176).

Tables

Table 1
Prevalence of bimodal splicing distributions before and after filtering.
DatasetCell type clusterCellsExonsBimodal% bimodalSelectedSelected
bimodal
% bimodal
selected
p-val (adj)
ChenES21744611826%9400%3.3e-14
ChenEpi, early9840210727%9811%9.3e-14
ChenEpi, late10451613626%7611%1.1e-09
ChenNeuron, early4736411732%4300%3.6e-08
ChenNeuron, late2251714628%6100%1.1e-09
LescroartHeart E6.751722867727%3300%2.0e-05
LescroartHeart E7.253412917827%3600%8.0e-06
TrapnellMyoblast 00 hr3540014236%4100%1.2e-08
TrapnellMyoblast 24 hr892519739%2700%1.2e-06
TrapnellMyoblast 48 hr722429740%3213%9.1e-07
TrapnellMyoblast 72 hr3525210140%2414%5.1e-05
SongiPSC6261626944%5500%3.3e-14
SongNPC732129243%2814%1.2e-06
SongMN671688249%19421%9.4e-03
ShalekBMDC131495134%2714%6.6e-05
Table 2
Single-cell RNA-seq datasets.
DatasetOrganismBiological processCellsNo. ASEMean reads/eventProtocolAccessionReference
ChenmousemES neuron differentiation48832763.1Smart-seq2GSE74155Chen et al., 2016
Lescroartmousecardiomyogenesis59830073.2Smart-seq2GSE100471Lescroart et al., 2018
Trapnellhumanskeletal myogenesis314445714.4Smart-seqGSE52529Trapnell et al., 2014
SonghumaniPS motor neuron differentiation206535588.8Smart-seqGSE85908Song et al., 2017
Fletchermouseolfactory neurogenesis8496842.7Smart-seqGSE95601Fletcher et al., 2017
Shalekmousebone-marrow-derived dendritic cells18380213.4Smart-seqGSE41265Shalek et al., 2013

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Carlos F Buen Abad Najar
  2. Nir Yosef
  3. Liana F Lareau
(2020)
Coverage-dependent bias creates the appearance of binary splicing in single cells
eLife 9:e54603.
https://doi.org/10.7554/eLife.54603