Robust and annotation-free analysis of alternative splicing across diverse cell types in mice
Abstract
Although alternative splicing is a fundamental and pervasive aspect of gene expression in higher eukaryotes, it is often omitted from single-cell studies due to quantification challenges inherent to commonly used short-read sequencing technologies. Here, we undertake the analysis of alternative splicing across numerous diverse murine cell types from two large-scale single-cell datasets—the Tabula Muris and BRAIN Initiative Cell Census Network—while accounting for understudied technical artifacts and unannotated events. We find strong and general cell-type-specific alternative splicing, complementary to total gene expression but of similar discriminatory value, and identify a large volume of novel splicing events. We specifically highlight splicing variation across different cell types in primary motor cortex neurons, bone marrow B cells, and various epithelial cells, and we show that the implicated transcripts include many genes which do not display total expression differences. To elucidate the regulation of alternative splicing, we build a custom predictive model based on splicing factor activity, recovering several known interactions while generating new hypotheses, including potential regulatory roles for novel alternative splicing events in critical genes like Khdrbs3 and Rbfox1. We make our results available using public interactive browsers to spur further exploration by the community.
Editor's evaluation
This paper presents a new method to study known and novel alternative splicing events at the single-cell level and perform differential analysis across cell types. The method addresses current challenges in the analysis of splicing in single cells related to technical variation and experimental biases. Performing one of the most comprehensive studies to date with data from different mice, this work expands the body of splicing events that potentially define individual cell types.
https://doi.org/10.7554/eLife.73520.sa0eLife digest
Cells are the basic building blocks of all living things. There are numerous types of cells, and each cell has its own machinery to fulfill a specialised role. Despite their different purposes, most cells contain the same instructions, stored as DNA, on how to assemble the proteins needed to perform their intended functions. Cell types often vary in the frequency that each gene is read, leading to different quantities of proteins produced.
Moreover, a process known as alternative splicing enables cells to build multiple proteins from the same gene. It works by joining fragments of a gene’s code in various combinations. The resulting RNA sequences are molecular templates that cells use to assemble proteins.
Analysing these RNA sequences reveals which genes are switched on in different tissues of the body, and what proteins are being made. However, despite recent advancements, alternative splicing is rarely studied in single cells because of some sizeable technical challenges.
Benegas, Fischer and Song developed a computational toolkit designed to handle the unique challenges of analysing alternative splicing events in single cells. The analysis pipeline, called scQuint, was tested on two large datasets that capture cell-to-cell differences in the brain and other tissues of mice.
Nearly all the cell types studied exhibited clear differences in alternative splicing, such that cell types could be distinguished based on their splicing profiles. Intriguing patterns of splicing were highlighted in some immune cells and certain types of neurons. Across cell types, the genes with unique splicing patterns were often not the same as those with unique activity patterns, indicating that gene expression and alternative splicing are two complementary processes. New types of alternative splicing events were also identified. Benegas et al. also developed a statistical model to probe the roles of splicing regulators in different cell types.
In summary, the scQuint toolkit overcomes critical technical challenges typically encountered when analysing alternative splicing in single cells. It also reveals new insights about mechanisms of alternative splicing. The results are open access, made available using public interactive browsers, which should spur on other researchers to interrogate how alternative splicing differs in single cells.
Introduction
The past decade’s advances in single-cell genomics have enabled the data-driven characterization of a wide variety of distinct cell populations. Despite affecting more than 90% of human pre-mRNAs (Wang et al., 2008), isoform-level variation in gene expression has often been ignored because of quantification difficulties when using data from popular short-read sequencing technologies such as 10x Genomics Chromium and Smart-seq2 (Picelli et al., 2014). Long-read single-cell technologies, which greatly simplify isoform quantification, are improving (Byrne et al., 2017; Gupta et al., 2018; Volden and Vollmers, 2020; Lebrigand et al., 2020; Joglekar et al., 2021), but remain more costly and lower-throughput than their short-read counterparts. For these reasons and others, short-read datasets predominate and we must work with short reads to make use of the rich compendium of available data. In response, researchers have developed several computational methods to investigate splicing variation despite the sizable technical challenges inherent to this regime. A selection of these challenges and methods are summarized in the Appendix.
To complement single-cell gene expression atlases, we analyze alternative splicing in large single-cell RNA-seq (scRNA-seq) datasets from the Tabula Muris consortium (Schaum et al., 2018) and BRAIN Initiative Cell Census Network (BICCN) (Yao et al., 2021). These data span a broad range of mouse tissues and cell types, and remain largely unexplored at the level of transcript variation. During our initial analyses, we encountered pervasive coverage biases, a heretofore largely unappreciated mode of technical variation which greatly confounds biological variation across cell types. Unsatisfied with the performance of current methods when confronted by these biases, we implemented our own quantification, visualization, and testing pipeline, named scQuint (single-cell quantification of introns), which allowed us to continue our analyses in a robust, annotation-free, and computationally tractable manner. Parts of the scQuint pipeline are based on adaptations of the bulk RNA-seq alternative splicing analysis method LeafCutter (Li et al., 2018) to handle the unique challenges of scRNA-seq data. As we demonstrate in subsequent sections, our modifications in the quantification, statistical modeling, and optimization procedures lead to improved robustness, scalability, and calibration when working with data from single cells (Figure 2—figure supplement 2, also see Materials and methods).
Applying scQuint to these datasets, we find a strong signal of cell-type-specific alternative splicing and demonstrate that cell type can be accurately predicted given only splicing proportions. Moreover, our annotation-free approach enables us to detect a large quantity of cell-type-specific novel splicing events. In certain cell types, particularly the neuron subclasses, as many as 30% of differential splicing events that we detect are novel. In general, across the many considered cell types and tissues in both datasets, we find only a narrow overlap between the top differentially expressed and the top differentially spliced genes within a given cell type, illustrating the complementarity of splicing to expression. Our examination of neurons in the primary motor cortex suggests that splicing distinguishes neuron classes and subclasses as readily as does expression. We showcase alternative splicing patterns specific to the GABAergic (inhibitory) and Glutamatergic (excitatory) neuron classes as well as the subclasses therein. The implicated transcripts include key synaptic molecules and genes which do not display expression differences across subclasses. In developing marrow B cells, we find alternative splicing and novel transcription start sites (TSS) in critical transcription factors such as Smarca4 and Foxp1, while further investigation reveals dissimilar trajectories for expression and alternative splicing in numerous genes across B cell developmental stages. These findings buttress our belief in the complementary nature of these processes and provide clues to the regulatory architecture controlling the early B cell life cycle. To facilitate easy exploration of these datasets and our results, we make available several interactive browsers as a resource for the genomics community.
Finally, to advance our understanding of alternative splicing regulation, we build a statistical machine learning model to predict splicing events by leveraging both the expression levels and splicing patterns of splicing factors across cell types. This model recovers several known regulatory interactions such as the repression of splice site four exons in neurexins by Khdrbs3, while generating new hypotheses for experimental follow-up. For example, in addition to the regulatory effect of the whole-gene Khdrbs3 expression, the model predicts a regulatory role for a novel alternative TSS in this gene. In aggregate, our results imply that alternative splicing serves as a complementary rather than redundant component of transcriptional regulation and supports the mining of large-scale single-cell transcriptomic data via careful modeling to generate hypothetical regulatory roles for splicing events.
Results
Methods overview
Robust, annotation-free quantification based on alternative introns
Most methods rely on the assumption that coverage depth across a transcript is essentially uniform (e.g., Akr1r1, Figure 1—figure supplement 1a). We instead found that Smart-seq2 data (Picelli et al., 2014) frequently contain sizable fractions of genes with coverage that decays with increasing distance from the 3’ ends of transcripts. For example, in mammary gland basal cells from the Tabula Muris dataset (Schaum et al., 2018), Ctnbb1 shows a gradual drop in coverage (Figure 1—figure supplement 1b) while Pdpn displays an abrupt reduction halfway through the 3’ UTR (Figure 1—figure supplement 1c). That the magnitude of these effects varies across technical replicates (plates) suggests they could be artifacts, possibly related to degradation or interrupted reverse transcription. Similar coverage bias artifacts are also apparent in the BICCN primary motor cortex data (Yao et al., 2021; Figure 1—figure supplement 2).
Such coverage biases affect gene expression quantification, and in some cases these batch effects are sufficient to comprise a significant proportion of the observed variation in expression levels. For the Tabula Muris mammary gland dataset, a low-dimensional embedding of cells based on gene expression reveals that some cell type clusters exhibit internal stratification by plate (Figure 1a). A subsequent test of differential gene expression between plate B002438 and all other plates returns 2870 significant hits after correction for multiple hypothesis testing, and all manually inspected differentially expressed genes exhibit these types of coverage biases. Perhaps unsurprisingly, quantification at the transcript level is apt to be even more sensitive to these artifacts than gene-level quantification, especially if it is based on coverage differences across the whole length of the transcript. The UMAP embeddings of isoform proportions (kallisto by Bray et al., 2016), exon proportions (DEXSeq by Anders et al., 2012), 100 bp bin coverage proportions (ODEGR-NMF by Matsumoto et al., 2020) or junction usage proportions across the whole gene (DESJ by Liu et al., 2021) depict a plate clustering pattern which scrambles the anticipated cell type clusters (Figure 1b–e).
With these considerations in mind, we sought to quantify transcript variation in a fashion that would be more robust to coverage differences along the transcript. Although some bulk RNA-seq methods such as RSEM (Li and Dewey, 2011) can model positional bias, they do so globally rather than in the gene-specific manner we encounter. One potential approach is alternative intron quantification as performed by bulk RNA-seq methods MAJIQ (Vaquero-Garcia et al., 2016), JUM (Wang and Rio, 2018), and LeafCutter (Li et al., 2018). Promisingly, quantification via LeafCutter (Figure 1f) yields an embedding that displays less clustering by plate than the other approaches we tried. We therefore based scQuint’s quantification approach on LeafCutter’s, with the key difference of restricting to alternative introns which share a common 3’ acceptor site (Figure 2). This results in alternative splicing events that are equidistant from the 3’ end of transcripts and which are less affected by the coverage biases we observed in scRNA-seq data. The embedding of cells based on our quantification approach (Figure 1g) shows less clustering by plate than LeafCutter and other methods.
Another advantage of alternative intron quantification is the ability to easily discover novel alternative splicing events. Whereas short reads generally cannot be associated with specific transcript isoforms, nor even exons if they partially overlap, split reads uniquely associate with a particular intron. Consequently, intron-based quantification does not depend on annotated transcriptome references and permits the discovery of novel alternative splicing events. This is important since, as detailed later, we estimate up to 30% of cell-type-specific differential splicing events are novel. Other annotation-free methods have been applied to single-cell short-read full-length data, but they do not provide a statistical test for differential splicing between two groups of cells (Appendix 1—table 1).
We do not recommend using scQuint to analyze alternative splicing in 10x Genomics Chromium data given its strong 3’ transcript bias and evidence suggesting that these data can detect about half the number of junctions detected by Smart-seq2 (Wang et al., 2021). This imposes a fundamental limit on the number of transcripts that can be distinguished, and we expect alternative intron quantification to be sub-optimal in this setting. Nonetheless, several approaches for differential transcript usage in 10x data have been developed: Sierra (Patrick et al., 2020), SpliZ (Olivieri et al., 2020), and a kallisto-based approach which could be adapted for this task (Ntranos et al., 2019).
Dimensionality reduction with Variational Autoencoder
To perform dimensionality reduction using splicing profiles, we developed a novel Variational Autoencoder (VAE) (Kingma and Welling, 2014) with a Dirichlet-Multinomial noise model, a natural distribution for sparse, overdispersed count data (Figure 2b, Materials and methods). For example, the often encountered ‘binary’ splicing (Buen Abad Najar et al., 2020) can be modeled by fitting a concentration parameter close to zero. VAEs are flexible and scalable generative models which have been successfully applied to analyze gene expression (Lopez et al., 2020) but have not yet been employed to investigate alternative splicing. To verify that we prevent leakage of gene expression information into our splicing profiles, we applied our VAE to embed a shuffled dataset obtained by resampling alternative intron counts with a fixed proportion in all cells. This shuffled dataset contained expression variability between cells but no splicing differences, and, as hoped, the resulting splicing latent space did not distinguish among cell types, indicating that it captures differences in splicing proportions rather than changes in absolute gene expression (Figure 2—figure supplement 1). We compared the latent space obtained with the VAE to the one obtained using Principal Component Analysis (PCA), a standard dimensionality reduction technique used in the LeafCutter and BRIE2 software packages. The VAE better distinguishes cell types than PCA (Figure 3), especially in the mammary gland and diaphragm.
Differential splicing hypothesis testing with Generalized Linear Model
To test for differential splicing across cell types or conditions, we adopt a Dirichlet-Multinomial Generalized Linear Model (GLM) coupled with a likelihood-ratio test (Figure 2c, Materials and methods). We do so by adapting one of LeafCutter’s proposed models for bulk RNA-seq to the scRNA-seq setting and apply it to our Smart-seq2 intron quantification. Namely, due to the sparse nature of scRNA-seq splicing data, we implement a more parsimonious statistical model featuring gene-level rather than intron-level parameters. Furthermore, we adjust the model-fitting algorithm at the initialization and optimization stages (see Materials and methods). After our modifications, we obtain well-calibrated p-values whereas those from LeafCutter’s original differential splicing model are anti-conservative (Figure 2—figure supplement 2) and perhaps prone to extra false positives if applied directly to scRNA-seq data. We also find improvements in computational cost, both in runtime and memory usage (Figure 2—figure supplement 2).
As described in Materials and methods, we generated synthetic data in order to benchmark scQuint against three other methods that also offer two-sample tests for differential transcript usage proportions: BRIE2 and DTUrtle, both designed for scRNA-seq, and LeafCutter, designed for bulk RNA-seq (Figure 4). While the choice of an appropriate simulation model for scRNA-seq data is very much an open area of debate, particularly at the transcript level, we attempted to recreate a challenging setting for inference by assuming low coverage (1–2X) and high overdispersion (variance-to-mean ratio of 8). We performed three in silico experiments to assess performance under the differing conditions of even transcript coverage, unannotated events, and coverage decay across the transcript. In the case of even coverage, scQuint, LeafCutter, and BRIE2 perform similarly and do a good job of correctly identifying events, while DTUrtle is slightly behind. scQuint does only slightly worse with low cell counts and low coverage, which is probably a trade-off for the robustness that comes from only using reads from junctions sharing 3’ acceptor sites. Next, we recreated the unannotated setting by masking the reference given to methods. Only scQuint and LeafCutter are able to perform differential transcript usage testing in this setting, and, as expected, they performed nearly identically to the annotated setting with even coverage. Lastly, we created a setting where transcript coverage decays with distance from the 3’ in one of the two groups, mirroring a pattern we often saw in the real data analyzed for this paper. Here, scQuint outperforms the other tested methods by a wide margin with performance improving at higher coverages, unlike other methods. These results validate that scQuint is robust to both incomplete annotations and coverage decay while only paying a modest penalty relative to other methods under ideal conditions (even coverage and annotated events).
Augmenting cell atlases with splicing information
We applied scQuint to two of the largest available Smart-seq2 datasets. The first comprehensively surveys the mouse primary motor cortex (BICCN Cortex) (Yao et al., 2021) while the second contains over 100 cell types distributed across 20 mouse organs (Tabula Muris) (Schaum et al., 2018; Table 1). We detect more alternative introns in BICCN Cortex neurons than in the entire broad range of cell types present in Tabula Muris (which includes neurons but in much smaller number). This observation comports with previous findings that the mammalian brain has exceptionally high levels of alternative splicing (Yeo et al., 2004). Booeshaghi et al., 2021 analyzed BICCN Cortex at the transcript level, but focused on changes in absolute transcript expression rather than proportions. While the authors indirectly find some differences in transcript proportions by inspecting genes with no differential expression, this is not a systematic analysis of differential transcript usage. Meanwhile, only microglial cells in Tabula Muris (Nip et al., 2020) have been analyzed at the transcript level. (Tabula Muris also contains 10x Chromium data analyzed at the transcript level [Patrick et al., 2020]).
As a community resource, we provide complementary ways to interactively explore splicing patterns present in these datasets (Figure 5), available at (https://github.com/songlab-cal/scquint-analysis, Benegas, 2021a) with an accompanying tutorial video. The UCSC Genome Browser (Kent et al., 2002) permits exploration of alternative splicing events within genomic contexts such as amino acid sequence, conservation score, or protein binding sites, while allowing users to select different length scales for examination. We additionally leverage the cell×gene browser (Megill et al., 2021) (designed for gene expression analysis) to visualize alternative intron PSI (percent spliced-in, defined as the proportion of reads supporting an intron relative to the total in the intron group) via cell embeddings. Further, one can generate histograms to compare across different groups defined by cell type, gender, or even manually selected groups of cells. These tools remain under active development by the community, and we hope that both the genome- and cell-centric views will soon be integrated into one browser.
Cell-type-specific splicing signal is strong and complementary to gene expression
Primary motor cortex
We first explored the splicing latent space of BICCN Cortex cells by comparing it to the usual expression latent space (Figure 6a). Cells in the splicing latent space strongly cluster by cell type (annotated by Yao et al., 2021 based on gene expression). A similar analysis was recently performed (Feng et al., 2021) on a different cortex subregion in which most, but not all, neuron subclasses could be distinguished based on splicing profiles (e.g., L6 CT and L6b could not be separated). However, the authors only considered annotated skipped exons, a subset of the events we quantify, and used a different dimensionality reduction technique.
Figure 6b (top left) highlights some differentially spliced genes between Glutamatergic and GABAergic neurons, including the glutamate metabotropic receptor Grm5 as well as Shisa9/Ckamp44, which associates with AMPA ionotropic glutamate receptors (von Engelhardt et al., 2010). The expression pattern of these genes, meanwhile, does not readily distinguish the neuron classes (Figure 6b, top right). In Pgm2, a gene of the glycolysis pathway thought to be regulated in the developing cortex by mTOR (Schüle et al., 2021), we discover a novel exon preferentially included in Glutamatergic neurons (Figure 6c, Figure 6—figure supplement 2).
Our differential splicing test reveals thousands of cell-type-specific splicing events (further discussed below in subsection Comparison of selected tissues), highlighting marker introns that distinguish neuron subclasses, while the expression of their respective genes does not; for example, compare the bottom left and bottom right panels of Figure 6b. Genes that better distinguish cell types at the expression level can be seen in Figure 6—figure supplement 1. As another example of the many novel events we discover, we showcase a novel alternative transcription start site in Rbfox1, a splicing factor known to regulate cell-type-specific alternative splicing in the brain (Wamsley et al., 2018; Figure 6d, Figure 6—figure supplement 3). This novel TSS (exon chr16:5763871–5763913, intron Rbfox1_26172), which lies in a highly-conserved region, is (partially) used by only L6b neurons. We are also able to detect well-known cell-type-specific alternatively spliced genes such as Nrxn1, which encodes a key pre-synaptic molecule (Figure 6e, Figure 6—figure supplement 4; Fuccillo et al., 2015). In this case, we observe an exon (known as splice site 2) exclusively skipped in Vip and Lamp5 neurons.
General patterns in Tabula Muris
We next turned our attention to Tabula Muris, which comprises a wide variety of organs and cell types from across the entire body. As before, we initially compared the expression and splicing latent spaces using UMAP (Figure 7a). This revealed broadly consistent clusters between projections, but a visible shift in the global layout of these clusters. In particular, whereas cell types were better separated in the expression projection, cell classes (e.g., endothelial, epithelial, immune) formed more coherent clusters in the splicing projection.
To supplement our qualitative comparison of UMAP projections with a more rigorous approach, we built dendrograms and a tanglegram using the respective distances between cells in each of the expression and splicing latent spaces (Figure 7b). Despite minor shifts, the dendrograms resemble one another, and most subtree structure is preserved. The low value of their entanglement, a quantitative measure of the discrepancy between hierarchical clusterings, at only 6% indicates a high degree of similarity. (For comparison, the entanglement value between the dendrogram for all expressed genes and that for transcript factors is 11% [Schaum et al., 2018]). As in the UMAP visualization, immune cells group together more closely in the splicing dendrogram. However, unlike the UMAP projection, we observe that several types of pancreatic cells cluster together with neurons, a cell type long believed to share an evolutionary origin (Le Roith et al., 1982). Notably, the left dendrogram in Figure 7b shows that hepatocytes are clear outliers in the expression latent space. We suspect this may be due to technical differences from using 96-well plates rather than the 384-well plates used for other cell types.
B cell development in the marrow
We then focused on developing B cells from the bone marrow in Tabula Muris. In the splicing latent space, we found that immature B cells are harder to distinguish from the other B cell subpopulations (Figure 8a), reflecting less refined splicing programs or limitations in transcript capture efficiency. Immature B cells have also fewer differential splicing events when compared to the other stages of B cell development (Figure 8b). The top differential splicing events we identified throughout development displayed splicing trajectories mostly independent from the trajectories of gene expression (Figure 8c). We highlight alternative TSSs (one of them novel) in two transcription factors essential for B cell development: Smarca4, encoding BRG1 (Bossen et al., 2015; Figure 8d, Figure 8—figure supplement 1); and Foxp1 (Hu et al., 2006; Figure 8e, Figure 8—figure supplement 2). While Foxp1 expression peaks in pre-B cells and does not follow a monotonic trend over developmental stages, the alternative TSS is progressively included throughout B cell development. Combining gene-level expression with TSS usage, which can influence translation rate, provides a more nuanced characterization of the expression patterns of these important transcription factors. Some other differentially spliced genes with well-known roles in B cell development are Syk (Cornall et al., 2000), Dock10 (García-Serna et al., 2016), Selplg/Psgl-1 (González-Tajuelo et al., 2020), and Rps6ka1 (Stein et al., 2017).
Epithelial and endothelial cell types across organs
Having compared different cell types within organs, we analyzed putatively similar cell types which are present in multiple organs to investigate splicing variation associated with tissue environment and function. We find many alternative introns with strong PSI differences across epithelial cell types, including several which are novel (Figure 9a). Conversely, apart from those in the brain, endothelial cell types fail to display such striking differences (Figure 9b). These patterns are consistent with the UMAP projection and dendrogram, both of which suggested less heterogeneity among endothelial than epithelial cells (Figure 7).
Our analysis revealed a novel alternative TSS in Itpr1 (Figure 9c, Figure 9—figure supplement 2), an intracellular calcium channel in the endoplasmic reticulum, which regulates secretory activity in epithelial cells of the gastrointestinal tract (Lemos et al., 2020). This novel TSS yields a shorter protein isoform (full view in Figure 9—figure supplement 1) which preserves the transmembrane domain, though it is unclear whether this isoform is functional. Notably, it is the predominant isoform in large intestine secretory cells, and these cells express Itpr1 at the highest level among all epithelial cell types in the dataset. All nine novel alternative splicing events in Figure 9a are alternative TSSs, with four affecting the 5’ UTR and five affecting the coding sequence.
Figure 9d (PSI distribution in Figure 9—figure supplement 3) illustrates a complex alternative splicing event in Khk involving the well-studied exons 3a and 3c (Hayward and Bonthron, 1998). Khk catalyzes the conversion of fructose into fructose-1-phosphate, and the two protein isoforms corresponding to either exon 3a or 3c inclusion differ in their thermostability and substrate affinity (Asipu et al., 2003). While the literature describes these exons as mutually exclusive, the transcriptome reference includes transcripts where neither or both may be included. Although we did not find cell types with high inclusion rates for both exons, we did see multiple cell types where both exons are predominantly excluded, for example, epithelial cells from the large intestine. Other differentially spliced genes are involved in cellular junctions, which are particularly important in epithelial tissue. These include Gsn, Eps8, Tln2, Fermt3, and Mapre2.
Comparison of selected tissues
Because of the breadth of the Tabula Muris dataset, we can look for general trends across a diverse array of tissues and cell types. Table 2 summarizes differential expression and splicing for some of the cell types and tissues with the largest sample sizes. First, we note the intersection between the top 100 most differentially expressed and top 100 most differentially spliced genes (ranked by p-value) is consistently low. This means that most differentially spliced genes, which might be of critical importance in a biological system, will go unnoticed if a study only considers differential expression. Second, L5 IT neurons have a larger fraction of genes with differential splicing relative to the number of differentially expressed genes.
We found many more cell-type-specific differential splicing events in the cortex than in the marrow, as expected (Yeo et al., 2004), as well as a higher proportion of events involving novel junctions, which can reach 30% (Figure 10a). Differences in proportion of novel junctions should be interpreted with care, however, since they can be affected by sequencing depth and number of cells, both of which vary between the two tissues. Very similar patterns are seen when grouping differential splicing events that occur in the same gene (Figure 10b). Most differential splicing events that we detected with alternative introns fall in the coding portion of the gene, with high proportions in the 5’ UTR (Figure 10c). This is a property of our quantification approach and does not reflect the total number of alternative splicing events in different gene regions; still, the relative proportion can be compared across tissues. We find an increased proportion of differentially spliced non-coding RNA in the cortex, the majority of which are previously unannotated events. To systematically evaluate how well cell types can be distinguished in the expression and splicing latent spaces, we calculated the ROC AUC score for the one-versus-all classification task for each cell type in each tissue using a binary logistic regression model (Figure 10d). Since cell type labels were defined using gene expression values, near-perfect classification is to be expected using the expression latent space. Classification based only on the splicing latent space is very good in general, suggesting that cell-type-specific differential splicing is rather pervasive. A few cell types were more challenging to classify correctly using splicing patterns alone. One such example is immature B cells, a reflection of the lower degree of separation observed in the embedding of Figure 8a.
Finding splicing factors associated with specific alternative splicing events
Several splicing factors have been identified as regulators of specific alternative splicing events, but most regulatory interactions remain unknown (see Vuong et al., 2016 for a review focused on the brain). To complement expensive and laborious knockout experiments, we sought to generate regulatory hypotheses by analyzing the correlation between splicing outcomes and splicing factor variation across cell types. Focusing on a subset of highly expressed genes in BICCN primary motor cortex neurons, we fit a sparse linear model regressing PSI of skipped exons on both expression and splicing patterns of splicing factors (Figure 11a and Figure 11—figure supplement 1). Our model recovers several known regulatory interactions such as Khdrbs3/Slm2/T-Star’s repression of splice site 4 (SS4) in neurexins, modulating their binding with post-synaptic partners (Traunmüller et al., 2016). Additionally, the proportion of a novel alternative TSS (though annotated in the human reference) in Khdrbs3 (Figure 11b, Figure 11—figure supplement 2) is negatively associated with SS4 in Nrxn1 and Nrxn3. This novel isoform lacks the first 30 amino acids of the Qua1 homodimerization domain and could affect dimerization, which modulates RNA affinity (Feracci et al., 2016). The model also recovers the known regulation of a skipped exon in Camta1, a transcription factor required for long-term memory (Bas-Orth et al., 2016), by Rbfox1 (Pedrotti et al., 2015). The skipping of exon 5 (E5) of Grin1, which controls long-term synaptic potentiation and learning (Sengar et al., 2019), is known to be regulated by Mbnl2 and Rbfox1 (Vuong et al., 2016). The model associates Grin1 E5 PSI with the expression of Rbfox1 but not Mbnl2; however, it does suggest an association with the PSI of two skipped exons in Mbnl2 (Figure 11c, Figure 11—figure supplements 3 and 4) and further implicates the inclusion level of the novel alternative TSS in Rbfox1 reported above (Rbfox1_26172, chr16:5763912–6173605, Figure 6d). These results help clarify the disparate impacts of expression and alternative splicing in splicing factors, and encourage the use of regression models to suggest candidate regulators of cell-type-specific alternative splicing. Such computationally generated hypotheses are particularly valuable for splicing events in splicing factors because of the heightened difficulty to experimentally perturb specific exons rather than whole genes.
Discussion
In this study, we introduce scQuint, a toolkit for the quantification, visualization, and statistical inference of alternative splicing in full-length scRNA-seq data without the need for annotations. This allows us to successfully extend the analysis of two single-cell atlases to the level of alternative splicing, overcoming the usual technical challenges as well as coverage artifacts and incomplete annotations. Our results, which we make available for public exploration via interactive browsers, indicate the presence of strong cell-type-specific alternative splicing and previously unannotated splicing events across a broad array of cell types. In most cases, splicing variation is able to differentiate cell types just as well as expression levels. We also note a striking lack of overlap between the most strongly differentially expressed and spliced genes (Table 2), suggesting that expression and splicing are complementary rather than integrated processes. Moreover, this complementarity may also manifest temporally, as we show in developing B cells in the marrow. Another outstanding question is the functional significance of isoforms, and we find that most differential splice sites appear in the coding sequence with a sizeable minority also mapping to 5’ UTRs. The apparent predilection for events to occur in these regions rather than 3’ UTRs poses questions about the role of splicing in protein synthesis from translational regulation to the formation of polypeptide chains. Answering these questions requires a more precise understanding of how variation in UTRs and coding sequences affects final protein output as well as the biophysical characteristics of protein isoforms and their roles in different biological systems. These factors, combined with the large fraction of unannotated events in several cell types, should encourage tissue specialists to more deeply consider the contribution of transcript variation to cell identity and cell and tissue homeostasis.
Despite the clear association between splicing and cell identity, our analyses are yet to produce instances in which clustering in the splicing latent space reveals new cell subpopulations not visible in the expression latent space. This, of course, does not preclude the possibility in other settings where alternative splicing is known to be important, such as in specific developmental transitions or disease conditions. Nevertheless, our current experience leads us to believe that gene expression and splicing proportions provide two different projections of the same underlying cell state. Incidentally, RNA Velocity (La Manno et al., 2018) estimates can be distorted by alternative splicing, and (Bergen et al., 2020) discuss incorporating isoform proportions into the model as a future direction.
To support our understanding of cell-type-specific splicing, we implemented a regularized generalized linear regression model which exploits the natural variation of splicing factors in different cell types. We recovered a number of previously identified (via knockout experiments) regulatory interactions and propose novel regulatory interactions involving genes known to play important regulatory roles. A key component of our analysis is the decision to include both the expression and alternative splicing patterns of splicing factors as features in the model. Consequently, we infer that several alternative splicing events in splicing factors themselves (some previously unannotated) contribute to their regulatory activity. Our model thus provides several opportunities for follow-up and does so with an increased granularity that distinguishes between effects due to expression and splicing differences. To facilitate further exploration of these data, we have uploaded our results to cell and genome browsers (linked at https://github.com/songlab-cal/scquint-analysis, (Benegas, 2021a copy archived at swh:1:rev:97dc31babf2a585666af4a38b1e4aa59a92bbf87)).
Our experience analyzing these large datasets, initially with prior methods and then scQuint, has led to a series of general observations regarding the analysis of splicing in scRNA-seq data. As most analyses use full-length short-read protocols because of the cost of long-read data and the necessary focus on the 3’ end of transcripts in most UMI-based techniques, we restrict our attention to the full-length short-read setting and its incumbent challenges. For example, low transcript capture efficiency introduces additional technical noise into isoform quantification (Arzalluz-Luque and Conesa, 2018; Westoby et al., 2020; Buen Abad Najar et al., 2020), and incomplete transcriptome annotations result in discarded reads and reduced sensitivity to cross-cell differences (Westoby et al., 2020). Nonetheless, we considered several methods (summarized in Appendix 1—table 1) to analyze transcript variation in short-read, full-length scRNA-seq. We found each of the classes of current methods to be problematic in the context of our datasets for varying reasons. Methods which depend on transcript annotations (Bray et al., 2016; Qiu et al., 2017; Huang and Sanguinetti, 2017; Hu et al., 2020; Yan et al., 2015; Wen et al., 2020; Liu et al., 2021; Huang and Sanguinetti, 2021; Tekath and Dugas, 2021) cannot easily identify unannotated alternative splicing events. In large collections of previously unsurveyed cell types, these may comprise a sizable fraction of events. Indeed, we found up to 30% of differential splicing events were unannotated in certain cell types. Annotation-free approaches are also available, but they either do not provide a formal statistical test for differential transcript usage across conditions (Song et al., 2017; Ling et al., 2020; Nip et al., 2020; Welch et al., 2016), or only do so in a specialized manner (Matsumoto et al., 2020), reducing their potential impacts. Finally, methods’ different approaches to quantification are affected by coverage biases to varying degrees. Some methods may thus lead to erroneous inference of cell clusters due to technical rather than biological variation. Until the prevalence and severity of coverage biases are better understood, we advocate quantifying transcript variation in a robust manner.
Recent and future experimental advances will catalyze the study of isoform variation in single cells. For instance, Smart-seq3 (Hagemann-Jensen et al., 2020) allows sequencing of short reads from the entire length of a gene together with unique molecular identifiers, improving mRNA capture and allowing for the filtering of PCR duplicates; however, experiments show that less than 40% of reads can be unambiguously assigned to a single (annotated) isoform. Ultimately, long-read scRNA-seq will provide the definitive picture of isoform variation between cells. Until then, there is much biology to be studied using short-read protocols, and variation at the transcript level should not be disregarded.
Materials and methods
Datasets
Request a detailed protocolTabula Muris data (Schaum et al., 2018) have accession code GSE109774. Cells were filtered to those from 3-month-old mice present in this collection: https://czb-tabula-muris-senis.s3-us-west-2.amazonaws.com/Data-objects/tabula-muris-senis-facs-processed-official-annotations.h5ad (filtering details in Tabula Muris Consortium, 2020). BICCN Cortex data (Yao et al., 2021) were downloaded from https://assets.nemoarchive.org/dat-ch1nqb7 and filtered as in Booeshaghi et al., 2021.
Simulation
Request a detailed protocolA preliminary set of exon skipping events was obtained by running briekit-event from the BRIE2 software package. For each event, one pair of transcripts was selected if they only differed on the skipped exon, resulting in 561 pairs, each from a different gene. Reads were simulated using Polyester (Frazee et al., 2015), which allows us to control overdispersion and induce different kinds of biases. For roughly half of the genes, differential transcript usage (DTU) was induced by overexpressing one transcript 1.5-fold in one of the two conditions. The number of reads was generated using a highly overdispersed negative binomial distribution with variance equal to eight times the mean. To simulate coverage decay in one of the conditions, the option bias="cdnaf" was added. To ensure coverage decays as a function of absolute distance to the 3’ end of the transcript, reads were generated no farther away from the 3’ than the minimum of the lengths of the two alternative transcripts. The Area Under the Receiver Operating Characteristic Curve (ROC AUC) for classifying genes into DTU vs. non-DTU was computed using the p-values from each method, excluding genes that were not tested by a given method (e.g., because of a minimum reads threshold).
Quantification
Request a detailed protocolThe bioinformatic pipeline was implemented using Snakemake (Köster and Rahmann, 2012). Raw reads were trimmed from Smart-seq2 adapters using Cutadapt (Martin, 2011) before mapping to the GRCm38/mm10 genome reference (https://hgdownload.soe.ucsc.edu/goldenPath/mm10/chromosomes/) and the transcriptome reference from Ensembl release 101 (ftp://ftp.ensembl.org/pub/release-101/gtf/mus_musculus/Mus_musculus.GRCm38.101.gtf.gz). Alignment was done using STAR (Dobin et al., 2013) in two-pass mode allowing novel junctions as long as they were supported by reads with at least 20 base pair overhang (30 if they are non-canonical) in at least 30 cells. Also, multimapping and duplicate reads were discarded using the flag --bamRemoveDuplicatesType UniqueIdentical (while this can remove duplicates from the second PCR step of Smart-seq, it will not remove duplicates from the first PCR step). Soft-clipped reads were removed as well. Additionally, reads were discarded if they belonged to the ENCODE region blacklist (Amemiya et al., 2019) (downloaded from https://github.com/Boyle-Lab/Blacklist/raw/master/lists/mm10-blacklist.v2.bed.gz).
Gene expression was quantified using featureCounts (Liao et al., 2014), and total-count normalized such that each cell had 10,000 reads (as in the Scanpy (Wolf et al., 2018) tutorial). Intron usage was quantified using split reads with an overhang of at least six base pairs. Introns were discarded if observed in fewer than 30 cells in BICCN Cortex or 100 cells in Tabula Muris. Introns were grouped into alternative intron groups based on shared 3’ splice acceptor sites. Introns not belonging to any alternative intron group were discarded. Additionally, we decided to subset our analysis to introns with at least one of their donor or acceptor sites annotated, so we could assign a gene to each intron and facilitate interpretation for our specific analyses.
Dimensionality reduction
Request a detailed protocolTo run PCA, we worked with alternative intron proportions (PSI, Percent Spliced In) rather than their absolute counts, as the latter would be confounded by gene expression differences. We first introduce some notation:
: cell identifier
: intron group identifier
: vector of counts of introns in intron group and cell
: function to divide each entry of a vector by the total sum.
Then, PSI can be defined as:
However, given the sparsity of single-cell data, a very high proportion of alternative intron groups will have no reads in a given cell, leaving PSI undefined. More generally, an intron group may contain few reads, resulting in defined but noisy PSI estimates. To navigate this issue, we introduce a form of empirical shrinkage towards a central value. We first define the ‘global PSI’ by aggregating reads from all cells and normalizing. Then, we add this global PSI as a pseudocount vector to each cell before re-normalizing to obtain each cell’s shrunken PSI profile (these are non-uniform pseudocounts adding up to one).
We then run standard PCA on the cell-by-intron-smoothed PSI matrix.
The VAE was implemented using PyTorch (Paszke et al., 2019) and scvi-tools (Gayoso et al., 2021). The following is the generative model, repeated for each cell (we drop the superscript indexing the cell in , , and ):
Sample the latent cell state
For each intron group :
Obtain the underlying intron proportions:
Sample the intron counts conditioning on the total observed ng:
Here fg, known as the decoder, can be any differentiable function, including linear mappings and neural networks. is a scalar controlling the amount of dispersion. We optimize a variational posterior on cell latent variables (Gaussian with diagonal covariance, given by an encoder neural network) as well as point estimates of global parameters fg, . The encoder takes as input the smoothed PSI values, as in PCA, but the likelihood is based on the raw intron counts. The objective to maximize is the evidence lower bound (ELBO), consisting of a reconstruction term and a regularization term:
where KL denotes the Kullback–Leibler divergence. Optimization is performed using Adam (Kingma and Ba, 2015), a stochastic gradient descent method. To avoid overfitting in cases of relatively few cells with respect to the number of features, we considered a linear decoder (Svensson et al., 2020), as well as a prior on the entries of the decoder matrix. Hyperparameters were tuned using reconstruction error on held-out data and are described in Table 3.
Differential splicing test
Request a detailed protocolOur differential splicing test across conditions (such as cell types) is based on a modified version of the Dirichlet-Multinomial Generalized Linear Model proposed in LeafCutter (Li et al., 2018) for bulk RNA-seq. For each intron group with alternative introns:
is a vector of counts for each of the introns;
The independent variable, , equals 0 in one condition and 1 in the other;
are the intercept and coefficients of the linear model;
is a dispersion parameter shared across conditions; and
the function maps from to the -dimensional probability simplex.
The Dirichlet-Multinomial Generalized Linear Model then proceeds as follows:
Obtain the underlying intron proportions:
Sample the intron counts conditioned on the total observed, ng:
We implemented this model in PyTorch and optimized it using L-BFGS (Liu and Nocedal, 1989).
To test for differential splicing across the two conditions, we compare the following two hypotheses:
Null hypothesis H0:
Alternative hypothesis H1:
We use the likelihood-ratio test, the test statistic for which is asymptotically distributed as a random variable with degrees of freedom under H0. Finally, we correct p-values for multiple testing using the Benjamini-Hochberg FDR procedure (Benjamini and Hochberg, 1995).
The differences with LeafCutter are the following:
LeafCutter groups introns that share a 5’ donor or 3’ acceptor site while scQuint groups introns that share a 3’ acceptor site.
LeafCutter has a vector of concentration parameters, one for each intron, while scQuint uses a single concentration parameter per intron group.
The LeafCutter and scQuint optimization procedures were implemented separately and differ in initialization strategies as well as L-BFGS hyperparameters.
Latent space analysis
Request a detailed protocolThe expression latent space was obtained by running PCA with 40 components on log-transformed and normalized gene expression values. The splicing latent space was obtained by running the VAE on the alternative intron count matrix (or equivalent features, e.g., Kallisto transcript counts, DEXSeq exon counts). Both latent spaces were visualized using UMAP (McInnes et al., 2018). In the comparison of Figure 1, we used our own implementation of the quantifications proposed by ODEGR-NMF, DEXSeq, and DESJ for ease of application to large single-cell datasets.
Dendrograms were constructed using hierarchical clustering (R function hclust) based on euclidean distance between the median latent space embedding of cells of each type. Tanglegram and entanglement were calculated using the dendextend R package, with the step2side method, as also described in Schaum et al., 2018.
Reported scores for cell type classification within a tissue were obtained by running a binary logistic regression classifier over different splits of cells into train and test sets. To assess generalization across individuals, we ensured the same individual was not present in both train and test sets.
Cell-type-specific differential splicing
Request a detailed protocolFor differential splicing testing between a given cell type and the rest of the tissue, we only considered introns expressed in at least 50 cells and intron groups with at least 50 cells from both of the conditions. We called an intron group ‘differentially spliced’ if it was both statistically significant using a 5% FDR and if it contained an intron with a PSI change greater than 0.05. We considered a differentially spliced intron group as unannotated if it contained an unannotated intron with a PSI change greater than 0.05. Differential expression was performed using the Mann-Whitney test. A gene was considered differentially expressed if it was statistically significant using a 5% FDR and if the fold change was at least 1.5.
For selection of marker genes or introns, we proceeded in a semi-automated fashion. For each cell type, we first filtered to keep only significant genes or introns and then ranked them by effect size. We picked a certain number of genes or introns from the top of this list for each cell type, while ensuring there were no repetitions.
Splicing factor regression analysis
Request a detailed protocolWe obtained 75 mouse splicing factors using the Gene Ontology term ‘alternative mRNA splicing, via spliceosome’ (http://amigo.geneontology.org/amigo/term/GO:0000380). A skipped exon annotation, processed by BRIE (Huang and Sanguinetti, 2017), was downloaded from https://sourceforge.net/projects/brie-rna/files/annotation/mouse/gencode.vM12/SE.most.gff3/download. Instead of using single cells as replicates, we partitioned the BICCN primary motor cortex dataset into roughly 200 clusters of 30 cells each that were pooled to create pseudobulks, aiming to reduce variance in the expression and splicing of splicing factors used as covariates in the model. We filtered target exon skipping events to those defined in at least 95% of the replicates, and those having a PSI standard deviation of at least 0.2. We used log-transformed normalized expression and PSI of alternative splicing events as input features. We chose to keep the PSI of only one intron per intron group to avoid the presence of highly correlated features and improve clarity, even if some information from non-binary events is lost. Input features were filtered to those having standard deviation of at least 0.05, and then standardized. A lasso Dirichlet-Multinomial GLM was fit to the data (in this instance, the model reduces to a Beta-Binomial because skipped exons are binary events), with the sparsity penalty selected via cross-validation. As a first approach, we fit a regular lasso linear regression model on PSI instead of raw counts, resulting in roughly similar patterns in the coefficients. Figure 11c shows the coefficients of the lasso Dirichlet-Multinomial model for the top 30 targets with the highest variance explained by the regular lasso model, all above 68%.
Code and data availability
Request a detailed protocolscQuint is implemented in Python and is available at https://github.com/songlab-cal/scquint, (Benegas, 2021b copy archived at swh:1:rev:a9db6454e13d42af25f47deee19e201e74d2bdd0). Differential splicing results and access to cell and genome browsers, together with the code to reproduce our results, are available at https://github.com/songlab-cal/scquint-analysis, (Benegas, 2021c copy archived at swh:1:rev:97dc31babf2a585666af4a38b1e4aa59a92bbf87). Processed alternative intron count matrices are provided in the AnnData format (anndata.readthedocs.ioanndata.readthedocs.io) for easy manipulation with Scanpy (Wolf et al., 2018), Seurat (Stuart et al., 2019), and other tools.
Appendix 1
Overview of available methods for alternative splicing analysis in full-length scRNA seq data
Due to experimental considerations, the analysis of transcript variation in 10x Chromium data is mostly restricted to the 3’ end of genes; in contrast, Smart-seq2 and other full-length, short-read protocols theoretically enable characterization of transcript variation along the whole gene. Nevertheless, numerous challenges impede such analyses in practice. For example, low transcript capture efficiency introduces additional technical noise into transcript quantification (Arzalluz-Luque and Conesa, 2018; Westoby et al., 2020; Buen Abad Najar et al., 2020), and incomplete transcriptome annotations result in discarded reads and reduced sensitivity to cross-cell differences (Westoby et al., 2020). Some authors have even recommended avoiding the analysis of alternative splicing in single-cell RNA sequencing (scRNA-seq) data until such obstacles can be suitably overcome (Westoby et al., 2020). Despite these difficulties, several methods (summarized in Appendix 1—table 1) have sought to analyze transcript variation in short-read, full-length scRNA-seq. Many methods, including kallisto (Bray et al., 2016), Census (Qiu et al., 2017), BRIE (Huang and Sanguinetti, 2017), SCATS (Hu et al., 2020), Quantas (Yan et al., 2015), VALERIE (meant only for visualization) (Wen et al., 2020), DESJ (Liu et al., 2021), BRIE2 (Huang and Sanguinetti, 2021) and DTUrtle (Tekath and Dugas, 2021), depend on transcript annotations and consequently cannot easily identify unannotated alternative splicing events, which may comprise a sizable fraction of events. Currently available annotation-free methods, such as ODEGR-NMF (Matsumoto et al., 2020), Expedition (Song et al., 2017), ASCOT (Ling et al., 2020), SingleSplice (Welch et al., 2016) and RNA-Bloom (Nip et al., 2020), do not provide a statistical test for differential transcript usage across conditions. Appendix 1—table 1 summarizes this information and makes the comparison of different methods easier.
Data availability
All data analyzed in this study are publicly available and URL links are provided in the Materials and methods section of our manuscript. Our source code as well as all results represented in figures and tables are publicly available on our lab's GitHub repositories: https://github.com/songlab-cal/scquint, (copy archived at swh:1:rev:a9db6454e13d42af25f47deee19e201e74d2bdd0) and https://github.com/songlab-cal/scquint-analysis, (copy archived at swh:1:rev:97dc31babf2a585666af4a38b1e4aa59a92bbf87).
-
NCBI Gene Expression OmnibusID GSE109774. Tabula Muris.
-
nemoarchiveID dat-ch1nqb7. BRAIN Initiative Cell Census Network Cortex.
References
-
Detecting differential usage of exons from RNA-seq dataGenome Research 22:2008–2017.https://doi.org/10.1101/gr.133744.111
-
The calmodulin-binding transcription activator CAMTA1 is required for long-term memory formation in miceLearning & Memory (Cold Spring Harbor, N.Y.) 23:313–321.https://doi.org/10.1101/lm.041111.115
-
Controlling the false discovery rate: a practical and powerful approach to multiple testingJournal of the Royal Statistical Society 57:289–300.https://doi.org/10.1111/j.2517-6161.1995.tb02031.x
-
Generalizing RNA velocity to transient cell states through dynamical modelingNature Biotechnology 38:1408–1414.https://doi.org/10.1038/s41587-020-0591-3
-
Near-optimal probabilistic RNA-seq quantificationNature Biotechnology 34:525–527.https://doi.org/10.1038/nbt.3519
-
STAR: ultrafast universal RNA-seq alignerBioinformatics (Oxford, England) 29:15–21.https://doi.org/10.1093/bioinformatics/bts635
-
Polyester: simulating RNA-seq datasets with differential transcript expressionBioinformatics (Oxford, England) 31:2778–2784.https://doi.org/10.1093/bioinformatics/btv272
-
ggsashimi: Sashimi plot revised for browser- and annotation-independent splicing visualizationPLOS Computational Biology 14:e1006360.https://doi.org/10.1371/journal.pcbi.1006360
-
Relevance of PSGL-1 expression in B cell development and activationFrontiers in Immunology 11:588212.https://doi.org/10.3389/fimmu.2020.588212
-
Single-cell isoform RNA sequencing characterizes isoforms in thousands of cerebellar cellsNature Biotechnology 36:1197–1202.https://doi.org/10.1038/nbt.4259
-
Single-cell RNA counting at allele and isoform resolution using Smart-seq3Nature Biotechnology 38:708–714.https://doi.org/10.1038/s41587-020-0497-0
-
Structure and alternative splicing of the ketohexokinase geneEuropean Journal of Biochemistry 257:85–91.https://doi.org/10.1046/j.1432-1327.1998.2570085.x
-
Foxp1 is an essential transcriptional regulator of B cell developmentNature Immunology 7:819–826.https://doi.org/10.1038/ni1358
-
Detecting differential alternative splicing events in scRNA-seq with or without Unique Molecular IdentifiersPLOS Computational Biology 16:e1007925.https://doi.org/10.1371/journal.pcbi.1007925
-
ConferenceAdam: A method for stochastic optimizationIn: Proceedings of the 2nd International Conference on Learning Representations (ICLR).
-
ConferenceAuto-encoding variational BayesIn: Proceedings of the 2nd International Conference onLearning Representations (ICLR).
-
Snakemake--a scalable bioinformatics workflow engineBioinformatics (Oxford, England) 28:2520–2522.https://doi.org/10.1093/bioinformatics/bts480
-
Inositol 1,4,5 trisphosphate receptors in secretory epithelial cells of the gastrointestinal tractCurrent Opinion in Physiology 17:169–174.https://doi.org/10.1016/j.cophys.2020.08.003
-
Annotation-free quantification of RNA splicing using LeafCutterNature Genetics 50:151–158.https://doi.org/10.1038/s41588-017-0004-9
-
featureCounts: an efficient general purpose program for assigning sequence reads to genomic featuresBioinformatics (Oxford, England) 30:923–930.https://doi.org/10.1093/bioinformatics/btt656
-
ASCOT identifies key regulators of neuronal subtype-specific splicingNature Communications 11:1–12.https://doi.org/10.1038/s41467-019-14020-5
-
On the limited memory BFGS method for large scale optimizationMathematical Programming 45:503–528.https://doi.org/10.1007/BF01589116
-
Enhancing scientific discoveries in molecular biology with deep generative modelsMolecular Systems Biology 16:20199198.https://doi.org/10.15252/msb.20199198
-
An NMF-based approach to discover overlooked differentially expressed gene regions from single-cell RNA-seq dataNAR Genomics and Bioinformatics 2:lqz020.https://doi.org/10.1093/nargab/lqz020
-
UMAP: Uniform Manifold Approximation and ProjectionJournal of Open Source Software 3:861.https://doi.org/10.21105/joss.00861
-
ConferencePyTorch: An imperative style, high-performance deep learning libraryIn: Advances in Neural Information Processing Systems. pp. 8026–8037.
-
The RNA-binding protein Rbfox1 regulates splicing required for skeletal muscle structure and functionHuman Molecular Genetics 24:2360–2374.https://doi.org/10.1093/hmg/ddv003
-
Full-length RNA-seq from single cells using Smart-seq2Nature Protocols 9:171–181.https://doi.org/10.1038/nprot.2014.006
-
mTOR Driven Gene Transcription Is Required for Cholesterol Production in Neurons of the Developing Cerebral CortexInternational Journal of Molecular Sciences 22:6034.https://doi.org/10.3390/ijms22116034
-
A defined metabolic state in pre B cells governs B-cell development and is counterbalanced by Swiprosin-2/EFhd1Cell Death and Differentiation 24:1239–1252.https://doi.org/10.1038/cdd.2017.52
-
Interpretable factor models of single-cell RNA-seq via variational autoencodersBioinformatics (Oxford, England) 36:3418–3421.https://doi.org/10.1093/bioinformatics/btaa169
-
Differential transcript usage analysis of bulk and single-cell RNA-seq data with DTUrtleBioinformatics (Oxford, England) 37:3781–3787.https://doi.org/10.1093/bioinformatics/btab629
-
Control of neuronal synapse specification by a highly dedicated alternative splicing programScience (New York, N.Y.) 352:982–986.https://doi.org/10.1126/science.aaf2397
-
CKAMP44: a brain-specific protein attenuating short-term synaptic plasticity in the dentate gyrusScience (New York, N.Y.) 327:1518–1522.https://doi.org/10.1126/science.1184178
-
The neurogenetics of alternative splicingNature Reviews. Neuroscience 17:265–281.https://doi.org/10.1038/nrn.2016.27
-
Direct Comparative Analyses of 10X Genomics Chromium and Smart-seq2Genomics, Proteomics & Bioinformatics 19:253–266.https://doi.org/10.1016/j.gpb.2020.02.005
-
Robust detection of alternative splicing in a population of single cellsNucleic Acids Research 44:e73.https://doi.org/10.1093/nar/gkv1525
-
VALERIE: Visual-based inspection of alternative splicing events at single-cell resolutionPLOS Computational Biology 16:e1008195.https://doi.org/10.1371/journal.pcbi.1008195
Article and author information
Author details
Funding
National Institutes of Health (R35-GM134922)
- Gonzalo Benegas
- Yun S Song
Chan Zuckerberg Initiative (CZF2019-002449)
- Gonzalo Benegas
- Yun S Song
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
We would like to thank Angela Oliveira Pisco, Spyros Darmanis, and Kif Liakath-Ali for helpful discussions. We also thank the Chan Zuckerberg Biohub for hosting our cell×gene sessions and Aaron McGeever for assistance. This research is supported in part by grant number R35-GM134922 from NIH and grant number CZF2019-002449 from the Chan Zuckerberg Initiative Foundation. YSS is a Chan Zuckerberg Biohub Investigator.
Copyright
© 2022, Benegas et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 5,286
- views
-
- 486
- downloads
-
- 9
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Chromosomes and Gene Expression
To function effectively as an integrated system, the transcriptional and post-transcriptional machineries must communicate through mechanisms that are still poorly understood. Here, we focus on the zinc-finger Sfp1, known to regulate transcription of proliferation-related genes. We show that Sfp1 can regulate transcription either by binding to promoters, like most known transcription activators, or by binding to the transcribed regions (gene bodies), probably via RNA polymerase II (Pol II). We further studied the first mode of Sfp1 activity and found that, following promoter binding, Sfp1 binds to gene bodies and affects Pol II configuration, manifested by dissociation or conformational change of its Rpb4 subunit and increased backtracking. Surprisingly, Sfp1 binds to a subset of mRNAs co-transcriptionally and stabilizes them. The interaction between Sfp1 and its client mRNAs is controlled by their respective promoters and coincides with Sfp1’s dissociation from chromatin. Intriguingly, Sfp1 dissociation from the chromatin correlates with the extent of the backtracked Pol II. We propose that, following promoter recruitment, Sfp1 accompanies Pol II and regulates backtracking. The backtracked Pol II is more compatible with Sfp1’s relocation to the nascent transcripts, whereupon Sfp1 accompanies these mRNAs to the cytoplasm and regulates their stability. Thus, Sfp1’s co-transcriptional binding imprints the mRNA fate, serving as a paradigm for the cross-talk between the synthesis and decay of specific mRNAs, and a paradigm for the dual-role of some zinc-finger proteins. The interplay between Sfp1’s two modes of transcription regulation remains to be examined.
-
- Chromosomes and Gene Expression
- Microbiology and Infectious Disease
Candida glabrata can thrive inside macrophages and tolerate high levels of azole antifungals. These innate abilities render infections by this human pathogen a clinical challenge. How C. glabrata reacts inside macrophages and what is the molecular basis of its drug tolerance are not well understood. Here, we mapped genome-wide RNA polymerase II (RNAPII) occupancy in C. glabrata to delineate its transcriptional responses during macrophage infection in high temporal resolution. RNAPII profiles revealed dynamic C. glabrata responses to macrophages with genes of specialized pathways activated chronologically at different times of infection. We identified an uncharacterized transcription factor (CgXbp1) important for the chronological macrophage response, survival in macrophages, and virulence. Genome-wide mapping of CgXbp1 direct targets further revealed its multi-faceted functions, regulating not only virulence-related genes but also genes associated with drug resistance. Finally, we showed that CgXbp1 indeed also affects fluconazole resistance. Overall, this work presents a powerful approach for examining host-pathogen interaction and uncovers a novel transcription factor important for C. glabrata's survival in macrophages and drug tolerance.