Integrating bulk and single cell RNA-seq refines transcriptomic profiles of individual C. elegans neurons

Alec Barrett
Erdem Varol
Alexis Weinreb
Seth R Taylor
Rebecca M McWhirter
Cyril Cros
Berta Vidal
Manasa Basaravaju
Abigail Poff
John A Tipps
Maryam Majeed
Chen Wang
Emily A Bayer
Molly Reilly
Eviatar Yemini
HaoSheng Sun
Oliver Hobert author has email address
David M Miller III author has email address
Marc Hammarlund author has email address

Department of Genetics, Yale University School of Medicine, New Haven, United States
Department of Statistics, Columbia University, New York, United States
Department of Neuroscience, Yale University School of Medicine, New Haven, United States
Department of Cell and Developmental Biology, Vanderbilt University School of Medicine, Nashville, United States
Department of Biological Sciences, Howard Hughes Medical Institute Columbia University, New York, United States
Department of Neurobiology, University of Massachusetts Chan Medical School, Worcester, United States
Department of Cell, Developmental, and Integrative Biology, University of Alabama Birmingham, Birmingham, United States
Program in Neuroscience, Vanderbilt University School of Medicine, Nashville, United States

https://doi.org/10.7554/eLife.106183.2

Open access
Copyright information

Figures and data

Single neuron bulk RNA-seq via targeted marker expression and FACS isolation.
A) Labeling, tissue dissociation, and FACS-enrichment schemes for capturing individual neuron types. Intersecting flp-22::GFP and unc-47::mCherry markers uniquely label AVL for isolation by FACS from dissociated L4 stage larval cells. RNA from this pool of AVL-enriched cells was used for bulk RNA sequencing (see Methods). B) PCA plot showing all bulk RNA-seq data labeled by cell type and colored according to functional modality; Sensory neurons (blue), motor neurons (green), interneurons (red), and CAN neurons (purple).

Identifying and removing contamination from bulk data with LittleBites.
A) Schematic representation of the LittleBites algorithm. Each iteration of the LittleBites algorithm contains 3 steps: 1) Use an NNLS regression to estimate the cell-type proportions of each bulk sample with aggregated single cell data as a reference. 2) Subtract the estimated contamination profiles, constrained by the proportion estimates, a gene level specificity weight, and a variable learning rate. 3) Compare the result of every tested learning rate against ground truth, and select the learning rate with the highest area under the receiver operator characteristic (AUROC). This loop repeats until further subtraction cannot improve the AUROC. B) Line plots of the receiver operator characteristic (ROC) curve for the unaltered bulk data and corrected datasets, as well as bMIND and ENIGMA analyses, assessed against ubiquitous genes and genes exclusively expressed in non-neuronal cells. C) Line plots of the ROC for the unaltered bulk data and corrected datasets, as well as bMIND and ENIGMA analyses, assessed against 160 genes expressed within the nervous system. D) Violin plot of AUROC values for bulk samples after LittleBites cleaning using a random subset of target cells in the single cell reference. X-axis indicates the number of target cells used in the reference. Each subsampling rate was run 500 times. e.g. for a VB sample, it was modeled using 10, 50, 100, or 500 VB neurons in the reference. E) Violin plot of the average coefficient of variation (CV) per gene after LittleBites cleanup. The Y-axis indicates the average CV across all genes in a single sample, and the the X-axis indicates the number of target cells used in the single-cell reference.

A model for the relationship between cell proportions and gene counts in sc-RNAseq data.
A) Aggregate count and proportion measures of scRNA-seq data. UMAP representations of AFD neurons captured in the scRNA-seq dataset (11). A) Left, log scale representation of ttx-1 counts in all AFD neurons. Right, binarized representation of ttx-1 expression in all AFD neurons. B) ROC for C. elegans neuronal ground-truth genes in scRNA-seq counts and proportions. The teal line indicates the proportion of non-zero cells, and the red line indicates the normalized counts. C) Scatter plot of fitted log-counts for simulated data, with log-counts on the Y-axis, and the sum of the log likelihood ratio of the proportion of non-zero cells and the log of the number of cells in the cluster on the X-axis. Red line indicates Y = X. D) Scatter plot of fitted log-counts for one SIA neuron replicate from the C. elegans neuronal scRNA-seq dataset (11), with log-counts on the Y-axis, and the sum of the log likelihood ratio of the proportion of non-zero cells and the log of the number of cells in the cluster on the X-axis. Red line indicates Y = X.

Integrating bulk RNA-seq and scRNA-seq data sets improves gene detection accuracy.
A) Receiver Operator Characteristic (ROC) curve for bulk, single-cell, and integrated datasets compared to neuronal ground-truth genes. The x-axis shows the False Positive Rate (FPR), and the y-axis shows the true positive rate (TPR). B) Area under the receiver operator characteristic (AUROC). Error bars represent bootstrap 95% confidence intervals of the AUROC value. Pairwise statistical testing was performed using the deLong method for AUROC scores. *** = p < 0.0001. C) Precision-Recall (PR) curve for bulk, single-cell, and integrated datasets compared to neuronal ground-truth genes. The x-axis shows the Precision (1 – False Discovery Rate/FDR), and the y-axis shows the TPR (Recall). D) Boxplot of bootstrapped sensitivity scores for each dataset at a 5% FDR cutoff. E) Bar chart of the number of genes detected at the medium threshold (FDR = 0.14) for the integrated dataset (red) and the dynamic proportions single cell dataset (blue). Bars are ordered by the number of cells in the corresponding single cell cluster from lowest to highest. F,G) Bar plots of significant terms in Tissue enrichment analyses for genes detected only in the single-cell and integrated datasets in ADL neurons. Dynamic thresholding was used for single cell data, and medium thresholds were used to binarize both datasets.

Bulk analysis reveals noncoding RNA expression patterns.
A) Proportions of classes of pan-neuronal noncoding RNAs. B) Proportions of classes of cell type specific noncoding RNAs C) Heatmap of average normalized counts per cell type for ncRNA genes considered cell type specific, rows are grouped by RNA class. Note that genes within individual neuron types are clustered by expression patterns, not similar gene function. D) Number of GPCR pseudogenes with specific expression per neuron type, grouped by neuron function. E) Number of GPCR genes per neuron type, grouped by neuron function.

A) Number of cell types sequenced per functional modality. B) Heatmap of Spearman Correlations between average single cell RNA-seq (row) and Bulk RNA-seq (column) profiles for each neuron type. For each row, correlations were calculated for genes called expressed in that single cell cluster (from single cell thresholding) (11). The single cell data did not separate the DD and VD GABA motor neuron types (11); these rows are identical.

A) Bar graph showing the AUROC values for each dataset assessed against ubiquitous genes and genes exclusively expressed in non-neuronal cells. Error bars represent bootstrap estimated confidence intervals. B) Bar graph showing the AUROC values for each dataset, assessed against 160 genes expressed within the nervous system. Error bars represent bootstrap estimated confidence intervals. C) Scatterplot of the training and testing AUROC improvement after LittleBites cleanup. Each dot represents one sample. Red line indicates x=y. D, E) Scatter plots of pairwise comparisons of sample-level AUROC values for unaltered and corrected bulk datasets. Ubiquitous and non-neuronal AUROCs are represented in panel C, and neuronal AUROCs are represented in panel D. F) Violin plot of unaltered bulk sample estimates for the proportion of the counts that come from neurons, vs the number of target cells used in the single cell reference. G) Violin plot in the coefficient of variation (CV) in the mean value of LittleBites cleaned samples, vs the number of target cells used in the single cell reference. H) Violin plot of the mean gene expression value after LittleBites cleanup, vs the number of target cells used in the single cell reference. Model was fit with slopes calculated for each neuron type separately.

Box plots of edgeR differential expression metrics for single-cell counts and Prop2Count inputs.
1176 neuron-neuron pairs were compared in edgeR, and differentially expressed genes were called as any gene with a p.adj value < 0.05, and a log2 effect size > 2. Metrics were generated by comparing differentially expressed genes with 160 differentially expressed ground truth genes. A) Box plots of TPR values for neuron-neuron edgeR comparisons. B) Box plots of FPR values for neuron-neuron edgeR comparisons. C) Box plots of FDR values for neuron-neuron edgeR comparisons. D) Box plots of MCC values for neuron-neuron edgeR comparisons. Red box plots represent single-cell counts edgeR comparisons, and teal box plots represent Prop2Count edgeR comparisons. All comparisons were made with a two-way permutation test (* = p < 0.05, ** = p < 0.01).

A) Comparison between ‘ground truth’ fluorescent reporters and RNAseq gene expression for all datasets, shown for a set of homeobox genes (34). Fluorescent reporter expression is binarized and is shown as small central squares, with black = expressed and white = not expressed. RNA transcripts shown as natural log transformed TPM counts, color-mapped according to the scale. B) Comparison between ‘ground truth’ fluorescent reporters and RNAseq gene expression for the Integrated dataset, shown for a set of genes encoding neurotransmitter machinery (35). Fluorescent reporter expression is binarized and is shown as small central squares, with black = expressed and white = not expressed. RNA transcripts shown as natural log transformed TPM counts, color-mapped according to the scale. C) Bar plots of significant terms in GO and Tissue enrichment analyses for genes detected only in the single-cell and integrated datasets in ADL and VB neuron types. D) scatterplot of newly detected genes in the Integrated dataset that were previously undetected in all tissues using the single cell reference. Y-axis is the number of genes detected, X-axis is the number of cells in the single cell reference. Red line indicates the line of best fit (log model), and blue lines indicate 95% confidence intervals of the model. E) Boxplot of the number neurons that each ground truth non-neuronal gene was called expressed in at the 14% or 8.4% FDR threshold. Integrated data is filled red, single cell data is filled teal. Notches indicate 95% confidence intervals. Dots represent outliers.

A) Log-scaled expression distribution of protein coding genes (blue) and non-coding genes (pink). B) Density plot showing the distribution of gene level correlation to contaminant estimates (purple); values plotted are the highest correlation per gene. Genes plotted were called expressed in at least one cell type. Blue and black dashed lines represent a gaussian mixture model, used to threshold against contaminant genes. All noncoding genes with a maximum correlation above 0.22 (vertical red line) were removed from analysis. C) Histogram of ncRNA genes binned by number of neuron types in which they are expressed; genes to the right of the red line are expressed in >90% of neuron types in this study. D) Barplot of the number of noncoding RNAs detected in each neuron type, colored by noncoding RNA class.

Sign up for email alerts