Transgenic strains used for isolating individual neuron classes.

Single neuron bulk RNA-seq via targeted marker expression and FACS isolation.

A) Labeling, tissue dissociation, and FACS-enrichment schemes for capturing individual neuron types. Intersecting flp-22::GFP and unc-47::mCherry markers uniquely label AVL for isolation by FACS from dissociated L4 stage larval cells. RNA from this pool of AVL-enriched cells was used for bulk RNA sequencing (see Methods). B) PCA plot showing all bulk RNA-seq data labeled by cell type and colored according to functional modality; Sensory neurons (blue), motor neurons (green), interneurons (red), and CAN neurons (purple).

Identifying and removing contamination from bulk data with LittleBites.

A) Schematic representation of the LittleBites algorithm. Each iteration of the LittleBites algorithm contains 3 steps: 1) Use an NNLS regression to estimate the cell-type proportions of each bulk sample with aggregated single cell data as a reference. 2) Subtract the estimated contamination profiles, constrained by the proportion estimates, a gene level specificity weight (not shown), and a variable learning rate. 3) Compare the result of every tested learning rate against ground truth, and select the learning rate with the highest area under the receiver operator characteristic (AUROC). This loop repeats until further subtraction cannot improve the AUROC. B) Line plots of the receiver operator characteristic (ROC) curve for the unaltered bulk data and corrected datasets, as well as bMIND and ENIGMA analyses, assessed against ubiquitous genes and genes exclusively expressed in non-neuronal cells. C) Line plots of the ROC for the unaltered bulk data and corrected datasets, as well as bMIND and ENIGMA analyses, assessed against 160 genes expressed within the nervous system.

A model for the relationship between cell proportions and gene counts in sc-RNAseq data.

A) Aggregate count and proportion measures of scRNA-seq data. UMAP representations of AFD neurons captured in the scRNA-seq dataset(11). A) Left, log scale representation of ttx-1 counts in all AFD neurons. Right, binarized representation of ttx-1 expression in all AFD neurons. B) ROC for C. elegans neuronal ground-truth genes in scRNA-seq counts and proportions. The teal line indicates the proportion of non-zero cells, and the red line indicates the normalized counts. C) Scatter plot of fitted log-counts for simulated data, with log-counts on the Y-axis, and the sum of the log likelihood ratio of the proportion of non-zero cells and the log of the number of cells in the cluster on the X-axis. Red line indicates Y = X. D) Scatter plot of fitted log-counts for one SIA neuron replicate from the C. elegans neuronal scRNA-seq dataset(11), with log-counts on the Y-axis, and the sum of the log likelihood ratio of the proportion of non-zero cells and the log of the number of cells in the cluster on the X-axis. Red line indicates Y = X.

Integrating bulk RNA-seq and scRNA-seq data sets improves gene detection accuracy.

A) Receiver Operator Characteristic (ROC) curve for bulk, single-cell, and integrated datasets compared to neuronal ground-truth genes. The x-axis shows the False Positive Rate (FPR), and the y-axis shows the true positive rate (TPR). B) Area under the receiver operator characteristic (AUROC). Error bars represent bootstrap 95% confidence intervals of the AUROC value. Pairwise statistical testing was performed using the deLong method for AUROC scores. *** = p < 0.0001. C) Precision-Recall (PR) curve for bulk, single-cell, and integrated datasets compared to neuronal ground-truth genes. The x-axis shows the Precision (1 – False Discovery Rate/FDR), and the y-axis shows the TPR (Recall). D) Boxplot of bootstrapped sensitivity scores for each dataset at a 5% FDR cutoff. E) Bar chart of the number of genes detected at the medium threshold (FDR = 0.14) for the integrated dataset (red) and the dynamic proportions single cell dataset (blue). Bars are ordered by the number of cells in the corresponding single cell cluster from lowest to highest. F,G) Bar plots of significant terms in Tissue enrichment analyses for genes detected only in the single-cell and integrated datasets in ADL neurons. Dynamic thresholding was used for single cell data, and medium thresholds were used to binarize both datasets.

Bulk RNA-seq samples detect protein coding genes that are not detected in scRNA-seq clusters.

A) Scatter plot showing the relationship between the size of a scRNA-seq cluster (i.e., the number of cells in the cluster) and the Spearman correlation between the average bulk RNA-seq profile and the average scRNA-seq for all protein coding genes. Each dot represents one cell type. Red dashed line shows a Michaelis-Menten fit (see Methods), gmax = 0.675, beta = 29.507. Blue dashed lines show the 97.5% confidence interval of the fit. B) Scatter plot showing the relationship between the size of a scRNA-seq cluster and the number of additional protein coding genes detected in the integrated dataset per cell type. Each dot represents one cell type. Red dashed line shows an exponential decay fit (see Methods), M = 105.26, m = 44.44, alpha = 238.38. Blue dashed lines show the 97.5% confidence interval of the fit. C) Bar plot showing the number of protein coding genes detected per cell type in the bulk dataset. Genes plotted are: 1) called unexpressed in the corresponding single cell cluster; 2)) are expressed above the medium threshold (FDR = 0.14, 0.237 normalized counts) in the integrated data profile for that cell type. D) Density plot of the gene level correlation to contaminant estimates for all genes that are detected in single cell data. Correlation coefficients were calculated using all unaltered bulk replicates (TMM normalized). Only the highest correlation per gene is used. Blue and black dashed lines represent a Gaussian mixture model, used to threshold against contaminant genes. All protein coding genes with a maximum correlation above 0.175 (red line) were removed from analysis.

Bulk analysis reveals noncoding RNA expression patterns.

A) Proportions of classes of pan-neuronal noncoding RNAs. B) Proportions of classes of cell type specific noncoding RNAs C) Heatmap of average normalized counts per cell type for ncRNA genes considered cell type specific, rows are grouped by RNA class. Note that genes within individual neuron types are clustered by expression patterns, not similar gene function. D) Number of GPCR pseudogenes with specific expression per neuron type, grouped by neuron function. E) Number of GPCR genes per neuron type, grouped by neuron function.

A) Number of cell types sequenced per functional modality. B) Heatmap of Spearman Correlations between average single cell RNA-seq (row) and Bulk RNA-seq (column) profiles for each neuron type. For each row, correlations were calculated for genes called expressed in that single cell cluster (from single cell thresholding)(11). The single cell data did not separate the DD and VD GABA motor neuron types(11); these rows are identical.

A) Bar graph showing the AUROC values for each dataset assessed against ubiquitous genes and genes exclusively expressed in non-neuronal cells. Error bars represent bootstrap estimated confidence intervals. B) Bar graph showing the AUROC values for each dataset, assessed against 160 genes expressed within the nervous system. Error bars represent bootstrap estimated confidence intervals. C,D) Scatter plots of pairwise comparisons of sample-level AUROC values for unaltered and corrected bulk datasets. Ubiquitous and non-neuronal AUROCs are represented in panel C, and neuronal AUROCs are represented in panel D.

Box plots of edgeR differential expression metrics for single-cell counts and Prop2Count inputs. 1176 neuron-neuron pairs were compared in edgeR, and differentially expressed genes were called as any gene with a p.adj value < 0.05, and a log2 effect size > 2. Metrics were generated by comparing differentially expressed genes with 160 differentially expressed ground truth genes. A) Box plots of TPR values for neuron-neuron edgeR comparisons. B) Box plots of FPR values for neuron-neuron edgeR comparisons. C) Box plots of FDR values for neuron-neuron edgeR comparisons. D) Box plots of MCC values for neuron-neuron edgeR comparisons. Red box plots represent single-cell counts edgeR comparisons, and teal box plots represent Prop2Count edgeR comparisons. All comparisons were made with a two-way permutation test (* = p < 0.05, ** = p < 0.01).

A) Comparison between ‘ground truth’ fluorescent reporters and RNAseq gene expression for all datasets, shown for a set of homeobox genes(34). Fluorescent reporter expression is binarized and is shown as small central squares, with black = expressed and white = not expressed. RNA transcripts shown as natural log transformed TPM counts, color-mapped according to the scale. B) Comparison between ‘ground truth’ fluorescent reporters and RNAseq gene expression for the Integrated dataset, shown for a set of genes encoding neurotransmitter machinery(35). Fluorescent reporter expression is binarized and is shown as small central squares, with black = expressed and white = not expressed. RNA transcripts shown as natural log transformed TPM counts, color-mapped according to the scale. C) Bar plots of significant terms in GO and Tissue enrichment analyses for genes detected only in the single-cell and integrated datasets in ADL and VB neuron types.

A-E) GO enrichment analysis for protein coding genes detected in selected bulk samples that were not detected in the corresponding scRNA-seq cluster. GO enrichment performed using WormBase. F) Density plot gene level correlations to non-neuronal contamination estimates in bulk samples. Maximum correlation per gene is plotted. Blue density curve represents a set of genes identified as putative non-neuronal markers, and red density curve represents all other protein coding genes, only protein coding genes called expressed in at least one neuronal cell type were plotted. G) Tissue Enrichment Analysis of 161 ‘new’ genes found in ADL.

A) Log-scaled expression distribution of protein coding genes (blue) and non-coding genes (pink). B) Density plot showing the distribution of gene level correlation to contaminant estimates (purple); values plotted are the highest correlation per gene. Genes plotted were called expressed in at least one cell type. Blue and black dashed lines represent a gaussian mixture model, used to threshold against contaminant genes. All noncoding genes with a maximum correlation above 0.22 (vertical red line) were removed from analysis. C) Histogram of ncRNA genes binned by number of neuron types in which they are expressed; genes to the right of the red line are expressed in >90% of neuron types in this study. D) Barplot of the number of noncoding RNAs detected in each neuron type, colored by noncoding RNA class.