Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.
Read more about eLife’s peer review process.Editors
- Reviewing EditorDouglas PortmanUniversity of Rochester, Rochester, United States of America
- Senior EditorClaude DesplanNew York University, New York, United States of America
Reviewer #1 (Public review):
This is an interesting manuscript aimed at improving the transcriptome characterization of 52 C. elegans neuron classes. Previous single-cell RNA seq studies already uncovered transcriptomes for these, but the data are incomplete, with a bias against genes with lower expression levels. Here, the authors use cell-specific reporter combinations to FACS purify neurons and bulk RNA sequencing to obtain better sequencing depth. This reveals more rare transcripts, as well as non-coding RNAs, pseudogenes, etc. The authors develop computational approaches to combine the bulk and scRNA transcriptome results to obtain more definitive gene lists for the neurons examined.
To ultimately understand features of any cell, from morphology to function, an understanding of the full complement of the genes it expresses is a pre-requisite. This paper gets us a step closer to this goal, assembling a current "definitive list" of genes for a large proportion of C. elegans neurons. The computational approaches used to generate the list are based on reasonable assumptions, the data appear to have been treated appropriately statistically, and the conclusions are generally warranted. I have a few issues that the authors may choose to address:
(1) As part of getting rid of cross-contamination in the bulk data, the authors model the scRNA data, extrapolate it to the bulk data and subtract out "contaminant" cell types. One wonders, however, given that low expressed genes are not represented in the scRNA data, whether the assignment of a gene to one or another cell type can really be made definitive. Indeed, it's possible that a gene is expressed at low levels in one cell, and high levels in another, and would therefore be considered a contaminant. The result would be to throw out genes that actually are expressed in a given cell type. The definitive list would therefore be a conservative estimate, and not necessarily the correct estimate.
(2) It would be quite useful to have tested some genes with lower expression levels using in vivo gene-fusion reporters to assess whether the expression assignments hold up as predicted. i.e. provide another avenue of experimentation, non-computational, to confirm that the decontamination algorithm works.
(3) In many cases, each cell class would be composed of at least 2 if not more neurons. Is it possible that differences between members of a single class would be missed by applying the cleanup algorithms? Such transcripts would be represented only in a fraction of the cells isolated by scRNAseq, and might then be considered not real.
(4) I didn't quite catch whether the precise staging of animals was matched between the bulk and scRNAseq datasets. Importantly, there are many genes whose expression is highly stage-specific or age-specific so even slight temporal differences might yield different sets of gene expression.
(5) To what extent does FACS sorting affect gene expression? Can the authors provide some controls?
Reviewer #2 (Public review):
Summary:
This study from the CenGEN consortium addresses several limitations of single-cell RNA (scRNA) and bulk RNA sequencing in C. elegans with a focus on cells in the nervous system. scRNA datasets can give very specific expression profiles, but detecting rare and non-polyA transcripts is difficult. In contrast, bulk RNA sequencing on isolated cells can be sequenced to high depth to identify rare and non-polyA transcripts but frequently suffers from RNA contamination from other cell types. In this study, the authors generate a comprehensive set of bulk RNA datasets from 53 individual neurons isolated by fluorescence-activated cell sorting (FACS). The authors combine these datasets with a previously published scRNA dataset (Taylor et al., 2021) to develop a novel method, called LittleBites, to estimate and subtract contamination from the bulk RNA data. The authors validate the method by comparing detected transcripts against gold-standard datasets on neuron-specific and non-neuronal transcripts. The authors generate an "integrated" list of protein-coding expression profiles for the 53 neuron sub-types, with fewer but higher confidence genes compared to expression profiles based only on scRNA. Also, the authors identify putative novel pan-neuronal and cell-type specific non-coding RNAs based on the bulk RNA data. LittleBites should be generally useful for extracting higher confidence data from bulk RNA-seq data in organisms where extensive scRNA datasets are available. The additional confidence in neuron-specific expression and non-coding RNA expands the already great utility of the neuronal expression reference atlas generated by the CenGEN consortium.
Strengths:
The study generates and analyzes a very comprehensive set of bulk RNA datasets from individual fluorescently tagged transgenic strains. These datasets are technically challenging to generate and significantly expand our knowledge of gene expression, particularly in cells that were poorly represented in the initial scRNA-seq datasets. Additionally, all transgenic strains are made available as a resource from the Caenorhabditis Elegans Genetics Center (CGC).
The study uses the authors' extensive experience with neuronal expression to benchmark their method for reducing contamination utilizing a set of gold-standard validated neuronal and non-neuronal genes. These gold-standard genes will be helpful for benchmarking any C. elegans gene expression study.
Weaknesses:
The bulk RNA-seq data collected by the authors has high levels of contamination and, in some cases, is based on very few cells. The methodology to remove contamination partly makes up for this shortcoming, but the high background levels of contaminating RNA in the FACS-isolated neurons limit the confidence in cell-specific transcripts.
The study does not experimentally validate any of the refined gene expression predictions, which was one of the main strengths of the initial CenGEN publication (Taylor et al, 2021). No validation experiments (e.g., fluorescence reporters or single molecule FISH) were performed for protein-coding or non-coding genes, which makes it difficult for the reader to assess how much gene predictions are improved, other than for the gold standard set, which may have specific characteristics (e.g., bias toward high expression as they were primarily identified in fluorescence reporter experiments).
The study notes that bulk RNA-seq data, in contrast to scRNA-seq data, can be used to identify which isoforms are expressed in a given cell. However, no analysis or genome browser tracks were supplied in the study to take advantage of this important information. For the community, isoform-specific expression could guide the design of cell-specific expression constructs or for predictive modeling of gene expression based on machine learning.
Reviewer #3 (Public review):
The manuscript by Barrett et al. "Integrating bulk and single cell RNA-seq refines transcriptomic profiles of individual C. elegans neurons" presents a comprehensive approach to integrating bulk RNA-seq and single-cell RNA-seq (scRNA-seq) data to refine transcriptomic profiles of individual C. elegans neurons. The study addresses the limitations of scRNA-seq, such as the under-detection of lowly expressed and non-polyadenylated transcripts, by leveraging the sensitivity of bulk RNA-seq. The authors deploy a computational method, LittleBites, to remove non-neuronal contamination in bulk RNA-seq, that aims to enhance specificity while preserving the sensitivity advantage of bulk sequencing. Using this approach, the authors identify lowly expressed genes and non-coding RNAs (ncRNAs), many of which were previously undetected in scRNA-seq data.
Overall, the study provides high-resolution gene expression data for 53 neuron classes, covering a wide range of functional modalities and neurotransmitter usage. The integrated dataset and computational tools are made publicly available, enabling community-driven testing of the robustness and reproducibility of the study. Nevertheless, while the study represents a relevant contribution to the field, certain aspects of the work require further refinement to ensure the robustness and rigor necessary for peer-reviewed publication. Below, I outline the areas where improvements are needed to strengthen the overall impact and reliability of the findings.
(1) The study relies on thresholding to determine whether a gene is expressed or not. While this is a common practice, the choice of threshold is not thoroughly justified. In particular, the choice of two uniform cutoffs across protein-encoding RNAs and of one distinct threshold for non-coding RNAs is somewhat arbitrary and has several limitations. This reviewer recommends the authors attempt to use adaptive threshold-methods that define gene expression thresholds on a per-gene basis. Some of these methods include GiniClust2, Brennecke's variance modeling, HVG in Seurat, BASiCS, and/or MAST Hurdle model for dropout correction.
(2) Most importantly, the study lacks independent experimental validation (e.g., qPCR, smFISH, or in situ hybridization) to confirm the expression of newly detected lowly expressed genes and non-coding RNAs. This is particularly important for validating novel neuronal non-coding RNAs, which are primarily inferred from computational approaches.
(3) The novel biology is somewhat limited. One potential area of exploration would be to look at cell-type specific alternative splicing events.
(4) The integration method disproportionately benefits neuron types with limited representation in scRNA-seq, meaning well-sampled neuron types may not show significant improvement. The authors should quantify the impact of this bias on the final dataset.
(5) The authors employ a logit transformation to model single-cell proportions into count space, but they need to clarify its assumptions and potential pitfalls (e.g., how it handles rare cell types).
(6) The LittleBites approach is highly dependent on the accuracy of existing single-cell references. If the scRNA-seq dataset is incomplete or contains classification biases, this could propagate errors into the bulk RNA-seq data. The authors may want to discuss potential limitations and sensitivity to errors in the single-cell dataset, and it is critical to define minimum quality parameters (e.g. via modeling) for the scRNAseq dataset used as reference.
(7) Also very important, the LittleBites method could benefit from a more intuitive explanation and schematic to improve accessibility for non-computational readers. A supplementary step-by-step breakdown of the subtraction process would be useful.
(8) In the same vein, the ROC curves and AUROC comparisons should have clearer annotations to make results more interpretable for readers unfamiliar with these metrics.
(9) Finally, after the correlation-based decontamination of the 4,440 'unexpressed' genes, how many were ultimately discarded as non-neuronal?
a) Among these non-neuronal genes, how many were actually known neuronal genes or components of neuronal pathways (e.g., genes involved in serotonin synthesis, synaptic function, or axon guidance)?
b) Conversely, among the "unexpressed" genes classified as neuronal, how many were likely not neuron-specific (e.g., housekeeping genes) or even clearly non-neuronal (e.g., myosin or other muscle-specific markers)?
(10) To increase transparency and allow readers to probe false positives and false negatives, I suggest the inclusion of:
a) The full list of all 4,440 'unexpressed' genes and their classification at each refinement step. In that list flag the subsets of genes potentially misclassified, including:
- Neuronal genes wrongly discarded as non-neuronal.
- Non-neuronal genes wrongly retained as neuronal.
b) Add a certainty or likelihood ranking that quantifies confidence in each classification decision, helping readers validate neuronal vs. non-neuronal RNA assignments.
This addition would enhance transparency, reproducibility, and community engagement, ensuring that key neuronal genes are not erroneously discarded while minimizing false positives from contaminant-derived transcripts.