MGPfactXMBD: A Model-Based Factorization Method for scRNA Data Unveils Bifurcating Transcriptional Modules Underlying Cell Fate Determination

  1. School of Informatics, Xiamen University, Xiamen, 361105, China
  2. National Institute for Data Science in Health and Medicine, School of Medicine, Xiamen University, Xiamen, 361102, China
  3. Department of Hematology, The First Affiliated Hospital of Xiamen University and Institute of Hematology, School of Medicine, Xiamen University, Xiamen, 361102, China

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Mohammad Karimi
    King's College London, London, United Kingdom
  • Senior Editor
    Alan Moses
    University of Toronto, Toronto, Canada

Reviewer #1 (Public Review):

Summary:

Ren et al developed a novel computational method to investigate cell evolutionary trajectory for scRNA-seq samples. This method, MGPfact, estimates pseudotime and potential branches in the evolutionary path by explicitly modeling the bifurcations in a Gaussian process. They benchmarked this method using synthetic as well as real-world samples and showed superior performance for some of the tasks in cell trajectory analysis. They further demonstrated the utilities of MGPfact using single-cell RNA-seq samples derived from microglia or T cells and showed that it can accurately identify the differentiation timepoint and uncover biologically relevant gene signatures.

Strengths:

Overall I think this is a useful new tool that could deliver novel insights for the large body of scRNA-seq data generated in the public domain. The manuscript is written in a logical way and most parts of the method are well described.

Weaknesses:

Some parts of the methods are not clear.

It should be outlined in detail how pseudo time T is updated in Methods. It is currently unclear either in the description or Algorithm 1.

There should be a brief description in the main text of how synthetic data were generated, under what hypothesis, and specifically how bifurcation is embedded in the simulation.

Please explain what the abbreviations mean at their first occurrence.

In the benchmark analysis (Figures 2/3), it would be helpful to include a few trajectory plots of the real-world data to visualize the results and to evaluate the accuracy.

It is not clear how this method selects important genes/features at bifurcation. This should be elaborated on in the main text.

It is not clear how survival analysis was performed in Figure 5. Specifically, were critical confounders, such as age, clinical stage, and tumor purity controlled?

I recommend that the authors perform some sort of 'robustness' analysis for the consensus tree built from the bifurcation Gaussian process. For example, subsample 80% of the cells to see if the bifurcations are similar between each bootstrap.

Reviewer #2 (Public Review):

Summary of the manuscript:

The authors present MGPfactXMBD, a novel model-based manifold-learning framework designed to address the challenges of interpreting complex cellular state spaces from single-cell RNA sequences. To overcome current limitations, MGPfactXMBD factorizes complex development trajectories into independent bifurcation processes of gene sets, enabling trajectory inference based on relevant features. As a result, it is expected that the method provides a deeper understanding of the biological processes underlying cellular trajectories and their potential determinants.

MGPfactXMBD was tested across 239 datasets, and the method demonstrated similar to slightly superior performance in key quality-control metrics to state-of-the-art methods. When applied to case studies, MGPfactXMBD successfully identified critical pathways and cell types in microglia development, validating experimentally identified regulons and markers. Additionally, it uncovered evolutionary trajectories of tumor-associated CD8+ T cells, revealing new subtypes with gene expression signatures that predict responses to immune checkpoint inhibitors in independent cohorts.

Overall, MGPfactXMBD represents a relevant tool in manifold learning for scRNA-seq data, enabling feature selection for specific biological processes and enhancing our understanding of the biological determinants of cell fate.

Summary of the outcome:

The novel method addresses core state-of-the-art questions in biology related to trajectory identification. The design and the case studies are of relevance.

However, in my opinion, the manuscript requires several clarifications and updates.

Also, how the methods compare with existing Deep Learning based approaches such as TIGON is a question mark. If a comparison would be possible, it should be conducted; if not, it should be clarified why.

Strengths:

(1) Relevant methodology for a current field of research.

(2) Relevant case studies with relevant outcomes.

Weaknesses:

(1) In general, the manuscript may be improved by making the text more accessible to the Journal's audience: (i) intuitive explanation of some concepts; (ii) review the flow of some explanations.

(2) Additionally, several parts require more details on how the methods work, especially the case studies.

(3) Finally, there are missing references to published work and possibly some additional comparisons to make.

Author response:

(1) Clarification and Detailed Explanation in the Methods Section:

- Regarding Reviewer 1's comments about the unclear explanation of the update process for pseudotime, T, and the selection of important genes/features at bifurcation points in the methods, we will provide a detailed description of the update process for pseudotime T and how high-weight genes important to the bifurcation process are selected.

- Regarding Reviewer 2's comments concerning the impact of the initial pseudotime prediction method and the insufficient description of various parameters, we will add information about the differences in the initially used pseudotime prediction methods and provide detailed information on the techniques and parameters used in each analysis.

- Regarding Reviewer 2's comments on the choice of kernel functions, we will explain the rationale for selecting rbf and polynomial kernels and why other options were discarded.

(2) Performance Comparison and Data Presentation:

- Regarding Reviewer 1's comments about using a few trajectory plots of the real-world data to visualize the results, we will include 1-2 trajectory plots of real-world datasets in the benchmark analysis to better visualize the results and assess accuracy.

- Regarding Reviewer 2's comments concerning the lack of comparison results and discussion related to trajectory prediction methods based on deep learning, we will include a comparison with deep learning methods such as scTour and Tigon in the revision. Additionally, we will discuss the latest deep learning methods for bifurcation analysis and alternative trajectory inference methods such as CellRank.

- Regarding Reviewer 2's comments on the impact of MURP, we will include an analysis on whether the number of MURPs affects the performance of the method and compare it with the random subsampling approach.

(3) Article Calibration and Refinement:

- Regarding Reviewer 2's comments on the discussion section, we will simplify the first three paragraphs to succinctly convey the background and implications of our contributions. Additionally, we will explain why HVG is considered as the entire feature space in our comparisons and analyses.

- Regarding Reviewer 2's comments concernig the regulons in the microglia analysis, we will review the correct explanations and revise the article accordingly.

- In response to the issues raised by both reviewers regarding grammatical errors, spelling mistakes, and inconsistencies between text and figures, we will review and correct any errors in the article. This includes providing explanations for all abbreviations upon their first appearance, ensuring the accuracy of text and figure descriptions, correcting equation numbering, improving image quality, and revising descriptions such as "the current manifold learning methods face two major challenges."

(4) Enhancing Descriptions and Readability:

- Regarding Reviewer 1's comments about the synthetic data, we will add a brief description in the main text on how synthetic data were generated.

- Regarding Reviewer 1's comments on the survival analysis, we will provide a more detailed description of the computational steps and clarify whether key confounding factors such as age, clinical stage, and tumor purity were controlled.

- Regarding Reviewer 2's comments on evaluation metrics, we will add detailed descriptions of the evaluation metrics and provide intuitive explanations of how different methods perform across various metrics in the comparison results.

- Regarding Reviewer 2's comments on CD8+ T cells, we plan to compare MGPfact with Monocle3, in addition to Monocle2. This will help clarify the added value of MGPfact and provide a more comprehensive evaluation of its performance.

- Regarding Reviewer 2's comments about consensus trajectorie, we will add detailed descriptions of the process of generating consensus trajectories.

- Regarding Reviewer 2's comments on regulons, we will include additional information on the process of downstream trajectory analysis and clarify the roles of SCENIC, GENIE3, RCisTarget, and AUCell in the bifurcation analysis.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation