Introduction

Data-mining of single-cell RNA sequencing (scRNA-seq) is often transformed into learning of lower-dimensional embedding (Becht et al., 2019; Haghverdi et al., 2015; Van der Maaten and Hinton, 2008) of the expression vectors, which represents the variation in the cellular space and helps explain the biological background. Previous single-cell studies used various embedding methods to characterize and visualize clustering of cells with unique biological functions (Saelens et al., 2019). Among the existing methods, graph-based embedding can better capture nonlinear biological signals among cells hence yielded more insights of the diversity of cells. More recent studies also use graph-based embedding (Costa et al., 2018) to reveal the dependency among cells and thereby reconstruct the evolutionary trajectory in the cellular space, which helps in understanding the determination of cell fate in development, differentiation and cancer.

To date, more and more manifold-learning methods are developed to infer lower-dimensional graphic embedding (manifolds) of scRNA-seq data, and yielded a number of trajectories corresponding to important cellular processes, such as TSCAN (Ji and Ji, 2016), DPT (Haghverdi et al., 2016), and scShaper (Smolander et al., 2022) belong to linear topological classes and reveal major linear pathways based on embedding spaces or cell clustering. TSCAN employs the construction of minimum spanning trees to discover pathways, while DPT reconstructs cellular trajectories using random walks, and scShaper integrates multiple pseudo-temporal solutions to deduce the shortest trajectory within a linear context. Additionally, there have been many approaches capable of inferring complex tree topological structures, such as the widely used Monocle series of algorithms includes Monocle 2 and 3 (Cao et al., 2019; Qiu et al., 2017b, 2017a). They leverage sophisticated graphing techniques to map intricate cell hierarchies; Monocle 2 creates DDRtree based on reversed graph embedding techniques, while Monocle 3 utilizes UMAP (Becht et al., 2019) for embedding. TinGa (Todorov et al., 2020) and scFates (Faure et al., 2023) represent more recent innovations. TinGa utilizes the Growing Neural Gas (GNG) algorithm (Fritzke, 1994) to construct an adaptive graph structure that effectively captures the density structure of the dataset. scFates, streamlines pseudotime analysis with flexible tree learning options, advanced feature extraction tasks, and specific functionalities for fork analysis.

Despite the recent advancements in reconstructing single-cell trajectory, the current manifold-learning methods are faced with two major challenges. As the inferred graph-based embedding is the result of multiple complex biological processes, the effect of specific process cannot be directly distinguished from each other, which hinders the interpretability of model and discovery of novel, relevant knowledge in biology. Then, the inferred graph embedding is highly dependent on the biases in the feature-selection, whereas conventional linear feature-selection methods are less efficient for the learning of complex topologies, hence adds to the difficulty of suggesting candidate genes for downstream functional study.

Here, we describe MGPfactXMBD (Factorization based on Mixtures of Gaussian Processes), a model-based, unsupervised manifold-learning method which factorize complex cellular trajectories into interpretable bifurcation Gaussian processes of transcription, and thereby enable discovery of specific biological determinants of cell fate. In the validation datasets, MGPfact recapitulated developmental trajectory of microglia and recovered key regulatory factors which have been proved experimentally. Moreover, MGPfact discovered highly specific subtypes of tumor-associated CD8+ T cells which associated with benefit to cancer immunotherapy.

Bring together, MGPfact is a knowledge discovery tool which conducts manifold-learning and factorization simultaneously. MGPfact offers two advantages in future scRNA-seq analyses: first, it provides highly interpretable, factorizable cellular trajectories with matched gene expression modules; then, it provides efficient feature-selection for graph-based embedding, thus enhancing our understanding of the determination of cell fate.

Results

Design of MGPfact

The analytical pipeline of MGPfact consists two major stages (Fig. 1). An algorithmic description is given in Algorithm 1.

Overview of MGPfact workflow.

The complete workflow comprises two major stages: MURP downsampling with preprocessed data and trajectory reconstruction. In the stage of trajectory reconstruction, the scRNA-seq data were first factorized into independent bifurcation processes based on mixtures of Gaussian processes, which were then merged into a consensus trajectory.

First, we performed downsampling of the preprocessed scRNA-seq data Y to yield a M-by-N expression matrix Y based on the “minimum unbiased representative points” (MURP) as described previously (Ren et al., 2022), where M representative points were considered as landmarks of the cellular trajectory and N is the number of genes. Then, we computed L Principal Components (PCs) of the downsampled expression matrix to obtain the matrix (M-by-L),

where vl is projection vector, serve as the l-th initial state of embedding.

Next, we used typical Gaussian Process Regression of on pseudotime T:

where f(T) is a Gaussian Process (GP) with covariance matrix S.

And for all features:

where p(Y | f(T)) is defined as follow:

Specifically, we consider S as a mixture of L independent bifurcation Gaussian processes (Schulz et al., 2018),

To cope with the bifurcation processes in cell fate, each of the Gaussian processes is defined with a bifurcation point at bl, branching labels cl, and the necessary hyperparameters. The branching labels cl {0,1,2}, correspond to different phases and states of cell fate, where cl = 0 corresponds the phase before branching, and cl {1,2}. corresponds to the two cellular states of the bifurcation process, respectively. For any landmark (MURP) x,

The covariance matrix sl for trajectory l can be expressed as follows,

where 𝒦 is a kernel function. We employ radial basis function (rbf) and polynomial kernel function (pl).

And the 𝒦 (tx, ty) is calculated as follows:

Therefore, p(Y| f(T)) is updated as follows:

We infer parameters by maximizing the posterior likelihood using Markov Chain Monte Carlo (MCMC) methods available in Mamba(B J, 2014). T is estimated using the Adaptive Metropolis within Gibbs (AMWG) sampling (Roberts and Rosenthal, 2009; Tierney, 1994), as the T exhibits high autocorrelation in the posterior distribution. Other parameters are estimated using the more efficient SLICE sampling technique (Neal, 2003).

Algorithm 1 MGPfact: infer cell fate trajectory Algorithm 1 MGPfact: infer cell fate trajectory

Then, we introduce a rotation matrix R = {r1, r2,..., rL} to obtain factor score wl for each trajectory l by rotating Y.

For all trajectories,

where el is the error term for the l-th trajectory.

Specifically, the loading matrix for each gene onto the l-th trajectory can be expressed using equations (1) and (13) as follows,

which enables factorization and gene-selection based on the inferred trajectories.

Additionally, we can combine independent bifurcation processes to form a consensus diffusion tree to represent the trajectory of cell fate (Supplementary Methods, Supplementary Table 1).

Performance evaluation of MGPfact

MGPfact predicts cellular trajectories

Before the performance evaluation, we performed a grid search on the number of independent trajectories in 100 training datasets and selected L = 3 for downstream testing (Supplementary Fig. 1, Supplementary Fig. 2, Methods). Then, we assessed the performance of MGPfact for prediction of cellular trajectories in 239 test datasets, including 171 synthetic and 68 real datasets, alongside with another 7 existing algorithms (Supplementary Fig. 1). For overall performance score, MGPfact ( Overallmean = 0.534) ranked second only to TinGa ( Overallmean = 0.563) and outperformed the rest of 6 algorithms (Fig. 2a). Particularly, MGPfact demonstrated the highest accuracy in predicting cell fate in branching trajectory (Fig. 2b, ). As for other three individual metrics, MGPfact ranked the 4th in HIM (HIMmean = 0.606), the 6th in cordist , and the 4th in wcorfeatures (Fig.2c-e).

Trajectory inference (TI) performance of state-of-the-art methods in 239 test datasets.

a. Overall scores;b. F1branches; c. HIM; d. cordist; e. wcorfeatures. All results are color-coded based on the trajectory types, with the black line representing the mean value. The “Overall” assessment is calculated as the geometric mean of all four metrics.

Next, we compared the performance of MGPfact with the other algorithms in 9 different trajectory types, respectively, for predicting differentiated cell fate (F1branches). As a result, MGPfact significantly outperformed more than half of the algorithms tested in the following trajectory types (T-test P<0.1, Table 1): disconnected graph (N=5), linear (N=5), bifurcation (N=4), multifurcation (N=4) and tree (N=4). As for the other three metrics, MGPfact also showed advantages in HIM in linear (N=5), bifurcation (N=3), convergence (N=3); and in wcorfeatures in multifurcation (N=6), bifurcation (N=5). Nevertheless, MGPfact showed limited predictive performance for cordist (Supplementary Table 2).

MGPfact outperformed state-of-the-art methods in F1branches. P-values based one-sided paired t-tests suggest that the F1branches scores of MGPfact were significantly higher than those of the other methods for different trajectory types in the test set.

It is also worth noting that in 68 test datasets of real cell populations, MGPfact ranked the top of all 7 algorithms for “overall score” (Fig. 3a). As for the individual metric, MGPfact ranked to top for predicting trajectory topology (HIMmean = 0.721, Fig. 3c); and the 2nd in trajectory branching (, Fig. 3b) with no significant difference with that of the top predictor (scShaper, T-test, P = 0.829). As for the other metrics in real test datasets, MGPfact was the 5th in the similarity of cell locations , and the 3rd in the similarity of gene significance . The differences between the metrics of MGPfact and those of the top performers were subtle (Fig. 3d-e, ).

Trajectory inference (TI) performance of state-of-the-art methods in 68 test datasets of real cell population.

a. Overall scores;b. F1branches; c. HIM;d. cordist;e. wcorfeatures. All results are color-coded based on the trajectory types, with the black line representing the mean value for ranking all methods. The “Overall” assessment is calculated as the geometric mean of all four metrics.

In summary, our data suggest that MGPfact is highly efficient in predicting cell fate in branching trajectory (F1branches) and topological structure (HIM), which meets the primary goals of the algorithm. In addition, MGPfact performed better in real datasets, suggesting its robustness to noise from real experimental conditions. Nevertheless, trajectory inference by MGPfact is based on factorization of the covariance matrix, hence less performance in wcorfeatures and cordist than methods based on full covariance matrix

Comparative of Time Efficiency and Memory Consumption

We also compared the runtime and memory usage of different algorithms across 239 test datasets (Supplementary Table 3). MGPfact’s average maximum memory consumption is comparable to those of the other algorithms . As a trade-off for its advantages in feature-selection and factorization, MGPfact requires moderately longer execution time than the other algorithms (timemean = 3.42 min).

MGPfact recovers the trajectory of early postnatal microglia development

The main advantage of MGPfact lies in the capability to factorize a complex cellular trajectory into bifurcation processes of selected co-expressed genes. To illustrate how MGPfact elucidates the biological process underlying cell fate determination, we applied MGPfact to a scRNA-seq data of microglia development and validated the results with experimental evidences (Li et al., 2019).

MGPfact recovers the determinants of the microglia development

Utilizing MGPfact, we reconstructed the developmental trajectories of microglia from immature microglia (IM at pseudotime 0) to homeostatic microglia (HM) and proliferative-region-associated microglia (PAM) (Fig. 4a-c, left panel). MGPfact identified three bifurcation processes (Supplementary Table 4), each corresponding to 74-90 highly weighted genes (HWG, absolute gene weight > 0.05) (Fig. 4a-c, right panel).

MGPfact reconstructed the developmental trajectory of microglia, recovering known determinants of microglia fate.

a-c. The inferred independent bifurcation processes with respect to the unique cell types (color-coded) of microglia development, where phase 0 corresponds to the state before bifurcation; and phases 1 and 2 correspond to the states post-bifurcation. The most highly weighted regulons in each trajectory were labeled by the corresponding transcription factors (left panels). The top three HWG of each bifurcation are significantly differently expressed in phase 1, 2 and 3 (right panel). d. The most highly weighted regulons influencing the three developmental trajectories of microglia. e. The expression levels of the transcription factors of highly weighted regulons in each trajectory significantly differ among different phases. f. The consensus developmental trajectory by merging the three bifurcation processes. Point 0 denotes the initial of differentiation, whereas the notion of “n-m” denotes the m-th branch from the branching point n. Each colored circle represents a landmark (MURP) of the trajectory, showing the fraction of cell types. The transcription factors of highly weighted regulons in each bifurcation process were used to label each branching point. Particularly, PAM-T1 and PAM-T2 are the two newly defined subtypes of PAM. g. Selected differently expressed genes between PAM-T1 and PAM-T2 (|logfc| > 0.25, adjusted P-value < 0.1) are shown by colored-dots corresponding to the mean expression levels in either cell type. The IDs validated marker genes for PAM are labeled in green. In all box plots, the horizontal line represents the median value, and the whisker extends to the furthest data point within 1.5 times the interquartile range. Significance is denoted as: ns, not significant, * P < 0.05, ** P < 0.01, *** P < 0.001, **** P < 0.0001, Wilcoxon rank-sum test.

The first bifurcation determines the differentiated cell fates of PAM and HM, which involves a set of notable marker genes of both cell types, such as Spp1, Apoe, Gpnmb (PAM), and Tmem119 (HM). The second bifurcation determines the proliferative status, which is crucial for the development and function of PAM and HM(Guzmán, n.d.; Li et al., 2019). The genes affected by the second bifurcation are associated with cell cycle and proliferation, such as Mki67, Tubb5, Arl6ip1, Top2a. The third bifurcation influences the development and maturity of microglia, of which the highly weighted genes, such as Selplg, P2ry12, Siglech, Cst3, 4632428N05Rik are all previously annotated markers for establishment of the fates of microglia (Anderson et al., 2022; Li et al., 2019) (Supplementary Table 4).

Moreover, MGPfact unveiled highly active regulons in each bifurcation processes, which further traced back to the influential transcription factors as determinants of microglia development, such as Hif1a, E2f5, and Nfkb1 (Fig. 4d-e, Methods). Specifically, Hif1a is crucial for microglial activation and directly linked to neurodegenerative disease progression (Wang et al., 2022). Our data showed an upregulation of Hif1a in the PAM-branch (phase 1) of the first bifurcation, reaffirming the role of Hif1a in PAM differentiation. The other two transcription factors, E2f5 and Nfkb1, were active in phase 2 of the 2nd bifurcation and the 3rd bifurcation, respectively. Both are known for the roles in microglial development (Dresselhaus and Meffert, 2019; Nawal, n.d.).

Using consensus trajectories to delineate the development of microglial cells

We generated a consensus trajectory of microglial development from three independent bifurcation processes (Methods). Of note, the consensus trajectory revealed two distinct subtypes of proliferative-region-associated microglia (PAM), PAM-T1 (Hif1a+/E2f5-/Nfkb1+) and PAM-T2 (Hif1a+/E2f5-/Nfkb1-) (Fig. 4f). Particularly, the highly expressed genes in PAM-T2, including Spp1, Gpnmb, Lgals1, and Cd63, are previously identified in disease-associated microglia (DAM) (Li et al., 2019). Thus, our finding reaffirmed the connection between the two cell types, DAM and PAM, and suggested Nfkb1 as a potential determinant of differentiation of PAM (Fig. 4g, Supplementary Table 5).

In conclusion, MGPfact reconstructed the cellular trajectory of microglial development, identified distinct cell types with marker genes and key regulators which are highly consistent to the experimental evidences (Dresselhaus and Meffert, 2019; Li et al., 2019; Nawal, n.d.; Wang et al., 2022).

Using MGPfact to decipher the evolution of tumor-associated CD8+ T cells

Next, we applied MGPfact to two populations of tumor-associated CD8+ T cells from non-small cell lung cancer (NSCLC) (Guo et al., 2018) and colorectal cancer (CRC) (Zhang et al., 2018), respectively. Using the same analytical pipeline as above, we identified a set of CD8+ T cell gene expression signatures (GES) from MGPfact-inferred trajectories, which are significantly predictive of clinical outcomes and immune treatment responses. Additionally, our data unveiled novel subtypes of tumor-associated CD8+ T cells with strong clinical implications.

MGPfact better explains the fate of tumor-associated CD8+ T cells

We first evaluated the goodness-of-fit of MGPfact-derived consensus trajectory with the CD8+ T cell subtypes characterized by the original studies. Following the original study, we used Monocle 2 as a reference baseline model for trajectory inference (Methods). As a result, MGPfact showed significantly improved explanatory power for all CD8+ T cell subtypes only with the exception for CD8-GZMK in CRC (P < 0.05, Table 2), suggesting that MGPfact’s prediction power of cellular trajectory offers better consistence to the validated cell fate of CD8+ T cells hence improved interpretability.

The consensus trajectory by MGPfact offered better explanatory power of cell fate. R-square and P-values based on F-tests suggest that the consensus trajectory of MGPfact showed significantly better fitness to the subtypes of CD8+ T cells characterized and annotated by experimental evidence than the baseline model (Monocle 2).

MGPfact identifies T-cell gene expression signatures with clinical implications

MGPfact discerned the different cellular fates of tumor-associated CD8+ T cells by distinct bifurcation processes. To reveal the clinical implications of these bifurcation processes, we retrieved the mean expression vectors corresponding to either phase (branch) as Gene Expression Signatures (GES), and stratified cancer cohorts by quantifying the propensities to certain destiny of CD8+ T cells (Methods). Our data demonstrated pronounced disparities in the clinical outcomes associated with different fates of CD8+ T cells among patients (Fig. 5a-b, Supplementary Fig. 5e). In NSCLC, trajectory 1 is associated with cytotoxic T cells (96%, phase 1) and higher overall survival in TCGA-lung adenocarcinoma (LUAD) patients. Trajectory 2 is associated with exhausted T cells (91%, phase 2), and lower overall survival in the same cohort (Fig. 5a, Supplementary Fig. 3a-b). Similarly, in the CRC, trajectory 1 is associated with exhausted T cells (95%, phase 1) and poor overall survival in TCGA-COAD patients (Fig. 5b, Supplementary Fig. 5a-b).

Highly weighted genes (HWG) of the bifurcation processes of CD8+ T cells serve as reliable indicators for clinical outcome and ICI treatment response.

a. Gene expression signatures (GES) corresponding to HWG in CD8+ T cells trajectory 1 and 2 in NSCLC predict overall survival of the TCGA-LUAD cohort. b. Gene expression signatures (GES) corresponding to HWG in CD8+ T cells trajectory 1 in CRC predict overall survival of the TCGA-COAD cohort. c. ROC curve showing the weighted mean of HWG in Trajectories 1 and 2 in NSCLC significantly associated with ICI response across 3 independent studies. d. ROC curve showing the weighted mean of HWG in trajectories 1 and 2 in CRC significantly associated with ICI response across 4 independent studies.

In addition, for each trajectory identified in NSCLC and CRC, we selected a set of HWG (absolute gene weight > 0.05) to characterize the underlying biological processes and key transcription factors determining the bifurcation (Supplementary Table 6-7, Supplementary Fig. 3c-d, Supplementary Fig. 5c-d). In NSCLC, the HWG of trajectory 1 are primarily implicated in immune responses associated with antigen processing and presentation (Supplementary Table 8), while trajectory 2-HWG involve processes of immune cell migration (Supplementary Table 8). For CRC, trajectory 1 is enriched for genes of T cell activation and regulation (Supplementary Table 9).

Notably, our data showed that the weighted mean expression of the HWG of CD8+ T cell trajectories (Methods) are associated with responses to immune checkpoint inhibitors (ICIs) in multiple independent cohorts (Fig. 5c-d, Supplementary Fig. 5f). For instance, the weighted means of HWG of trajectories 1 and 2 in NSCLC which are associated with high activities of cytotoxic T cells, predicted better responses to anti-PD-130 and anti-CTLA-4 (Cho et al., 2020; Liu et al., 2018), as well as their combination therapies (Auslander et al., 2018) (AUC ∈ {0.813, 0.964}, P < 0.1, Supplementary Fig. 4). Similarly, HWG pertaining to trajectory 1 in CRC is associated with high proportion of EMRA (87%, phase 1), hence better responses to immunotherapies in 4 cohorts (Auslander et al., 2018; Cho et al., 2020; Jung et al., 2019; Liu et al., 2018) (AUC ∈ {0.796, 0.964}, P < 0.01, Supplementary Fig. 6).

Taken together, the factorization of scRNA-seq data by MGPfact provides highly relevant gene expression signatures of the fate of tumor associated CD8+ T cells, which advances the understanding of the evolution of tumor immune microenvironment (TIME) and predicts clinical outcomes.

MGPfact identified new subtypes of CD8+ T cells with clinical implications

Furthermore, the consensus trajectories of tumor-associated CD8+ T cells inferred by MGPfact from NSCLC and CRC revealed new subtypes of lymphocytes. In NSCLC, we characterized CD8-ZNF683-T1 (LEF+/TBX21-) and CD8-ZNF683-T2 (LEF+/TBX21+) from CD8-ZNF683 (Fig. 6a, Supplementary Fig. 3c-d). The CD8-ZNF683-T2 cells highly expressed genes associated with “pre-exhausted” state, such as ITGAL, SAMD3, and SLAMF7 (Pritchard et al., 2023), many of which are target genes of TBX21. In contrast, CD8-ZNF683-T1 showed lower expression of these genes, hence repellency to the “pre-exhausted” state (Fig. 6b, Supplementary Table 10). In CRC, we identified two subtypes of effector memory T cells (CD8-GZMK), CD8-GZMK-T1 (EOMES-/BHLHE40+) and CD8-GZMK-T2 (EOMES-/BHLHE40-) (Fig. 6d, Supplementary Fig. 5c-d). CD8-GZMK-T2 strongly resembles CD8-GZMK, and potentially differentiating into CD8-CD6 (resident memory T cells, TRM) and CD8-CD160 (intraepithelial lymphocytes, IEL); whereas CD8-GZMK-T1 cells demonstrated higher expression of GZMB, indicating an active cytotoxic cell-killing capability (Trapani, 2001). Simultaneously, these cells also marked by high expression levels of immune related genes, including OASL, RBPJ, and CTLA4, which are known targets of BHLHE40 (Lutter et al., 2022; Salmon et al., 2022), implying that BHLHE40 is a modulator of the higher effector activity in CD8-GZMK-T1 (Fig. 6e, Supplementary Table 11).

MGPfact serves as an effective approach for characterization of new cellular subtypes.

a. The consensus trajectory of tumor-associated CD8+ T cells in NSCLC identified CD8-ZNF683-T1 and CD8-ZNF683-T2 as two subtypes of CD8-ZNF683, which are influenced by TBX21. b. Selected differently expressed genes between CD8-ZNF683-T1 and CD8-ZNF683-T2 (|logfc| > 0.25, adjusted P-value < 0.1). c. High expression of CD8-ZNF683-T1 signatures predicts good overall survival in the TCGA LUAD cohort (Methods). P-values were calculated through multivariate Cox regression analysis, and HR represents hazard ratio. d. The consensus trajectory of tumor-associated CD8+ T cells in CRC identified CD8-GZMK-T1 and CD8-GZMK-T2 as two subtypes of CD8-GZMK. e. Selected differently expressed genes between CD8-GZMK-T1 and CD8-GZMK-T2 (|logfc| > 0.25, adjusted P-value < 0.1). f. ROC curve showing high expression of CD8-GZMK-T1 signature associated with ICI treatment response in three independent studies. The consensus trajectory is formed by merging three bifurcation processes. Each colored circle represents a landmark (MURP), indicating the of cell type.

We further derived scores based on the differentially expressed genes of CD8-ZNF683-T1 and CD8-GZMK-T1 (Methods), as measures of the fraction of each subtype in cancer cohorts. In the LUAD cohort of TCGA, increased fraction of CD8-ZNF683-T1 in TIME was associated with favorable outcomes (Fig. 6c). And increased fractions of CD8-GZMK-T1 in TIME were associated with better responses to ICI therapy across 3 independent cohorts3133 (Fig. 6f, Supplementary Fig. 7, AUC ∈ {0.782, 0.917}, P < 0.1), which were treated with anti-PD-1, anti-CTLA-4 and their combination therapies.

In conclusion, our data showed the cellular trajectory inferred by MGPfact can be used to elucidate the complex evolutionary processes of tumor-associated CD8+ T cells, and further inform the characterization of new subtypes of T cells with significant clinical implications.

Discussion

Single-cell RNA sequences (scRNA-seq) offer direct, quantitative snapshot of a cell population involved in actual biological activities hence tremendous insight of the variability of cells’ states and functions. So far, most studies use scRNA-seq to characterize new subtypes of cells by various clustering and embedding algorithms. These algorithms are usually optimized for efficiently revealing discrete biological states of cells corresponding to distinct functions. Nevertheless, as the resolution increases, clustering algorithms becomes more sensitive to noises and less efficient in delineating continuous biological process (Ren et al., 2022). The challenges are reflected in the discrepancy of definition of new cell types, over- and under-clustering due to imbalanced sampling.

The introduction of manifold-learning in scRNA-seq analysis provides a new dimension for biological knowledge discovery. The trajectories inferred by manifold-learning methods are more suitable representation of continuous variation in cells, which are usually implicated in many time-dependent biological processes, such as development, differentiation and clonal evolution. Particularly, manifold-learning is capable to reconstruct pseudotime trajectory from cross-sectional scRNA-seq data, which offers the unique advantage to delineating complex evolutionary processes of cells sampled from real clinical biopsies. Thus far, trajectory-inference is used by many recent studies to elucidate the origin and developmental pathways of different cell states in the temporal space, also known as cell fate.

Despite of the latest booming of new methods, existing manifold-learning methods are faced with major limitations. First, many trajectory inference (TI) methods still requires prior information on pseudtime and cell clustering, hence restrictions in the applicability. Thus, the resulted cellular trajectory based on cell clustering represents the admixture of both continuous temporal variation and discrete subclones of cells. Then, most of the current manifold-learning methods are based on either single feature (gene), or the whole feature space, such as all highly variable genes (HVG), hence the results are highly dependent and even biased on the input feature (genes). Besides, the lack of necessary gene-selection in the inference makes it challenging to annotate the resulting trajectory to known ontological terms or pathways, which hinders the interpretability of the results and the follow-up functional study. In some cases, discrepancy can arise between trajectory inferred from all genes and cell type characterized by specific gene sets, which further restrict the use of manifold-learning in scRNA-seq analysis.

We developed MGPfact to overcome the limitations of the existing methods. Inspired by recent studies, MGPfact model the cell fate as mixture of Gaussian processes, which accommodate both continuous evolution pathway and biphasic destiny of cell fate. Thus, MGPfact is capable to distinguish discrete and continuous events in the same trajectory. In addition, by factorizing the mixture Gaussian processes, MGPfact offers the advantage to select genes corresponding to each bifurcation process and thereby enable full biological annotation and interpretation of the trajectory. As a validation, we showed that gene-selection by MGPfact consistently recapitulated the development of microglia and tumor-associated CD8+ T cells; and recovered key regulators of distinct cell fate. So far, MGPfact is the only model-based manifold-learning framework which factorizes complex development trajectories into independent bifurcation processes of gene sets.

We conducted a comprehensive comparison of MGPfact with existing TI methods from various perspectives. This comparison included the correlation of cell sorting, accuracy of branch allocation, similarity of topological structures and differentially expressed features. For the overall TI-performance, MGPfact demonstrated leading performance across 239 datasets, second only to TinGa. For the performance in branch allocation, which directly reflect the fitness to the outcoms of cell-fate, MGPfact outperformed its counterparts, especially in the topology groups of linear and bifurcation. As for wcorfeatures and cordist, MGPfact performed less well mainly for two reasons. First MGPfact is designed for bifurcation topology in the cellular trajectory, hence less efficient in inferring complex topologies. Then, MGPfact inference is based on selected gene sets instead of the whole transcriptome, the resulted trajectory correspond only to the bifurcation processes of interest hence does not necessarily reflect the whole topology of cellular trajectory. Furthermore, MGPfact performed significantly better in trajectory prediction in real cell population compared to the synthetic ones, suggesting the algorithm fit better to the true biological variation and noises.

To reconstruct the trajectory of cell fate, we merged all the bifurcation processes into a consensus trajectory. In the validation by microglia and tumor-associated CD8+ T cells, the consensus trajectory revealed highly consistent findings recovering known biology and the marker genes of specific cell states which further inform the transcription factor (TF) determining the fate of cell. In addition, the consensus trajectory revealed new subtypes of cells demonstrating highly relevant transcriptional characteristics. Particularly, we reported new subtypes of tumor associated CD8+ T cells characterized by different TBX21 and BHLHE40 activity, both of which are known regulators of CD8+ T cell functionality (Lutter et al., 2022; Pritchard et al., 2023; Salmon et al., 2022; Trapani, 2001). These data suggest that MGPfact is capable to discover gene modules with strong and consistent transcriptional background hence better interpretability. Moreover, the results of MGPfact demonstrated strong clinical relevance. Using MGPfact we retrieved gene expression signatures (GES) which quantitatively measure the propensity and fraction of different fates of CD8+ T cell. These signatures correspond to important biological processes of T cell activity; and predict clinical outcome and ICI treatment responses from transcriptome data of bulk tumor biopsies, independent of any endogenous feature of the tumor cells.

Nevertheless, the current method also has some limitations, which shall be addressed in future study. The complex definition of the bifurcation kernel introduces unfavorable singularity to the Gaussian kernel when considering numerous trajectories. Additionally, the current MGPfact-inferred trajectory is based solely on pseudotime, neglecting potential bifurcation processes occurring in space. Future models should incorporate spatial dynamics of transcription and analyze cellular velocity data to provide more comprehensive insights. Furthermore, the reconstruction of cellular trajectories by MGPfact implies independence of each bifurcation processes, which may not reflect real cellular behavior. Therefore, predictions from MGPfact should be interpreted with caution and validated experimentally.

Methods

Benchmarking MGPfact to state-of-the-art methods

We adopted a comprehensive evaluation framework from previous scRNA-seq study to assess the TI performance of MGPfact (Saelens et al., 2019; Smolander et al., 2022; Todorov et al., 2020). The validation dataset comprises 110 real data and 229 synthetic data, encompassing 9 different cellular trajectory topologies. The ground truth of cellular trajectories of each dataset were inferred and validated by the original study (Saelens et al., 2019). The evaluation of TI performance was based on five metrics: HIM for assessing trajectory topology similarity, F1branches for measuring branch allocation similarity, cordist for quantifying inter-cell distance similarity, wcorfeatures for evaluating the consistency of significant features between predicted and actual trajectories, and the overall performance score, which is the geometric mean of the above four metrics.

The dataset is divided into two groups: a training set and a testing set (Supplementary Fig. 1). We use 100 training datasets to determine the optimal parameter settings. Subsequently, in 239 test datasets, we compare MGPfact with 7 state-of-the-art TI methods using the aforementioned metrics, including Monocle DDRTree (Qiu et al., 2017b, 2017a), TSCAN (Ji and Ji, 2016), and DPT (Haghverdi et al., 2016), as well as four new methods from recent studies: Monocle 3 (Cao et al., 2019), scShaper (Smolander et al., 2022), scFates (Faure et al., 2023), and TinGa (Todorov et al., 2020). The experimental comparisons were conducted on a CentOS system equipped with 48 CPU cores running at 2.2GHz and 250GB of memory. To ensure a uniform comparison, all experiments were performed using a single CPU core. For MGPfact, we tested each resulted trajectory and selected the one with the best “overall” score for comparison. For the other 7 methods, default settings are used unless otherwise specified.

Application of MGPfact in a Microglia Single-cell RNA-seq Dataset

We utilized the MGPfact reconstructed the developmental trajectory of microglia from a scRNA-seq dataset, including immature microglia (IM), proliferation-associated microglia (PAM), and homeostatic microglia (HM) (Li et al., 2019). In this analysis, we provided a detailed explanation of the analytical steps of MGPfact and the key results. Firstly, we identified three independent developmental pathways and pinpointed highly weighted genes (HWG) associated with each bifurcation process. Then, we retrieved highly active regulons within each bifurcation process, tracing back to the potential influential determinants (transcription factors) in the development of microglia. Finally, we combined all the bifurcation processes into a consensus trajectory (Supplementary methods), which recovered the known biology of disease-related microglia (PAM), as represented by distinct cellular state and marker genes.

Predicting the evolutionary trajectory of tumor-associated CD8+ T cells

We utilized MGPfact to conducted an exploratory analysis of the evolution of tumor-associated CD8+ T cells of non-small cell lung cancer (NSCLC) and colorectal cancer (CRC). We evaluated the goodness-of-fitness of the consensus trajectories from MGPfact to the CD8+ T cell subtypes identified in the original studies. For comparison, we used Monocle 2 (Qiu et al., 2017b, 2017a) as a baseline model.

For the survival analysis, we extracted gene expression signatures (GES) from each independent bifurcation process to develop classifiers for evolutionary propensity of CD8+ T cells towards specific fates, based on which we stratified TCGA cancer cohorts and verified their association with clinical outcome.

To evaluated the association to ICI responses, we used HWG to retrieve key transcription factors related to each bifurcation process and characterized the underlying biological processes. We then assessed their connection to immunotherapy response using weighted means of the HWG.

Finally, we identified new subtypes based on distinct endpoints of the consensus trajectory and validated their association with clinical outcome and immunotherapy response using mean of the differently expressed gene (DEG, Supplementary Methods).

Single-cell sequencing data processing

We obtained the original mouse developmental microglia single-cell sequencing data from the GEO accession number GSE123025 (Li et al., 2019). Using Seurat (Butler et al., 2018; Stuart et al., 2019), we replicated the processing steps described in the original study: 1) Normalization by dividing gene expression values by total RNA count, followed by log2 transformation; 2) Selection of highly variable genes (HVGs) using Seurat’s mean.var.plot function, with controlled average expression [0.0125,3] and variance [0.5, ∞]; 3) Scaling and centering of the normalized matrix for HVGs, with regression of cell cycle effects. After preprocessing, we grouped cells into IM (P7-C0), PAM (P7-C1 and P7-GPNMB+CLEC7A+), and HM (P60), resulting in a 4,889-gene expression matrix across 1,009 cells.

For analyzing tumor-associated CD8+ T cells, we utilized scRNA-seq data from lung cancer (GSE99254) (Guo et al., 2018) and colorectal cancer (GSE108989) (Zhang et al., 2018) in the GEO database. We extracted preprocessed and centralized gene expression matrices of CD8+ T cells and analyzed them using the same genes and the same method (Monocle 29,10) as in the original papers or MGPfact for trajectory construction for a direct comparison. The NSCLC data yielded an 888-gene expression matrix across 3,700 cells, while the CRC data resulted in a 700-gene expression matrix across 3,177 cells.

Functional enrichment of highly weighted genes (HWG)

For the highly weighted genes (HWG, absolute gene weight > 0.05) obtained from independent bifurcation processes, we utilized the R package clusterprofiler (Yu et al., 2012) to perform functional annotation using GO terms(Consortium, 2004), including biological process (BP), cellular component (CC), and molecular function (MF). The results with a Benjamini–Hochberg-adjusted P value less than 0.05 were retained.

Transcription factor program analysis

To comprehensively assess key regulatory factors within each independent trajectory, we performed SCENIC (Aibar et al., 2017) transcription factor regulatory program estimation for each analysis case. GENIE3 (Huynh-Thu et al., 2010) was used to identify co-expressed modules from the results of MURP (Ren et al., 2022) downsampling. RCisTarget (Aerts et al., 2010; Aibar et al., 2017) was then used to identify regulons before AUCell (Aibar et al., 2017) was used to estimate the activity of each regulon. Each regulon comprises a specific transcription factor and its target genes. Finally, we utilize gene weights obtained from MGPfact analysis to evaluate the distinct impact of top regulons on each trajectory.

Assessing consistency of MGPfact-derived CD8+ T Cell subtype trajectories

In the case study of CD8+ T cells, by combining independent trajectories, we derive a consensus trajectory representing the complex developmental pathway. To assess the goodness-of-fitness to the CD8+ T cell subtypes from the original study, we classified the trajectories into several states based on bifurcation points, each corresponding to a distinct stage of the evolutionary process. Then, we evaluated the interactive effects between the states and pseudotime on the fraction of the cell types using F-test (ANOVA). The resulted R-squared (R2), P-values, and F-statistics were used to evaluate the goodness-of-fitness of the models tested hence the explanatory power. For comparison, we used the Monocle 2 as the baseline model for trajectory inference. The differentiation trajectories of Monocle 2 were replicated following the workflow in the original study (Guo et al., 2018; Zhang et al., 2018).

Survival analysis

We assessed the association of bifurcation processes and specific cell types with the clinical outcomes two cohorts of lung adenocarcinoma (LUAD, N=563) and colon adenocarcinoma (COAD, N=320) data from The Cancer Genome Atlas (TCGA). The gene expression and clinical data were downloaded from UCSC Xena platform (http://xena.ucsc.edu/).

We assessed the survival impacts of NSCLC and CRC bifurcation processes in the TCGA LUAD and COAD cohorts, respectively. For each independent bifurcation process, we defined Gene Expression Signatures (GES) by the mean expression vectors of all trajectories in phase 1 and 2, respectively. Subsequently, we calculated the Peason’s correlation coefficients for each individual expression profile in TCGA LUAD or COAD cohorts to the phase 1 and 2 GES, where higher correlation correspond to stronger propensity to specific cell fate. This allowed us to classify patients into two groups: those exhibiting propensity to phase 1 and those exhibiting propensity to phase 2.

To assess the survival impacts of specific cell states defined by the consensus trajectory, we developed a CD8-ZNF683-T1 score based on the signed average expression level of DEG associated with CD8-ZNF683-T1 (Supplementary methods). Subsequently, we classified the TCGA LUAD cohorts into two groups using the median of CD8-ZNF683-T1 scores, identifying those demonstrating a propensity towards CD8-ZNF683-T1 and those demonstrating a propensity towards CD8-ZNF683-T2.

The Cox regression model was implemented using R-4.2 package “survival”. And we generated Kaplan-Meier survival curves based on different classifiers to illustrate differences in survival time and report the statistical significance based on Log-rank test.

Immune-checkpoint inhibitor treatment response analysis

For the prediction of Immune-checkpoint inhibitor treatment response, we collected four datasets containing ICI treatment responses. These datasets consist of two non-small cell lung cancer related datasets (GSE135222 (Jung et al., 2019), n=27; GSE126044 (Cho et al., 2020), n=16) and two melanoma related datasets (GSE115821 (Auslander et al., 2018), n=14; TCGA-MENO(Liu et al., 2018), n=10). All data were processed with DESeq2 (Love et al., 2014) to fit gene dispersion to a negative binomial distribution, normalize raw counts, and stabilize variance, achieving standardization.

To validate the bifurcation processes identified by MGPfact predicting patients’ response to immune-checkpoint inhibitor (ICI) treatments, we selected highly weighted genes (HWG) with absolute weights greater than 0.05 from each independent bifurcation process. Then, we calculated the weighted mean expression of HWG in each ICI dataset to generate ROC curves for patient drug response.

To validate the specific cell states defined by the consensus trajectory predicting patients’ response to ICI treatments, we used CD8-GZMK-T1 score based on the average expression level differences of upregulated and downregulated genes associated with CD8-GZMK-T1. Then, we calculated the CD8-GZMK-T1 score in each ICI dataset to generate ROC curves for patient drug response.

Code availability

We have developed a comprehensive workflow for MGPfact. Firstly, a Docker container enables one-click program execution (details at: https://github.com/renjun0324/ti_mgpfact). Additionally, to fully harness MGPfact’s capabilities, we have created the R package MGPfactR, accessible at: https://github.com/renjun0324/MGPfactR. Within this workflow, MCMC sampling for model parameter estimation is carried out using the Mamba library in the Julia environment. The related Julia package can be found here: https://github.com/renjun0324/MGPfact.jl. Furthermore, the scFates Tree used in this paper is available as a performance comparison Docker container, constructed using the dendritic trajectory process in scFates (Faure et al., 2023), accessible at: https://github.com/renjun0324/ti_scfates_tree.

Data availability

The datasets used for performance comparison are archived on Zenodo by Saelens et al. (Saelens et al., 2019), with processed real and synthetic datasets available at https://doi.org/10.5281/zenodo.1443566. Data for specific cell instances of microglia and CD8+ T cells can be obtained from the GEO database with the following accession numbers: GSE123025 (Li et al., 2019), GSE99254 (Guo et al., 2018), and GSE108989 (Zhang et al., 2018). Expression matrices related to immune-checkpoint inhibitors (ICI) and clinical response information are downloadable from the CTR-DB (https://ctrdb.cloudna.cn).

Acknowledgements

We would like to thank Qingyun Li for providing the cell labels for the microglial dataset (Li et al., 2019). We would also like to express our gratitude to Nengming Xiao, and Guo Fu for their valuable input and constructive suggestions during the preparation of the manuscript.

Sources of Funding

This work was supported by the Fundamental Research Funds for the National Natural Science Foundation of China [82272944 to QL]; the National Natural Science Foundation of China [82203420 to JG].

Author Contributions

The project was conceived by Qiyuan Li. The model construction, data collection and analytical validation was done by Jun Ren, with the assistance of Yudi Hu and Jintao Guo. The manuscript was written by Jun Ren and Xuejing Lyu, with the assistance of Ying Zhou and Xiaodong Shi. All authors read and approved the final manuscript.