Introduction

Acute myeloid leukemia (AML) is an aggressive blood cancer driven by non-random genomic rearrangements in hematopoietic stem/progenitor cells (HSPCs). Recurrent AML-associated genomic aberrations, which often involve transcriptional or epigenetic regulators, give rise to distinct patterns of gene expression strongly associated with clinical course and chemotherapy response [1, 2]. Single-cell RNA sequencing (scRNA-seq) studies have demonstrated that HSPCs acquire lineage priming at an early stage when still phenotypically immature and disperse down an erythromyeloid or lymphomyeloid differentiation trajectory [3]. In the context of AML, diverse clonal hierarchies include the co-existence of normal hematopoietic clones. Leukemic clones can partially recapitulate myeloid differentiation and have been shown to display functional differences even when defined by the same genotype [47]. Indeed, analysis of AML using scRNA-seq has revealed key clonal hierarchies, defining subtype-associated cell types, and dynamic changes following therapy, and have been critical in characterizing leukemic stem cells (LSCs), which propagate the disease and drive relapse [58].

Most AML scRNA-seq studies are limited by small sample numbers and include a mixture of different AML subtypes which may not be directly comparable to one another. Therefore, it is difficult to make biological conclusions with sufficient robustness to be clinically translatable in these individual datasets. To overcome this, we performed large-scale integration of public scRNA-seq datasets to create a single-cell transcriptomic atlas for AML (AML scAtlas). Due to the range of data sources spanning time, geographical locations, and experimental designs, complex batch effects often arise between scRNA-seq datasets which requires a tailored data integration approach [9, 10]. Thus, we benchmarked some widely used batch correction tools [1113] for our specific data use case.

To begin interrogating the data in the AML scAtlas and demonstrate its utility, given the vast representation of different age groups, we set out to address a developmental question in the biology of AML. The molecular landscape of AML in children is different from adults [2, 1416]. Correspondingly, the clinical course is different with substantially better outcomes in children. The observed differences in molecular profiles in adult and pediatric AML may, in part, be attributed to the developmental origins of the disease. Acquired chromosomal changes in pediatric leukemia are acquired in utero, as leukemia-specific genomic aberrations have been detected in the Gunthrie spots of children who subsequently developed leukemia, sometimes several years later in life [17]. Adult leukemia, in contrast, is believed to develop later in life through acquisition of pre-leukaemic changes and clonal evolution of adult HSPCs [18, 19]. The impact of these developmental stages on leukemia biology are not fully understood, and there is currently no means of quantifying and characterizing differences in the origin of leukemia. However, as childhood AML with presumed in-utero origin has a better outcome, for teenagers and young adults, determination of the pre- or postnatal origin might be important for better treatment stratification and prognostication.

AML with t(8;21) (AML-ETO/RUNX1-RUNX1T1) is one of the most frequent AML subtypes in young people, although it affects all ages [2]. The prenatal origins of the t(8;21) rearrangement, has been confirmed even in older children presenting with AML [17]. The prognosis of AML with t(8;21) is better in children than in teenagers and even more so than in young adults [20]. This outcome difference is not fully explainable by co-morbidities and may be related to the developmental origins of the disease. In the intermediate teenage group, t(8;21) AML may comprise both late childhood and early adult disease entities, which may have prognostic implications and could help to explain disease biology and clinical course.

We used our AML scAtlas data resource to characterize age and developmental stage specific signatures in t(8;21) AML by applying single-cell gene regulatory network (GRN) inference methods [21, 22], as a means of revealing cell state heterogeneity across age groups. We then validate and refine our findings in a larger cohort using bulk RNA sequencing (RNA-seq) data from the TARGET [2] and BeatAML [23] studies, defining age-associated GRN signatures and key regulators of t(8;21) AML, which may reflect the developmental origins of the leukemia.

Profiling together both gene expression and chromatin accessibility can be used to decipher the enhancer-driven GRN (eGRN) and enriched transcriptional regulators. Significant heterogeneity across different patients and time points [24] was recently described by analyzing combined scRNA-seq and single-cell Assay for Transposase Accessible Chromatin sequencing (scATAC-seq). This also identified longitudinal transcriptional changes across key clinical milestones, revealing significant heterogeneity across different patients and timepoints. We used the t(8;21) AML data from this study to apply cutting edge GRN inference methods [25], which encompasses both modalities to provide a cleaner GRN which we could correlate with our age-related signatures.

Results

Large Scale Data Integration to Construct a Single-Cell Transcriptomic Atlas of AML (AML scAtlas)

To create AML scAtlas, we combined published scRNA-seq data of primary AML bone marrow samples, from 16 suitable high-quality studies (see Materials and Methods), which included 159 AML samples (Supplementary Table 1). Where on-treatment time points were available, we selected only diagnostic samples to form a reference atlas of primary AML at diagnosis. If studies had healthy donor bone marrow samples these were included, in addition to data from healthy bone marrow samples from four scRNA-seq studies (Supplementary Table 1) to enable comparisons with healthy bone marrow populations. After cell filtering and quality control the AML scAtlas contains data from 748,679 high quality cells from a total of 20 different scRNA-seq studies [46, 8, 2640] (Supplementary Figure 1A). Each sample was allocated to an AML clinical subtype, based on the recent European Leukemia Net (ELN) clinical guidelines [41], and a corresponding prognostic risk group. This resource includes a range of molecular subtypes of AML and spans different age groups, including both pediatric and adult AML cases (Figure 1A; 1B). Overall, this dataset is the largest data resource to date for exploring AML biology at single-cell resolution.

Large Scale Data Integration Creates an Atlas of AML

(A) Proportion of cells (left panel) and samples (right panel) belonging to each AML subtype as defined by the ELN clinical guideline. (B) Age group and gender distribution of AML scAtlas cohort samples. (C) scVI harmonized UMAP colored by annotated cell types. (D) The expression of key hematopoietic marker genes across annotated cell types shown on a dotplot. Color scale shows mean gene expression, dot size represents the fraction of cells expressing the given gene. (E) Selected hematopoietic stem and progenitor cell (HSPC) clusters from the AML scAtlas, annotated with a curated reference of leukemia stem and progenitor cells (LSPCs). (F) Proportions of HSPC/LSC populations in each AML subtype as defined by the ELN clinical guideline. (G) Proportions of HSPC/LSC populations in each AML risk group as defined by the ELN clinical guideline.

From initial analysis of the combined dataset, batch effects were noted, as study-specific clusters were formed. Even within samples of the same study, sample-wise clustering was noted (Supplementary Figure 1A-1C). To address this problem, we performed a benchmark of some widely used batch correction methods (Supplementary Table 2; Supplementary Figure 2A;2C). This identified scVI as the best method for this dataset (Supplementary Table 2), which we employed to correct for batch effects. Next, we performed clustering and cell type annotation of AML scAtlas, based on manually curating automated cell type annotation tools results and cluster-wise marker gene expression (Figure 1C;1D). Cell type proportions analyses across the clinically relevant subtypes in the dataset (Supplementary Figure 3A) show that the AML subtypes were significantly biased towards myeloid cell types (CMP, MEP, GMP, ProMono, CD14+ Mono, CD16+ Mono, cDC, Erythroid) and had a clear predominant cell type, in keeping with AML clonal expansion. On the other hand, healthy donor samples had more balanced cell type proportions and lymphoid cells (T, B, NK, ProB, pDC, Plasma) were well represented (Supplementary Figure 3A).

Due to the established critical role of HSPCs and LSCs in propagating AML and their importance as therapeutic targets [42], we selected HSPC clusters for further analysis, as defined by CD34 expression (Figure 1E; Supplementary Figure 3B;3C). To identify LSCs, we used a curated reference profile of leukaemic stem and progenitor cells (LSPCs) [7] (Figure 1E) and correlated this with calculated LSC6 [43] and LSC17 [44] scores for each cell (Supplementary Figure 3D). We compared the proportions of HSPC/LSPCs across different AML subtypes and risk groups, as defined by the ELN clinical guidelines [41] (Figure 1F). This shows that higher-risk subtypes have a higher proportion of LSCs compared to favorable risk disease.

Application of AML scAtlas to Identifying Age-Associated Gene Regulatory Networks in t(8;21)

The AML scAtlas enables robust comparison of adult and pediatric AML. We hypothesized that in adolescents and young adults with t(8;21)AML, the potential for either in-utero or postnatal HSPC origin disease might affect disease biology and prognosis. Thus, we sought to explore biological differences between pediatric and adult cases of t(8;21) AML, aiming to explain and potentially improve prognostication in adolescents and young adults. We selected samples with t(8;21) AML from the AML scAtlas. This included 105,663 cells from 13 adult cases (aged 20-67), 7 adolescent cases (aged 12-17), and 3 pediatric cases (aged 6-8) (Figure 2A-2C). Where gender was not available, this was inferred from ChrY/XIST gene expression (Supplementary Figure 4A). Several adult samples underwent CD34 selection in original studies, excluding more differentiated cell types in these samples. Thus, these cell types were excluded from comparative analysis, focusing only on HSPCs (Figure 2B).

AML scAtlas Reveals Age-Associated Heterogeneity in t(8;21) AML

(A) Clinical characteristics of the AML scAtlas t(8;21) samples, which were divided into three age groups. Gender is indicated where available. *Sample sex inferred from XIST/ChrY expression. (B) Using the AML scAtlas t(8;21) sample cells, UMAP was re-computed and shows the different cell types. (C) Bar plots of the absolute cell type numbers (left panel) and the cell type proportions (right panel) stratified by age group. The CD34 enrichment performed on several adult samples is reflected. (D) Using HSPCs and CMPs only, the pySCENIC gene regulatory network (GRN) and regulon AUC scores were calculated. Z-score normalized scores underwent hierarchical clustering to create a clustered heatmap and identify age-associated regulons. Regulons were prioritized using their regulon specificity scores (RSS).

We reconstructed the GRN for the t(8;21) subset of our AML scAtlas using the pySCENIC [22] pipeline and defined critical regulons for this AML subtype. To compare pediatric and adult t(8;21) AML, the top 20 regulons for each of these groups were selected based on the regulon specificity score (RSS) (Supplementary Figure 4B). Unsupervised clustering was performed on the Z score normalized regulon activity score matrix, which highlighted clear differences in the GRN across different age groups (Supplementary Figure 4C). We presume that the differences in the GRN might reflect differences in the pre- or postnatal developmental origins of the disease.

To define signature regulons, we used the clustered dendrogram to select the regulon clusters most associated with the pediatric (below 10 years-old) and adult samples (over 18 years-old) (Supplementary Figure 4C; Figure 2D). The pediatric regulon signature is proposed to represent in-utero origin t(8;21) AML (henceforth termed ‘inferred-prenatal’), and includes 16 regulons defined by a distinct group of hematopoietic transcription factors (TFs) (TRIM28, CTCF, RAD21, SOX4, TAL1, MYB, FOXN3, JUND, BCLAF1, ZBTB7A, IKZF1, MAZ, REST, YY1, CUX1, KDM5A), many of which have clearly defined roles in HSPCs and AML [4547].The adult regulon signature, presumed representative of the postnatally acquired t(8;21) AML (henceforth termed ‘inferred-postnatal), combines 3 discrete clusters of regulons (YBX1, ENO1, and HDAC2; GATA1, POLE3, TFDP1, MYBL2, E2F4, and KLF1; IRF1, STAT1, IRF7, MAFF, AFT4, TAGLN2, SPI1, and KLF2), defined by TFs previously implicated in various hematopoietic, leukemic and inflammatory processes [4850]. Importantly, both signatures contain key components of the AP-1 complex, which is heavily implicated in the biology of t(8;21) AML [51, 52] and undergoes dynamic changes during aging [53].

Samples of 6 individuals aged 12-17 clustered with the pediatric samples and showed enrichment for the inferred-prenatal signature (Figure 2D), which suggests that older adolescents (up to aged 17 in our cohort) more closely resemble pediatric AML with t(8;21) and remain biologically distinct from adult onset disease. This implies that the inferred in utero origin of t(8;21) AML can also be present in AML diagnosed in older children.

Validation of Age-Associated Regulons in Bulk-RNA-Seq Cohorts of t(8;21) AML

We next sought to perform external validation of our age-associated regulon signatures in a larger cohort of patients. We applied the AUCell algorithm from pySCENIC [22], to the t(8;21) AML bulk RNA-seq from the TARGET [2] and BeatAML [23] cohorts (n=103) to calculate the activity of our pediatric inferred-prenatal and adult inferred-postnatal regulons. Using unsupervised clustering of the bulk RNA-seq AUCell results, we identified discrete clusters of samples that were highly enriched for our inferred-prenatal and inferred-postnatal origin-associated regulons (Figure 3A).

Validation of Age-Associated Regulons in Large Bulk RNA-Seq Cohorts

(A) Using previously defined age-associated regulons, pySCENIC AUC scores (Z-score normalized) were clustered to identify samples most enriched for inferred-prenatal and inferred-postnatal origin signatures. (B) Volcano plot of differentially expressed genes when comparing the inferred-prenatal origin and inferred-postnatal origin samples. Adjusted P value threshold 0.01; log2 fold change threshold 0.5. Regulon signature associated TFs are indicated. (C) Enrichment plot of significant gene sets enriched in the inferred-prenatal origin samples. GSEA was performed on the DEGs using MSigDB databases. FDR q-value threshold <0.05. (D) Enrichment plot of drug sensitivity gene sets enriched in the inferred-prenatal samples. GSEA was performed on the DEGs, using drug response signatures from published studies of 4 widely used AML drugs. FDR q-value threshold <0.05. (E) The predicted cell type proportions estimated using AutoGeneS deconvolution, of the inferred-prenatal and inferred-postnatal origin samples were compared using T-Tests. Significant P values <0.05 (*), <0.01 (**), <0.001 (***) and <0.0001 (****) are indicated.

Given the limitations of most scRNA-seq platforms in detecting lowly expressed genes, notably TFs, we used these bulk RNA-seq samples to refine our identified gene regulatory networks by detecting differentially expressed regulon-associated TFs. We used our inferred-prenatal and inferred-postnatal signature clusters and performed differential gene expression analysis between these samples, which are well separated by principal components analysis (PCA) (Supplementary Figure 5A). We then looked for differentially expressed regulon-associated TFs between the two groups; although changes in TF expression are subtle (Supplementary Figure 5B; Figure 3B), we identify significantly differentially expressed TFs which reflect the observed differences in regulon activity and indicate the most critical regulons in our age-related GRN signatures (Figure 3B). This further delineated the inferred-prenatal signature to 5 key TFs (KDM5A, REST, BCLAF1, YY1, and RAD21), and the inferred-postnatal signature to 9 TFs (ENO1, TFDP1, MYBL2, KLF1, TAGLN2, KLF2, IRF7, SPI1, and YXB1).

We next performed gene set enrichment analysis (GSEA) using all significant differentially expressed genes, to investigate pathways which were enriched in the inferred-prenatal samples compared to the inferred-postnatal ones (Figure 3C). Notably, this identified an increase in stemness-associated genes, and in the SMARCA2 target genes, which is a key player in HSC gene expression regulation and chromatin remodeling [54]. SMARCA2 is also known to be upregulated during the fetal-to-adult HSC transition [55], implying that the observed SMARCA2 enrichment may indeed reflect the inferred fetal HSC cell-of-origin. Genes impacted by YY1 depletion were also downregulated compared to the samples of inferred-postnatal leukemia origin, which supports the identification of YY1 as an inferred-prenatal regulon (Figure 3C). We also performed GSEA using drug response signatures from published studies [5659], to deduce whether the inferred-prenatal or inferred-postnatal origin of the leukemia affects chemosensitivity related genes (Figure 3D). This analysis revealed that inferred-prenatal origin t(8;21) AML is enriched for genes associated with increased chemosensitivity to cytarabine, venetoclax, and daunorubicin (Figure 3D).

We hypothesized that the increase in stemness-associated genes in the leukemia with the inferred-prenatal origin could be reflective of potential differences in the leukemic cell-of-origin and its impact on myeloid differentiation. To compare the cellular heterogeneity between inferred-prenatal and inferred-postnatal samples, we performed cell type deconvolution using AutoGeneS [60], with a curated LSPC reference profile [7], to compare the cellular heterogeneity between prenatal and postnatal origin bulk RNA-seq samples (Figure 3E). This revealed a higher proportion of most HSPC cell types (HSC, HSC-like, LSPC-Cycle, LSPC-Primed, LSPC-Quiescent), with a reduction in differentiated myeloid cell types (CMP, MEP, GMP, ProMono, CD14+ Mono, CD16+ Mono, cDC, Erythroid) in the leukemia data with inferred-prenatal origin (Figure 3E). To corroborate this finding, we compared cell type proportions in the original t(8;21) subset of AML scAtlas, confirming that cells with the inferred-prenatal signature comprise more HSC/HSC-like cells than inferred-postnatal signature cells (Supplementary Figure 5C;5D). However, comparison of cell type proportions in this dataset is confounded by differences in sample processing as some studies performed CD34 selection, hence there is more cell type diversity observed in the pediatric samples (Supplementary Figure 5C;5D).

Multiomics Single-Cell Data Reveals a Denoised GRN and Identifies Candidate Perturbations in Prenatal Origin t(8;21) AML

We used the scRNA-seq and scATAC-seq data from a recent cohort of pediatric t(8;21) AML patients [24] at multiple clinical time points to uncover the eGRN (enhancer-driven GRN) in inferred-prenatal and inferred-postnatal origin t(8;21) AML. Initially, we identified the two most representative samples of our inferred-prenatal and inferred-postnatal signatures by using pySCENIC AUCell [22] to measure the activity of our previously defined regulons. Unsupervised clustering of the AUC scores was used to infer whether each sample matched the regulon signatures, identifying one inferred-prenatal sample and one inferred-postnatal sample for downstream analysis (Supplementary Figure 6A).

SCENIC+ [25] analysis uses both scRNA-seq and scATAC-seq to identify candidate enhancer regions and TF-binding motifs, which are then used to link TFs to target genes and identified enhancers. This creates enhancer-driven regulons (eRegulons), forming an eGRN. We applied SCENIC+ [25] to the leukemia samples with the inferred-prenatal and inferred-postnatal origin at diagnosis and relapse, keeping only regulons that showed a correlation between both modalities to retain only the most robust regulons (Supplementary Figure 6B). This revealed several eRegulons across both patients (Supplementary Figure 6C), many of which were patient specific, particularly when comparing HSPC populations (Figure 4A). The sample of inferred-prenatal origin shows a specific HSC eRegulon profile, whereas the inferred-postnatal sample more closely resembles the corresponding Granulocyte-Monocyte Progenitor (GMP) (Figure 4A). Interestingly, at relapse the inferred-prenatal origin patient undergoes a chemotherapy-driven lineage switch to a lymphoid phenotype, which may suggest that the leukemia originated from a less committed progenitor (Figure 4A).

Combining Multiomics Data Interrogates Age-Associated Regulons

(A) SCENIC+ eRegulon dotplot of showing correlation between scRNA-seq target gene activity (indicated by the color scale) and scATAC-seq target region accessibility (depicted by spot size). RSS identified the key activating eRegulons (+/+) between inferred-prenatal and inferred-postnatal origin disease and allows comparison of diagnosis (Dx) and relapse (Rel) time points. (B) Correlation plot of the eRegulon target genes. Clusters reflecting inferred-prenatal and inferred-postnatal origin disease at diagnosis (Dx) and relapse (Rel) reflect co-binding of eRegulon TFs. (C) Over-representation analysis of age-associated eRegulon target genes using GO Biological Processes curated gene sets. Adjusted P value threshold 0.05. (D) Principal components analysis (PCA) of the gene based eRegulon enrichment scores for the inferred-prenatal origin disease at diagnosis and relapse. PC1 axis explains variance occurring between diagnosis and relapse, where this patient underwent a lineage switch. PC2 captures variance related to hematopoietic differentiation. (E) SCENIC+ perturbation simulation shows the predicted effect of knockout of selected TFs on the previously computed PCA embedding. Arrows indicate the predicted shift in cell states relative to the initial PCA embedding.

To identify clusters of closely related eRegulons, we computed the correlations between eRegulons enrichment. We identified 5 main clusters of eRegulons which correspond to different inferred-signature samples (Figure 4B). For each eRegulon cluster, we used the associated target genes as input for gene ontology over representation analysis (ORA), to compare functional differences in the eGRN. This revealed fundamental differences in the underlying biological processes (Figure 4C; Supplementary Figure 7A). The AML sample with inferred-prenatal origin shows critical transcriptional components as the key molecular processes, such as RNA splicing, histone modification, and TF binding. Whereas AML sample with inferred-postnatal origin appears more immune-mediated, with leucocyte proliferation and leucocyte receptor activation (Figure 4C; Supplementary Figure 7A). This further supports the stemness signature in presumed prenatal origin t(8;21) AML, compared to postnatal origin disease.

Previous analysis using the TARGET [2] and BeatAML [23] datasets indicated that inferred-prenatal and inferred-postnatal origin t(8;21) AML may harbor different levels of chemosensitivity based on published drug response signatures (Figure 3D). Therefore, we performed in silico perturbations of eRegulon-associated TFs. PCA of the diagnosis and relapse samples recapitulates expected differentiation trajectories along PC2, while capturing differences between diagnosis and relapse along PC1 (Figure 4D). Using the SCENIC+ [25] perturbation simulation workflow to model the predicted effects of TF perturbation on cell state, we identified TFs estimated to induce differentiation, as defined by a negative shift in PC2 (Supplementary Figure 7B). We prioritized TFs predicted to induce an effect on the HSC compartment and identified 18 TFs predicted to induce a significant effect on HSC differentiation (Supplementary Figure 7C). Several of these TFs are components of the AP-1 complex (JUN, ATF4, FOSL2), which are established targets of the t(8;21) fusion protein and are known to propagate t(8;21) AML [51, 61] (Supplementary Figure 7C; Figure 4E).

EP300 was one of the most impactful hits, which has recently had a role established in driving t(8;21) AML biology through acetylation permitting self-renewal [62]. This suggests that presumed prenatal origin pediatric t(8;21) AML may be particularly sensitive to EP300 inhibition. One of the most striking predictions, for both diagnostic and relapse HSC populations, was BCLAF1 (Supplementary Figure 7C; Figure 4E). BCLAF1 is a regulator of normal HSPCs [63], and its expression level declines during hematopoietic differentiation. While recent studies have identified a role for BCLAF1 in AML [64], this has not been explored in detail in the context of pediatric AML or t(8;21) AML, and may present a therapeutic opportunity.

We also performed SCENIC+ [25] perturbation modelling on the postnatal origin sample (AML12). PCA was more difficult to interpret in this case, as branching differentiation trajectories towards a lymphoid or myeloid fate appear along the PC2 axis, while PC1 distinguishes diagnosis and relapse samples (Supplementary Figure 7D). Therefore, we prioritized TFs based on a predicted effect similar to AP-1 complex components, as it is known that this complex is a critical regulator in t(8;21) AML. We identified several TFs which were identified in our original postnatal origin signature as likely to have an effect (Supplementary Figures 7D-F), indicating our previous analyses identified biologically relevant GRNs.

To investigate the impact EP300 and BCLAF1 in more detail, we used the DepMap [65] database to assess the dependency of t(8;21) AML cell lines to these genes (Supplementary Figure 7G). We found that the two widely-used cell lines of t(8;21) AML, KASUMI-1 (7-year-old donor) [66] and SKNO-1 (22-year-old donor) [67], were among the most sensitive to these perturbations based on their DepMap effect scores. Together this indicates that EP300 inhibition may be particularly effective in t(8;21) AML, and that BCLAF1 may present a new therapeutic target for t(8;21) AML, particularly in children with inferred pre-natal origin of the driver translocation.

Discussion

Here we have generated a new data resource, AML scAtlas, to investigate single-cell biology across a range of subtypes of AML. By including 222 samples comprising 748,679 cells of patients with a wide range of clinical characteristics, we overcome the limitations of many standalone single-cell studies by allowing AML subtype-focused analysis with enough data for robust statistical comparisons. The AML scAtlas dataset is publicly available (https://cellxgene.bmh.manchester.ac.uk/AML/) to allow researchers from all areas of AML, to utilize this dataset to answer their own biological questions and generate new hypotheses.

To further address a clinically relevant question using this data source, we compare differences between pediatric and adult-onset disease based on the potential biological effect of the in-utero origin of pediatric leukemia. Data of our AML scAtlas was used to explore the GRNs in adult and pediatric t(8;21) AML, and revealed a strong age-associated GRN signature. This suggests that while pediatric and adult t(8;21) AML is propagated by the same driver translocation, there are clear biological differences which correlate with age. This may be due to differences in the cell-of-origin, which has been shown to transform from a HSC or a more lineage restricted GMP in t(8;21) AML mouse models [68]. As pediatric t(8;21) can arise in-utero, as evidenced by previous studies [17], and adult t(8;21) is acquired postnatally [18, 19], we propose that the observed age-related differences in AML with t(8;21) reflect these differences in the developmental origins of the disease. Here, we find two distinct groups of regulons which best reflect inferred-prenatal origin and inferred-postnatal origin disease. Our cohort is the largest scRNA-seq dataset to explore t(8;21) AML biology to date, however, the number of patients included remains low (n=22), and many of the studies containing the adult samples used CD34 selection in their experimental protocol creating a bias towards HSPCs in these samples.

To overcome some of these limitations, we used bulk RNA-seq samples from the TARGET [2] and BeatAML [23] studies with t(8;21) AML (n=103) to validate our regulon signatures. This identifies two clusters of samples which closely match these signatures, showing that the regulon patterns identified from our AML scAtlas are also applicable to bulk RNA-seq data which allows exploration of larger patient cohorts. We compared the transcriptomes of the groups with inferred-prenatal and inferred-postnatal origin, to prioritize TFs which were differentially expressed and to compare differences in underlying biology and drug response. We identified 5 signature TFs (KDM5A, REST, BCLAF1, YY1, RAD21) for inferred-prenatal origin disease, some of which have roles in embryonic stem cells [69, 70], and critical functions in the maintenance of HSCs [4547]. Whereas TFs identified in inferred-postnatal origin samples, such as interferon regulatory factors (IRFs), HDAC2, and SPI1, reflect inflammatory and immune processes, many of which have been implicated in leukemia [4850]. We also found that inferred-prenatal origin samples had a higher proportion of HSC/HSC-like cell types compared to inferred-postnatal origin samples, which suggests that prenatal origin t(8;21) AML confers a more primitive state compared to postnatal onset t(8;21) AML cases, supporting age-associated differences in the cell-of-origin playing a role in disease biology.

Given these biological differences, we used bulk RNA-seq to predict chemosensitivity using published drug response signatures [5659], finding that the inferred-prenatal samples were enriched for genes indicative of cytarabine sensitivity, and depleted of genes suggestive of daunorubicin and venetoclax resistance. Our analysis further suggests that the developmental origins of the disease may be important for drug response signatures, which might inform the design of novel therapeutic strategies and provides further biological evidence that pediatric AML might benefit from different clinical management compared with adult-onset AML. Importantly, venetoclax is currently in the AML23 trial (NCT05955261); this data supports further evaluation of venetoclax treatment in pediatric t(8;21) AML.

Using an additional single-cell multiomic dataset, using SCENIC+ we reconstructed the enhancer-driven regulons (eRegulons) in samples matching our inferred-prenatal and inferred-postnatal regulon signatures. Upon comparing eRegulons for each patient at diagnosis and relapse, we identified clusters of highly correlated eRegulons which were defined by different biological processes. This indicated that the samples with inferred-prenatal origin are propagated by transcriptional dysregulation, whereas the samples of inferred-postnatal origin are largely driven by inflammatory signaling. We used SCENIC+ to model the predicted impact of TF perturbations on our prenatal origin sample at diagnosis and relapse. Importantly, we identified several key components of the AP-1 complex, which is critical in t(8;21) AML biology and is also associated with dynamic age-related transcriptional changes [5153].

Through our analysis, we identified EP300 as a candidate target, which has been shown to be critical for t(8;21) AML biology [62]. This has been shown to cause an effect in KASUMI-1 and SKNO1 cell lines. EP300 has been identified as a promising therapeutic target in AML with several molecules in development [71]; our data indicates potential specific therapeutic benefit in prenatal origin t(8;21) AML. One of the most impactful perturbation predictions for the HSC compartment at diagnosis and relapse was BCLAF1. This is consistent with previous evidence of its importance in HSCs [63, 72] and AML [64], but has not been studied specifically in the context of pediatric AML or t(8;21) AML previously. The DepMap data shows that KASUMI-1 is the most sensitive myeloid cell line to BCLAF1 perturbation, and our GRN analyses suggest it is particularly active in pediatric t(8;21) AML of inferred in-utero origin, thus this may represent an additional prognostic indicator.

Conclusions

Overall, we demonstrate that large-scale single-cell data integration provides a powerful tool to explore specific patient groups in detail, allowing for robust comparisons. We present the AML scAtlas as a data resource for researchers to answer their own biological questions. Applying our AML scAtlas to t(8;21) AML, we uncover age-associated gene regulatory networks which we propose relates to differences in the developmental origins, biology and oucome of the disease, and might point to novel candidate therapeutic targets which may be more effective in pediatric t(8;21) AML compared to adult-onset disease.

Methods

For the complete analysis code, including the conda environments used for analysis, see GitHub Repo (https://github.com/jesswhitts/AML-scAtlas).

Data Collection

A literature search was performed for published AML scRNA-seq datasets [46, 8, 2640]. Suitable studies were selected based on the data quality (over 1000 counts and 500 genes detected per cell for most of the data). Diagnostic, primary AML samples were selected from each AML study. Where healthy donor samples were present, these were also included, along with an additional 4 studies with healthy bone marrow samples.

Initial Data Processing

Each scRNA-seq dataset underwent initial quality control individually using Scanpy (v1.9.3) [73] as some studies provided raw data and others provided pre-filtered data. Where raw data was provided, doublets were removed using Scrublet (v0.2.3) [74] and cells were filtered using the median absolute deviation as described in this single-cell best practices handbook [10, 75].

Once filtered, datasets were combined, and quality control was performed using Scanpy (v1.9.3) [73]. The full dataset had quality thresholds applied (percentage mitochondrial counts <10, read counts >1000, gene counts >500), removing any samples which had fewer than 50 cells remaining after filtering. Genes present in <50 cells were removed. MATAL1 was removed as this was highly abundant in many cells and considered artefactual.

Batch Correction

The presence of batch effects was determined through dimensionality reduction and clustering using Scanpy (v1.9.3) [73] and using the kBET algorithm (v0.99.6) [76]. This was repeated on individual studies, to assess whether there were sample-wise batch effects. Batch correction benchmarking was implemented using Harmony (Scanpy v1.9.3 implementation) [11], scVI (v1.0.3) [12], and scANVI (v1.0.3) [13] and quantified using scIB (1.1.4) [9]. Different numbers of highly variable genes were used to select the optimal number for integration. Batch correction was performed using scVI (v1.0.3) [12] with the top 2000 highly variable genes, using sample as the model covariate.

AML scAtlas Annotation

The scVI corrected embedding was used to run UMAP and Leiden clustering using Scanpy functions (v1.9.3) [73]. Cell type annotation was performed using CellTypist (v1.6.0) [77], SingleR (v2.0.0) [78], and scType (v1.0) [79], taking the consensus of the different tools and the expression of key marker genes to manually curate the cell types. To annotate LSCs, a custom SingleR (v2.0.0) [78] reference was created using the Zeng et al [7] revised annotations of the Van Galen et al [5] dataset. This was also correlated with the LSC6 [43] and LSC17 [44] scores for each cell.

AML with t(8;21) Analysis

Samples with the t(8;21) translocation were selected from the full AML scAtlas. The UMAP was re-computed, and genes were filtered to remove those detected in fewer than 50 cells for the revised dataset, leaving 24,866 genes remaining. Gene regulatory network analysis was performed using pySCENIC [21, 22] (v0.12.1) as per the recommended workflow. To facilitate comparisons between age groups, cell types were focused on HSPCs, as many adult samples were originally enriched for CD34. The RSS was calculated for the adult and pediatric samples to select the top 20 differential regulons per age group. Using SciPy hierarchical clustering (v1.12.0), regulons were filtered to identify regulon signature groups used for downstream analysis.

Bulk RNA-Seq Analysis

Bulk RNA-Seq data was downloaded for the TARGET [2] and BeatAML [23] cohorts and samples with t(8;21) were selected. Using the previously defined signature regulons, the AUCell algorithm [21] (v0.12.1) was implemented to measure regulon activity. Hierarchical clustering was performed using SciPy v1.12.0) to identify samples most enriched for each age-related signature. Differential gene expression analysis was implemented using DESeq2 [80] (v1.40.2), using a log2 fold change threshold of 0.5 and an adjusted p value cutoff of 0.01. Candidate differential genes, ranked on log2 fold change, underwent GSEA with GSEApy (v0.10.8) using a significance threshold of 0.05. Cell type deconvolution was performed using AutoGeneS [60] (v1.0.4) using the recommended workflow. Significance when comparing groups was ascertained using a student’s T-test on the predicted cell type proportion values for each sample.

Single-Cell Multi-Omics Analysis

The Lambo et al [24] scRNA-seq and scATAC-seq data was downloaded, and the t(8;21) samples selected. Using our previously defined signature regulons, the AUCell algorithm [21] (v0.12.1) was implemented to measure regulon activity. This identified the samples most enriched for each age-related signature as AML16 and AML12.

The SCENIC+ pipeline [25] (v1.0a1) was implemented as per the recommended Snakemake workflow for creating pseudo-multiome data. Regulons were filtered by correlation between modalities, using a threshold of 0.2 for non-multiome data. The most robust regulons were prioritized based on the SCENIC+ recommendations (direct +/+) [25]. To facilitate comparisons, the eRegulon RSS was calculated for each patient and the top 30 eRegulons selected. The correlation between the gene sets underpinning eRegulons was calculated and sample-associated clusters were selected for over-representation analysis with clusterProfiler [81] (v4.8.3).

To predict the impact of specific TF perturbations on key cell types, SCENIC+ [25] perturbation modelling was implemented using the recommended parameters. TFs were then prioritized on their predicted impact on HSC differentiation and visualised using the PCA embedding. Candidate targets EP300 and BCLAF1 were queried in the DepMap [65] databases to infer their potential importance.

Availability of Data and Materials

The AML scAtlas is hosted online for public use (https://cellxgene.bmh.manchester.ac.uk/AML/). The processed AnnData object is also available to download from figshare (DOI: 10.48420/27269946). Details of all data used in this study can be found in Supplementary Table 1, along with associated links to the original data. All code used to perform the analyses presented here can be accessed in the GitHub repository: (https://github.com/jesswhitts/AML-scAtlas).

Acknowledgements

JW was funded by MRC DTP award (MR/W007428/1), SM by Blood cancer UK (15038) and CCLG (2016 09), while GL by Cancer Research UK (C5759/A20971 & C5759/A27412) and MI by MRC (MR/X014088/1).

We would like to thank all authors of the public data used in this study for their contributions to scientific community. We also acknowledge useful discussions around this work with Magnus Rattray.

Additional information

Author Contributions

MI, SMB, and GL conceived the study and oversaw the research. SM guided sub-analyses and assisted with interpretation. JW did the data collection and implemented downstream analyses. JW wrote the first draft of the manuscript. All authors interpreted the data and edited the manuscript. All authors approved the final manuscript.

Additional files

Supplementary figures and tables