Introduction

Acute myeloid leukemia (AML) is an aggressive blood cancer driven by non-random genomic rearrangements in hematopoietic stem/progenitor cells (HSPCs). Recurrent AML-associated genomic aberrations, which often involve transcriptional or epigenetic regulators, give rise to distinct patterns of gene expression strongly associated with clinical course and chemotherapy response [1, 2]. Single-cell RNA sequencing (scRNA-seq) studies have demonstrated that HSPCs acquire lineage priming at an early stage when still phenotypically immature and disperse down an erythromyeloid or lymphomyeloid differentiation trajectory [3]. In the context of AML, diverse clonal hierarchies include the co-existence of normal hematopoietic clones. Leukemic clones can partially recapitulate myeloid differentiation and have been shown to display functional differences even when defined by the same genotype [4-7]. Indeed, analysis of AML using scRNA-seq has revealed key clonal hierarchies, defining subtype-associated cell types, and dynamic changes following therapy, and have been critical in characterizing leukemic stem cells (LSCs), which propagate the disease and drive relapse [5-8].

Most AML scRNA-seq studies are limited by small sample numbers and include a mixture of different AML subtypes which may not be directly comparable to one another. Therefore, it is difficult to make biological conclusions with sufficient robustness to be clinically translatable in these individual datasets. To overcome this, we performed large-scale integration of public scRNA-seq datasets to create a single-cell transcriptomic atlas for AML (AML scAtlas). Due to the range of data sources spanning time, locations, and experimental designs, complex batch effects often arise between scRNA-seq datasets which requires a tailored data integration approach [9, 10]. Thus, we benchmarked some widely used batch correction tools [11-13] for our specific data use case.

Given the broad representation of age groups in AML scAtlas, we sought to investigate a developmental aspect of AML biology. Pediatric AMLs have substantially better clinical outcomes compared to adult AMLs[14-16]. The molecular landscape of AML differs between children and adults[2, 14-16]; this may, in part, reflect differences in the developmental origins of the disease. Chromosomal changes in pediatric leukemia are acquired in-utero, as evidenced by leukemia-specific genomic aberrations detected in the Gunthrie spots of children who later developed leukemia, sometimes several years after birth [17]. Adult leukemia, in contrast, is thought to develop later in life through acquisition of pre-leukaemic changes and clonal evolution of adult HSPCs [18, 19]. The impact of these developmental stages on leukemia biology remains incompletely understood, and no current methods exist to quantify and characterize differences in the origin of the disease. However, as childhood AML with presumed in-utero origin has a better outcome, for teenagers and young adults, determination of the pre- or postnatal origin might be important for better treatment stratification and prognostication.

AML with t(8;21) (AML-ETO/RUNX1-RUNX1T1) is one of the most frequent AML subtypes in young people, although it affects all ages [2]. The prenatal origins of the t(8;21) rearrangement, has been confirmed even in older children presenting with AML [17]. The prognosis of AML with t(8;21) is better in children than in teenagers and even more so than in young adults [20]. This outcome difference is not fully explainable by co-morbidities and may instead be related to the developmental origins of the disease. In the intermediate teenage group, t(8;21) AML may comprise both late childhood and early adult disease entities, a distinction that could have prognostic implications and could help to explain disease biology and clinical course.

We leveraged our AML scAtlas resource to characterize age and developmental stage specific signatures in t(8;21) AML by applying single-cell gene regulatory network (GRN) inference[21, 22], as a means of revealing cell state heterogeneity across age groups. We then validated and refined our findings in a larger cohort using bulk RNA sequencing (RNA-seq) data from the TARGET[2] and BeatAML[23] studies, defining age-associated GRN signatures and key regulators of t(8;21) AML, that may reflect the developmental origins of the leukemia.

Profiling both gene expression and chromatin accessibility together can decipher the enhancer-driven GRN (eGRN) and enriched transcriptional regulators. Significant heterogeneity across different patients and time points [24] was recently described by analyzing combined scRNA-seq and single-cell Assay for Transposase Accessible Chromatin sequencing (scATAC-seq). We used the t(8;21) AML data from this study to validate our initial findings, by applying cutting edge GRN inference methodology [25]. This encompasses both modalities to provide a denoised eGRN which we could correlate with our age-associated signatures.

Results

Large Scale Data Integration to Construct a Single-Cell Transcriptomic Atlas of AML (AML scAtlas)

To create the AML scAtlas, we integrated published scRNA-seq data of primary AML bone marrow samples, from 16 suitable high-quality studies (see Materials and Methods), comprising 159 AML samples (Figure 1A; Supplementary Table 1). Where on-treatment time points were available, we selected only diagnostic samples to establish a reference atlas of primary AML at diagnosis. If studies had healthy donor bone marrow samples, these were included, alongside data from healthy bone marrow samples from four additional scRNA-seq studies (Supplementary Table 1) to enable comparisons between malignant and healthy bone marrow populations. After cell filtering and quality control, the AML scAtlas contains data from 748,679 high quality cells derived from a total of 20 different scRNA-seq studies [4-6, 8, 26-40] (Supplementary Figure 1A). Each sample was assigned to an AML clinical subtype, based on the recent European Leukemia Net (ELN) clinical guidelines [41], and classified into the corresponding prognostic risk group. This resource captures a broad range of molecular subtypes of AML and spans different age groups, including both pediatric and adult AML cases (Figure 1B;1C). Overall, this is the largest dataset to date for exploring AML biology at single-cell resolution.

Large Scale Data Integration Creates a Single-Cell Atlas of AML

(A) Overview of the analysis steps in creating AML scAtlas. (B) Proportion of cells (left panel) and samples (right panel) belonging to each AML subtype as defined by the ELN clinical guideline. (C) Age group and gender distribution of AML scAtlas cohort samples. (D) scVI harmonized UMAP colored by annotated cell types. (E) The expression of key hematopoietic marker genes across annotated cell types shown on a dotplot. Color scale shows mean gene expression, dot size represents the fraction of cells expressing the given gene.

In the initial analysis of the combined dataset batch effects were noted, with study-specific clustering, which was quantified using several benchmarking metrics (Supplementary Table 2; Supplementary Figure 1A;1B). Even within samples of the same study, sample-wise clustering was noted (Supplementary Figure 1C). To address this, we benchmarked several widely used batch correction methods (Supplementary Table 2; Supplementary Figure 2A;2C), identifying scVI as the best method for this dataset (Supplementary Table 2). We therefore employed scVI to correct for batch effects, before clustering and cell type annotation in the AML scAtlas, by using the consensus of multiple annotation tool results (Supplementary Table 3), verified using cluster-wise marker gene expression (Figure 1D;1E).

Characterizing Cell Type Distributions in AML Subtypes

(A) UMAP highlighting the distribution of cells from different AML subtypes in AML scAtlas. (B) Schematic showing the workflow used to identify leukemic stem cells (LSCs) from the AML scAtlas hematopoietic stem and progenitor cell (HSPC) clusters. (C) Using the AML scAtlas HSPC clusters only, UMAP was regenerated and annotated with an AML-specific reference of leukemia stem and progenitor cells (LSPCs). (D) UMAPs showing the leukemic stem cell scores of each cell, for the LSC17 (left) and LSC6 (right). (E) Proportions of HSPC/LSPC populations in different AML subtypes (left) and AML risk groups (right), as defined by ELN clinical guidelines. (F) Comparison of LSC abundance in favourable and adverse ELN risk groups. Chi-Square test statistic: 8658.98, degrees of freedom: 1, P-value: 0.0.

Cell type proportions analyses across the clinically relevant subtypes in the dataset show that the AML subtypes were significantly biased towards myeloid cell types (CMP, MEP, GMP, ProMono, CD14+ Mono, CD16+ Mono, cDC, Erythroid) with each subtype exhibiting a clear predominant cell type consistent with AML clonal expansion (Figure 2A; Supplementary Figure 3A). In contrast, healthy donor samples had more balanced lineage proportions, with lymphoid cells (T, B, NK, ProB, pDC, Plasma) well represented (Figure 2A; Supplementary Figure 3A). Given the established critical role of HSPCs and LSCs in propagating AML, and their importance as therapeutic targets [42], we focused on HSPC clusters for further analysis (Figure 2B). To identify LSCs, we applied a curated reference profile of leukaemic stem and progenitor cells (LSPCs) [7] (Figure 2B;2C) and correlated this with calculated LSC6 [43] and LSC17[44] scores for each cell (Figure 2D). We then compared the proportions of HSPC/LSPCs across different AML subtypes and risk groups, as defined by the ELN clinical guidelines [41] (Figure 2E). Higher-risk subtypes displayed a higher proportion of LSCs compared to favorable risk disease (Figure 2E;2F).

AML scAtlas Reveals Age-Associated Heterogeneity in t(8;21) AML

(A) Depiction of the workflow to generate and validate the t(8;21) AML gene regulatory network (GRN) from AML scAtlas. (B) Using the AML scAtlas t(8;21) sample cells, UMAP was re-computed and shows the different cell types. (C)Bar plots of the absolute cell type numbers (left panel) and the cell type proportions (right panel) stratified by age group. The CD34 enrichment performed on several adult samples is reflected. (D) Using HSPCs and CMPs only, the pySCENIC gene regulatory network (GRN) and regulon AUC scores were calculated. Z-score normalized scores underwent hierarchical clustering to create a clustered heatmap and identify age-associated regulons. Regulons were prioritized using their regulon specificity scores (RSS).

Application of AML scAtlas to Identifying Age-Associated Gene Regulatory Networks in t(8;21) AML

The AML scAtlas enables robust comparison of adult and pediatric AML. We hypothesized that in adolescents and young adults with t(8;21) AML, the potential for either in-utero or postnatal HSPC origin disease might affect disease biology and prognosis. Thus, we sought to explore biological differences between pediatric and adult cases of t(8;21) AML, aiming to explain and potentially improve prognostication in adolescents and young adults. We selected samples with t(8;21) AML from the AML scAtlas, resulting in 105,663 cells from 13 adult cases (aged 20-67), 7 adolescent cases (aged 12-17), and 3 pediatric cases (aged 6-8) (Figure 3A-3C). Where gender information was not available, this was inferred from ChrY/XIST gene expression (Supplementary Figure 4A). Several adult samples underwent CD34 selection in original studies, excluding more differentiated cell types (mature lymphoid populations, monocytes, granulocytes) in these samples. Thus, these cell types were excluded from comparative analysis, focusing only on HSPCs and myeloid progenitors (CMP, GMP, MEP), which were well represented in all studies (Figure 3C).

Validation of Age-Associated Regulons in Large Bulk RNA-Seq Cohorts

(A) Using previously defined age-associated regulons, pySCENIC AUC scores (Z-score normalized) were clustered to identify samples most enriched for inferred-prenatal and inferred-postnatal origin signatures. (B) Volcano plot of differentially expressed genes when comparing the inferred-prenatal origin and inferred-postnatal origin samples. Adjusted P value threshold 0.01; log2 fold change threshold 0.5. Regulon signature associated TFs are indicated. (C) Enrichment plot of significant gene sets enriched in the inferred-prenatal origin samples. GSEA was performed on the DEGs using MSigDB databases. FDR q-value threshold <0.05. (D) Enrichment plot of drug sensitivity gene sets enriched in the inferred-prenatal samples. GSEA was performed on the DEGs, using drug response signatures from published studies of 4 widely used AML drugs. FDR q-value threshold <0.05. (E) The predicted cell type proportions estimated using AutoGeneS deconvolution, of the inferred-prenatal and inferred-postnatal origin samples were compared using T-Tests. Significant P values <0.05 (*), <0.01 (**), <0.001 (***) and <0.0001 (****) are indicated.

We reconstructed the GRN for the t(8;21) subset using the pySCENIC [22] pipeline, which is a python-based efficient implementation of original SCENIC method [21]. It is a state-of-the-art method for network inference from scRNA data, popular in the community [45-47] and has shown strong performance in a recent benchmarking study [48]. SCENIC’s three major steps are: First, it identifies groups of co-expressed genes as potential targets of a transcription factor (TF). Second, it filters these groups of genes to retain only TF targets with the corresponding binding motif, forming “regulons.” Third, it uses the AUCell method (embedded within SCENIC) to quantify the activity of each regulon in every cell. AUCell calculates the Area Under the Curve (AUC) for the regulon’s genes set in a ranking of all genes by expression for each cell. The top 20 regulons for each age groups were selected based on the regulon specificity score (RSS) (Supplementary Figure 4B). Unsupervised clustering on the Z score normalized regulon activity score matrix revealed clear differences in the GRN across different age groups (Supplementary Figure 4C). We hypothesize that the differences in the GRN might reflect differences in the pre- or postnatal developmental origins of the disease. Additional testing of GRN inference from individual studies shows that the high number of cells refines the overall GRN (see Methods; Supplementary Figure 4D-E).

To define gene regulatory programs (co-occurring gene modules, defined by a transcription factor and its targets) which are specific to different age groups (termed ‘regulon signature’), we used the clustered dendrogram to select the regulon clusters most associated with the pediatric (below 10 years-old) and adult samples (over 18 years-old) (Figure3D; Supplementary Figure 4C). The pediatric regulon signature, proposed to represent in-utero origin t(8;21) AML (henceforth termed ‘inferred-prenatal’), includes 16 regulons defined by a distinct group of hematopoietic transcription factors (TFs) (TRIM28, CTCF, RAD21, SOX4, TAL1, MYB, FOXN3, JUND, BCLAF1, ZBTB7A, IKZF1, MAZ, REST, YY1, CUX1, KDM5A), many of which have clearly defined roles in HSPCs and AML [49-51].The adult regulon signature, presumed representative of the postnatally acquired t(8;21) AML (henceforth termed ‘inferred-postnatal), combines 3 discrete clusters of regulons (YBX1, ENO1, and HDAC2; GATA1, POLE3, TFDP1, MYBL2, E2F4, and KLF1; IRF1, STAT1, IRF7, MAFF, ATF4, TAGLN2, SPI1, and KLF2), defined by TFs previously implicated in various hematopoietic, leukemic and inflammatory processes [52-54]. Importantly, both signatures contain key components of the AP-1 complex, which is heavily implicated in the biology of t(8;21) AML [55, 56] and undergoes dynamic changes during aging [57]. Samples of 6 individuals aged 12-17 clustered with the pediatric samples and showed enrichment for the inferred-prenatal signature (Figure 3D), suggesting that older adolescents (up to aged 17 in our cohort) more closely resemble pediatric AML with t(8;21) and remain biologically distinct from adult-onset disease. This implies that the inferred in-utero origin of t(8;21) AML can also be present in AML diagnosed in older children.

Validation of Age-Associated Regulons in Bulk-RNA-Seq Cohorts of t(8;21) AML

We next sought to externally validate our age-associated regulon signatures in a larger cohort of patients. Bulk RNA-seq samples were obtained from the TARGET [2] and BeatAML [23] cohorts, selecting bone marrow samples taken at diagnosis in line with AML scAtlas data (n=83; Supplementary Table 4). We applied the AUCell algorithm from pySCENIC [22] to calculate the activity of our pediatric inferred-prenatal and adult inferred-postnatal regulons in each sample. Unsupervised clustering of the bulk RNA-seq AUCell results revealed discrete clusters of samples that were highly enriched for our inferred-prenatal and inferred-postnatal origin-associated regulons (Figure 4A).

Given the limitations of most scRNA-seq platforms in detecting lowly expressed genes, notably TFs, we leveraged bulk RNA-seq samples to refine our identified gene regulatory networks by detecting differentially expressed regulon-associated TFs. We used our inferred-prenatal and inferred-postnatal signature clusters and performed differential gene expression analysis between these samples, using two widely used tools (DESeq2[59] and edgeR[58]) to ensure robustness of the results (Figure 4B; Supplementary Figure 5A). We then compared differentially expressed regulon-associated TFs between the two groups and intersected this with the differential genes detected by each method. Although changes in TF expression are subtle (Figure 4B; Supplementary Figure 5A), we identify significantly differentially expressed TFs which reflect the observed differences in regulon activity and indicate the most critical regulons in our age-related GRN signatures (Figure 4A; Supplementary Figure 5B). This further delineated the inferred-prenatal signature to 5 key TFs (KDM5A, REST, BCLAF1, YY1, and RAD21), and the inferred-postnatal signature to 8 TFs (ENO1, TFDP1, MYBL2, TAGLN2, KLF2, IRF7, SPI1, and YBX1).

Combining Multiomics Data Interrogates Age-Associated Regulons

(A) SCENIC+ eRegulon dotplot of showing correlation between scRNA-seq target gene activity (indicated by the color scale) and scATAC-seq target region accessibility (depicted by spot size). RSS identified the key activating eRegulons (+/+) between inferred-prenatal and inferred-postnatal origin disease and allows comparison of diagnosis (Dx) and relapse (Rel) time points. (B) Network showing the inferred-prenatal (blue) and inferred-postnatal (orange) associated eRegulons. Node size represents the number of target genes in each regulon. Edges represent interactions between nodes. (C) Over-representation analysis of age-associated eRegulon target genes using GO Biological Processes curated gene sets. Adjusted P value threshold 0.05. (D) Principal components analysis (PCA) of the gene based eRegulon enrichment scores for the inferred-prenatal origin disease at diagnosis and relapse. PC1 axis explains variance occurring between diagnosis and relapse, where this patient underwent a lineage switch. PC2 captures variance related to hematopoietic differentiation. (E) SCENIC+ perturbation simulation shows the predicted effect of knockout of selected TFs on the previously computed PCA embedding. Arrows indicate the predicted shift in cell states relative to the initial PCA embedding.

We next performed gene set enrichment analysis (GSEA) on significantly differentially expressed genes as determined by edgeR[58], to investigate pathways enriched in the inferred-prenatal samples compared to the inferred-postnatal ones (Figure 4C). Notably, inferred-prenatal samples showed increased expression of stemness-associated genes, and SMARCA2 target genes, a key player in HSC gene expression regulation and chromatin remodeling [60]. SMARCA2 is also known to be upregulated during the fetal-to-adult HSC transition [61], implying that the observed SMARCA2 enrichment may indeed reflect the inferred fetal HSC cell-of-origin. Genes impacted by YY1 depletion were also downregulated compared to the samples of inferred-postnatal leukemia origin, which supports the identification of YY1 as an inferred-prenatal regulon (Figure 4C). To explore therapeutic implications, we performed GSEA using drug response signatures from published studies [62-65] (Figure 4D). This analysis revealed that inferred-prenatal origin t(8;21) AML is enriched for genes associated with increased chemosensitivity to cytarabine, venetoclax, and daunorubicin (Figure 4D).

We hypothesized that the increase in stemness-associated genes in the leukemia with the inferred-prenatal origin could be reflective of potential differences in the leukemic cell-of-origin and its impact on myeloid differentiation. We therefore performed cell type deconvolution using AutoGeneS [66], with a curated LSPC reference profile [7], to compare the cellular heterogeneity between prenatal and postnatal origin bulk RNA-seq samples (Figure 4E). This revealed a higher proportion of HSPC cell types (HSC, Prog), with a reduction in some differentiated myeloid cell types (ProMono-like, cDC-like) in the samples of inferred-prenatal origin (Figure 4E). To corroborate this finding, we examined cell type proportions in the original t(8;21) subset of AML scAtlas, confirming that cells with the inferred-prenatal signature comprise more HSCs than inferred-postnatal signature cells (Supplementary Figure 5D-E). However, comparison of cell type proportions in this dataset is confounded by differences in sample processing as some studies performed CD34 selection, hence there is more cell type diversity observed in the pediatric samples (Supplementary Figure 5D-E).

Multiomics Single-Cell Data Reveals a Denoised GRN and Identifies Candidate Perturbations in Prenatal Origin t(8;21) AML

We next used the scRNA-seq and scATAC-seq data from a recent cohort of pediatric t(8;21) AML patients [24] at multiple clinical time points to uncover the enhancer-driven GRN (eGRN) in inferred-prenatal and inferred-postnatal origin t(8;21) AML (Supplementary Table 4). Initially, we identified two representative samples of our inferred-prenatal and inferred-postnatal signatures by using pySCENIC AUCell [22] to measure the activity of our previously defined regulons. Unsupervised clustering of the AUC scores was used to infer whether each sample matched the regulon signatures, identifying one inferred-prenatal sample and one inferred-postnatal sample for downstream analysis (Supplementary Figure 6A).

We then applied SCENIC+ [25], which integrates scRNA-seq and scATAC-seq to identify candidate enhancer regions and TF-binding motifs, linking TFs to target genes and identified enhancers. This creates enhancer-driven regulons (eRegulons), forming an eGRN. We applied SCENIC+ [25] to the leukemia samples with the inferred-prenatal and inferred-postnatal origin at diagnosis and relapse, keeping only regulons that showed a correlation between both modalities to retain only the most robust regulons (Supplementary Figure 6B). This revealed several eRegulons across both patients (Supplementary Figure 6D), many of which were patient specific, particularly when comparing HSPC populations (Figure 4A). The inferred-prenatal sample displayed a specific HSC eRegulon profile. In contrast, the inferred-postnatal sample more closely resembled the corresponding Granulocyte-Monocyte Progenitor (GMP) (Figure 5A). Interestingly, at relapse the inferred-prenatal origin patient undergoes a chemotherapy-driven lineage switch to a lymphoid phenotype, which may suggest that the leukemia originated from a less committed progenitor (Figure 5A).

To identify clusters of closely related eRegulons, we computed the correlations between eRegulons enrichment. We identified 2 main clusters of eRegulons which correspond to different inferred-signature samples (Figure 4B; Supplementary Figure 6C). For each eRegulon cluster, we used the associated target genes as input for gene ontology over representation analysis (ORA), to assess functional differences in the eGRN. This revealed fundamental differences in the underlying biological processes (Figure 5C; Supplementary Figure 7A). The AML sample with inferred-prenatal origin was enriched for many processes associated with development. In contrast, inferred-postnatal samples appeared more metabolism focused (Figure 5C; Supplementary Figure 7A). This further supports the association of these eRegulons with presumed prenatal origin t(8;21) AML, compared to postnatal origin disease.

Previous analysis using the TARGET [2] and BeatAML [23] datasets indicated that inferred-prenatal and inferred-postnatal origin t(8;21) AML may harbor different levels of chemosensitivity based on published drug response signatures (Figure 4D). Therefore, we performed in silico perturbations of eRegulon-associated TFs. PCA of the diagnosis and relapse samples recapitulated the expected differentiation trajectories along PC2, while separating diagnosis from relapse along PC1 (Figure 5D). Using the SCENIC+ [25] perturbation simulation workflow, we identified TFs estimated to induce differentiation, as defined by a negative shift in PC2 (Supplementary Figure 7B). We prioritized TFs predicted to impact the HSC compartment and identified 18 TFs with predicted significant effects on HSC differentiation (Supplementary Figure 7C). Several of these are components of the AP-1 complex (JUN, ATF4, FOSL2), which are established downstream targets of the t(8;21) fusion protein and are known to propagate t(8;21) AML [55, 67] (Figure 5E; Supplementary Figure 7C).

Using AP-1 complex members as a comparative baseline, we identified EP300 as one of the most impactful hits. EP300 has recently been shown to drive t(8;21) AML self-renewal through acetylation dependent mechanism [68]. This suggests that presumed prenatal origin pediatric t(8;21) AML may be particularly sensitive to EP300 inhibition. One of the most striking predictions, for both diagnostic and relapse HSC populations, was BCLAF1 (Figure 5E; Supplementary Figure 7C). BCLAF1 is a regulator of normal HSPCs [69], and its expression level declines during hematopoietic differentiation. While recent studies have identified a role for BCLAF1 in AML [70], this has not been explored in detail in the context of pediatric AML or t(8;21) AML and may present a therapeutic opportunity.

We also performed SCENIC+ [25] perturbation modelling on the postnatal origin sample (AML12). In this case, the PCA was less straightforward to interpret, as branching differentiation trajectories towards a lymphoid or myeloid fate appear along the PC2 axis, while PC1 distinguishes diagnosis and relapse samples (Supplementary Figure 7D). Therefore, we prioritized TFs based on a predicted effect similar to AP-1 complex components, as it is known that this complex is a critical regulator in t(8;21) AML. We identified several TFs from our original postnatal origin signature were predicted to have an effect (Supplementary Figures 7D-F), supporting the relevance of the GRNs identified in our previous analyses.

To further investigate EP300 and BCLAF1, we queried the DepMap [71] database to assess the dependency of t(8;21) AML cell lines to these genes (Supplementary Figure 7G). We found that the two widely used cell lines of t(8;21) AML, KASUMI-1 (7-year-old donor) [72] and SKNO-1 (22-year-old donor) [73], were among the most sensitive to these perturbations based on their DepMap effect scores (Supplementary Figure 7G). Several other cell lines sensitive to BCLAF1 were derived from pediatric cancers, most notably neuroblastomas, which also arise in-utero [74] (Supplementary Figure 7H). Together, these findings suggest that EP300 inhibition may be particularly effective in t(8;21) AML, and that BCLAF1 may present a new therapeutic target for t(8;21) AML, particularly in pediatric cases with inferred pre-natal origin of the driver translocation.

Discussion

Here we have generated a new data resource, AML scAtlas, to investigate AML biology across a broad range of subtypes at single-cell resolution. By including 222 samples comprising 748,679 cells of patients with a wide range of clinical characteristics, AML scAtlas overcomes the limitations of many standalone single-cell studies enabling AML subtype-focused analysis with enough data for robust statistical comparisons. This dataset is publicly available (https://cellxgene.bmh.manchester.ac.uk/AML/) providing the AML research community with a resource to address diverse biological questions and generate new hypotheses.

To further address a clinically relevant question using this data source, we compared differences between pediatric and adult-onset disease based on the potential biological effect of the in-utero origin of pediatric leukemia. Data of our AML scAtlas was used to explore the GRNs in adult and pediatric t(8;21) AML and revealed a strong age-associated GRN signature. This suggests that while pediatric and adult t(8;21) AMLs are propagated by the same driver translocation, they exhibit clear biological differences correlated with age. This may be due to differences in the cell-of-origin, with mouse models showing that t(8;21) AML can arise from a HSC or a more lineage restricted GMP [75]. As pediatric t(8;21) can arise in-utero, as evidenced by previous studies [17], and adult t(8;21) is acquired postnatally [18, 19], we propose that the observed age-related differences in AML with t(8;21) reflect these differences in the developmental origins of the disease. We identified two distinct groups of regulons corresponding to either inferred-prenatal origin and inferred-postnatal origin disease. These regulons constitute the GRN underlying the cellular state, which can be informative when identifying molecular vulnerabilities to target leukemia.

Our cohort is the largest scRNA-seq dataset to explore t(8;21) AML biology to date, however, the number of patients included remains low (n=22), and many of the studies containing the adult samples used CD34 selection in their experimental protocol creating a bias towards HSPCs in these samples. To overcome some of these limitations, we used bulk RNA-seq samples from the TARGET [2] and BeatAML [23] studies with t(8;21) AML (n=83) to validate our regulon signatures. This identifies two clusters of samples which closely match these signatures, showing that the regulon patterns identified from our AML scAtlas are recapitulated with bulk RNA-seq data enabling exploration of larger patient cohorts. Comparisons between inferred-prenatal and inferred-postnatal origin transcriptomes prioritized TFs which were differentially expressed and highlighted differences in underlying biology and drug response. We identified 5 signature TFs (KDM5A, REST, BCLAF1, YY1, RAD21) for inferred-prenatal origin disease, several of which have roles in embryonic stem cells [76, 77], and critical functions in the maintenance of HSCs [49-51]. In contrast, TFs identified in inferred-postnatal origin samples, such as interferon regulatory factors (IRFs), HDAC2, and SPI1, reflect inflammatory and immune processes, many of which have been implicated in leukemia [52-54]. We also found that inferred-prenatal origin samples had a higher proportion of HSC/Prog cell types compared to inferred-postnatal origin samples, a more primitive state than postnatal onset t(8;21) AML cases, supporting the hypothesis that age-associated differences in the cell-of-origin influence disease biology.

Given these biological differences, we used bulk RNA-seq to predict chemosensitivity using published drug response signatures [62-65]. Inferred-prenatal samples were enriched for genes indicative of cytarabine sensitivity and depleted of genes suggestive of daunorubicin and venetoclax resistance. These findings suggest that the developmental origins of the disease may influence drug responses, with potential implications in the design of novel therapeutic strategies and providing further biological evidence that pediatric AML might benefit from different clinical management compared with adult-onset AML. Importantly, venetoclax is currently in the AML23 trial (NCT05955261); our results support further evaluation of venetoclax treatment in pediatric t(8;21) AML.

Using an additional single-cell multiomic dataset, using SCENIC+, we reconstructed the eGRN in samples matching our inferred-prenatal and inferred-postnatal regulon signatures. Upon comparing eRegulons for each patient at diagnosis and relapse, we identified clusters of highly correlated eRegulons defined by different biological processes. Inferred-prenatal origin samples are characterized by developmental and transcriptional dysregulation, whereas inferred-postnatal origin samples are largely driven by fundamental cellular processes linked to inflammation. We used SCENIC+ to model the predicted impact of TF perturbations on our prenatal origin sample at diagnosis and relapse identified several key components of the AP-1 complex, which are critical in t(8;21) AML biology and are also associated with dynamic age-related transcriptional changes [55-57].

Through our analysis, we identified EP300 as a candidate target, which has been shown to be critical for t(8;21) AML biology [68] with demonstrable effects in KASUMI-1 and SKNO1 cell lines. EP300 has been identified as a promising therapeutic target in AML with several molecules in development [78]; our data indicate potential specific therapeutic benefit in prenatal origin t(8;21) AML. One of the most impactful perturbation predictions for the HSC compartment at diagnosis and relapse was BCLAF1. This is consistent with previous evidence of its importance in HSCs [69, 79] and AML [70], but has not been studied specifically in the context of pediatric AML or t(8;21) AML previously. The DepMap data shows that KASUMI-1 is the most sensitive myeloid cell line to BCLAF1 perturbation, and our GRN analyses suggest it is particularly active in pediatric t(8;21) AML of inferred in-utero origin, thus this may represent an additional prognostic indicator.

Further investigations are required to characterize the roles of both EP300 and BCLAF1 in prenatal origin t(8;21) AML before any clinical realisation. EP300 has already been investigated as a target in AML, so future work should focus on the pediatric AML setting with in-vitro and in-vivo studies using EP300/CBP inhibitors such as inobrodib [78]. In contrast

BCLAF1 is relatively unexplored, and additional work is required to elucidate its molecular function and assess its potential as a therapeutic target. BCLAF1 may ultimately prove most valuable as a biomarker of in-utero t(8;21) AML, enabling distinction between late-onset in-utero and postnatal disease. This would require molecular validation in a large cohort of pediatric patients with Gunthrie spots to confirm whether they had acquired t(8;21) in-utero.

Conclusions

Overall, our study demonstrates that large-scale single-cell data integration is a powerful approach to dissect specific patient groups in detail, and enabling robust comparative analyses. We present the AML scAtlas as a publicly available resource for the research community to address diverse biological questions. By applying AML scAtlas to t(8;21) AML, we identified age-associated gene regulatory networks that likely reflect differences in the developmental origins, biology and outcome of the disease. These findings also highlight novel candidate therapeutic targets which may be more relevant in pediatric t(8;21) AML compared to adult-onset disease, offering opportunities for more tailored treatment strategies.

Methods

For the complete analysis code, including the conda environments used for analysis, see GitHub Repo (https://github.com/jesswhitts/AML-scAtlas).

Data Collection

A literature search was performed for published AML scRNA-seq datasets [4-6, 8, 26-40]. Suitable studies were selected based on the data quality (over 1000 counts and 500 genes detected per cell for most of the data). Diagnostic, primary AML samples were selected from each AML study. Where healthy donor samples were present, these were also included, along with an additional 4 studies with healthy bone marrow samples.

Initial Data Processing

Each scRNA-seq dataset underwent initial quality control individually using Scanpy (v1.9.3) [80] as some studies provided raw data and others provided pre-filtered data. Where raw data was provided, doublets were removed using Scrublet (v0.2.3) [81] and cells were filtered using the median absolute deviation as described in this single-cell best practices handbook [10, 82].

Once filtered, datasets were combined, and quality control was performed using Scanpy (v1.9.3) [80]. The full dataset had quality thresholds applied (percentage mitochondrial counts <10, read counts >1000, gene counts >500), removing any samples which had fewer than 50 cells remaining after filtering. Genes present in <50 cells were removed. MATAL1 was removed as this was highly abundant in many cells and considered artefactual.

Batch Correction

The presence of batch effects was determined through dimensionality reduction and clustering using Scanpy (v1.9.3) [80] and using the kBET algorithm (v0.99.6) [83]. This was repeated on individual studies, to assess whether there were sample-wise batch effects. Batch correction benchmarking was implemented using Harmony (Scanpy v1.9.3 implementation) [11], scVI (v1.0.3) [12], and scANVI (v1.0.3) [13] and quantified using scIB (1.1.4) [9]. Different numbers of highly variable genes were used to select the optimal number for integration. Batch correction was performed using scVI (v1.0.3) [12] with the top 2000 highly variable genes, using sample as the model covariate.

AML scAtlas Cell Type Annotation

The scVI corrected embedding was used to run UMAP and Leiden clustering using Scanpy functions (v1.9.3) [80]. Cell type annotation was performed using CellTypist (v1.6.0) [84] using the ‘Immune_All_Low.pkl’ model, SingleR (v2.0.0) [85] using the Novershtern hematopoietic refreference[86], and scType (v1.0) [87] with the tissue defined as ‘Immune system’. Full automated tool outputs are detailed in Supplementary Table 3; overall we found that the results varied significantly between different tools. We postulate that this is, in part, due to differences in the reference profiles used. Thus, we opted to use the best consensus of these different tools for our cluster identity assignments.

AML scAtlas LSC Annotation

HSPC clusters were selected from AML scAtlas, and the scVI corrected embedding was used to re-compute UMAP using Scanpy functions (v1.9.3) [80]. As our previous cell type annotations used generic reference profiles and were not AML specific, we generated a custom cell type annotation reference to identify LSCs. We created a custom SingleR (v2.0.0) [85] reference using the Zeng et al [7] revised annotations of the Van Galen et al [5] dataset (Supplementary Table 3). This was also correlated with the LSC6 [43] and LSC17 [44] scores for each cell. To compare LSC abundance between ELN risk groups, chi2_contingency was implemented from SciPy (v1.12.0).

AML with t(8;21) Analysis

Samples with the t(8;21) translocation were selected from the full AML scAtlas. The UMAP was re-computed, and genes were filtered to remove those detected in fewer than 50 cells for the revised dataset, leaving 24,866 genes remaining. Gene regulatory network analysis was performed using pySCENIC [21, 22] (v0.12.1) as per the recommended workflow. To facilitate comparisons between age groups, cell types were focused on HSPCs, as many adult samples were originally enriched for CD34. The RSS was calculated for the adult and pediatric samples to select the top 20 differential regulons per age group. Using SciPy hierarchical clustering (v1.12.0), regulons were filtered to identify regulon signature groups used for downstream analysis.

Bulk RNA-Seq Analysis

Bulk RNA-Seq data was downloaded for the TARGET [2] and BeatAML [23] cohorts and samples with t(8;21) were selected. Only bone marrow samples taken at diagnosis were used for downstream analyses (Supplementary Table 4). Using the previously defined signature regulons, the AUCell algorithm [21] (v0.12.1) was implemented to measure regulon activity. Hierarchical clustering was performed using SciPy v1.12.0) to identify samples most enriched for each age-related signature. Differential gene expression analysis was implemented using edgeR [58] (v3.42.4) and DESeq2 [59] (v1.40.2), using a log2 fold change threshold of 0.5 and an adjusted p value cutoff of 0.01. Candidate differential genes, ranked on log2 fold change, underwent GSEA with GSEApy (v0.10.8) using a significance threshold of 0.05. Cell type deconvolution was performed using AutoGeneS [66] (v1.0.4) using the recommended workflow. Significance when comparing groups was ascertained using a student’s T-test on the predicted cell type proportion values for each sample.

Single-Cell Multi-Omics Analysis

The Lambo et al [24] scRNA-seq and scATAC-seq data from pediatric AML bone marrow samples was downloaded, and the t(8;21) samples selected (Supplementary Table 4). Using our previously defined signature regulons, the AUCell algorithm [21] (v0.12.1) was implemented to measure regulon activity. This identified the samples most enriched for each age-related signature as AML16 and AML12.

The SCENIC+ pipeline [25] (v1.0a1) was implemented as per the recommended Snakemake workflow for creating pseudo-multiome data. Regulons were filtered by correlation between modalities, using a threshold of 0.2 for non-multiome data. The most robust regulons were prioritized based on the SCENIC+ recommendations (direct +/+) [25]. To facilitate comparisons, the eRegulon RSS was calculated for each patient and the top 30 eRegulons selected. The correlation between the gene sets underpinning eRegulons was calculated and sample-associated clusters were selected for over-representation analysis with clusterProfiler [88] (v4.8.3).

To predict the impact of specific TF perturbations on key cell types, SCENIC+ [25] perturbation modelling was implemented using the recommended parameters. TFs were then prioritized on their predicted impact on HSC differentiation and visualised using the PCA embedding. Candidate targets EP300 and BCLAF1 were queried in the DepMap [71] databases to infer their potential importance.

Acknowledgements

JW was funded by MRC DTP award (MR/W007428/1), SM by Blood cancer UK (15038) and CCLG (2016 09), while GL by Cancer Research UK (C5759/A20971 & C5759/A27412) and MI by MRC (MR/X014088/1).

We would like to thank all authors of the public data used in this study for their contributions to scientific community. We also acknowledge useful discussions around this work with Magnus Rattray.

Additional information

Data availability

The AML scAtlas is hosted online for public use (https://cellxgene.bmh.manchester.ac.uk/AML/). The processed AnnData object is also available to download from figshare (DOI: 10.48420/27269946). Details of all data used in this study can be found in Supplementary Table 1, along with associated links to the original data. All code used to perform the analyses presented here can be accessed in the GitHub repository: (https://github.com/jesswhitts/AML-scAtlas). The samples used for validation analyses are publicly available and are detailed in Supplementary Table 4. The SCENIC+ eGRN files are provided as Supplementary Table 5.

Author Contributions

MI, SMB, and GL conceived the study and oversaw the research. JW did the data collection and implemented the analyses. SM guided sub-analyses and assisted with interpretation. JW wrote the first draft of the manuscript. All authors interpreted the data and edited the manuscript. All authors approved the final manuscript.

Funding

Medical Research Council (MR/W007428/1)

Medical Research Council (MR/X014088/1)

Cancer Research UK (C5759/A20971)

Cancer Research UK (C5759/A27412)

Blood Cancer UK (15038)

Additional files

Supplementary figures and tables

Supplementary Table 3

Supplementary Table 4

Supplementary Table 5