Single-Cell Atlas of AML Reveals Age-Related Gene Regulatory Networks in t(8;21) AML

  1. Division of Informatics, Imaging and Data Sciences, Faculty of Biology, Medicine and Health, The University of Manchester, Manchester, United Kingdom
  2. Stem Cell Biology Group, Cancer Research UK Manchester Institute, The University of Manchester, Manchester, United Kingdom
  3. Manchester Cancer Research Centre (MCRC), Division of Cancer Sciences, School of Medical Sciences, Faculty of Biology, Medicine and Health, The University of Manchester, Manchester, United Kingdom
  4. Department of Paediatric and Adolescent Oncology, Royal Manchester Children’s Hospital, Manchester, United Kingdom
  5. Department of Adolescent Oncology, The Christie NHS Foundation Trust, Manchester, United Kingdom

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Murim Choi
    Seoul National University, Seoul, Republic of Korea
  • Senior Editor
    Murim Choi
    Seoul National University, Seoul, Republic of Korea

Reviewer #1 (Public review):

Summary:

In this manuscript, the authors performed an integration of 48 scRNA-seq public datasets and created a single-cell transcriptomic atlas for AML (222 samples comprising 748,679 cells). This is important since most AML scRNA-seq studies suffer from small sample size coupled with high heterogeneity. They used this atlas to further dissect AML with t(8;21) (AML-ETO/RUNX1-RUNX1T1), which is one of the most frequent AML subtypes in young people. In particular, they were able to predict Gene Regulatory Networks in this AML subtype using pySCENIC, which identified the paediatric regulon defined by a distinct group of hematopoietic transcription factors (TFs) and the adult regulon for t(8;21). They further validated this in bulk RNA-seq with AUCell algorithm and inferred prenatal signature to 5 key TFs (KDM5A, REST, BCLAF1, YY1, and RAD21), and the postnatal signature to 9 TFs (ENO1, TFDP1, MYBL2, KLF1, TAGLN2, KLF2, IRF7, SPI1, and YXB1). They also used SCENIC+ to identify enhancer-driven regulons (eRegulons), forming an eGRN, and found that prenatal origin shows a specific HSC eRegulon profile, while a postnatal origin shows a GMP profile. They also did an in silico perturbation and found AP-1 complex (JUN, ATF4, FOSL2), P300, and BCLAF1 as important TFs to induce differentiation. Overall, I found this study very important in creating a comprehensive resource for AML research.

Strengths:

(1) The generation of an AML atlas integrating multiple datasets with almost 750K cells will further support the community working on AML.

(2) Characterisation of t(8;21) AML proposes new interesting leads.

Weaknesses:

Were these t(8;21) TFs/regulons identified from any of the single datasets? For example, if the authors apply pySCENIC to any dataset, would they find the same TFs, or is it the increase in the number of cells that allows identification of these?

Reviewer #2 (Public review):

Summary:

The authors assemble 222 publicly available bone marrow single-cell RNA sequencing samples from healthy donors and primary AML, including pediatric, adolescent, and adult patients at diagnosis. Focusing on one specific subtype, t(8;21), which, despite affecting all age classes, is associated with better prognosis and drug response for younger patients, the authors investigate if this difference is reflected also in the transcriptomic signal. Specifically, they hypothesize that the pediatric and part of the young population acquires leukemic mutations in utero, which leads to a different leukemogenic transformation and ultimately to differently regulated leukemic stem cells with respect to the adult counterpart. The analysis in this work heavily relies on regulatory network inference and clustering (via SCENIC tools), which identifies regulatory modules believed to distinguish the pre-, respectively, post-natal leukemic transformation. Bulk RNA-seq and scATAC-seq datasets displaying the same signatures are subsequently used for extending the pool of putative signature-specific TFs and enhancer elements. Through gene set enrichment, ontology, and perturbation simulation, the authors aim to interpret the regulatory signatures and translate them into potential onset-specific therapeutic targets. The putative pre-natal signature is associated with increased chemosensitivity, RNA splicing, histone modification, stem-ness marker SMARCA2, and potentially maintained by EP300 and BCLAF1.

Strengths:

The main strength of this work is the compilation of a pediatric AML atlas using the efficient Cellxgene interface. Also, the idea of identifying markers for different disease onsets, interpreting them from a developmental angle, and connecting this to the different therapy and relapse observations, is interesting. The results obtained, the set of putative up-regulated TFs, are biologically coherent with the mechanisms and the conclusions drawn. I also appreciate that the analysis code was made available and is well documented.

Weaknesses:

There were fundamental flaws in how methods and samples were applied, a general lack of critical examination of both the results and the appropriateness of the methods for the data at hand, and in how results were presented. In particular:

(1) Cell type annotation:

a) The 2-phase cell type annotation process employed for the scRNA-seq sample collection raised concerns. Initially annotated cells are re-labeled after a second round with the same cell types from the initial label pool (Figure 1E). The automatic annotation tools were used without specifying the database and tissue atlases used as a reference, and no information was shown regarding the consensus across these tools.

b) Expression of the CD34 marker is only reported as a selection method for HSPCs, which is not in line with common practice. The use of only is admitted as a surface marker, while robust annotation of HSPCs should be done on the basis of expression of gene sets.

c) During several analyses, the cell types used were either not well defined or contradictory, such as in Figure 2D, where it is not clear if pySCENIC and AUC scores were computed on HSPCs alone or merged with CMPs. In other cases, different cell type populations are compared and used interchangeably: comparing the HSPC-derived regulons with bulk (probably not enriched for CD34+ cells) RNA samples could be an issue if there are no valid assumptions on the cell composition of the bulk sample.

(2) Method selection:

a) The authors should explain why they use pySCENIC and not any other approach. They should briefly explain how pySCENIC works and what they get out in the main text. In addition they should explain the AUCell algorithm and motivate its usage.

b) The obtained GRN signatures were not critically challenged on an external dataset. Therefore, the evidence that supports these signatures to be reliable and significant to the investigated setting is weak.

(3) There are some issues with the analysis & visualization of the data.

(4) Discussion:

a) What exactly is the 'regulon signature' that the authors infer? How can it be useful for insights into disease mechanisms?

b) The authors write 'Together this indicates that EP300 inhibition may be particularly effective in t(8;21) AML, and that BCLAF1 may present a new therapeutic target for t(8;21) AML, particularly in children with inferred pre-natal origin of the driver translocation.' I am missing a critical discussion of what is needed to further test the two targets. Put differently: Would the authors take the risk of a clinical study given the evidence from their analysis?

Author response:

Reviewer #1 (Public review):

Summary:

In this manuscript, the authors performed an integration of 48 scRNA-seq public datasets and created a single-cell transcriptomic atlas for AML (222 samples comprising 748,679 cells). This is important since most AML scRNA-seq studies suffer from small sample size coupled with high heterogeneity. They used this atlas to further dissect AML with t(8;21) (AML-ETO/RUNX1-RUNX1T1), which is one of the most frequent AML subtypes in young people. In particular, they were able to predict Gene Regulatory Networks in this AML subtype using pySCENIC, which identified the paediatric regulon defined by a distinct group of hematopoietic transcription factors (TFs) and the adult regulon for t(8;21). They further validated this in bulk RNA-seq with AUCell algorithm and inferred prenatal signature to 5 key TFs (KDM5A, REST, BCLAF1, YY1, and RAD21), and the postnatal signature to 9 TFs (ENO1, TFDP1, MYBL2, KLF1, TAGLN2, KLF2, IRF7, SPI1, and YXB1). They also used SCENIC+ to identify enhancer-driven regulons (eRegulons), forming an eGRN, and found that prenatal origin shows a specific HSC eRegulon profile, while a postnatal origin shows a GMP profile. They also did an in silico perturbation and found AP-1 complex (JUN, ATF4, FOSL2), P300, and BCLAF1 as important TFs to induce differentiation. Overall, I found this study very important in creating a comprehensive resource for AML research.

Strengths:

(1) The generation of an AML atlas integrating multiple datasets with almost 750K cells will further support the community working on AML.

(2) Characterisation of t(8;21) AML proposes new interesting leads.

We thank the reviewer for a succinct summary of our work and highlighting its strengths.

Weaknesses:

Were these t(8;21) TFs/regulons identified from any of the single datasets? For example, if the authors apply pySCENIC to any dataset, would they find the same TFs, or is it the increase in the number of cells that allows identification of these?

The purpose of our study was to gain biological insights by integrating multiple datasets, to overcome limitations from small sample size. We expect that the larger dataset would improve network inference, which is what we implemented in the manuscript, hence we have not looked at individual datasets. However, we will investigate this further in the revised manuscript by running pySCENIC on individual datasets and comparing to the results drawn from the whole atlas.

Reviewer #2 (Public review):

Summary:

The authors assemble 222 publicly available bone marrow single-cell RNA sequencing samples from healthy donors and primary AML, including pediatric, adolescent, and adult patients at diagnosis. Focusing on one specific subtype, t(8;21), which, despite affecting all age classes, is associated with better prognosis and drug response for younger patients, the authors investigate if this difference is reflected also in the transcriptomic signal. Specifically, they hypothesize that the pediatric and part of the young population acquires leukemic mutations in utero, which leads to a different leukemogenic transformation and ultimately to differently regulated leukemic stem cells with respect to the adult counterpart. The analysis in this work heavily relies on regulatory network inference and clustering (via SCENIC tools), which identifies regulatory modules believed to distinguish the pre-, respectively, post-natal leukemic transformation. Bulk RNA-seq and scATAC-seq datasets displaying the same signatures are subsequently used for extending the pool of putative signature-specific TFs and enhancer elements. Through gene set enrichment, ontology, and perturbation simulation, the authors aim to interpret the regulatory signatures and translate them into potential onset-specific therapeutic targets. The putative pre-natal signature is associated with increased chemosensitivity, RNA splicing, histone modification, stem-ness marker SMARCA2, and potentially maintained by EP300 and BCLAF1.

Strengths:

The main strength of this work is the compilation of a pediatric AML atlas using the efficient Cellxgene interface. Also, the idea of identifying markers for different disease onsets, interpreting them from a developmental angle, and connecting this to the different therapy and relapse observations, is interesting. The results obtained, the set of putative up-regulated TFs, are biologically coherent with the mechanisms and the conclusions drawn. I also appreciate that the analysis code was made available and is well documented.

We thank the reviewer for reviewing our work, and highlighting its key features, including creation of AML atlas, downstream analysis and interpretation for t(8;21) subtype.

We also appreciate useful critique of our paper provided below.

Weaknesses:

There were fundamental flaws in how methods and samples were applied, a general lack of critical examination of both the results and the appropriateness of the methods for the data at hand, and in how results were presented. In particular:

(1) Cell type annotation:

a) The 2-phase cell type annotation process employed for the scRNA-seq sample collection raised concerns. Initially annotated cells are re-labeled after a second round with the same cell types from the initial label pool (Figure 1E). The automatic annotation tools were used without specifying the database and tissue atlases used as a reference, and no information was shown regarding the consensus across these tools.

We believe that most of the reviewer’s criticisms stem from a misunderstanding, and we apologize for not explaining certain aspects of our work more clearly.

The two types of cell type annotation applied were different and served distinct purposes:

• One was using general bone marrow/blood reference datasets to annotate blood subtype lineage clusters.

• The other was using a CD34 purified AML specific reference dataset which included leukaemia-associated annotations, to identify HSPC subpopulations. We also implemented this on a single-cell level to allow more robust identification of these rare populations in a large dataset.

This is probably not well explained in the methods and figure presentation. We will clearly indicate in the revised manuscript that different HSPC annotations represent separate analysis and will update the figures to highlight this. We will provide a comprehensive review of the annotation strategies implemented, including the automated tool outputs, which may be useful for the single-cell community.

b) Expression of the CD34 marker is only reported as a selection method for HSPCs, which is not in line with common practice. The use of only is admitted as a surface marker, while robust annotation of HSPCs should be done on the basis of expression of gene sets.

We used CD34 expression in conjunction with other cell type annotations and marker sets to identify LSCs, although results are same when we use HSPC annotated cells without condition on CD34 expression. In the revised manuscript, we will simplify this analysis to use HSPC clusters as suggested by the reviewer.

c) During several analyses, the cell types used were either not well defined or contradictory, such as in Figure 2D, where it is not clear if pySCENIC and AUC scores were computed on HSPCs alone or merged with CMPs. In other cases, different cell type populations are compared and used interchangeably: comparing the HSPC-derived regulons with bulk (probably not enriched for CD34+ cells) RNA samples could be an issue if there are no valid assumptions on the cell composition of the bulk sample.

As mentioned in the Methods, we only excluded lymphoid cell types from the pySCENIC analysis to overcome the bias that some samples were enriched using CD34 selection when preparing them for scRNA-seq. We will make this clearer in the text and figures of the revised manuscript. It is difficult to overcome this bias when using bulk RNA samples, which may explain why some of our samples do not fit into our defined signature groups. However, as we do not have access to primary samples ourselves, we cannot provide a better matched experimental cohort for validation.

(2) Method selection:

a) The authors should explain why they use pySCENIC and not any other approach. They should briefly explain how pySCENIC works and what they get out in the main text. In addition they should explain the AUCell algorithm and motivate its usage.

pySCENIC is state-of-the-art method for network inference from scRNA data and is widely used within the single-cell community (over 5000 citations for both versions of the SCENIC pipeline). The pipeline has been benchmarked as one of the top performers for GRN analysis (Nguyen et al, 2021. Briefings in Bioinformatics). AUCELL is a module within the pySCENIC pipeline to summarise the activity of a set of genes (a regulon) into a single number which helps compare and visualise different regulons. We agree with reviewer that this could have been more clearly explained within the manuscript. We will update text in the revised manuscript to add more explanation.

b) The obtained GRN signatures were not critically challenged on an external dataset. Therefore, the evidence that supports these signatures to be reliable and significant to the investigated setting is weak.

These signatures were inferred from the best suitable AML single-cell RNA datasets available to date, and we used two independent datasets to validate our findings (the TARGET AML bulk RNA sequencing cohort, and the Lambo et al. scRNA-seq dataset). To our knowledge, there are no other better suited datasets for validation. Experimental validations on patient samples are beyond the scope of this study.

(3) There are some issues with the analysis & visualization of the data.

We will provide new statistical tests to improve robustness of the analysis as well as presentation and visualization of the data in the revised manuscript.

(4) Discussion:

a) What exactly is the 'regulon signature' that the authors infer? How can it be useful for insights into disease mechanisms?

The ’regulon signature’ here refers to a gene regulatory program (multiple gene modules, each defined by a transcription factor and its targets) which are specific to different age groups. Further investigation into this can be useful for understanding why patients of different ages confer a different clinical course. We will add more text on the utility of our discovered 'regulon signature' in the discussion section of revised manuscript.

b) The authors write 'Together this indicates that EP300 inhibition may be particularly effective in t(8;21) AML, and that BCLAF1 may present a new therapeutic target for t(8;21) AML, particularly in children with inferred pre-natal origin of the driver translocation.' I am missing a critical discussion of what is needed to further test the two targets. Put differently: Would the authors take the risk of a clinical study given the evidence from their analysis?

Of course, many extensive studies would be required before these findings are clinically translatable. We can include some perspectives on what further work is required in terms of further experimental validation and potential subsequent clinical study.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation