Robust estimation of cancer and immune cell-type proportions from bulk tumor ATAC-Seq data

Aurélie AG Gabriel; Julien Racle; Maryline Falquet; Camilla Jandus; David Gfeller

doi:10.7554/eLife.94833.1

eLife assessment

This study presents an important computational tool for the quantification of the cellular composition of human tissues profiled with ATAC-seq. The methodology is solid and its application results on breast cancer tumor tissues are convincing. It advances existing methods by utilizing a comprehensive reference profile for major cancer-relevant cell types, compatible with a widely-used cell type deconvolution tool.

https://doi.org/10.7554/eLife.94833.1.sa2

Significance of findings

important: Findings that have theoretical or practical implications beyond a single subfield

landmark
fundamental
important
valuable
useful

Strength of evidence

solid: Methods, data and analyses broadly support the claims with only minor weaknesses

convincing: Appropriate and validated methodology in line with current state-of-the-art

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

Assay for Transposase-Accessible Chromatin sequencing (ATAC-Seq) is a widely used technique to explore gene regulatory mechanisms. For most ATAC-Seq data from healthy and diseased tissues such as tumors, chromatin accessibility measurement represents a mixed signal from multiple cell types. In this work, we derive reliable chromatin accessibility marker peaks and reference profiles for all major cancer-relevant cell types. We then capitalize on the EPIC deconvolution framework (Racle et al. 2017) previously shown to accurately predict cell-type composition in tumor bulk RNA-Seq data and integrate our markers and reference profiles to EPIC to quantify cell-type heterogeneity in bulk ATAC-Seq data. Our EPIC-ATAC tool accurately predicts non-malignant and malignant cell fractions in tumor samples. When applied to a breast cancer cohort, EPIC-ATAC accurately infers the immune contexture of the main breast cancer subtypes.

Introduction

Gene regulation is a dynamic process largely determined by the physical access of chromatin-binding factors such as transcription factors (TFs) to regulatory regions of the DNA (e.g., enhancers and promoters) (Klemm, Shipony, and Greenleaf 2019). The genome-wide landscape of chromatin accessibility is essential in the control of cellular identity and cell fate and thus varies in different cell types (K. Zhang et al. 2021; Klemm, Shipony, and Greenleaf 2019). Over the last decade, Assay for Transposase-Accessible Chromatin (ATAC-Seq) (Buenrostro et al. 2013) has become a reference epigenomic technique to profile chromatin accessibility and the activity of gene regulatory elements in diverse biological contexts including cancer (Luo, Gribskov, and Wang 2022) and across large cohorts (Corces et al. 2018). Several optimized ATAC-seq protocols have been developed to improve the quality of ATAC-Seq data and expand its usage to different tissue types. These include the OMNI-ATAC protocol, which leads to cleaner signal and is applicable to frozen samples (Corces et al. 2017; Grandi et al. 2022), as well as the formalin-fixed paraffin-embedded (FFPE)-ATAC protocol adapted to FFPE samples. The reasonable cost and technical advantages of these protocols foreshadow an increased usage of ATAC-Seq in cancer studies.

Most biological tissues are composed of multiple cell types. For instance, tumors are complex ecosystems including malignant and stromal cells as well as a large diversity of immune cells. This cellular heterogeneity, in particular the presence of specific immune cell types, impacts tumor progression as well as response to immunotherapy (Fridman et al. 2012; 2017; de Visser and Joyce 2023). Most existing ATAC-Seq data from tumors were performed on bulk samples, thereby including information from both cancer and non-malignant cells. Precisely quantifying the proportions of different cell types in such samples represents therefore a promising way to explore the immune contexture and the composition of the tumor micro-environment (TME) across large cohorts. Carefully assessing cell-type heterogeneity is also important to handle confounding factors in genomic analyses in which samples with different cellular compositions are compared. Recently, single-cell ATAC-Seq (scATAC-Seq) has been developed to explore cellular heterogeneity with high resolution in complex biological systems (Cusanovich et al. 2015; Lareau et al. 2019; Satpathy et al. 2019). However, the resulting data are sensitive to technical noise and such experiments require important resources, which so far limits the use of scATAC-Seq in contrast to bulk sequencing in the context of large cohorts.

In the past decade, computational deconvolution tools have been developed to predict the proportion of diverse cell types from bulk genomic data obtained from tumor samples (Avila Cobos et al. 2018; 2020; Sturm et al. 2019; Racle et al. 2017; Monaco et al. 2019; Newman et al. 2019; H. Li et al. 2020; Finotello et al. 2019; Becht et al. 2016). A large number of these tools model bulk data as a mixture of reference profiles identified in purified cell populations for each cell type. The accuracy of the predictions of cell-type proportions relies on the quality of these reference profiles as well as on the use of cell-type specific markers (Avila Cobos et al. 2018). A limitation of most deconvolution algorithms is that they do not predict the proportion of cell types that are not present in the reference profiles (here referred to as ‘uncharacterized’ cells). In the context of cancer samples, these uncharacterized cell populations include malignant cells whose molecular profiles differ not only from one cancer type to another, but also from one patient to another even within the same tumor type (Corces et al. 2018). A few tools consider uncharacterized cells in their deconvolution framework by using cell-type specific markers not expressed in the uncharacterized cells (Clarke, Seol, and Clarke 2010; Gosink, Petrie, and Tsinoremas 2007; Racle et al. 2017; Finotello et al. 2019). These tools include EPIC (Estimating the Proportion of Immune and Cancer cells) which simultaneously quantifies immune, stromal, vascular as well as uncharacterized cells from bulk tumor samples (Racle et al. 2017; Racle and Gfeller 2020).

Most deconvolution algorithms have been initially developed for transcriptomic data (RNA-Seq data) (Newman et al. 2015; Racle et al. 2017; Finotello et al. 2019; Monaco et al. 2019; Newman et al. 2019; T. Li et al. 2020; Jimenez-Sanchez, Cast, and Miller 2019; Gong and Szustakowski 2013). More recently they have been adapted for other omics layers such as methylation (Chakravarthy et al. 2018; Teschendorff et al. 2020; Arneson, Yang, and Wang 2020; H. Zhang et al. 2021) and proteomics (Feng et al. 2023) or chromatin accessibility. For the latter, a specific framework called DeconPeaker (H. Li et al. 2020) was developed to estimate cell-type proportions from bulk samples. Deconvolution tools developed initially for other omics modalities, such as RNA-Seq, can also be applied on ATAC-Seq if appropriate ATAC-Seq profiles are provided to the tool. For example, the popular deconvolution tool, CIBERSORT (Newman et al. 2015), was used to deconvolve leukemic ATAC-Seq samples (Corces et al. 2016). Other methods have been proposed to decompose ATAC-Seq bulk profiles into subpopulation-specific profiles (Zeng et al. 2019; Burdziak et al. 2019) or compartments (Peng et al. 2019). However, these methods have more requisites: (i) the integration of the ATAC-Seq data with single-cell or bulk RNA-Seq (Zeng et al. 2019; Burdziak et al. 2019) and HIChIP data (Zeng et al. 2019) or, (ii) subsequent feature annotation to associate compartments with cell types or biological processes (Peng et al. 2019).

The application of existing bulk ATAC-Seq data deconvolution tools to solid tumors is limited. First, current computational frameworks do not quantify populations of uncharacterized cell types. Second, ATAC-Seq based markers (i.e., chromatin accessible regions called peaks) and reference profiles generated so far have been derived in the context of hematopoietic cell mixtures (Corces et al. 2016; H. Li et al. 2020). Markers and profiles for major populations of the TME (e.g., stromal and vascular cells) are thus missing. While cell-type specific markers have been identified from scATAC-Seq data (K. Zhang et al. 2021), not all TME-relevant cell types are covered (e.g., lack of scATAC-Seq data from neutrophils due to extracellular traps formation). Also, these markers have not been curated to fulfill the requirements of tools such as EPIC to quantify uncharacterized cells (i.e., markers of a cell-type should not be accessible in other human tissues).

In this study, we collected ATAC-Seq data from pure cell types to identify cell-type specific marker peaks and to build reference profiles from most major non-malignant cell types typically observed in tumors. These data were integrated in the EPIC (Racle et al. 2017) framework to perform bulk ATAC-Seq samples deconvolution (Figure 1). Applied on peripheral blood mononuclear cells (PBMCs) and tumor samples, the EPIC-ATAC framework showed accurate predictions of the proportions of non-malignant and malignant cells with similar or higher performances than other existing tools.

Graphical description of the identification of cell-type specific marker peaks and reference ATAC-Seq profiles included in the EPIC-ATAC framework. 1) 564 pure ATAC-Seq data of sorted cells were collected to build reference profiles for cancer-relevant cell populations. 2) Cell-type specific marker peaks were identified using differential accessibility analysis. 3) Markers with previously observed chromatin accessibility in human healthy tissues were then excluded. 4) For tumor bulk deconvolution, the set of remaining marker peaks was refined by selecting markers with correlated behavior in tumor bulk samples. 5) The cell-type specific marker peaks and reference profiles were finally integrated in the EPIC-ATAC framework to perform bulk ATAC-Seq deconvolution. Parts of this figure were created with BioRender.com.
© 2024, BioRender Inc. Any parts of this image created with BioRender are not made available under the same license as the Reviewed Preprint, and are © 2024, BioRender Inc.

Results

ATAC-Seq data from sorted cell populations reveal cell-type specific marker peaks and reference profiles

A key determinant for accurate predictions of cell-type proportions by most deconvolution tools is the availability of reliable cell-type specific markers and reference profiles. To identify robust chromatin accessibility marker peaks of cancer relevant cell types, we collected 564 samples of sorted cell populations from twelve studies including eight immune cell types (B cells (Calderon et al. 2019; Corces et al. 2016; P. Zhang et al. 2022), CD4+ T cells (Corces et al. 2016; Liu et al. 2020; P. Zhang et al. 2022; Mumbach et al. 2017; Giles et al. 2022), CD8+ T cells (Calderon et al. 2019; Corces et al. 2016; Liu et al. 2020; P. Zhang et al. 2022; Giles et al. 2022), natural killer (NK) cells (Calderon et al. 2019; Corces et al. 2016), dendritic cells (DCs) (Calderon et al. 2019; Leylek et al. 2020; Liu et al. 2020), macrophages (Liu et al. 2020; P. Zhang et al. 2022), monocytes (Calderon et al. 2019; Corces et al. 2016; Leylek et al. 2020; P. Zhang et al. 2022; Trizzino et al. 2021) and neutrophils (Ram-Mohan et al. 2021; Perez et al. 2020), as well as fibroblasts (Ge et al. 2021; Liu et al. 2020) and endothelial (Liu et al. 2020; Xin et al. 2020) cells (Figure 1 box 1, Figure 2A, Supplementary Table 1). To limit batch effects, the collected samples were homogeneously processed from read alignment to peak calling. For each cell type, we derived a set of stable peaks, i.e., peaks observed across samples and studies (see Materials and Methods).

These peaks were then used to perform pairwise differential analysis to identify marker peaks for each cell type (Figure 1, box 2). To ensure that the cell-type specific marker peaks are not accessible in other human tissues, we included in the differential analysis ATAC-Seq samples from diverse human tissues from the ENCODE data (The ENCODE Project Consortium et al. 2020; Rozowsky et al. 2023) (Supplementary Figure 1). To select a sufficient number of peaks prior to peak filtering, the top 200 peaks recurrently differentially accessible across all cell-type pairs were selected as cell-type specific markers (see Materials and Methods). Using the human atlas study (K. Zhang et al. 2021), markers with potential residual accessibility in human tissues were then filtered out (Figure 1, box 3, see Materials and Methods). The resulting marker peaks specific to the immune cell types were considered for the deconvolution of PBMC samples (PBMC markers). For tumor bulk sample deconvolution, the list of markers was further refined based on the correlation patterns of the markers in tumor bulk samples from diverse cancer types from The Cancer Genome Atlas (TCGA) (Corces et al. 2018) (Figure 1, box 4, see the Material and methods). The latter filtering ensures the relevance of the markers in the TME context since cell-type specific TME markers are expected to be correlated in tumor bulk ATAC-Seq measurements (Qiu et al. 2021). 716 markers of immune, fibroblasts and endothelial cell types remained after the later filtering and were considered for the deconvolution of bulk tumor samples (TME markers).

To assess the quality and reproducibility of these markers, we performed principal component analysis (PCA) based on each set of marker peaks. Computing silhouette coefficients based on the cell-type classification and on the study of origin showed that samples clustered by cell type and not by study of origin (averaged silhouette coefficients above 0.45 for cell type and around 0 for study of origin). Two-dimensional UMAP representations of the samples confirmed this observation (Figure 2B). These results indicate limited remaining batch effects after data processing and marker selection.

We then used the collected samples to generate chromatin accessibility profiles by computing the average of the normalized counts for each peak in each cell type as well as peak variability in each cell type (Racle et al. 2017) (see Material and methods). Figure 2C represents the average chromatin accessibility of each marker peak in each cell type of the reference dataset and highlights, as expected, the cell-type specificity of the selected markers (see also Supplementary Tables 2 and 3), which was confirmed in independent ATAC-Seq data from sorted cells and single-cell ATAC-Seq samples from blood and diverse human tissues (Figure 2D and 2E, see Materials and methods).

Annotations of the marker peaks highlight their biological relevance

To characterize the different marker peaks, we annotated them using ChiPSeeker (Yu, Wang, and He 2015). We observed that most of the markers are in distal and intergenic regions (Figure 2F), which is expected considering the large proportion of distal regions in the human genome and the fact that such regions have been previously described as highly cell-type specific (Corces et al. 2016). We also noticed that 7% of the PBMC and TME marker peaks are in promoter regions in contrast to 4% when considering matched genomic regions randomly selected in the set of peaks identified prior to the differential analysis (see Material and methods), which suggest enrichment in our marker peaks for important regulatory regions.

To assess the biological relevance of the marker peaks, we associated each marker peak to its nearest gene using ChIP-Enrich based on the “nearest transcription start site (TSS)” locus definition (Welch et al. 2014) (Supplementary Tables 4 and 5). Nearest genes reported as known marker genes in public databases of gene markers (i.e., PanglaoDB (Franzén, Gan, and Björkegren 2019) and CellMarker (Hu et al. 2023)) are listed in Table 1.

List of nearest genes and enriched CBPs reported in the PanglaoDB or CellMarker databases.

In each set of cell-type specific peaks, we observed an overrepresentation of chromatin binding proteins (CBPs) reported in the JASPAR2022 database (Castro-Mondragon et al. 2022) (using Signac (Stuart et al. 2021) and MonaLisa (Machlab et al. 2022) for assessing the overrepresentation) and the ReMap catalog (Hammal et al. 2022) (using RemapEnrich, see Material and Methods). Overrepresented CBPs also reported as known marker genes in the PanglaoDB and CellMarker databases are listed in Table 1. Detailed peaks annotations are summarized in Supplementary Tables 4 and 5.

Based on the “nearest TSS” annotation, we tested, using ChIP-Enrich (Welch et al. 2014), whether each set of cell-type specific marker peaks was enriched for regions linked to specific biological pathways (GO pathways). Figure 2G highlights a subset of the enriched pathways that are consistent with prior knowledge on each cell type. Some of these pathways are known to be characteristic of immune responses to inflammatory or tumoral environments. The complete list of enriched pathways is listed in the Supplementary Tables 6 and 7. Overall, these analyses demonstrate that the proposed cell-type specific marker peaks capture some of the known biological properties associated to each cell type.

EPIC-ATAC accurately estimates immune cell fractions in PBMC ATAC-Seq samples

The cell-type specific marker peaks and profiles derived from the reference samples were integrated to the EPIC deconvolution tool (Racle et al. 2017; Racle and Gfeller 2020). We will refer to this ATAC-Seq deconvolution framework as EPIC-ATAC. To test the accuracy of EPIC-ATAC predictions, we first collected PBMCs from five healthy donors. In each donor, half of the cells was used to generate a bulk ATAC-Seq dataset and the other half was used to determine the cellular composition of each sample, i.e., the proportions of monocytes, B cells, CD4+ T cells, CD8+ T cells, NK cells and dendritic cells, by multiparametric flow cytometry (Figure 3A, see Materials and methods). We then applied EPIC-ATAC to the bulk ATAC-Seq data. The predicted cell fractions are consistent with the cell fractions obtained by flow cytometry (Figure 3B, Pearson correlation coefficient of 0.78 and root mean squared error (RMSE) of 0.10).

As a second validation, we applied EPIC-ATAC to pseudo-bulk PBMC samples (referred to as the PBMC pseudobulk dataset, generated using three publicly available PBMC scATAC-Seq datasets (Satpathy et al. 2019; Granja et al. 2019; 10x Genomics 2021), see Material and methods). A high correlation (0.91) between EPIC-ATAC predictions and true cell-type proportions and a low RMSE (0.05) were observed for this dataset (Figure 3C).

The accuracy of the predictions obtained with EPIC-ATAC was then compared with the accuracy of other deconvolution approaches which could be used with our reference profiles and marker peaks (Figure 3D-E). To this end, we considered both the DeconPeaker method (H. Li et al. 2020) originally developed for bulk ATAC-Seq as well as several algorithms developed for bulk RNA-Seq (CIBERSORTx (Newman et al. 2019), QuanTIseq (Finotello et al. 2019), ABIS (Monaco et al. 2019), and MCPcounter (Becht et al. 2016)). To enable meaningful comparison across the cell types considered in this work and use the method initially developed for bulk RNA-Seq deconvolution, the marker peaks and profiles derived in this work were used in each of these methods. DeconPeaker and CIBERSORTx include the option to define cell-type specific markers and profiles from a set of reference samples. We thus fed our ATAC-Seq samples collection to both algorithms and used the resulting profiles and marker peaks to perform bulk ATAC-Seq deconvolution. The resulting predictions are referred to as DeconPeaker-Custom and CIBERSORTx-Custom.

Many tools displayed high correlation and low RMSE values, similar to those of EPIC-ATAC, and no single tool consistently outperformed the others (Figure 3D-E, Supplementary Figure 2A-C). The fact that our marker peaks and reference profiles could be used with EPIC-ATAC and other existing tools demonstrates their broad applicability.

Predictions accuracies were also evaluated in each cell type separately. Since the number of samples was low in each dataset, samples from both datasets were combined for this analysis. EPIC-ATAC demonstrated good accuracies across cell types with RMSE values ranging from 0.02 for B cells to 0.13 for NK cells (Supplementary Figure 3). As expected, predictions with all tools were more accurate for frequent cell types with well-characterized markers (e.g., CD8/CD4 T cells, B cells) compared to less frequent cell types (e.g., NK cells, dendritic cells) (Supplementary Figure 2 and 3). Note that MCPcounter is a marker-based method that derives cell-type specific scores which cannot be compared between cell types. This method was thus only included in the benchmark considering each cell type separately.

EPIC-ATAC accurately predicts fractions of cancer and non-malignant cells in tumor samples

We evaluated the ability of the EPIC-ATAC framework to predict not only immune and stromal cells proportions but also the proportion of cells for which reference profiles are not available (i.e., uncharacterized cells). For this purpose, we considered two previously published scATAC-Seq datasets containing basal cell carcinoma and gynecological cancer samples (Satpathy et al. 2019; Regner et al. 2021). We generated two pseudobulk datasets by averaging the chromatin accessibility signal across all cells of each sample (see Material and methods). Applying EPIC-ATAC to both datasets shows that this framework is able to simultaneously predict the proportions of both uncharacterized cells and immune, stromal and vascular cells (Figure 4A). In these cancer samples, the proportion of uncharacterized cells can be seen as a proxy of the proportion of cancer cells.

As for the PBMC datasets, we compared EPIC-ATAC performances to other existing deconvolution tools. For both datasets, EPIC-ATAC led to the highest performances and was the only method to accurately predict the proportion of uncharacterized cells (Figure 4B, Supplementary Figure 4 and 5). Although quanTIseq also allows users to perform such predictions, the method resulted in lower correlation and higher RMSE values when comparing the estimated and true proportions of the uncharacterized cells (Figure 4B, Supplementary Figure 4).

In the EPIC-ATAC and quanTIseq frameworks, predictions correspond to absolute cell-type fraction, i.e., proportions of all cells present in the bulk, while the estimations obtained from the other tools correspond to relative cell fractions, i.e., proportions of cells present in the reference profiles (CIBERSORTx, DeconPeaker) or to scores with arbitrary units (ABIS, MCPcounter). We thus conducted a second benchmark excluding the predictions of uncharacterized cell fractions and rescaling both estimations and true proportions to sum to 1 (see Material and methods). EPIC-ATAC outperformed most of the other methods also when excluding the uncharacterized cells (Figure 4C, Supplementary Figure 4 and 5). Supplementary Figure 6 reports the performances of each tool when considering each cell type separately. Overall, EPIC-ATAC showed comparable or higher correlation and lower RMSE values when compared to the other deconvolution tools.

T cell subtypes quantification reveals the ATAC-Seq deconvolution limits for closely related cell types

To explore the limitations of ATAC-Seq deconvolution, we next evaluated whether EPIC-ATAC could predict the proportions of T-cell subtypes. To this end, we considered naive and non-naive CD8+ as well as naïve, helper/memory and T regulatory CD4+ T cells. We redefined our list of cell-type specific marker peaks and reference profiles including also these five T-cell subtypes (Supplementary Tables 8-9, Supplementary Figure 7A) and observed that the markers were conserved in external data (Supplementary Figure 7B). The annotations of the markers associated to the T-cell subtypes are available in Supplementary Tables 10-13.

We capitalized on the more detailed cell-type annotation of the PBMC datasets as well as the basal cell carcinoma dataset to evaluate the EPIC-ATAC prediction of cell-subtype fractions using these updated markers and profiles. Overall, the correlations observed between the predictions and true proportions of T cells decreased when considering T-cell subtypes rather than CD4+ and CD8+ cell types only (Figure 5A). In particular, low accuracies were obtained for helper/memory CD4+ and naïve T-cell subtypes (Figure 5B). Similar results were obtained using other deconvolution tools (Supplementary Figure 8).

T cell subtypes quantification reveals the ATAC-Seq deconvolution limits for closely related cell types.
A) Comparison of the proportions estimated by EPIC-ATAC and the true proportions for PBMC samples (PBMC experiment and PBMC pseudobulk samples combined) (top) and the basal cell carcinoma pseudobulks (bottom). Predictions of the proportions of CD4+ and CD8+ T-cells were obtained using the reference profiles based on the major cell types and subtype predictions using the reference profiles including the T-cell subtypes. B) Pearson’s correlation values obtained by EPIC-ATAC in each cell type.

EPIC-ATAC accurately infers the immune contexture in a bulk ATAC-Seq breast cancer cohort

We applied EPIC-ATAC to a breast cancer cohort of 42 breast ATAC-Seq samples including samples from two breast cancer subtypes, i.e., 35 oestrogen receptor (ER)-positive human epidermal growth factor receptor 2 (HER2)-negative (ER+/HER2-) samples and 7 triple negative (TN) tumors (Kumegawa et al. 2023). No cell sorting was performed in parallel to the chromatin accessibility sequencing. We thus used EPIC-ATAC to estimate cell-type proportions. We observed a higher proportion of T cells, B cells, NK cells and macrophages in the TN samples in comparison to ER+/HER2-samples (Figure 6A).

We then compared the cellular composition of ER+/HER2-subgroups identified in the original study (clusters CA-A, CA-B and CA-C). A higher infiltration of T and B cells was observed in cluster CA-C and higher proportions of endothelial cells and fibroblasts were observed in cluster CA-B (Figure 6B). These predictions are consistent with the infiltration level estimations reported in the original publication, although no differences in macrophages infiltration was observed between the ER+/HER2-subgroups in our case (Kumegawa et al. 2023).

EPIC-ATAC performs similarly to EPIC RNA-seq based deconvolution and better than gene activity based deconvolution

We finally compared the accuracy of EPIC when applied on ATAC-Seq data and on RNA-Seq data. For this purpose, we used the 10X multiome PBMC dataset (10x Genomics 2021) which provides for each cell both its chromatin accessibility profile and its gene expression profile and simulated 100 pseudobulks with diverse cellular compositions (see Material and methods). We used EPIC-ATAC to perform ATAC-Seq based deconvolution on the chromatin accessibility levels of the peaks and the original EPIC tool to perform standard RNA-seq deconvolution on the gene expression levels. ATAC-Seq peaks can also be aggregated, based on peak distances to each gene, into gene activity (GA) variables as proxy for gene expression. We thus applied the GA transformation to the 10x multiome PBMC dataset and performed GA-based RNA deconvolution using the original EPIC tool (See Material and methods). Figure 7 shows that EPIC-ATAC performs similarly to the EPIC RNA-seq based deconvolution and outperforms the GA-based RNA deconvolution. The lower performances of GA based RNA deconvolution could be explained by the fact that GA features, by construction, do not perfectly match the transcriptomic data.

Discussion

Bulk chromatin accessibility profiling of biological tissues like tumors represents a reliable and affordable technology to map the activity of gene regulatory elements across multiple samples in different conditions. Here, we collected ATAC-Seq data from pure cell populations covering major immune and non-immune cancer-relevant cell types from diseased, stimulated and healthy samples. This enabled us to identify reliable cell-type specific marker peaks and chromatin accessibility profiles for both PBMC and solid tumor sample deconvolution. We integrated these data in the EPIC deconvolution framework to accurately predict the fraction of both malignant and non-malignant cell types from bulk tumor ATAC-Seq samples.

In cases where specific cell types are expected in a sample but are not part of our list of reference profiles (e.g., neuronal cells in brain tumors), custom marker peaks and reference profiles can be provided to EPIC-ATAC to perform cell-type deconvolution and we provide the code to generate such markers and profiles based on ATAC-Seq data from sorted cells, following the approach developed in this work (Figure 1, see Code availability).

Solid tumors contain large and heterogeneous fractions of cancer cells for which it is challenging to build reference profiles. To our knowledge this work provides the first benchmark of deconvolution tools adapted to ATAC-Seq data in the context of solid tumor samples. We show that the EPIC-ATAC framework, in contrast to other existing tools, allows users to accurately predict the proportion of cells not included in the reference profiles (Figure 4 and Supplementary Figure 4). These uncharacterized cells can include cancer cells but also other non-malignant cells. Since the major cell types composing TMEs were included in our reference profiles, the proportion of uncharacterized cells approximates the proportion of the cancer cells in most cases.

The pseudobulk approach provides unique opportunities to design benchmarks with known cell-type proportions but also comes with some limitations. Indeed, pseudobulks are generated from single-cell data which are noisy and whose cell-type annotation is challenging in particular for closely related cell types. These limitations might lead to chromatin accessibility profiles that deviates from true bulk data and errors in the true cell-type proportions. For this reason, we anticipate that the newly generated benchmarking PBMCs dataset with ground truth cell proportions obtained by flow cytometry will nicely complement pseudobulk from scATAC-Seq data in future benchmarks of ATAC-Seq deconvolution. The qualitative evaluation of our method on true bulk ATAC-Seq samples from breast cancer patients and the observation of similar immune compositions in TN and ER+/HER2-samples as the ones identified in the original paper (Figure 6) further support the accuracy of EPIC-ATAC to deconvolve bulk ATAC-Seq data, without requiring additional scATAC-Seq data which are not always available for all cancer types.

Overall the evaluation of the EPIC-ATAC deconvolution resulted in an average absolute error of 7% across cell types. This number is consistent with previous observations in RNA-Seq data deconvolution (Racle et al. 2017). Considering this uncertainty, the quantification of low frequency populations remains challenging (Jin and Liu 2021). While the estimated proportions of these populations by EPIC-ATAC are low (e.g., dendritic cells), comparing such estimations across samples should be performed with care due to the uncertainty of the predictions.

Another limitation of cell-type deconvolution is often reached when closely related cell types are considered. In the reference-based methods used in this study, this limit was reached when considering T-cell subtypes in the reference profiles (Figure 5 and Supplementary figure 8). We thus recommend to use the EPIC-ATAC framework using the markers and reference profiles based on the major cell populations. We additionally provide the marker peaks of the T-cell subtypes which could be used to build cell-type specific chromatin accessibility signatures or perform “peak set enrichment analysis” similarly to gene set enrichment analysis (GSEA, (Subramanian et al. 2005)). Such application could be useful for the annotation of scATAC-Seq data, which often relies on matched RNA-Seq data and for which there is a lack of markers at the peak level (Jiang et al. 2023).

Another possible application of our marker peaks relies on their annotation (Figure 2G, Supplementary Tables 4-5), which could be used to expand the list of genes and CBPs associated to each cell type or subtype. For example, the neutrophils marker peaks were enriched for motifs of TFs such as SPI1 (Supplementary Table 4), which was not listed in the neutrophil genes in the databases used for annotation but has been reported in previous studies as involved in neutrophils development (Watt et al. 2021). The annotations related to the set of major cell types and T-cell subtypes are provided in Supplementary Tables 4-5 and 10-11. Finally, the annotation of marker peaks highlighted pathways involved in immune responses to tumoral environments (Figure 2G). Examples of these pathways are the toll-like receptor signaling pathway involved in pathogen-associated and recognition of damage-associated molecular patterns in diverse cell types including B and T cells (Geng et al. 2010; Javaid and Choi 2020), glucan metabolic processes which are known to be related to trained immunity which can lead to anti-tumor phenotype in neutrophils (Kalafati et al. 2020) or the Fc-receptor signaling observed in NK cells (Sanseviero 2019; Bonnema et al. 1994). These observations suggest that our marker peaks contain regulatory regions not only specific to cell types but also adapted to the biological context of solid tumors.

Conclusion

In this work, we identified biologically relevant cell-type specific chromatin accessibility markers and profiles for all major cancer-relevant cell types. We capitalized on these markers and profiles to predict cell-type proportions from bulk PBMC and solid tumor ATAC-Seq data (https://github.com/GfellerLab/EPIC-ATAC). Evaluated on diverse tissues, EPIC-ATAC shows reliable predictions of immune, stromal, vascular and cancer cell proportions. With the expected increase of ATAC-Seq studies in cancer, the EPIC-ATAC framework will enable researchers to deconvolve bulk ATAC-Seq data from tumor samples to support the analysis of regulatory processes underlying tumor development, and correlate the TME composition with clinical variables.

Materials and methods

Generation of an ATAC-Seq reference dataset of cancer relevant cell types

Pre-processing of the sorted ATAC-Seq datasets

We collected pure ATAC-Seq samples from 12 studies. The data include samples from (i) ten major immune, stromal and vascular cell types (B (Calderon et al. 2019; Corces et al. 2016; P. Zhang et al. 2022), CD4+ (Corces et al. 2016; Liu et al. 2020; P. Zhang et al. 2022; Mumbach et al. 2017; Giles et al. 2022), CD8+ (Calderon et al. 2019; Corces et al. 2016; Liu et al. 2020; P. Zhang et al. 2022; Giles et al. 2022), natural killer (NK) (Calderon et al. 2019; Corces et al. 2016), dendritic (DCs) cells (Calderon et al. 2019; Leylek et al. 2020; Liu et al. 2020), macrophages (Liu et al. 2020; P. Zhang et al. 2022), monocytes (Calderon et al. 2019; Corces et al. 2016; Leylek et al. 2020; P. Zhang et al. 2022; Trizzino et al. 2021) and neutrophils (Ram-Mohan et al. 2021; Perez et al. 2020) as well as fibroblasts (Ge et al. 2021; Liu et al. 2020) and endothelial (Liu et al. 2020; Xin et al. 2020) cells (See Figure 2A), and (ii) eight tissues from distinct organs (i.e bladder, breast, colon, liver, lung, ovary, pancreas and thyroid) from the ENCODE data (The ENCODE Project Consortium et al. 2020; Rozowsky et al. 2023). The list of the samples and their associated metadata (including cell types and accession number of the study of origin) is provided in Supplementary Table 1. To limit batch effects, the samples were reprocessed homogeneously from the raw data (fastq files) processing to the peak calling. For that purpose, raw fastq files were collected from GEO using the SRA toolkit and the PEPATAC framework (Smith et al. 2021) was used to process the raw fastq files based on the following tool: trimmomatic for adapter trimming, bowtie2 (with the PEPATAC default parameters) for reads pre-alignment on human repeats and human mitochondrial reference genome, bowtie2 (with the default PEPATAC parameters: --very-sensitive-X 2000) for alignment on the human genome (hg38), samtools (PEPATAC default parameters: -q 10) for duplicates removal and MACS2 (Y. Zhang et al. 2008) (PEPATAC default parameters: --shift-75--extsize 150 --nomodel --call-summits --nolambda --keep-dup all -p 0.01) for peak calling in each sample. After alignment, reads mapping on chromosome M were excluded. TSS enrichment scores were computed for each sample and used to filter out samples with low quality (criteria of exclusion: TSS score < 5) (See Supplementary Table 1 containing the TSS score of each sample). 789 samples (including 564 from our ten reference cell-types) had a TSS score > 5.

Generation of a consensus set of peaks

Peak calling was performed in each sample individually. Peaks were then iteratively collapsed to generate a set of reproducible peaks. For each cell type, peaks collapse was performed adapting the iterative overlap peak merging approach proposed in the PEPATAC framework. A first peaks collapse was performed at the level of each study of origin, i.e., if peaks identified in distinct samples overlapped (minimum overlap of 1bp between peaks), only the peak with the highest peak calling score was kept. Also, only peaks detected in at least half of the samples of each study were considered for the next step. If a study had only two samples, only peaks detected in both samples were considered. After this first selection, a second round of peaks collapse was performed at the cell-type level to limit batch effects in downstream analyses. For each cell type, only peaks detected in all the studies of origin were considered. The final list of peaks was then generated by merging each set of reproducible peaks. Peaks located on chromosome Y were excluded from the rest of the analyses. ATAC-Seq counts were retrieved for each sample and each peak using featureCounts (Liao, Smyth, and Shi 2014).

Identification of cell-type specific markers

Differential accessibility analysis

To identify cell-type specific markers, we split the samples collection in ten folds (created with the create_folds function from the R package splitTools (Mayer 2023)). For each fold, we performed pairwise differential accessibility analysis across the ten cell types considered in the reference samples as well as the ENCODE samples from diverse organs. The differential analysis was performed using limma ((Ritchie et al. 2015), version 3.56.2). Effective library sizes were computed using the method of trimmed mean of M-values (TMM) from the edgeR package in R ((Robinson, McCarthy, and Smyth 2010), version 3.42.4). Due to differences of library size across all samples collected, we used voom from the limma package (Law et al. 2014) to transform the data and model the mean-variance relationship. Finally, a linear model was fitted to the data to assess the differential accessibility of each peak across each pair of cell types. To identify our marker peaks, all peaks with log2 fold change higher than 0.2 were selected and ranked by their maximum adjusted p-value across all pairwise comparisons. The top 200 features (with the lowest maximum adjusted p-value) were considered as cell-type specific marker peaks. The marker peaks identified in at least three folds were considered in the final list of marker peaks.

Marker peaks filtering

Modules of open chromatin regions accessible in all (universal modules) or in specific human tissues have been identified in the study Zhang et al. (K. Zhang et al. 2021). These regions were used to refine the set of marker peaks and exclude peaks with residual accessibility in other cell types than those considered for deconvolution. More precisely, for immune, endothelial and fibroblasts specific peaks, we filtered out the peaks overlapping the universal modules as well as the tissue specific modules except the immune (modules 8 to 25), endothelial (modules 26 to 35) and stromal related modules (modules 41 to 49 and 139-150) respectively. As a second filtering step, we retained markers exhibiting the highest correlation patterns in tumor bulk samples from different cancer types, i.e., The Cancer Genome Atlas (TCGA) samples (Corces et al. 2018). We used the Cancer Genomics Cloud (CGC) (Lau et al. 2017) to retrieve the ATAC-Seq counts for each marker peaks in each TCGA sample (using featureCounts). For each set of cell-type specific peaks, we identified the most correlated peaks using the findCorrelation function of the caret R package ((Kuhn 2008), version 6.0-94) with a correlation cutoff value corresponding to the 90^th percentile of pairwise Pearson correlation values.

Evaluation of the study of origin batch effect

To identify potential batch effect issues, we run principal component analysis (PCA) based on the cell-type specific peaks after normalizing ATAC-Seq counts using full quantile normalization (FQ-FQ) implemented in the EDASeq R package (Risso et al. 2011) to correct for depth and GC biases. These data were used to visualize the data in two-dimensional space running Uniform Manifold Approximation (UMAP) based on the PBMC and TME markers (Figure 2B). We also run PCA and used the ten first principal components to evaluate distances between samples and compute silhouette coefficients based on the cell type and study of origin classifications.

Building the reference profiles

It has been previously demonstrated in the context of RNA-Seq based deconvolution approaches (Racle et al. 2017; Sturm et al. 2019) that the transcripts per million (TPM) transformation is appropriate to estimate cell fractions from bulk mixtures. We thus normalized the ATAC-Seq counts of the reference samples using a TPM-like transformation, i.e., dividing counts by peak length, correcting samples counts for depth and rescaling counts so that the counts of each sample sum to 10⁶. We then computed for each peak the median of the TPM-like counts across all samples from each cell type to build the reference profiles of the ten cell types considered in the EPIC-ATAC framework (Figure 2C). In the EPIC algorithm, weights reflecting the variability of each feature of the reference profile can be considered in the constrained least square optimization. We thus also computed the inter-quartile range of the TPM-like counts for each feature in each cell type. Two ATAC-Seq reference profiles are available in the EPIC-ATAC framework: (i) a reference profile containing profiles for B cells, CD4+ T cells, CD8+ T cells, NK, monocytes, dendritic cells and neutrophils to deconvolve PBMC samples, and (ii) a reference profile containing profiles for B cells, CD4+ T cells, CD8+ T cells, NK, dendritic cells, macrophages, neutrophils, fibroblasts and endothelial cells to deconvolve tumor samples. The reference profiles are available in the EPICATAC R package and the reference profiles restricted to our cell-type specific marker peaks are available in the Supplementary Tables 2 and 3.

Assessing the reproducibility of the marker peaks signal in independent samples

We evaluated the chromatin accessibility level of the marker peaks in samples that were not included in the peak calling step. Firstly, we considered samples from two independent studies (Ucar et al. 2017; Carvalho et al. 2021) providing pure ATAC-Seq data for five immune cell types (i.e., B, CD4+ T cells, CD8+ T cells, Monocytes, Macrophages) (Figure 2D). To consider the other cell types, samples that were excluded from the reference dataset due to a low TSS enrichment score were also considered in this validation dataset (Supplementary Table 1). Secondly, we collected the data from a single-cell atlas chromatin accessibility from human tissues and considered the cell types included in our reference data (K. Zhang et al. 2021) (Figure 2E). We used the cell-type annotations provided in the original study (GEO accession number: GSE184462). The Signac R package ((Stuart et al. 2021), 1.9.0) was used to extract fragments counts for each cell and each marker peak and the ATAC-Seq signal of each marker peak was averaged across all cells of each cell type.

Annotation of the marker peaks

The cell-type specific markers were annotated using ChIPseeker R package ((Yu, Wang, and He 2015), version 1.34.1) and the annotation from TxDb.Hsapiens.UCSC.hg38.knownGene in R to identify the regions in which the marker peaks are (i.e., promoter, intronic regions, etc.) and ChipEnrich to associate each peak to the nearest gene TSS (Welch et al. 2014). The nearest genes identified were then compared to cell-type marker genes listed in the PanglaoDB (Franzén, Gan, and Björkegren 2019) and CellMarker databases (Hu et al. 2023). PanglaoDB provides an online interface to explore a large collection on single-cell RNA-Seq data as well as a community-curated list of cell-type marker genes. CellMarker is a database providing a large set of curated cell-type markers for more than 400 cell types in human tissues retrieved from a large collection of single-cell studies and flow cytometry, immunostaining or experimental studies. ChipEnrich was also used to perform gene set enrichment and identify for each set of cell-type specific peaks potential biological pathways regulated by the marker peaks. The enrichment analysis was performed using the chipenrich function (genesets = “GOBP”, locusdef = “nearest_tss”) from the chipenrich R package (v2.22.0). Chromatin accessibility peaks can also be annotated for chromatin binding proteins (CBPs) such as transcription factors (TFs), whose potential binding in the peak region is reported in databases. In our study we chose the JASPAR2022 (Castro-Mondragon et al. 2022) database and the ReMap database (Hammal et al. 2022).

Using the JASPAR2022 database, we assessed, for each cell type, whether the cell-type specific marker peaks were enriched in specific TFs motifs using two TFs enrichment analysis frameworks: Signac (Stuart et al. 2021) and MonaLisa (Machlab et al. 2022). For the MonaLisa analysis, the cell-types specific markers peaks were categorized in bins of sequences, one bin per cell type (use of the calcBinnedMotifEnrR function). To test for an enrichment of motifs, the sequences of each bin were compared to a set of background peaks with similar average size and GC composition obtained by randomly sampling regions in all the peaks identified from the reference dataset. The enrichment test was based on a binomial test. For the Signac analysis, we used the FindMotif function to identify over-represented TF motifs in each set of cell-type specific marker peaks (query). This function used a hypergeometric test to compare the number of query peaks containing the motif with the total number of peaks containing the motif in the background regions (matched to have similar GC content, region length and dinucleotide frequencies as the query regions), corresponding in our case to the peaks called in the reference dataset.

The ReMap database associates chromatin binding proteins (CBPs), including TFs, transcriptional coactivators and chromatin-remodeling factors, to their DNA binding regions based on DNA-binding experiments such as chromatin immunoprecipitation followed by sequencing (ChIP-seq). For each association of a CBP to its binding region, the cell type in which the binding has been observed is reported in the ReMap database (biotype). We used the ReMapEnrich R package (version 0.99) to test if the cell-type specific marker peaks are significantly enriched in CBPs-binding regions listed in the Remap 2022 catalog. We considered the non-redundant peaks catalog from Remap 2022, containing non-redundant binding regions for each CBP in each biotype. Similarly to the previously mentioned enrichment methods, we chose the consensus peaks called in the reference samples as universe for the enrichment test. Note that, for each cell type, an enrichment was retained only if the biotype in which the CBP-regions were identified matched the correct cell-type.

Running EPIC-ATAC on bulk ATAC-Seq data

The samples used to generate the reference profiles were aligned using the hg38 reference genome. To assure the compatibility of any input bulk ATAC-Seq dataset with the EPIC-ATAC marker peaks and reference profiles, we provide an option to lift over hg19 datasets to hg38 (use of the liftOver R package). Subsequently, the features of the input bulk matrix are matched to our reference profiles features. To match both sets of features, we determine for each peak of the input bulk matrix the distance to the nearest peak in the reference profiles peaks. Overlapping regions are retained and the feature IDs are matched to their associated nearest peaks. If multiple features are matched to the same reference peak, the counts are summed. In RNA-Seq based deconvolution, EPIC uses an estimation of the amount of mRNA in each reference cell type to derive cell proportions. For the ATAC-Seq based deconvolution these values were set to 1 to give similar weights to all cell-types quantifications.

Datasets used for the evaluation of ATAC-Seq deconvolution

PBMCs ATAC-Seq data from healthy donors

Peripheral blood mononuclear cell (PBMC) isolation

Venous blood from five healthy donors was collected at the local blood transfusion center of Geneva in Switzerland, under the approval of the Geneva University Hospital’s Institute Review Board, upon written informed consent and in accordance with the Declaration of Helsinki. PBMCs were freshly isolated by Lymphoprep (Promega) centrifugation (1800 rpm, 20 minutes, without break, room temperature). Red blood cell lysis was performed using red blood lysis buffer (Qiagen) and platelets were removed by centrifugation (1000 rpm, 10 minutes without break, room temperature). Cells were counted and immediately used.

Flow cytometry

Immune cell populations were identified using multiparameter flow cytometry and the following antibodies: FITC anti-human CD45RA (HI100, Biolegend), PerCP-Cyanine5.5 anti-human CD19 (H1B19, Biolegend), PE anti-human CD3 (SK7, Biolegend), PE-Dazzle anti-human CD14 (M0P9, BD Biosciences), PE-Cyanine7 anti-human CD56 (HCD56, Biolegend), APC anti-human CD4 (RPA-T4, Biolgend), APC-Cyanine7 anti-human CCR7 (G043H7, Biolegend), Brilliant Violet 421 anti-human CD8 (RPA-T8, Biolegend), Brilliant Violet 510 anti-human CD25 (BC96, Biolegend), Brilliant Violet 711 anti-human CD16 (3G8, Biolegend), Brilliant Violet 786 anti-human CD127 (A019D5, Biolegend), Ultra-Brilliant Violet anti-human CD45 (HI30, BD Biosciences), FITC anti-human Celc9a (8F9, Miltenyi), PE anti-human XCR1 (S15046E, Biolegend), PE-Dazzle anti-human BDCA-2 (201A, Biolegend), APC anti-human BDCA-3 (AD5-14H12, Miltenyi), Brilliant Violet 421 anti-human CD3 (UCHT1, Biolegend), Brilliant Violet 421 anti-human CD14 (M5E2, BD Pharmingen), Brilliant Violet 421 anti-human CD19 (SJ25C1, Biolegend), Brilliant Violet 510 anti-human BDCA-1 (L161, Biolegend), Brilliant Violet 650 anti-human CD11c (3.9, Biolegend), Brilliant Violet 711 anti-human CD11c (N418, Biolegend) and Brilliant Violet 711 anti-human HLA-DR (L243, Biolegend). Dead cells were excluded using the Zombie UV™ Fixable Viability Kit (Biolegend). Intracellular staining was performed after fixation and permeabilization of the cells with the FoxP3 Transcription Factor Staining Buffer Set (00-5523-00, Invitrogen) using Alexa 700 anti-human FoxP3 antibody (259D/C7, BD Biosciences). Data were acquired on LSRFortessa flow cytometer and analysed using FlowJo software (v10.7.1).

Cell preparation for ATAC-Sequencing

50000 CD45+ cells were sorted from total PBMCs using anti-human Ultra-Brilliant Violet (BUV395) CD45 (HI30, BD Biosciences) with a FACSAria II (Becton Dickinson) and were collected in PBS with 10% Foetal Bovine Serum (FBS). Cell pellets were resuspended in cold lysis buffer (10mM Tris-Cl pH 7.4, 10mM NaCl, 3mM MgCl2, 0,1% NP40 and water) and immediately centrifuged at 600g for 30min at 4°C. Transposition reaction was performed using the Illumina Tagment DNA Enzyme and Buffer kit (20034210, Illumina) and transposed DNA was eluted using the MinElute PCR Purification Kit (Qiagen). Libraries were generated by PCR amplification using indexing primers and NEBNext High-Fidelity Master Mix (New England BioLabs) and were purified using AMPure XP beads (A63880, Beckman Coulter). Libraries were quantified by a fluorometric method (QubIT, Life Technologies) and their quality assessed on a Fragment Analyzer (Agilent Technologies). Sequencing was performed as a paired end 50 cycles run on an Illumina NovaSeq 6000 (v1.5 reagents) at the Genomic Technologies Facility (GTF) in Lausanne, Switzerland. Raw sequencing data were demultiplexed using the bcl2fastq2 Conversion Software (version 2.20, Illumina).

Data processing

The same steps as for the processing of the reference ATAC-Seq samples were followed. (See Pre-processing of the ATAC-Seq datasets).

ATAC-Seq pseudobulk data from PBMCs and cancer samples

To evaluate the accuracy of our ATAC-Seq deconvolution framework, we generated pseudo-bulk datasets from 5 single-cell datasets:

PBMC pseudobulk dataset: combination of three single-cell datasets for PBMCs.
- Dataset 1 corresponds to a scATAC-Seq dataset obtained from Satpathy et al. (Satpathy et al. 2019) (GEO accession number: GSE129785). This dataset contains FACS-sorted populations of PBMCs. Since the cells of some cell types came from a unique donor, all the cells of this dataset were aggregated to form one pseudobulk. Ground truth cell fractions were obtained by dividing the number of cells in each cell type by the total number of cells.
- Dataset 2 (included in the PBMC pseudobulk dataset) was retrieved from Granja et al. (Granja et al. 2019) (GEO accession number GSE139369). B cells, monocytes, dendritic, CD8+, CD4+ T, NK cells, neutrophils from healthy donors were considered. The neutrophil cells came from a single donor. As for dataset 1, we thus aggregated all the cells to generate one pseudobulk. Ground truth cell fractions were obtained by dividing the number of cells in each cell type by the total number of cells.
- Dataset 3 (included in the PBMC pseudobulk dataset) corresponds to the 10X multiome dataset of PBMC cells (10x Genomics 2021). Since these data come from one donor, one pseudobulk sample was generated for this dataset. The pseudobulk was generated by averaging the ATAC-Seq signal from all cells from the following cell types: B cells, CD4+ T cells, CD8+ T cells, NK cells, Dendritic cells and monocytes.
Basal cell carcinoma dataset: obtained from the study of Satpathy et al. (Satpathy et al. 2019). This dataset is a scATAC-Seq dataset composed of 13 basal cell carcinoma samples composed of immune (B cells, plasma cells, CD4+ T cells, CD8+ T cells, NK cells, myeloid cells), stromal (endothelial and fibroblasts) and cancer cells. Plasma cells and cancer cells were both considered as uncharacterized cells (i.e., cell types not included in the reference profiles). Cell annotations were retrieved from the original study.
Gynecological cancer dataset: obtained from the study of Regner et al. (Regner et al. 2021) (GEO accession number GSE173682). In this study, the authors performed scATAC-Seq on 11 gynecological cancer samples from two tumor sites (i.e endometrium and ovary) and composed of immune (B cells, NK and T cells grouped under the same cell-type annotation, macrophages, mast cells), stromal (fibroblast, endothelial, smooth muscle) and cancer cells. Mast cells, smooth muscle and cancer cells were considered as uncharacterized cells. Cell annotations were retrieved from the original study.

For Basal cell carcinoma and Gynecological cancer datasets, one pseudobulk per sample was generated and ground truth cell fractions were obtained for each sample by dividing the number of cells in each cell type by the total number of cells in the sample.

For each dataset, raw fragments files were downloaded from the respective GEO accession numbers and data were preprocessed using ArchR ((Granja et al. 2021), ArchR R package 1.0.2). Cells with TSS score below four were removed. Doublets removal was performed using the doubletsRemoval function from ArchR. To match as much as possible real bulk ATAC-seq data processing, peak calling was not performed on each cell type or cell cluster as usually done in scATAC-Seq studies but using all cells for each dataset from the PBMC pseudobulk data or grouping cells by sample for the Basal cell carcinoma and Gynecological cancer datasets. Peak calling was performed using MACS2 within the ArchR framework. Fragments counts were extracted using ArchR for each peak called to generate single-cell peak counts matrices. These matrices were normalized using a TPM-like transformation, i.e., dividing counts by peak length and correcting samples counts for depth. Finally, for each peak, the average of the normalized counts was computed across all the cells for each dataset from the PBMC pseudobulk data and across all the cells of each sample for the Basal cell carcinoma and Gynecological cancer datasets. Averaged data were then rescaled so that the sum of counts of each sample sum to 10⁶.

Bulk ATAC-Seq data from a breast cancer cohort

Bulk ATAC-Seq samples from a breast cancer cohort was obtained from Kumegawa et al. (Kumegawa et al. 2023). These data include 42 breast cancer samples which can be classified based on two features: (i) the breast cancer subtype ER+/HER2- or triple negative, and (ii) the molecular classification provided by the original study (CA-A, CA-B and CA-C). The ATAC-Seq raw counts and the samples metadata were retrieved from figshare (Kumegawa 2023). As for the previously mentioned datasets, raw counts were normalized using the TPM-like transformation prior to bulk deconvolution.

Benchmarking of the EPIC-ATAC framework against other existing deconvolution tools

The performances of the EPIC-ATAC framework were benchmarked against the following deconvolution tools:

quanTIseq (Finotello et al. 2019) is a deconvolution tool using constrained least square regression to deconvolve RNA-Seq bulk samples. No reference profiles are available in this framework to perform ATAC-Seq deconvolution and quanTIseq does not provide the option to automatically build reference profiles from pure bulk samples. quanTIseq was thus run using the reference profiles derived in this work for the EPIC-ATAC framework and the quanTIseq function from the quantiseqr R package (parameters: scaling set to 1 for all cell types and method set to “lsei”).
DeconPeaker (H. Li et al. 2020) relies on SIMPLS, a variant of partial least square regression to perform bulk RNA-Seq and bulk ATAC-Seq deconvolution. ATAC-Seq reference profiles are available in this deconvolution framework however not all cell types considered in the EPIC-ATAC framework are included in the DeconPeaker reference profiles. This tool was thus run using different reference profiles: (i) the reference profiles derived in this work for the EPIC-ATAC framework (corresponds to “DeconPeaker” or “DeconPeaker_ourmarkers” in our analyses), and (ii) reference profiles automatically generated by DeconPeaker from the sorted reference samples collected in this work (corresponds to “DeconPeaker_cust.” in our analyses). The results of DeconPeaker obtained using its original markers and profiles are also provided for the cell types in common with the cell types considered in this work in Supplementary Figures 3 and 5. Deconvolution was run using the deconvolution module deconPeaker (using findctsps with the following parameter: --lib-strategy=ATAC-Seq). DeconPeaker outputs cell-type proportions relative to the total amount of cells from the reference cell types.
CIBERSORTx (Newman et al. 2019) is a deconvolution algorithm based on linear support vector regression. CIBERSORTx does not provide ATAC-Seq reference profiles, however it is possible to automatically generate new profiles from a set of pure bulk samples. This tool was thus run using different reference profiles: i) the reference profiles derived in this work for the EPIC-ATAC framework (corresponds to “CIBERSORTx” or “CIBERSORTx_ourMarkers” in our analyses), and ii) reference profiles automatically generated by CIBERSORTx from the sorted reference samples collected in this work (corresponds to “CIBERSORTx_cust.” in our analyses). To run CIBERSORTx, we used the docker container provided by the authors of CIBERSORTx on their website. The algorithm was run using the default options (i.e --absolute FALSE, --rmbatchBmode FALSE and –rmbatchSmode FALSE), which results in cell-type proportions relative to the total amount of cells from the reference cell types.
ABIS (Monaco et al. 2019) uses robust linear modeling to estimate cell-type proportions in bulk RNA-Seq samples. No ATAC-Seq reference profiles are available in the deconvolution framework. ABIS was run using the EPIC-ATAC reference profiles by using the rlm function from the MASS R package (as performed in the deconvolute_abis function from the immunedeconv R package (Sturm et al. 2019) was used to quantify each cell type from the reference profiles. The cell-types quantifications returned by this approach are in arbitrary units. To compare the estimations and the true cell proportions, we scaled the estimations of each sample between 0 and 1 to obtained relative proportions.
MCPcounter (Becht et al. 2016): MCPcounter returns scores instead of cell type proportions.

The scores were obtained using the appendSignatures function from the MCPcounter R package by providing the list of marker peaks specific to each cell type. The cell-type scores are not comparable between cell type, MCPcounter was thus included only in the evaluation of the performances in each cell type separately.

For all the tools, TPM-like data were used as input bulk samples for the deconvolution. Since CIBERSORTx, ABIS and DeconPeaker do not predict proportions of uncharacterized cells, we performed two benchmarking analyses: (i) including all cell types and (ii) excluding the cell types that are absent from the reference profiles (uncharacterized cells) and rescaling the estimated and true proportions of the immune cells, endothelial cells and fibroblasts so that their sum equals 1.

Comparing deconvolution based on RNA-Seq, gene activity or peaks features

100 pseudobulks were generated from the 10X PBMC multiome dataset (10x Genomics 2021) based on 3000 cells for each pseudobulk. Cell fractions were defined using the rdirichlet function from the gtools R package. Three sets of features were extracted from the data, i.e., gene expression features extracted from the RNA-Seq layer, ATAC-Seq peaks and gene activity derived from the ATAC-Seq layer. The same cells sampling was considered for each modality.

Gene activity features were extracted from the single-cell data using ArchR (1.0.2), which considers distal elements and adjusts for large differences in gene size in the gene activity score calculation. Gene activity pseudobulks were built by averaging the gene activity scores across all cells belonging to the pseudobulk. For ATAC-Seq pseudobulk, peaks called using ArchR on all cells form the 10X dataset were considered (see the method section “ATAC-Seq pseudobulk data from PBMCs and cancer samples”) and counts were averaged across all cells of each pseudobulk. For RNA-Seq pseudobulks, counts were also averaged across all cells of each pseudobulk. All aggregated data were depth normalized across each features to 10⁶. Cell-type deconvolution was performed on each pseudobulk using EPIC-ATAC on the peak matrix using our ATAC-Seq marker peaks and reference profiles. The RNA-Seq and gene activity pseudobulks were deconvolved with EPIC.

Supporting information

Supplementary_figures

Supplementary Tables

Sup. Table 1: Metadata of the ATAC-Seq samples used in the study

Sup. Table 2: Averaged chromatin accessibility of the PBMC marker peaks in each cell-type.

Sup. Table 3: Averaged chromatin accessibility of the TME marker peaks in each cell-type.

Sup. Table 4: Annotations of the cell-type specific PBMC marker peaks

Sup. Table 5: Annotations of the cell-type specific TME marker peaks

Sup. Table 6: GO pathways enriched in each set of cell-type specific PBMC marker peaks

Sup. Table 7: GO pathways enriched in each set of cell-type specific TME marker peaks

Sup. Table 8: Averaged chromatin accessibility of the PBMC marker peaks in each cell-type (T cells subtypes included).

Sup. Table 9: Averaged chromatin accessibility of the TME marker peaks in each cell-type (T cells subtypes included).

Sup. Table 10: Annotations of the cell-type specific PBMC marker peaks (T cells subtypes included).

Sup. Table 11: Annotations of the cell-type specific TME marker peaks (T cells subtypes included).

Sup. Table 12: GO pathways enriched in each set of cell-type specific PBMC marker peaks (T cell subtypes).

Sup. Table 13: GO pathways enriched in each set of cell-type specific TME marker peaks (T cell subtypes)

Supplementary Tables

Code availability

The code to download and preprocess publicly available ATAC-Seq samples as well as the code used to identify our cell-type specific marker peaks and generate the reference profiles is available on GitHub (https://github.com/GfellerLab/EPIC-ATAC_Manuscript). A README file is provided on the GitHub repository with more details on how to use the code.

The code to perform ATAC-Seq deconvolution using the EPIC-ATAC framework is available as an R package called EPICATAC and is available on GitHub (https://github.com/GfellerLab/EPIC-ATAC).

Data availability

The newly generated ATAC-Seq data have been deposited on Zenodo (doi: 10.5281/zenodo.8431792). The other data related to this work are available in the supplementary tables and on the Zenodo deposit (doi: 10.5281/zenodo.8431792).

Competing interests

The authors declare that they have no competing interests.

Acknowledgements

We thank the Lausanne Genomic Technologies Facility, University of Lausanne, Switzerland (https://www.unil.ch/gtf/en/home.html) for the sequencing of the PBMC samples as well as Yan Liu, Dana Moreno and Matei Teleman for testing the EPICATAC R package. Some of the illustrations were created with BioRender.com.

Authors contributions

Conceptualization: AAGG, JR, DG; Data curation: AAGG; Software: AAGG, JR, DG; Experiments: MF, CJ; Visualization: AAGG, JR, DG; Methodology: AAGG, JR, DG; Writing—original draft: AAGG, DG; Writing—review and editing: all authors.

List of abbreviations

ATAC: Assay for Transposase-Accessible chromatin

CBP(s): chromatin binding protein(s)

ChIP-seq: chromatin immunoprecipitation followed by sequencing

DC: dendritic cells

NK: natural killer cells

PCA: principal component analysis

RMSE: root mean squared error

TCGA: The Cancer Genome Atlas

TF(s): transcription factor(s)

TSS: transcription start site

UMAP: Uniform Manifold Approximation

References

1. 10x Genomics
2021“PBMC from a Healthy Donor - Granulocytes Removed Through Cell Sorting (10k).”https://www.10xgenomics.com/resources/datasets/pbmc-from-a-healthy-donor-granulocytes-removed-through-cell-sorting-10-k-1-standard-2-0-0
1. Arneson Douglas
2. Yang Xia
3. Wang Kai
2020MethylResolver—a Method for Deconvoluting Bulk DNA Methylation Profiles into Known and Unknown Cell ContentsCommunications Biology 3https://doi.org/10.1038/s42003-020-01146-2 Google Scholar
1. Avila Cobos Francisco
2. Alquicira-Hernandez José
3. Powell Joseph E.
4. Mestdagh Pieter
5. De Preter Katleen
2020Benchmarking of Cell Type Deconvolution Pipelines for Transcriptomics DataNature Communications 11:1–14https://doi.org/10.1038/s41467-020-19015-1 Google Scholar
1. Avila Cobos Francisco
2. Vandesompele Jo
3. Mestdagh Pieter
4. De Preter Katleen
2018Computational Deconvolution of Transcriptomics Data from Mixed Cell PopulationsBioinformatics 34:1969–79https://doi.org/10.1093/BIOINFORMATICS/BTY019 Google Scholar
1. Becht Etienne
2. Giraldo Nicolas A.
3. Lacroix Laetitia
4. Buttard Bénédicte
5. Elarouci Nabila
6. Petitprez Florent
7. Selves Janick
8. et al.
2016Estimating the Population Abundance of Tissue-Infiltrating Immune and Stromal Cell Populations Using Gene ExpressionGenome Biology 17:218https://doi.org/10.1186/S13059-016-1070-5/TABLES/4 Google Scholar
1. Bonnema Joy D.
2. Karnitz Larry M.
3. Schoon Renee A.
4. Abraham Robert T.
5. Leibson Paul J.
1994Fc Receptor Stimulation of Phosphatidylinositol 3-Kinase in Natural Killer Cells Is Associated with Protein Kinase C-Independent Granule Release and Cell-Mediated CytotoxicityJournal of Experimental Medicine 180:1427–35https://doi.org/10.1084/JEM.180.4.1427 Google Scholar
1. Buenrostro Jason D.
2. Giresi Paul G.
3. Zaba Lisa C.
4. Chang Howard Y.
5. Greenleaf William J.
2013Transposition of Native Chromatin for Fast and Sensitive Epigenomic Profiling of Open Chromatin, DNA-Binding Proteins and Nucleosome PositionNature Methods 10:1213–18https://doi.org/10.1038/nmeth.2688 Google Scholar
1. Burdziak Cassandra
2. Azizi Elham
3. Prabhakaran Sandhya
2019A Nonparametric Multi-View Model for Estimating Cell Type-Specific Gene Regulatory NetworksArXiv Google Scholar
1. Calderon Diego
2. Nguyen Michelle L.T.
3. Mezger Anja
4. Kathiria Arwa
5. Müller Fabian
6. Nguyen Vinh
7. Lescano Ninnia
8. et al.
2019Landscape of Stimulation-Responsive Chromatin across Diverse Human Immune CellsNature Genetics 51:1494–1505https://doi.org/10.1038/s41588-019-0505-9 Google Scholar
1. Carvalho Klebea
2. Rebboah Elisabeth
3. Jansen Camden
4. Williams Katherine
5. Dowey Andrew
6. McGill Cassandra
7. Mortazavi Ali
2021“Uncovering the Gene Regulatory Networks Underlying Macrophage Polarization Through Comparative Analysis of Bulk and Single-Cell Data.” BioRxivJanuary https://doi.org/10.1101/2021.01.20.427499 Google Scholar
1. Castro-Mondragon Jaime A.
2. Riudavets-Puig Rafael
3. Rauluseviciute Ieva
4. Lemma Roza Berhanu
5. Turchi Laura
6. Blanc-Mathieu Romain
7. Lucas Jeremy
8. et al.
2022JASPAR 2022: The 9th Release of the Open-Access Database of Transcription Factor Binding ProfilesNucleic Acids Research 50:D165–73https://doi.org/10.1093/NAR/GKAB1113 Google Scholar
1. Chakravarthy Ankur
2. Furness Andrew
3. Joshi Kroopa
4. Ghorani Ehsan
5. Ford Kirsty
6. Ward Matthew J.
7. King Emma V.
8. et al.
2018Pan-Cancer Deconvolution of Tumour Composition Using DNA MethylationNature Communications 9:Augusthttps://doi.org/10.1038/s41467-018-05570-1 Google Scholar
1. Clarke Jennifer
2. Seol Pearl
3. Clarke Bertrand
2010Statistical Expression Deconvolution from Mixed Tissue SamplesBioinformatics 26:1043–49https://doi.org/10.1093/BIOINFORMATICS/BTQ097 Google Scholar
1. Corces M. Ryan
2. Buenrostro Jason D.
3. Wu Beijing
4. Greenside Peyton G.
5. Chan Steven M.
6. Koenig Julie L.
7. Snyder Michael P.
8. et al.
2016Lineage-Specific and Single-Cell Chromatin Accessibility Charts Human Hematopoiesis and Leukemia EvolutionNature Genetics 48:1193–1203https://doi.org/10.1038/ng.3646 Google Scholar
1. Corces M. Ryan
2. Granja Jeffrey M.
3. Shams Shadi
4. Louie Bryan H.
5. Seoane Jose A.
6. Zhou Wanding
7. Silva Tiago C.
8. et al.
2018The Chromatin Accessibility Landscape of Primary Human CancersScience 362https://doi.org/10.1126/science.aav1898 Google Scholar
1. Corces M. Ryan
2. Trevino Alexandro E.
3. Hamilton Emily G.
4. Greenside Peyton G.
5. Sinnott-Armstrong Nicholas A.
6. Vesuna Sam
7. Satpathy Ansuman T.
8. et al.
2017An Improved ATAC-Seq Protocol Reduces Background and Enables Interrogation of Frozen TissuesNature Methods 14:959–62https://doi.org/10.1038/nmeth.4396 Google Scholar
1. Cusanovich Darren A.
2. Daza Riza
3. Adey Andrew
4. Pliner Hannah A.
5. Christiansen Lena
6. Gunderson Kevin L.
7. Steemers Frank J.
8. Trapnell Cole
9. Shendure Jay
2015Multiplex Single-Cell Profiling of Chromatin Accessibility by Combinatorial Cellular IndexingScience 348:910–14https://doi.org/10.1126/SCIENCE.AAB1601/SUPPL_FILE/PAP.PDF Google Scholar
1. Feng Song
2. Calinawan Anna
3. Pugliese Pietro
4. Wang Pei
5. Ceccarelli Michele
6. Petralia Francesca
7. Gosline Sara J C
2023“Decomprolute: A Benchmarking Platform Designed for Multiomics-Based Tumor Deconvolution.”BioRxiv Google Scholar
1. Finotello Francesca
2. Mayer Clemens
3. Plattner Christina
4. Laschober Gerhard
5. et al.
2019Molecular and Pharmacological Modulators of the Tumor Immune Contexture Revealed by Deconvolution of RNA-Seq DataGenome Medicine 11:34https://doi.org/10.1186/s13073-019-0638-6 Google Scholar
1. Franzén Oscar
2. Gan Li Ming
3. Björkegren Johan L.M.
2019PanglaoDB: A Web Server for Exploration of Mouse and Human Single-Cell RNA Sequencing DataDatabase 2019:baz046https://doi.org/10.1093/DATABASE/BAZ046 Google Scholar
1. Fridman Wolf H.
2. Pagès Franck
3. Sautès-Fridman Catherine
2012The Immune Contexture in Human Tumours: Impact on Clinical OutcomeNature Reviews Cancer 12:298–306https://doi.org/10.1038/nrc3245 Google Scholar
1. Fridman Wolf H.
2. Zitvogel Laurence
3. Sautès-Fridman Catherine
4. Kroemer Guido
2017“The Immune Contexture in Cancer Prognosis and Treatment.”Nature Reviews Clinical Oncology. Nature Publishing Group https://doi.org/10.1038/nrclinonc.2017.101 Google Scholar
1. Ge Xiangyu
2. Frank-Bertoncelj Mojca
3. Klein Kerstin
4. McGovern Amanda
5. Kuret Tadeja
6. Houtman Miranda
7. Burja Blaž
8. et al.
2021Functional Genomics Atlas of Synovial Fibroblasts Defining Rheumatoid Arthritis HeritabilityGenome Biology 22:247https://doi.org/10.1186/S13059-021-02460-6/FIGURES/7 Google Scholar
1. Geng Degui
2. Zheng Liqin
3. Srivastava Ratika
4. Asprodites Nicole
5. Velasco-Gonzalez Cruz
6. Davila Eduardo
2010When Toll-like Receptor and T-Cell Receptor Signals Collide: A Mechanism for Enhanced CD8 T-Cell Effector FunctionBlood 116:3494–3504https://doi.org/10.1182/BLOOD-2010-02-268169 Google Scholar
1. Giles Josephine R.
2. Manne Sasikanth
3. Freilich Elizabeth
4. Oldridge Derek A.
5. Baxter Amy E.
6. George Sangeeth
7. Chen Zeyu
8. et al.
2022Human Epigenetic and Transcriptional T Cell Differentiation Atlas for Identifying Functional T Cell-Specific EnhancersImmunity 55:557–574https://doi.org/10.1016/J.IMMUNI.2022.02.004 Google Scholar
1. Gong Ting
2. Szustakowski Joseph D.
2013DeconRNASeq: A Statistical Framework for Deconvolution of Heterogeneous Tissue Samples Based on MRNA-Seq DataBioinformatics 29:1083–85https://doi.org/10.1093/BIOINFORMATICS/BTT090 Google Scholar
1. Gosink Mark M.
2. Petrie Howard T.
3. Tsinoremas Nicholas F.
2007Electronically Subtracting Expression Patterns from a Mixed Cell PopulationBioinformatics 23:3328–34https://doi.org/10.1093/BIOINFORMATICS/BTM508 Google Scholar
1. Grandi Fiorella C.
2. Modi Hailey
3. Kampman Lucas
4. Ryan Corces M.
2022“Chromatin Accessibility Profiling by ATAC-Seq.”Nature Protocols :1518–52https://doi.org/10.1038/s41596-022-00692-9 Google Scholar
1. Granja Jeffrey M.
2. Ryan Corces M.
3. Pierce Sarah E.
4. Tansu Bagdatli S.
5. Choudhry Hani
6. Chang Howard Y.
7. Greenleaf William J.
2021ArchR Is a Scalable Software Package for Integrative Single-Cell Chromatin Accessibility AnalysisNature Genetics 53:403–11https://doi.org/10.1038/s41588-021-00790-6 Google Scholar
1. Granja Jeffrey M.
2. Klemm Sandy
3. McGinnis Lisa M.
4. Kathiria Arwa S.
5. Mezger Anja
6. Ryan Corces M.
7. Parks Benjamin
8. et al.
2019Single-Cell Multiomic Analysis Identifies Regulatory Programs in Mixed-Phenotype Acute LeukemiaNature Biotechnology 37:1458–65https://doi.org/10.1038/s41587-019-0332-7 Google Scholar
1. Hammal Fayrouz
2. De Langen Pierre
3. Bergon Aurelie
4. Lopez Fabrice
5. Ballester Benoit
2022ReMap 2022: A Database of Human, Mouse, Drosophila and Arabidopsis Regulatory Regions from an Integrative Analysis of DNA-Binding Sequencing ExperimentsNucleic Acids Research 50:D316–25https://doi.org/10.1093/NAR/GKAB996 Google Scholar
1. Hu Congxue
2. Li Tengyue
3. Xu Yingqi
4. Zhang Xinxin
5. Li Feng
6. Bai Jing
7. Chen Jing
8. et al.
2023CellMarker 2.0: An Updated Database of Manually Curated Cell Markers in Human/Mouse and Web Tools Based on ScRNA-Seq DataNucleic Acids Research 51:D870–76https://doi.org/10.1093/NAR/GKAC947 Google Scholar
1. Javaid Nasir
2. Choi Sangdun
2020Toll-like Receptors from the Perspective of Cancer TreatmentCancers 12:297https://doi.org/10.3390/CANCERS12020297 Google Scholar
1. Jiang Yijia
2. Hu Zhirui
3. Jiang Junchen
4. Zhu Alexander
5. Zhang Yi
6. Lynch Allen W.
7. Xie Yingtian
8. et al.
2023“ScATAnno: Automated Cell Type Annotation for Single-Cell ATAC Sequencing Data.” BioRxivJune https://doi.org/10.1101/2023.06.01.543296 Google Scholar
1. Jimenez-Sanchez Alejandro
2. Cast Oliver
3. Miller Martin L.
2019Comprehensive Benchmarking and Integration of Tumor Microenvironment Cell Estimation MethodsCancer Research 79:6238–46https://doi.org/10.1158/0008-5472.CAN-18-3560 Google Scholar
1. Jin Haijing
2. Liu Zhandong
2021A Benchmark for RNA-Seq Deconvolution Analysis under Dynamic Testing EnvironmentsGenome Biology 22:102https://doi.org/10.1186/s13059-021-02290-6 Google Scholar
1. Kalafati Lydia
2. Kourtzelis Ioannis
3. Schulte-Schrepping Jonas
4. Li Xiaofei
5. Hatzioannou Aikaterini
6. Grinenko Tatyana
7. Hagag Eman
8. et al.
2020Innate Immune Training of Granulopoiesis Promotes Anti-Tumor ActivityCell 183:771–785https://doi.org/10.1016/J.CELL.2020.09.058 Google Scholar
1. Klemm Sandy L.
2. Shipony Zohar
3. Greenleaf William J.
2019Chromatin Accessibility and the Regulatory EpigenomeNature Reviews Genetics 20:207–20https://doi.org/10.1038/s41576-018-0089-8 Google Scholar
1. Kuhn Max
2008Building Predictive Models in R Using the Caret PackageJournal of Statistical Software 28:1–26https://doi.org/10.18637/JSS.V028.I05 Google Scholar
1. Kumegawa Kohei
2023ATAC-Seq Data of 42 BC Samples as SummarizedExperiment Object with Count Matrix, Normalized Count Matrix, Peak Info, and Clinical Info.”2023https://doi.org/10.6084/m9.figshare.21992609.v1 Google Scholar
1. Kumegawa Kohei
2. Saeki Sumito
3. Takahashi Yoko
4. Yang Liying
5. Osako Tomo
6. Nakadai Tomoyoshi
7. Amino Sayuri
8. et al.
2023Chromatin Profile-Based Identification of a Novel ER-Positive Breast Cancer Subgroup with Reduced ER-Responsive Element AccessibilityBritish Journal of Cancer 128:1208–22https://doi.org/10.1038/s41416-023-02178-1 Google Scholar
1. Lareau Caleb A.
2. Duarte Fabiana M.
3. Chew Jennifer G.
4. Kartha Vinay K.
5. Burkett Zach D.
6. Kohlway Andrew S.
7. Pokholok Dmitry
8. et al.
2019Droplet-Based Combinatorial Indexing for Massive-Scale Single-Cell Chromatin AccessibilityNature Biotechnology 37:916–24https://doi.org/10.1038/s41587-019-0147-6 Google Scholar
1. Lau Jessica W.
2. Lehnert Erik
3. Sethi Anurag
4. Malhotra Raunaq
5. Kaushik Gaurav
6. Onder Zeynep
7. Groves-Kirkby Nick
8. et al.
2017The Cancer Genomics Cloud: Collaborative, Reproducible, and Democratized - A New Paradigm in Large-Scale Computational ResearchCancer Research 77:e3–6https://doi.org/10.1158/0008-5472.CAN-17-0387 Google Scholar
1. Law Charity W.
2. Chen Yunshun
3. Shi Wei
4. Smyth Gordon K.
2014Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-Seq Read CountsGenome Biology 15:R29https://doi.org/10.1186/GB-2014-15-2-R29/FIGURES/11 Google Scholar
1. Leylek Rebecca
2. Alcántara-Hernández Marcela
3. Granja Jeffrey M.
4. Chavez Michael
5. Perez Kimberly
6. Diaz Oscar R.
7. Li Rui
8. Satpathy Ansuman T.
9. Chang Howard Y.
10. Idoyaga Juliana
2020Chromatin Landscape Underpinning Human Dendritic Cell HeterogeneityCell Reports 32:108180https://doi.org/10.1016/J.CELREP.2020.108180 Google Scholar
1. Li Huamei
2. Sharma Amit
3. Luo Kun
4. Qin Zhaohui S.
5. Sun Xiao
6. Liu Hongde
2020DeconPeaker, a Deconvolution Model to Identify Cell Types Based on Chromatin Accessibility in ATAC-Seq Data of Mixture SamplesFrontiers in Genetics 11:Junehttps://doi.org/10.3389/fgene.2020.00392 Google Scholar
1. Li Taiwen
2. Fu Jingxin
3. Zeng Zexian
4. Cohen David
5. Li Jing
6. Chen Qianming
7. Li Bo
8. Shirley Liu X.
2020TIMER2.0 for Analysis of Tumor-Infiltrating Immune CellsNucleic Acids Research 48:W509–14https://doi.org/10.1093/nar/gkaa407 Google Scholar
1. Liao Yang
2. Smyth Gordon K.
3. Shi Wei
2014FeatureCounts: An Efficient General Purpose Program for Assigning Sequence Reads to Genomic FeaturesBioinformatics 30:923–30https://doi.org/10.1093/BIOINFORMATICS/BTT656 Google Scholar
1. Liu Qian
2. Zaba Lisa
3. Satpathy Ansuman T.
4. Longmire Michelle
5. Zhang Wen
6. Li Kun
7. Granja Jeffrey
8. et al.
2020Chromatin Accessibility Landscapes of Skin Cells in Systemic Sclerosis Nominate Dendritic Cells in Disease PathogenesisNature Communications 11:5843https://doi.org/10.1038/s41467-020-19702-z Google Scholar
1. Luo Liheng
2. Gribskov Michael
3. Wang Sufang
2022“Bibliometric Review of ATAC-Seq and Its Application in Gene Expression.” Briefings in BioinformaticsMarch https://doi.org/10.1093/BIB/BBAC061 Google Scholar
1. Machlab Dania
2. Burger Lukas
3. Soneson Charlotte
4. Rijli Filippo M.
5. Schübeler Dirk
6. Stadler Michael B.
2022MonaLisa: An R/Bioconductor Package for Identifying Regulatory MotifsBioinformatics 38:2624–25https://doi.org/10.1093/BIOINFORMATICS/BTAC102 Google Scholar
1. Mayer Michael
2023“R Package ‘SplitTools’: Tools for Data Splittinghttps://cran.r-project.org/web/packages/splittools/index.html
1. Monaco Gianni
2. Lee Bernett
3. Xu Weili
4. Mustafah Seri
5. Hwang You Yi
6. Carré Christophe
7. Burdin Nicolas
8. et al.
2019RNA-Seq Signatures Normalized by MRNA Abundance Allow Absolute Deconvolution of Human Immune Cell TypesCell Reports 26:1627–1640https://doi.org/10.1016/J.CELREP.2019.01.041 Google Scholar
1. Mumbach Maxwell R.
2. Satpathy Ansuman T.
3. Boyle Evan A.
4. Dai Chao
5. Gowen Benjamin G.
6. Cho Seung Woo
7. Nguyen Michelle L.
8. et al.
2017Enhancer Connectome in Primary Human Cells Identifies Target Genes of Disease-Associated DNA ElementsNature Genetics 49:1602–12https://doi.org/10.1038/ng.3963 Google Scholar
1. Newman Aaron M.
2. Liu Chih Long
3. Green Michael R.
4. Gentles Andrew J.
5. Feng Weiguo
6. Xu Yue
7. Hoang Chuong D.
8. Diehn Maximilian
9. Alizadeh Ash A.
2015Robust Enumeration of Cell Subsets from Tissue Expression ProfilesNature Methods 12:453–57https://doi.org/10.1038/nmeth.3337 Google Scholar
1. Newman Aaron M.
2. Steen Chloé B.
3. Liu Chih Long
4. Gentles Andrew J.
5. Chaudhuri Aadel A.
6. Scherer Florian
7. Khodadoust Michael S.
8. et al.
2019Determining Cell Type Abundance and Expression from Bulk Tissues with Digital CytometryNature Biotechnology 37:773–82https://doi.org/10.1038/s41587-019-0114-2 Google Scholar
1. Peng Xianlu Laura
2. Moffitt Richard A.
3. Torphy Robert J.
4. Volmar Keith E.
5. Yeh Jen Jen
2019“De Novo Compartment Deconvolution and Weight Estimation of Tumor Samples Using DECODER.”Nature Communications 10:4729https://doi.org/10.1038/s41467-019-12517-7 Google Scholar
1. Perez Cristina
2. Botta Cirino
3. Zabaleta Aintzane
4. Puig Noemi
5. Cedena Maria-Teresa
6. Goicoechea Ibai
7. Alameda Daniel
8. et al.
2020Immunogenomic Identification and Characterization of Granulocytic Myeloid-Derived Suppressor Cells in Multiple MyelomaBlood 136:199–209https://doi.org/10.1182/BLOOD.2019004537 Google Scholar
1. Qiu Yixuan
2. Wang Jiebiao
3. Lei Jing
4. Roeder Kathryn
2021Identification of Cell-Type-Specific Marker Genes from Co-Expression Patterns in Tissue SamplesBioinformatics 37:3228–34https://doi.org/10.1093/BIOINFORMATICS/BTAB257 Google Scholar
1. Racle Julien
2. Gfeller David
2020“EPIC: A Tool to Estimate the Proportions of Different Cell Types from Bulk Gene Expression Data.”In Methods in Molecular Biology 2120:233–48https://doi.org/10.1007/978-1-0716-0327-7_17 Google Scholar
2017Simultaneous Enumeration of Cancer and Immune Cell Types from Bulk Tumor Gene Expression DataELife 6:Novemberhttps://doi.org/10.7554/eLife.26476 Google Scholar
1. Ram-Mohan Nikhil
2. Thair Simone A
3. Litzenburger Ulrike M
4. Cogill Steven
5. Andini Nadya
6. Yang Xi
7. Chang Howard Y
8. Yang Samuel
2021Profiling Chromatin Accessibility Responses in Human Neutrophils with Sensitive Pathogen DetectionLife Science Alliance 4:8https://doi.org/10.26508/LSA.202000976 Google Scholar
1. Regner Matthew J.
2. Wisniewska Kamila
3. Garcia-Recio Susana
4. Thennavan Aatish
5. Giraldez Raul Mendez-
6. Malladi Venkat S.
7. Hawkins Gabrielle
8. et al.
2021A Multi-Omic Single-Cell Landscape of Human Gynecologic MalignanciesMolecular Cell 81:4924–4941https://doi.org/10.1016/j.molcel.2021.10.013 Google Scholar
1. Risso Davide
2. Schwartz Katja
3. Sherlock Gavin
4. Dudoit Sandrine
2011GC-Content Normalization for RNA-Seq DataBMC Bioinformatics 12:480https://doi.org/10.1186/1471-2105-12-480/FIGURES/7 Google Scholar
1. Ritchie Matthew E.
2. Belinda Phipson Di Wu
3. Hu Yifang
4. Law Charity W.
5. Shi Wei
6. Smyth Gordon K.
2015Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray StudiesNucleic Acids Research 43:e47https://doi.org/10.1093/nar/gkv007 Google Scholar
1. Robinson Mark D.
2. McCarthy Davis J.
3. Smyth Gordon K.
2010EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression DataBioinformatics 26:139–40https://doi.org/10.1093/BIOINFORMATICS/BTP616 Google Scholar
1. Rozowsky Joel
2. Gao Jiahao
3. Borsari Beatrice
4. Guigó Roderic
5. Gingeras Thomas R
6. Gerstein Mark
7. Yang Yucheng T
8. et al.
2023The EN-TEx Resource of Multi-Tissue Personal Epigenomes & Variant-Impact ModelsCell 186:1493–1511https://doi.org/10.1016/j.cell.2023.02.018 Google Scholar
1. Sanseviero Emilio
2019NK Cell-Fc Receptors Advance Tumor ImmunotherapyJournal of Clinical Medicine 8:1667https://doi.org/10.3390/JCM8101667 Google Scholar
1. Satpathy Ansuman T.
2. Granja Jeffrey M.
3. Yost Kathryn E.
4. Qi Yanyan
5. Meschi Francesca
6. McDermott Geoffrey P.
7. Olsen Brett N.
8. et al.
2019Massively Parallel Single-Cell Chromatin Landscapes of Human Immune Cell Development and Intratumoral T Cell ExhaustionNature Biotechnology 37:925–36https://doi.org/10.1038/s41587-019-0206-z Google Scholar
1. Smith Jason P
2. Ryan Corces M
3. Xu Jin
4. Reuter Vincent P
5. Chang Howard Y
6. Sheffield Nathan C
2021PEPATAC: An Optimized Pipeline for ATAC-Seq Data Analysis with Serial AlignmentsNAR Genomics and Bioinformatics 3:4https://doi.org/10.1093/NARGAB/LQAB101 Google Scholar
1. Stuart Tim
2. Srivastava Avi
3. Madad Shaista
4. Lareau Caleb A.
5. Satija Rahul
2021Single-Cell Chromatin State Analysis with SignacNature Methods 18:1333–41https://doi.org/10.1038/s41592-021-01282-5 Google Scholar
1. Sturm Gregor
2. Finotello Francesca
3. Petitprez Florent
4. Zhang Jitao David
5. Baumbach Jan
6. Fridman Wolf H
7. List Markus
8. Aneichyk Tatsiana
2019Comprehensive Evaluation of Transcriptome-Based Cell-Type Quantification Methods for Immuno-OncologyBioinformatics 35:i436–45https://doi.org/10.1093/bioinformatics/btz363 Google Scholar
1. Subramanian Aravind
2. Tamayo Pablo
3. Mootha Vamsi K
4. Mukherjee Sayan
5. Ebert Benjamin L
6. Gillette Michael A
7. Paulovich Amanda
8. et al.
2005Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression ProfilesPNAS 102:15545–50https://doi.org/10.1073/pnas.0506580102 Google Scholar
1. Teschendorff Andrew E.
2. Zhu Tianyu
3. Breeze Charles E.
4. Beck Stephan
2020EPISCORE: Cell Type Deconvolution of Bulk Tissue DNA Methylomes from Single-Cell RNA-Seq DataGenome Biology 21:Septemberhttps://doi.org/10.1186/s13059-020-02126-9 Google Scholar
1. The ENCODE Project Consortium
2. Moore Jill E.
3. Purcaro Michael J.
4. Pratt Henry E.
5. Epstein Charles B.
6. Shoresh Noam
7. Adrian Jessika
2020“Expanded Encyclopaedias of DNA Elements in the Human and Mouse Genomes.”Nature 583:699–710https://doi.org/10.1038/s41586-020-2493-4 Google Scholar
1. Trizzino Marco
2. Zucco Avery
3. Deliard Sandra
4. Wang Fang
5. Barbieri Elisa
6. Veglia Filippo
7. Gabrilovich Dmitry
8. Gardini Alessandro
2021EGR1 Is a Gatekeeper of Inflammatory Enhancers in Human MacrophagesScience Advances 7:3https://doi.org/10.1126/SCIADV.AAZ8836/SUPPL_FILE/AAZ8836_TABLE_S7.XLSX Google Scholar
1. Ucar Duygu
2. Márquez Eladio J.
3. Chung Cheng Han
4. Marches Radu
5. Rossi Robert J.
6. Uyar Asli
7. Wu Te Chia
8. et al.
2017The Chromatin Accessibility Signature of Human Immune Aging Stems from CD8+ T CellsJournal of Experimental Medicine 214:3123–44https://doi.org/10.1084/jem.20170416 Google Scholar
1. de Visser Karin E
2. Joyce Johanna A.
2023The Evolving Tumor Microenvironment: From Cancer Initiation to Metastatic OutgrowthCancer Cell 41:374–403https://doi.org/10.1016/J.CCELL.2023.02.016 Google Scholar
1. Watt Stephen
2. Vasquez Louella
3. Walter Klaudia
4. Mann Alice L.
5. Kundu Kousik
6. Chen Lu
7. Sims Ying
8. et al.
2021Genetic Perturbation of PU.1 Binding and Chromatin Looping at Neutrophil Enhancers Associates with Autoimmune DiseaseNature Communications 12:2298https://doi.org/10.1038/S41467-021-22548-8 Google Scholar
1. Welch Ryan P.
2. Lee Chee
3. Imbriano Paul M.
4. Patil Snehal
5. Weymouth Terry E.
6. Alex Smith R.
7. Scott Laura J.
8. Sartor Maureen A.
2014ChIP-Enrich: Gene Set Enrichment Testing for ChIP-Seq DataNucleic Acids Research 42:e105https://doi.org/10.1093/NAR/GKU463 Google Scholar
1. Xin Jingxue
2. Zhang Hui
3. He Yaoxi
4. Duren Zhana
5. Bai Caijuan
6. Chen Lang
7. Luo Xin
8. et al.
2020Chromatin Accessibility Landscape and Regulatory Network of High-Altitude Hypoxia AdaptationNature Communications 11:4928https://doi.org/10.1038/S41467-020-18638-8 Google Scholar
1. Yu Guangchuang
2. Wang Li Gen
3. He Qing Yu
2015ChIPseeker: An R/Bioconductor Package for ChIP Peak Annotation, Comparison and VisualizationBioinformatics 31:2382–83https://doi.org/10.1093/BIOINFORMATICS/BTV145 Google Scholar
1. Zeng Wanwen
2. Chen Xi
3. Duren Zhana
4. Wang Yong
5. Jiang Rui
6. Wong Wing Hung
2019DC3 Is a Method for Deconvolution and Coupled Clustering from Bulk and Single-Cell Genomics DataNature Communications 10:4613https://doi.org/10.1038/s41467-019-12547-1 Google Scholar
1. Zhang Hanyu
2. Cai Ruoyi
3. Dai James
4. Sun Wei
2021Scientific Reports:5717https://doi.org/10.1038/s41598-021-84864-9 Google Scholar
1. Zhang Kai
2. Hocker James D
3. Miller Michael
4. Hou Xiaomeng
5. Chiou Joshua
6. Poirion Olivier B
7. Qiu Yunjiang
8. et al.
2021A Single-Cell Atlas of Chromatin Accessibility in the Human GenomeCell 184:5985–6001https://doi.org/10.1016/j.cell.2021.10.024 Google Scholar
1. Zhang Ping
2. Amarasinghe Harindra E.
3. Whalley Justin P.
4. Tay Chwen
5. Fang Hai
6. Migliorini Gabriele
7. Brown Andrew C.
8. et al.
2022Epigenomic Analysis Reveals a Dynamic and Context-Specific Macrophage Enhancer Landscape Associated with Innate Immune Activation and ToleranceGenome Biology 23:136https://doi.org/10.1186/S13059-022-02702-1 Google Scholar
1. Zhang Yong
2. Liu Tao
3. Meyer Clifford A.
4. Eeckhoute Jérôme
5. Johnson David S.
6. Bernstein Bradley E.
7. Nussbaum Chad
8. et al.
2008Model-Based Analysis of ChIP-Seq (MACS)Genome Biology 9:R137https://doi.org/10.1186/GB-2008-9-9-R137/FIGURES/3 Google Scholar

Article and author information

Author information

Aurélie AG Gabriel
Department of Oncology, Ludwig Institute for Cancer Research, University of Lausanne, Lausanne, Switzerland, Agora Cancer Research Centre, Lausanne, Switzerland, Swiss Cancer Center Leman (SCCL), Switzerland, Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
ORCID iD: 0000-0002-0606-3622
Julien Racle
Department of Oncology, Ludwig Institute for Cancer Research, University of Lausanne, Lausanne, Switzerland, Agora Cancer Research Centre, Lausanne, Switzerland, Swiss Cancer Center Leman (SCCL), Switzerland, Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
ORCID iD: 0000-0002-0100-0323
Maryline Falquet
Swiss Cancer Center Leman (SCCL), Switzerland, Ludwig Institute for Cancer Research, Lausanne Branch, Lausanne, Switzerland, Department of Pathology and Immunology Faculty of Medicine, University of Geneva, Geneva, Switzerland, Geneva Center for Inflammation Research, Geneva, Switzerland
ORCID iD: 0000-0002-5940-8583
Camilla Jandus
Swiss Cancer Center Leman (SCCL), Switzerland, Ludwig Institute for Cancer Research, Lausanne Branch, Lausanne, Switzerland, Department of Pathology and Immunology Faculty of Medicine, University of Geneva, Geneva, Switzerland, Geneva Center for Inflammation Research, Geneva, Switzerland
ORCID iD: 0000-0002-7405-5747
David Gfeller
Department of Oncology, Ludwig Institute for Cancer Research, University of Lausanne, Lausanne, Switzerland, Agora Cancer Research Centre, Lausanne, Switzerland, Swiss Cancer Center Leman (SCCL), Switzerland, Swiss Institute of Bioinformatics (SIB), Lausanne, Switzerland
ORCID iD: 0000-0002-3952-0930
- Corresponding author: david.gfeller@unil.ch

Version history

Sent for peer review: January 28, 2024
Preprint posted: January 29, 2024
Reviewed Preprint version 1: April 3, 2024
Reviewed Preprint version 2: August 7, 2024
Reviewed Preprint version 3: September 10, 2024
Version of Record published: October 9, 2024

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.94833. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

views: 2,985
downloads: 154
citations: 6

Views, downloads and citations are aggregated across all versions of this paper published by eLife.