Graphical description of the identification of cell-type specific marker peaks and reference ATAC-Seq profiles included in the EPIC-ATAC framework.

1) 564 pure ATAC-Seq data of sorted cells were collected to build reference profiles for non-malignant cell types observed in the tumor microenvironment. 2) Cell-type specific marker peaks were identified using differential accessibility analysis. 3) Markers with previously observed chromatin accessibility in human healthy tissues were excluded. 4) For tumor bulk deconvolution, the set of remaining marker peaks was refined by selecting markers with correlated behavior in tumor bulk samples. 5) The cell-type specific marker peaks and reference profiles were integrated in the EPIC-ATAC framework to perform bulk ATAC-Seq deconvolution.

© 2024, BioRender Inc. Any parts of this image created with BioRender are not made available under the same license as the Reviewed Preprint, and are © 2024, BioRender Inc.

ATAC-Seq data from sorted cell populations reveal cell-type specific marker peaks and reference profiles.

A) Number of samples collected for each cell type. The colors correspond to the different studies of origin. B) Representation of the collected samples using the first three components of the PCA based on the PBMC markers (left) and TME markers (right). Colors correspond to cell types. C) Scaled averaged chromatin accessibility of all the cell-type specific marker peaks (PBMC and TME markers) (rows) in each cell type (columns) in the ATAC-Seq reference samples used to identify the marker peaks. D) Scaled averaged chromatin accessibility of all the marker peaks in external ATAC-Seq data from samples of pure cell types excluded from the reference samples (see Material and Methods). E) Scaled averaged chromatin accessibility of all the marker peaks in an external scATAC-Seq dataset (Human Atlas (K. Zhang et al., 2021)). F) Distribution of the marker peak distances to the nearest transcription start site (TSS) (left panel) and the ChIPseeker annotations (right panel). G) Significance (- log10(q.value)) of pathways (columns) enrichment test obtained using ChIP-Enrich on each set of cell-type specific marker peaks (rows). A subset of relevant enriched pathways is represented. Colors of the names of the pathways correspond to cell types where the pathways were found to be enriched. When pathways were significantly enriched in more than one set of peaks, pathways names are written in bold.

List of nearest genes and enriched CBPs reported in the PanglaoDB or CellMarker databases.

EPIC-ATAC accurately estimates immune cell fractions in PBMC ATAC-Seq samples.

A) Schematic description of the experiment designed to validate the ATAC-Seq deconvolution on PBMC samples. B) Comparison between cell-type proportions predicted by EPIC-ATAC and the true proportions in the PBMC bulk dataset. Symbols correspond to donors. C) Comparison between the proportions of cell-types predicted by EPIC-ATAC and the true proportions in the PBMC pseudobulk dataset. Symbols correspond to pseudobulks. D) Pearson correlation (left) and RMSE (right) values obtained by each deconvolution tool on the PBMC bulk dataset. The EPIC-ATAC results are highlighted in red. E) Pearson correlation (left) and RMSE (right) values obtained by each deconvolution tool on the PBMC pseudobulk dataset.

© 2024, BioRender Inc. Any parts of this image created with BioRender are not made available under the same license as the Reviewed Preprint, and are © 2024, BioRender Inc.

EPIC-ATAC accurately predicts fractions of cancer and non-malignant cells in tumor samples.

A) Comparison between cell-type proportions estimated by EPIC-ATAC and true proportions for the basal cell carcinoma (top), gynecological (middle) and HTAN (bottom) pseudobulk datasets. Symbols correspond to pseudobulks. B) Pearson’s correlation and RMSE values obtained for the deconvolution tools included in the benchmark. EPIC-ATAC is highlighted in red. C) Same analyses as in panels B, with the uncharacterized cell population excluded for the evaluation of the prediction accuracy. The predicted and true proportions of the immune, stromal and vascular cell types were rescaled to sum to 1.

Accuracy of ATAC-Seq deconvolution is determined by the abundance and specificity of each cell type.

A) Correlations (top) and RMSE (middle) between EPIC-ATAC predictions and true cell-type proportions in each cell-type. True proportions are also shown for each cell type (bottom). Colors correspond to different datasets. B) Comparison of the proportions estimated by EPIC-ATAC and the true proportions for PBMC samples (PBMC experiment and PBMC pseudobulk samples combined) (top) and the basal cell carcinoma pseudobulks (bottom). Predictions of the proportions of CD4+ and CD8+ T-cells were obtained using the reference profiles based on the major cell types and subtypes predictions using the reference profiles including the T-cell subtypes. C) Pearson’s correlation values obtained by EPIC-ATAC in each cell type.

EPIC-ATAC accurately infers the immune contexture in a bulk ATAC-Seq breast cancer cohort.

A) Proportions of different cell types predicted by EPIC-ATAC in each sample as a function of the average ATAC signal at the cell-type specific CREs used by Kumegawa et al. (Kumegawa et al., 2023) to infer the level of cell-type infiltration in the tumor samples. B) Proportions of different cell types predicted by EPIC-ATAC in the samples stratified based on two breast cancer subtypes. C) Proportions of different cell types predicted by EPIC-ATAC in the samples stratified based on three ER+/HER2- subgroups. Wilcoxon test p-values are represented at the top of the boxplots.

EPIC-ATAC and EPIC RNA-seq based deconvolution have similar accuracy and can complement each other.

A-B) Pearson’s correlation (left) and RMSE (right) values comparing the proportions predicted by the ATAC-Seq deconvolution, the RNA-Seq deconvolution and the GA-based RNA deconvolution and true cell-type proportions in 100 pseudobulks simulated form the 10x multiome PBMC dataset (10x Genomics, 2021) (panel A) and in the pseudobulks generated from the HTAN cohorts (panel B). Dots correspond to outlier pseudobulks. C) Left panel: Schematic description of the dataset from Morandini et al. (Morandini et al., 2024) composed of matched bulk ATAC-Seq, RNA-Seq and flow cytometry data from PBMC samples. Right panel: Pearson’s correlation (left) and RMSE values (right) obtained by EPIC-ATAC, EPIC-RNA and EPIC-ATAC/RNA (i.e., averaged predictions of EPIC-ATAC and EPIC-RNA) in each cell type separately or all cell types together (columns).

© 2024, BioRender Inc. Any parts of this image created with BioRender are not made available under the same license as the Reviewed Preprint, and are © 2024, BioRender Inc.

A) Chromatin accessibility signal of each marker peak (rows) in each reference sample and in the ENCODE samples from diverse tissues (columns). B) Pairwise scatterplots of the first 3 axes of the PCA run on the PBMC markers (top) and TME markers (bottom).

Left panel: Proportions of each cell type in the PBMC experiment samples. Shapes match the points shapes from Figure 3B in the main text. Right panel: Chromatin accessibility signal (log scale) of the cell-type specific marker peaks (rows) in each sample (columns) of the PBMC experiment dataset.

A) Comparison of the cell-type proportions estimated by the different deconvolution methods included in the benchmark (y axis) and the true proportions (x axis) in the PBMC experiment samples (top panel) and in the PBMC pseudobulk samples (bottom panel). B) Comparison of the cell-type proportions estimated by CIBERSORTx and DeconPeaker using reference profiles built by each tool using our collection of pure ATAC-Seq samples and the true proportions in the PBMC experiment samples (top panel) and in the PBMC pseudobulk samples (bottom panel). C) Pearson’s correlation coefficient and RMSE values associated to the comparison of the cell-type proportions estimated by each deconvolution tool in each cell type of the PBMC datasets, colors correspond to cell types. Note that for quanTIseq, gray dots correspond to uncharacterized cells since the prediction of an uncharacterized cell type can not be turned off using the quanTIseq function from the quantiseqr R package. P -values resulting from a paired wilcoxon test are represented for each pair of comparison between tools. Significance levels are represented as follows: *: p < 0.05, **: p < 0.01, ***: p < 0.001.

Comparison of the cell-type proportions estimated by EPIC-ATAC (y axis) and the true proportions (x axis) in the HTAN dataset.

Comparison of the cell-type proportions estimated by the different deconvolution methods using our reference profiles and the true proportions in the basal cell carcinoma samples (top panel), in the gynecological cancer samples (middle panel) and in the HTAN samples (bottom panel). In each panel, the top row includes the uncharacterized cells while the bottom row excludes them.

Comparison of the cell-type proportions estimated by CIBERSORTx and Decon-Peaker using reference profiles built by each tool using our collection of pure ATAC-Seq samples and the true proportions in the basal cell carcinoma samples (panel A), in the gynecological cancer samples (panel B) and in the HTAN samples (panel C). In each panel, the top row includes the uncharacterized cells while the bottom row excludes them.

Top panel: RMSE values associated to the comparison of the true proportions and the cell-type proportions estimated by each deconvolution tool in each cell type. For each cell type, all samples from each cancer pseudobulk dataset were considered, i.e., colors correspond to cell types and shapes to datasets. P -values resulting from a paired wilcoxon test are represented for each pair of comparison between tools. Significance levels are represented as follows: *: p < 0.05, **: p < 0.01, ***: p < 0.001. Bottom panel: Considering that not all deconvolution tools are able to predict the proportion of uncharacterized cells, the same analysis as in the top panel was performed without considering the uncharacterized cells predictions.

CPU time in seconds needed to run each tool on the benchmarking datasets.

Pearson’s correlation (left panel) and RMSE values (right panel) associated to the comparison of the cell-type proportions estimated by each deconvolution tool (rows) included in the benchmark and the true cell-type proportions in the PBMC samples. The comparisons were made in each cell-type (columns) separately. Significance level of the correlations are represented as follows: *: p < 0.05, **: p < 0.01, ***: p < 0.001.

Pearson’s correlation (left panel) and RMSE values (right panel) associated to the comparison of the cell-type proportions estimated by each deconvolution tool (rows) included in the benchmark and the true cell-type proportions. The comparisons were made in each cell-type (columns) separately. Panel A, B and C are associated to the benchmark performed on the basal cell carcinoma dataset (A), the gynecological cancer dataset (B) and the HTAN dataset (C). Significance level of the correlations are represented as follows: *: p < 0.05, **: p < 0.01, ***: p < 0.001.

A) Scaled averaged chromatin accessibility of each cell-type specific marker peak in each cell-type obtained based on the ATAC-Seq reference samples used to identify the marker peaks. B) Scaled averaged chromatin accessibility of each cell-type specific marker peak in each cell-type from the sorted samples excluded from the reference samples (See Material and Methods).

Left panels: Scatter plots representing proportions estimated by the different deconvolution tools included in the benchmark (y-axis) as a function of the true proportions (x-axis) for PBMC samples (top row) and the basal cell carcinoma pseudobulks (bottom row). CD4+ and CD8+ T-cells predictions were obtained using the reference profiles based on the major cell-types and the CD4+ and CD8+ T-cell subtypes predictions using the reference profiles including the T-cells subtypes. Right panels: Barplots representing Pearson’s correlation values (x-axis) obtained based on the comparison of the predictions and true proportions in each cell-type (y-axis).

Top panel: Proportions of each myeloid cell type predicted by EPIC-ATAC in each sample as a function of the average ATAC signal at the myeloid specific CREs used by Kumegawa and colleagues to infer the level of infiltration in the tumor samples. Middle panel: Proportions of different cell types predicted by EPIC-ATAC in the samples stratified based on two breast cancer subtypes. Bottom panel: Proportions of different cell types predicted by EPIC-ATAC in the samples stratified based on three ER+/HER2- subgroups. Wilcoxon test p-values are represented at the top of the boxplots