Figures and data

Schematic of the epigenome editing prediction pipeline.
The pipeline uses epigenetic data to train models to predict endogenous gene expression. These models were used to predict fold-change in gene expression based on perturbed histone PTM input data, and their predictions were validated using CRISPR-Cas9-based epigenome editing data.

Metagene plots show histone PTMs are consistent across cell types and recapitulate established relationships between histone PTMs and gene expression.
Colors represent genes binned into quantiles based on gene expression. Blue 75-100%, Orange 50-75%, Green 25-50%, Red 0-25% of gene expression within a cell type. The y-axis represents − log10(p-value) obtained from ChIP-seq data.

Metagene plots for different cell types for uncorrected ChIP-seq data across gene expression quantiles.
Blue is the highest and red is the lowest gene expression quantile. * represents data from HEK293 and (A) represents Avocado imputed data.

S3norm-based approach for correcting ChIP-seq − log10(p-values).
On the left panel, the p-values of a target cell type’s ChIP-seq data, which are to be corrected are plotted on the Y-axis. While the ChIP-seq data for the reference cell type, chosen to be IMR-90, is shown on the X-axis. After correction with this devised procedure, the resulting corrected p-values are shown on the Y-axis of the right panel.

Metagene plots for different cell types for batch effect corrected ChIP-seq data across gene expression quantiles.
Blue is the highest and red is the lowest gene expression quantile. * represents data from HEK293 instead and (A) represents Avocado imputed data.

Histone PTMs accurately predict endogenous gene expression.
(A) Spearman correlation on genes from held out chromosomes for different input context lengths, with all cell types pooled together. Blue curve is the mean across 10 computational replicates of CNNs and the red is the mean across 10 computational replicates of ridge regression. Shaded area represents standard deviation in the Spearman correlation across the 10 computational replicates. (B) Spearman correlation on genes of cell types held out during training. The bar plots represent the mean across 10 computational replicates and the error bars represent the corresponding standard deviations. (C) Distribution of Spearman correlations across genes, computed for each gene in test chromosomes by comparing predictions across the 13 cell types. The different curves represent 10 computational replicates for each model type.

Spearman correlation distribution across all cell types, for each cell type.
Each panel corresponds to a different assay where the epigenetic data for that assay in chromosome 17 (which is part of the test dataset) is considered.

Endogenous RNA-seq expression levels of HEK293 and HEK293T cell lines are highly concordant.
Spearman correlations between TPM values from RNA-seq datasets of two biological replicates of the HEK293T cell line (with SRA accessions shown in parentheses) are on par with Spearman correlation with RNA-seq TPM values for the HEK293 cell line.

Benchmarking of models predicting gene expression from histone marks.
AUROC values were computed after converting gene expression values into high and low values using the median gene expression in the training dataset as the threshold.

Features learned by gene expression models.
Each point on the x-axis corresponds to in silico perturbation of that assay at that position and the y-axis measures the predicted fold-change in gene expression, averaged across a set of 100 trained models. The fold-changes were averaged across 500 randomly chosen genes.

Features learned by gene expression models for H3K9me3 in K562.
Each point on the X-axis corresponds to in silico perturbation of H3K9me3 at that position and the Y-axis measures the predicted fold-change in gene expression, averaged across a set of 100 trained models. The fold-changes were averaged across 500 randomly chosen genes. This is a zoomed-in version of the subplot in Figure Figure 4 corresponding to H3K9me3 in K562.

dCas9-p300 epigenome editing at eight endogenous genes identifies gene specific responses.
The genes tested are CYP17A1, SOX11, C2CD4B, CXCR4, CD79A, TGFBR1, MYO1G, and PRSS12.
(A) gRNA targeting +/− 250 bp of each gene were selected. (B) These Selected gRNA were individually co-transfected with dCas9-p300 with relative mRNA determined with qPCR. (C) Relative mRNA associated with selected guide position are displayed with the highest activating guide position marked in orange. The Y-axis corresponds to qPCR fold-change.

Transfection efficiency is shared across experiments.
This figure shows consistent transfection efficiency across multiple gene targets. Histograms show the distribution of fluorescent signal intensity, indicating the percentage of cells (right) successfully transfected with the reporter construct containing mCherry-p300. We selected 2 gRNAs (gRNA1 and gRNA2) for 2 gene targets (CYP17A1 and SOX11) and a scramble gRNA to measure the transfection efficiency. An average transfection efficiency of 17% was achieved across the different samples with no transfection in the untreated cells.

In silico model for dCas9-p300-based epigenome editing
(A) dCas9-p300 is more likely to bind to a position not occupied by the nucleosome. Thicker green arrow represents higher probability of binding for a gRNA targeting that site. (B) The in silico perturbation is modeled as a Gaussian kernel parameterized by a standard deviation, σ, and the amount of H3K27ac deposited, λ. (C) The final perturbed H3K27ac is obtained by point-wise multiplication of the Gaussian kernel with nucleosome occupancy quantified by MNase activity since dCas9-p300 can only acetylate histones within nucleosomes. (D) Ranks for predicted and endogenous expression across 8 genes and 13 cell types. Rank 1 corresponds to the highest numerical value. (E) Ranks for predicted and empirically measured expression fold-changes following perturbation by dCas9-p300 for 8 genes in HEK293T cells. Rank 1 corresponds to the highest numerical value.

H3K27ac levels elevation is similar across quantified regions following gRNA dCas9-p300 targeting.
Each colored line corresponds to a gRNA targeting proximal to CXCR4 and TGFBR1 in HEK293T cells. The X-axis represents the distance between gRNA and the CUT&RUN amplicon. The Y-axis represents H3K27ac fold enrichment estimated through CUT&RUN.

Gene-wise predicted vs experimental gene expression TPM ranks.
Each dot corresponds to a cell type and the title of each plot shows the Spearman correlation and the corresponding p-values. Rank 1 corresponds to the highest numerical value.

Gene-wise predicted vs experimental fold-change ranks.
Each dot corresponds to a gRNA targeting a locus near the TSS of the gene (each gRNA corresponds to atleast three replicates and hence the fold-change shown here is the experimental mean). Rank 1 corresponds to the highest numerical value.

Predicted vs experimental fold-change ranks.
Each dot corresponds to a gRNA targeting a locus near the TSS of the gene. Rank 1 corresponds to the highest numerical value.

ChIP-seq − log10(p-values) were obtained from the ENCODE Imputation Challenge where the ground truth data were available (corresponding to entries labeled T in the table).
Avocado imputations were downloaded from the ENCODE data portal, where ground truth data were not available (corresponding to entries labeled A in the table).

Endogenous gene expression of genes for which we generated dCas9-p300 epigenome editing data indicates that genes for which high fold-change was obtained are more likely to have low endogenous gene expression in HEK293T.
Cross-cell type Spearman provides a metric to assess how accurate our CNN model predictions are, on any given gene, across the 13 cell types.