Schematic of the epigenome editing prediction pipeline.

The pipeline uses epigenetic data to train models to predict endogenous gene expression. These models were used to predict fold-change in gene expression based on perturbed histone PTM input data, and their predictions were validated using CRISPR-Cas9-based epigenome editing data.

Metagene plots show histone PTMs are consistent across cell types and recapitulate established relationships between histone PTMs and gene expression.

Colors represent genes binned into quantiles based on gene expression. Blue 75-100%, Orange 50-75%, Green 25-50%, Red 0-25% of gene expression within a cell type. The y-axis represents − log10(p-value) obtained from ChIP-seq data.

Histone PTMs accurately predict endogenous gene expression.

(A) Spearman correlation on genes from held out chromosomes for different input context lengths, with all cell types pooled together. Blue curve is the mean across 10 computational replicates of CNNs and the red is the mean across 10 computational replicates of ridge regression. Shaded area represents standard deviation in the Spearman correlation across the 10 computational replicates. (B) Spearman correlation on genes of cell types held out during training. The bar plots represent the mean across 10 computational replicates and the error bars represent the corresponding standard deviations. (C) Distribution of Spearman correlations across genes, computed for each gene in test chromosomes by comparing predictions across the 13 cell types. The different curves represent 10 computational replicates for each model type.

Features learned by gene expression models.

Each point on the x-axis corresponds to in silico perturbation of that assay at that position and the y-axis measures the predicted fold-change in gene expression, averaged across a set of 100 trained models. The fold-changes were averaged across 500 randomly chosen genes.

dCas9-p300 epigenome editing at eight endogenous genes identifies gene specific responses.

The genes tested are CYP17A1, SOX11, C2CD4B, CXCR4, CD79A, TGFBR1, MYO1G, and PRSS12.

(A) gRNA targeting +/− 250 bp of each gene were selected. (B) These Selected gRNA were individually co-transfected with dCas9-p300 with relative mRNA determined with qPCR. (C) Relative mRNA associated with selected guide position are displayed with the highest activating guide position marked in orange. The Y-axis corresponds to qPCR fold-change.

In silico model for dCas9-p300-based epigenome editing

(A) dCas9-p300 is more likely to bind to a position not occupied by the nucleosome. Thicker green arrow represents higher probability of binding for a gRNA targeting that site. (B) The in silico perturbation is modeled as a Gaussian kernel parameterized by a standard deviation, σ, and the amount of H3K27ac deposited, λ. (C) The final perturbed H3K27ac is obtained by point-wise multiplication of the Gaussian kernel with nucleosome occupancy quantified by MNase activity since dCas9-p300 can only acetylate histones within nucleosomes. (D) Ranks for predicted and endogenous expression across 8 genes and 13 cell types. Rank 1 corresponds to the highest numerical value. (E) Ranks for predicted and empirically measured expression fold-changes following perturbation by dCas9-p300 for 8 genes in HEK293T cells. Rank 1 corresponds to the highest numerical value.

ChIP-seq − log10(p-values) were obtained from the ENCODE Imputation Challenge where the ground truth data were available (corresponding to entries labeled T in the table). Avocado imputations were downloaded from the ENCODE data portal, where ground truth data were not available (corresponding to entries labeled A in the table).

Endogenous gene expression of genes for which we generated dCas9-p300 epigenome editing data indicates that genes for which high fold-change was obtained are more likely to have low endogenous gene expression in HEK293T. Cross-cell type Spearman provides a metric to assess how accurate our CNN model predictions are, on any given gene, across the 13 cell types.