Predicting the effect of CRISPR-Cas9-based epigenome editing

  1. Sanjit Singh Batra
  2. Alan Cabrera
  3. Jeffrey P Spence
  4. Jacob Goell
  5. Selvalakshmi S Anand
  6. Isaac B Hilton  Is a corresponding author
  7. Yun S Song  Is a corresponding author
  1. Computer Science Division, University of California, United States
  2. Department of Bioengineering, Rice University, United States
  3. Department of Genetics, Stanford University, United States
  4. Systems, Synthetic, and Physical Biology Graduate Program, Rice University, United States
  5. Department of Statistics, University of California, Berkeley, United States
6 figures, 2 tables and 1 additional file

Figures

Schematic of the epigenome editing prediction pipeline.

The pipeline uses epigenetic data to train models to predict endogenous gene expression. These models were used to predict fold-change in gene expression based on perturbed histone PTM input data, and their predictions were validated using CRISPR-Cas9-based epigenome editing data.

Figure 2 with 3 supplements
Metagene plots show histone post-translational modifications (PTMs) are consistent across cell types and recapitulate established relationships between histone PTMs and gene expression.

Colors represent genes binned into quantiles based on gene expression. Blue 75–100%, orange 50–75%, green 25–50%, and red 0–25% of gene expression within a cell type. The y-axis represents log10(p-value) obtained from ChIP-seq data.

Figure 2—figure supplement 1
Metagene plots for different cell types for uncorrected ChIP-seq data across gene expression quantiles.

Blue is the highest and red is the lowest gene expression quantile. ∗ represents data from HEK293 and (A) represents Avocado imputed data.

Figure 2—figure supplement 2
S3norm-based approach for correcting ChIP-seq log10(p-values).

On the left panel, the p-values of a target cell type’s ChIP-seq data, which are to be corrected are plotted on the Y-axis. While the ChIP-seq data for the reference cell type, chosen to be IMR-90, is shown on the X-axis. After correction with this devised procedure, the resulting corrected p-values are shown on the Y-axis of the right panel.

Figure 2—figure supplement 3
Metagene plots for different cell types for batch effect-corrected ChIP-seq data across gene expression quantiles.

Blue is the highest and red is the lowest gene expression quantile. ∗ represents data from HEK293 instead and (A) represents Avocado imputed data.

Figure 3 with 3 supplements
Histone post-translational modifications (PTMs) accurately predict endogenous gene expression.

(A) Spearman correlation on genes from held-out chromosomes for different input context lengths, with all cell types pooled together. The blue curve is the mean across 10 computational replicates of convolutional neural networks (CNNs), and the red is the mean across 10 computational replicates of ridge regression. Shaded area represents standard deviation in the Spearman correlation across the 10 computational replicates. (B) Spearman correlation on genes of cell types held out during training. The bar plots represent the mean across 10 computational replicates, and the error bars represent the corresponding standard deviations. (C) Distribution of Spearman correlations across genes, computed for each gene in test chromosomes by comparing predictions across the 13 cell types. The different curves represent 10 computational replicates for each model type.

Figure 3—figure supplement 1
Spearman correlation distribution across all cell types, for each cell type.

Each panel corresponds to a different assay where the epigenetic data for that assay in chromosome 17 (which is part of the test dataset) is considered.

Figure 3—figure supplement 2
Endogenous RNA-seq expression levels of HEK293 and HEK293T cell lines are highly concordant.

Spearman correlations between transcripts per million (TPM) values from RNA-seq datasets of two biological replicates of the HEK293T cell line (with SRA accessions shown in parentheses) are on par with Spearman correlation with RNA-seq TPM values for the HEK293 cell line.

Figure 3—figure supplement 3
Benchmarking of models predicting gene expression from histone marks.

AUROC values were computed after converting gene expression values into high and low values using the median gene expression in the training dataset as the threshold.

Figure 4 with 1 supplement
Features learned by gene expression models.

Each point on the X-axis corresponds to in silico perturbation of that assay at that position, and the Y-axis measures the predicted fold-change in gene expression, averaged across a set of 100 trained models. The fold-changes were averaged across 500 randomly chosen genes.

Figure 4—figure supplement 1
Features learned by gene expression models for H3K9me3 in K562.

Each point on the X-axis corresponds to in silico perturbation of H3K9me3 at that position and the Y-axis measures the predicted fold-change in gene expression, averaged across a set of 100 trained models. The fold-changes were averaged across 500 randomly chosen genes. This is a zoomed-in version of the subplot in Figure 4 corresponding to H3K9me3 in K562.

Figure 5 with 1 supplement
dCas9-p300 epigenome editing at eight endogenous genes identifies gene-specific responses.

The genes tested are CYP17A1, SOX11, C2CD4B, CXCR4, CD79A, TGFBR1, MYO1G, and PRSS12. (A) gRNA (n=5) targeting +/ 250 bp of each gene was selected. (B) These selected gRNA were individually co-transfected with dCas9-p300 with relative mRNA determined with qPCR. (C) Relative mRNA associated with selected guide position is displayed with the highest activating guide position marked in orange. The Y-axis corresponds to qPCR fold-change.

Figure 5—figure supplement 1
Transfection efficiency is shared across experiments.

This figure shows consistent transfection efficiency across multiple gene targets. Histograms show the distribution of fluorescent signal intensity, indicating the percentage of cells (right) successfully transfected with the reporter construct containing mCherry-p300. We selected two gRNAs (gRNA1 and gRNA2) for two gene targets (CYP17A1 and SOX11) and a scramble gRNA to measure the transfection efficiency. An average transfection efficiency of 17% was achieved across the different samples with no transfection in the untreated cells.

Figure 6 with 4 supplements
In silico model for dCas9-p300-based epigenome editing.

(A) dCas9-p300 is more likely to bind to a position not occupied by the nucleosome. Thicker green arrow represents higher probability of binding for a gRNA targeting that site. (B) The in silico perturbation is modeled as a Gaussian kernel parameterized by a standard deviation, σ, and the amount of H3K27ac deposited, λ. (C) The final perturbed H3K27ac is obtained by point-wise multiplication of the Gaussian kernel with nucleosome occupancy quantified by MNase activity since dCas9-p300 can only acetylate histones within nucleosomes. (D) Ranks for predicted and endogenous expression across 8 genes and 13 cell types. Rank 1 corresponds to the highest numerical value. (E) Ranks for predicted and empirically measured expression fold-changes following perturbation by dCas9-p300 for eight genes in HEK293T cells. Rank 1 corresponds to the highest numerical value.

Figure 6—source data 1

Raw qPCR data.

Each row has an individual measurement which includes pertinent information used to generate in silico and compare with model predictions. Columns include corresponding guide information regarding gRNA position and coordinates as well as gene information such as orientation and coordinates.

https://cdn.elifesciences.org/articles/92991/elife-92991-fig6-data1-v1.xlsx
Figure 6—source data 2

Raw CUT&RUN qPCR data.

This table includes measurements with corresponding sgRNA used and their distance with respect to the TSS. Gene information and amplicon centerpoint distance to the TSS.

https://cdn.elifesciences.org/articles/92991/elife-92991-fig6-data2-v1.xlsx
Figure 6—source data 3

Primer sequences, sources, assay use, and corresponding direction.

CUT&RUN primers have their corresponding genomic coordinates reported corresponding to the regions they amplify.

https://cdn.elifesciences.org/articles/92991/elife-92991-fig6-data3-v1.xlsx
Figure 6—figure supplement 1
H3K27ac levels elevation is similar across quantified regions following gRNA dCas9-p300 targeting.

Each colored line corresponds to a gRNA targeting proximal to CXCR4 and TGFBR1 in HEK293T cells. The X-axis represents the distance between gRNA and the CUT&RUN amplicon. The Y-axis represents H3K27ac fold enrichment estimated through CUT&RUN.

Figure 6—figure supplement 2
Gene-wise predicted vs. experimental gene expression transcripts per million (TPM) ranks.

Each dot corresponds to a cell type, and the title of each plot shows the Spearman correlation and the corresponding p-values. Rank 1 corresponds to the highest numerical value.

Figure 6—figure supplement 3
Gene-wise predicted vs experimental fold-change ranks.

Each dot corresponds to a gRNA targeting a locus near the TSS of the gene (each gRNA corresponds to at least three replicates and hence the fold-change shown here is the experimental mean). Rank 1 corresponds to the highest numerical value.

Figure 6—figure supplement 4
Predicted vs. experimental fold-change ranks.

Each dot corresponds to a gRNA targeting a locus near the TSS of the gene. Rank 1 corresponds to the highest numerical value.

Tables

Appendix 1—table 1
ChIP-seq log10(p-values) were obtained from the ENCODE Imputation Challenge where the ground truth data were available (corresponding to entries labeled T in the table).

Avocado imputations were downloaded from the ENCODE data portal, where ground truth data were not available (corresponding to entries labeled A in the table).

Cell typepolyA Plus RNA-seqH3K36me3H3K27me3H3K27acH3K4me1H3K4me3H3K9me3
IMR-90TTTTTTT
H1-hESCTTTTTTT
Trophoblast cellTTTTTTT
Neural stem progenitor cellTTTTTTT
K562TTTTTTT
Heart left ventricleTTTTTTT
Adrenal glandTTTTTTT
Endocrine pancreasTTTTTTT
Peripheral blood mononuclear cellTTTTTTT
AmnionTTTTTTT
Myoepithelial cell of mammary glandTTTATTT
ChorionTTTTTTA
HEK293TTATTTT
Appendix 1—table 2
Endogenous gene expression of genes for which we generated dCas9-p300 epigenome editing data indicates that genes for which high fold-change was obtained are more likely to have low endogenous gene expression in HEK293T.

Cross-cell type Spearman provides a metric to assess how accurate our CNN model predictions are, on any given gene, across the 13 cell types.

GeneHEK293 (SRR3997504) TPMHEK293T (SRR13341848) TPMHEK293T (SRR15013784) TPMMaximum fold-change in dCas9-p300 dataCross-cell type Spearman
PRSS1212.7108.4486.9102.3800.896
CXCR411.9742.8268.2165.3650.852
TGFBR10.7253.2548.0293.6750.689
C2CD4B0.3060.0000.000591.3120.726
CD79A0.2800.2070.127127.0940.364
SOX110.0510.1310.20914.2450.846
MYO1G0.0000.0160.00037.9480.621
CYP17A10.0000.0000.0006,549.1100.397

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Sanjit Singh Batra
  2. Alan Cabrera
  3. Jeffrey P Spence
  4. Jacob Goell
  5. Selvalakshmi S Anand
  6. Isaac B Hilton
  7. Yun S Song
(2026)
Predicting the effect of CRISPR-Cas9-based epigenome editing
eLife 12:RP92991.
https://doi.org/10.7554/eLife.92991.4