The computational framework to analyze the ability of transcription factors to bind to nucleosomes by integrating ChIP-seq, MNaseq-seq and DNase-seq data for motif enrichment and binding motif analysis.

Identifying pioneer transcription factors using motif enrichment analysis.

a) TF motif enrichment score is used to distinguish 32 known PTFs (Test set 1) from other other TFs. Receiver operating curve (ROC) and precision-recall (PR) curve analysis of motif enrichment scores are performed. Here nucleosome regions (NRs) were determined as genomic regions (147 bp long) centered at the representative dyad positions and the nucleosome-depleted regions (NDRs) represent genomic regions free of nucleosomes and are located in open chromatin regions. The random retrieval classifier would predict with AUC=0.5 and PR=the fraction of true positives=0.17. b) TF motif enrichment score is used to distinguish 11 known PTFs with essential roles in cell differentiation (Test set 2) from other other TFs. Here NRs in differentially open and NDRs in conserved open chromatin regions are used in enrichment analysis. The random retrieval classifier would predict with AUC=0.5 and PR= 0.04. c) Classification of pioneer transcription factors by binding motif enrichment scores. Known pioneer transcription factors from Test set 2 and Test set 3 are indicated by squares and triangles, while other TFs are shown as circles. Colors corresponds to false discovery rate (FDR) q-values. Mann–Whitney U tests are performed under the null hypothesis that PTF’s mean values of enrichment scores are equal to canonical TFs. * - p-value < 0.05; ** - p-value < 0.005. d) Binding motif profile of TFs with the highest and lowest motif enrichment scores (ranked at the top or bottom 20% among all TFs). The number of motifs for each TF is normalized within the range between 0 and 1 as follows: X(i)normalized) = (X(i) -Xmin)/(Xmax -Xmin), X(i) is the number of sequences which have TF binding sites at the ith base pair from the nucleosomal dyad position; Xmax and Xmin represent the maximal and minimal counts of sequence fragments. e) Comparison of the enrichment score of TFs in different clusters identified from recent EMSA experiments19. Only 13 transcription factors could be found in EMSA and our data set. Cluster 1 and 2 include strong binders to both naked DNA and nucleosomal DNA and weak binders to both naked DNA and nucleosomal DNA (only one TF). Cluster 3: strong binders to naked DNA but weak binders to nucleosomal DNA.

Association between the binding motif profiles and nucleosome occupancy.

a) Motif profiles for ZKSCAN1 in HeLa-S3 cell line is shown as an example. b) Pearson correlation coefficients between motif profiles and nucleosome occupancy values for each TF (n=225) (the median value of correlation coefficient = - 0.46). c) Binding motif profiles of TFs with positive (red, correlation coefficient >= 0.2 and p-value < 0.05) or negative correlation coefficients (black, correlation coefficient <= -0.4 and p-value < 0.05) between motif profile and nucleosome occupancy. Dashed lines correspond to the average of binding motif profiles. d) Comparison of binding motif profiles of CTCF between H1 embryonic stem cell line and somatic cell lines.

Clusters of TF binding motif profiles on nucleosomal DNA. Binding motif profiles centered at nucleosomal dyad locations (+/- 60 base pair from dyad) are clustered using k-medoids clustering with k=6. The entry/exit regions of nucleosomal DNA were excluded as a certain nucleotide bias exists around the ends of nucleosomal DNA reads produced by the MNase-seq experiments. Binding motif profiles between two symmetrical nucleosomal halves are combined for each TF. The black line represents the averaged profiles of all TFs in the same cluster. Cluster members with silhouette width <= 0.25 were considered as outliers and removed.