Detection of new pioneer transcription factors as cell-type specific nucleosome binders

  1. Institute of Biophysics and Department of Physics, Central China Normal University, Wuhan 430079, China
  2. National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
  3. School of Life Sciences, University of Essex, Wivenhoe Park, Colchester, CO4 3SQ, UK
  4. Department of Pathology and Molecular Medicine, Queen’s University, ON, Canada
  5. Department of Biology and Molecular Sciences, Queen’s University, ON, Canada
  6. School of Computing, Queen’s University, ON, Canada
  7. Ontario Institute of Cancer Research, Toronto, ON, Canada


  • Reviewing Editor
    Xiaobing Shi
    Van Andel Institute, United States of America
  • Senior Editor
    David James
    University of Sydney, Australia

Reviewer #1 (Public Review):

Peng et al develop a computational method to predict/rank transcription factors (TFs) according to their likelihood of being pioneer transcription factors--factors that are capable of binding nucleosomes--using ChIP-seq for 225 human transcription factors, MNase-seq and DNase-seq data from five cell lines. The authors developed relatively straightforward, easy to interpret computational methods that leverage the potential for MNase-seq to enable relatively precise identification of the nucleosome dyad. Using an established smoothing approach and local peak identification methods to estimate positions together with identification of ChIP-seq peaks and motifs within those peaks which they referred to as "ChIP-seq motifs", they were able to quantify "motif profiles" and their density in nucleosome regions (NRs) and nucleosome free regions (NFRs) relative to their estimated nucleosome dyad positions. Using these profiles, they arrived at an odd-ratio based motif enrichment score along with a Fisher's exact test to assess the odds and significance that a given transcription factor's ChIP-seq motifs are enriched in NRs compared to NFRs, hence, its potential to be a pioneer transcription factor. They showed that known pioneer transcription factors had among the highest enrichment scores, and they could identify 32 relatively novel pioneer TFs with high enrichment scores and relatively high expression in their corresponding cell line. They used multiple validation approaches including (1) calculating the ROC-AUC associated with their enrichment score based on 16 known pioneer TFs among their 225 TFs which they used as positives and the remaining TFs (among the 225) as negatives; (2) use of the literature to note that known pioneer TFs that acted as key regulators of embryonic stem cell differentiation had a highest enrichment scores; (3) comparison of their enrichments scores to three classes of TFs defined by protein microarray and electromobility shift assays (1. strong binder to free and nucleosomal DNA, 2. weak binder to free and nucleosomal DNA, 3. strong binding to free but not nucleosomal DNA); and (4) correlation between their calculated TF motif nucleosome end/dyad binding ratio and relevant data from an NCAP-SELEX experiment. They also characterize the spatial distribution of TF motif binding relative to the dyad by (1) correlating TF motif density and nucleosome occupancy and (2) clustering TF motif binding profiles relative to their distance from the dyad and identifying 6 clusters.

The strengths of this paper are the use of MNase-seq data to define relatively precise dyad positions and ChIP-seq data together with motif analysis to arrive at relatively accurate TF binding profiles relative to dyad positions in NRs as well as in NFRs. This allowed them to use a relatively simple odds ratio based enrichment score which performs well in identifying known pioneer TFs. Moreover, their validation approaches either produced highly significant or reasonable, trending results.

The weaknesses of the paper are relatively minor. The most significant one is that they used ROC-AUC to assess the prediction accuracy of their enrichment score on a highly imbalanced dataset with 16 positives and 209 negatives. ROC-AUC is known to be a misleading prediction measure on highly imbalanced data. This is mitigated by the fact that they find an AUC = 0.94 for their best case. Thus, they're likely to find good results using a more appropriate performance measure for imbalanced data. Another minor point is that they did not associate their enrichment score (focus of Figure 2) with their correlation coefficients of TF motif density and nucleosome occupancy (focus of Figure 3). Finally, while the manuscript was clearly written, some parts of the Methods section could have been made more clear so that their approaches could be reproduced. The description of the NCAP-SELEX method could have also been more clear for a reader not familiar with this approach.

Reviewer #2 (Public Review):

In this study, the authors utilize a compendium of public genomic data to identify transcription factors (TF) that can identify their DNA binding motifs in the presence of nuclosome-wrapped chromatin and convert the chromatin to open chromatin. This class of TFs are termed Pioneer TFs (PTFs). A major strength of the study is the concept, whose premise is that motifs bound by PTFs (assessed by ChIP-seq for the respective TFs) should be present in both "closed" nucleosome wrapped DNA regions (measured by MNase-seq) as well as open regions (measured by DNAseI-seq) because the PTFs are able to open the chromatin. Use of multiple ENCODE cell lines, including the H1 stem cell line, enabled the authors to assess if binding at motifs changes from closed to open. Typical, non-PTF TFs are expected to only bind motifs in open chromatin regions (measured by DNaseI-seq) and not in regions closed in any cell type. This study contributes to the field a validation of PTFs that are already known to have pioneering activity and presents an interesting approach to quantify PTF activity.

For this reviewer, there were a few notable limitations. One was the uncertainty regarding whether expression of the respective TFs across cell types was taken into account. This would help inform if a TF would be able to open chromatin. Another limitation was the cell types used. While understandable that these cell types were used, because of their deep epigenetic phenotyping and public availability, they are mostly transformed and do not bear close similarity to lineages in a healthy organism. Next, the methods used to identify PTFs were not made available in an easy-to-use tool for other researchers who may seek to identify PTFs in their cell type(s) of interest. Lastly, some terms used were not defined explicitly (e.g., meaning of dyads) and the language in the manuscript was often difficult to follow and contained improper English grammar.

Reviewer #3 (Public Review):

Peng et al. designed a computational framework for identifying pioneer factors using epigenomic data from five cell types. The identification of pioneer factors is important for our understanding of the epigenetic and transcriptional regulation of cells. A computational approach toward this goal can significantly reduce the burden of labor-intensive experimental validation. Nevertheless, there are several caveats in the current analysis which may require some modification of the computational methods and additional analysis to maximize the confidence of the pioneer factor prediction results.

A key consideration that arises during this review is that the current analysis anchors on H1 ESC and therefore may have biased the results toward the identification of pioneer factors that are relevant to the four other differentiated cell types. The low ranking of Yamanaka factors and known pioneer factors of NFYs and ESRRB may be due to the setup of the computational framework. Analysis should be repeated by using each of every cell type as an anchor for validating the reproducibility of the pioneer factors found so far and also to investigate whether TFs related to ESC identity (e.g. Yamanaka factors, NFYs and ESRRB) would show significant changes in their ranking. Given the potential cell type specificity of the pioneer factors, the extension to more cell types appears to be important for further demonstrating the utility of the computational framework.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation