Interpretable Protein-DNA Interactions Captured by Structure-Sequence Optimization

Yafan Zhang; Irene Silvernail; Zhuyang Lin; Xingcheng Lin

doi:10.7554/eLife.105565.1

eLife Assessment

This valuable work presents an interpretable protein-DNA Energy Associative (IDEA) model for predicting binding sites and affinities of DNA-binding proteins. The study provides a detailed description of the method, making it reproducible. However, the generalizability of the prediction model presents certain concerns, and the supporting evidence appears incomplete. Nonetheless, with a thorough re-examination of the training and testing procedures, this model can be widely applicable for predicting genome-wide protein-DNA binding sites.

https://doi.org/10.7554/eLife.105565.1.sa3

Significance of findings

valuable: Findings that have theoretical or practical implications for a subfield

landmark
fundamental
important
valuable
useful

Strength of evidence

incomplete: Main claims are only partially supported

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

Sequence-specific DNA recognition underlies essential processes in gene regulation, yet methods for simultaneous prediction of genomic DNA recognition sites and their binding affinity remain lacking. Here, we present the Interpretable protein-DNA Energy Associative (IDEA) model, a residue-level, interpretable biophysical model capable of predicting binding sites and affinities of DNA-binding proteins. By fusing structures and sequences of known protein-DNA complexes into an optimized energy model, IDEA enables direct interpretation of physicochemical interactions among individual amino acids and nucleotides. We demonstrate that this energy model can accurately predict DNA recognition sites and their binding strengths across various protein families. Additionally, the IDEA model is integrated into a coarse-grained simulation framework that quantitatively captures the absolute protein-DNA binding free energies. Overall, IDEA provides an integrated computational platform alleviating experimental costs and biases in assessing DNA recognition and can be utilized for mechanistic studies of various DNA-recognition processes.

Introduction

Gene regulation is essential for controlling the timing and extent of gene expression in cells. It guides important processes from development to disease response.¹ Gene expressions are tightly controlled by various DNA binding proteins, including transcription factors (TFs) and epigenetic regulators.^2–4 TFs initiate gene expression by binding to specific DNA sequences, during which time a primary RNA transcript is synthesized from a gene’s DNA.⁵ On the other hand, epigenetic regulators control gene expression by binding to specific chromatin regions, which spread post-translational modifications that modulate 3D genome organization.⁶ Therefore, characterizing the DNA-interaction specificities of DNA binding proteins is critical for understanding the molecular mechanisms underlying many DNA-templated processes.⁷

To establish a comprehensive understanding of DNA binding processes, various experimental technologies have been developed and performed.^8–16 These methods, such as Chromatin Immunoprecipitation followed by sequencing (ChIP-Seq),^8,9,17 Protein-binding microarray (PBM),^18–20 and Systematic Evolution of Ligands by Exponential Enrichment (SELEX),^15,16,21 have proven invaluable for measuring DNA-specific protein recognition. Nonetheless, due to the need for protein-specific antibodies,¹⁷ as well as the cost and intrinsic bias associated with these experiments,²² high-throughput measurement of large numbers of protein-DNA variants remains challenging.

Computational methods complement experimental efforts by providing the initial filter for assessing protein-DNA binding specificity. Numerous methods have emerged to enable predictions of binding sites and affinities of DNA-binding proteins. ^23–30 These methods often utilized machine-learning-based training to extract sequence preference information from DNA or protein by utilizing experimental high-throughput (HT) assays,^23–27 which rely on the availability and quality of experimental binding assays. Additionally, many approaches employ deep neural networks,^28–30 which could obscure the interpretation of interaction patterns governing protein-DNA binding specificities. Understanding these patterns, however, is crucial for elucidating the molecular mechanisms underlying various DNA-recognition processes, such as those seen in TFs.³¹

Nowadays, over 5000 protein-DNA 3D structures, including TF-DNA complexes, have been published.^32,33 These data provide invaluable resources for understanding the physicochemical properties of protein-DNA binding patterns, extending beyond mere sequence information. Utilization of these data enables the training of a model for predicting the binding affinities of protein-DNA interactions. The very robustness of evolution ^34–37 offers an approach to extract the sequence-structure relation embedded in these available complexes, which learns an interpretable binding energy landscape that governs the recognition processes of those DNA-binding proteins.

Here, we introduce the Interpretable protein-DNA Energy Associative (IDEA) model, a predictive model that learns protein-DNA physicochemical interactions by fusing available biophysical structures and their associated sequences into an optimized energy model (Figure 1). We show that the model can be used to accurately predict the sequence-specific DNA binding affinities of DNA-binding proteins and is transferrable across the same protein superfamily. Moreover, the model can be enhanced by incorporating experimental binding data and can be generalized to enable base-pair resolution predictions of genomic DNA-binding sites. Notably, IDEA learns the interaction matrix between each amino acid and nucleotide, allowing for direct interpretation of the “molecular grammar” governing the binding specificities of DNA binding proteins. This interpretable energy model is further integrated into a simulation framework, facilitating mechanistic studies of various biomolecular functions involving protein-DNA dynamics.

Results

Residue-Level Protein-DNA Energy Model for Predicting ProteinDNA Recognition Specificities

IDEA is a coarse-grained biophysical model at the residue resolution for investigating the binding interactions between protein and DNA (Fig. 1). It leverages both structures and corresponding sequences of known protein-DNA complexes to learn an interpretable energy model based on the interacting amino acids and nucleotides at the protein-DNA interface. The model was trained using available protein-DNA complexes curated from existing database.^32,38 Unlike existing deep-learning-based protein-DNA binding prediction models, IDEA aims to learn a physicochemical-based energy model that quantitatively characterizes sequences-specific interactions between amino acids and nucleotides, thereby interpreting the “molecular grammar” driving the binding energetics of protein-DNA interactions. The optimized energy model can be used to predict the binding affinity of given protein-DNA pairs based on their structures and sequences. Additionally, it enables the prediction of genomic DNA binding sites by a given protein, such as a transcription factor. Finally, the learned energy model is integrated into a simulation framework that can be used to investigate the molecular mechanisms underlying various DNA-templated activities, such as initiation of gene transcription,³⁹ epigenetic regulation of gene expression,⁴ and DNA replication.⁴⁰ Further details of the optimization protocol are provided in Methods Section Energy Model Optimization.

Overview of the IDEA protocol.
The protein–DNA complex, represented by the human MAX–DNA complex structure (PDB ID: 1HLO), was used for training the IDEA model. The sequences of the protein and DNA residues that form close contacts (highlighted in blue) in the structure were included in the training dataset. In addition, a series of synthetic decoy sequences were generated by randomizing the contacting residues in both the protein and DNA sequences. The amino acid–nucleotide energy model was then optimized by maximizing the ratio of the binding energy gap (δE) between protein and DNA in the native complex and the decoy complexes relative to the energy variance (ΔE). The optimized energy model can be used for multiple predictive applications, including the evaluation of binding free energies for various protein–DNA sequence pairs, prediction of genomic DNA binding sites by transcription factors or other DNA-binding proteins, and integration into a sequence-specific, residue-resolution simulation framework for dynamic simulations.

IDEA Accurately Predicts Protein-DNA Binding Specificity

We first examine the predictive accuracy of IDEA by comparing its predicted TF-DNA binding affinity with experimental measurements.^13,14 We focused on the MAX TF, a basic Helix-loop-helix (bHLH) TF with the most comprehensive experimental binding data. The binding affinity of MAX to various DNA sequences has been quantified by multiple experimental platforms, including different SELEX variants^15,16,21 and microfluidic-based MITOMI assay.^13,41 Among them, the MITOMI quantitatively measured the binding affinities of MAX TF to a comprehensive set of 255 DNA sequences with mutations in the enhancer box (Ebox) motif, a consensus MAX DNA binding sequence, and can thus serve as a reference for us to benchmark our model predictions. Our de novo prediction based on one MAX crystal structure (PDB ID: 1HLO) correlates well with the experimental values (Pearson Correlation 0.67, Fig. 2A). Including additional human-associated MAX-DNA complex structures and their associated sequences slightly improves the correlation (Pearson Correlation 0.68, Fig. S1), albeit not significantly, likely due to the structural similarity of protein-DNA interfaces across all MAX proteins. Prior works have proven it instrumental in incorporating experimental binding data to improve predicted protein-nucleic acid binding affinities.^27,42 Inspired by these approaches, we developed an enhanced predictive protocol that integrates additional experimental binding data if available (see the Methods Section Enhanced Modeling Prediction with SELEX Data). Encouragingly, when training the model with the SELEX-seq data²⁷ of MAX TF, IDEA gained a significant predictive improvement in reproducing the MITOMI measurement (Pearson Correlation 0.79, Fig. S2).

Results for MAX-based predictions.
(A) The binding free energy calculated by IDEA, trained using one MAX–DNA complex (PDB ID: 1HLO), correlates well with experimentally measured MAX–DNA binding free energy.¹³ ΔΔG represents the changes in binding free energy relative to that of the wild-type protein–DNA complex. (B) The optimized energy model reveals an amino acid–nucleotide interaction pattern governing MAX–DNA recognition. The predicted binding free energies and optimized energy model are presented in reduced units, as explained in the Methods. (C) The 3D structure of the MAX–DNA complex (zoomed in with different views) highlights important amino acid–nucleotide contacts at the protein–DNA interface, where several DNA cytosines (red spheres) form close contacts with arginine (blue spheres). (D) Probability density distribution of strong and weak binders for the protein ZBTB7A. The mean of each distribution is marked with a dashed line. (E) The AUC score for each protein–DNA pair, calculated based on the predicted probability distributions.

IDEA Decodes the Molecular Grammar Governing Protein-DNA Interactions

IDEA protocol learns the physicochemical interactions that determine protein-DNA interactions by utilizing the sequence-structure relationship embedded in the protein-DNA experimental structures. Such a physicochemical interaction pattern can be interpreted directly from the learned energy model. To illustrate this, we focused on the energy model learned from the MAX-DNA complexes (Fig. 2B). Notably, DNA Cytosine (DC) exhibited strong interactions with Arginine (R), consistent with the fact that the E-box region (CACGTG) frequently attracts the positively charged residues of bHLH TFs.⁴³ Importantly, Arginine was in close contact with Cytosine in the crystal structure (Fig. 2C), thus consistent with the strong DC-R interactions shown in the learned energy model. Upon integrating the SELEX data into our model training, we found that the improved energy model shows additional unfavorable interactions between glutamic acid (E) and all DNA nucleotides, consistent with their negative charges (Fig. S3). Thus, including more experimental data can boost IDEA’s predictive accuracy by refining the amino-acid-nucleotide interacting energy model to better align with physical principles.

IDEA Generalizes across Various Protein Families

To examine IDEA’s predictive accuracy across different DNA-binding protein families, we applied it to calculate protein-DNA binding affinities using a comprehensive HT-SELEX dataset.²⁶ We focused on evaluating the capability of IDEA to distinguish strong binders from weak binders for each protein with experimentally determined structures. We calculated the probability density distribution of the top and bottom binders identified in the SELEX experiment. A well-separated distribution indicates the successful identification of strong binders by IDEA (Figure 2D and Fig. S4). Receiver Operating Characteristic (ROC) analysis was performed to calculate the Area Under the Curve (AUC) score for these predictions. Further details are provided in Methods Section Evaluation of IDEA prediction using HT-SELEX data. Our analysis shows that for 80% of the proteins across 14 protein families, IDEA successfully differentiates strong binders from weak ones, with an AUC score greater than 0.5. We further performed 10-fold cross-validation on the binding affinities of the protein-DNA pairs in this dataset and found that IDEA outperforms state-of-theart protein-DNA models for cases with available experimentally determined protein-DNA complex structures (Fig. S5)

We also applied IDEA to calculate the binding affinity of additional TFs with available MITOMI measurements.⁴¹ IDEA’s predicted binding affinity of Pho4 protein, another bHLH transcription factor, correlates well with the experimental measurement (Pearson Correlation 0.6, Fig. S6A) by training on the only available protein-DNA structure (PDB ID: 1A0A). We further evaluated IDEA’s predictive performance for the zinc-finger protein Zif268, which has a different structure from the bHLH TFs and thus requires a different atom to represent the DNA nucleotides (Table S1). Due to limited experimental data for all the possible DNA sequence combinations, we focused on testing IDEA’s predictions on point-mutated DNA sequences, which was known to be a challenging task due to their minor deviation from the wild-type sequence. Despite the challenge, IDEA achieves accurate predictions, with a Pearson Correlation of 0.57 (Fig. S6B). We further expanded the training dataset to include all the Zif268 structures and their associated sequences from the same CATH zinc finger superfamily (CATH ID: 3.30.160.60). Incorporating these additional training data further improves the predictive accuracy (Pearson Correlation 0.63, Fig. S7).

IDEA Demonstrates Transferability across Proteins in the Same CATH Superfamily

Since IDEA relies on the sequence-structure relationship of given protein-DNA complexes to reach predictive accuracy, we inquired whether the trained energy model from one proteinDNA complex could be generalized to predict the binding specificities of other complexes.

To test this, we assessed the transferability of IDEA predictions across all 11 structurally available protein-DNA complexes within the MAX TF-associated CATH superfamily (CATH ID: 4.10.280.10, Helix-loop-helix DNA-binding domain). We trained IDEA based on each of these 11 complexes and then used the trained model to predict the MAX-based MITOMI binding affinity. Our results show that IDEA generally makes correct predictions of the binding affinity when trained on proteins that are homologous to MAX, with Pearson Correlation coefficients larger than 0.5 (Fig. 3A).

IDEA prediction shows transferability within the same CATH superfamily.
(A) The predicted MAX binding specificity, trained on other protein-DNA complexes within the same protein CATH superfamily, correlates well with experimental measurement. The proteins are ordered by their probability of being homologous to the MAX protein, determined using HHpred.⁴⁴ Training with a homologous protein (determined as a hit by HHpred) usually leads to better predictive performance (Pearson Correlation coefficient > 0.5) compared to non-homologous proteins. (B) Structural alignment between 1HLO (white) and 1A0A (blue), two protein-DNA complexes within the same CATH Helix-loop-helix superfamily. The alignment was performed based on the E-box region of the DNA.⁴⁵ (C) The optimized energy model for 1A0A, a protein-DNA complex structure of the transcription factor Pho4 and DNA, with 33.41% probability of being homologous to the MAX protein. The optimized energy model is presented in reduced units, as explained in the Methods.

The transferability of IDEA within the same CATH superfamily can be understood from the similar structural interfaces among different DNA-binding proteins, which determines a similar learned energy model. For example, the Pho4 protein (PDB ID: 1A0A) shares a highly similar DNA-binding interface with the MAX protein (PDB ID: 1HLO) (Fig. 3B), despite having a 33.41% probability of being homologous. Consequently, the energy model derived from the Pho4-DNA complex (Fig. 3C) exhibits a similar amino-acid-nucleotide interactive pattern as that learned from the MAX-DNA complex (Fig. 2B).

Identification of Genomic Protein-DNA Binding Sites

The genomic locations of DNA binding sites are causally related to major cellular processes. ¹⁷ Although multiple techniques have been developed to enable high-throughput mapping of genomic protein-DNA binding locations, such as ChIP–seq, ^9,17,46 DAP-seq,¹⁰ and FAIREseq,⁴⁷ challenges remain for precisely pinpointing protein-binding sites at a base-pair resolution. Furthermore, the applicability of these techniques for different DNA-binding proteins is restricted by the quality of antibody designs.¹⁷ Therefore, a predictive computational framework would significantly reduce the costs and accelerate the identification of genomic binding sites of DNA-binding proteins.

We incorporated IDEA into a protocol to predict the genomic protein-DNA binding sites at the base-pair resolution, given a DNA-binding protein and a genomic DNA sequence. To evaluate the predictive accuracy of this protocol, we utilized publicly available ChIP-seq data for MAX TF binding in GM12878 lymphoblastoid cell lines.⁴⁸ As the experimental measurements were conducted at the 420 base pairs resolution, we averaged our modeling prediction over a window spanning 500 base pairs. Since the experimental signals are sparsely distributed across the genome, we focused our prediction on the 1 Mb region of Chromosome 1, which has the densest and most reliable ChIP-seq signals (156000 kb 157000 kb, representative window shown in Fig. 4A top). Lower predicted binding free energy corresponds to stronger protein-DNA binding affinity, indicating higher probability binding sites.

IDEA accurately identifies genome-wide protein-binding sites.
(A) The IDEA-predicted MAX-transcription factor binding sites on the Lymphoblastoid cells chromosome (bottom) correlate well with ChIP-seq measurements (top), shown for a representative 1Mb region of chromosome 1 where ChIP-seq signals are densest. For visualization purposes, a 1 kb resolution was used to plot the predicted normalized Z scores, with highly probable binding sites represented as red peaks. (B) AUC score for prediction accuracy based on the normalized Z scores, averaged over a 500 bp window to match the experimental resolution of 420 bp.

The predicted binding free energies were further normalized by those of the randomized decoy sequences (see the Methods Section Processing of ChIP-seq Data for details). The identified MAX-binding sites are marked in red, with a normalized Z score < −0.75. Our prediction successfully identifies the majority of the MAX binding sites, achieving an AUC score of 0.81 (Fig. 4B). We also tested IDEA’s predictive performance on two additional typical DNA-binding proteins and successfully identified their genomic binding sites (Fig. S8). Therefore, IDEA provides an accurate initial prediction of genomic recognition sites of DNA-binding proteins based only on DNA sequences, facilitating more focused experimental efforts. Furthermore, IDEA holds the potential to identify binding sites beyond the experimental resolution and can aid in the interpretation of other sequencing techniques to uncover additional factors that influence genome-wide DNA binding, such as chromatin accessibility.^49–51

Integration of IDEA into a Residue-Resolution Simulation Model Captures Protein-DNA Binding Dynamics

The trained residue-level model can be incorporated into a simulation model, thus enabling investigations into dynamic interactions between protein and DNA for mechanistic studies. Coarse-grained protein-nucleic acid models have shown strong advantages in investigating large biomolecular systems.^52–56 Notably, the structure-based protein model⁵⁷ and 3-Site-Per-Nucleotide (3SPN) DNA model⁵⁸ have been developed at the residue resolution to enable efficient sampling of biomolecular systems involving protein and DNA. A combination of these two models has proved instrumental in quantitatively studying multiple large chromatin systems,⁵⁹ including nucleosomal DNA unwrapping,^60–62 interactions between nucleosomes,^63,64 large chromatin organizations,^64–66 and its modulation by chromatin factors.^67–69 Although both the protein and DNA force fields have been systematically optimized,^58,70–73 the non-bonded interactions between protein and DNA were primarily guided by Debye-Hückel treatment of electrostatic interactions, which did not consider sequencespecific interactions between individual protein amino acids and DNA nucleotides.

To refine this simulation model by including protein-DNA interaction specificity, we incorporated the IDEA-optimized protein-DNA energy model into the coarse-grained simulation model as a short-range Van der Waals (VdW) interaction in the form of a Tanh function (See Methods Section Sequence Specific Protein-DNA Simulation Model for modeling details). This refined simulation model is used to simulate the dynamic interactions between DNA-binding proteins and their target DNA sequences. As a demonstration of the working principle of this modeling approach, we selected 9 TF-DNA complexes whose binding affinities were measured by the same experimental group.⁷⁴ We applied IDEA to learn an energy model based on these 9 protein-DNA complexes, which leads to a strong correlation between the modeling predicted binding affinity and the experimental measurements (Pearson Correlation coefficient 0.72, Fig. 5A). The learned energy model was incorporated into the simulation model, which was used to simulate the binding process between the protein and DNA target using umbrella sampling techniques.⁷⁵ The predicted binding free energy⁷⁶ (Fig. S9) from our simulation shows a strong correlation (Pearson Correlation 0.79) and a near-perfect fit with the experimental measurement (Fig. 5B), greatly improved over the previous model, which depicts protein-DNA interactions by non-sequence-specific homogeneous electrostatics interactions (Pearson Correlation 0.6, Fig. S10). Notably, due to an undetermined prefactor in the IDEA training, it can only provide relative binding free energies in reduced units for different protein-DNA complexes. In contrast, the IDEA-incorporated simulation model can predict the absolute binding free energy given the protein and DNA structures with physical units.

Enhanced protein-DNA simulation model by incorporating IDEAoptimized energy model.
(A) The prediction of the protein-DNA binding affinity for 9 protein-DNA complexes using the IDEA-learned energy model shows a strong correlation with the experimental measurements.⁷⁴ ΔΔG represents the changes in binding free energy relative to the protein-DNA complex with the lowest predicted binding free energy. The predicted binding free energies are presented in reduced units, as explained in Methods. (B) Illustration of an example protein-DNA complex structure (PDB ID: 9ANT) and the coarse-grained simulation used to evaluate protein-DNA binding free energy. A typical free energy profile was extracted from the simulation, using the center of mass (COM) distance as the collective variable. The shaded region represents the standard deviation of the mean. Representative structures of bound and unbound proteins are shown above, with protein in blue and DNA in yellow. (C) Incorporating the IDEA-optimized energy model into the simulation model improves the prediction of protein-DNA binding affinity, compared to the prediction by the previous model with electrostatic interactions and uniform non-specific attraction between protein and DNA (Fig. S9). The predicted binding free energies are presented in physical units. Error bars represent the standard deviation of the mean.

Discussion

The protein-DNA interaction landscape has evolved to facilitate precise targeting of proteins towards their functional binding sites, which underlie essential processes in controlling gene expression. These interaction specifics are determined by physicochemical interactions between amino acids and nucleotides. By integrating sequences and structural data from available protein-DNA complexes into an interaction matrix, we introduce IDEA, a datadriven method that optimizes a system-specific energy model. This model enables highthroughput in silico predictions of protein-DNA binding specificities and can be scaled up to predict genomic binding sites of DNA-binding proteins, such as TFs. IDEA achieves accurate de novo predictions using only protein-DNA complex structures and their associated sequences, but its accuracy can be further enhanced by incorporating available experimental data from other binding assay measurements, such as the SELEX data,^15,16,21 achieving accuracy comparable or better than state-of-the-art methods (Figs. S2 and S5). Despite significant progress in genome-wide sequencing techniques, ^9,10,17,47 determining the binding specificities of DNA-binding biomolecules remains time-consuming and expensive. Therefore, IDEA presents a cost-effective alternative for generating the initial predictions before pursuing further experimental refinement.

A key advantage of IDEA is its incorporation of both structural and sequence information, which greatly reduces the demand for extensive training data. Prior efforts have applied deep learning techniques to predict the interactions between proteins and DNA based on their sequences.^24,25,27 These approaches usually require a large amount of sequencing training data. By leveraging the sequence-structure relationship inherent in protein-DNA complex structures, IDEA achieves accurate de novo prediction using only a few protein-DNA complexes within the same protein superfamily.

In addition, despite the complicated in vivo environment, IDEA enables predictions of genomic binding sites and shows good agreement with the ChIP-seq measurements (Fig. 4, S8). Moreover, IDEA features rapid prediction, typically requiring 1-3 days to predict base-pair resolution genomic binding sites of a 1 Mb DNA region using a single CPU. Furthermore, IDEA’s prediction can be trivially parallelized, making it possible to perform genome-wide predictions of DNA recognition sites by a given protein within weeks.

Another highlight of IDEA is its ability to present an interpretable amino acid-nucleotide interaction energy model for given protein-DNA complexes. Unlike other machine-learning approaches, IDEA’s optimized energy model not only predicts binding affinities and binding sites of DNA-binding proteins but also provides an interpretable representation of interaction specifics between each type of amino acid and nucleotide. Additionally, we integrated this physicochemical-based energy model into a simulation framework, thereby improving the characterization of protein-DNA binding dynamics. Therefore, IDEA-based simulation enables investigations into dynamic interactions among various proteins and DNA, facilitating molecular-based simulations to understand the physical mechanisms underlying many DNA-binding processes, such as transcription, epigenetic regulations, and their modulation by sequence mutations such as single-nucleotide polymorphisms (SNPs). ^77,78

Since the IDEA-optimized energy model serves as a “molecular grammar” for guiding protein-DNA interaction specifics, it can be expanded to include additional variants of amino acids and nucleotides. With recent advancements in sequencing and structural characterization,^79,80 as well as deep-learning-based structural predictions,^81–83 thousands of structures and sequences have been solved for these variants. The IDEA procedure can be repeated for these variant-included structures to expand the optimized energy model by including modifications such as methylated DNA nucleotides ⁸⁴ and post-translationally modified amino acids.⁸⁵ Such strategies will facilitate studies on the functional relevance of epigenetic variants, such as those caused by exposure to environmental hazards,⁸⁶ and their structural impact on the human genome.^87–89

Methods

Structural Modeling of Protein and DNA

We utilized a coarse-grained framework to extract structural information of protein and DNA at the residue level (Figure 1). Specifically, each protein amino acid was represented by the coordinates of the Cα atom, and each DNA nucleotide was represented by either the P atom in the phosphate group or the C5 atom on the nucleobase. The choice between these two DNA atom types (C5 or P) depended on their distances to surrounding Cα atoms. We hypothesized that DNA atoms closer to Cα atoms would provide more meaningful information on protein-DNA interactions. Therefore, we chose the DNA atom type (C5 or P) with the largest sum of occurrences within 8 Å, 9 Å, and 10 Å distances from their surrounding Cα atoms (Table S1). This selection rule was applied consistently across all complexes in the training dataset to maintain uniform criteria.

Energy Model Optimization

Training Protocol

We fused the native protein and DNA sequences from the experimentally determined proteinDNA structural interface into an interaction matrix for model training. These sequences, located at the protein-DNA structural interface (defined as those residues whose representative atoms have a distance ≤ 8Å, highlighted in blue in Fig. 1 Left), are hypothesized to have evolved to favor strong amino acid-nucleotide interactions. We collected the sequences from both the strong binders and decoy binders (considered weak binders) for model parameterization. Synthetic decoy binders were generated by randomizing either the DNA (1000 sequences) or protein sequences (10,000 sequences) from the strong binders. The IDEA protocol optimizes a pairwise amino-acid-nucleotide energy model by maximizing the gap in binding energies between the strong and decoy binders. Similar strategies have been successfully applied in protein folding and protein-protein binding.^{34,36,53,90,91} Specifically, the binding energies for strong binders (E_strong) and their corresponding decoy weak binders (E_decoy) are calculated using the following expression:

where Θ_i,jutilized a switching function that captures an effective interaction range between each amino acid-nucleotide pair within the protein-DNA complex:

here, r_min = −8 Å and r_max = 8 Å. This defines two residues to be “in contact” if their distance is less than 8 Å. κ (here taken 0.7) is a scaling factor that modulates the steepness of the hyperbolic tangent function.

IDEA optimizes γ, a protein-DNA energy model represented by a 20-by-4 interaction matrix. Each entry γ_i,j(a_i, a_j) describes the pairwise interaction between an amino acid i and a nucleotide j. In practice, the parameters γ_i,j(a_i, a_i) were optimized to maximize δE/ΔE, where δE = ⟨E_decoy⟩ − ⟨E_strong⟩, representing the energy gap between strong and decoy binders. Mathematically, δE can be represented as Aγ. The standard deviation of the decoy binding energies ΔE can be calculated as , where:

here, ϕ takes the function form of E_binding and summarizes the total number of contacts between each type of amino acid and nucleotide in the given protein-DNA complexes. The vector A thus specifies the difference in interaction strengths for each pair of amino acids and nucleotides between the strong and decoy binders, with a dimension of 1 ×300. Additionally, matrix B is a 300 × 300 covariance matrix that contains information about which types of interactions tend to co-occur in the decoy binders. Optimizing the function δE/ΔE with respect to the parameters γ_i,j(a_i, a_j) corresponds to maximizing the ratio . This maximization is effectively attained by maximizing the functional objective R(γ), defined as:

where λ is a Lagrange multiplier. Setting the derivative ^∂R⁽^γ⁾ equal to zero leads to γ ∝ B⁻¹A. Here, γ represents a 300 × 1 vector that encodes the learned binding strengths across different amino acid-nucleotide interactions. A filtering procedure is further applied to reduce the noise arising from the finite sampling of certain types of interactions: The matrix B is first diagonalized such that B⁻¹ = P Λ⁻¹P ⁻¹, where P is the matrix of eigenvectors of B, and Λ is a diagonal matrix composed of B’s eigenvalues. We retained the principal N modes of B, which are ordered by the descending magnitude of their eigenvalues and replaced the remaining 300 − N eigenvalues in Λ with the N th eigenvalue of B. In all of our optimization reported in this paper, N was set to 70 to maximize the utilization of non-zero eigenvalues in the matrix B. For visualization purposes, the vector γ is reshaped into a 20-by-4 matrix, as shown in Fig. 2B. Due to the presence of an undetermined scaling factor after the model optimization, the predicted energies are presented in reduced units. Only when the optimized energy is integrated into a simulation model, can IDEA-based simulations predict binding free energies in physical units (Fig. 5C).

Enhanced Modeling Prediction with SELEX Data

When additional experimental binding data, such as SELEX,²⁷ is available, we expanded IDEA to incorporate this data for enhancing the model’s predictive capabilities. SELEX data provides binding affinities between given transcription factors and DNA sequences, which can be converted into dimensionless binding free energy with the expression E = − ln(Affinity), with Affinity being the normalized affinity generated from the SELEX package. We then calculated the ϕ value by utilizing the protein-DNA complex structure threaded with the SELEX-provided DNA sequences, which can be used together with the converted E values to compute the γ energy model based on the equation 1:

here γ is the enhanced protein-DNA energy model represented as a 300 × 1 vector.

For predicting the MAX-DNA binding specificity, we utilized the Human MAX SELEX data deposited in the European Nucleotide Archive database (accession no. PRJEB25690).^27,92 The R/Bioconductor package SELEX version 1.30.0⁹³ was used to determine the observed R1 count for all 10-mers. Among them, ACCACGTGGT is the motif with the highest frequency, which aligns with the DNA sequence in the MAX-DNA PDB structure (PDB ID: 1HLO), whose full DNA sequence is CACCACGTGGT. Consequently, we used this motif as the reference to align the remaining SELEX-measured 10-mer sequences. A total of 255 10-mer sequences, corresponding to the DNA variants measured from the MITOMI experiment,¹³ were selected. We threaded the PDB structure with these sequences to construct the ϕ and utilized the SELEX-converted binding free energy to compute the γ energy model.

Evaluation of IDEA Predictions Using HT-SELEX data

The processed HT-SELEX data used in this study is from,²⁶ which contains processed DNA binding sequence motifs and their normalized copy number detected from the HT-SELEX experiment, referred to as M-word scores. A higher M-word score corresponds to a stronger binding sequence motif detected in the HT-SELEX experiments. Among all the proteins in the data, 23 have experimentally determined protein-DNA complex structures. We predicted the binding free energies of all the processed sequences of these proteins using our protocol. For evaluating IDEA predictions, we classified motifs with M-word scores above 0.9 as strong binders (label 1) and those with M-word scores below 0.3 as weak binders (label 0). In cases where no sequences had an M-word score below 0.3, we used alternative cutoffs of 0.4 or 0.5 to ensure that at least three weak-binding sequences were included. These labeled strong and weak binders were then used as the ground truth for evaluating IDEA’s predictions. We assessed IDEA’s predictive accuracy by calculating the probability density distributions of the predicted binding free energies (Fig. S4). A well-separated distribution between strong and weak binders indicates a successful prediction. Additionally, we calculated the AUC score based on ROC analysis of these probability distributions. An AUC score of 1.0 means IDEA can fully separate the strongand weak-binder distributions, while an AUC score of 0.5 indicates no separation.

We further performed 10-fold cross-validation on this dataset, using the protocol described in Methods Section Enhanced Modeling Prediction with SELEX Data. For each protein, we divided the entire dataset into 10 equal, randomly assigned folds. In each iteration, we used randomly selected 9 of the 10 folds as the training set and the remaining fold as the test set. This process was repeated 10 times so that each fold served as the test set once. We then reported the average R² scores across these iterations to evaluate our model’s predictive performance. Our results are compared with the 1mer and 1mer+shape methods from²⁶ (Fig. S5). our comparative analysis shows IDEA achieved higher predictive accuracy than the existing state-of-the-art sequence-based protein-DNA binding predictors for those protein-DNA complexes that have available experimentally resolved structures.

Selection of CATH Superfamily for Testing Model Transferability

The MAX protein is characterized by a Helix-loop-helix DNA-binding domain and belongs to the CATH superfamily 4.10.280.10,⁹⁴ which include a total of 11 protein-DNA structures with the E-box sequence (CACGTC), similar to the MAX-binding DNA. We hypothesized that proteins within the same superfamily exhibit common residue-interaction patterns due to their structural and evolutionary similarities. To test the transferability of our model, we conducted individual training on each one of those 11 protein-DNA complex structures and their associated sequences, predicting the MAX-DNA binding affinity in the MITOMI dataset.¹³ To examine the connection between the predictive outcome and the probability that the training protein is homologous to MAX, HHpred⁹⁵ was utilized to search for MAX’s homologous proteins, using the sequence of 1HLO as the query (Fig. 3A).

Processing of ChIP-seq Data

To evaluate the predictive capabilities of our protocol on specific chromosomal segments, we selected ChIP-seq data for three transcription factors—MAX, EGR1, and CTCF. The data were sourced from the ENCODE project, with accession numbers ENCFF361EVH (MAX), ENCSR026GSW (EGR1), and ENCFF960ZGP (CTCF).

For testing, we chose genomic regions with the densest ChIP-seq signals. For MAX and CTCF, we focused on the human GM12878 chromosome 1 (GRCh38) region between 156,000,000 bp and 157,000,000 bp, as this 1 Mb segment contains the densest signals for both transcription factors, with 33 MAX binding sites and 70 CTCF binding sites. For EGR1, we used the HepG2 cell line, focusing on the GRCh38 chr8 region between 143,000,000 bp and 144,000,000 bp, which contains 97 peaks.

To predict binding sites, we scanned the corresponding DNA regions base by base using the DNA sequence closest to each protein in the relevant protein-DNA complex structures. For MAX, we used the 11-bp sequence from the PDB structure 1HLO, generating 1, 000, 000− 11 + 1 = 999, 990 sequences. For EGR1, we used the 10-bp sequence from PDB 1AAY, producing 1, 000, 000 − 10 + 1 = 999, 991 sequences. For CTCF, no single PDB structure covers the entire protein,⁹⁶ so two PDB structures were used: 8SSS (ZF1-ZF7, 23-bp) and 8SSQ (ZnFs 3-11, 35-bp), generating 1, 000, 000 −23 + 1 = 999, 978 and 1, 000, 000 −35 + 1 = 999, 966 sequences, respectively.

Predicted binding free energies for each transcription factor were normalized into Z-scores using randomized decoy sequences:

where ⟨ ⟩ is the average over all decoy sequences, and SD is the standard deviation. Given the evolutionary preference for strong binding, most Z-scores were negative, indicating stronger binding. ROC curves were then constructed to assess the predictive performance, with lower predicted Z-scores representing stronger binding affinity (Fig. 4, S8).

Sequence Specific Protein-DNA Simulation Model

We utilized a previously developed protein and DNA residue-level coarse-grained model for simulating protein-DNA binding interactions. The protein was modeled using the Cα-based structure-based model,⁵⁷ with each amino acid represented by a simulation bead located at the Cα site. The DNA was modeled with the 3-Site-Per-Nucleotide (3SPN) DNA model, ⁵⁸ with each DNA nucleotide modeled by three coarse-grained sites located at the centers of mass of the phosphate, sugar, and base sites. Both these two models and their combination have been validated to reproduce multiple experimental measurements.^{57,58,61,63–66,68,69,72,97}

Detailed descriptions of this protein-DNA force field have been documented in previous studies.^60,65,66,69 Specifically, we followed Ref. 58 for simulating DNA-DNA interactions, and Ref. 57 for simulating the protein-protein interactions. The protein-DNA interactions were modeled with a long-range electrostatic interaction and a short-range sequence-specific interaction. The electrostatic interactions were simulated with the Debye-Hückel treatment to capture the screening effect at different ionic concentrations:

where ɛ = 78.0 is the dielectric constant of the bulk solvent. q_i and q_j correspond to the charges of the two particles. ɛ₀ = 1 is the vacuum electric permittivity, and l_D is the Debye-Hückel screening length.

The excluded volume effect was modeled using a hard wall potential with an apparent radius of σ = 4.0Å:

An additional sequence-specific protein-DNA interaction was modeled with a Tanh function between the base site of the DNA and the amino acid, with the form:

where r₀ = 8 Å and η = 0.7 Å⁻¹. W is a 4 × 20 matrix that characterizes the interaction energy between each type of amino acid and DNA nucleotide. Here, W used the normalized IDEA-learned energy model based on 9 protein-DNA complexes (Fig. 5A). The IDEA-learned energy model was normalized by the sum of the absolute value of each term of the energy model. Detailed parameters are provided in Table S2. Additional details on the protein-DNA simulation are provided in the Supplementary Materials.

Data availability

The implementation of the IDEA model, along with training and prediction examples, is available on our GitHub repository and in the Supplementary data.

Acknowledgements

This work was supported by the startup funding from North Carolina State University. Additionally, this research received partial funding from the NC State Genetics and Genomics Academy. We appreciate Dr. Keith Weninger, Dr. Faruck Morcos, and Dr. Qin Zhou for their insightful discussions and Dr. Sebastian Maerkl for guiding us to the MITOMI experimental data.

Additional files

Supplementary Information

References

(1)
1. Lee T. I.
2. Young R. A
2013Transcriptional Regulation and Its Misregulation in DiseaseCell 152:1237–1251Google Scholar
(2)
1. Orphanides G.
2. Reinberg D
2002A Unified Theory of Gene ExpressionCell 108:439–451Google Scholar
(3)
1. Ren B.
2. Robert F.
3. Wyrick J. J.
4. Aparicio O.
5. Jennings E. G.
6. Simon I.
7. Zeitlinger J.
8. Schreiber J.
9. Hannett N.
10. Kanin E.
11. Volkert T. L.
12. Wilson C. J.
13. Bell S. P.
14. Young R. A
2000Genome-Wide Location and Function of DNA Binding ProteinsScience 290:2306–2309Google Scholar
(4)
1. Jaenisch R.
2. Bird A
2003Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signalsNat Genet 33:245–254Google Scholar
(5)
1. Latchman D. S
1997Transcription factors: an overviewThe international journal of biochemistry & cell biology 29:1305–1312Google Scholar
(6)
1. Owen J. A.
2. Osmanovíc D.
3. Mirny L.
2023Design principles of 3D epigenetic memory systemsScience 382:eadg3053Google Scholar
(7)
1. Stormo G. D.
2. Zhao Y
2010Determining the specificity of protein–DNA interactionsNat Rev Genet 11:751–760Google Scholar
(8)
1. Solomon M. J.
2. Larsen P. L.
3. Varshavsky A
1988Mapping proteinDNA interactions in vivo with formaldehyde: Evidence that histone H4 is retained on a highly transcribed geneCell 53:937–947Google Scholar
(9)
1. Park P. J
2009ChIP–seq: advantages and challenges of a maturing technologyNature reviews genetics 10:669–680Google Scholar
(10)
1. Bartlett A.
2. O’Malley R. C.
3. Huang S.-s. C.
4. Galli M.
5. Nery J. R.
6. Gallavotti A.
7. Ecker J. R.
2017Mapping genome-wide transcription-factor binding sites using DAP-seqNat Protoc 12:1659–1672Google Scholar
(11)
1. Berger M. F.
2. Bulyk M. L
2009Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factorsNature protocols 4:393–411Google Scholar
(12)
1. Meng X.
2. Brodsky M. H.
3. Wolfe S. A
2005A bacterial one-hybrid system for determining the DNA-binding specificity of transcription factorsNature biotechnology 23:988–994Google Scholar
(13)
1. Maerkl S. J.
2. Quake S. R
2007A Systems Approach to Measuring the Binding Energy Landscapes of Transcription FactorsScience 315:233–237Google Scholar
(14)
1. Fordyce P. M.
2. Gerber D.
3. Tran D.
4. Zheng J.
5. Li H.
6. DeRisi J. L.
7. Quake S. R
2010De novo identification and biophysical characterization of transcription-factor binding sites with microfluidic affinity analysisNature biotechnology 28:970–975Google Scholar
(15)
1. Ogawa N.
2. Biggin M. D.
3. Deplancke B.
4. Gheldof N.
2012Gene Regulatory Networks: Methods and ProtocolsTotowa, NJ: Humana Press pp. 51–63Google Scholar
(16)
1. Isakova A.
2. Groux R.
3. Imbeault M.
4. Rainer P.
5. Alpern D.
6. Dainese R.
7. Ambrosini G.
8. Trono D.
9. Bucher P.
10. Deplancke B
2017SMiLE-seq identifies binding motifs of single and dimeric transcription factorsNature methods 14:316–322Google Scholar
(17)
1. Furey T. S
2012ChIP–seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactionsNat Rev Genet 13:840–852Google Scholar
(18)
1. Bulyk M. L.
2. Huang X.
3. Choo Y.
4. Church G. M
2001Exploring the DNA-binding specificities of zinc fingers with DNA microarraysProc. Natl. Acad. Sci. U.S.A 98:7158–7163Google Scholar
(19)
1. Gord^an, R.; Shen, N.; Dror, I.; Zhou, T.; Horton, J.; Rohs, R.; Bulyk, M
2013Genomic Regions Flanking E-Box Binding Sites Influence DNA Binding Specificity of bHLH Transcription Factors through DNA ShapeCell Reports 3:1093–1104Google Scholar
(20)
1. Afek A.
2. Shi H.
3. Rangadurai A.
4. Sahay H.
5. Senitzki A.
6. Xhani S.
7. Fang M.
8. Salinas R.
9. Mielko Z.
10. Pufall M. A.
11. Poon G. M. K.
12. Haran T. E.
13. Schumacher M. A.
14. Al-Hashimi H. M.
15. Gordan R.
2020DNA mismatches reveal conformational penalties in protein–DNA recognitionNature 587:291–296Google Scholar
(21)
1. Jolma A.
2. Kivioja T.
3. Toivonen J.
4. Cheng L.
5. Wei G.
6. Enge M.
7. Taipale M.
8. Vaquerizas J. M.
9. Yan J.
10. Sillanpäa M. J.
2010; others Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificitiesGenome research 20:861–873Google Scholar
(22)
1. Kohlberger M.
2. Gadermaier G
2022SELEX: Critical factors and optimization strategies for successful aptamer selectionBiotechnology and applied biochemistry 69:1771–1792Google Scholar
(23)
1. Weirauch M. T.
2. Cote A.
3. Norel R.
4. Annala M.
5. Zhao Y.
6. Riley T. R.
7. SaezRodriguez J.
8. Cokelaer T.
9. Vedenko A.
10. Talukder S
2013; others Evaluation of methods for modeling transcription factor sequence specificityNature biotechnology 31:126–134Google Scholar
(24)
1. Alipanahi B.
2. Delong A.
3. Weirauch M. T.
4. Frey B. J
2015Predicting the sequence specificities of DNAand RNA-binding proteins by deep learningNat Biotechnol 33:831–838Google Scholar
(25)
1. Zhou T.
2. Shen N.
3. Yang L.
4. Abe N.
5. Horton J.
6. Mann R. S.
7. Bussemaker H. J.
8. Gordân R.
9. Rohs R.
2015Quantitative modeling of transcription factor binding specificities using DNA shapeProceedings of the National Academy of Sciences 112:4654–4659Google Scholar
(26)
1. Yang L.
2. Orenstein Y.
3. Jolma A.
4. Yin Y.
5. Taipale J.
6. Shamir R.
7. Rohs R
2017Transcription factor family-specific DNA shape readout revealed by quantitative specificity modelsMolecular systems biology 13Google Scholar
(27)
1. Rastogi C.
2. Rube H. T.
3. Kribelbauer J. F.
4. Crocker J.
5. Loker R. E.
6. Martini G. D.
7. Laptenko O.
8. Freed-Pastor W. A.
9. Prives C.
10. Stern D. L.
11. Mann R. S.
12. Bussemaker H. J
2018Accurate and sensitive quantification of protein-DNA binding affinityProc. Natl. Acad. Sci. U.S.A 115Google Scholar
(28)
1. Roche R.
2. Moussad B.
3. Shuvo M. H.
4. Tarafder S.
5. Bhattacharya D
2024EquiPNAS: improved protein–nucleic acid binding site prediction using protein-language-modelinformed equivariant deep graph neural networksNucleic Acids Research 52:e27–e27Google Scholar
(29)
1. Liu Y.
2. Tian B
2023Protein–DNA binding sites prediction based on pre-trained protein language model and contrastive learningBriefings in Bioinformatics 25:bbad488Google Scholar
(30)
1. Nguyen B. P.
2. Nguyen Q. H.
3. Doan-Ngoc G.-N.
4. Nguyen-Vo T.-H.
5. Rahardja S
2019iProDNA-CapsNet: identifying protein-DNA binding residues using capsule neural networksBMC Bioinformatics 20Google Scholar
(31)
1. Siggers T.
2. Gordân R.
2014Protein–DNA binding: complexities and multi-protein codesNucleic Acids Research 42:2099–2111Google Scholar
(32)
1. Sagendorf J. M.
2. Markarian N.
3. Berman H. M.
4. Rohs R
2019DNAproDB: an expanded database and web-based tool for structural analysis of DNA–protein complexesNucleic Acids Research 48:D277–D287Google Scholar
(33)
1. Harini K.
2. Srivastava A.
3. Kulandaisamy A.
4. Gromiha M. M
2022ProNAB: database for binding affinities of protein-nucleic acid complexes and their mutantsNucleic Acids Res 50:D1528–D1534Google Scholar
(34)
1. Bryngelson J. D.
2. Wolynes P. G
1987Spin glasses and the statistical mechanics of protein foldingProc. Natl. Acad. Sci. U.S.A 84:7524–7528Google Scholar
(35)
1. Onuchic J. N.
2. Luthey-Schulten Z.
3. Wolynes P. G
1997Theory of protein folding: the energy landscape perspectiveAnnu Rev Phys Chem 48:545–600Google Scholar
(36)
1. Schafer N. P.
2. Kim B. L.
3. Zheng W.
4. Wolynes P. G
2014Learning To Fold Proteins Using Energy Landscape TheoryIsr J Chem 54:1311–1337Google Scholar
(37)
1. Chu W.-T.
2. Yan Z.
3. Chu X.
4. Zheng X.
5. Liu Z.
6. Xu L.
7. Zhang K.
8. Wang J
2021Physics of biomolecular recognition and conformational dynamicsRep. Prog. Phys 84Google Scholar
(38)
1. Burley S. K.
2. et al.
2022RCSB Protein Data Bank (RCSBorg): delivery of experimentallydetermined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning. Nucleic Acids Research 51:D488–D508Google Scholar
(39)
1. Petrenko N.
2. Jin Y.
3. Dong L.
4. Wong K. H.
5. Struhl K
2019Requirements for RNA polymerase II preinitiation complex formation in vivoeLife 8:e43654Google Scholar
(40)
1. Marchal C.
2. Sima J.
3. Gilbert D. M
2019Control of DNA replication timing in the 3D genomeNat Rev Mol Cell Biol 20:721–737Google Scholar
(41)
1. Geertz M.
2. Shore D.
3. Maerkl S. J
2012Massively parallel measurements of molecular interaction kinetics on a microfluidic platformProceedings of the National Academy of Sciences 109:16540–16545Google Scholar
(42)
1. Zhou Q.
2. Kunder N.
3. De La Paz J. A.
4. Lasley A. E.
5. Bhat V. D.
6. Morcos F.
7. Campbell Z. T
2018Global pairwise RNA interaction landscapes reveal core features of protein recognitionNat Commun 9:2511Google Scholar
(43)
1. Blackwell T. K.
2. Kretzner L.
3. Blackwood E. M.
4. Eisenman R. N.
5. Weintraub H
1990Sequence-specific DNA binding by the c-Myc proteinScience 250:1149–1151Google Scholar
(44)
1. Soding, J.; Biegert, A.; Lupas, A. N
2005The HHpred interactive server for protein homology detection and structure predictionNucleic Acids Research 33:W244–W248Google Scholar
(45)
1. Humphrey W.
2. Dalke A.
3. Schulten K
1996VMD – Visual Molecular DynamicsJournal of Molecular Graphics 14:33–38Google Scholar
(46)
1. de Souza N.
2012The ENCODE projectNature methods 9:1046–1046Google Scholar
(47)
1. Giresi P. G.
2. Kim J.
3. McDaniell R. M.
4. Iyer V. R.
5. Lieb J. D
2007FAIRE (FormaldehydeAssisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatinGenome Res 17:877–885Google Scholar
(48)
1. Zhang J.
2. et al.
2020An integrative ENCODE resource for cancer genomicsNat Commun 11:3696Google Scholar
(49)
1. Boyle A. P.
2. Davis S.
3. Shulha H. P.
4. Meltzer P.
5. Margulies E. H.
6. Weng Z.
7. Furey T. S.
8. Crawford G. E
2008High-resolution mapping and characterization of open chromatin across the genomeCell 132:311–322Google Scholar
(50)
1. Corces M. R.
2. Trevino A. E.
3. Hamilton E. G.
4. Greenside P. G.
5. SinnottArmstrong N. A.
6. Vesuna S.
7. Satpathy A. T.
8. Rubin A. J.
9. Montine K. S.
10. Wu B
2017; others An improved ATAC-seq protocol reduces background and enables interrogation of frozen tissuesNature methods 14:959–962Google Scholar
(51)
1. Granja J. M.
2. Corces M. R.
3. Pierce S. E.
4. Bagdatli S. T.
5. Choudhry H.
6. Chang H. Y.
7. Greenleaf W. J
2021ArchR is a scalable software package for integrative single-cell chromatin accessibility analysisNature genetics 53:403–411Google Scholar
(52)
1. Savelyev A.
2. Papoian G. A
2010Chemically accurate coarse graining of double-stranded DNAProc. Natl. Acad. Sci. U.S.A 107:20340–20345Google Scholar
(53)
1. Davtyan A.
2. Schafer N. P.
3. Zheng W.
4. Clementi C.
5. Wolynes P. G.
6. Papoian G. A
2012AWSEM-MD: Protein Structure Prediction Using Coarse-Grained Physical Potentials and Bioinformatically Based Local Structure BiasingJ. Phys. Chem. B 116:8494–8503Google Scholar
(54)
1. Bascom D.
2. Schlick G.
2018Nuclear Architecture and DynamicsElsevier pp. 123–147Google Scholar
(55)
1. Tan C.
2. Takada S
2020Nucleosome allostery in pioneer transcription factor bindingProc. Natl. Acad. Sci. U.S.A 117:20586–20596Google Scholar
(56)
1. Chakraborty D.
2. Mondal B.
3. Thirumalai D
2024Brewing COFFEE: A Sequence-Specific Coarse-Grained Energy Function for Simulations of DNA-Protein ComplexesJ. Chem. Theory Comput 20:1398–1413Google Scholar
(57)
1. Clementi C.
2. Nymeyer H.
3. Onuchic J. N
2000Topological and energetic factors: what determines the structural details of the transition state ensemble and “en-route” intermediates for protein folding? an investigation for small globular proteinsJournal of Molecular Biology 298:937–953Google Scholar
(58)
1. Freeman G. S.
2. Hinckley D. M.
3. Lequieu J. P.
4. Whitmer J. K.
5. De Pablo J. J
2014Coarse-grained modeling of DNA curvatureThe Journal of Chemical Physics 141Google Scholar
(59)
1. Lin X.
2. Qi Y.
3. Latham A. P.
4. Zhang B
2021Multiscale modeling of genome organization with maximum entropy optimizationJ. Chem. Phys 155Google Scholar
(60)
1. Zhang B.
2. Zheng W.
3. Papoian G. A.
4. Wolynes P. G
2016Exploring the Free Energy Landscape of NucleosomesJ. Am. Chem. Soc 138:8126–8133Google Scholar
(61)
1. Lequieu J.
2. Córdoba A.
3. Schwartz D. C.
4. De Pablo J. J.
2016Tension-Dependent Free Energies of Nucleosome UnwrappingACS Cent. Sci 2:660–666Google Scholar
(62)
1. Parsons T.
2. Zhang B
2019Critical role of histone tail entropy in nucleosome unwindingThe Journal of Chemical Physics 150Google Scholar
(63)
1. Moller J.
2. Lequieu J.
3. De Pablo J. J
2019The Free Energy Landscape of Internucleosome Interactions and Its Relation to Chromatin Fiber StructureACS Cent. Sci 5:341–348Google Scholar
(64)
1. Lin X.
2. Zhang B
2024Explicit ion modeling predicts physicochemical interactions for chromatin organizationeLife 12:RP90073Google Scholar
(65)
1. Ding X.
2. Lin X.
3. Zhang B
2021Stability and folding pathways of tetra-nucleosome from six-dimensional free energy surfaceNat Commun 12:1091Google Scholar
(66)
1. Liu S.
2. Lin X.
3. Zhang B
2022Chromatin fiber breaks into clutches under tension and crowdingNucleic Acids Res 50:9738–9747Google Scholar
(67)
1. Watanabe S.
2. Mishima Y.
3. Shimizu M.
4. Suetake I.
5. Takada S
2018Interactions of HP1 Bound to H3K9me3 Dinucleosome by Molecular Simulations and Biochemical AssaysBiophysical Journal 114:2336–2351Google Scholar
(68)
1. Leicher R.
2. Ge E. J.
3. Lin X.
4. Reynolds M. J.
5. Xie W.
6. Walz T.
7. Zhang B.
8. Muir T. W.
9. Liu S
2020Single-molecule and in silico dissection of the interaction between Polycomb repressive complex 2 and chromatinProc Natl Acad Sci USA 117:30465–30475Google Scholar
(69)
1. Lin X.
2. Leicher R.
3. Liu S.
4. Zhang B
2021Cooperative DNA looping by PRC2 complexesNucleic Acids Research 49:6238–6248Google Scholar
(70)
1. Noel J. K.
2. Levi M.
3. Raghunathan M.
4. Lammert H.
5. Hayes R. L.
6. Onuchic J. N.
7. Whitford P. C
2016SMOG 2: A Versatile Software Package for Generating Structure-Based ModelsPLoS Comput Biol 12:e1004794Google Scholar
(71)
1. Knotts T. A.
2. Rathore N.
3. Schwartz D. C.
4. De Pablo J. J
2007A coarse grain model for DNAThe Journal of Chemical Physics 126Google Scholar
(72)
1. Hinckley D. M.
2. Freeman G. S.
3. Whitmer J. K.
4. De Pablo J. J
2013An experimentallyinformed coarse-grained 3-site-per-nucleotide model of DNA: Structure, thermodynamics, and dynamics of hybridizationThe Journal of Chemical Physics 139Google Scholar
(73)
1. Freeman G. S.
2. Lequieu J. P.
3. Hinckley D. M.
4. Whitmer J. K.
5. De Pablo J. J
2014DNA Shape Dominates Sequence Affinity in Nucleosome FormationPhys. Rev. Lett 113Google Scholar
(74)
1. Privalov P. L.
2. Dragan A. I.
3. Crane-Robinson C
2011Interpreting protein/DNA interactions: distinguishing specific from non-specific and electrostatic from non-electrostatic componentsNucleic Acids Research 39:2483–2491Google Scholar
(75)
1. Torrie G.
2. Valleau J
1977Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella samplingJournal of Computational Physics 23:187–199Google Scholar
(76)
1. Kumar S.
2. Rosenberg J. M.
3. Bouzida D.
4. Swendsen R. H.
5. Kollman P. A
1992THE weighted histogram analysis method for free-energy calculations on biomolecules. I. The methodJ Comput Chem 13:1011–1021Google Scholar
(77)
1. Hindorff L. A.
2. Sethupathy P.
3. Junkins H. A.
4. Ramos E. M.
5. Mehta J. P.
6. Collins F. S.
7. Manolio T. A
2009Potential etiologic and functional implications of genomewide association loci for human diseases and traitsProceedings of the National Academy of Sciences 106:9362–9367Google Scholar
(78)
1. Lappalainen T.
2. Scott A. J.
3. Brandt M.
4. Hall I. M
2019Genomic analysis in the age of human genome sequencingCell 177:70–84Google Scholar
(79)
1. Metzker M. L
2010Sequencing technologies — the next generationNat Rev Genet 11:31–46Google Scholar
(80)
1. Doerr, A.
2017Cryo-electron tomographyNat Methods 14:34–34Google Scholar
(81)
1. Abramson J.
2. et al.
2024Accurate structure prediction of biomolecular interactions with AlphaFold 3Nature 630:493–500Google Scholar
(82)
1. Tunyasuvunakool K.
2. et al.
2021Highly accurate protein structure prediction for the human proteomeNature 596:590–596Google Scholar
(83)
1. Baek M.
2. McHugh R.
3. Anishchenko I.
4. Jiang H.
5. Baker D.
6. DiMaio F
2024Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNANat Methods 21:117–121Google Scholar
(84)
1. Moore L. D.
2. Le T.
3. Fan G.
2013DNA Methylation and Its Basic FunctionNeuropsychopharmacol 38:23–38Google Scholar
(85)
1. Strahl B. D.
2. Allis C. D
2000The language of covalent histone modificationsNature 403:41–45Google Scholar
(86)
1. Jirtle R. L.
2. Skinner M. K.
2007Environmental epigenomics and disease susceptibilityNat Rev Genet 8:253–262Google Scholar
(87)
1. Portela A.
2. Esteller M.
2010Epigenetic modifications and human diseaseNat Biotechnol 28:1057–1068Google Scholar
(88)
1. Allis C. D.
2. Jenuwein T
2016The molecular hallmarks of epigenetic controlNat Rev Genet 17:487–500Google Scholar
(89)
1. de Goede O. M.
2. et al.
2021Population-scale tissue transcriptomics maps long non-coding RNAs to complex diseaseCell 184:2633–2648Google Scholar
(90)
1. Lin X.
2. George J. T.
3. Schafer N. P.
4. Ng Chau K.
5. Birnbaum M. E.
6. Clementi C.
7. Onuchic J. N.
8. Levine H
2021Rapid assessment of T-cell receptor specificity of the immune repertoireNat Comput Sci 1:362–373Google Scholar
(91)
1. Wang A.
2. Lin X.
3. Chau K. N.
4. Onuchic J. N.
5. Levine H.
6. George J. T
2024RACER-m leverages structural features for sparse T cell specificity predictionSci. Adv 10:eadl0161Google Scholar
(92)
2024European Nucleotide Archive European Nucleotide Archivehttps://www.ebi.ac.uk/ena
(93)
1. Rastogi C.
2. Liu D.
3. Melo L.
4. Bussemaker H. J.
2022SELEX: Functions for analyzing SELEX-seq dataBioconductor
(94)
1. Sillitoe I.
2. Bordin N.
3. Dawson N.
4. Waman V. P.
5. Ashford P.
6. Scholes H. M.
7. Pang C. S.
8. Woodridge L.
9. Rauer C.
10. Sen N
2021; others CATH: increased structural coverage of functional spaceNucleic acids research 49:D266–D273Google Scholar
(95)
1. Gabler F.
2. Nam S.-Z.
3. Till S.
4. Mirdita M.
5. Steinegger M.
6. Söding J.
7. Lupas A. N.
8. Alva V.
2020Protein sequence analysis using the MPI bioinformatics toolkitCurrent Protocols in Bioinformatics 72:e108Google Scholar
(96)
1. Yang J.
2. Horton J. R.
3. Liu B.
4. Corces V. G.
5. Blumenthal R. M.
6. Zhang X.
7. Cheng X
2023Structures of CTCF–DNA complexes including all 11 zinc fingersNucleic acids research 51:8447–8462Google Scholar
(97)
1. Lequieu J.
2. Schwartz D. C.
3. De Pablo J. J
2017In silico evidence for sequence-dependent nucleosome slidingProc. Natl. Acad. Sci. U.S.A 114Google Scholar

Article and author information

Author information

Yafan Zhang
Bioinformatics Research Center, North Carolina State University, Raleigh, United States
Irene Silvernail
Department of Physics, North Carolina State University, Raleigh, United States
Zhuyang Lin
Bioinformatics Research Center, North Carolina State University, Raleigh, United States
Xingcheng Lin
Bioinformatics Research Center, North Carolina State University, Raleigh, United States, Department of Physics, North Carolina State University, Raleigh, United States
ORCID iD: 0000-0002-9378-6174
- For correspondence: Lin@ncsu.edu

Author Notes

Competing Interest Statement: The authors have declared no competing interest.

Version history

Preprint posted: December 7, 2024
Sent for peer review: December 11, 2024
Reviewed Preprint version 1: February 7, 2025
Reviewed Preprint version 2: June 9, 2025
Version of Record published: July 17, 2025

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.105565. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

views: 2,039
downloads: 109
citations: 2

Views, downloads and citations are aggregated across all versions of this paper published by eLife.