Introduction

Gene regulation is essential for controlling the timing and extent of gene expression in cells. It guides important processes from development to disease response.1 Gene expressions are tightly controlled by various DNA binding proteins, including transcription factors (TFs) and epigenetic regulators.24 TFs initiate gene expression by binding to specific DNA sequences, during which time a primary RNA transcript is synthesized from a gene’s DNA.5 On the other hand, epigenetic regulators control gene expression by binding to specific chromatin regions, which spread post-translational modifications that modulate 3D genome organization.6 Therefore, characterizing the DNA-interaction specificities of DNA binding proteins is critical for understanding the molecular mechanisms underlying many DNA-templated processes.7

To establish a comprehensive understanding of DNA binding processes, various experimental technologies have been developed and performed.816 These methods, such as Chromatin Immunoprecipitation followed by sequencing (ChIP-Seq),8,9,17 Protein-binding microarray (PBM),1820 and Systematic Evolution of Ligands by Exponential Enrichment (SELEX),15,16,21 have proven invaluable for measuring DNA-specific protein recognition. Nonetheless, due to the need for protein-specific antibodies,17 as well as the cost and intrinsic bias associated with these experiments,22 high-throughput measurement of large numbers of protein-DNA variants remains challenging.

Computational methods complement experimental efforts by providing the initial filter for assessing protein-DNA binding specificity. Numerous methods have emerged to enable predictions of binding sites and affinities of DNA-binding proteins. 2330 These methods often utilized machine-learning-based training to extract sequence preference information from DNA or protein by utilizing experimental high-throughput (HT) assays,2327 which rely on the availability and quality of experimental binding assays. Additionally, many approaches employ deep neural networks,2830 which could obscure the interpretation of interaction patterns governing protein-DNA binding specificities. Understanding these patterns, however, is crucial for elucidating the molecular mechanisms underlying various DNA-recognition processes, such as those seen in TFs.31

Nowadays, over 5000 protein-DNA 3D structures, including TF-DNA complexes, have been published.32,33 These data provide invaluable resources for understanding the physicochemical properties of protein-DNA binding patterns, extending beyond mere sequence information. Utilization of these data enables the training of a model for predicting the binding affinities of protein-DNA interactions. The very robustness of evolution 3437 offers an approach to extract the sequence-structure relation embedded in these available complexes, which learns an interpretable binding energy landscape that governs the recognition processes of those DNA-binding proteins.

Here, we introduce the Interpretable protein-DNA Energy Associative (IDEA) model, a predictive model that learns protein-DNA physicochemical interactions by fusing available biophysical structures and their associated sequences into an optimized energy model (Figure 1). We show that the model can be used to accurately predict the sequence-specific DNA binding affinities of DNA-binding proteins and is transferrable across the same protein superfamily. Moreover, the model can be enhanced by incorporating experimental binding data and can be generalized to enable base-pair resolution predictions of genomic DNA-binding sites. Notably, IDEA learns the interaction matrix between each amino acid and nucleotide, allowing for direct interpretation of the “molecular grammar” governing the binding specificities of DNA binding proteins. This interpretable energy model is further integrated into a simulation framework, facilitating mechanistic studies of various biomolecular functions involving protein-DNA dynamics.

Results

Residue-Level Protein-DNA Energy Model for Predicting ProteinDNA Recognition Specificities

IDEA is a coarse-grained biophysical model at the residue resolution for investigating the binding interactions between protein and DNA (Fig. 1). It leverages both structures and corresponding sequences of known protein-DNA complexes to learn an interpretable energy model based on the interacting amino acids and nucleotides at the protein-DNA interface. The model was trained using available protein-DNA complexes curated from existing database.32,38 Unlike existing deep-learning-based protein-DNA binding prediction models, IDEA aims to learn a physicochemical-based energy model that quantitatively characterizes sequences-specific interactions between amino acids and nucleotides, thereby interpreting the “molecular grammar” driving the binding energetics of protein-DNA interactions. The optimized energy model can be used to predict the binding affinity of given protein-DNA pairs based on their structures and sequences. Additionally, it enables the prediction of genomic DNA binding sites by a given protein, such as a transcription factor. Finally, the learned energy model is integrated into a simulation framework that can be used to investigate the molecular mechanisms underlying various DNA-templated activities, such as initiation of gene transcription,39 epigenetic regulation of gene expression,4 and DNA replication.40 Further details of the optimization protocol are provided in Methods Section Energy Model Optimization.

Overview of the IDEA protocol.

The protein–DNA complex, represented by the human MAX–DNA complex structure (PDB ID: 1HLO), was used for training the IDEA model. The sequences of the protein and DNA residues that form close contacts (highlighted in blue) in the structure were included in the training dataset. In addition, a series of synthetic decoy sequences were generated by randomizing the contacting residues in both the protein and DNA sequences. The amino acid–nucleotide energy model was then optimized by maximizing the ratio of the binding energy gap (δE) between protein and DNA in the native complex and the decoy complexes relative to the energy variance (ΔE). The optimized energy model can be used for multiple predictive applications, including the evaluation of binding free energies for various protein–DNA sequence pairs, prediction of genomic DNA binding sites by transcription factors or other DNA-binding proteins, and integration into a sequence-specific, residue-resolution simulation framework for dynamic simulations.

IDEA Accurately Predicts Protein-DNA Binding Specificity

We first examine the predictive accuracy of IDEA by comparing its predicted TF-DNA binding affinity with experimental measurements.13,14 We focused on the MAX TF, a basic Helix-loop-helix (bHLH) TF with the most comprehensive experimental binding data. The binding affinity of MAX to various DNA sequences has been quantified by multiple experimental platforms, including different SELEX variants15,16,21 and microfluidic-based MITOMI assay.13,41 Among them, the MITOMI quantitatively measured the binding affinities of MAX TF to a comprehensive set of 255 DNA sequences with mutations in the enhancer box (Ebox) motif, a consensus MAX DNA binding sequence, and can thus serve as a reference for us to benchmark our model predictions. Our de novo prediction based on one MAX crystal structure (PDB ID: 1HLO) correlates well with the experimental values (Pearson Correlation 0.67, Fig. 2A). Including additional human-associated MAX-DNA complex structures and their associated sequences slightly improves the correlation (Pearson Correlation 0.68, Fig. S1), albeit not significantly, likely due to the structural similarity of protein-DNA interfaces across all MAX proteins. Prior works have proven it instrumental in incorporating experimental binding data to improve predicted protein-nucleic acid binding affinities.27,42 Inspired by these approaches, we developed an enhanced predictive protocol that integrates additional experimental binding data if available (see the Methods Section Enhanced Modeling Prediction with SELEX Data). Encouragingly, when training the model with the SELEX-seq data27 of MAX TF, IDEA gained a significant predictive improvement in reproducing the MITOMI measurement (Pearson Correlation 0.79, Fig. S2).

Results for MAX-based predictions.

(A) The binding free energy calculated by IDEA, trained using one MAX–DNA complex (PDB ID: 1HLO), correlates well with experimentally measured MAX–DNA binding free energy.13 ΔΔG represents the changes in binding free energy relative to that of the wild-type protein–DNA complex. (B) The optimized energy model reveals an amino acid–nucleotide interaction pattern governing MAX–DNA recognition. The predicted binding free energies and optimized energy model are presented in reduced units, as explained in the Methods. (C) The 3D structure of the MAX–DNA complex (zoomed in with different views) highlights important amino acid–nucleotide contacts at the protein–DNA interface, where several DNA cytosines (red spheres) form close contacts with arginine (blue spheres). (D) Probability density distribution of strong and weak binders for the protein ZBTB7A. The mean of each distribution is marked with a dashed line. (E) The AUC score for each protein–DNA pair, calculated based on the predicted probability distributions.

IDEA Decodes the Molecular Grammar Governing Protein-DNA Interactions

IDEA protocol learns the physicochemical interactions that determine protein-DNA interactions by utilizing the sequence-structure relationship embedded in the protein-DNA experimental structures. Such a physicochemical interaction pattern can be interpreted directly from the learned energy model. To illustrate this, we focused on the energy model learned from the MAX-DNA complexes (Fig. 2B). Notably, DNA Cytosine (DC) exhibited strong interactions with Arginine (R), consistent with the fact that the E-box region (CACGTG) frequently attracts the positively charged residues of bHLH TFs.43 Importantly, Arginine was in close contact with Cytosine in the crystal structure (Fig. 2C), thus consistent with the strong DC-R interactions shown in the learned energy model. Upon integrating the SELEX data into our model training, we found that the improved energy model shows additional unfavorable interactions between glutamic acid (E) and all DNA nucleotides, consistent with their negative charges (Fig. S3). Thus, including more experimental data can boost IDEA’s predictive accuracy by refining the amino-acid-nucleotide interacting energy model to better align with physical principles.

IDEA Generalizes across Various Protein Families

To examine IDEA’s predictive accuracy across different DNA-binding protein families, we applied it to calculate protein-DNA binding affinities using a comprehensive HT-SELEX dataset.26 We focused on evaluating the capability of IDEA to distinguish strong binders from weak binders for each protein with experimentally determined structures. We calculated the probability density distribution of the top and bottom binders identified in the SELEX experiment. A well-separated distribution indicates the successful identification of strong binders by IDEA (Figure 2D and Fig. S4). Receiver Operating Characteristic (ROC) analysis was performed to calculate the Area Under the Curve (AUC) score for these predictions. Further details are provided in Methods Section Evaluation of IDEA prediction using HT-SELEX data. Our analysis shows that for 80% of the proteins across 14 protein families, IDEA successfully differentiates strong binders from weak ones, with an AUC score greater than 0.5. We further performed 10-fold cross-validation on the binding affinities of the protein-DNA pairs in this dataset and found that IDEA outperforms state-of-theart protein-DNA models for cases with available experimentally determined protein-DNA complex structures (Fig. S5)

We also applied IDEA to calculate the binding affinity of additional TFs with available MITOMI measurements.41 IDEA’s predicted binding affinity of Pho4 protein, another bHLH transcription factor, correlates well with the experimental measurement (Pearson Correlation 0.6, Fig. S6A) by training on the only available protein-DNA structure (PDB ID: 1A0A). We further evaluated IDEA’s predictive performance for the zinc-finger protein Zif268, which has a different structure from the bHLH TFs and thus requires a different atom to represent the DNA nucleotides (Table S1). Due to limited experimental data for all the possible DNA sequence combinations, we focused on testing IDEA’s predictions on point-mutated DNA sequences, which was known to be a challenging task due to their minor deviation from the wild-type sequence. Despite the challenge, IDEA achieves accurate predictions, with a Pearson Correlation of 0.57 (Fig. S6B). We further expanded the training dataset to include all the Zif268 structures and their associated sequences from the same CATH zinc finger superfamily (CATH ID: 3.30.160.60). Incorporating these additional training data further improves the predictive accuracy (Pearson Correlation 0.63, Fig. S7).

IDEA Demonstrates Transferability across Proteins in the Same CATH Superfamily

Since IDEA relies on the sequence-structure relationship of given protein-DNA complexes to reach predictive accuracy, we inquired whether the trained energy model from one proteinDNA complex could be generalized to predict the binding specificities of other complexes.

To test this, we assessed the transferability of IDEA predictions across all 11 structurally available protein-DNA complexes within the MAX TF-associated CATH superfamily (CATH ID: 4.10.280.10, Helix-loop-helix DNA-binding domain). We trained IDEA based on each of these 11 complexes and then used the trained model to predict the MAX-based MITOMI binding affinity. Our results show that IDEA generally makes correct predictions of the binding affinity when trained on proteins that are homologous to MAX, with Pearson Correlation coefficients larger than 0.5 (Fig. 3A).

IDEA prediction shows transferability within the same CATH superfamily.

(A) The predicted MAX binding specificity, trained on other protein-DNA complexes within the same protein CATH superfamily, correlates well with experimental measurement. The proteins are ordered by their probability of being homologous to the MAX protein, determined using HHpred.44 Training with a homologous protein (determined as a hit by HHpred) usually leads to better predictive performance (Pearson Correlation coefficient > 0.5) compared to non-homologous proteins. (B) Structural alignment between 1HLO (white) and 1A0A (blue), two protein-DNA complexes within the same CATH Helix-loop-helix superfamily. The alignment was performed based on the E-box region of the DNA.45 (C) The optimized energy model for 1A0A, a protein-DNA complex structure of the transcription factor Pho4 and DNA, with 33.41% probability of being homologous to the MAX protein. The optimized energy model is presented in reduced units, as explained in the Methods.

The transferability of IDEA within the same CATH superfamily can be understood from the similar structural interfaces among different DNA-binding proteins, which determines a similar learned energy model. For example, the Pho4 protein (PDB ID: 1A0A) shares a highly similar DNA-binding interface with the MAX protein (PDB ID: 1HLO) (Fig. 3B), despite having a 33.41% probability of being homologous. Consequently, the energy model derived from the Pho4-DNA complex (Fig. 3C) exhibits a similar amino-acid-nucleotide interactive pattern as that learned from the MAX-DNA complex (Fig. 2B).

Identification of Genomic Protein-DNA Binding Sites

The genomic locations of DNA binding sites are causally related to major cellular processes. 17 Although multiple techniques have been developed to enable high-throughput mapping of genomic protein-DNA binding locations, such as ChIP–seq, 9,17,46 DAP-seq,10 and FAIREseq,47 challenges remain for precisely pinpointing protein-binding sites at a base-pair resolution. Furthermore, the applicability of these techniques for different DNA-binding proteins is restricted by the quality of antibody designs.17 Therefore, a predictive computational framework would significantly reduce the costs and accelerate the identification of genomic binding sites of DNA-binding proteins.

We incorporated IDEA into a protocol to predict the genomic protein-DNA binding sites at the base-pair resolution, given a DNA-binding protein and a genomic DNA sequence. To evaluate the predictive accuracy of this protocol, we utilized publicly available ChIP-seq data for MAX TF binding in GM12878 lymphoblastoid cell lines.48 As the experimental measurements were conducted at the 420 base pairs resolution, we averaged our modeling prediction over a window spanning 500 base pairs. Since the experimental signals are sparsely distributed across the genome, we focused our prediction on the 1 Mb region of Chromosome 1, which has the densest and most reliable ChIP-seq signals (156000 kb 157000 kb, representative window shown in Fig. 4A top). Lower predicted binding free energy corresponds to stronger protein-DNA binding affinity, indicating higher probability binding sites.

IDEA accurately identifies genome-wide protein-binding sites.

(A) The IDEA-predicted MAX-transcription factor binding sites on the Lymphoblastoid cells chromosome (bottom) correlate well with ChIP-seq measurements (top), shown for a representative 1Mb region of chromosome 1 where ChIP-seq signals are densest. For visualization purposes, a 1 kb resolution was used to plot the predicted normalized Z scores, with highly probable binding sites represented as red peaks. (B) AUC score for prediction accuracy based on the normalized Z scores, averaged over a 500 bp window to match the experimental resolution of 420 bp.

The predicted binding free energies were further normalized by those of the randomized decoy sequences (see the Methods Section Processing of ChIP-seq Data for details). The identified MAX-binding sites are marked in red, with a normalized Z score < −0.75. Our prediction successfully identifies the majority of the MAX binding sites, achieving an AUC score of 0.81 (Fig. 4B). We also tested IDEA’s predictive performance on two additional typical DNA-binding proteins and successfully identified their genomic binding sites (Fig. S8). Therefore, IDEA provides an accurate initial prediction of genomic recognition sites of DNA-binding proteins based only on DNA sequences, facilitating more focused experimental efforts. Furthermore, IDEA holds the potential to identify binding sites beyond the experimental resolution and can aid in the interpretation of other sequencing techniques to uncover additional factors that influence genome-wide DNA binding, such as chromatin accessibility.4951

Integration of IDEA into a Residue-Resolution Simulation Model Captures Protein-DNA Binding Dynamics

The trained residue-level model can be incorporated into a simulation model, thus enabling investigations into dynamic interactions between protein and DNA for mechanistic studies. Coarse-grained protein-nucleic acid models have shown strong advantages in investigating large biomolecular systems.5256 Notably, the structure-based protein model57 and 3-Site-Per-Nucleotide (3SPN) DNA model58 have been developed at the residue resolution to enable efficient sampling of biomolecular systems involving protein and DNA. A combination of these two models has proved instrumental in quantitatively studying multiple large chromatin systems,59 including nucleosomal DNA unwrapping,6062 interactions between nucleosomes,63,64 large chromatin organizations,6466 and its modulation by chromatin factors.6769 Although both the protein and DNA force fields have been systematically optimized,58,7073 the non-bonded interactions between protein and DNA were primarily guided by Debye-Hückel treatment of electrostatic interactions, which did not consider sequencespecific interactions between individual protein amino acids and DNA nucleotides.

To refine this simulation model by including protein-DNA interaction specificity, we incorporated the IDEA-optimized protein-DNA energy model into the coarse-grained simulation model as a short-range Van der Waals (VdW) interaction in the form of a Tanh function (See Methods Section Sequence Specific Protein-DNA Simulation Model for modeling details). This refined simulation model is used to simulate the dynamic interactions between DNA-binding proteins and their target DNA sequences. As a demonstration of the working principle of this modeling approach, we selected 9 TF-DNA complexes whose binding affinities were measured by the same experimental group.74 We applied IDEA to learn an energy model based on these 9 protein-DNA complexes, which leads to a strong correlation between the modeling predicted binding affinity and the experimental measurements (Pearson Correlation coefficient 0.72, Fig. 5A). The learned energy model was incorporated into the simulation model, which was used to simulate the binding process between the protein and DNA target using umbrella sampling techniques.75 The predicted binding free energy76 (Fig. S9) from our simulation shows a strong correlation (Pearson Correlation 0.79) and a near-perfect fit with the experimental measurement (Fig. 5B), greatly improved over the previous model, which depicts protein-DNA interactions by non-sequence-specific homogeneous electrostatics interactions (Pearson Correlation 0.6, Fig. S10). Notably, due to an undetermined prefactor in the IDEA training, it can only provide relative binding free energies in reduced units for different protein-DNA complexes. In contrast, the IDEA-incorporated simulation model can predict the absolute binding free energy given the protein and DNA structures with physical units.

Enhanced protein-DNA simulation model by incorporating IDEAoptimized energy model.

(A) The prediction of the protein-DNA binding affinity for 9 protein-DNA complexes using the IDEA-learned energy model shows a strong correlation with the experimental measurements.74 ΔΔG represents the changes in binding free energy relative to the protein-DNA complex with the lowest predicted binding free energy. The predicted binding free energies are presented in reduced units, as explained in Methods. (B) Illustration of an example protein-DNA complex structure (PDB ID: 9ANT) and the coarse-grained simulation used to evaluate protein-DNA binding free energy. A typical free energy profile was extracted from the simulation, using the center of mass (COM) distance as the collective variable. The shaded region represents the standard deviation of the mean. Representative structures of bound and unbound proteins are shown above, with protein in blue and DNA in yellow. (C) Incorporating the IDEA-optimized energy model into the simulation model improves the prediction of protein-DNA binding affinity, compared to the prediction by the previous model with electrostatic interactions and uniform non-specific attraction between protein and DNA (Fig. S9). The predicted binding free energies are presented in physical units. Error bars represent the standard deviation of the mean.

Discussion

The protein-DNA interaction landscape has evolved to facilitate precise targeting of proteins towards their functional binding sites, which underlie essential processes in controlling gene expression. These interaction specifics are determined by physicochemical interactions between amino acids and nucleotides. By integrating sequences and structural data from available protein-DNA complexes into an interaction matrix, we introduce IDEA, a datadriven method that optimizes a system-specific energy model. This model enables highthroughput in silico predictions of protein-DNA binding specificities and can be scaled up to predict genomic binding sites of DNA-binding proteins, such as TFs. IDEA achieves accurate de novo predictions using only protein-DNA complex structures and their associated sequences, but its accuracy can be further enhanced by incorporating available experimental data from other binding assay measurements, such as the SELEX data,15,16,21 achieving accuracy comparable or better than state-of-the-art methods (Figs. S2 and S5). Despite significant progress in genome-wide sequencing techniques, 9,10,17,47 determining the binding specificities of DNA-binding biomolecules remains time-consuming and expensive. Therefore, IDEA presents a cost-effective alternative for generating the initial predictions before pursuing further experimental refinement.

A key advantage of IDEA is its incorporation of both structural and sequence information, which greatly reduces the demand for extensive training data. Prior efforts have applied deep learning techniques to predict the interactions between proteins and DNA based on their sequences.24,25,27 These approaches usually require a large amount of sequencing training data. By leveraging the sequence-structure relationship inherent in protein-DNA complex structures, IDEA achieves accurate de novo prediction using only a few protein-DNA complexes within the same protein superfamily.

In addition, despite the complicated in vivo environment, IDEA enables predictions of genomic binding sites and shows good agreement with the ChIP-seq measurements (Fig. 4, S8). Moreover, IDEA features rapid prediction, typically requiring 1-3 days to predict base-pair resolution genomic binding sites of a 1 Mb DNA region using a single CPU. Furthermore, IDEA’s prediction can be trivially parallelized, making it possible to perform genome-wide predictions of DNA recognition sites by a given protein within weeks.

Another highlight of IDEA is its ability to present an interpretable amino acid-nucleotide interaction energy model for given protein-DNA complexes. Unlike other machine-learning approaches, IDEA’s optimized energy model not only predicts binding affinities and binding sites of DNA-binding proteins but also provides an interpretable representation of interaction specifics between each type of amino acid and nucleotide. Additionally, we integrated this physicochemical-based energy model into a simulation framework, thereby improving the characterization of protein-DNA binding dynamics. Therefore, IDEA-based simulation enables investigations into dynamic interactions among various proteins and DNA, facilitating molecular-based simulations to understand the physical mechanisms underlying many DNA-binding processes, such as transcription, epigenetic regulations, and their modulation by sequence mutations such as single-nucleotide polymorphisms (SNPs). 77,78

Since the IDEA-optimized energy model serves as a “molecular grammar” for guiding protein-DNA interaction specifics, it can be expanded to include additional variants of amino acids and nucleotides. With recent advancements in sequencing and structural characterization,79,80 as well as deep-learning-based structural predictions,8183 thousands of structures and sequences have been solved for these variants. The IDEA procedure can be repeated for these variant-included structures to expand the optimized energy model by including modifications such as methylated DNA nucleotides 84 and post-translationally modified amino acids.85 Such strategies will facilitate studies on the functional relevance of epigenetic variants, such as those caused by exposure to environmental hazards,86 and their structural impact on the human genome.8789

Methods

Structural Modeling of Protein and DNA

We utilized a coarse-grained framework to extract structural information of protein and DNA at the residue level (Figure 1). Specifically, each protein amino acid was represented by the coordinates of the Cα atom, and each DNA nucleotide was represented by either the P atom in the phosphate group or the C5 atom on the nucleobase. The choice between these two DNA atom types (C5 or P) depended on their distances to surrounding Cα atoms. We hypothesized that DNA atoms closer to Cα atoms would provide more meaningful information on protein-DNA interactions. Therefore, we chose the DNA atom type (C5 or P) with the largest sum of occurrences within 8 Å, 9 Å, and 10 Å distances from their surrounding Cα atoms (Table S1). This selection rule was applied consistently across all complexes in the training dataset to maintain uniform criteria.

Energy Model Optimization

Training Protocol

We fused the native protein and DNA sequences from the experimentally determined proteinDNA structural interface into an interaction matrix for model training. These sequences, located at the protein-DNA structural interface (defined as those residues whose representative atoms have a distance ≤ 8Å, highlighted in blue in Fig. 1 Left), are hypothesized to have evolved to favor strong amino acid-nucleotide interactions. We collected the sequences from both the strong binders and decoy binders (considered weak binders) for model parameterization. Synthetic decoy binders were generated by randomizing either the DNA (1000 sequences) or protein sequences (10,000 sequences) from the strong binders. The IDEA protocol optimizes a pairwise amino-acid-nucleotide energy model by maximizing the gap in binding energies between the strong and decoy binders. Similar strategies have been successfully applied in protein folding and protein-protein binding.34,36,53,90,91 Specifically, the binding energies for strong binders (Estrong) and their corresponding decoy weak binders (Edecoy) are calculated using the following expression:

where Θi,jutilized a switching function that captures an effective interaction range between each amino acid-nucleotide pair within the protein-DNA complex:

here, rmin = −8 Å and rmax = 8 Å. This defines two residues to be “in contact” if their distance is less than 8 Å. κ (here taken 0.7) is a scaling factor that modulates the steepness of the hyperbolic tangent function.

IDEA optimizes γ, a protein-DNA energy model represented by a 20-by-4 interaction matrix. Each entry γi,j(ai, aj) describes the pairwise interaction between an amino acid i and a nucleotide j. In practice, the parameters γi,j(ai, ai) were optimized to maximize δE/ΔE, where δE = ⟨Edecoy⟩ − ⟨Estrong⟩, representing the energy gap between strong and decoy binders. Mathematically, δE can be represented as . The standard deviation of the decoy binding energies ΔE can be calculated as , where:

here, ϕ takes the function form of Ebinding and summarizes the total number of contacts between each type of amino acid and nucleotide in the given protein-DNA complexes. The vector A thus specifies the difference in interaction strengths for each pair of amino acids and nucleotides between the strong and decoy binders, with a dimension of 1 ×300. Additionally, matrix B is a 300 × 300 covariance matrix that contains information about which types of interactions tend to co-occur in the decoy binders. Optimizing the function δE/ΔE with respect to the parameters γi,j(ai, aj) corresponds to maximizing the ratio . This maximization is effectively attained by maximizing the functional objective R(γ), defined as:

where λ is a Lagrange multiplier. Setting the derivative ∂R(γ) equal to zero leads to γB−1A. Here, γ represents a 300 × 1 vector that encodes the learned binding strengths across different amino acid-nucleotide interactions. A filtering procedure is further applied to reduce the noise arising from the finite sampling of certain types of interactions: The matrix B is first diagonalized such that B−1 = P Λ−1P −1, where P is the matrix of eigenvectors of B, and Λ is a diagonal matrix composed of B’s eigenvalues. We retained the principal N modes of B, which are ordered by the descending magnitude of their eigenvalues and replaced the remaining 300 − N eigenvalues in Λ with the N th eigenvalue of B. In all of our optimization reported in this paper, N was set to 70 to maximize the utilization of non-zero eigenvalues in the matrix B. For visualization purposes, the vector γ is reshaped into a 20-by-4 matrix, as shown in Fig. 2B. Due to the presence of an undetermined scaling factor after the model optimization, the predicted energies are presented in reduced units. Only when the optimized energy is integrated into a simulation model, can IDEA-based simulations predict binding free energies in physical units (Fig. 5C).

Enhanced Modeling Prediction with SELEX Data

When additional experimental binding data, such as SELEX,27 is available, we expanded IDEA to incorporate this data for enhancing the model’s predictive capabilities. SELEX data provides binding affinities between given transcription factors and DNA sequences, which can be converted into dimensionless binding free energy with the expression E = − ln(Affinity), with Affinity being the normalized affinity generated from the SELEX package. We then calculated the ϕ value by utilizing the protein-DNA complex structure threaded with the SELEX-provided DNA sequences, which can be used together with the converted E values to compute the γ energy model based on the equation 1:

here γ is the enhanced protein-DNA energy model represented as a 300 × 1 vector.

For predicting the MAX-DNA binding specificity, we utilized the Human MAX SELEX data deposited in the European Nucleotide Archive database (accession no. PRJEB25690).27,92 The R/Bioconductor package SELEX version 1.30.093 was used to determine the observed R1 count for all 10-mers. Among them, ACCACGTGGT is the motif with the highest frequency, which aligns with the DNA sequence in the MAX-DNA PDB structure (PDB ID: 1HLO), whose full DNA sequence is CACCACGTGGT. Consequently, we used this motif as the reference to align the remaining SELEX-measured 10-mer sequences. A total of 255 10-mer sequences, corresponding to the DNA variants measured from the MITOMI experiment,13 were selected. We threaded the PDB structure with these sequences to construct the ϕ and utilized the SELEX-converted binding free energy to compute the γ energy model.

Evaluation of IDEA Predictions Using HT-SELEX data

The processed HT-SELEX data used in this study is from,26 which contains processed DNA binding sequence motifs and their normalized copy number detected from the HT-SELEX experiment, referred to as M-word scores. A higher M-word score corresponds to a stronger binding sequence motif detected in the HT-SELEX experiments. Among all the proteins in the data, 23 have experimentally determined protein-DNA complex structures. We predicted the binding free energies of all the processed sequences of these proteins using our protocol. For evaluating IDEA predictions, we classified motifs with M-word scores above 0.9 as strong binders (label 1) and those with M-word scores below 0.3 as weak binders (label 0). In cases where no sequences had an M-word score below 0.3, we used alternative cutoffs of 0.4 or 0.5 to ensure that at least three weak-binding sequences were included. These labeled strong and weak binders were then used as the ground truth for evaluating IDEA’s predictions. We assessed IDEA’s predictive accuracy by calculating the probability density distributions of the predicted binding free energies (Fig. S4). A well-separated distribution between strong and weak binders indicates a successful prediction. Additionally, we calculated the AUC score based on ROC analysis of these probability distributions. An AUC score of 1.0 means IDEA can fully separate the strongand weak-binder distributions, while an AUC score of 0.5 indicates no separation.

We further performed 10-fold cross-validation on this dataset, using the protocol described in Methods Section Enhanced Modeling Prediction with SELEX Data. For each protein, we divided the entire dataset into 10 equal, randomly assigned folds. In each iteration, we used randomly selected 9 of the 10 folds as the training set and the remaining fold as the test set. This process was repeated 10 times so that each fold served as the test set once. We then reported the average R2 scores across these iterations to evaluate our model’s predictive performance. Our results are compared with the 1mer and 1mer+shape methods from26 (Fig. S5). our comparative analysis shows IDEA achieved higher predictive accuracy than the existing state-of-the-art sequence-based protein-DNA binding predictors for those protein-DNA complexes that have available experimentally resolved structures.

Selection of CATH Superfamily for Testing Model Transferability

The MAX protein is characterized by a Helix-loop-helix DNA-binding domain and belongs to the CATH superfamily 4.10.280.10,94 which include a total of 11 protein-DNA structures with the E-box sequence (CACGTC), similar to the MAX-binding DNA. We hypothesized that proteins within the same superfamily exhibit common residue-interaction patterns due to their structural and evolutionary similarities. To test the transferability of our model, we conducted individual training on each one of those 11 protein-DNA complex structures and their associated sequences, predicting the MAX-DNA binding affinity in the MITOMI dataset.13 To examine the connection between the predictive outcome and the probability that the training protein is homologous to MAX, HHpred95 was utilized to search for MAX’s homologous proteins, using the sequence of 1HLO as the query (Fig. 3A).

Processing of ChIP-seq Data

To evaluate the predictive capabilities of our protocol on specific chromosomal segments, we selected ChIP-seq data for three transcription factors—MAX, EGR1, and CTCF. The data were sourced from the ENCODE project, with accession numbers ENCFF361EVH (MAX), ENCSR026GSW (EGR1), and ENCFF960ZGP (CTCF).

For testing, we chose genomic regions with the densest ChIP-seq signals. For MAX and CTCF, we focused on the human GM12878 chromosome 1 (GRCh38) region between 156,000,000 bp and 157,000,000 bp, as this 1 Mb segment contains the densest signals for both transcription factors, with 33 MAX binding sites and 70 CTCF binding sites. For EGR1, we used the HepG2 cell line, focusing on the GRCh38 chr8 region between 143,000,000 bp and 144,000,000 bp, which contains 97 peaks.

To predict binding sites, we scanned the corresponding DNA regions base by base using the DNA sequence closest to each protein in the relevant protein-DNA complex structures. For MAX, we used the 11-bp sequence from the PDB structure 1HLO, generating 1, 000, 000− 11 + 1 = 999, 990 sequences. For EGR1, we used the 10-bp sequence from PDB 1AAY, producing 1, 000, 000 − 10 + 1 = 999, 991 sequences. For CTCF, no single PDB structure covers the entire protein,96 so two PDB structures were used: 8SSS (ZF1-ZF7, 23-bp) and 8SSQ (ZnFs 3-11, 35-bp), generating 1, 000, 000 −23 + 1 = 999, 978 and 1, 000, 000 −35 + 1 = 999, 966 sequences, respectively.

Predicted binding free energies for each transcription factor were normalized into Z-scores using randomized decoy sequences:

where ⟨ ⟩ is the average over all decoy sequences, and SD is the standard deviation. Given the evolutionary preference for strong binding, most Z-scores were negative, indicating stronger binding. ROC curves were then constructed to assess the predictive performance, with lower predicted Z-scores representing stronger binding affinity (Fig. 4, S8).

Sequence Specific Protein-DNA Simulation Model

We utilized a previously developed protein and DNA residue-level coarse-grained model for simulating protein-DNA binding interactions. The protein was modeled using the Cα-based structure-based model,57 with each amino acid represented by a simulation bead located at the Cα site. The DNA was modeled with the 3-Site-Per-Nucleotide (3SPN) DNA model, 58 with each DNA nucleotide modeled by three coarse-grained sites located at the centers of mass of the phosphate, sugar, and base sites. Both these two models and their combination have been validated to reproduce multiple experimental measurements.57,58,61,6366,68,69,72,97

Detailed descriptions of this protein-DNA force field have been documented in previous studies.60,65,66,69 Specifically, we followed Ref. 58 for simulating DNA-DNA interactions, and Ref. 57 for simulating the protein-protein interactions. The protein-DNA interactions were modeled with a long-range electrostatic interaction and a short-range sequence-specific interaction. The electrostatic interactions were simulated with the Debye-Hückel treatment to capture the screening effect at different ionic concentrations:

where ɛ = 78.0 is the dielectric constant of the bulk solvent. qi and qj correspond to the charges of the two particles. ɛ0 = 1 is the vacuum electric permittivity, and lD is the Debye-Hückel screening length.

The excluded volume effect was modeled using a hard wall potential with an apparent radius of σ = 4.0Å:

An additional sequence-specific protein-DNA interaction was modeled with a Tanh function between the base site of the DNA and the amino acid, with the form:

where r0 = 8 Å and η = 0.7 Å−1. W is a 4 × 20 matrix that characterizes the interaction energy between each type of amino acid and DNA nucleotide. Here, W used the normalized IDEA-learned energy model based on 9 protein-DNA complexes (Fig. 5A). The IDEA-learned energy model was normalized by the sum of the absolute value of each term of the energy model. Detailed parameters are provided in Table S2. Additional details on the protein-DNA simulation are provided in the Supplementary Materials.

Data availability

The implementation of the IDEA model, along with training and prediction examples, is available on our GitHub repository and in the Supplementary data.

Acknowledgements

This work was supported by the startup funding from North Carolina State University. Additionally, this research received partial funding from the NC State Genetics and Genomics Academy. We appreciate Dr. Keith Weninger, Dr. Faruck Morcos, and Dr. Qin Zhou for their insightful discussions and Dr. Sebastian Maerkl for guiding us to the MITOMI experimental data.

Additional files

Supplementary Information