Figures and data

Overview of the IDEA protocol.
The protein–DNA complex, represented by the human MAX–DNA complex structure (PDB ID: 1HLO), was used for training the IDEA model. The sequences of the protein and DNA residues that form close contacts (highlighted in blue on the structure) in the structure were included in the training dataset. In addition, a series of synthetic decoy sequences was generated by randomizing the contacting residues in both the protein and DNA sequences. The amino acid–nucleotide energy model was then optimized by maximizing the ratio of the binding energy gap (δE) between protein and DNA in the native complex and the decoy complexes relative to the energy variance (ΔE). The optimized energy model can be used for multiple predictive applications, including the evaluation of binding free energies for various protein–DNA sequence pairs, prediction of genomic DNA binding sites by transcription factors or other DNA-binding proteins, and integration into a sequence-specific, residue-resolution simulation framework for dynamic simulations.

Results for MAX-based predictions.
(A) The binding free energies calculated by IDEA, trained using a single MAX–DNA complex (PDB ID: 1HLO), correlate well with experimentally measured MAX–DNA binding free energies.13 ΔΔG represents the changes in binding free energy relative to that of the wild-type protein–DNA complex. (B) The heatmap, derived from the optimized energy model, illustrates key amino acid–nucleotide interactions governing MAX–DNA recognition, showing pairwise interaction energies between 20 amino acids and the four DNA bases—DA (deoxyadenosine), DT (deoxythymidine), DC (deoxycytidine), and DG (deoxyguanosine). Both the predicted binding free energies and the optimized energy model are expressed in reduced units, as explained in the Methods Section Training Protocol. Each cell represents the optimized energy contribution, where blue indicates more favorable (lower) energy values, and red indicates less favorable (higher) values. (C) The 3D structure of the MAX–DNA complex (zoomed in with different views) highlights key amino acid–nucleotide contacts at the protein–DNA interface. Notably, several DNA deoxycytidines (red spheres) form close contacts with arginines (blue spheres). Additional nucleotide color coding: adenine (yellow spheres), guanine (green spheres), thymine (pink spheres). (D) Probability density distributions of predicted binding free energies for strong (blue) and weak (red) binders of the protein ZBTB7A. The median of each distribution is marked with a dashed line. (E) Summary of AUC scores for protein–DNA pairs across 12 protein families, calculated based on the predicted probability distributions of binding free energies.

IDEA prediction shows transferability within the same CATH super-family.
(A) The predicted MAX binding affinity, trained on other protein-DNA complexes within the same protein CATH superfamily, correlates well with experimental measurement. The proteins are ordered by their probability of being homologous to the MAX protein, determined using HHpred.47 Training with a homologous protein (determined as a hit by HHpred) usually leads to better predictive performance (Pearson correlation coefficient > 0.5) compared to non-homologous proteins. (B) Structural alignment between 1HLO (white) and 1A0A (blue), two protein-DNA complexes within the same CATH Helix-loop-helix superfamily. The alignment was performed based on the E-box region of the DNA.48 (C) The optimized energy model for 1A0A, a protein-DNA complex structure of the transcription factor PHO4 and DNA, with 33.41% probability of being homologous to the MAX protein. The optimized energy model is presented in reduced units, as explained in the Methods Section: Training Protocol.

IDEA accurately identifies genome-wide protein-binding sites.
(A) The IDEA-predicted MAX-transcription factor binding sites on the Lymphoblastoid cells chromosome (bottom) correlate well with ChIP-seq measurements (top), shown for a representative 1Mb region of chromosome 1 where ChIP-seq signals are densest. For visualization purposes, a 1 kb resolution was used to plot the predicted normalized Z scores, with highly probable binding sites represented as red peaks. (B) AUC score for prediction accuracy based on the normalized Z scores, averaged over a 500 bp window to match the experimental resolution of 420 bp.

Enhanced protein-DNA simulation model by incorporating IDEA-optimized energy model.
(A) Predicted protein-DNA binding free energies for 9 protein-DNA complexes using the IDEA-learned energy model correlate with the experimental measurements.77 ΔΔG represents the changes in binding free energy relative to the protein-DNA complex with the lowest predicted value. The predicted binding free energies are presented in reduced units, as explained in Methods. (B) Example of a protein-DNA complex structure (PDB ID: 9ANT) and the coarse-grained simulation used to evaluate protein-DNA binding free energy. A representative free energy profile was extracted from the simulation, using the center of mass (COM) distance between protein and DNA as the collective variable. The shaded region represents the standard deviation of the mean. Representative bound and unbound structures are shown above, with protein in blue and DNA in yellow. (C) Incorporating the IDEA-optimized energy model into the simulation model improves the prediction of protein-DNA binding affinity, compared to the prediction by the previous model with electrostatic interactions and uniform nonspecific attraction between protein and DNA (Figure S14). The predicted binding free energies are presented in physical units. Error bars represent the standard deviation of the mean.

Including additional human Max-DNA complex structures in IDEA training improves prediction.
Including additional human-associated MAX-DNA complex structures and their associated sequences (PDB IDs: 1HLO, 1NLW, 1NKP) leads to a minor improvement in the correlation between the predicted and experimental binding free energies, with a Pearson correlation coefficient of 0.68 and Spearman’s rank correlation coefficient of 0.65. ΔΔG represents the changes in binding free energy relative to the protein binding to the wild-type DNA sequence. The predicted binding free energies are presented in reduced units, as explained in the Methods Section: Training Protocol.

Enhanced IDEA prediction by integrating SELEX-seq data for MAX transcription factor.
Incorporating SELEX-seq data27 into IDEA’s training protocol significantly improves its predictive accuracy on MITOMI binding measurements13 for the MAX transcription factor, achieving a Pearson correlation coefficient of 0.79 and a Spearman’s rank correlation coefficient of 0.76.

Refined energy model with SELEX data integration reveals additional physicochemical insights into protein-DNA interactions.
Upon integrating the SELEX data into our model training, we found that the refined energy model reveals additional unfavorable interactions between glutamic acid (E) and deoxycytidine (DC), consistent with their negative charges. The optimized energy model is presented in reduced units, as explained in the Methods Section: Training Protocol



Evaluation of IDEA’s predictive accuracy for distinguishing strong from weak protein-DNA binding interactions (Part 1 of 3).
Probability density distributions of predicted binding free energies for strong (blue) and weak (red) binders identified from the HT-SELEX experiment26 are shown across multiple protein families. A clear separation between these distributions indicates the model’s effectiveness in identifying high-affinity DNA binders. The darker dashed lines within each distribution correspond to the median binding free energy for strong (blue) and weak (red) binders, respectively. For Mafb, the predicted median binding free energies for strong and weak binders are nearly identical, measured at −103.826 and −103.824, respectively. As a result, the dashed lines almost completely overlap, and only one appears visible in the plot. The Receiver Operating Characteristic (ROC) curve and Precision-Recall (PR) curve adjacent to each density plot provide a quantitative assessment of prediction accuracy, with the Area Under the Curve (AUC) and precision-recall AUC (PRAUC) scores displayed in each panel.

Summary of balanced PRAUC scores for protein-DNA pairs across 12 protein families.

Performance comparison of IDEA model with other prediction methods.
(A) AUC scores and (B) PRAUC scores for identifying strong binders were evaluated across six different prediction methods: IDEA, IDEA augmented with binding sequences (IDEA-seq), ProBound,31 DeepBind,24 the general knowledge-based energy model DBD-Hunter,46 and family-specific knowledge-based energy model rCLAMPS.36 Predictive performances were assessed using HT-SELEX datasets spanning 22 proteins from 12 protein families. Each violin plot displays the distribution of scores for individual targets, with the width indicating score density. The thick grey bar represents the interquartile range (first to third quartiles), and the thin line extends to 1.5 times the interquartile range. Individual data points are depicted as scattered black dots, and the white dot represents the median. Sample sizes (n) are labeled above each group.

IDEA outperforms other predictors in cross-validation analysis of protein-DNA binding affinity.
Results from a 10-fold cross-validation indicate that IDEA achieves higher R2 scores for most of the protein-DNA complexes that have available experimentally resolved structures, compared to the 1mer and 1mer+shape methods reported in.26 In panel (A), each point represents a protein-DNA complex and compares IDEA’s R2 score with that of the 1mer method, while panel (B) compares IDEA’s R2 score with the 1mer+shape method. Points are color-coded by protein family (see Figure 2E for family annotations).

IDEA correctly predicts the protein-DNA recognition by additional transcription factors.
(A) The 3D structure of basic helix-loop-helix (bHLH) transcription factor PHO4 and its associated DNA (PDB ID: 1A0A). (B) Model-predicted binding free energies of PHO4 correlate with the experimentally determined binding free energies.13 (C) The 3D structure of zinc finger protein Zif268 and its associated DNA (PDB ID: 1AAY). (D) Model-predicted binding free energies of Zif268 correlate with the experimentally determined binding free energies.43 ΔΔG represents the changes in binding free energy relative to the protein binding to the wild-type DNA sequence.

Enhanced Predictive Accuracy with Inclusion of Zif268 and Related CATH Protein Structures.
Including the structures and associated sequences of 1AAY and other protein-DNA complexes from the same CATH zinc finger superfamily (CATH ID: 3.30.160.60) in the training dataset enhances the predictive accuracy, with a Pearson correlation coefficient of 0.63 and Spearman’s rank correlation coefficient of 0.60.

Spearman’s rank correlation coefficients between predicted and experimental MAX binding affinities across training proteins ordered by probability of being homologous to the MAX protein.

Principal component analysis of the normalized IDEA-learned energy model across 12 protein families.

IDEA accurately identifies genomic binding sites for additional proteins.
(A, B) Predicted binding sites for the EGR1 transcription factor in a 1 Mb region on chromosome 8 of the HepG2 cell line, where ChIP-seq signals are densest. The top plot of (A) shows ChIP-seq signals, while the bottom plot shows IDEA’s predicted binding sites, with highly probable sites marked as red peaks. Normalized Z scores were averaged over a 500 bp window to align with the experimental resolution. The ROC curve (B) shows an AUC score of 0.72, reflecting the IDEA’s accuracy for EGR1 binding sites prediction. (C–F) Predicted binding sites for the CTCF (CCCTC-Binding factor) protein in the GM12878 cell line on chromosome 1 using two different training structures: 8SSS (C, D) and 8SSQ (E, F). Panels (C) and (E) show IDEA’s predictions compared to ChIP-seq signals, with highly probable binding sites highlighted as red peaks. Panels (D) and (F) present ROC curves with AUC scores of 0.64 and 0.62, respectively, indicating IDEA’s predictive performance using each training structure.

Binding free energy curves calculated from simulations based on non-sequence-specific homogenous electrostatic potential and IDEA potential models.
The predicted free energy profiles were computed as a function of the Protein-DNA center of mass (COM) distance, comparing the non-sequence-specific homogenous electrostatic potential (H Potential) model (black) and the IDEA potential model (red). The line represents the mean free energy calculated from three equal partitions of the simulation trajectories, and the shaded areas were calculated as the standard deviation.

Prediction of protein-DNA binding free energy with the non-sequence-specific homogeneous electrostatic potential model.
Comparison between the simulation-predicted and experimentally-assessed binding free energies for 9 protein-DNA complexes. The simulation free energies were computed using a non-sequence-specific homogenous electrostatic potential (H potential) model. The predicted binding free energies are presented in physical units. Error bars represent the standard deviation of the mean. See Figure S13 for additional details.

Effect of the number of decoy sequences on model generalizability and transferrability.
(A-B) Evaluation of IDEA’s generalizability across different decoy numbers, corresponding to Figure 2E. Violin plots summarize the distribution of AUC (A) and PRAUC (B) scores for three decoy combinations. (C-D) Analysis of IDEA transferability within the MAX CATH superfamily, corresponding to Figure 3A. Violin plots summarize the distribution of Pearson correlation coefficients (C) and Spearman’s rank correlation coefficients (D) between predicted and experimental binding affinities for three decoy combinations. In each violin plot, the thick grey bar represents the interquartile range (first to third quartiles), and the thin line extends to 1.5 times the interquartile range. Individual data are depicted as scattered black points, and the white dot represents the median. Sample sizes (n) are labeled above each group. The three tested decoy combinations include: 100 DNA + 1000 protein decoys (DNA100Pro1000), 1000 DNA + 1000 protein decoys (DNA1000Pro1000), and 1000 DNA + 10000 protein decoys (DNA1000Pro10000). The consistent performance across all decoy combinations demonstrates the robustness of the model’s prediction with respect to the number of decoy sequences.

Raw AUC scores for distinguishing strong and weak binders across 22 proteins from 12 protein families using six different prediction methods: IDEA, IDEA augmented with binding sequences (IDEA-seq), ProBound,31 DeepBind,24 DBD-Hunter46 and rCLAMPS36

Raw PRAUC scores for distinguishing strong and weak binders across 22 proteins from 12 protein families using six different prediction methods: IDEA, IDEA augmented with binding sequences (IDEA-seq), ProBound,31 DeepBind,24 DBD-Hunter46 and rCLAMPS36



This table reports the number of atoms within the cutoff distances from the Cα atoms for all the protein-DNA structures used in this study. For each structure, we selected the DNA atom type (C5 or P) with the largest sum of occurrences within 8°A, 9°A, and 10°A distances from surrounding Cα atoms for modeling. For protein YY1 (PDB ID: 1ubd), where the counts for C5 and P were identical, both models trained on C5 and P achieved an AUC score and PRAUC score of 1.0 for distinguishing strong from weak binders of the HT-SELEX dataset (Figure S4).
The R2 values for 10-fold cross-validation on the HT-SELEX dataset were 0.68 for the C5-trained model and 0.647 for the P-trained model. Figures S4, S5, and S7 use the result from the C5-trained model for YY1.
