Figures and data in Interpretable protein-DNA interactions captured by structure-sequence optimization

Figures
Tables
Additional files

9 figures, 3 tables and 1 additional file

Figures

Figure 1

Download asset Open asset

Overview of the IDEA protocol.

The protein-DNA complex, represented by the human MAX-DNA complex structure (PDB ID: 1HLO), was used for training the IDEA model. The sequences of the protein and DNA residues that form close contacts (highlighted in blue on the structure) in the structure were included in the training dataset. In addition, a series of synthetic decoy sequences was generated by randomizing the contacting residues in both the protein and DNA sequences. The amino acid-nucleotide energy model was then optimized by maximizing the ratio of the binding energy gap ( $δ E$ ) between protein and DNA in the native complex and the decoy complexes relative to the energy variance ( $Δ E$ ). The optimized energy model can be used for multiple predictive applications, including the evaluation of binding free energies for various protein-DNA sequence pairs, prediction of genomic DNA binding sites by transcription factors or other DNA-binding proteins, and integration into a sequence-specific, residue-resolution simulation framework for dynamic simulations.

Figure 2 with 9 supplements

Download asset Open asset

Results for MAX-based predictions.

(A) The binding free energies calculated by IDEA, trained using a single MAX-DNA complex (PDB ID: 1HLO), correlate well with experimentally measured MAX-DNA binding free energies (Maerkl and Quake, 2007). $Δ Δ G$ represents the changes in binding free energy relative to that of the wild-type protein-DNA complex. (B) The heatmap, derived from the optimized energy model, illustrates key amino acid-nucleotide interactions governing MAX-DNA recognition, showing pairwise interaction energies between 20 amino acids and the four DNA nucleotide—DA (deoxyadenosine), DT (deoxythymidine), DC (deoxycytidine), and DG (deoxyguanosine). Both the predicted binding free energies and the optimized energy model are expressed in reduced units, as explained in the Materials and methods section *Training protocol*. Each cell represents the optimized energy contribution, where blue indicates more favorable (lower) energy values, and red indicates less favorable (higher) values. (C) The 3D structure of the MAX-DNA complex (zoomed in with different views) highlights key amino acid-nucleotide contacts at the protein-DNA interface. Notably, several DNA deoxycytidines (red spheres) form close contacts with arginines (blue spheres). Additional nucleotide color coding: adenine (yellow spheres), guanine (green spheres), thymine (pink spheres). (D) Probability density distributions of predicted binding free energies for strong (blue) and weak (red) binders of the protein ZBTB7A. The median of each distribution is marked with a dashed line. (E) Summary of AUC scores for protein-DNA pairs across 12 protein families, calculated based on the predicted probability distributions of binding free energies.

Figure 2—figure supplement 1

Download asset Open asset

Including additional human Max-DNA complex structures in IDEA training improves prediction.

Including additional human-associated MAX-DNA complex structures and their associated sequences (PDB IDs: 1HLO, 1NLW, 1NKP) leads to a minor improvement in the correlation between the predicted and experimental binding free energies, with a Pearson correlation coefficient of 0.68 and Spearman’s rank correlation coefficient of 0.65. $Δ Δ G$ represents the changes in binding free energy relative to the protein binding to the wild-type DNA sequence. The predicted binding free energies are presented in reduced units, as explained in the Materials and methods section: *Training protocol*.

Figure 2—figure supplement 2

Download asset Open asset

Enhanced IDEA prediction by integrating SELEX-seq data for MAX transcription factor.

Incorporating SELEX-seq data (Rastogi et al., 2018) into IDEA’s training protocol significantly improves its predictive accuracy on MITOMI binding measurements Maerkl and Quake, 2007 for the MAX transcription factor, achieving a Pearson correlation coefficient of 0.79 and a Spearman’s rank correlation coefficient of 0.76.

Figure 2—figure supplement 3

Download asset Open asset

Refined energy model with SELEX data integration reveals additional physicochemical insights into protein-DNA interactions.

Upon integrating the SELEX data into our model training, we found that the refined energy model reveals additional unfavorable interactions between glutamic acid (E) and deoxycytidine (DC), consistent with their negative charges. The optimized energy model is presented in reduced units, as explained in the Materials and methods section: *Training protocol.*

Figure 2—figure supplement 4

Download asset Open asset

Evaluation of IDEA’s predictive accuracy for distinguishing strong from weak protein-DNA binding interactions.

Probability density distributions of predicted binding free energies for strong (blue) and weak (red) binders identified from the HT-SELEX experiment (Yang et al., 2017) are shown across multiple protein families. A clear separation between these distributions indicates the model’s effectiveness in identifying high-affinity DNA binders. The darker dashed lines within each distribution correspond to the median binding free energy for strong (blue) and weak (red) binders, respectively. For Mafb, the predicted median binding free energies for strong and weak binders are nearly identical, measured at −103.826 and −103.824, respectively. As a result, the dashed lines almost completely overlap, and only one appears visible in the plot. The Receiver Operating Characteristic (ROC) curve and Precision-Recall (PR) curve adjacent to each density plot provide a quantitative assessment of prediction accuracy, with the Area Under the Curve (AUC) and precision-recall AUC (PRAUC) scores displayed in each panel.

Figure 2—figure supplement 5

Download asset Open asset

Summary of balanced PRAUC scores for protein-DNA pairs across 12 protein families.

Figure 2—figure supplement 6

Download asset Open asset

Performance comparison of the IDEA model with other prediction methods.

(A) AUC scores and (B) PRAUC scores for identifying strong binders were evaluated across six different prediction methods: IDEA, IDEA augmented with binding sequences (IDEA-seq), ProBound (Rube et al., 2022), DeepBind (Alipanahi et al., 2015), the general knowledge-based energy model DBD-Hunter (Gao and Skolnick, 2008), and family-specific knowledge-based energy model rCLAMPS (Wetzel et al., 2022). Predictive performances were assessed using HT-SELEX datasets spanning 22 proteins from 12 protein families. Each violin plot displays the distribution of scores for individual targets, with the width indicating score density. The thick grey bar represents the interquartile range (first to third quartiles), and the thin line extends to 1.5 times the interquartile range. Individual data points are depicted as scattered black dots, and the white dot represents the median. Sample sizes ( $n$ ) are labeled above each group.

Figure 2—figure supplement 6—source data 1 Raw AUC scores for distinguishing strong and weak binders across 22 proteins from 12 protein families using six different prediction methods: IDEA, IDEA augmented with binding sequences (IDEA-seq), ProBound, DeepBind, DBD-Hunter, and rCLAMPS.: https://cdn.elifesciences.org/articles/105565/elife-105565-fig2-figsupp6-data1-v1.xlsx
Download elife-105565-fig2-figsupp6-data1-v1.xlsx
Figure 2—figure supplement 6—source data 2 Raw PRAUC scores for distinguishing strong and weak binders across 22 proteins from 12 protein families using six different prediction methods: IDEA, IDEA augmented with binding sequences (IDEA-seq), ProBound, DeepBind, DBD-Hunter, and rCLAMPS.: https://cdn.elifesciences.org/articles/105565/elife-105565-fig2-figsupp6-data2-v1.xlsx
Download elife-105565-fig2-figsupp6-data2-v1.xlsx

Figure 2—figure supplement 7

Download asset Open asset

IDEA outperforms other predictors in cross-validation analysis of protein-DNA binding affinity.

Results from a 10-fold cross-validation indicate that IDEA achieves higher $R^{2}$ scores for most of the protein-DNA complexes that have available experimentally resolved structures, compared to the 1mer and 1mer +shape methods reported in Yang et al., 2017. In panel (A), each point represents a protein-DNA complex and compares IDEA’s $R^{2}$ score with that of the 1mer method, while panel (B) compares IDEA’s $R^{2}$ score with the 1mer +shape method. Points are color-coded by protein family (see Figure 2E for family annotations).

Figure 2—figure supplement 8

Download asset Open asset

IDEA correctly predicts the protein-DNA recognition by additional transcription factors.

(A) The 3D structure of basic helix-loop-helix (bHLH) transcription factor PHO4 and its associated DNA (PDB ID: 1A0A). (B) Model-predicted binding free energies of PHO4 correlate with the experimentally determined binding free energies (Maerkl and Quake, 2007). (C) The 3D structure of zinc finger protein Zif268 and its associated DNA (PDB ID: 1AAY). (D) Model-predicted binding free energies of Zif268 correlate with the experimentally determined binding free energies (Geertz et al., 2012). $Δ Δ G$ represents the changes in binding free energy relative to the protein binding to the wild-type DNA sequence. The predicted binding free energies are presented in reduced units, as explained in the Materials and methods section: *Training protocol*.

Figure 2—figure supplement 9

Download asset Open asset

Enhanced predictive accuracy with the inclusion of Zif268 and related CATH protein structures.

Figure 3 with 2 supplements

Download asset Open asset

IDEA prediction shows transferability within the same CATH superfamily.

(A) The predicted MAX binding affinity, trained on other protein-DNA complexes within the same protein CATH superfamily, correlates well with experimental measurement. The proteins are ordered by their probability of being homologous to the MAX protein, determined using HHpred (Söding et al., 2005). Training with a homologous protein (determined as a hit by HHpred) usually leads to better predictive performance (Pearson correlation coefficient > 0.5) compared to non-homologous proteins. (B) Structural alignment between 1HLO (white) and 1A0A (blue), two protein-DNA complexes within the same CATH Helix-loop-helix superfamily. The alignment was performed based on the E-box region of the DNA (Humphrey et al., 1996). (C) The optimized energy model for 1A0A, a protein-DNA complex structure of the transcription factor PHO4 and DNA, with 33.41% probability of being homologous to the MAX protein. The optimized energy model is presented in reduced units, as explained in the Materials and methods section: *Training protocol*.

Figure 3—figure supplement 1

Download asset Open asset

Spearman’s rank correlation coefficients between predicted and experimental MAX binding affinities across training proteins ordered by probability of being homologous to the MAX protein.

Figure 3—figure supplement 2

Download asset Open asset

Principal component analysis of the normalized IDEA-learned energy model across 12 protein families.

Figure 4 with 1 supplement

Download asset Open asset

IDEA accurately identifies genome-wide protein-binding sites.

(A) The IDEA-predicted MAX-transcription factor binding sites on the Lymphoblastoid cells chromosome (bottom) correlate well with ChIP-seq measurements (top), shown for a representative 1 Mb region of chromosome 1 where ChIP-seq signals are densest. For visualization purposes, a 1 kb resolution was used to plot the predicted normalized Z scores, with highly probable binding sites represented as red peaks. (B) AUC score for prediction accuracy based on the normalized Z scores, averaged over a 500 bp window to match the experimental resolution of 420 bp.

Figure 4—figure supplement 1

Download asset Open asset

IDEA accurately identifies genomic binding sites for additional proteins.

(**A, B**) Predicted binding sites for the EGR1 transcription factor in a 1 Mb region on chromosome 8 of the HepG2 cell line, where ChIP-seq signals are densest. The top plot of (A) shows ChIP-seq signals, while the bottom plot shows IDEA’s predicted binding sites, with highly probable sites marked as red peaks. Normalized Z scores were averaged over a 500 bp window to align with the experimental resolution. The ROC curve (B) shows an AUC score of 0.72, reflecting IDEA’s accuracy for EGR1 binding sites prediction. (**C–F**) Predicted binding sites for the CTCF (CCCTC-Binding factor) protein in the GM12878 cell line on chromosome 1 using two different training structures: 8SSS (**C, D**) and 8SSQ (**E, F**). Panels (C) and (E) show IDEA’s predictions compared to ChIP-seq signals, with highly probable binding sites highlighted as red peaks. Panels (D) and (F) present ROC curves with AUC scores of 0.64 and 0.62, respectively, indicating IDEA’s predictive performance using each training structure.

Figure 5 with 2 supplements

Download asset Open asset

Enhanced protein-DNA simulation model by incorporating IDEA-optimized energy model.

(A) Predicted protein-DNA binding free energies for 9 protein-DNA complexes using the IDEA-learned energy model correlate with the experimental measurements (Privalov et al., 2011). $Δ Δ G$ represents the changes in binding free energy relative to the protein-DNA complex with the lowest predicted value. The predicted binding free energies are presented in reduced units, as explained in *Materials and methods section Training protocol.* (B) Example of a protein-DNA complex structure (PDB ID: 9ANT) and the coarse-grained simulation used to evaluate protein-DNA binding free energy. A representative free energy profile was extracted from the simulation, using the center of mass (COM) distance between protein and DNA as the collective variable. The shaded region represents the standard deviation of the mean. Representative bound and unbound structures are shown above, with protein in blue and DNA in yellow. (C) Incorporating the IDEA-optimized energy model into the simulation model improves the prediction of protein-DNA binding affinity, compared to the prediction by the previous model with electrostatic interactions and uniform nonspecific attraction between protein and DNA (Figure 5—figure supplement 2). The predicted binding free energies are presented in physical units. Error bars represent the standard deviation of the mean from three equally partitioned segments of the simulation trajectory.

Figure 5—figure supplement 1

Download asset Open asset

Binding free energy curves calculated from simulations based on non-sequence-specific homogenous electrostatic potential and IDEA potential models.

The predicted free energy profiles were computed as a function of the Protein-DNA center of mass (COM) distance, comparing the non-sequence-specific homogenous electrostatic potential (H Potential) model (black) and the IDEA potential model (red). The line represents the mean free energy calculated from three equally partitioned segments of the simulation trajectory, and the shaded areas were calculated as the standard deviation.

Figure 5—figure supplement 2

Download asset Open asset

Prediction of protein-DNA binding free energy with the non-sequence-specific homogeneous electrostatic potential model.

Comparison between the simulation-predicted and experimentally-assessed binding free energies for 9 protein-DNA complexes. The simulation free energies were computed using a non-sequence-specific homogenous electrostatic potential (H potential) model. The predicted binding free energies are presented in physical units. Error bars represent the standard deviation of the mean. See Figure 2—figure supplement 1 for additional details.

Appendix 1—figure 1

Download asset Open asset

Effect of the number of decoy sequences on model generalizability and transferrability.

(**A, B**) Evaluation of IDEA’s generalizability across different decoy numbers, corresponding to Figure 2E. Violin plots summarize the distribution of AUC (A) and PRAUC (B) scores for three decoy combinations. (**C, D**) Analysis of IDEA transferability within the MAX CATH superfamily, corresponding to Figure 3A. Violin plots summarize the distribution of Pearson correlation coefficients (C) and Spearman’s rank correlation coefficients (D) between predicted and experimental binding affinities for three decoy combinations. In each violin plot, the thick grey bar represents the interquartile range (first to third quartiles), and the thin line extends to 1.5 times the interquartile range. Individual data are depicted as scattered black points, and the white dot represents the median. Sample sizes (n) are labeled above each group. The three tested decoy combinations include: 100 DNA +1000 protein decoys (DNA100Pro1000), 1000 DNA +1000 protein decoys (DNA1000Pro1000), and 1000 DNA +10,000 protein decoys (DNA1000Pro10000). The consistent performance across all decoy combinations demonstrates the robustness of the model’s prediction with respect to the number of decoy sequences.

Author response image 1

Download asset Open asset

Comparison of learned energy models for different protein-DNA complexes: MAX (A), PHO4 (B), and PDX1 (C).

MAX and PHO4 are members of the Helixloop-helix (HLH) CATH protein superfamily (4.10.280.100), while PDX1 belongs to another Homeodomain-like CATH protein superfamily (1.10.10.60).

Author response image 2

Download asset Open asset

Comparison between P and C5 atoms in proximity to the protein 3D structures of MAX–DNA (A) and FOXP-DNA (B) complexes, where P atoms (red sphere) and C5 atoms (blue sphere) that are within 10 A of Cα atoms are highlighted.

Author response image 3

Download asset Open asset

Comparison of simulations using different representative atoms.

(A) Protein-DNA binding simulation with the IDEA-model incorporated as short-range van der Waals between protein Cα atom and nucleic base site. (B) Protein-DNA binding simulation with the IDEA-model incorporated as short-range van der Waals between protein Cα atom and DNA P atoms. The predicted free energies are robust to the choice of DNA representative atoms. The predicted binding free energies are presented in physical units, and error bars represent the standard deviation of the mean.

Tables

Appendix 1—table 1

This table reports the number of atoms within the cutoff distances from the Cα atoms for all the protein-DNA structures used in this study.

For each structure, we selected the DNA atom type (C5 or P) with the largest sum of occurrences within 8 Å, 9 Å, and 10 Å distances from surrounding Cα atoms for modeling. For protein YY1 (PDB ID: 1ubd), where the counts for C5 and P were identical, both models trained on C5 and P achieved an AUC score and PRAUC score of 1.0 for distinguishing strong from weak binders of the HT-SELEX dataset (Figure 2—figure supplement 4). The $R^{2}$ values for 10-fold cross-validation on the HT-SELEX dataset were 0.68 for the C5-trained model and 0.647 for the P-trained model. Figure 2—figure supplement 4; Figure 2—figure supplement 5; Figure 2—figure supplement 7 use the result from the C5-trained model for YY1.

	8 Å	9 Å	10 Å	Sum
1a0a (C5)	15	22	22	59
1a0a (P)	18	19	20	57
1aay (C5)	11	15	16	42
1aay (P)	14	15	15	44
1an4 (C5)	6	11	16	33
1an4 (P)	16	18	24	58
1apl (C5)	7	11	14	32
1apl (P)	10	14	15	39
1dux (C5)	8	9	11	28
1dux (P)	6	7	12	25
1gu4 (C5)	14	18	20	52
1gu4 (P)	13	16	19	48
1hlo (C5)	11	15	19	45
1hlo (P)	13	14	15	42
1ig7 (C5)	6	8	11	25
1ig7 (P)	9	12	13	34
1j47 (C5)	9	12	16	37
1j47 (P)	16	18	18	52
1nk3 (C5)	5	9	11	25
1nk3 (P)	8	9	10	27
1nkp (C5)	11	16	20	47
1nkp (P)	16	20	20	56
1nlw (C5)	11	16	20	47
1nlw (P)	14	19	20	53
5cbx (C5)	8	15	17	40
5cbx (P)	14	14	15	43
6od3 (C5)	5	8	10	23
6od3 (P)	8	11	11	30
7tdw (C5)	6	9	10	25
7tdw (P)	8	10	11	29
7xv6 (C5)	9	16	20	45
7xv6 (P)	14	16	18	48
8e3d (C5)	13	20	23	56
8e3d (P)	20	22	24	66
8k8d (C5)	11	16	18	45
8k8d (P)	9	14	16	39
8osb (C5)	13	22	27	62
8osb (P)	26	30	34	90
8pm7 (C5)	4	8	10	22
8pm7 (P)	9	13	13	35
8ssq (C5)	28	41	46	115
8ssq (P)	34	41	46	121
8sss (C5)	29	39	42	110
8sss (P)	28	35	35	98
9ant (C5)	5	7	11	23
9ant (P)	8	10	12	30
1ozj (C5)	9	13	21	43
1ozj (P)	13	18	20	51
1qrv (C5)	5	9	13	27
1qrv (P)	9	9	11	29
1ubd (C5)	14	21	24	59
1ubd (P)	17	19	23	59
2ezd (C5)	6	9	11	26
2ezd (P)	10	11	11	32
2ezf (C5)	7	9	10	26
2ezf (P)	6	7	10	23
2h1k (C5)	6	8	10	24
2h1k (P)	8	11	12	31
2lef (C5)	12	18	21	51
2lef (P)	20	20	21	61
2wty (C5)	13	19	23	55
2wty (P)	17	20	20	57
2xsd (C5)	11	14	14	39
2xsd (P)	10	11	12	33
3hdd (C5)	5	7	10	22
3hdd (P)	7	8	11	26
4hc7 (C5)	5	10	12	27
4hc7 (P)	8	11	13	32
4xrs (C5)	5	11	17	33
4xrs (P)	11	14	15	40
4y60 (C5)	7	11	14	32
4y60 (P)	15	15	17	47

Appendix 1—table 2

Summary of parameters used in our protein-DNA simulation model (unit: kcal/mol).

	A	G	C	T
ALA	–0.0228	0.0101	0.0199	–0.0061
ARG	–0.0402	0.034	–0.0363	–0.021
ASN	–0.0163	0.0252	0.0159	–0.003
ASP	–0.0001	0.012	0.0048	0.0101
CYS	0.0025	0.0107	0.0062	0.0075
GLU	0.0045	–0.021	0.0189	–0.0059
GLN	–0.0138	–0.006	0.0094	–0.0032
GLY	0.0027	0.032	0.0183	–0.0486
HIS	–0.003	0.0086	0.0149	–0.0022
ILE	–0.0188	0.0052	0.014	0.0099
LEU	–0.0148	–0.0216	0.0127	0.0121
LYS	–0.0107	0.0083	–0.0487	0.0115
MET	–0.0049	0.0085	0.002	0.0143
PHE	–0.0015	–0.0117	0.0151	–0.0042
PRO	0.0009	0.024	–0.0073	–0.0087
SER	0.012	–0.01	0.0011	–0.0079
THR	–0.0213	–0.0132	0.0163	0.0066
TRP	–0.0181	0.0058	0.0172	0.003
TYR	–0.0094	0.0016	–0.0117	0.0127
VAL	–0.006	0.0102	0.0013	0.006

Author response table 1

Comparison of IDEA performance using two DNA atom selection schemes: the filtering scheme presented in the manuscript (C5 and P atoms) versus using only P atoms.

Cases where the two schemes result in different atom selections are highlighted in bold.

Protein	AUC (C5 & P)	AUC (P only)	PRAUC (C5 & P)	PRAUC (P only)
ELK1	0.810	1.000	0.651	1.000
FOXP3	0.873	0.873	0.738	0.738
MAX	0.941	0.879	0.772	0.897
TCF4	0.638	0.638	0.617	0.617
USF1	0.484	0.484	0.474	0.474
Cebpb	0.983	0.989	0.976	0.942
CEBPB	0.923	0.931	0.691	0.699
Mafb	0.607	0.607	0.539	0.539
Egr1	0.181	0.181	0.275	0.275
YY1	1.000	1.000	1.000	1.000
ZBTB7A	0.998	0.998	0.830	0.830
GATA3	0.816	0.816	0.636	0.636
ALX4	0.631	0.631	0.542	0.542
BARHL2	0.822	0.822	0.737	0.737
MSX1	0.926	0.926	0.815	0.815
PDX1	0.595	0.595	0.488	0.488
HSF1	0.993	0.993	0.992	0.992
SMAD3	0.992	0.992	0.903	0.903
MEIS1	0.797	0.797	0.631	0.631
NR2C2	1.000	1.000	1.000	1.000
NR3C1	0.921	0.921	0.791	0.791
POU3F1	0.324	0.220	0.338	0.338

Additional files

MDAR checklist: https://cdn.elifesciences.org/articles/105565/elife-105565-mdarchecklist1-v1.docx
Download elife-105565-mdarchecklist1-v1.docx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Yafan Zhang
Irene Silvernail
Zhuyang Lin
Xingcheng Lin

(2025)

Interpretable protein-DNA interactions captured by structure-sequence optimization

eLife 14:RP105565.

https://doi.org/10.7554/eLife.105565.3

Share this article

Cite this article

Overview of the IDEA protocol.

Results for MAX-based predictions.

Including additional human Max-DNA complex structures in IDEA training improves prediction.

Enhanced IDEA prediction by integrating SELEX-seq data for MAX transcription factor.

Refined energy model with SELEX data integration reveals additional physicochemical insights into protein-DNA interactions.

Evaluation of IDEA’s predictive accuracy for distinguishing strong from weak protein-DNA binding interactions.

Summary of balanced PRAUC scores for protein-DNA pairs across 12 protein families.

Performance comparison of the IDEA model with other prediction methods.

Figure 2—figure supplement 6—source data 1

Figure 2—figure supplement 6—source data 2

IDEA outperforms other predictors in cross-validation analysis of protein-DNA binding affinity.

IDEA correctly predicts the protein-DNA recognition by additional transcription factors.

Enhanced predictive accuracy with the inclusion of Zif268 and related CATH protein structures.

IDEA prediction shows transferability within the same CATH superfamily.

Spearman’s rank correlation coefficients between predicted and experimental MAX binding affinities across training proteins ordered by probability of being homologous to the MAX protein.

Principal component analysis of the normalized IDEA-learned energy model across 12 protein families.

IDEA accurately identifies genome-wide protein-binding sites.

IDEA accurately identifies genomic binding sites for additional proteins.

Enhanced protein-DNA simulation model by incorporating IDEA-optimized energy model.

Binding free energy curves calculated from simulations based on non-sequence-specific homogenous electrostatic potential and IDEA potential models.

Prediction of protein-DNA binding free energy with the non-sequence-specific homogeneous electrostatic potential model.

Effect of the number of decoy sequences on model generalizability and transferrability.

Comparison of learned energy models for different protein-DNA complexes: MAX (A), PHO4 (B), and PDX1 (C).

Comparison between P and C5 atoms in proximity to the protein 3D structures of MAX–DNA (A) and FOXP-DNA (B) complexes, where P atoms (red sphere) and C5 atoms (blue sphere) that are within 10 A of Cα atoms are highlighted.

Comparison of simulations using different representative atoms.

This table reports the number of atoms within the cutoff distances from the Cα atoms for all the protein-DNA structures used in this study.

Summary of parameters used in our protein-DNA simulation model (unit: kcal/mol).

Comparison of IDEA performance using two DNA atom selection schemes: the filtering scheme presented in the manuscript (C5 and P atoms) versus using only P atoms.

MDAR checklist

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)