Interpretable protein-DNA interactions captured by structure-sequence optimization

  1. Yafan Zhang
  2. Irene Silvernail
  3. Zhuyang Lin
  4. Xingcheng Lin  Is a corresponding author
  1. Bioinformatics Research Center, North Carolina State University, United States
  2. Department of Physics, North Carolina State University, United States
9 figures, 3 tables and 1 additional file

Figures

Overview of the IDEA protocol.

The protein-DNA complex, represented by the human MAX-DNA complex structure (PDB ID: 1HLO), was used for training the IDEA model. The sequences of the protein and DNA residues that form close contacts (highlighted in blue on the structure) in the structure were included in the training dataset. In addition, a series of synthetic decoy sequences was generated by randomizing the contacting residues in both the protein and DNA sequences. The amino acid-nucleotide energy model was then optimized by maximizing the ratio of the binding energy gap (δE) between protein and DNA in the native complex and the decoy complexes relative to the energy variance (ΔE). The optimized energy model can be used for multiple predictive applications, including the evaluation of binding free energies for various protein-DNA sequence pairs, prediction of genomic DNA binding sites by transcription factors or other DNA-binding proteins, and integration into a sequence-specific, residue-resolution simulation framework for dynamic simulations.

Figure 2 with 9 supplements
Results for MAX-based predictions.

(A) The binding free energies calculated by IDEA, trained using a single MAX-DNA complex (PDB ID: 1HLO), correlate well with experimentally measured MAX-DNA binding free energies (Maerkl and Quake, 2007). ΔΔG represents the changes in binding free energy relative to that of the wild-type protein-DNA complex. (B) The heatmap, derived from the optimized energy model, illustrates key amino acid-nucleotide interactions governing MAX-DNA recognition, showing pairwise interaction energies between 20 amino acids and the four DNA nucleotide—DA (deoxyadenosine), DT (deoxythymidine), DC (deoxycytidine), and DG (deoxyguanosine). Both the predicted binding free energies and the optimized energy model are expressed in reduced units, as explained in the Materials and methods section Training protocol. Each cell represents the optimized energy contribution, where blue indicates more favorable (lower) energy values, and red indicates less favorable (higher) values. (C) The 3D structure of the MAX-DNA complex (zoomed in with different views) highlights key amino acid-nucleotide contacts at the protein-DNA interface. Notably, several DNA deoxycytidines (red spheres) form close contacts with arginines (blue spheres). Additional nucleotide color coding: adenine (yellow spheres), guanine (green spheres), thymine (pink spheres). (D) Probability density distributions of predicted binding free energies for strong (blue) and weak (red) binders of the protein ZBTB7A. The median of each distribution is marked with a dashed line. (E) Summary of AUC scores for protein-DNA pairs across 12 protein families, calculated based on the predicted probability distributions of binding free energies.

Figure 2—figure supplement 1
Including additional human Max-DNA complex structures in IDEA training improves prediction.

Including additional human-associated MAX-DNA complex structures and their associated sequences (PDB IDs: 1HLO, 1NLW, 1NKP) leads to a minor improvement in the correlation between the predicted and experimental binding free energies, with a Pearson correlation coefficient of 0.68 and Spearman’s rank correlation coefficient of 0.65. ΔΔG represents the changes in binding free energy relative to the protein binding to the wild-type DNA sequence. The predicted binding free energies are presented in reduced units, as explained in the Materials and methods section: Training protocol.

Figure 2—figure supplement 2
Enhanced IDEA prediction by integrating SELEX-seq data for MAX transcription factor.

Incorporating SELEX-seq data (Rastogi et al., 2018) into IDEA’s training protocol significantly improves its predictive accuracy on MITOMI binding measurements Maerkl and Quake, 2007 for the MAX transcription factor, achieving a Pearson correlation coefficient of 0.79 and a Spearman’s rank correlation coefficient of 0.76.

Figure 2—figure supplement 3
Refined energy model with SELEX data integration reveals additional physicochemical insights into protein-DNA interactions.

Upon integrating the SELEX data into our model training, we found that the refined energy model reveals additional unfavorable interactions between glutamic acid (E) and deoxycytidine (DC), consistent with their negative charges. The optimized energy model is presented in reduced units, as explained in the Materials and methods section: Training protocol.

Figure 2—figure supplement 4
Evaluation of IDEA’s predictive accuracy for distinguishing strong from weak protein-DNA binding interactions.

Probability density distributions of predicted binding free energies for strong (blue) and weak (red) binders identified from the HT-SELEX experiment (Yang et al., 2017) are shown across multiple protein families. A clear separation between these distributions indicates the model’s effectiveness in identifying high-affinity DNA binders. The darker dashed lines within each distribution correspond to the median binding free energy for strong (blue) and weak (red) binders, respectively. For Mafb, the predicted median binding free energies for strong and weak binders are nearly identical, measured at −103.826 and −103.824, respectively. As a result, the dashed lines almost completely overlap, and only one appears visible in the plot. The Receiver Operating Characteristic (ROC) curve and Precision-Recall (PR) curve adjacent to each density plot provide a quantitative assessment of prediction accuracy, with the Area Under the Curve (AUC) and precision-recall AUC (PRAUC) scores displayed in each panel.

Figure 2—figure supplement 5
Summary of balanced PRAUC scores for protein-DNA pairs across 12 protein families.
Figure 2—figure supplement 6
Performance comparison of the IDEA model with other prediction methods.

(A) AUC scores and (B) PRAUC scores for identifying strong binders were evaluated across six different prediction methods: IDEA, IDEA augmented with binding sequences (IDEA-seq), ProBound (Rube et al., 2022), DeepBind (Alipanahi et al., 2015), the general knowledge-based energy model DBD-Hunter (Gao and Skolnick, 2008), and family-specific knowledge-based energy model rCLAMPS (Wetzel et al., 2022). Predictive performances were assessed using HT-SELEX datasets spanning 22 proteins from 12 protein families. Each violin plot displays the distribution of scores for individual targets, with the width indicating score density. The thick grey bar represents the interquartile range (first to third quartiles), and the thin line extends to 1.5 times the interquartile range. Individual data points are depicted as scattered black dots, and the white dot represents the median. Sample sizes (n) are labeled above each group.

Figure 2—figure supplement 6—source data 1

Raw AUC scores for distinguishing strong and weak binders across 22 proteins from 12 protein families using six different prediction methods: IDEA, IDEA augmented with binding sequences (IDEA-seq), ProBound, DeepBind, DBD-Hunter, and rCLAMPS.

https://cdn.elifesciences.org/articles/105565/elife-105565-fig2-figsupp6-data1-v1.xlsx
Figure 2—figure supplement 6—source data 2

Raw PRAUC scores for distinguishing strong and weak binders across 22 proteins from 12 protein families using six different prediction methods: IDEA, IDEA augmented with binding sequences (IDEA-seq), ProBound, DeepBind, DBD-Hunter, and rCLAMPS.

https://cdn.elifesciences.org/articles/105565/elife-105565-fig2-figsupp6-data2-v1.xlsx
Figure 2—figure supplement 7
IDEA outperforms other predictors in cross-validation analysis of protein-DNA binding affinity.

Results from a 10-fold cross-validation indicate that IDEA achieves higher R2 scores for most of the protein-DNA complexes that have available experimentally resolved structures, compared to the 1mer and 1mer +shape methods reported in Yang et al., 2017. In panel (A), each point represents a protein-DNA complex and compares IDEA’s R2 score with that of the 1mer method, while panel (B) compares IDEA’s R2 score with the 1mer +shape method. Points are color-coded by protein family (see Figure 2E for family annotations).

Figure 2—figure supplement 8
IDEA correctly predicts the protein-DNA recognition by additional transcription factors.

(A) The 3D structure of basic helix-loop-helix (bHLH) transcription factor PHO4 and its associated DNA (PDB ID: 1A0A). (B) Model-predicted binding free energies of PHO4 correlate with the experimentally determined binding free energies (Maerkl and Quake, 2007). (C) The 3D structure of zinc finger protein Zif268 and its associated DNA (PDB ID: 1AAY). (D) Model-predicted binding free energies of Zif268 correlate with the experimentally determined binding free energies (Geertz et al., 2012). ΔΔG represents the changes in binding free energy relative to the protein binding to the wild-type DNA sequence. The predicted binding free energies are presented in reduced units, as explained in the Materials and methods section: Training protocol.

Figure 2—figure supplement 9
Enhanced predictive accuracy with the inclusion of Zif268 and related CATH protein structures.
Figure 3 with 2 supplements
IDEA prediction shows transferability within the same CATH superfamily.

(A) The predicted MAX binding affinity, trained on other protein-DNA complexes within the same protein CATH superfamily, correlates well with experimental measurement. The proteins are ordered by their probability of being homologous to the MAX protein, determined using HHpred (Söding et al., 2005). Training with a homologous protein (determined as a hit by HHpred) usually leads to better predictive performance (Pearson correlation coefficient > 0.5) compared to non-homologous proteins. (B) Structural alignment between 1HLO (white) and 1A0A (blue), two protein-DNA complexes within the same CATH Helix-loop-helix superfamily. The alignment was performed based on the E-box region of the DNA (Humphrey et al., 1996). (C) The optimized energy model for 1A0A, a protein-DNA complex structure of the transcription factor PHO4 and DNA, with 33.41% probability of being homologous to the MAX protein. The optimized energy model is presented in reduced units, as explained in the Materials and methods section: Training protocol.

Figure 3—figure supplement 1
Spearman’s rank correlation coefficients between predicted and experimental MAX binding affinities across training proteins ordered by probability of being homologous to the MAX protein.
Figure 3—figure supplement 2
Principal component analysis of the normalized IDEA-learned energy model across 12 protein families.
Figure 4 with 1 supplement
IDEA accurately identifies genome-wide protein-binding sites.

(A) The IDEA-predicted MAX-transcription factor binding sites on the Lymphoblastoid cells chromosome (bottom) correlate well with ChIP-seq measurements (top), shown for a representative 1 Mb region of chromosome 1 where ChIP-seq signals are densest. For visualization purposes, a 1 kb resolution was used to plot the predicted normalized Z scores, with highly probable binding sites represented as red peaks. (B) AUC score for prediction accuracy based on the normalized Z scores, averaged over a 500 bp window to match the experimental resolution of 420 bp.

Figure 4—figure supplement 1
IDEA accurately identifies genomic binding sites for additional proteins.

(A, B) Predicted binding sites for the EGR1 transcription factor in a 1 Mb region on chromosome 8 of the HepG2 cell line, where ChIP-seq signals are densest. The top plot of (A) shows ChIP-seq signals, while the bottom plot shows IDEA’s predicted binding sites, with highly probable sites marked as red peaks. Normalized Z scores were averaged over a 500 bp window to align with the experimental resolution. The ROC curve (B) shows an AUC score of 0.72, reflecting IDEA’s accuracy for EGR1 binding sites prediction. (C–F) Predicted binding sites for the CTCF (CCCTC-Binding factor) protein in the GM12878 cell line on chromosome 1 using two different training structures: 8SSS (C, D) and 8SSQ (E, F). Panels (C) and (E) show IDEA’s predictions compared to ChIP-seq signals, with highly probable binding sites highlighted as red peaks. Panels (D) and (F) present ROC curves with AUC scores of 0.64 and 0.62, respectively, indicating IDEA’s predictive performance using each training structure.

Figure 5 with 2 supplements
Enhanced protein-DNA simulation model by incorporating IDEA-optimized energy model.

(A) Predicted protein-DNA binding free energies for 9 protein-DNA complexes using the IDEA-learned energy model correlate with the experimental measurements (Privalov et al., 2011). ΔΔG represents the changes in binding free energy relative to the protein-DNA complex with the lowest predicted value. The predicted binding free energies are presented in reduced units, as explained in Materials and methods section Training protocol. (B) Example of a protein-DNA complex structure (PDB ID: 9ANT) and the coarse-grained simulation used to evaluate protein-DNA binding free energy. A representative free energy profile was extracted from the simulation, using the center of mass (COM) distance between protein and DNA as the collective variable. The shaded region represents the standard deviation of the mean. Representative bound and unbound structures are shown above, with protein in blue and DNA in yellow. (C) Incorporating the IDEA-optimized energy model into the simulation model improves the prediction of protein-DNA binding affinity, compared to the prediction by the previous model with electrostatic interactions and uniform nonspecific attraction between protein and DNA (Figure 5—figure supplement 2). The predicted binding free energies are presented in physical units. Error bars represent the standard deviation of the mean from three equally partitioned segments of the simulation trajectory.

Figure 5—figure supplement 1
Binding free energy curves calculated from simulations based on non-sequence-specific homogenous electrostatic potential and IDEA potential models.

The predicted free energy profiles were computed as a function of the Protein-DNA center of mass (COM) distance, comparing the non-sequence-specific homogenous electrostatic potential (H Potential) model (black) and the IDEA potential model (red). The line represents the mean free energy calculated from three equally partitioned segments of the simulation trajectory, and the shaded areas were calculated as the standard deviation.

Figure 5—figure supplement 2
Prediction of protein-DNA binding free energy with the non-sequence-specific homogeneous electrostatic potential model.

Comparison between the simulation-predicted and experimentally-assessed binding free energies for 9 protein-DNA complexes. The simulation free energies were computed using a non-sequence-specific homogenous electrostatic potential (H potential) model. The predicted binding free energies are presented in physical units. Error bars represent the standard deviation of the mean. See Figure 2—figure supplement 1 for additional details.

Appendix 1—figure 1
Effect of the number of decoy sequences on model generalizability and transferrability.

(A, B) Evaluation of IDEA’s generalizability across different decoy numbers, corresponding to Figure 2E. Violin plots summarize the distribution of AUC (A) and PRAUC (B) scores for three decoy combinations. (C, D) Analysis of IDEA transferability within the MAX CATH superfamily, corresponding to Figure 3A. Violin plots summarize the distribution of Pearson correlation coefficients (C) and Spearman’s rank correlation coefficients (D) between predicted and experimental binding affinities for three decoy combinations. In each violin plot, the thick grey bar represents the interquartile range (first to third quartiles), and the thin line extends to 1.5 times the interquartile range. Individual data are depicted as scattered black points, and the white dot represents the median. Sample sizes (n) are labeled above each group. The three tested decoy combinations include: 100 DNA +1000 protein decoys (DNA100Pro1000), 1000 DNA +1000 protein decoys (DNA1000Pro1000), and 1000 DNA +10,000 protein decoys (DNA1000Pro10000). The consistent performance across all decoy combinations demonstrates the robustness of the model’s prediction with respect to the number of decoy sequences.

Author response image 1
Comparison of learned energy models for different protein-DNA complexes: MAX (A), PHO4 (B), and PDX1 (C).

MAX and PHO4 are members of the Helixloop-helix (HLH) CATH protein superfamily (4.10.280.100), while PDX1 belongs to another Homeodomain-like CATH protein superfamily (1.10.10.60).

Author response image 2
Comparison between P and C5 atoms in proximity to the protein 3D structures of MAX–DNA (A) and FOXP-DNA (B) complexes, where P atoms (red sphere) and C5 atoms (blue sphere) that are within 10 A of Cα atoms are highlighted.
Author response image 3
Comparison of simulations using different representative atoms.

(A) Protein-DNA binding simulation with the IDEA-model incorporated as short-range van der Waals between protein Cα atom and nucleic base site. (B) Protein-DNA binding simulation with the IDEA-model incorporated as short-range van der Waals between protein Cα atom and DNA P atoms. The predicted free energies are robust to the choice of DNA representative atoms. The predicted binding free energies are presented in physical units, and error bars represent the standard deviation of the mean.

Tables

Appendix 1—table 1
This table reports the number of atoms within the cutoff distances from the Cα atoms for all the protein-DNA structures used in this study.

For each structure, we selected the DNA atom type (C5 or P) with the largest sum of occurrences within 8 Å, 9 Å, and 10 Å distances from surrounding Cα atoms for modeling. For protein YY1 (PDB ID: 1ubd), where the counts for C5 and P were identical, both models trained on C5 and P achieved an AUC score and PRAUC score of 1.0 for distinguishing strong from weak binders of the HT-SELEX dataset (Figure 2—figure supplement 4). The R2 values for 10-fold cross-validation on the HT-SELEX dataset were 0.68 for the C5-trained model and 0.647 for the P-trained model. Figure 2—figure supplement 4; Figure 2—figure supplement 5; Figure 2—figure supplement 7 use the result from the C5-trained model for YY1.

8 Å9 Å10 ÅSum
1a0a (C5)15222259
1a0a (P)18192057
1aay (C5)11151642
1aay (P)14151544
1an4 (C5)6111633
1an4 (P)16182458
1apl (C5)7111432
1apl (P)10141539
1dux (C5)891128
1dux (P)671225
1gu4 (C5)14182052
1gu4 (P)13161948
1hlo (C5)11151945
1hlo (P)13141542
1ig7 (C5)681125
1ig7 (P)9121334
1j47 (C5)9121637
1j47 (P)16181852
1nk3 (C5)591125
1nk3 (P)891027
1nkp (C5)11162047
1nkp (P)16202056
1nlw (C5)11162047
1nlw (P)14192053
5cbx (C5)8151740
5cbx (P)14141543
6od3 (C5)581023
6od3 (P)8111130
7tdw (C5)691025
7tdw (P)8101129
7xv6 (C5)9162045
7xv6 (P)14161848
8e3d (C5)13202356
8e3d (P)20222466
8k8d (C5)11161845
8k8d (P)9141639
8osb (C5)13222762
8osb (P)26303490
8pm7 (C5)481022
8pm7 (P)9131335
8ssq (C5)284146115
8ssq (P)344146121
8sss (C5)293942110
8sss (P)28353598
9ant (C5)571123
9ant (P)8101230
1ozj (C5)9132143
1ozj (P)13182051
1qrv (C5)591327
1qrv (P)991129
1ubd (C5)14212459
1ubd (P)17192359
2ezd (C5)691126
2ezd (P)10111132
2ezf (C5)791026
2ezf (P)671023
2h1k (C5)681024
2h1k (P)8111231
2lef (C5)12182151
2lef (P)20202161
2wty (C5)13192355
2wty (P)17202057
2xsd (C5)11141439
2xsd (P)10111233
3hdd (C5)571022
3hdd (P)781126
4hc7 (C5)5101227
4hc7 (P)8111332
4xrs (C5)5111733
4xrs (P)11141540
4y60 (C5)7111432
4y60 (P)15151747
Appendix 1—table 2
Summary of parameters used in our protein-DNA simulation model (unit: kcal/mol).
PPSUAGCT
ALA00–0.02280.01010.0199–0.0061
ARG00–0.04020.034–0.0363–0.021
ASN00–0.01630.02520.0159–0.003
ASP00–0.00010.0120.00480.0101
CYS000.00250.01070.00620.0075
GLU000.0045–0.0210.0189–0.0059
GLN00–0.0138–0.0060.0094–0.0032
GLY000.00270.0320.0183–0.0486
HIS00–0.0030.00860.0149–0.0022
ILE00–0.01880.00520.0140.0099
LEU00–0.0148–0.02160.01270.0121
LYS00–0.01070.0083–0.04870.0115
MET00–0.00490.00850.0020.0143
PHE00–0.0015–0.01170.0151–0.0042
PRO000.00090.024–0.0073–0.0087
SER000.012–0.010.0011–0.0079
THR00–0.0213–0.01320.01630.0066
TRP00–0.01810.00580.01720.003
TYR00–0.00940.0016–0.01170.0127
VAL00–0.0060.01020.00130.006
Author response table 1
Comparison of IDEA performance using two DNA atom selection schemes: the filtering scheme presented in the manuscript (C5 and P atoms) versus using only P atoms.

Cases where the two schemes result in different atom selections are highlighted in bold.

ProteinAUC (C5 & P)AUC (P only)PRAUC (C5 & P)PRAUC (P only)
ELK10.8101.0000.6511.000
FOXP30.8730.8730.7380.738
MAX0.9410.8790.7720.897
TCF40.6380.6380.6170.617
USF10.4840.4840.4740.474
Cebpb0.9830.9890.9760.942
CEBPB0.9230.9310.6910.699
Mafb0.6070.6070.5390.539
Egr10.1810.1810.2750.275
YY11.0001.0001.0001.000
ZBTB7A0.9980.9980.8300.830
GATA30.8160.8160.6360.636
ALX40.6310.6310.5420.542
BARHL20.8220.8220.7370.737
MSX10.9260.9260.8150.815
PDX10.5950.5950.4880.488
HSF10.9930.9930.9920.992
SMAD30.9920.9920.9030.903
MEIS10.7970.7970.6310.631
NR2C21.0001.0001.0001.000
NR3C10.9210.9210.7910.791
POU3F10.3240.2200.3380.338

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Yafan Zhang
  2. Irene Silvernail
  3. Zhuyang Lin
  4. Xingcheng Lin
(2025)
Interpretable protein-DNA interactions captured by structure-sequence optimization
eLife 14:RP105565.
https://doi.org/10.7554/eLife.105565.3