Figures and data
![](https://prod--epp.elifesciences.org/iiif/2/105565%2Fv1%2Fcontent%2F595895v2_fig1.tif/full/max/0/default.jpg)
Overview of the IDEA protocol.
The protein–DNA complex, represented by the human MAX–DNA complex structure (PDB ID: 1HLO), was used for training the IDEA model. The sequences of the protein and DNA residues that form close contacts (highlighted in blue) in the structure were included in the training dataset. In addition, a series of synthetic decoy sequences were generated by randomizing the contacting residues in both the protein and DNA sequences. The amino acid–nucleotide energy model was then optimized by maximizing the ratio of the binding energy gap (δE) between protein and DNA in the native complex and the decoy complexes relative to the energy variance (ΔE). The optimized energy model can be used for multiple predictive applications, including the evaluation of binding free energies for various protein–DNA sequence pairs, prediction of genomic DNA binding sites by transcription factors or other DNA-binding proteins, and integration into a sequence-specific, residue-resolution simulation framework for dynamic simulations.
![](https://prod--epp.elifesciences.org/iiif/2/105565%2Fv1%2Fcontent%2F595895v2_fig2.tif/full/max/0/default.jpg)
Results for MAX-based predictions.
(A) The binding free energy calculated by IDEA, trained using one MAX–DNA complex (PDB ID: 1HLO), correlates well with experimentally measured MAX–DNA binding free energy.13 ΔΔG represents the changes in binding free energy relative to that of the wild-type protein–DNA complex. (B) The optimized energy model reveals an amino acid–nucleotide interaction pattern governing MAX–DNA recognition. The predicted binding free energies and optimized energy model are presented in reduced units, as explained in the Methods. (C) The 3D structure of the MAX–DNA complex (zoomed in with different views) highlights important amino acid–nucleotide contacts at the protein–DNA interface, where several DNA cytosines (red spheres) form close contacts with arginine (blue spheres). (D) Probability density distribution of strong and weak binders for the protein ZBTB7A. The mean of each distribution is marked with a dashed line. (E) The AUC score for each protein–DNA pair, calculated based on the predicted probability distributions.
![](https://prod--epp.elifesciences.org/iiif/2/105565%2Fv1%2Fcontent%2F595895v2_fig3.tif/full/max/0/default.jpg)
IDEA prediction shows transferability within the same CATH superfamily.
(A) The predicted MAX binding specificity, trained on other protein-DNA complexes within the same protein CATH superfamily, correlates well with experimental measurement. The proteins are ordered by their probability of being homologous to the MAX protein, determined using HHpred.44 Training with a homologous protein (determined as a hit by HHpred) usually leads to better predictive performance (Pearson Correlation coefficient > 0.5) compared to non-homologous proteins. (B) Structural alignment between 1HLO (white) and 1A0A (blue), two protein-DNA complexes within the same CATH Helix-loop-helix superfamily. The alignment was performed based on the E-box region of the DNA.45 (C) The optimized energy model for 1A0A, a protein-DNA complex structure of the transcription factor Pho4 and DNA, with 33.41% probability of being homologous to the MAX protein. The optimized energy model is presented in reduced units, as explained in the Methods.
![](https://prod--epp.elifesciences.org/iiif/2/105565%2Fv1%2Fcontent%2F595895v2_fig4.tif/full/max/0/default.jpg)
IDEA accurately identifies genome-wide protein-binding sites.
(A) The IDEA-predicted MAX-transcription factor binding sites on the Lymphoblastoid cells chromosome (bottom) correlate well with ChIP-seq measurements (top), shown for a representative 1Mb region of chromosome 1 where ChIP-seq signals are densest. For visualization purposes, a 1 kb resolution was used to plot the predicted normalized Z scores, with highly probable binding sites represented as red peaks. (B) AUC score for prediction accuracy based on the normalized Z scores, averaged over a 500 bp window to match the experimental resolution of 420 bp.
![](https://prod--epp.elifesciences.org/iiif/2/105565%2Fv1%2Fcontent%2F595895v2_fig5.tif/full/max/0/default.jpg)
Enhanced protein-DNA simulation model by incorporating IDEAoptimized energy model.
(A) The prediction of the protein-DNA binding affinity for 9 protein-DNA complexes using the IDEA-learned energy model shows a strong correlation with the experimental measurements.74 ΔΔG represents the changes in binding free energy relative to the protein-DNA complex with the lowest predicted binding free energy. The predicted binding free energies are presented in reduced units, as explained in Methods. (B) Illustration of an example protein-DNA complex structure (PDB ID: 9ANT) and the coarse-grained simulation used to evaluate protein-DNA binding free energy. A typical free energy profile was extracted from the simulation, using the center of mass (COM) distance as the collective variable. The shaded region represents the standard deviation of the mean. Representative structures of bound and unbound proteins are shown above, with protein in blue and DNA in yellow. (C) Incorporating the IDEA-optimized energy model into the simulation model improves the prediction of protein-DNA binding affinity, compared to the prediction by the previous model with electrostatic interactions and uniform non-specific attraction between protein and DNA (Fig. S9). The predicted binding free energies are presented in physical units. Error bars represent the standard deviation of the mean.