1. Computational and Systems Biology
Download icon

Building accurate sequence-to-affinity models from high-throughput in vitro protein-DNA binding data using FeatureREDUCE

  1. Todd R Riley
  2. Allan Lazarovici
  3. Richard S Mann
  4. Harmen J Bussemaker Is a corresponding author
  1. Columbia University, United States
  2. University of Massachusetts Boston, United States
Research Article
Cited
3
Views
877
Comments
0
Cite as: eLife 2015;4:e06397 doi: 10.7554/eLife.06397

Abstract

Transcription factors are crucial regulators of gene expression. Accurate quantitative definition of their intrinsic DNA binding preferences is critical to understanding their biological function. High-throughput in vitro technology has recently been used to deeply probe the DNA binding specificity of hundreds of eukaryotic transcription factors, yet algorithms for analyzing such data have not yet fully matured. Here, we present a general framework (FeatureREDUCE) for building sequence-to-affinity models based on a biophysically interpretable and extensible model of protein-DNA interaction that can account for dependencies between nucleotides within the binding interface or multiple modes of binding. When training on protein binding microarray (PBM) data, we use robust regression and modeling of technology-specific biases to infer specificity models of unprecedented accuracy and precision. We provide quantitative validation of our results by comparing to gold-standard data when available.

https://doi.org/10.7554/eLife.06397.001

eLife digest

Transcription is the process by which the information contained within DNA is copied to a short-lived molecule called RNA so that it can be transported to other areas of the cell for various purposes. Transcription factors are key components in this process. These proteins recognise and gather at specific sequences of DNA near genes, and then assist the enzymes that copy the information in the gene into a molecule of RNA. This means that transcription factors essentially control which genes are expressed, and when and where these genes are expressed.

Recent technological advances have made it possible to identify where transcription factors can bind within DNA sequences. Yet, while a lot of data has been generated in this area, the computational tools needed to make sense of it have not kept pace.

Now, Riley et al. have developed software called FeatureREDUCE that will allow researchers to build computer predictions of how strongly a transcription factor will interact with specific short sections of DNA sequence. The software can be applied to experimental data collected in so-called ‘protein binding microarray’ experiments. FeatureREDUCE can also be used to investigate questions in the field of transcription factor research that had previously remained unanswered. First, to what level of detail can data obtained from recent technological advances be understood? Second, can transcription factors bind to DNA in more than one way, and can data from protein binding microarrays be used to uncover this?

Riley et al. show that FeatureREDUCE can produce accurate and interpretable clues about the biology behind how transcription factors recognize DNA sequences. This includes how mutations as small as a change to single DNA letter can affect recognition. The next step will be to use the software to make sense of the existing volume of experimental data regarding protein-DNA interactions and data that will be generated in future experiments.

https://doi.org/10.7554/eLife.06397.002

Introduction

Transcription factors (TFs) play a central role in the regulation of gene expression. To be able to understand and predict the behavior of the gene regulatory circuitry in any given organism, we need to know the in vivo DNA binding preferences of the TFs that its genome encodes. In recent years, a number of high-throughput in vitro technologies have been introduced that can provide such information (Berger and Bulyk, 2009; Warren et al., 2006; Maerkl and Quake, 2007; Zhao et al., 2009; Slattery et al., 2011; Berger et al., 2006). However, while the volume of the data generated using these assays dwarfs that of more traditional measurements of protein-DNA interaction strength, the available computational methodology for analyzing them has not fully matured (Weirauch, 2013).

The number of base pairs that constitute the DNA 'footprint' within which base identity can influence binding affinity depends strongly on the three-dimensional structure of the DNA-binding domain (DBD) of the TF. In theory, as long as thermodynamic equilibrium can be assumed, sequence specificity is completely defined by a table containing the (relative) affinity with which the DBD binds to each possible oligonucleotide within the footprint. This tabular approach has been widely used to analyze Protein Binding Microarray (PBM) data (Berger and Bulyk, 2009). It comes with significant challenges, however. First, the size of the oligomer table grows exponentially with footprint size, which in practice limits it to eight base pairs, shorter than the footprint of most TFs. Even for octamer tables, the number of affinity parameters to be estimated is on the same order as the number of PBM data points. This limits precision and necessitates the use of non-parametric methods (as opposed to parameterized biophysical methods), resulting in an associated loss of quantitative information.

A long-standing alternative has been to assume that each nucleotide position within the footprint contributes independently to the overall binding affinity. The most commonly used representation of sequence specificity that makes this independence assumption is the position weight matrix (PWM) (Berg and von Hippel, 1987; Stormo and Fields, 1998; Stormo, 2000; Djordjevic et al., 2003), which defines position-specific base frequencies. Algorithms for inferring the PWM coefficients traditionally aim to maximize its information content relative to a random background model (Lawrence et al., 1993; Frech et al., 1997; Bailey, 1995; Roth et al., 1998). In an alternative approach, sequence specificity is represented in terms of the relative affinity (or, equivalently, the difference, △△G, in binding free energy) associated with each possible point mutation of the optimal sequence (Stormo and Yoshioka, 1991; Stormo et al., 1993), and summarized in the form of a position-specific affinity matrix (PSAM) (Foat et al., 2006; Bussemaker et al., 2007). The difference in philosophy between the PWM and PSAM representations also leads to a different approach to estimating their coefficients. It is no longer the information content (i.e. the height of the letters in the standard sequence logo) that is being optimized, but rather the ability of the PSAM to quantitatively explain the variation in a measurable quantity in terms of variation in the nucleotide sequence associated with each quantity (for instance, the expression level of a gene in terms of its upstream promoter sequence). In the case of PBM data, the PSAM parameters are inferred by performing a nonlinear fit of a sequence-based model that predicts the signal intensity for each probe. The first implementations of this idea were the MatrixREDUCE (Foat et al., 2006; Foat et al., 2005) and PREGO (Tanay, 2006) algorithms; a more recent extension is BEEML-PBM (Zhao and Stormo, 2011).

Whether dependencies between nucleotide positions can be accurately estimated from PBM data and used to refine models of binding specificity remains an open question (Weirauch, 2013; Zhao and Stormo, 2011; Benos et al., 2002). Furthermore, while the existence of alternative binding modes is now widely recognized, accurate quantification of their relative usage has not yet been attempted. To address these needs, we developed our FeatureREDUCE software. It provides a flexible framework for building sequence-to-affinity models from PBM data (Figure 1).

The FeatureREDUCE workflow for analyzing PBM intensities.

(1) A robust method is used to estimate relative affinities for each K-mer of a given length. The K-mer with the highest affinity is chosen as the seed. (2) Using the seed as a reference, robust linear regression is used to estimate the relative affinity parameters in each column of the position-specific affinity matrix (PSAM). (3) With the current affinity model, linear regression is used to estimate the positional bias profile across the probe. (4) An optional step uses nonlinear regression to solve for the free protein concentration. (5) Robust regression is used to estimate free energy contributions associated with higher-order sequence features such as dinucleotides. (6) Steps 2 through 5 are repeated until convergence. (7) The procedure results in a feature-specific affinity model (FSAM) that can be used to predict the relative affinity for any DNA sequence.

https://doi.org/10.7554/eLife.06397.003

Results and discussion

FeatureREDUCE is based on an extensible biophysical model in which the binding free energy difference △△G(SrefS) between an arbitrary nucleotide sequence S and a reference sequence Sref (usually taken as the highest-affinity sequence) is defined as a sum of parameters △△Gφ over all nucleotide sequence 'features' φ that distinguish S from Sref (see Materials and methods). The full set of possible features in which S and Sref differ includes all possible single-base substitutions by default, but can be supplemented with dinucleotides that model dependencies and/or insertions at specific positions within the binding site that model variation in binding mode. Such a feature-based approach has been used previously (Sharon et al., 2008; Gordân et al., 2013; Zhou et al., 2015), but, as we will argue below, our approach to estimating the coefficients of the model is different and optimal.

The contribution of a binding site to the PBM intensity depends on its position within the probe, as was previously demonstrated by planting the same motif at different offsets (Berger et al., 2006). This may preclude accurate model estimation unless it is dealt with explicitly. Moreover, whenever the TF binds near the free end of a probe, loss of contacts with the DNA backbone can reduce binding affinity. FeatureREDUCE captures such spatial bias by introducing an independent multiplicative correction factor for the ratio [TF]/Kd at each offset within the probe (Figure 2a). These coefficients are estimated from the PBM intensities in parallel with the △△G parameters (see Materials and methods). The positional bias profile inferred by FeatureREDUCE for the homo-dimer Cbf1p is shown in Figure 2b. It indicates that the magnitude of the contribution of an individual binding site to the PBM intensity can vary by an order of magnitude depending on its offset within the probe, and that there is preference for Cbf1p binding away from the substrate. For Pho4p, binding near the free end of the probe shows an opposite trend (Figure 2—figure supplement 1). The fraction of the variance (R2) in PBM intensity explained by a 10-bp independent-nucleotide model increases dramatically, from 48% to 71%, after accounting for the positional bias. FeatureREDUCE can also detect any preference for monomeric TFs to bind in one of the two possible orientations on the dsDNA probes. For example, Zif268 exhibits a strong bias for binding to the negative strand over the positive strand (Figure 2c). Indeed, it is known that Zif268 requires non-specific contacts with the DNA backbone on the 5'-end of the motif (Berger et al., 2006). BEEML-PBM (Zhao and Stormo, 2011) also models positional biases along the probes, but it does not account for orientation preference, or for overhang binding at the free end of the probe.

Figure 2 with 1 supplement see all
Quantifying PBM-specific positional and orientational bias.

(a) Accounting for biases related to the position of the binding site within the probe. The effective protein concentration is lower closer to the substrate, presumably due to steric hindrance. Furthermore, binding near the free end of probe is associated with loss of contacts with the DNA backbone. (b) Positional bias profile for the homo-dimeric bHLH transcription factor Cbf1p, as inferred by a model fit to the PBM intensities. (c) Idem, for the monomeric zinc finger transcription factor Zif268. Figure 2—figure supplement 1 shows how positional bias can be used as an indicator of data quality.

https://doi.org/10.7554/eLife.06397.004

The readout of base identity at different nucleotide positions is only approximately independent. Indeed, various studies have analyzed whether representations of sequence specificity that account for nucleotide dependencies are more accurate than those that do not (Sharon et al., 2008; Agius et al., 2010; Lee et al., 2002; Stormo et al., 1986; Zhou and Liu, 2004). Controversy, however, remains about whether the additional parameters associated with such dependencies reflect structural mechanisms or technology-specific biases (Weirauch, 2013). In the biophysical model that underlies FeatureREDUCE, we model dependencies by simply including additional DNA sequence features φ that define base identity at two (or more) nucleotide positions, and estimating the corresponding free energy parameters △△Gφ along with those for single nucleotides (see Materials and methods). The nucleotide dependencies discovered by FeatureREDUCE for Cbf1p are shown in Figure 3a. As expected for a model with additional parameters, accounting for dependencies significantly increased the fraction of the variance that could be explained when training on PBM intensities (R2 improved from 71% to 96%). The real question, however, is how well the inferred model parameters perform on independent validation data.

Figure 3 with 2 supplements see all
Robust estimation of dependencies between nucleotide positions.

(a) Overview of the dependencies between pairs of neighboring nucleotides positions identified by FeatureREDUCE for homodimers of the basic helix-loop-helix (bHLH) factor Cbf1p. (b) Including dinucleotide dependencies in the sequence-to-affinity model, in combination with the use of robust regression, improves the ability to delineate Gene Ontology associations with Cbf1p targets predicted from the genome sequence. Figure 3—figure supplement 1 shows the crucial importance of using robust inference methods for estimating the binding free energy correction terms associated with dinucleotide features. Figure 3—figure supplement 2 shows the underlying cumulative distributions of yeast promoter affinities for 'sulfur compound metabolic process', the GO category with the most statistically significant association with Cbf1p.

https://doi.org/10.7554/eLife.06397.006

A unique aspect of FeatureREDUCE is the use of robust inference techniques (Huber and Ronchetti, 2009), which, as it turns out, is crucial for obtaining accurate estimates of the various contributions to the binding free energy. To demonstrate this, we compared our results to measurements of binding affinity for Cbf1p obtained using the orginal version of the MITOMI technology (Maerkl and Quake, 2007), which are in excellent (R2 = 0.95) agreement with similar measurements obtained using surface plasmon resonance (Berger et al., 2006; Teh et al., 2007). We used these 'gold-standard' binding affinity measurements to assess the quality of the sequence-to-affinity models inferred from PBM data by FeatureREDUCE. When we fit a position-specific affinity matrix (PSAM) based model that ignores dependencies between nucleotides, the root-mean-square error (RMSE) between FeatureREDUCE model predictions and gold-standard MITOMI 1.0 measurements improved from 0.071 to 0.035 when standard least-squares fitting was replaced by robust iteratively re-weighted least-squares (Figure 3—figure supplement 1, left panels). When positional dependencies were added to the model (feature-specific affinity model; FSAM), it actually performed worse than the model assuming independence between nucleotides (position-specific affinity matrix; PSAM) when we used standard least-squares fitting, indicative of over-fitting to noise in the training data. When we used robust regression, however, the RMSE for the FSAM-based model was significantly better than for the PSAM-based model (Figure 3—figure supplement 1, right panels). These results indicate that dependencies within the binding interface indeed exist, and that they can be modeled incorrectly when non-robust regression techniques are used.

We also assessed the ability of our models to make predictions regarding in vivo TF function. First, we found that the inclusion of nucleotide dependencies when predicting aggregated yeast promoter affinities improves the ability to delineate Gene Ontology categories associated with regulation by Cbf1p (see Materials and methods), but again only when robust inference techniques are used (Figure 3B; see also Figure 3—figure supplement 2). Second, to assess the extent to which we can quantitatively predict in vivo binding, we considered the degree of occupancy by Cbf1p at 955 potential genomic binding sites of type NNCACGTGNN (E-box) in yeast cells growing in rich media as measured by ChIP-seq (Zhou and O'Shea, 2011). Using a simple thermodynamic equilibrium model with a single free protein concentration parameter to account for binding saturation in the ChIP-seq experiment, we found that by this in vivo measure, FeatureREDUCE performs well (RMSE = 0.075), and significantly better than BEEML-PBM (Zhao and Stormo, 2011) (RMSE = 0.156); full data are shown in Figure 4.

ChIP-seq based validation of position-specific affinity matrix (PSAM) inferred for Cbf1p.

(a) Direct comparison between relative affinities for 10-mers inferred from PBM intensities by FeatureREDUCE and relative in vivo occupancy at 955 genomic locations of type NNCACGTGNN (E-box with flanks) as measured by ChIP-seq (Zhou and O'Shea, 2011). Trimmed-mean (trim = 10%) ChIP-seq fold-enrichments were computed for all unique 10-mer sequences that occur at least three times in the genome. To account for saturation of higher-affinity binding sites, a basic equilibrium model (green curve) was fit with a single free-protein parameter. Red lines indicate the error between the observed and predicted relative ChIP enrichments. (b) The same plot for the BEEML-PBM algorithm (Zhao and Stormo, 2011). The same equilibrium model (green curve) was fit, but the optimal free protein concentration parameter was much lower than in (a), so the saturation is not apparent in this case.

https://doi.org/10.7554/eLife.06397.009

The accuracy of FeatureREDUCE was demonstrated more broadly in a recent benchmark comparison between 26 different PBM data analysis algorithms (Weirauch, 2013). This study employed two distinct performance metrics. The in vitro metric quantified the accuracy with which probe intensities were predicted in a test PBM experiment using a model trained on independent training data from a PBM experiment for the same TF but with a different probe design; this was done for 66 different TFs from various structural families. The in vivo metric used a non-parametric score to quantify accuracy in ranking sequences in terms of their local ChIP-seq enrichment, for nine different TFs. FeatureREDUCE emerged as the top-performing algorithm according to both criteria.

Weirauch et al. (Weirauch, 2013) first assessed algorithm performance across a large dataset by using cross-validation between two different PBM designs. However, one of their surprising findings was that some algorithms generate models that perform well across different PBM designs but poorly when predicting in vivo binding, presumably due to over-fitting to PBM-specific biases. For example, the performance of one of the algorithms went from second-best in the PBM cross-validation test to worst in the ChIP-seq prediction test (Weirauch, 2013). Likewise, an extension of BEEML-PBM that adds dinucleotide parameters (Zhao et al., 2012) did not perform well on ChIP-seq data (Weirauch, 2013), presumably because it does not employ robust inference techniques. We conclude that FeatureREDUCE is currently the only algorithm that succeeds in parameterizing dependencies within the binding site.

Another current debate in the field is whether or not PBM data can provide evidence of alternative DNA binding modes employed by the same TF (Zhao and Stormo, 2011; Badis et al., 2009; Morris et al., 2011; Gordân et al., 2011). Oligomeric TF complexes can often bind DNA using different relative orientations of and/or spacing between their subunits (Gordân et al., 2011). The modeling framework employed by FeatureREDUCE provides a natural opportunity to perform forward selection of multiple binding modes, represented by distinct PSAMs, which can subsequently be combined into a single predictive model and refined in an iterative manner (see Materials and methods). We applied this procedure for the basic leucine zipper (bZIP) proteins Yap1p and Gcn4p. FeatureREDUCE accurately captures the known intrinsic differences in half-site preference of Yap1p (which prefers TTAC, typical for the C/EBP subfamily (Warren et al., 2012); Figure 5a) and Gcn4p (which prefers TGAC, typical for the CREB subfamily; Figure 5b). It is well known that each protein, when binding DNA as a homo-dimer, can do so with or without a 1-bp overlap between the half-sites. What is unique about our approach is that the multi-PSAM model captures fold-differences in thermodynamic stability between the binding modes by estimating an overall relative affinity coefficient for the secondary PSAM. For instance, the binding affinity of Gcn4p for TGACGTCA is predicted to be 24% of that of TGASTCA (cf. Figure 5b). This is in good agreement with recent high-throughput measurements of Gcn4 binding constants for all 12-mers using HiTS-FLIP (Nutiu et al., 2011), specifically, Kd = 65 nM and 15 nM for the respective sequences. In addition, while the accuracy of the dominant TGASTCA motif essentially does not change, the accuracy of the weaker TGACGTCA motif increases significantly when using our multi-PSAM model compared to the single-fit model (R2 improved from 69% to 80% for sites with relative affinity > 0.1). We note that the results of our analysis can be used in a straightforward manner to predict which binding mode is dominant for a particular DNA sequence.

Quantifying the differential usage of alternative binding modes.

The transcription factors Yap1p (a) and Gcn4p (b) can each bind in two distinct modes, in which the two half-sites respectively do (top) and do not (bottom) overlap. Not only is the sequence of preferred half-site different between the two factors, the preferred binding mode is different too, as indicated by the relative association constant (Ka) inferred from the PBM data by FeatureREDUCE.

https://doi.org/10.7554/eLife.06397.010

In conclusion, FeatureREDUCE analysis yields an accurate and interpretable biophysical affinity model that can provide detailed clues about the structural mechanisms that underlie protein-DNA recognition (Roider et al., 2007; Kinney et al., 2007), such as dependencies between nucleotide positions and the modulation of binding mode by variation in the underlying DNA sequence. Our algorithm allows one to make optimal use of the large volume of data that has been generated using microarray-based protein-nucleotide interaction profiling technology.

Materials and methods

Equilibrium model for TF-DNA interaction

FeatureREDUCE builds upon the biophysical model for protein-DNA interaction on which Matrix-REDUCE (Foat et al., 2006) is based. A transcription factor P binds to DNA sequence S to form a TF-DNA complex PS, with forward and backward rate constants kon and koff, respectively:

(1) P+S koffkon PS

The affinity of P for S can be expressed in terms of an equilibrium association constant Ka(S) or dissociation constant Kd(S):

(2) Ka(S)=1Kd(S)=konkoff=[PS][P][S]

The Gibbs free energy of binding per mole (relative to a 1M reference concentration) is given by ∆G = RT ln(Kd /1M) is, where R is the universal gas constant and T the absolute temperature. The fractional occupancy, N(S), defined as the probability that S is bound by P, can be expressed as:

(3) N(S)=[PS][PS]+[S]=[P][P]+Kd(S)=[P]Ka(S)[P]Ka(S)+1

If we assume a low-concentration regime where [P]Kd(S) where no saturation occurs, the expression for the occupancy simplifies to:

(4) N(S)  [P]Kd(S)=[P]Ka(S)

Relative to a reference sequence Sref (typically chosen to be the highest-affinity sequence), there will be multiplicative change in the affinity Ka, or, equivalently, an additive change ∆∆G in the free energy of binding, for any other sequence S:

(5) Ka(S)Ka(Sref) = exp-G(S)RT

where

(6) G(S)=G(S)-G(Sref)

Feature-based model of sequence specificity

FeatureREDUCE models the relative binding free energy for sequence S as a sum of parameters associated with the DNA sequence features φΦ(S) that characterize S:

(7) G(S)=φΦ(S)Gφ

In this study, we considered both single-nucleotide features (e.g. φ = A1, denoting the presence of an A at position 1 within the binding site window) and adjacent-dinucleotide ones (e.g., φ C3G4, denoting the presence of a CpG dinucleotide starting at position 3). At a given position, exactly one of a set of 4 single-nucleotide features (A, C, G, or T) will be present in any particular sequence. We refer to such a set of mutually exclusive and jointly exhaustive features as a 'block'. Each sequence has exactly one feature from each block. A binding window of length L contains L mononucleotide blocks. There is a one-to-one correspondence between the (exponentiated) ∆∆Gφ/RT values in a mononucleotide block and a column in a position-specific affinity matrix (PSAM). Together, the dinucleotide features constitute − 1 dinucleotide blocks, each consisting of 16 features. Within each block, the 4 or 16 ∆∆Gφ values are only defined up to a common additive constant. The convention we use for mononucleotide blocks is that ∆∆Gφ = 0 for the feature that occurs in the reference sequence. For dinucleotide blocks, however, we use a different convention intended to minimize the number of ∆∆Gφ values that are significantly different from zero (see below for details).

Modeling PBM intensity

The model on which MatrixREDUCE (Foat et al., 2006) was based assumes that the measured fluorescence intensity y(S) for probe S is given by a sum over all possible ways in which the TF can bind to the probe (all possible offsets in either the forward or the reverse direction), which we will here refer to as partial 'views' Sv on the full probe sequence S:

(8) y(S) = β0+β1veΔΔG(Sv)/RT

In FeatureREDUCE, to account for positional and directional biases in the extent to which binding affinity in a particular view contributes to the probe intensity, we introduce coefficients γν that are shared across all probes:

(9) y(S) = β0+β1vγveΔΔG(Sv)/RT

We do not necessarily assume a low TF concentration (cf. Equation 4). Moreover, we include a term ∆∆Gns that accounts for non-specific binding, which helps capture the DNA binding characteristics of the protein succinctly and has a positive effect on numerical convergence when estimating the model parameters. Together, this leads to the following model:

(10) y(S) = β0+β1ν11+γveΔΔG(Sv)/RT+eΔΔGns/RT1

Robust model parameter estimation

FeatureREDUCE estimates the parameters in Equation 9 using iteratively reweighted least squares (IRLS). This procedure down-weights data points that have high residuals compared to the fitted model. However, the weights depend on the residuals and the residuals depend on the weights. This dependency is broken by first choosing uniform initial weights, then iteratively refitting the model and re-calculating new residuals and weights, until convergence. IRLS prevents over-fitting and thereby allows for improved parameter estimation. FeatureREDUCE uses the 'rlm' function in the MASS package for R to perform the robust regression. We set the trimmed probes hyperparameter to 20%, as we previously found this value to be optimal during cross-validation on the DREAM5 dataset (Weirauch, 2013). Repeated rounds of parameter re-estimation that cycle over feature 'blocks' are performed until convergence, first for mononucleotide blocks (resulting in a converged PSAM), and subsequently for dinucleotide blocks. Specifically, when estimating the free energy parameters for a given block B, the following model is fit:

(11) y(S) = β0+φBβφvVφ(S)e-(G(Sv)-Gφ)/RT

Here Vφ(S) denotes the subset of views on  that contain feature φ. Note that only the βφ coefficients are treated as fit parameters here. The ∆∆Gφ values come from the previous iteration. However, they are re-estimated as:

(12) ΔΔGφRT=logβφβPSAM

Here βPSAM stands for the value of β1 in Equation 12 when only mononucleotide features are fit. With this normalization, ∆∆Gφ is no longer equal to zero for dinucleotide features occurring in the reference sequence. However, most of the ∆∆Gφ values for dinucleotide features now tend to be close to zero, which is desirable. In each round, the spatial bias parameters γν are also re-estimated using robust regression. After iteration and convergence over multiple such rounds, FeatureREDUCE fits the additional parameters in the non-linear model in Equation 10 using the Levenberg-Marquardt nonlinear least-squares algorithm.

Seed discovery

Finding a good seed for the feature-based regression procedure is a crucial first step in our algorithm. To select the seed from the set of all oligomers of length K, we developed a dedicated robust iterative algorithm based on trimmed means designed to deal with the sparseness of the design matrix. We fit probe intensities as a weighted sum of the number of occurrences of each of the 4K oligomers:

(13) y(S)=mβmXm(S)

Here Xm(S) denotes the number of occurrences of oligomer m in sequence S. Regression coefficients βm were initialized to a small value (10−4) representative of non-specific binding. Next, for each individual probe S, we determined the coefficient value β’m(S) that exactly predicts the intensity:

(14) βm'(S) = 1Xmy(S)m'm βm'Xm'(S)

We then computed the trimmed mean βm' (removing the top and bottom 15% values) of all β’m(S) values across all probes for which Xm > 0. Finally, we updated the coefficient according to:

(15) βm(1α)βm+αβm'

using a step size α = 0.1. This was done for all oligomers in parallel. After iteration and convergence, the oligomer with the highest regression coefficient value was chosen as the seed.

Palindromic symmetry

This step in the algorithm starts by separately fitting and then comparing positive and negative strand PSAMs. If these are similar according to the L1-norm (within a small tolerance) then the motif is flagged as symmetric. The PSAM is then rebuilt using the highest-affinity symmetric seed. The symmetric version of the binding motif tends to be more accurate while using half the number of parameters.

Motif length

We determine the length of the binding site by adding columns to either side of the PSAM until the coefficient of determination (R2) decreases. An increase indicates that there are direct or indirect protein-DNA contacts being made at the additional positions, while a decrease indicates that we have increased the length of the motif past the effective range of specificity and inadvertently excluded valid binding sites at the end of the PBM probe.

Testing for gene ontology association

Following (Ward and Bussemaker, 2008), we first predict the total affinity of each gene’s upstream region, as a sum over a sliding window of binding affinities computed using our model. Next, all genes were ranked by this total promoter affinity and the Wilcoxon-Mann-Whitney rank-sum test was used to score association with each Gene Ontology category (Gene Ontology and Consortium, 2015). P-values were corrected for multiple testing using a Bonferroni correction based on the total number of GO categories tested in parallel.

Alternative binding modes

To infer multiple-binding models for a single TF, the following procedure was used: (i) Fit a single-binding-model using the standard algorithm. (ii) Fit additional binding-mode model(s) to the residuals of the previous model. (iii) Iteratively update each binding mode as a weighted mean between a newly fit model and the model from the previous iteration, until convergence. (iv) Perform a final multiple regression to determine the relative preference for (and statistical significance of) each binding mode.

Motifs containing poly-G/C stretches

Stretches of four or more guanines affect the efficiency of PBM probe synthesis and were replaced by their reverse complement in some PBM designs (Berger et al., 2008). Therefore, if the highest-affinity motif contains four or more consecutive cytosines, FeatureREDUCE uses only the positive strand to generate the biophysical model; conversely, if the highest-affinity motif contains a poly-G stretch, only the negative strand is used.

Software availability

http://bussemakerlab.org/software/FeatureREDUCE/

References

  1. 1
  2. 2
    High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions
    1. P Agius
    2. A Arvey
    3. W Chang
    4. WS Noble
    5. C Leslie
    (2010)
    PLoS Computational Biology, 6, 10.1371/journal.pcbi.1000916.
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
    Wiley Series in Probability and Statistics
    1. PJ Huber
    2. E Ronchetti
    (2009)
    354, Robust statistics. 2nd, Wiley Series in Probability and Statistics, Hoboken, NJ, Wiley. .
  19. 19
  20. 20
  21. 21
    Precise physical models of protein-DNA interaction from high-throughput data
    1. JB Kinney
    2. G Tkacik
    3. CG Callan
    (2007)
    Proceedings of the National Academy of Sciences of the United States of America 104:501–506.
    https://doi.org/10.1073/pnas.0609908104
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
    Specificity of the mnt protein determined by binding to randomized operators
    1. GD Stormo
    2. M Yoshioka
    (1991)
    Proceedings of the National Academy of Sciences of the United States of America 88:5699–5703.
    https://doi.org/10.1073/pnas.88.13.5699
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49

Decision letter

  1. Nir Friedman
    Reviewing Editor; The Hebrew University of Jerusalem, Israel

eLife posts the editorial decision letter and author response on a selection of the published articles (subject to the approval of the authors). An edited version of the letter sent to the authors after peer review is shown, indicating the substantive concerns or comments; minor concerns are not usually shown. Reviewers have the opportunity to discuss the decision before the letter is sent (see review process). Similarly, the author response typically shows only responses to the major concerns raised by the reviewers.

Thank you for sending your work entitled "Building accurate sequence-to-affinity models from high-throughput in vitro protein-DNA binding data using FeatureREDUCE" for consideration at eLife. Your article has been favorably evaluated by Aviv Regev (Senior editor) and three reviewers, one of whom is a member of our Board of Reviewing Editors.

The Reviewing editor and the other reviewers discussed their comments before we reached this decision, and the Reviewing editor has assembled the following comments to help you prepare a revised submission.

The main concerns raised by the reviewers were:

1) Lack of details about the methods, and in particular how dinucleotide interactions are incorporated and their impact on the quality of predictions, and the use of robust regression.

2) More thorough and detailed evaluation of the performance of the method compared to others (the reviews provide specific suggestions).

Reviewer #1:

During recent years there has been a wealth of information gathered about the DNA-binding specificity of proteins using microarrays and sequencing methods. A challenge posed by this data is how to generalize from experimental results to a model of the DNA-binding preferences (or affinities) of the protein. The authors describe a computational tool (FeatureREDUCE) for dealing with these challenges. They evaluate their tool on a few in vitro and in vivo datasets and compare it to recent state of the art tools.

I agree with the authors’ description of a shift in the field from data-poor regime (where only a few binding sets per protein were known) to a data-rich regime where high-throughput methods assigning affinity/preference of the protein to thousands sequences. Yet, the methodology for representing these affinities has been lagging behind the experimental advances. I also agree with their goal of finding such a representation to generalize (i.e., extrapolate) from the results of PBMs (which list affinity value for all possible 8-mers).

To the best of my understanding the new contributions here involve the following:

1) The use of feature-based "free-energy" based method combined with specific robust estimation methods. These extend previous methods of the authors by using published methods.

2) Estimating position-bias in PBM data in parallel to estimating sequence preferences.

3) The use of Poisson regression for SELEX-seq data.

The authors evaluate their claims by using existing datasets to evaluate the results of their method compared to few recent ones in the literature.

I list comments on these aspects of the paper, especially the first, below. These describe what I view as major problems in the current manuscript, but I believe they are all addressable.

The manuscript is, above all, a methods paper. It seems to be carefully designed method to solve a well-established and important problem and it definitely would be suitable for a specialized journal. For a wider audience, I would have expected a much more significant innovation that provided new and useful insights about the general computational problem and the underlying biological reality.

Major concerns:

The authors discuss at length the importance of their free-energy view of the problem. Frankly, the discussion relating free energy representation that defines a Gibbs distribution and probabilistic models (especially exponential models) has been hashed thoroughly in multiple venues and communities. In terms of representation, there is no difference between the representational powers of position-specific Affinity and position specific Probability matrix. I agree that affinities allow an elegant way to adjust the probability due to changes in protein concentration (if you believe the system is in equilibrium). However, this is a much more minute difference than the authors' presentation makes it seem. I would tone down the "sale pitch" on this view. It is a useful one to have (and has been discussed since early days, e.g., Berg and von Hippel, 1987), but it is not inherently superior.

Similarly, the introduction of feature-based representations of free-energy differences (ΔΔG) is mathematically similar or identical to previous methods (e.g., Sharon, Lubliner and Segal, 2008).

There is a great discrepancy between the main text and figures and the Materials and methods regarding the use of features (one of the novelties of the current manuscript). The discussion in the Materials and methods mentions only position-specific mutation and do not mention any higher-order feature, which ones are searched, and how. From Figure 3 I assume these are strings of mutations. However, the way the method generates candidate features (all the ones I see in the figure involve pairs of adjacent positions) or search/decide among them is not specified.

Another discrepancy between the main text and the Materials and methods is the issue of robust estimation. The authors view the use of robust estimation as one of their main contributions (reference in the Abstract, dedicated supplemental figures, etc.). Yet, the Methods are vague about the details. The main text provides a reference for a book about robust statistics, which while surely a useful text, is not helpful for the reader who attempts to understand what actually happens in the estimation procedure.

Much of the evaluation of the method is based on its performance in a published comparison of many computational methods. It is surprising that one of the best performing methods in that comparison has been unpublished in the two years since the comparison was performed (the program has been available for download apparently).

The evaluation carried out in this manuscript is limited for cases where there is external information to compare against. I found the evaluation of GO enrichment p-value of Figure 3 as a very indirect (and inaccurate) indicator of quality (the difference between p-value of 10-8 to 10-7 are not that dramatic and can be driven by changes in the status of few genes). Similarly, the differences in RMSE in Figure 4 are not convincing me that the model is substantially different or better. First of all, ChIP signal (enrichment over background) is often not a reliable quantitative signal. Second, it is unclear how differences in low affinity sequences change actual predictions made with these models. I would expect an evaluation to provide the reader with a sense as to when the precise details do matter. I suspect they do, but the current presentation is focused on various summaries that are not transparent nor reflect actual differences in predictions.

Reviewer #2:

The manuscript describes FeatureREDUCE, a successor to the very successful MatrixREDUCE method for inferring TF binding preferences/(relative affinity) from in vitro binding data. FeatureREDUCE adds robust regression, dinucleotide affinity models, and various features specific for recent large-scale in vitro assays of TF binding (location models for PBMs, Poisson regression for sequencing-based approaches, etc.). Results are presented on a handful of TFs, so these examples simply illustrate the advantages of the new features rather than to extensively test them.

Although it seems iterative, and the validation is limited, this is very important work. The software package would be immediately useful to a large group of researchers. Despite the abundance of work in this area, there's really nothing as comprehensive (and correct) as the described software available. BEEML and BEEML-PBM come close but the manuscript nicely illustrates the advantages of FeatureREDUCE. The fact that this software works well was already illustrated in a previous study (Weirauch et al.) and all the important conceptual and engineering advances over the past few years are implemented in this package. I think eLife would be a great spot for this manuscript to end up.

A few things need to be fixed though. First, there are a number of missing details about the methodology that need to be filled in. Also, the manuscript is a little grandiose and needs to be toned down. Finally, and very importantly, the software needs to be open source and freely available. The value of this manuscript is the associated software; if that software isn't freely available, then there's no point in publishing this manuscript.

Major concerns:

1) It looks like in Figure 4 that you didn't permit BEEML to fit a protein concentration parameter as well, so this isn't a fair comparison. To make it fair, you could i) not fit a protein concentration parameter for FeatureREDUCE or ii) fit one for BEEML. Actually, you should do both, just to be sure.

2) Please provide more details in the "Robust Regression" about the MM-estimator. This technique will be new to most readers. How was it implemented? Are there any other free parameters besides the number of trimmed probes?

3) From Equation 11, it looks like BiasREDUCE requires non-linear regression. If so, then Figure 1 caption #3 is wrong.

4) How do you go from a seed k-mer to an initial PSAM? Presumably, you can't start with any zeros in the PSAM or you will get zero partial derivatives.

5) What is your tolerance for the L1-norm for the palindromes?

6) You need to provide more details about the multi-PSAM mode and experiments. Do you fit an initial PSAM and then fit a second one to the residuals using the entire pipeline (starting with #1)? Do you then re-fit the first PSAM? How do you decide whether or not a second PSAM is necessary?

Reviewer #3:

Riley and colleagues describe the development and application of a computational regression algorithm (FeatureREDUCE) to build more accurate affinity models from PBM or SELEX-seq datasets. This new method relies on iterative regression steps and the incorporation of dinucleotide dependencies to train improved affinity models for transcription factor binding. The utility of these affinity models is demonstrated on transcription factors with simple and complex binding modes.

Overall, this computational approach appears to display improved performance over other existing methods for building affinity models, and consequently will be of value to the broader scientific community. However, the current manuscript does not describe for the end-user the type of models that are generated for use in other applications – in particular how weights for the dinucleotide dependencies in the FSAM can be represented and readily displayed for interpretation. One subset of dinucleotide dependencies is displayed in Figure 3A, where surprisingly (if this reviewer is interpreting the data correctly) the strongest dinucleotide contributions to the model are substantial penalties against the consensus sequence. To this reviewer, this result is counter-intuitive, and deserves comment, as it does not mesh with a simplistic view of dinucleotide effects in protein-DNA recognition (e.g. a single amino acid simultaneously and synergistically contacting neighboring base pairs in the preferred binding site). Moreover, since the authors are claiming the demonstration of the existence of dinucleotide dependencies based on their analysis, which has been a contentious point in the field, further explanation is warranted.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Building accurate sequence-to-affinity models from high-throughput in vitro protein-DNA binding data using FeatureREDUCE" for further consideration at eLife. Your revised article has been favorably evaluated by Aviv Regev (Senior editor), a Reviewing editor, and two reviewers.

The authors responded to the issues raised. However, the reviewers’ opinion was that this response is too terse/unsatisfactory in places. We strongly recommend that the authors attempt to provide a clear understandable description of the methods, to the extent that someone with reasonable knowledge in the field but not in this project can understand what is being done in each step.

Essential revisions:

1) The software must be deposited to a public repository (GitHub or similar). Or at the very least, the open source version of FeatureREDUCE used in this manuscript should be made available through the supplement for this journal article. Published software should have an additional guarantee of availability though the journal and/or an outside authority like GitHub (or BitBucket, or similar; we do not wish to proscribe the channel for release).

2) Clarify the methods description of discrimination from reference sequence. Reviewer comment (following the authors' response):

Again, if this reviewer is interpreting the data correctly from Figure 3A, there is a penalty against CG dinucleotides at positions -1 and 1. (Likewise CA at -3,-2 and TG at 2,3) Based on the description in the Methods:

"FeatureREDUCE models the binding free energy for sequence S as a sum of coefficients associated with all the "features" that discriminate S from the reference sequence Sref (usually the DNA sequence with the highest affinity). In this study, we considered single-nucleotide features ("A at position 1") and adjacent dinucleotide features ("CG at positions 3 and 4")."

The authors describe the utilization of these features to discriminate S from Sref. The strongest (negative) dinucleotide features in Figure 3A are for dinucleotides that are in Sref, as presumably the highest affinity sequence is the reference sequence. Again, this reviewer may be missing some critical understanding of the algorithm, but based on the methods description the result does not fit with expectation. (Why would dinucleotide dependencies be discovered that apply to Sref?) Consequently, a more adequate description is required in the Materials and methods.

3) “Feature-based modeling of intensities”. This paragraph is hard to make sense of, and uses too many ill-defined terms. Please include what the features are, the coefficient (what are "columns of features"?), and how are defined (as you did for position-specific case).

4) The "Seed detection" paragraph is similarly obscure.

5) Probe-position effects – what is α? Are the γ coefficients shared among all probes? When/how are they estimated?

6) The authors use rlm (from MASS package) to estimate the "parameters in Equation 11". Up to now the authors discussed coefficients (are these parameters?). As far as I can tell, these do not appear in a linear form in Equation 11. How do you resolve that?

https://doi.org/10.7554/eLife.06397.029

Author response

Reviewer #1:

The manuscript is, above all, a methods paper. It seems to be carefully designed method to solve a well-established and important problem and it definitely would be suitable for a specialized journal. For a wider audience, I would have expected a much more significant innovation that provided new and useful insights about the general computational problem and the underlying biological reality.

The technical innovations presented here make it possible to do two things that were not possible before for the first time: (i) capturing dinucleotide dependencies in a way that holds up in cross-platform validation, (ii) quantify the relative importance of iple binding modes for the same transcription factor. Indeed, Reviewer #2 states that our work is important for the genomics community, and that eLife will be an excellent home for it.

The authors discuss at length the importance of their free-energy view of the problem. Frankly, the discussion relating free energy representation that defines a Gibbs distribution and probabilistic models (especially exponential models) has been hashed thoroughly in multiple venues and communities. In terms of representation, there is no difference between the representational powers of position-specific Affinity and position specific Probability matrix. I agree that affinities allow an elegant way to adjust the probability due to changes in protein concentration (if you believe the system is in equilibrium). However, this is a much more minute difference than the authors' presentation makes it seem. I would tone down the "sale pitch" on this view. It is a useful one to have (and has been discussed since early days, e.g., Berg and von Hippel, 1987), but it is not inherently superior.

We thank the reviewer for their careful reading of the manuscript and the constructive feedback. We have revised the text to make clear that we do not claim a new representation of DNA binding specificity, but present an improved method for estimating the binding specificity parameters from high-throughput data.

We agree that the affinity and probability representations of relative base preference can often be used equivalently, as only ratios between bases are physically meaningful. However, this equivalence only holds for the probability that a given protein molecule is bound by a DNA molecule of a particular sequence. Even in this case, as the reviewer notes, they are not exactly the same, as the probabilities can be compressed relative to the affinities due to saturation at higher free protein concentrations.

Affinities always reflect the intrinsic binding preferences of the transcription factor. Base probabilities by contrast are always context dependent. This can lead to confusion and incorrect interpretation. For instance, when base probabilities are used to summarize aligned top-scoring sequences, the proportionality with affinity is lost. This is why we emphasize the difference. In fact, one could say that the essence of what FeatureREDUCE does is to first model DNA sequence probabilities in terms of TF-DNA affinities, and then put this model in reverse in order to infer the affinity parameters from the PBM data.

Nevertheless, we agree with this reviewer that we could have described the similarities and differences with alternative approaches more clearly. Therefore, as you will see, we have thoroughly rewritten the relevant section in our Introduction.

Similarly, the introduction of feature-based representations of free-energy differences (ΔΔ G) is mathematically similar or identical to previous methods (e.g., Sharon, Lubliner and Segal, 2008).

We do agree that there is no essential difference between “energy” and “affinity” representations, as they are related by a simple log-transformation through the Boltzmann factor. We have clarified this in the text and added an additional reference (Gordon et al., 2013) to credit their early use of a feature-based representation of binding specificity.

There is a great discrepancy between the main text and figures and the Materials and methods regarding the use of features (one of the novelties of the current manuscript). The discussion in the Materials and methods mentions only position-specific mutation and do not mention any higher-order feature, which ones are searched, and how. From Figure 3 I assume these are strings of mutations. However, the way the method generates candidate features (all the ones I see in the figure involve pairs of adjacent positions) or search/decide among them is not specified.

We agree, and thank the reviewer for bringing this to our attention. The Materials and methods now have a newly added section “Feature-based models of binding affinity” that describes in more detail how FeatureREDUCE defines dinucleotide features and estimates free energy coefficients for them.

Another discrepancy between the main text and the Methods is the issue of robust estimation. The authors view the use of robust estimation as one of their main contributions (reference in the Abstract, dedicated supplemental figures, etc.). Yet, the Materials and methods are vague about the details. The main text provides a reference for a book about robust statistics, which while surely a useful text, is not helpful for the reader who attempts to understand what actually happens in the estimation procedure.

Again, we agree that this is a good idea, and have provided more detail in the Materials and methods section about the iteratively reweighted least-squares (IRLS) method for robust regression.

Much of the evaluation of the method is based on its performance in a published comparison of many computational methods. It is surprising that one of the best performing methods in that comparison has been unpublished in the two years since the comparison was performed (the program has been available for download apparently).

We would like to emphasize that following up on the study by Weirauch et al. (2013) is only one aspect of our manuscript. We also demonstrate for the first time how the relative preference between multiple binding modes can be accurately modeled.

The evaluation carried out in this manuscript is limited for cases where there is external information to compare against. I found the evaluation of GO enrichment p-value of Figure 3 as a very indirect (and inaccurate) indicator of quality (the difference between p-value of 10-8 to 10-7 are not that dramatic and can be driven by changes in the status of few genes). Similarly, the differences in RMSE in Figure 4 are not convincing me that the model is substantially different or better. First of all, ChIP signal (enrichment over background) is often not a reliable quantitative signal. Second, it is unclear how differences in low affinity sequences change actual predictions made with these models. I would expect an evaluation to provide the reader with a sense as to when the precise details do matter. I suspect they do, but the current presentation is focused on various summaries that are not transparent nor reflect actual differences in predictions.

Both the ChIP and GO analysis serve to show that the affinity models are useful as predictors of in vivo function of the TF. We agree with the reviewer that the in vivo ChIP enrichment signal in general can be highly complex, but the fact that we only use the ChIP enrichment values for Cbf1p at a few hundred predicted in vitro binding sites in the genome makes this problem less severe. The GO analysis admittedly is indirect, but it is an attempt to use the affinity model to predict target genes solely from promoter sequence. This analysis was inspired by a similar one performed by Maerkl & Quake (2007) to validate their MITOMI 1.0 assay. Our affinity model is used to predict the affinity landscape across the entire upstream promoter region of each gene, integrating high and low affinity sites. As such, it puts strong demands on the affinity model. Nevertheless, not only do we find that genes in GO categories consistent with the known function of the TF tend to be associated with higher promoter affinities (and thus presumably more responsive to changes in the activity of the TF), but also that the trends regarding the benefits of robust regression and dinucleotides are in the same direction as when we validate more directly against in vitroMITOMI 1.0 data for the same TF (Figure 3—figure supplement 1). The GO and ChIP analyses are valuable complementary results in our opinion.

To address the reviewer’s concern that it is not obvious that the improvement in RMSE when dinucleotide term are added to the model in Figure 4 is statistically significant, we have added a permutation test that explicitly addresses this: we performed 100,000 iterations of randomly permuting the inferred dinucleotide corrections in Figure 4 to show that the decrease in the RMSE by a third when including dinucleotide dependencies is indeed significant (p-value 2.7e-5). Also, this significant improvement in the affinity model is not due to just one or two binding sites. Rather, as seen by the red residual lines in Figure 4, most of the binding sites show an improvement in their predicted affinities when including the dinucleotide dependencies.

We disagree with the assertion that our analysis is not transparent: we show all the data underlying our summary statistics, both for the ChIP (Figure 4) and the GO (Figure 3—figure supplement 2) analysis. The GO scoring is done using a non-parametric distributional test (Wilcoxon-Mann-Whitney), and we do not classify genes as “target” or “non-target”, and therefore the sensitivity to small changes in the status of genes does not apply here. We have added a new section to the Methods that provides these details about the GO scoring. We now apply a Bonferroni correction to the p-values, and have updated Figure panel 3B accordingly.

Reviewer #2:A few things need to be fixed though. First, there are a number of missing details about the methodology that need to be filled in. Also, the manuscript is a little grandiose and needs to be toned down. Finally, and very importantly, the software needs to be open source and freely available. The value of this manuscript is the associated software; if that software isn't freely available, then there's no point in publishing this manuscript. It would just be free advertising for Columbia University, or Bussemaker lab, or whoever. If that's the goal, then I suggest arXiv. Ideally, the software would be put in GitHub or some other archiving service. 'Available by contacting the authors' is no longer acceptable in this field.

We thank the reviewer for acknowledging the relevance of our work. The missing details about the methods have been added, as described below and in response to the other reviewers. We have also toned down our manuscript where was warranted.

Our software is already publicly available via our lab website. Anybody can register, and obtain it. Our reason for doing this, rather than posting it on a completely open repository such as GitHub, is that we would like to be able to email the users about future updates to the software. In response to the reviewer’s request that the software be open source, we now also include the Java source code along with the compiled classes. We have also added working examples (“demos”) for each type of analysis.

1) It looks like in Figure 4 that you didn't permit BEEML to fit a protein concentration parameter as well, so this isn't a fair comparison. To make it fair, you could i) not fit a protein concentration parameter for FeatureREDUCE or ii) fit one for BEEML. Actually, you should do both, just to be sure.

Although it may not be obvious, the same model with a free protein concentration parameter was fit for BEEML as for FeatureREDUCE. In the case of BEEML, the best fit was obtained with a much smaller value of this protein concentration parameter, which makes the green curve look linear. We have updated the caption to clarify this and dispel any doubts that the comparison was done fairly.

2) Please provide more details in the "Robust Regression" about the MM-estimator. This technique will be new to most readers. How was it implemented? Are there any other free parameters besides the number of trimmed probes?

We have provided more detail in the Methods section about the iteratively reweighted least-squares (IRLS) method for robust regression. FeatureREDUCE uses the “rlm” function in the MASS package for R to perform the robust regression. There are no other free parameters besides the number of trimmed probes. See also our response to Reviewer #1.

3) From Equation 11, it looks like BiasREDUCE requires non-linear regression. If so, then Figure 1 caption #3 is wrong.

We believe the caption is correct. FeatureREDUCE solves Equation 11 by cycling through the steps in Figure 1. BiasREDUCE (step 3) does not fit the free-protein concentration and thus does not need to use non-linear regression. Only SaturationREDUCE (step 4) fits the free-protein concentration and uses non-linear regression.

4) How do you go from a seed k-mer to an initial PSAM? Presumably, you can't start with any zeros in the PSAM or you will get zero partial derivatives.

This statement is not correct. The affinity for an entire binding site is a product over positions (or, more generally, features). Therefore, the partial derivative with respect to one of the model coefficients is a product over all other positions/features. Consequently, there is no reason for the derivative to be zero when the feature coefficient is zero (as would be the case for all three single-base features at any given position that are not in the seed).

5) What is your tolerance for the L1-norm for the palindromes?

The tolerance is configurable. However, the default is (0.15 x motif length). The maximum L1-distance between the positive- and negative-strand PSAMs is (2 x motifLength). We have clarified this in the Materials and methods.

6) You need to provide more details about the multi-PSAM mode and experiments. Do you fit an initial PSAM and then fit a second one to the residuals using the entire pipeline (starting with #1)? Do you then re-fit the first PSAM? How do you decide whether or not a second PSAM is necessary?

We now provide more detail in the Materials and methods section about the modeling of alternative binding modes. The FeatureREDUCE pipeline remains the same. However, at each step, each binding-mode model is fit to the residuals of the other models instead of the original binding data. The relative affinities between the different binding-mode models is used to determine the existence and magnitude of alternative binding.

Reviewer #3:

Overall, this computational approach appears to display improved performance over other existing methods for building affinity models, and consequently will be of value to the broader scientific community. However, the current manuscript does not describe for the end-user the type of models that are generated for use in other applications – in particular how weights for the dinucleotide dependencies in the FSAM can be represented and readily displayed for interpretation. One subset of dinucleotide dependencies is displayed in Figure 3A, where surprisingly (if this reviewer is interpreting the data correctly) the strongest dinucleotide contributions to the model are substantial penalties against the consensus sequence. To this reviewer, this result is counter-intuitive, and deserves comment, as it does not mesh with a simplistic view of dinucleotide effects in protein-DNA recognition (e.g. a single amino acid simultaneously and synergistically contacting neighboring base pairs in the preferred binding site). Moreover, since the authors are claiming the demonstration of the existence of dinucleotide dependencies based on their analysis, which has been a contentious point in the field, further explanation is warranted.

We do not think there is anything counter-intuitive or unusual going on here, but it may require some explanation. We have therefore added some more explanation to the Materials and methods.

The dinucleotide features cannot be seen in isolation from the single-nucleotide features. To be concrete, at the two central positions, –1 and +1, we see a very strong preference for C and G, respectively, at the single-nucleotide level. When we add up the free energy coefficients for sequences that have CG at the center, with two highly favorable mononucleotide contributions and a moderately unfavorable dinucleotide contribution, the net outcome is still a very strong preference for CG over all 15 other dinucleotides. In the present paper, we are agnostic about the structural mechanisms that underlie the value of the dinucleotide coefficients, but our results are not inconsistent with one amino acid simultaneously recognizing two base pairs, the example mentioned by the reviewer.

[Editors' note: further revisions were requested prior to acceptance, as described below.]

Essential revisions: 1) The software must be deposited to a public repository (GitHub or similar). Or at the very least, the open source version of FeatureREDUCE used in this manuscript should be made available through the supplement for this journal article. This is not because we do not trust the authors. Rather, many things that can go wrong when authors are the only ones responsible for the availability of their published software. Published software should have an additional guarantee of availability though the journal and/or an outside authority like GitHub (or BitBucket, or similar; we do not wish to proscribe the channel for release).

We agree, and have created a public repository github.com/FeatureREDUCE/FeatureREDUCE/

2) Clarify the methods description of discrimination from reference sequence. Reviewer comment (following the authors' response):

Again, if this reviewer is interpreting the data correctly from Figure 3A, there is a penalty against CG dinucleotides at positions -1 and 1. (Likewise CA at -3,-2 and TG at 2,3) Based on the description in the Methods: "FeatureREDUCE models the binding free energy for sequence S as a sum of coefficients associated with all the "features" that discriminate S from the reference sequence Sref (usually the DNA sequence with the highest affinity). In this study, we considered single-nucleotide features ("A at position 1") and adjacent dinucleotide features ("CG at positions 3 and 4")." The authors describe the utilization of these features to discriminate S from Sref. The strongest (negative) dinucleotide features in Figure 3A are for dinucleotides that are in Sref, as presumably the highest affinity sequence is the reference sequence. Again, this reviewer may be missing some critical understanding of the algorithm, but based on the methods description the result does not fit with expectation. (Why would dinucleotide dependencies be discovered that apply to Sref?) Consequently, a more adequate description is required in the Methods.

We agree that we did not describe how the ddG values for dinucleotide features shown in Figure 3A are normalized. For single-nucleotide features we use a standard normalization where ddG=0 for features in the optimal reference sequence. However, as we explain in detail in the revised Methods section, the normalization we use for dinucleotide features is quite different, hence the confusion on the part of the reviewer. We have also updated Figure 3A to make everything completely consistent with expectation: a newly added non-zero ddG value associated with the PSAM in Figure 3A now makes the sum over all features associated with the reference sequence equal to zero as expected. We are grateful to the reviewer for insisting on this point.

3) “Feature-based modeling of intensities”. This paragraph is hard to make sense of, and uses too many ill-defined terms. Please include what the features are, the coefficient (what are "columns of features"?), and how are defined (as you did for position-specific case).

We agree, and have completely rewritten this part of the Methods section.

4) "Seed detection" paragraph is similarly obscure.

Again we agree. We have expanded and completely rewritten this paragraph.

5) Probe-position effects – what is α? Are the γ coefficients shared among all probes? When/how are they estimated?

We have clarified the definition of α. Indeed, the γ coefficients are shared between all probes. We have updated the Methods section to make this clear, and also improved the description of these coefficients are defined and estimated.

6) The authors use rlm (from MASS package) to estimate the "parameters in Equation 11". Up to now the authors discussed coefficients (are these parameters?). As far as I can tell, these do not appear in a linear form in Equation 11. How do you resolve that?

Yes, a coefficient is a multiplicative parameter. In linear models, the parameters are therefore called coefficients. However, in nonlinear models the word parameter is more commonly used.

We agree that the procedure for solving the old Equation 11 was not properly described, and have expanded the pertinent section in Materials and methods.

https://doi.org/10.7554/eLife.06397.030

Article and author information

Author details

  1. Todd R Riley

    1. Department of Biological Sciences, Columbia University, New York, United States
    2. Department of Systems Biology, Columbia University, New York, United States
    3. Department of Biology, University of Massachusetts Boston, Boston, United States
    Contribution
    TRR, Conception and design, Analysis and interpretation of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.
  2. Allan Lazarovici

    1. Department of Biological Sciences, Columbia University, New York, United States
    2. Department of Electrical Engineering, Columbia University, New York, United States
    Contribution
    AL, Conception and design, Analysis and interpretation of data
    Competing interests
    The authors declare that no competing interests exist.
  3. Richard S Mann

    1. Department of Systems Biology, Columbia University, New York, United States
    2. Department of Biochemistry and Molecular Biophysics, Columbia University, New York, United States
    Contribution
    RSM, Analysis and interpretation of data, Drafting or revising the article
    Competing interests
    The authors declare that no competing interests exist.
  4. Harmen J Bussemaker

    1. Department of Biological Sciences, Columbia University, New York, United States
    2. Department of Systems Biology, Columbia University, New York, United States
    Contribution
    HJB, Conception and design, Analysis and interpretation of data, Drafting or revising the article
    For correspondence
    hjb2004@columbia.edu
    Competing interests
    The authors declare that no competing interests exist.

Funding

National Human Genome Research Institute (R01HG003008)

  • Harmen J Bussemaker

John Simon Guggenheim Memorial Foundation

  • Harmen J Bussemaker

National Institute of General Medical Sciences (R01GM058575)

  • Richard S Mann

National Cancer Institute (U54CA121852)

  • Richard S Mann
  • Harmen J Bussemaker

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank the members of the Bussemaker and Mann labs for comments and feedback during the course of these studies. This work was supported by National Institute of Health grants R01HG003008 to HJB, GM058575 to RSM, F32GM087047 to MS, and F32GM099160 to NA, a John Simon Guggenheim Foundation Fellowship (HJB), and pilot funding from Columbia University’s Research Initiatives in Science and Engineering (RISE) program to HJB and RSM.

Reviewing Editor

  1. Nir Friedman, Reviewing Editor, The Hebrew University of Jerusalem, Israel

Publication history

  1. Received: January 8, 2015
  2. Accepted: December 20, 2015
  3. Accepted Manuscript published: December 23, 2015 (version 1)
  4. Version of Record published: February 9, 2016 (version 2)

Copyright

© 2015, Riley et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 877
    Page views
  • 269
    Downloads
  • 3
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Comments

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Neuroscience
    Lin Cong et al.
    Tools and Resources