Introduction

Intrinsically disordered proteins (IDPs) or regions (IDRs) do not have the luxury of a three-dimensional structure to help decipher the relationship between sequence and function. Instead, dynamics has emerged as a crucial link between sequence and function for IDPs (1). NMR spin relaxation is a uniquely powerful technique for characterizing IDP dynamics, capable of yielding residue-specific information (2). Backbone 15N relaxation experiments typically yield three parameters per residue: transverse relaxation rate (R2), longitudinal relaxation rate (R1), and steady-state heteronuclear Overhauser enhancement (NOE). While all the three parameters depend on ps-ns dynamics, R2 is the one most affected by slower dynamics (10s of ns to 1 μs and beyond). An increase in either the timescale or the amplitude of slower dynamics results in higher R2. For IDPs, R2 is also the parameter that exhibits the strongest dependence on sequence (1, 2).

R2 was noted early on as an important indicator of residual structure in the unfolded state of the structured protein lysozyme (3). This property has since been measured for many IDPs to provide insight into various biophysical processes. Just as the residual structure in the unfolded state biases the folding pathway of lysozyme (3), a nascent α-helix in the free state of Sendai virus nucleoprotein C-terminal domain (Sev-NT), as indicated by highly elevated R2 (4), biases the coupled binding and folding pathway in the presence of its target phosphoprotein (5). Local secondary structure preformation also facilitates the binding of yes-associated protein (YAP) with its target transcription factor (6). Likewise a correlation has been found between R2 in the free state and the membrane binding propensity of synaptobrevin-2: residues with elevated R2 have increased propensity for membrane binding (7). R2 in the free state has also been used to uncover factors that promote liquid-liquid phase separation of IDPs. For example, a nascent α-helix (shown by elevated R2) is important for the phase separation of TDP-43 low-complexity domain, as both the deletion of the helical region and a helix-breaking mutation (Ala to Pro) abrogates phase separation (8). Similarly, nascent α-helices in the free state of cytosolic abundant heat-soluble 8 (CAHS-8), upon raising concentration and lowering temperature stabilize to form the core of fibrous gels (9). For hnRNPA1 low-complexity domain (A1-LCD), aromatic residues giving rise to local peaks in R2 also mediate phase separation (10).

Both NMR relaxation data and molecular dynamics (MD) simulations have revealed determinants of R2 for IDPs. It has been noted that the flexible Gly has a tendency to lower R2 whereas secondary structure and contact formation tend to raise R2 (11). This conclusion agrees well with recent MD simulations (1, 1214). These MD studies, using IDP-specific force fields, are able to predict R2 in quantitative agreement with NMR measurements, without ad hoc reweighting as done in earlier studies. According to MD, most contact clusters are formed by local sequences, within blocks of up to a dozen or so residues (1, 12, 13). Tertiary contacts can also form but are relatively rare; as such their accurate capture requires extremely extensive sampling and still poses a challenge for MD simulations. Opposite to Gly, aromatic residues have been noted as mediators of contact clusters (3, 10).

Schwalbe et al. (15) introduced a mathematical model to describe the R2 profile along the sequence for lysozyme in the unfolded state. The R2 value of a given residue was expressed as the sum of contributions from this residues and its neighbors. This model yields a mostly flat profile across the sequence, except for falloff at the termini, resulting an overall bell shape. Klein-Seetharaman et al. (3) then fit peaks above this flat profile as a sum of Gaussians. Cho et al. (16) proposed bulkiness as a qualitative indicator of backbone dynamics. Recently Sekiyama et al. (17) calculated R2 as the geometric mean of “indices of local dynamics”; the latter were parameterized by fitting to the measured R2 for a single IDP. All these models merely describe the R2 profile of a given IDP, and none of them is predictive.

Here we present a method, SeqDYN, for predicting R2 of IDPs. Using a mathematical model introduced by Li et al. (18) to predict propensities for binding nanoparticles and also adapted for predicting propensities for binding membranes (19), we express the R2 value of a residue as the product of contributing factors from all residues. The contributing factor attenuates as the neighboring residue becomes more distant from the central residue. The model, after training on a set of 45 IDPs, has prediction accuracy that is competitive against that of the recent MD simulations using IDP-specific force fields (1, 1214). For lysozyme, the SeqDYN prediction agrees remarkably well with R2 measured in the unfolded state.

Results

The data set of IDPs with R2 rates

We collected R2 data for a total of 54 nonhomologous IDPs or IDRs (Table S1; Fig. S1). According to indicators from NMR properties, including low or negative NOEs, narrow dispersion in backbone amide proton chemical shifts, and small secondary chemical shifts (SCSs), most of the proteins are disordered with at most transient α-helices. A few are partially folded, including Sev-NT with a well-populated (~80%) long helix (residues 478-491) (20), CREB-binding protein fourth intrinsically disordered linker (CBP-ID4) with > 50% propensities for two long helices (residues 2-25 and 101-128) (21), HOX transcription factor DFD (HOX-DFD) with a well-folded domain comprising three helices (22), and Hahellin (apo form) as a molten globule (23). In Fig. 1, we display representative conformations of five IDPs, ranging from fully disordered MAPK kinase 4 (MKK4) (24) and α-synuclein (25) to Measles virus phosphoprotein N-terminal domain (Mev-PNTD) (26) with transient short helices to Sev-NT and CBP-ID4 with stable long helices. The sequences of all the IDPs are listed in Table S2.

Representative conformations of five IDPs. (A-E) MKK4, α-synuclein, Mev-PNTD, Sev-NT, and CBP-ID4. Conformations were initially generated using TraDES (http://trades.blueprint.org) (46), selected to have radius of gyration close to predicted by a scaling function Rg = 2.54N0.522 (Å) (47). Conformations for residues predicted as helical by PsiPred plus filtering were replaced by ideal helix. Finally residues are colored according to a scheme ranging from green for low predicted R2 to red for high predicted R2.

We used 45 of the 54 IDPs to train and validate SeqDYN and reserved the remaining 9 for testing. The sequence lengths of the training set range from 39 to 406 residues, with an average of 125.3 residues. Altogether R2 data are available for 3966 residues. A large majority (35 out of 45) of the 45 IDPs have mean R2 values (, calculated among all the residues in a protein) between 2.5 and 5.5 s−1 (Table S1 and Fig. 2A). This range is much lower than that of structured proteins with similar sequence lengths. The low values and lack of dependence on sequence length (Fig. S2A) suggest that R2 of the IDPs is mostly dictated by local sequence as opposed to tertiary interaction.

Properties of the 45 IDPs in the training set. (A) Histograms of means and standard deviations, calculated for individual proteins. Curves are drawn to guide the eye. Inset: correlation between and . (B) Experimental mean scaled R2 (msR2) and SeqDYN q parameters, for the 20 types of amino acids. Note that Pro residues have low msR2 for the lack of backbone amide proton. Amino acids are in descending order of q.

The most often used temperature for acquiring the R2 data was 298 K, but low temperatures (277 to 280 K) were used in a few cases (Table S1 and Fig. S2B). Of the seven IDPs with , four can be attributed to low temperatures (2730), one is due to a relatively low temperature (283 K) as well as the presence of glycerol (20% v/v) (31), and two can be explained by tertiary structure formation [a folded domain (22) or molten globule (23)]. A simple reason for higher R2 values at lower temperatures is the higher water viscosity, resulting in slowdown in molecular tumbling; a similar effect is achieved by adding glycerol. In some cases, R2 was measured at both low and room temperatures (4, 10). To a good approximation, the effect of lowering temperature is a uniform scaling of R2 across the IDP sequence. For Sev-NT, downscaling of the R2 values at 278 K by a factor of 2.0 brings them into close agreement with those at 298 K (Fig. S2C), with a root-mean-square-deviation (RMSD) of 0.5 s−1 among all the residues. Likewise, for A1-LCD, downscaling by a factor of 2.4 brings the R2 values at 288 K into good match with those at 298 K (Fig. S2D), with an RMSD of 0.4 s−1. Because SeqDYN is concerned with the sequence dependence of R2, a uniform scaling has no effect on model parameter or prediction; therefore mixing the data from different temperatures is justified. The same can be said about the different magnetic fields in acquiring the R2 data (Table S1 and Fig. S2E). Increasing the magnetic field raises R2 values, and the effect is also approximated well by a uniform scaling (4, 8, 29).

One measure on the level of sequence dependence of R2 is the standard deviation, , calculated among the residues of an IDP. Among the training set, the R2 values of 30 IDPs have moderate sequence variations, with ranging from 0.5 to 1.5 s−1 (Table S1); the histogram of calculated for the entire training set peaks around 0.75 s−1 (Fig. 2A). There is a moderate correlation between and (Fig. 2A, inset), reflecting in part the fact that can be raised simply by a uniform upscaling, e.g., as a result of lowering temperature. Still, only two of the five IDPs with high attributable to lower temperature or presence of glycerol are among the seven IDPs with high sequence variations . Therefore the sequence variation of R2 as captured by manifests mostly the intrinsic effect of the IDP sequence, not the influence of external factors such as temperature or magnetic field strength. The mean value among the training set is 1.24 s−1.

One way to eliminate the influence of external factors is to scale the R2 values of each IDP by its we refer to the results as scaled R2, or sR2. We then pooled the sR2 values for all residues in the training set, and separated them according to amino-acid types. The amino acid type-specific mean sR2 values, or msR2, are displayed in Fig. 2B. The seven amino acids with the highest msR2 are, in descending order, Trp, Arg, Tyr, Phe, Ile, His, and Leu. The presence of all the four aromatic amino acids in this “high-end” group immediately suggests π-π stacking as important for raising R2; the presence of Arg further implicates cation-π interactions. In the other extreme, the seven amino acids with the lowest msR2 are, in ascending order, are Gly, Cys, Val, Asp, Ser, Thr, and Asn. Gly is well-known as a flexible residue; it is also interesting that all the four amino acids with short polar sidechains are found in this “low-end” group. Pro has an excessively low msR2 [with data from only two IDPs (32, 33)], but that is due to the absence of an amide proton.

The SeqDYN model and parameters

The null model is to assume a uniform R2 for all the residues in an IDP. The root-mean-square-error (RMSE) of the null model is equal to the standard deviation, , of the measured R2 values. The mean RMSE, , of the null model, equal to 1.24 s−1 for the training set, serves as the upper bound for evaluating the errors of R2 predictors. The next improvement is a one-residue predictor, where first each residue (with index n) assumes its amino acid-specific mean sR2 (msR2) and then a uniform scaling factor ϒ is applied:

This one-residue model does only minutely better than the null model, with a of 1.22 s−1.

In SeqDYN, we account for the influence of neighboring residues. Specifically, each residue i contributes a factor f(i; n) to the R2 value of residue n. Therefore

where N is the total number of residues in the IDP. The contributing factor depends on the sequence distance s = |in| and the amino-acid type of residue i:

There are 21 global parameters. The first 20 are the q values, one for each of the 20 types of amino acids; the last parameter is b, appearing in the Lorentzian form of the sequence-distance dependence. We define the correlation length, Lcorr, as the sequence distance at which the contributing factor is midway between the values at s = 0 and ∞. It is easy to verify that Lcorr = b−1/2. Note that the single-residue model can be seen as a special case of SeqDYN, with Lcorr set to 0 and q set to msR2.

The functional forms of Eqs (2a,b) were adapted from Li et al. (18); we also used them for predicting residue-specific membrane association propensities of IDPs (19). In these previous applications, a linear term was also present in the denominator of Eq (2b). In our initial training of SeqDYN, the coefficient of the linear term always converged to near zero. We thus eliminated the linear term. In addition to the Lorentzian form, we also tested a Gaussian form for the sequence-distance dependence and found somewhat worse performance. The more gradual attenuation of the Lorentzian form with increasing sequence distance evidently provides an overall better model for the R2 data in the entire training set. Others (16, 17, 24) have modeled R2 as the average of some parameters over a window; a window has an extremely abrupt sequence-distance dependence (1 for s < Lcorr and 0 for 4 > Lcorr).

We parametrized the SeqDYN model represented by Eqs (2a,b) on the training set of 45 IDPs. In addition to the 21 global parameters noted above, there are also 45 local parameters, namely one uniform scaling factor (ϒ) per IDP. The parameter values were selected to minimize the sum of the mean-square-errors for the IDPs in the training set, calculated on R2 data for a total of 3924 residues. We excluded the 42 Pro residues in the training set because, as already noted, their R2 values are lower for chemical reasons. We will present validation and test results below, but first let us look at the parameter values.

The q values are displayed in Fig. 2B alongside msR2. The seven amino acids with the highest q values, in descending order, are Trp, Ile, Tyr, Arg, His, Phe, and Leu. These are exactly the same amino acids in the high-end group for msR2, though their order there is somewhat different. The seven amino acids (excluding Pro) with the lowest q values, in ascending order, are Gly, Asn, Ser, Asp, Val, Thr, and Cys. The composition of the low-end group is also identical to that for msR2. The q values thus also suggest that π-π and cation-π interactions in local sequences may raise R2, whereas Gly and short-polar residues may lower R2.

Given the common amino acids at both the high and low ends for msR2 and q, it is not surprising that these two properties exhibit a strong correlation, with a coefficient of determination (R2; excluding Pro) at 0.92 (Fig. 3A). Also, because the high-end group contains the largest amino acids (e.g., Trp and Tyr) whereas the low-end group contains the smallest amino acids (e.g., Gly and Ser), we anticipated some correlation of msR2 and q with amino-acid size. We measure the latter property by the molecular mass (m). As shown in Fig. 3B, both msR2 and q indeed show medium correlation with m, with R2 = 0.67 (excluding Pro) and 0.61, respectively. A bulkiness parameter was proposed as an indicator of sequence-dependent backbone dynamics of IDPs (16, 24). Bulkiness was defined as the sidechain volume to length ratio, and identifies amino acids with aromatic or branched aliphatic sidechains as bulky (34). We found only modest correlations between either msR2 or q and bulkiness, with R2 just below 0.4 (Fig. 3C).

SeqDYN model parameters. (A) Correlation between msR2 and q. The values are also displayed as bars in Fig. 2B. (B) Correlation of msR2 and q with amino-acid molecular mass. (C) Correlation of msR2 and q with bulkiness. (D) The optimal correlation length and deterioration of SeqDYN prediction as the correlation length is moved away from the optimal value.

The optimized value of b is 3.164 × 10−2, corresponding to an Lcorr of 5.6 residues. The resulting optimized is 0.95 s−1, a clear improvement over the value 1.24 s−1 of the null model. To check the sensitivity of prediction accuracy to b, we set b to values corresponding to Lcorr = 0, 1, 2, .., and retrained SeqDYN for b fixed at each value. Note that the null-model , 1.24 s−1, sets an upper bound. This upper bound is gradually reached when Lcorr is increased from the optimal value. In the opposite direction, when Lcorr is decreased from the optimal value, rises quickly, reaching 1.22 s−1 at Lcorr = 0. The latter is the same as that of the single-residue model. Lastly we note that there is strong correlation between the uniform scaling factors and values among the 45 IDPs (R2 = 0.77), as to be expected. For 39 of the 45 IDPs, ϒ values fall in the range of 0.8 to 2.0 s−1.

As presented next, we evaluate the performance of SeqDYN by leave-one-out cross validation, where each IDP in turn was left out of the training set and the model was trained on the remaining 44 IDPs to predict R2 for the IDP that was left out. The parameters from the leave-one-out (also known as jackknife) training sessions allow us to assess potential bias of the training set. For this purpose, we compare the values of the 21 global parameters, either from the full training set or from taking the averages of the jackknife training sessions. For each of the q parameters, the values from these two methods differ only in the fourth digit; e.g., for Leu, they are both 1.1447 from full training and from jackknife training. The values for b are 3.164 × 10−2 from full training as stated above and 3.163 × 10−2 from jackknife training. The close agreement in parameter values between full training and jackknife training suggests no significant bias in the training set.

Another question of interest is whether the difference between the q parameters of two amino acids is statistically significant. To answer this question, we carried out five-fold cross-validation training, resulting in five independent estimates for each parameter. For example, the mean ± standard deviation of the q parameter is 1.1405 ± 0.0066 for Leu and 1.2174 ± 0.0211 for Ile. A t-test shows that their difference is statistically significant (P < 0.0001). In contrast, the difference between Leu and Phe (q = 1.1552 ± 0.0304) is not significant.

Validation of SeqDYN predictions

We now present leave-one-out cross-validation results. We denote the RMSE of the R2 prediction for the left-out IDP as RMSE(−1). As expected, RMSE(−1) is higher than the RMSE obtained with the IDP kept in the training set, but the increases are generally slight. Specifically, all but eight of the IDPs have increases < 0.1 s−1; the largest increase is 0.35 s−1, for CBP-ID4. The mean RMSE(−1), or , for the 45 IDPs is increased by 0.05 s−1 over , to 1.00 s−1. The latter value is still a distinct improvement over the mean RMSE 1.24 s−1 of the null model. The histogram of RMSE(−1) for the 45 IDPs is shown in Fig. 4A. It peaks at 0.5 s−1, which is a substantial downshift from the corresponding peak at 0.75 s−1 for (Fig. 2A). Thirty-four of the 45 IDPs have RMSE(−1) values lower than the corresponding .

Quality of SeqDYN predictions. (A) Histogram of RMSE(−1). Letters indicate RMSE(−1) values of the IDPs to be presented in panels (B-F). (B-F) Measured (bars) and predicted (curves) R2 profiles for MKK4, α-synuclein, Mev-PNTD, Sev-NT, and CBP-ID4. In (E) and (F), green curves are SeqDYN predictions and red curves are obtained after a helix boost.

To further illustrate the performance of SeqDYN, we present comparison of predicted and measured R2 values for five IDPs: MKK4, α-synuclein, Mev-PNTD, Sev-NT, and CBP-ID4 (Fig. 4B-F). A simple common feature is the falloff of R2 at the N- and C-termini, resulting from missing upstream or downstream residues that otherwise would be coupled to the terminal residues, as first recognized by Schwalbe et al. (15). Representative conformations of the five IDPs are displayed in Fig. 1, with residues colored according to the predicted R2 values. For four of these IDPs, the RMSE(−1) values range from 0.44 to 0.76 s−1 and are scattered around the peak of the histogram, while the RMSE(−1) for the fifth IDP, namely CBP-ID4, the RMSE(−1) value is 2.01 s−1 and falls on the tail of the histogram (Fig. 4A). Figure 4B displays the measured and predicted R2 for MKK4. SeqDYN correctly predicts higher R2 values in the second half of the sequence than in the first half. It even correctly predicts the peak around residue Arg75. The sequence in this region is H72IERLRTH79; six of these eight residues belong to the high-end group. In contrast, the lowest R2 values occur in the sequence S7GGGGSGGGSGSG19, comprising entirely of two amino acids in the low-end group.

R2 values for α-synuclein are shown in Fig. 4C. Here SeqDYN correctly predicts higher R2 near the C-terminus and a dip around Gly68. However, it misses the R2 peaks around Tyr39 and Asp121. MD simulations (1) have found that these R2 peaks can be explained by a combination of secondary structure formation (β-sheet around Tyr39 and polyproline II helix around Asp121) and local (between Tyr39 and Ser42) or long-range (between Asp121 and Lys96) interactions. SeqDYN cannot account for long-range interactions (e.g., between β-strands and between Asp121 and Lys96). Figure 4D shows that SeqDYN gives excellent R2 predictions for Mev-PNTD. It correctly predicts the high peaks around Arg17, Glu31, Leu193, and lower peaks around Arg235 and Trp285, but does underpredict the narrow peak around Tyr113.

The overall R2 profile of Sev-NT is predicted well by SeqDYN, but the peak in the long helical region (residues 478-491) is severely underestimated (green curve in Fig. 4E). A similar situation occurs for CBP-ID4, where the peak in the second long helical region (around Glu113) is underpredicted (green curve in Fig. 4F). While the measured R2 exhibits a higher peak in the second helical region than in the first helical region (around Arg16), the opposite is predicted by SeqDYN. When the R2 data were included in the training set (i.e., full training), the second peak is higher than the first one, but that is not a real prediction because the R2 data themselves were used for training the model. It merely means that the SeqDYN functions can be parameterized to produce any prescribed R2 profile along the sequence. Indeed, when the R2 data of CBP-ID4 alone were used to parameterize SeqDYN, the measured R2 profile is closely reproduced (Fig. S3). The reversal in R2 peak heights between the two helical regions is the reason for the aforementioned unusual increase in RMSE when CBP-ID4 was left out of the training set.

R2 boost in long helical regions

It is apparent that SeqDYN underestimates the R2 of stable long helices. Transient short helices does not seem to be a problem, since these are present, e.g., in Mev-PNTD, where transient helix formation in the first 37 residues and between residues 189-198 (26) coincides with R2 peaks that are correctly predicted by SeqDYN. SeqDYN can treat coupling between residues within the correlation length of 5.6 residues, but a much longer helix would tumble more slowly than implied by an Lcorr of 5.6, and thus it makes sense that SeqDYN would underestimate R2 in that case.

Our solution then is to apply a boost factor to the long helical region. To do so, we have to know whether an IDP does form long helices and if so what the constituent residues are. Secondary structure predictors tend to overpredict α-helices and β-strands for IDPs, as they are trained on structured proteins. One way to counter that tendency is to make the criteria for α-helices and β-strands stricter. We found that, by filtering PsiPred (http://bioinf.cs.ucl.ac.uk/psipred) (35) helix propensity scores (pHlx) with a very high cutoff of 0.99, the surviving helix predictions usually correspond well with residues identified by NMR as having high helix propensities. For example, for Mev-PNTD, PsiPred plus filtering predicts residues 14-17, 28-33, and 191-193 as helical; all of them are in regions that form transient helices according to chemical shifts (26). Likewise long helices are also correctly predicted for Sev-NT (residues 477-489) and CBP-ID4 (residues 6-17 and 105-116).

We apply a boost factor, BHlx, to helices with a threshold length of 12:

The Θ function is 1 if the helix propensity score is above the filtering cutoff and the helix length (LHlx) is above the threshold, and 0 otherwise. With a boost amplitude α at 0.5, the boosted SeqDYN prediction for Sev-NT reaches excellent agreement with the measured R2 (Fig. 4E, red curve). The RMSE(−1) is reduced from 0.76 s−1 to 0.38 s−1 upon boosting. Applying the same helix boost to CBP-ID4 also results in a modest reduction in RMSE(−1), from 2.01 to 1.90 s−1 (Fig. 4F, red curve). The only other IDP for which PsiPred plus filtering predicts a long helix is the N-terminal region of lysyl-tRNA synthetase (KRS-NT). The authors who studied this protein did not report on secondary structure (36), but feeding their reported chemical shifts to the TALOS+ server (https://spin.niddk.nih.gov/bax/nmrserver/talos/) (37) found only short stretches of residues that fall into the helical region of the Ramachandran map. The SeqDYN prediction for KRS-NT is already good [RMSE(−1) = 0.83 s−1]; applying a helix boost would deteriorate the RMSE(−1) to 1.16 s−1.

Further test on a set of nine IDPs

We have reserved nine IDPs for testing SeqDYN (parameterized on the training set of 45 IDPs). The level of disorder in these test proteins also spans the full range, from absence of secondary structures [ChiZ N-terminal region (12), Pdx1 C-terminal region (11), and TIA-1 prion-like domain (17)] to presence of transient short helices [synaptobrevin-2 (7), α-endosulfine (38), YAP (6), angiomotin-like 1 (AMOTL1) (39)] to formation of stable long helices [FtsQ (13) and CAHS-8 (9)]. For eight of the nine test IDPs, the RMSEs of SeqDYN predictions are lower than the experimental values, by an average of 0.62 s−1. For the nineth IDP (Pdx1), the SeqDYN RMSE is slightly higher, by 0.06 s−1, than the experimental .

The comparison of predicted and measured R2 profiles along the sequence is presented in Fig. 5A-I. For ChiZ, SeqDYN correctly predicts the major peak around Arg25 and the minor peak around Arg46 (Fig. 5A). The R2 profile of Pdx1 is largely featureless, except for a dip around Gly216, which is correctly predicted by SeqDYN (Fig. 5B). Correct prediction is also obtained for the higher R2 in the first half of TIA-1 prion-like domain than in the second half (Fig. 5C). SeqDYN gives an excellent prediction for synaptobrevin-2, including a linear increase up to Arg56 and the major peak around Trp89 (Fig. 5D).

Measured (bars) and predicted (curves) R2 profiles for ChiZ N-terminal region, TIA1 prion-like domain, Pdx1 C-terminal region, synaptobrevin-2, α-endosulfine, YAP, AMOTL1, FtsQ, and CAHS-8. In (C), R2 does not fall off at the N-terminus because the sequence is preceded by an expression tag MGSSHHHHHHHHHHHHS. In (H) and (I), green curves are SeqDYN predictions and red curves are obtained after a helix boost.

The prediction is also very good for α-endosulfine, including elevated R2 around Glu34, which coincides with the presence of a transient helix, and depressed R2 in the last 40 residues (Fig. 5E). The only miss is an underprediction for the peak around Lys74. SeqDYN also predicts well the overall shape of the R2 profile for YAP, including peaks around Asn70, Leu91, Arg124, and Arg161, but severely underestimates the peak height around Asn70 (Fig. 5F).NOE signals indicate contacts between Met86, Leu91, Fhe95, and Fhe96 (6); evidently this type of local contacts is captured well by SeqDYN. The R2 elevation around Asn70 is mostly due to helix formation: residues 61-74 have helix propensities up to 40% (6). PsiPred predicts helix for residues 62-73, but only residues 65-68 survive the filtering that we impose, resulting a helix that is too short to apply a helix boost. The prediction for AMOTL1 is mostly satisfactory, including peaks around Phe200 and Arg264 and a significant dip around Gly292 (Fig. 5G). However, whereas the two peaks have approximately equal heights in the measured R2 profile, the predicted peak height around Phe200 is too low. SCSs indicate helix propensity around both R2 peaks (39). PsiPred also predicts helix in both regions, but only five and two residues, respectively, survive after filtering, which are too short for applying a helix boost.

For FtsQ, SeqDYN correctly predicts elevated R2 for the long helix [residues 46-74 (13)] but underestimates the magnitude (RMSE = 2.32 s−1; green curve in Fig. 5H). PsiPred plus filtering predicts a long helix formed by residues 47-73. Applying the helix boost substantially improves the agreement with the measured R2, with RMSE reducing to 1.71 s−1 (red curve in Fig. 5H). SeqDYN also gives a qualitatively correct R2 profile for CAHS-8, with higher R2 for the middle section (residues 95-190) (RMSE = 2.36 s−1; green curve in Fig. 5I). However, it misses the extra elevation in R2 for the first half of the middle section (residues 95-145). According to SCS, the first and second halves have helix propensities of 60% and 30%, respectively (9). PsiPred plus filtering predicts helices for residues 96-121, 124-141, 169-173, and 179-189. Only the first two helices, both in the first half of the middle section, are considered long according to our threshold. Once again, applying the helix boost leads to marked improvement in the predicted in R2, with RMSE reducing to 1.92 s−1 (red curve in Fig. 5I).

Inputting the sequence of a structured protein predicts R2 in the unfolded state

SeqDYN is trained on IDPs, what if we feed it with the sequence of a structured protein? The prediction using the sequence of hen egg white lysozyme, a well-studied single-domain protein, is displayed in Fig. 6A. It shows remarkable agreement with the R2 profile measured by Klein-Seetharaman et al. (3) in the unfolded state (denatured by 8 M urea and reduced to break disulfide bridges), including a major peak around Trp62, a second peak around Trp111, and a third peak around Trp123. Klein-Seetharaman et al. mutated Trp62 to Gly and the major peak all but disappeared. This result is also precisely predicted by SeqDYN with the mutant sequence (Fig. 6B).

Predicted R2 profiles (curves) for the lysozyme sequence and a mutant show close agreement with those measured (bars) on the proteins in the unfolded state. (A) Wild type. (B) With Trp62 to Gly mutation.

Discussion

We have developed a powerful method, SeqDYN, that predicts the backbone amide transverse relaxation rates (R2) of IDPs. The method is based on IDP sequences, is extremely fast, and available as a web server at https://zhougroup-uic.github.io/SeqDYNidp/. The excellent performance supports the notion that the ns-dynamics reported by R2 is coded by the local sequence, comprising up to 6 residues on either side of a given residue. The amino-acid types that contribute the most to coupling within a local sequence are aromatic (Trp, Tyr, Phe, and His), Arg, and long branched aliphatic (Ile and Leu), suggesting the importance of π-π, cation-π, and hydrophobic interactions in raising R2. These interactions are interrupted by Gly and amino acids with short polar sidechains (Ser, Thr, Asn, and Asp), leading to reduced R2. Transient short helices produce moderate elevation in R2, whereas stable long helices result in a big boost in R2. Tertiary contacts can also raise R2, but appears to be infrequent in most IDPs (1).

Our method incorporates ideas from a number of previous efforts at describing R2. The first serious effort was by Schwalbe et al. (15), who accounted for contributions from neighboring residues as additive terms, instead of multiplicative factors as in SeqDYN. Cho et al. (16) and Delaforge et al. (24) used the running average of the bulkiness parameter over a window of five to nine residues as a qualitative indicator of R2. Here again the calculation was based on an additive model. Sekiyama et al. (17) employed a multiplicative model, with R2 calculated as a geometric mean of “indices of local dynamics” over a five-residue window. These indices, akin to our q parameters, were trained on a single IDR (TIA-1 prion-like domain) and used to reproduce the measured R2 for the same IDR. As we have illustrated on CBP-ID4 (Fig. S3), training on a single protein merely biases the parameters to that model and has little value in predicting R2 for other proteins. In comparison, SeqDYN is trained on 45 IDPs and its predictions are robust and achieve quantitative agreement with measured R2.

Ten of the IDPs tested here have been studied recently by MD simulations using IDP-specific force fields (1, 1214). In Table 1 we compare the RMSEs of SeqDYN predictions with those for R2 calculations from MD simulations. For five of these IDPs: A1-LCD, Aβ40, α-synuclein, tau K18, and FtsQ, RMSEs of SeqDYN and MD are remarkably similar. Four of these IDPs lack significant population of α-helices or β-sheets, but FtsQ forms a stable long helix. For one other IDP, namely HOX-DFD, MD, by explicitly modeling its folded domain, does a much better job in predicting R2 than SeqDYN (RMSEs of 1.40 s−1 vs 1.99 s−1). However, for the four remaining IDPs: p53TAD, Pup, Sev-NT, and ChiZ, SeqDYN significantly outperforms MD, with RMSEs averaging only 0.47 s−1, compared to the MD counterpart of 1.14 s−1. Overall, SeqDYN is very competitive against MD in predicting R2, but without the significant computational cost. While MD simulations can reveal details of local interactions, as noted for α-synuclein, and capture tertiary interactions if they occur, they still suffer from perennial problems of force-field imperfection and inadequate sampling. SeqDYN provides an accurate description of IDP dynamics at a “mean-field” level, but could miss idiosyncratic behaviors of specific local sequences.

RMSEs (s−1) of R2 predictions by SeqDYN and MD for 10 IDPs

Deep-learning models have become very powerful, but they usually have millions of parameters and require millions of protein sequences for training (40). In contrast, SeqDYN employs a mathematical model with dozens of parameters and requires only dozens of proteins for training. Reduced models (by collapsing amino acids into a small number of distinct types) have even been trained on < 10 IDPs to predict propensities for binding nanoparticles (18) or membranes (19). The mathematical model-based approach may be useful in other applications where data, similar to R2, are limited, including predictions of IDP secondary chemical shifts or residues that bind drug molecules (41) or protein targets, or even in protein design, e.g., for recognizing an antigenic site or a specific DNA site.

Methods

Collection of IDPs with measured R2

Starting from six nonhomologous IDPs in our previous MD study (1), we obtained R2 data for eight IDPs from the Bimolecular Magnetic Resonance Data Bank (BMRB; https://bmrb.io); data for two other IDPs were from our collaborators (12, 13). Most of the 54 IDPs studied here were from searching the literature. Disorder was judged by dispersion in backbone amide proton chemical shifts, NOE, and SCS. R2 data that were not available from the authors or BMRB were obtained by digitizing R2 plots presented in figures of published papers, using WebPlotDigitizer (https://automeris.io/WebPlotDigitizer) (42) and further inspected visually.

Homology of IDPs was checked by sequence alignment using Clustal W (http://www.clustal.org/clustal2) (43), and presented as a clock-like tree using the “ape” package (http://ape-package.ird.fr) (44). IDPs that had discernible homology with the selected training set were removed. Removed IDPs included HOX-SCR and β-synuclein from our previous MD study (1), due to homology with HOX-DFD and α-synuclein, respectively.

Coding for SeqDYN

The training of SeqDYN was coded in python, similar to our previous work for predicting residue-specific membrane association propensities (ReSMAP; https://zhougroup-uic.github.io/ReSMAPidp/) (19). The cost function was the sum of mean-squared-errors for the IDPs in the training set. We used the least_squares function in scipy.optimize, with Trust Region Reflective as the minimization algorithm and all parameters restricted to the positive range. For the web server (https://zhougroup-uic.github.io/SeqDYNidp/), we rewrote the prediction code javascript.

Acknowledgements

This work was supported by Grant GM118091 from the National Institutes of Health.