Predicting the sequence-dependent backbone dynamics of intrinsically disordered proteins

  1. Sanbo Qin
  2. Huan-Xiang Zhou  Is a corresponding author
  1. Department of Chemistry, University of Illinois Chicago, United States
  2. Department of Physics, University of Illinois Chicago, United States

eLife assessment

In this useful study, a solid machine learning approach based on a broad set of systems to predict the R2 relaxation rates of residues in intrinsically disordered proteins (IDPs) is described. The ability to predict the patterns of R2 will be helpful to guide experimental studies of IDPs. A potential weakness is that the predicted R2 values may include both fast and slow motions, thus the predictions provide only limited new physical insights into the nature of the underlying protein dynamics, such as the most relevant timescale.

https://doi.org/10.7554/eLife.88958.3.sa0

Abstract

How the sequences of intrinsically disordered proteins (IDPs) code for functions is still an enigma. Dynamics, in particular residue-specific dynamics, holds crucial clues. Enormous efforts have been spent to characterize residue-specific dynamics of IDPs, mainly through NMR spin relaxation experiments. Here, we present a sequence-based method, SeqDYN, for predicting residue-specific backbone dynamics of IDPs. SeqDYN employs a mathematical model with 21 parameters: one is a correlation length and 20 are the contributions of the amino acids to slow dynamics. Training on a set of 45 IDPs reveals aromatic, Arg, and long-branched aliphatic amino acids as the most active in slow dynamics whereas Gly and short polar amino acids as the least active. SeqDYN predictions not only provide an accurate and insightful characterization of sequence-dependent IDP dynamics but may also serve as indicators in a host of biophysical processes, including the propensities of IDP sequences to undergo phase separation.

Introduction

Intrinsically disordered proteins (IDPs) or regions (IDRs) do not have the luxury of a three-dimensional structure to help decipher the relationship between sequence and function. Instead, dynamics has emerged as a crucial link between sequence and function for IDPs (Dey et al., 2022). Nuclear magnetic resonance (NMR) spin relaxation is a uniquely powerful technique for characterizing IDP dynamics, capable of yielding residue-specific information (Camacho-Zarco et al., 2022). Backbone 15N relaxation experiments typically yield three parameters per residue: transverse relaxation rate (R2), longitudinal relaxation rate (R1), and steady-state heteronuclear Overhauser enhancement (NOE). While all three parameters depend on ps-ns dynamics, R2 is the one most affected by slower dynamics (10 s of ns to 1 μs). An increase in either the timescale or the amplitude of slower dynamics results in higher R2. For IDPs, R2 is also the parameter that exhibits the strongest dependence on sequence (Dey et al., 2022; Camacho-Zarco et al., 2022).

R2 was noted early on as an important indicator of residual structure in the unfolded state of the structured protein lysozyme (Klein-Seetharaman et al., 2002). This property has since been measured for many IDPs to provide insight into various biophysical processes. Just as the residual structure in the unfolded state biases the folding pathway of lysozyme (Klein-Seetharaman et al., 2002), a nascent α-helix in the free state of Sendai virus nucleoprotein C-terminal domain (Sev-NT), as indicated by highly elevated R2 (Abyzov et al., 2016), biases the coupled binding and folding pathway in the presence of its target phosphoprotein (Schneider et al., 2015). Local secondary structure preformation also facilitates the binding of yes-associated protein (YAP) with its target transcription factor (Feichtinger et al., 2022). Likewise a correlation has been found between R2 in the free state and the membrane binding propensity of synaptobrevin-2: residues with elevated R2 have increased propensity for membrane binding (Lakomek et al., 2019). R2 in the free state has also been used to uncover factors that promote liquid-liquid phase separation of IDPs. For example, a nascent α-helix (shown by elevated R2) is important for the phase separation of the TDP-43 low-complexity domain, as both the deletion of the helical region and a helix-breaking mutation (Ala to Pro) abrogates phase separation (Conicella et al., 2016). Similarly, nascent α-helices in the free state of cytosolic abundant heat-soluble 8 (CAHS-8), upon raising concentration and lowering temperature stabilize to form the core of fibrous gels (Malki et al., 2022). For the hnRNPA1 low-complexity domain (A1-LCD), aromatic residues giving rise to local peaks in R2 also mediate phase separation (Martin et al., 2020).

Both NMR relaxation data and molecular dynamics (MD) simulations have revealed determinants of R2 for IDPs. It has been noted that the flexible Gly tends to lower R2, whereas secondary structure and contact formation tend to raise R2 (Cook et al., 2019). This conclusion agrees well with recent MD simulations (Dey et al., 2022; Hicks et al., 2020; Yu and Brüschweiler, 2022; Smrt et al., 2023). These MD studies, using IDP-specific force fields, are able to predict R2 in quantitative agreement with NMR measurements, without ad hoc reweighting as done in earlier studies. According to MD, most contact clusters are formed by local sequences, within blocks of up to a dozen or so residues (Dey et al., 2022; Hicks et al., 2020; Smrt et al., 2023). Tertiary contacts can also form but are relatively rare; as such their accurate capture requires extremely extensive sampling and still poses a challenge for MD simulations. Contrary to Gly, aromatic residues have been noted as mediators of contact clusters (Klein-Seetharaman et al., 2002; Martin et al., 2020).

Schwalbe et al., 1997 introduced a mathematical model to describe the R2 profile along the sequence for lysozyme in the unfolded state. The R2 value of a given residue was expressed as the sum of contributions from this residue and its neighbors. This model yields a mostly flat profile across the sequence, except for a falloff at the termini, resulting in an overall bell shape. Klein-Seetharaman et al., 2002 then fit peaks above this flat profile as a sum of Gaussians. Cho et al., 2007 proposed bulkiness as a qualitative indicator of backbone dynamics. Recently Sekiyama et al., 2022 calculated R2 as the geometric mean of ‘indices of local dynamics’; the latter were parameterized by fitting to the measured R2 for a single IDP. All these models merely describe the R2 profile of a given IDP, and none of them is predictive.

Here, we present a method, SeqDYN, for predicting R2 of IDPs. Using a mathematical model introduced by Li et al., 2020 to predict propensities for binding nanoparticles and also adapted for predicting propensities for binding membranes (Qin et al., 2022), we express the R2 value of a residue as the product of contributing factors from all residues. The contributing factor attenuates as the neighboring residue becomes more distant from the central residue. The model, after training on a set of 45 IDPs, has prediction accuracy that is competitive against that of the recent MD simulations using IDP-specific force fields (Dey et al., 2022; Hicks et al., 2020; Yu and Brüschweiler, 2022; Smrt et al., 2023). For lysozyme and other structured proteins, the SeqDYN prediction agrees remarkably well with R2 measured in their unfolded state.

Results

The data set of IDPs with R2 rates

We collected R2 data for a total of 54 nonhomologous IDPs or IDRs (Table 1; Figure 1). According to indicators from NMR properties, including low or negative NOEs, narrow dispersion in backbone amide proton chemical shifts, and small secondary chemical shifts (SCSs), most of the proteins are disordered with at most transient α-helices. A few are partially folded, including Sev-NT with a well-populated (~80%) long helix (residues 478–491; Jensen et al., 2008), CREB-binding protein fourth intrinsically disordered linker (CBP-ID4) with >50% propensities for two long helices (residues 2–25 and 101–128; Piai et al., 2016), HOX transcription factor DFD (HOX-DFD) with a well-folded domain comprising three helices (Maiti et al., 2019), and Hahellin (apo form) as a molten globule (Patel et al., 2014). In Figure 2, we display representative conformations of five IDPs, ranging from fully disordered MAPK kinase 4 (MKK4; Delaforge et al., 2018) and α-synuclein (Sung and Eliezer, 2007) to Measles virus phosphoprotein N-terminal domain (Mev-PNTD; Milles et al., 2018) with transient short helices to Sev-NT and CBP-ID4 with stable long helices. The sequences of all the IDPs are listed in Appendix 1.

Clock-like tree plot showing lack of homology among the 45 IDPs.

The level of homology between two sequences is measured by the distance from their convergence point to the center of the clock. The highest level of apparent identity is between A1-LCD and TDP-43, at 25%, but these two proteins differ in both secondary structure formation and R2 characteristics. There is, however, a 20-residue overlap between the N-terminus of MBP-xα2 and the C-terminus of rmBG21.

Representative conformations of five IDPs.

(A–E) MKK4, α-synuclein, Mev-PNTD, Sev-NT, and CBP-ID4. Conformations were initially generated using TraDES (http://trades.blueprint.org; Feldman and Hogue, 2002), selected to have radius of gyration close to predicted by a scaling function Rg=2.54N0.522 (Å) (Bernadó and Blackledge, 2009). Conformations for residues predicted as helical by PsiPred plus filtering were replaced by an ideal helix. Finally residues are colored according to a scheme ranging from green for low predicted R2 to red for high predicted R2.

Table 1
Experimental conditions, mean and standard deviation of measured R2, and SeqDYN prediction RMSE.
Protein name# of resTemp(K)B0 (MHz)R¯2(s–1)σR2 (s–1)RMSE(s–1)PMID; ref
Training set (45 IDPs) *
A1-LCD1312988002.680.460.6032029630; Martin et al., 2020
Aβ40402786003.400.920.3831181936; Rezaei-Ghaleh et al., 2019
Ash1832788009.801.401.4127807972; Martin et al., 2016
Beclin11652888005.371.031.1427288992; Yao et al., 2016
CAPRIN11033036005.340.880.7231898464; Wong et al., 2020
CBP-ID42072837005.452.552.01;1.9029790640; Murrali et al., 2018
GbnD4-DHD912807006.811.551.2829309054; Jenner et al., 2018
ERD141852886003.960.870.5421336827; Szalainé Ágoston et al., 2011
ExsE882986003.180.880.7622138394; Zheng et al., 2012
FCP1852985002.940.540.4326286791; Lawrence and Showalter, 2012
FUS1632988503.480.510.5426455390; Burke et al., 2015
GAb1822985003.990.880.8934929201; Gruber et al., 2022
hACTR693046003.260.470.4918177052; Ebert et al., 2008
Hahellin922988009.942.692.8524671380; Patel et al., 2014
hCSD11412985003.560.930.9918537264; Kiss et al., 2008
HOX-DFD902986006.983.151.9930802457; Maiti et al., 2019
hZIP4-ICL21002838009.542.371.5830793391; Bafaro et al., 2019
Jaburetox942988006.012.302.2725605001; Lopes et al., 2015
KRS-NT723036003.260.930.8324983501; Cho et al., 2014
MBP-xα2702956003.830.600.5425343306; De Avila et al., 2014
MKK4862788504.491.420.6329276882; Delaforge et al., 2018
N-Cby632984.191.201.2521182262; Mokhtarzada et al., 2011
Niv-PNTD4062887005.411.821.6633177626; Schiavina et al., 2020
NS5A-D2D32682788008.623.852.1426445449; Sólyom et al., 2015
NUPR1932986002.980.820.7631325636; Neira et al., 2019
OPN2203108002.590.820.5431794728; Mateos et al., 2020
p53TAD732988502.720.660.3330240067; Xie et al., 2018
PDEγ872983.961.050.7118230733; Song et al., 2008
PKIα753009003.410.870.5232338601; Olivieri et al., 2020
Mev-PNTD3042989502.920.590.4830140745; Milles et al., 2018
ProTα1132838003.400.560.4329466338; Borgia et al., 2018
Pup642988502.660.510.4330240067; Xie et al., 2018
rmBG211993006004.060.900.6317676872; Ahmed et al., 2007
RPB12012778506.481.741.3328945358; Janke et al., 2018
securin2022835005.491.131.0819053469; Csizmok et al., 2008
Sev-NT1242986003.201.420.76;0.3827112095; Abyzov et al., 2016
Sic1922785003.340.590.4820399186; Mittag et al., 2010
SKIPN712985.641.051.4620007319; Wang et al., 2010
SLBP-NT1132986003.961.401.6115260482; Thapar et al., 2004
α-synuclein1402986002.960.530.4430184304; Rezaei-Ghaleh et al., 2018
SOCS5-JIR703038004.322.361.9126173083; Chandrashekaran et al., 2015
tau K181292837004.120.950.8323740819; Barré and Eliezer, 2013
TC11062986004.651.611.2423189168; Cino et al., 2012
TDP-431512835004.071.510.9627545621; Conicella et al., 2016
γ-tubulin-CT392885002.230.350.2729127738; Harris et al., 2018
Test set (9 IDPs)
AMOTL12072838008.452.552.0435481651; Vogel et al., 2022
CAHS-82333038504.433.252.36;1.9234750927; Malki et al., 2022
ChiZ642988004.330.890.7432585849; Hicks et al., 2020
α-endosulfine1212988003.210.810.4834346186; Thapa et al., 2022
FtsQ993058006.443.782.32;1.7136959324; Smrt et al., 2023
Pdx1832985002.980.700.7630525611; Cook et al., 2019
synaptobrevin-2962786005.541.800.7230975750; Lakomek et al., 2019
TIA-1913108004.010.890.5536112647; Sekiyama et al., 2022
YAP1222988003.191.441.2335378854; Feichtinger et al., 2022
  1. *

    For training set, RMSE is calculated for prediction based on leave-one-out training (using 44 IDPs).

  2. First number is for SeqDYN prediction; second number is after applying a helix boost.

We used 45 of the 54 IDPs to train and validate SeqDYN and reserved the remaining 9 for testing. The sequence lengths of the training set range from 39 to 406 residues, with an average of 125.3 residues. Altogether R2 data are available for 3966 residues. A large majority (35 out of 45) of the 45 IDPs have mean R2 values (R¯2, calculated among all the residues in a protein) between 2.5 and 5.5 s–1 (Table 1 and Figure 3A). This R¯2 range is much lower than that of structured proteins with similar sequence lengths. The low R¯2 values and lack of dependence on sequence length (Figure 3—figure supplement 1A) suggest that R2 of the IDPs is mostly dictated by local sequence instead of tertiary interaction.

Figure 3 with 1 supplement see all
Properties of the 45 IDPs in the training set.

(A) Histograms of means and standard deviations, calculated for individual proteins. Curves are drawn to guide the eye. Inset: correlation between R¯2 and σR2. (B) Experimental mean scaled R2 (msR2) and SeqDYN q parameters, for the 20 types of amino acids. Note that Pro residues have low msR2 for the lack of backbone amide proton. Amino acids are in descending order of q.

The most often used temperature for acquiring the R2 data was 298 K, but low temperatures (277–280 K) were used in a few cases (Table 1 and Figure 3—figure supplement 1B). Of the seven IDPs with R¯2 > 6.4 s–1, four can be attributed to low temperatures (Sólyom et al., 2015; Martin et al., 2016; Janke et al., 2018; Jenner et al., 2018), one is due to a relatively low temperature (283 K) as well as the presence of glycerol (20% v/v; Bafaro et al., 2019), and two can be explained by tertiary structure formation [a folded domain (Maiti et al., 2019) or molten globule (Patel et al., 2014)]. A simple reason for higher R2 values at lower temperatures is the higher water viscosity, resulting in a slowdown in molecular tumbling; a similar effect is achieved by adding glycerol. In some cases, R2 was measured at both low and room temperatures (Abyzov et al., 2016; Martin et al., 2020). To a good approximation, the effect of lowering temperature is a uniform scaling of R2 across the IDP sequence. For Sev-NT, downscaling of the R2 values at 278 K by a factor of 2.0 brings them into close agreement with those at 298 K (Figure 3—figure supplement 1C), with a root-mean-square-deviation (RMSD) of 0.5 s–1 among all the residues. Likewise, for A1-LCD, downscaling by a factor of 2.4 brings the R2 values at 288 K into good match with those at 298 K (Figure 3—figure supplement 1D), with an RMSD of 0.4 s–1. Because SeqDYN is concerned with the sequence dependence of R2, a uniform scaling has no effect on model parameter or prediction; therefore mixing the data from different temperatures is justified. The same can be said about the different magnetic fields in acquiring the R2 data (Table 1 and Figure 3—figure supplement 1E). Increasing the magnetic field raises R2 values, and the effect is also approximated well by a uniform scaling (Abyzov et al., 2016; Conicella et al., 2016; Janke et al., 2018).

One measure on the level of sequence dependence of R2 is the standard deviation, σR2, calculated among the residues of an IDP. Among the training set, the R2 values of 30 IDPs have moderate sequence variations, with σR2 ranging from 0.5 to 1.5 s–1 (Table 1); the histogram of σR2 calculated for the entire training set peaks around 0.75 s–1 (Figure 3A). There is a moderate correlation between σR2 and R¯2 (Figure 3A, inset), reflecting in part the fact that σR2 can be raised simply by a uniform upscaling, for example as a result of lowering temperature. Still, only two of the five IDPs with high R¯2 attributable to lower temperature or presence of glycerol are among the seven IDPs with high sequence variations (σR2>2 s–1). Therefore, the sequence variation of R2 as captured by σR2 manifests mostly the intrinsic effect of the IDP sequence, not the influence of external factors such as temperature or magnetic field strength. The mean σR2 value among the training set is 1.24 s–1.

One way to eliminate the influence of external factors is to scale the R2 values of each IDP by its R¯2; we refer to the results as scaled R2, or sR2. We then pooled the sR2 values for all residues in the training set, and separated them according to amino-acid types. The amino acid type-specific mean sR2 values, or msR2, are displayed in Figure 3B. The seven amino acids with the highest msR2 in descending order are Trp, Arg, Tyr, Phe, Ile, His, and Leu. The presence of all the four aromatic amino acids in this “high-end” group immediately suggests π-π stacking as important for raising msR2; the presence of Arg further implicates cation-π interactions. In the other extreme, the seven amino acids with the lowest msR2 in ascending order are Gly, Cys, Val, Asp, Ser, Thr, and Asn. Gly is well-known as a flexible residue; it is also interesting that all the four amino acids with short polar sidechains are found in this “low-end” group. Pro has an excessively low msR2 [with data from only two IDPs (Murrali et al., 2018; Wong et al., 2020)], but that is due to the absence of an amide proton.

The SeqDYN model and parameters

The null model is to assume a uniform R2 for all the residues in an IDP. The root-mean-square-error (RMSE) of the null model is equal to the standard deviation, σR2, of the measured R2 values. The mean RMSE, RMSE¯, of the null model, equal to 1.24 s–1 for the training set, serves as the upper bound for evaluating the errors of R2 predictors. The next improvement is a one-residue predictor, where first each residue (with index n) assumes its amino acid-specific mean sR2 (msR2) and then a uniform scaling factor Υ is applied:

(1) R2n=ΥmsR2n

This one-residue model does only minutely better than the null model, with a RMSE¯ of 1.22 s–1.

In SeqDYN, we account for the influence of neighboring residues. Specifically, each residue i contributes a factor fi;n to the R2 value of residue n. Therefore,

(2a) R2(n)=Υi=1Nf(i;n)

where N is the total number of residues in the IDP. The contributing factor depends on the sequence distance s=|in| and the amino-acid type of residue i:

(2b) f(i;n)=1+q(i)11+bs2

There are 21 global parameters. The first 20 are the q values, one for each of the 20 types of amino acids; the last parameter is b, appearing in the Lorentzian form of the sequence-distance dependence. We define the correlation length, Lcorr, as the sequence distance at which the contributing factor is midway between the values at s=0 and . It is easy to verify that Lcorr=b-1/2. Note that the single-residue model can be seen as a special case of SeqDYN, with Lcorr set to 0 and q set to msR2.

The functional forms of Equation 2a and Equation 2b were adapted from Li et al., 2020; we also used them for predicting residue-specific membrane association propensities of IDPs (Qin et al., 2022). In these previous applications, a linear term was also present in the denominator of Equation 2b. In our initial training of SeqDYN, the coefficient of the linear term always converged to near zero. We thus eliminated the linear term. In addition to the Lorentzian form, we also tested a Gaussian form for the sequence-distance dependence and found somewhat worse performance. The more gradual attenuation of the Lorentzian form with increasing sequence distance evidently provides an overall better model for the R2 data in the entire training set. Others (Cho et al., 2007; Sekiyama et al., 2022; Delaforge et al., 2018) have modeled R2 as the average of some parameters over a window; a window has an extremely abrupt sequence-distance dependence (1 for s<Lcorr and 0 for s>Lcorr).

We parametrized the SeqDYN model represented by Equation 2a and Equation 2b on the training set of 45 IDPs. In addition to the 21 global parameters noted above, there are also 45 local parameters, namely one uniform scaling factor (Υ) per IDP. The parameter values were selected to minimize the sum of the mean-square-errors for the IDPs in the training set, calculated on R2 data for a total of 3924 residues. We excluded the 42 Pro residues in the training set because, as already noted, their R2 values are lower for chemical reasons. We will present validation and test results below, but first let us look at the parameter values.

The q values are displayed in Figure 3B alongside msR2. In descending order, the seven amino acids with the highest q values are Trp, Ile, Tyr, Arg, His, Phe, and Leu. These are exactly the same amino acids in the high-end group for msR2, though their order there is somewhat different. In ascending order, the seven amino acids (excluding Pro) with the lowest q values are Gly, Asn, Ser, Asp, Val, Thr, and Cys. The composition of the low-end group is also identical to that for msR2. The q values thus also suggest that π-π and cation-π interactions in local sequences may raise R2, whereas Gly and short-polar residues may lower R2.

Given the common amino acids at both the high and low ends for msR2 and q, it is not surprising that these two properties exhibit a strong correlation, with a coefficient of determination (R2; excluding Pro) at 0.92 (Figure 4A). Also, because the high-end group contains the largest amino acids (e.g. Trp and Tyr) whereas the low-end group contains the smallest amino acids (e.g. Gly and Ser), we anticipated some correlation of msR2 and q with amino-acid size. We measure the latter property by the molecular mass (m). As shown in Figure 4B, both msR2 and q indeed show a medium correlation with m, with R2=0.67 (excluding Pro) and 0.61, respectively. A bulkiness parameter was proposed as an indicator of sequence-dependent backbone dynamics of IDPs (Cho et al., 2007; Delaforge et al., 2018). Bulkiness was defined as the sidechain volume-to-length ratio, and identified amino acids with aromatic or branched aliphatic sidechains as bulky (Zimmerman et al., 1968). We found only modest correlations between either msR2 or q and bulkiness, with R2 just below 0.4 (Figure 4C).

Figure 4 with 1 supplement see all
SeqDYN model parameters.

(A) Correlation between msR2 and q. The values are also displayed as bars in Figure 3B. (B) Correlation of msR2 and q with amino-acid molecular mass. (C) Correlation of msR2 and q with bulkiness. (D) The optimal correlation length and deterioration of SeqDYN prediction as the correlation length is moved away from the optimal value.

The optimized value of b is 3.164 × 10–2, corresponding to an Lcorr of 5.6 residues. The resulting optimized RMSE¯ is 0.95 s–1, a clear improvement over the value 1.24 s–1 of the null model. To check the sensitivity of prediction accuracy to b, we set b to values corresponding to Lcorr = 0, 1, 2,…, and retrained SeqDYN for b fixed at each value (Figure 4D). Note that the null-model RMSE¯, 1.24 s–1, sets an upper bound. This upper bound is slowly reached when Lcorr is increased from the optimal value. In the opposite direction, when Lcorr is decreased from the optimal value, RMSE¯ rises quickly, reaching 1.22 s–1 at Lcorr = 0. The latter RMSE¯ is the same as that of the single-residue model. Lastly we note that there is a strong correlation between the uniform scaling factors and R¯2 values among the 45 IDPs (R2=0.77), as to be expected. For 39 of the 45 IDPs, Υ values fall in the range of 0.8–2.0 s–1.

As presented next, we evaluate the performance of SeqDYN by leave-one-out cross validation, where each IDP in turn was left out of the training set and the model was trained on the remaining 44 IDPs to predict R2 for the IDP that was left out. The parameters from the leave-one-out (also known as jackknife) training sessions allow us to assess the potential bias of the training set. For this purpose, we compare the values of the 21 global parameters, either from the full training set or from taking the averages of the jackknife training sessions. For each of the q parameters, the values from these two methods differ only in the fourth digit; for example for Leu, they are both 1.1447 from full training and from jackknife training. The values for b are 3.164×10–2 from full training as stated above and 3.163×10–2 from jackknife training. The close agreement in parameter values between full training and jackknife training suggests no significant bias in the training set.

Another question of interest is whether the difference between the q parameters of two amino acids is statistically significant. To answer this question, we carried out fivefold cross-validation training, resulting in five independent estimates for each parameter. For example, the mean ±standard deviation of the q parameter is 1.1405 ± 0.0066 for Leu and 1.2174 ± 0.0211 for Ile. A t-test shows that their difference is extremely statistically significant (P<0.0001). In contrast, the difference between Leu and Phe (q=1.1552 ± 0.0304) is not significant. t-test results for other pairs of amino acids are found in Figure 4—figure supplement 1.

Validation of SeqDYN predictions

We now present leave-one-out cross-validation results. We denote the RMSE of the R2 prediction for the left-out IDP as RMSE(–1). As expected, RMSE(–1) is higher than the RMSE obtained with the IDP kept in the training set, but the increases are generally slight. Specifically, all but eight of the IDPs have increases <0.1 s–1; the largest increase is 0.35 s–1, for CBP-ID4. The mean RMSE(–1), or RMSE¯(1), for the 45 IDPs is increased by 0.05 s–1 over RMSE¯, to 1.00 s–1. The latter value is still a distinct improvement over the mean RMSE 1.24 s–1 of the null model. The histogram of RMSE(–1) for the 45 IDPs is shown in Figure 5A. It peaks at 0.5 s–1, which is a substantial downshift from the corresponding peak at 0.75 s–1 for σR2 (Figure 3A). Thirty-four of the 45 IDPs have RMSE(–1) values lower than the corresponding σR2.

Figure 5 with 1 supplement see all
Quality of SeqDYN predictions.

(A) Histogram of RMSE(–1). Letters indicate RMSE(–1) values of the IDPs to be presented in panels (B–F). (B–F) Measured (bars) and predicted (curves) R2 profiles for MKK4, α-synuclein, Mev-PNTD, Sev-NT, and CBP-ID4. In (E) and (F), green curves are SeqDYN predictions and red curves are obtained after a helix boost.

To further illustrate the performance of SeqDYN, we present the comparison of predicted and measured R2 values for five IDPs: MKK4, α-synuclein, Mev-PNTD, Sev-NT, and CBP-ID4 (Figure 5B–F). A simple common feature is the falloff of R2 at the N- and C-termini, resulting from missing upstream or downstream residues that otherwise would be coupled to the terminal residues, as first recognized by Schwalbe et al., 1997. Representative conformations of the five IDPs are displayed in Figure 2, with residues colored according to the predicted R2 values. For four of these IDPs, the RMSE(–1) values range from 0.44 to 0.76 s–1 and are scattered around the peak of the histogram, while the RMSE(–1) for the fifth IDP, namely CBP-ID4, the RMSE(–1) value is 2.01 s–1 and falls on the tail of the histogram (Figure 5A). Figure 5B displays the measured and predicted R2 for MKK4. SeqDYN correctly predicts higher R2 values in the second half of the sequence than in the first half. It even correctly predicts the peak around residue Arg75. The sequence in this region is H72IERLRTH79; six of these eight residues belong to the high-end group. In contrast, the lowest R2 values occur in the sequence S7GGGGSGGGSGSG19, comprising entirely of two amino acids in the low-end group.

R2 values for α-synuclein are shown in Figure 5C. Here, SeqDYN correctly predicts higher R2 near the C-terminus and a dip around Gly68. However, it misses the R2 peaks around Tyr39 and Asp121. MD simulations Dey et al., 2022 have found that these R2 peaks can be explained by a combination of secondary structure formation (β-sheet around Tyr39 and polyproline II helix around Asp121) and local (between Tyr39 and Ser42) or long-range (between Asp121 and Lys96) interactions. SeqDYN cannot account for long-range interactions (e.g. between β-strands and between Asp121 and Lys96). Figure 5D shows that SeqDYN gives excellent R2 predictions for Mev-PNTD. It correctly predicts the high peaks around Arg17, Glu31, Leu193, and lower peaks around Arg235 and Trp285, but does underpredict the narrow peak around Tyr113.

The overall R2 profile of Sev-NT is predicted well by SeqDYN, but the peak in the long helical region (residues 478–491) is severely underestimated (green curve in Figure 5E). A similar situation occurs for CBP-ID4, where the peak in the second long helical region (around Glu113) is underpredicted (green curve in Figure 5F). While the measured R2 exhibits a higher peak in the second helical region than in the first helical region (around Arg16), the opposite is predicted by SeqDYN. When the R2 data were included in the training set (i.e., full training), the second peak is higher than the first one, but that is not a real prediction because the R2 data themselves were used for training the model. It merely means that the SeqDYN functions can be parameterized to produce any prescribed R2 profile along the sequence. Indeed, when the R2 data of CBP-ID4 alone were used to parameterize SeqDYN, the measured R2 profile is closely reproduced (Figure 5—figure supplement 1). The reversal in R2 peak heights between the two helical regions is the reason for the aforementioned unusual increase in RMSE when CBP-ID4 was left out of the training set.

R2 boost in long helical regions

It is apparent that SeqDYN underestimates the R2 of stable long helices. Transient short helices does not seem to be a problem, since these are present, for example in Mev-PNTD, where transient helix formation in the first 37 residues and between residues 189–198 (Milles et al., 2018) coincides with R2 peaks that are correctly predicted by SeqDYN. SeqDYN can treat coupling between residues within the correlation length of 5.6 residues, but a much longer helix would tumble more slowly than implied by an Lcorr of 5.6, and thus it makes sense that SeqDYN would underestimate R2 in that case.

Our solution then is to apply a boost factor to the long helical region. To do so, we have to know whether an IDP does form long helices and if so what the constituent residues are. Secondary structure predictors tend to overpredict α-helices and β-strands for IDPs, as they are trained on structured proteins. One way to counter that tendency is to make the criteria for α-helices and β-strands stricter. We found that, by filtering PsiPred (http://bioinf.cs.ucl.ac.uk/psipred; McGuffin et al., 2000) helix propensity scores (pHlx) with a very high cutoff of 0.99, the surviving helix predictions usually correspond well with residues identified by NMR as having high helix propensities. For example, for Mev-PNTD, PsiPred plus filtering predicts residues 14–17, 28–33, and 191–193 as helical; all of them are in regions that form transient helices according to chemical shifts (Milles et al., 2018). Likewise long helices are also correctly predicted for Sev-NT (residues 477–489) and CBP-ID4 (residues 6–17 and 105–116; Jensen et al., 2008; Piai et al., 2016).

We apply a boost factor, BHlx, to helices with a threshold length of 12:

(3) BHlx=1+αpHlxΘ(pHlx0.99;LHlx12)

The Θ function is 1 if the helix propensity score is above the filtering cutoff and the helix length (LHlx) is above the threshold, and 0 otherwise With a boost amplitude α at 0.5, the boosted SeqDYN prediction for Sev-NT reaches excellent agreement with the measured R2 (Figure 5E, red curve). The RMSE(–1) is reduced from 0.76 s–1 to 0.38 s–1 upon boosting. Applying the same helix boost to CBP-ID4 also results in a modest reduction in RMSE(–1), from 2.01 to 1.90 s–1 (Figure 5F, red curve). The only other IDP for which PsiPred plus filtering predicts a long helix is the N-terminal region of lysyl-tRNA synthetase (KRS-NT). The authors who studied this protein did not report on secondary structure (Cho et al., 2014), but feeding their reported chemical shifts to the TALOS +server (https://spin.niddk.nih.gov/bax/nmrserver/talos/; Shen et al., 2009) found only short stretches of residues that fall into the helical region of the Ramachandran map. The SeqDYN prediction for KRS-NT is already good [RMSE(–1)=0.83 s–1]; applying a helix boost would deteriorate the RMSE(−1) to 1.16 s–1.

Further test on a set of nine IDPs

We have reserved nine IDPs for testing SeqDYN (parameterized on the training set of 45 IDPs). The level of disorder in these test proteins also spans the full range, from absence of secondary structures [ChiZ N-terminal region (Hicks et al., 2020), Pdx1 C-terminal region (Cook et al., 2019), and TIA-1 prion-like domain (Sekiyama et al., 2022)] to presence of transient short helices [synaptobrevin-2 (Lakomek et al., 2019), α-endosulfine (Thapa et al., 2022), YAP (Feichtinger et al., 2022), angiomotin-like 1 (AMOTL1) (Vogel et al., 2022)] to formation of stable long helices [FtsQ Smrt et al., 2023 and CAHS-8 Malki et al., 2022]. For eight of the nine test IDPs, the RMSEs of SeqDYN predictions are lower than the experimental σR2 values, by an average of 0.66 s–1. For the ninth IDP (Pdx1), the SeqDYN RMSE is slightly higher, by 0.06 s–1, than the experimental σR2. Together, the nine test IDPs have a mean RMSE of 1.13 s–1, close to the RMSE¯(1) of 1.00 s–1 for the training set in the leave-one-out cross-validation.

The comparison of predicted and measured R2 profiles along the sequence is presented in Figure 6A–I. For ChiZ, SeqDYN correctly predicts the major peak around Arg25 and the minor peak around Arg46 (Figure 6A). The R2 profile of Pdx1 is largely featureless, except for a dip around Gly216, which is correctly predicted by SeqDYN (Figure 6B). Correct prediction is also obtained for the higher R2 in the first half of TIA-1 prion-like domain than in the second half (Figure 6C). SeqDYN gives an excellent prediction for synaptobrevin-2, including a linear increase up to Arg56 and the major peak around Trp89 (Figure 6D).

Measured (bars) and predicted (curves) R2 profiles for ChiZ N-terminal region, TIA1 prion-like domain, Pdx1 C-terminal region, synaptobrevin-2, α-endosulfine, YAP, AMOTL1, FtsQ, and CAHS-8.

In (C), R2 does not fall off at the N-terminus because the sequence is preceded by an expression tag MGSSHHHHHHHHHHHHS. In (H) and (I), green curves are SeqDYN predictions and red curves are obtained after a helix boost.

The prediction is also very good for α-endosulfine, including elevated R2 around Glu34, which coincides with the presence of a transient helix, and depressed R2 in the last 40 residues (Figure 6E). The only miss is an underprediction for the peak around Lys74. SeqDYN also predicts well the overall shape of the R2 profile for YAP, including peaks around Asn70, Leu91, Arg124, and Arg161, but severely underestimates the peak height around Asn70 (Figure 6F). NOE signals indicate contacts between Met86, Leu91, Fhe95, and Fhe96 (Feichtinger et al., 2022); evidently this type of local contacts is captured well by SeqDYN. The R2 elevation around Asn70 is mostly due to helix formation: residues 61–74 have helix propensities up to 40% (Feichtinger et al., 2022). PsiPred predicts helix for residues 62–73, but only residues 65–68 survive the filtering that we impose, resulting in a helix that is too short to apply a helix boost. The prediction for AMOTL1 is mostly satisfactory, including peaks around Phe200 and Arg264 and a significant dip around Gly292 (Figure 6G). However, whereas the two peaks have approximately equal heights in the measured R2 profile, the predicted peak height around Phe200 is too low. SCSs indicate helix propensity around both R2 peaks (Vogel et al., 2022). PsiPred also predicts helix in both regions, but only five and two residues, respectively, survive after filtering, and are too short for applying a helix boost.

For FtsQ, SeqDYN correctly predicts elevated R2 for the long helix [residues 46–74 Smrt et al., 2023] but underestimates the magnitude (RMSE = 2.32 s–1; green curve in Figure 6H). PsiPred plus filtering predicts a long helix formed by residues 47–73. Applying the helix boost substantially improves the agreement with the measured R2, with RMSE reducing to 1.71 s–1 (red curve in Figure 6H). SeqDYN also gives a qualitatively correct R2 profile for CAHS-8, with higher R2 for the middle section (residues 95–190; RMSE = 2.36 s–1; green curve in Figure 6I). However, it misses the extra elevation in R2 for the first half of the middle section (residues 95–145). According to SCS, the first and second halves have helix propensities of 60% and 30%, respectively (Malki et al., 2022). PsiPred plus filtering predicts helices for residues 96–121, 124–141, 169–173, and 179–189. Only the first two helices, both in the first half of the middle section, are considered long according to our threshold. Once again, applying the helix boost leads to marked improvement in the predicted in R2, with RMSE reducing to 1.92 s–1 (red curve in Figure 6I).

Inputting the sequences of structured proteins predicts R2 in the unfolded state

SeqDYN is trained on IDPs, what if we feed it with the sequence of a structured protein? The prediction using the sequence of hen egg white lysozyme, a well-studied single-domain protein, is displayed in Figure 7A. It shows remarkable agreement with the R2 profile measured by Klein-Klein-Seetharaman et al., 2002 in the unfolded state (denatured by 8 M urea at pH 2 and reduced to break disulfide bridges), including a major peak around Trp62, a second peak around Trp111, and a third peak around Trp123. Klein-Seetharaman et al. mutated Trp62 to Gly and the major peak all but disappeared. This result is also precisely predicted by SeqDYN with the mutant sequence (Figure 7B).

R2 profiles predicted (curves) by SeqDYN show close agreement with those measured (bars) on structured proteins in the unfolded state.

(A) Wild-type lysozyme (8 M urea; pH 2; cysteine-methylated). (B) Lysozyme with Trp62 to Gly mutation (pH 2). Methylated cysteines were treated as Ala in the SeqDYN predictions. (C) Apomyoglogin (8 M urea; pH 2.3). (D) Ubiquitin (8 M urea; pH 2).

SeqDYN also predicts well the R2 profiles of other proteins in the unfolded state. For unfolded apomyoglobin (8 M urea; pH 2.3), Schwarzinger et al., 2002 claimed that depressed R2 corresponded to stretches of small amino acids (Gly and Ala), whereas elevated corresponded to local hydrophobic interactions. SeqDYN reproduces all the observed peaks and valleys in the R2 profile (Figure 7C). The deepest valley indeed occurs over a Gly/Ala-rich stretch, G125ADAQGA131, but the highest peak occurs over a stretch, I102KYLEFI108, that contains both hydrophobic and charged residues, all of which are on the high end of the q parameters (Figure 3B). The R2 profile of unfolded ubiquitin (8 M urea; pH 2) is relatively flat, which Wirmer et al., 2006 attributed to lack of residual secondary structure, based on the assumption that β-sheets (major elements of folded ubiquitin) are less resistant to denaturation than α-helices. SeqDYN predicts a relatively flat R2 profile (Figure 7D), but the reason is that the ubiquitin sequence lacks a contiguous stretch of high-q amino acids.

Discussion

We have developed a powerful method, SeqDYN, that predicts the backbone amide transverse relaxation rates (R2) of IDPs. The method is based on IDP sequences, is extremely fast, and available as a web server at https://zhougroup-uic.github.io/SeqDYNidp/ (Qin and Zhou, 2024). The excellent performance supports the notion that the ns-dynamics reported by R2 is coded by the local sequence, comprising up to 6 residues on either side of a given residue. The amino-acid types that contribute the most to coupling within a local sequence are aromatic (Trp, Tyr, Phe, and His), Arg, and long branched aliphatic (Ile and Leu), suggesting the importance of π-π, cation-π, and hydrophobic interactions in raising R2. These interactions are interrupted by Gly and amino acids with short polar sidechains (Ser, Thr, Asn, and Asp), leading to reduced R2. Transient short helices produce moderate elevation in R2, whereas stable long helices result in a big boost in R2. Tertiary contacts can also raise R2, but appear to be infrequent in most IDPs (Dey et al., 2022).

It is also possible that R2 reported by backbone amide 15N relaxation (as is the case for most of the IDPs studied here) may not be particularly sensitive to exchange effects, which likely involve tertiary contact formation. For the D2 domain of p27Kip1, the exchange contributions measured using 15N relaxation were small (<2.5 s–1) but were as large as 25 s–1 when measured by high-power 1H relaxation dispersion (Ban et al., 2017). This experiment measures the effective transverse relation rate, R2,eff, over a range of effective radiofrequency ωeff. The exchange contribution is maximal for the value R2,efflowωeff at low ωeff but is largely quenched for the value R2,0app in the high-ωeff limit. The SeqDYN prediction for this IDP matches much better with R2,0app than with R2,efflowωeff (Figure 8). It is not clear whether this IDP is unique in forming persistent tertiary contacts that give rise to substantial exchange contributions or the 1H relaxation dispersion experiment is unique in reporting the exchange contributions. At the minimum, SeqDYN yields the exchange-free portion of the transverse relaxation rate, enabling easy identification of residues that potentially participate in tertiary contacts. For the D2 domain of p27Kip1, SeqDYN correctly predicts the R2,0app local maxima at W76 and Y88. It is these same two residues that show substantial exchange contributions and putatively participate in tertiary contact (Ban et al., 2017). Therefore local contacts may seed tertiary contacts. If R2,eff data with substantial exchange contributions become available for more IDPs, SeqDYN may be retrained to make predictions for IDPs forming persistent tertiary contacts.

Comparison between SeqDYN prediction (curves) and effective transverse relaxation rate (bars) from 1H dispersion relaxation experiment.

(A) R2,eff in the high-ωeff limit. (B) R2,eff at low ωeff.

The q parameters, while introduced here to characterize the propensities of amino acids to participate in local interactions, appear to correlate with the tendencies of amino acids to drive liquid-liquid phase separation. Consistent with the rank order of q, Trp, Tyr, and Arg have been reported to be strong drivers of phase separation, Lys is a moderate driver, whereas Gly and Ser suppress phase separation (Martin et al., 2020; Wong et al., 2020; Wang et al., 2018). Recent measurements of the threshold concentration produced the following order for the propensity of phase separation by eight nonpolar amino acids in homotetrapeptides of the form XXssXX (ss: backbone disulfide bond): Trp > Phe > Leu>Met > Ile>Val > Ala>Pro (Zhang et al., 2024). This order is the same as that of the q parameters, except that the q values of Ile and Val are in the second and last places, respectively. Threshold concentrations of IDPs are now predicted reasonably well by coarse-grained simulations where each amino acid is modeled by a single bead with a Lennard-Jones diameter d0 and a stickiness parameter λ (Tesei and Lindorff-Larsen, 2022). Our q parameter shows a good correlation (R2=0.59) with the compound parameter d03λ (Figure 9). Therefore, the q parameter may serve as a predictor for the tendency of an amino acid to drive phase separation. In essence, the same ability of an amino acid, for example Trp, to form interactions with neighboring residues of an IDP in the free state also applies when it comes to interactions with residues on neighboring chains in a dense phase.

Correlation between the stickiness parameters (λ) and the NMR relaxation parameters (q).

The regression line is shown as dashes.

Our method incorporates ideas from a number of previous efforts at describing R2. The first serious effort was by Schwalbe et al., 1997, who accounted for contributions from neighboring residues as additive terms, instead of multiplicative factors as in SeqDYN. Cho et al., 2007 and Delaforge et al., 2018 used the running average of the bulkiness parameter over a window of five to nine residues as a qualitative indicator of R2. Here again the calculation was based on an additive model. Sekiyama et al., 2022 employed a multiplicative model, with R2 calculated as a geometric mean of ‘indices of local dynamics’ over a five-residue window. These indices, akin to our q parameters, were trained on a single IDR (TIA-1 prion-like domain) and used to reproduce the measured R2 for the same IDR. As we have illustrated on CBP-ID4 (Figure 5—figure supplement 1), training on a single protein merely biases the parameters to that model and has little value in predicting R2 for other proteins. In comparison, SeqDYN is trained on 45 IDPs and its predictions are robust and achieve quantitative agreement with measured R2.

Ten of the IDPs tested here have been studied recently by MD simulations using IDP-specific force fields (Dey et al., 2022; Hicks et al., 2020; Yu and Brüschweiler, 2022; Smrt et al., 2023). In Table 2, we compare the RMSEs of SeqDYN predictions with those for R2 calculations from MD simulations. For five of these IDPs: A1-LCD, Aβ40, α-synuclein, tau K18, and FtsQ, RMSEs of SeqDYN and MD are remarkably similar. Four of these IDPs lack significant population of α-helices or β-sheets, but FtsQ forms a stable long helix. For one other IDP, namely HOX-DFD, MD, by explicitly modeling its folded domain, does a much better job in predicting R2 than SeqDYN (RMSEs of 1.40 s–1 vs 1.99 s–1). However, for the four remaining IDPs: p53TAD, Pup, Sev-NT, and ChiZ, SeqDYN significantly outperforms MD, with RMSEs averaging only 0.47 s–1, compared to the MD counterpart of 1.14 s–1. Overall, SeqDYN is very competitive against MD in predicting R2, but without the significant computational cost. While MD simulations can reveal details of local interactions, as noted for α-synuclein, and capture tertiary interactions if they occur, they still suffer from perennial problems of force-field imperfection and inadequate sampling. SeqDYN provides an accurate description of IDP dynamics at a ‘mean-field’ level, but could miss idiosyncratic behaviors of specific local sequences.

Table 2
RMSEs (s–1) of R2 predictions by SeqDYN and MD for 10 IDPs.
IDP nameSeqDYNMD
A1-LCD0.60*0.59 §,
Aβ400.38*0.38 §
HOX-DFD1.99*1.40 §
α-synuclein0.44*0.50 §
p53TAD0.33*1.04 **
Pup0.43*1.00**
Sev-NT0.38*,1.10 §,††
tau K180.83*0.80 §
ChiZ0.74 1.40 ‡ ‡
FtsQ1.71 §,1.70 § §
  1. *

    Based on leave-one-out training (using 44 IDPs).

  2. Helix boost applied.

  3. Based on training by the full training set (45 IDPs).

  4. §
  5. RMSE is scaled down by a factor of 2.39, to correct for the effect of temperature (MD at 288 K; see Figure 3—figure supplement 1C).

  6. **
  7. ††

    RMSE is scaled down by a factor of 2.99, to correct for the effects of temperature and magnetic field (MD at 274 K and 850 MHz; see Figure 3—figure supplement 1B).

  8. ‡ ‡

    Originally calculated in Hicks et al., 2020 with correction in Hicks et al., 2021.

  9. § §

Deep-learning models have become very powerful, but they usually have millions of parameters and require millions of protein sequences for training (Rives et al., 2021). In contrast, SeqDYN employs a mathematical model with dozens of parameters and requires only dozens of proteins for training. Reduced models (by collapsing amino acids into a small number of distinct types) have even been trained on <10 IDPs to predict propensities for binding nanoparticles (Li et al., 2020) or membranes (Qin et al., 2022). The mathematical model-based approach may be useful in other applications where data, similar to R2, are limited, including predictions of IDP secondary chemical shifts or residues that bind drug molecules (Robustelli et al., 2022) or protein targets, or even in protein design, for example for recognizing an antigenic site or a specific DNA site.

Methods

Collection of IDPs with measured R2

Starting from six nonhomologous IDPs in our previous MD study (Dey et al., 2022), we obtained R2 data for eight IDPs from the Bimolecular Magnetic Resonance Data Bank (BMRB; https://bmrb.io); data for two other IDPs were from our collaborators (Hicks et al., 2020; Smrt et al., 2023). Most of the 54 IDPs studied here were from searching the literature. Disorder was judged by dispersion in backbone amide proton chemical shifts, NOE, and SCS. R2 data that were not available from the authors or BMRB were obtained by digitizing R2 plots presented in figures of published papers, using WebPlotDigitizer (https://automeris.io/WebPlotDigitizer; Rohatgi, 2022) and further inspected visually.

Homology of IDPs was checked by sequence alignment using Clustal W (http://www.clustal.org/clustal2; Larkin et al., 2007), and presented as a clock-like tree using the ‘ape’ package (http://ape-package.ird.fr; Paradis et al., 2004). IDPs that had discernible homology with the selected training set were removed. Removed IDPs included HOX-SCR and β-synuclein from our previous MD study (Dey et al., 2022), due to homology with HOX-DFD and α-synuclein, respectively.

Coding for SeqDYN

The training of SeqDYN was coded in python, similar to our previous work for predicting residue-specific membrane association propensities (ReSMAP; https://zhougroup-uic.github.io/ReSMAPidp/; Qin et al., 2022). The cost function was the sum of mean-squared-errors for the IDPs in the training set. We used the least_squares function in scipy.optimize, with Trust Region Reflective as the minimization algorithm and all parameters restricted to the positive range. For the web server (https://zhougroup-uic.github.io/SeqDYNidp/; Qin and Zhou, 2024), we rewrote the prediction code javascript.

Appendix 1

Sequences of 54 IDPs

Terminal tags and other insertions are underlined.

Training set (45 IDPs)

>A1-LCD (Uniprot ID P04256, residues 186-320; deletion of 258-263)
GSMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGSNFGGGGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSSSSSSYGSGRRF
>Aβ40 (Uniprot ID Q28053, residues 7-46)
DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGV
>Ash1 (Uniprot ID P34233, residues 420-500)
GASASSSPSPSTPTKSGKMRSRSSSPVRPKAYTPSPRSPNYHRFALDSPPQSPRRSSNSSITKKGSRRSSGSSPTRHTTRVCV
>Beclin1 (Uniprot ID Q14457, residues 1-150)
MGSSHHHHHHSQDPMEGSKTSNNSTMQVSFVSQRSSQPLKLDTSFKILDRVTIQELTAPLLTTAQAKPGETQEEETNSGEEPFIETPRQDGVSRRFIPPARMMSTESANSFTLIGEASDGGTMENLSRRLKVTGDLFDIMSGQTDVDHPLSEESTDTLLDQLDTY
>CAPRIN1 (Uniprot ID Q14444, residues 607-709)
SRGVSRGGSRGARGLMNGYRGPANGFRGGYDGYRPSFSNTPNSGYTQSQFSAPRDYSGYQRDGYQQNFKRGSGQSGPRGAPRGRGGPPRPNRGMPQMNTQQVN
>CBP-ID4 (Uniprot ID Q92793, residues 1852-2057)
MQQQIQHRLQQAQLMRRRMATMNTRNVPQQSLPSPTSAPPGTPTQQPSTPQTPQPPAQPQPSPVSMSPAGFPSVARTQPPTTVSTGKPTSQVPAPPPPAQPPPAAVEAARQIEREAQQQQHLYRVNINNSMPPGRTGMGTPGSQMAPVSLNVPRPNQVSGPVMPSMPPGQWQQAPLPQQQPMPGLPRPVISMQAQAAVAGPRMPSVQ
>GbnD4-DHD (Uniprot ID A0A808VWJ6, residues 412-482)
MKHHHHHHHHGGLVPRGSHGSDEGVPDALRADTVPRAGPVRYARRRYWIGEARSDALAPAAPLEREPLPAEAMGAYFAIRRTDADDTVAAH
>ERD14 (Uniprot ID P42763, residues 1-185)
MAEEIKNVPEQEVPKVATEESSAEVTDRGLFDFLGKKKDETKPEETPIASEFEQKVHISEPEPEVKHESLLEKLHRSDSSSSSSSEEEGSDGEKRKKKKEKKKPTTEVEVKEEEKKGFMEKLKEKLPGHKKPEDGSAVAAAPVVVPPPVEEAHPVEKKGILEKIKEKLPGYHPKTTVEEEKKDKE
>ExsE (Uniprot ID Q9I322, residues 1-81)
MKIESIPPVQPSQDAGAEAVGHFEGRSVTRAAVRGDDRSSVAGLARWLARNVAGDPRSEQALQRLADGDGTPLEARTVRRREFLEGSS
>FCP1 (Uniprot ID Q9Y5B0, residues 879-961)
PGPEEQEEEPQPRKPGTRRERTLGAPASSERSAAGGRGPRGHKRKLNEEDAASESSRESSNEDEGSSSEADEMAKALEAELNDLM
>FUS (Uniprot ID P35637, residues 1-163)
MASNDYTQQATQSYGAYPTQPGQGYSQQSSQPYGQQSYSGYSQSTDTSGYGQSSYSSYGQSQNTGYGTQSTPQGYGSTGGYGSSQSSQSSYGQQSSYPGYGQQPAPSSTSGSYGSSSQSSSYGQPQSGSYSQQPSYGGQQQSYGQQQSYNPPQGYGQQNQYNS
>GAb1 (Uniprot ID B7Z3B9, residues 510-591)
SSPMIKPKGDKQVEYLDLDLDSGKSTPPRKQKSSGSGSSVADERVDYVVVDQQKTLALKSTREAWTDGRQSTESETPAKSVK
>hACTR (Uniprot ID Q9Y6Q9, residues 1023-1091)
GTQNRPLLRNSLDDLVGPPSNLEGQSDERALLDQLHTLLSNTDATGLEEIDRALGIPELVNQGQALEPK>Hahellin (Uniprot ID Q2SHN6, residues 162-252)
MGEKTVKLYEDTHFKGYSVELPVGDYNLSSLISRGALNDDLSSARVPSGLRLEVFQHNNFKGVRDFYTSDAAELSRDNDASSVRVSKMETTN
>hCSD1 (Uniprot ID P20810, residues 137-277)
AVPVESKPDKPSGKSGMDAALDDLIDTLGGPEETEEENTTYTGPEVSDPMSSTYIEELGKREVTIPPKYRELLAKKEGITGPPADSSKPIGPDDAIDALSSDFTCGSPTAAGKKTEKEESTEVLKAQSAGTVRSAAPPQEK
>HOX-DFD (Uniprot ID P07548, residues 337-426)
TDGERIIYPWMKKIHVAGVANGSYQPGMEPKRQRTAYTRHQILELEKEFHYNRYLTRRRRIEIAHTLVLSERQIKIWFQNRRMKWKKDNK
>hZIP4-ICL2 (Uniprot ID Q6P5W5, residues 424-498)
GDRGPEFELGTLPRDPEDLEDGPCGHSSHSHGGHSHGVSLQLAPSELRQPKPPHEGSRADLVAEESPELLNPEPRRLSPELRLLPYGHGLSAWSHPQFEK
>Jaburetox (Uniprot ID I1K3K3, residues 230-320)
MGPVNEANCKAAMEIVCRREFGHKEEEDASEGVTTGDPDCPFTKAIPREEYANKYGPTIGDKIRLGDTDLIAEIEKDFALYGDESVFGGGKVIH
>KRS-NT (Uniprot ID Q15046, residues 1-72)
MAAVQAAEVKVDGSEPKLSKNELKRRLKAEKKVAEKEAKQKELSEKQLSQATAAATNHTTDNGVGPEEESVD
>MBP-xα2 (Uniprot ID P04370, residues 172-237)
SIGRFFSGDRGAPKRGSGKDSHTRTTHYGSLPQKSQHGRTQDENPVVHFFKNIVTPRTPPPSQGKGRGLS
>MKK4 (Uniprot ID P45985, residues 1-86)
MAAPSPSGGGGSGGGSGSGTPGPVGSPAPGHPAVSSMQGKRKALKLNFANPPFKSTARFTLNPNPTGVQNPHIERLRTHSIESSGK
>N-Cby (Uniprot ID B0QY54, residues 1-63)
MPFFGNTFSPKKTPPRKSASLSNLHSLDRSTREVELGLEYGSPTMNLAGQSLKFENGQWIAET
>Niv-PNTD (Uniprot ID P0C1C7, residues 1-406)
MDKLELVNDGLNIIDFIQKNQKEIQKTYGRSSIQQPSIKDQTKAWEDFLQCTSGESEQVEGGMSKDDGDVERRNLEDLSSTSPTDGTIGKRVSNTRDWAEGSDDIQLDPVVTDVVYHDHGGECTGYGFTSSPERGWSDYTSGANNGNVCLVSDAKMLSYAPEIAVSKEDRETDLVHLENKLSTTGLNPTAVPFTLRNLSDPAKDSPVIAEHYYGLGVKEQNVGPQTSRNVNLDSIKLYTSDDEEADQLEFEDEFAGSSSEVIVGISPEDEEPSSVGGKPNESIGRTIEGQSIRDNLQAKDNKSTDVPGAGPKDSAVKEEPPQKRLPMLAEEFECSGSEDPIIRELLKENSLINCQQGKDAQPPYHWSIERSISPDKTEIVNGAVQTADRQRPGTPMPKSRGIPIKK
>NS5A-D2D3 (Uniprot ID O92972, residues 2163-2419)
GHMASGSLRGGEPEPDVTVLTSMLTDPSHITAETAKRRLARGSPPSLASSSASQLSAPSLKATCTTHHDSPDADLIEANLLWRQEMGGNITRVESENKVVILDSFEPLHADGDEREISVAAEILRKSRKFPSALPIWARPDYNPPLLESWKDPDYVPPVVHGCPLPPTKAPPIPPPRRKRTVVLTESNVSSALAELATKTFGSSGSSAVDSGTATALPDQASDDGDKGSDVESYSSMPPLEGEPGDPDLSDGSWSTVSEEASEDVVCC
>NUPR1 (Uniprot ID O60356, residues 2-82)
MRGSHHHHHHGSATFPPATSAPQQPPGPEDEDSSLDESDLYSLAHSYLGGGGRKGRTKREAAANTNRPSPGGHERKLVTKLQNSERKKRGARR
OPN (Uniprot ID F1NSM8, residues 46-264)
MHQDHVDSQSQEHLQQTQNDLASLQQTHYSSEENADVPEQPDFPDVPSKSQETVDDDDDDDNDSNDTDESDEVFTDFPTEAPVAPFNRGDNAGRGDSVAYGFRAKAHVVKASKIRKAARKLIEDDATTEDGDSQPAGLWWPKESREQNSRELPQHQSVENDSRPKFDSREVDGGDSKASAGVDSRESQGSVPAVDASNQTLESAEDAEDRHSIENNEVTR
>p53TAD (Uniprot ID P04637, residues 1-71)
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPRV
>PDEγ (Uniprot ID P61248, residues 1-87)
MNLEPPKAEIRSATRVMGGPVTPRKGPPKFKQRQTRQFKSKPPKKGVQGFGDDIPGMEGLGTDITVIAPWEAFNHLELHELAQYGII
>PKIα (Uniprot ID P61925, residues 2-76)
TDVETTYADFIASGRTGRRNAIHDILVSSASGNSNELALKLAGLDINKTEGEEDAQRSSTEQSGEAQGEAAKSES
>Mev-PNTD (Uniprot ID P03422, residues 1-304)
MAEEQARHVKNGLECIRALKAEPIGSLAIEEAMAAWSEISDNPGQERATCREEKAGSSGLSKPCLSAIGSTEGGAPRIRGQGPGESDDDAETLGIPPRNLQASSTGLQCYYVYDHSGEAVKGIQDADSIMVQSGLDGDSTLSGGDNESENSDVDIGEPDTEGYAITDRGSAPISMGFRASDVETAEGGEIHELLRLQSRGNNFPKLGKTLNVPPPPDPGRASTSGTPIKKGTERRLASFGTEIASLLTGGATQCARKSPSEPSGPGAPAGNVPECVSNAALIQEWTPESGTTISPRSQNNEEGG
>ProTα (Uniprot ID P06454, residues 1-111)
GPMSDAAVDTSSEITTKDLKEKKEVVEEAENGRDAPANGNAENEENGEQEADNEVDEEEEEGGEEEEEEEEGDGEEEDGDEDEEAESATGKRAAEDDEDDDVDTKKQKTDEDD
>Pup (Uniprot ID P9WHN4, residues 1-64)
MAQEQTKRGGGGGDDDDIAGSTAAGQERREKLTEETDDLLDEIDDVLEENAEDFVRAYVQKGGQ
>rmBG21 (Uniprot ID P04370, residues 2-190)
GNHSGKRELSAEKASKDGEIHRGEAGKKRSVGKLSQTASEDSDVFGEADAIQNNGTSAEDTAVTDSKHTADPKNNWQGAHPADPGNRPHLIRLFSRDAPGREDNTFKDRPSESDELQTIQEDPTAASGGLDVMASQKRPSQRSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFSGDRGAPKRGSGKVSLEHHHHHH
>RPB1 (Uniprot ID P24928, residues 1773-1970)
GHMSPNYTPTSPNYSPTSPSYSPTSPSYSPTSPSYSPSSPRYTPQSPTYTPSSPSYSPSSPSYSPASPKYTPTSPSYSPSSPEYTPTSPKYSPTSPKYSPTSPKYSPTSPTYSPTTPKYSPTSPTYSPTSPVYTPTSPKYSPTSPTYSPTSPKYSPTSPTYSPTSPKGSTYSPTSPGYSPTSPTYSLTSPAISPDDSDEEN
>securin (Uniprot ID O95997, residues 1-202)
MATLIYVDKENGEPGTRVVAKDGLKLGSGPSIKALDGRSQVSTPRFGKTFDAPPALPKATRKALGTVNRATEKSVKTKGPLKQKQPSFSAKKMTEKTVKAKSSVPASDDAYPEIEKFFPFNPLDFESFDLPEEHQIAHLPLSGVPLMILDEERELEKLFQLGPPSPVKMPSPPWESNLLQSPSSILSTLDVELPPVCCDIDI
>Sev-NT (Uniprot ID Q07097, residues 401-524)
LSGGDGAYHEPTGGGAIEVALDNADIDLETEAHADQDARGWGGESGERWARQVSGGHFVTLHGAERLEEETNDEDVSDIERRIAMRLAERRQEDSATHGDEGRNNGVDHDEDDDAAAVAGIGGI
>Sic1 (Uniprot ID P38634, residues 1-90)
GSMTPSTPPRSRGTRYLAQPSGNTSSSALMQGQKTPQKPSQNLVPVTPSTTKSFKNAPLLAPPNSNMGMTSPFNGLTSPQRSPFPKSSVKRT
>SKIPN (Uniprot ID G3V5R3, residues 59-129)
GDGGAFPEIHVAQYPLDMGRKKKMSNALAIQVDSEGKIKYDAIARQGQSKDKVIYSKYTDLVPKEVMNADD
>SLBP-NT (Uniprot ID Q9VAN6, residues 17-108)
MGSSHHHHHHSSGLVPRGSHMGSGSLNSSASSISIDVKPTMQSWAQEVRAEFGHSDEASSSLNSSAASCGSLAKKETADGNLESKDGEGREMAFEFLDGVNEVKFERLVKEEK
>α-synuclein (Uniprot ID P37840, residues 1-140)
MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKEGVVHGVATVAEKTKEQVTNVGGAVVTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAPQEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA
>SOCS5-JIR (Uniprot ID A0A5E4BAI0, residues 12-81)
RSLRQRLQDTVGLCFPMRTYSKQSKPLFSNKRKIHLSELMLEKCPFPAGSDLAQKWHLIKQHTAPVSPHS
>tau K18 (Uniprot ID Q9MYX8, residues 186-314)
QTAPVPMPDLKNVKSKIGSTENLKHQPGGGKVQIINKKLDLSNVQSKCGSKDNIKHVPGGGSVQIVYKPVDLSKVTSKAGSLGNIHHKPGGGQVEVKSEKLDFKDRVQSKIGSLDNITHVPGGGNKKIE
>TC1 (Uniprot ID Q9NR00, residues 1-106)
MKAKRSHQAIIMSTSLRVSPSIHGYHFDTASRKKAVGNIFENTDQESLERLFRNSGDKKAEERAKIIFAIDQDVEEKTRALMALKKRTKDKLFQFLKLRKYSIKVH
>TDP-43 (Uniprot ID Q13148, residues 267-414)
GHMNRQLERSGRFGGNPGGFGNQGGFGNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWGMMGMLASQQNQSGPSGNNQNQGNMQREPNQAFGSGNNSYSGSNSGAAIGWGSASNAGSGSGFNGGFGSSMDSKSSGWGM
>γ-tubulin-CT (Uniprot ID P53378, residues 439-473)
LLRGAAEQDSYLDDVLVDDENMVGELEEDLDADGDHKLV

Test set (9 IDPs)

>AMOTL1 (Uniprot ID Q8IY63, residues 178-384)
STQPQQNNEELPTYEEAKAQSQFFRGQQQQQQQQGAVGHGYYMAGGTSQKSRTEGRPTVNRANSGQAHKDEALKELKQGHVRSLSERIMQLSLERNGAKQHLPGSGNGKGFKVGGGPSPAQPAGKVLDPRGPPPEYPFKTKQMMSPVSKTQEHGLFYGDQHPGMLHEMVKPYPAPQPVRTDVAVLRYQPPPEYGVTSRPCQLPFPST
>CAHS-8 (Uniprot ID P0CU50, residues 1-227)
MSGRNVESHMERNEKVVVNNSGHADVKKQQQQVEHTEFTHTEVKAPLIHPAPPIISTGAAGLAEEIVGQGFTASAARISGGTAEVHLQPSAAMTEEARRDQERYRQEQESIAKQQEREMEKKTEAYRKTAEAEAEKIRKELEKQHARDVEFRKDLIESTIDRQKREVDLEAKMAKRELDREGQLAKEALERSRLATNVEVNFDSAAGHTVSGGTTVSTSDKMEIKRNENLYFQ
>ChiZ (Uniprot ID I6YA32, residues 1-64)
MTPVRPPHTPDPLNLRGPLDGPRWRRAEPAQSRRPGRSRPGGAPLRYHRTGVGMSRTGHGSRPV
>-endosulfine (Uniprot ID O43768, residues 1-121)
MSQKQEEENPAEETGEEKQDTQEKEGILPERAEEAKLKAKYPSLGQKPGGSDFLMKRLQKGQKYFDSGDYNMAKAKMKNKQLPSAGPDKNLVTGDHIPTPQDLPQRKSSLVTSKLAGGQVE
>FtsQ (Uniprot ID Q8IY63, residues 1-99)
MTEHNEDPQIERVADDAADEEAVTEPLATESKDEPAEHPEFEGPRRRARRERAERRAAQARATAIEQARRAAKRRARGQIVSEQNPAKPAARGVVRGLK
>Pdx1 (Uniprot ID P52945, residues 204-283)
GPGEEDKKRGGGTAVGGGGVAEPEQDCAVTSGEELLALPPPPPPGGAVPPAAPVAAREGRLPPGLSASPQPSSVAPRRPQEPR
>synaptobrevin-2 (Uniprot ID P63027, residues 1-96)
MSATAATAPPAAPAGEGGPPAPPPNLTSNRRLQQTQAQVDEVVDIMRVNVDKVLERDQKLSELDDRADALQAGASQFETSAAKLKRKYWWKNLKMM
>TIA-1 (Uniprot ID P31483, residues 320-386)
MGSSHHHHHHHHHHHHSENLYFQGGQYVPNGWQVPAYGVYGQPWSQQGFNQTQSSAPWMGPNYSVPPPQGQNGSMLPSQPAGYRVAGYETQ
>YAP (Uniprot ID P46937, residues 50-171)
AGHQIVHVRGDSETDLEALFNAVMNPKTANVPQTVPMRLRKLPDSFFKPPEPKSHSRQASTDAGTAGALTPQHVRAHSSPASLQLGAVSPGTLTPTGVVSGPAATPTAQHLRQSSFEIPDDV

Data availability

All data generated or analyzed during this study are included in the manuscript and supplementary files; source data have been provided for Figures 39, Figure 3—figure supplement 1, Figure 4—figure supplement 1, and Figure 5—figure supplement 1.

References

    1. Feldman HJ
    2. Hogue CWV
    (2002)
    Probabilistic sampling of protein conformations: new hope for brute force?
    Proteins 46:8–23.

Article and author information

Author details

  1. Sanbo Qin

    Department of Chemistry, University of Illinois Chicago, Chicago, United States
    Contribution
    Resources, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology
    Competing interests
    No competing interests declared
  2. Huan-Xiang Zhou

    1. Department of Chemistry, University of Illinois Chicago, Chicago, United States
    2. Department of Physics, University of Illinois Chicago, Chicago, United States
    Contribution
    Conceptualization, Resources, Data curation, Formal analysis, Supervision, Funding acquisition, Validation, Investigation, Visualization, Methodology, Writing - original draft, Project administration, Writing - review and editing
    For correspondence
    hzhou43@uic.edu
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-9020-0302

Funding

National Institutes of Health (GM118091)

  • Huan-Xiang Zhou

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

This work was supported by Grant GM118091 from the National Institutes of Health.

Version history

  1. Preprint posted:
  2. Sent for peer review:
  3. Reviewed Preprint version 1:
  4. Reviewed Preprint version 2:
  5. Version of Record published:

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.88958. This DOI represents all versions, and will always resolve to the latest one.

Copyright

© 2023, Qin and Zhou

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 757
    views
  • 49
    downloads
  • 2
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Sanbo Qin
  2. Huan-Xiang Zhou
(2024)
Predicting the sequence-dependent backbone dynamics of intrinsically disordered proteins
eLife 12:RP88958.
https://doi.org/10.7554/eLife.88958.3

Share this article

https://doi.org/10.7554/eLife.88958

Further reading

    1. Structural Biology and Molecular Biophysics
    Yuanyuan Wang, Fan Xu ... Yongning He
    Research Article

    SCARF1 (scavenger receptor class F member 1, SREC-1 or SR-F1) is a type I transmembrane protein that recognizes multiple endogenous and exogenous ligands such as modified low-density lipoproteins (LDLs) and is important for maintaining homeostasis and immunity. But the structural information and the mechanisms of ligand recognition of SCARF1 are largely unavailable. Here, we solve the crystal structures of the N-terminal fragments of human SCARF1, which show that SCARF1 forms homodimers and its epidermal growth factor (EGF)-like domains adopt a long-curved conformation. Then, we examine the interactions of SCARF1 with lipoproteins and are able to identify a region on SCARF1 for recognizing modified LDLs. The mutagenesis data show that the positively charged residues in the region are crucial for the interaction of SCARF1 with modified LDLs, which is confirmed by making chimeric molecules of SCARF1 and SCARF2. In addition, teichoic acids, a cell wall polymer expressed on the surface of gram-positive bacteria, are able to inhibit the interactions of modified LDLs with SCARF1, suggesting the ligand binding sites of SCARF1 might be shared for some of its scavenging targets. Overall, these results provide mechanistic insights into SCARF1 and its interactions with the ligands, which are important for understanding its physiological roles in homeostasis and the related diseases.

    1. Biochemistry and Chemical Biology
    2. Structural Biology and Molecular Biophysics
    Jie Luo, Jeff Ranish
    Tools and Resources

    Dynamic conformational and structural changes in proteins and protein complexes play a central and ubiquitous role in the regulation of protein function, yet it is very challenging to study these changes, especially for large protein complexes, under physiological conditions. Here, we introduce a novel isobaric crosslinker, Qlinker, for studying conformational and structural changes in proteins and protein complexes using quantitative crosslinking mass spectrometry. Qlinkers are small and simple, amine-reactive molecules with an optimal extended distance of ~10 Å, which use MS2 reporter ions for relative quantification of Qlinker-modified peptides derived from different samples. We synthesized the 2-plex Q2linker and showed that the Q2linker can provide quantitative crosslinking data that pinpoints key conformational and structural changes in biosensors, binary and ternary complexes composed of the general transcription factors TBP, TFIIA, and TFIIB, and RNA polymerase II complexes.