Predicting the sequence-dependent backbone dynamics of intrinsically disordered proteins
eLife assessment
In this useful study, a solid machine learning approach based on a broad set of systems to predict the R2 relaxation rates of residues in intrinsically disordered proteins (IDPs) is described. The ability to predict the patterns of R2 will be helpful to guide experimental studies of IDPs. A potential weakness is that the predicted R2 values may include both fast and slow motions, thus the predictions provide only limited new physical insights into the nature of the underlying protein dynamics, such as the most relevant timescale.
https://doi.org/10.7554/eLife.88958.3.sa0Useful: Findings that have focused importance and scope
- Landmark
- Fundamental
- Important
- Valuable
- Useful
Solid: Methods, data and analyses broadly support the claims with only minor weaknesses
- Exceptional
- Compelling
- Convincing
- Solid
- Incomplete
- Inadequate
During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments
Abstract
How the sequences of intrinsically disordered proteins (IDPs) code for functions is still an enigma. Dynamics, in particular residue-specific dynamics, holds crucial clues. Enormous efforts have been spent to characterize residue-specific dynamics of IDPs, mainly through NMR spin relaxation experiments. Here, we present a sequence-based method, SeqDYN, for predicting residue-specific backbone dynamics of IDPs. SeqDYN employs a mathematical model with 21 parameters: one is a correlation length and 20 are the contributions of the amino acids to slow dynamics. Training on a set of 45 IDPs reveals aromatic, Arg, and long-branched aliphatic amino acids as the most active in slow dynamics whereas Gly and short polar amino acids as the least active. SeqDYN predictions not only provide an accurate and insightful characterization of sequence-dependent IDP dynamics but may also serve as indicators in a host of biophysical processes, including the propensities of IDP sequences to undergo phase separation.
Introduction
Intrinsically disordered proteins (IDPs) or regions (IDRs) do not have the luxury of a three-dimensional structure to help decipher the relationship between sequence and function. Instead, dynamics has emerged as a crucial link between sequence and function for IDPs (Dey et al., 2022). Nuclear magnetic resonance (NMR) spin relaxation is a uniquely powerful technique for characterizing IDP dynamics, capable of yielding residue-specific information (Camacho-Zarco et al., 2022). Backbone 15N relaxation experiments typically yield three parameters per residue: transverse relaxation rate (), longitudinal relaxation rate (), and steady-state heteronuclear Overhauser enhancement (NOE). While all three parameters depend on ps-ns dynamics, is the one most affected by slower dynamics (10 s of ns to 1 μs). An increase in either the timescale or the amplitude of slower dynamics results in higher . For IDPs, is also the parameter that exhibits the strongest dependence on sequence (Dey et al., 2022; Camacho-Zarco et al., 2022).
was noted early on as an important indicator of residual structure in the unfolded state of the structured protein lysozyme (Klein-Seetharaman et al., 2002). This property has since been measured for many IDPs to provide insight into various biophysical processes. Just as the residual structure in the unfolded state biases the folding pathway of lysozyme (Klein-Seetharaman et al., 2002), a nascent α-helix in the free state of Sendai virus nucleoprotein C-terminal domain (Sev-NT), as indicated by highly elevated (Abyzov et al., 2016), biases the coupled binding and folding pathway in the presence of its target phosphoprotein (Schneider et al., 2015). Local secondary structure preformation also facilitates the binding of yes-associated protein (YAP) with its target transcription factor (Feichtinger et al., 2022). Likewise a correlation has been found between in the free state and the membrane binding propensity of synaptobrevin-2: residues with elevated have increased propensity for membrane binding (Lakomek et al., 2019). in the free state has also been used to uncover factors that promote liquid-liquid phase separation of IDPs. For example, a nascent α-helix (shown by elevated ) is important for the phase separation of the TDP-43 low-complexity domain, as both the deletion of the helical region and a helix-breaking mutation (Ala to Pro) abrogates phase separation (Conicella et al., 2016). Similarly, nascent α-helices in the free state of cytosolic abundant heat-soluble 8 (CAHS-8), upon raising concentration and lowering temperature stabilize to form the core of fibrous gels (Malki et al., 2022). For the hnRNPA1 low-complexity domain (A1-LCD), aromatic residues giving rise to local peaks in also mediate phase separation (Martin et al., 2020).
Both NMR relaxation data and molecular dynamics (MD) simulations have revealed determinants of for IDPs. It has been noted that the flexible Gly tends to lower , whereas secondary structure and contact formation tend to raise (Cook et al., 2019). This conclusion agrees well with recent MD simulations (Dey et al., 2022; Hicks et al., 2020; Yu and Brüschweiler, 2022; Smrt et al., 2023). These MD studies, using IDP-specific force fields, are able to predict in quantitative agreement with NMR measurements, without ad hoc reweighting as done in earlier studies. According to MD, most contact clusters are formed by local sequences, within blocks of up to a dozen or so residues (Dey et al., 2022; Hicks et al., 2020; Smrt et al., 2023). Tertiary contacts can also form but are relatively rare; as such their accurate capture requires extremely extensive sampling and still poses a challenge for MD simulations. Contrary to Gly, aromatic residues have been noted as mediators of contact clusters (Klein-Seetharaman et al., 2002; Martin et al., 2020).
Schwalbe et al., 1997 introduced a mathematical model to describe the profile along the sequence for lysozyme in the unfolded state. The value of a given residue was expressed as the sum of contributions from this residue and its neighbors. This model yields a mostly flat profile across the sequence, except for a falloff at the termini, resulting in an overall bell shape. Klein-Seetharaman et al., 2002 then fit peaks above this flat profile as a sum of Gaussians. Cho et al., 2007 proposed bulkiness as a qualitative indicator of backbone dynamics. Recently Sekiyama et al., 2022 calculated as the geometric mean of ‘indices of local dynamics’; the latter were parameterized by fitting to the measured for a single IDP. All these models merely describe the profile of a given IDP, and none of them is predictive.
Here, we present a method, SeqDYN, for predicting of IDPs. Using a mathematical model introduced by Li et al., 2020 to predict propensities for binding nanoparticles and also adapted for predicting propensities for binding membranes (Qin et al., 2022), we express the value of a residue as the product of contributing factors from all residues. The contributing factor attenuates as the neighboring residue becomes more distant from the central residue. The model, after training on a set of 45 IDPs, has prediction accuracy that is competitive against that of the recent MD simulations using IDP-specific force fields (Dey et al., 2022; Hicks et al., 2020; Yu and Brüschweiler, 2022; Smrt et al., 2023). For lysozyme and other structured proteins, the SeqDYN prediction agrees remarkably well with measured in their unfolded state.
Results
The data set of IDPs with R2 rates
We collected R2 data for a total of 54 nonhomologous IDPs or IDRs (Table 1; Figure 1). According to indicators from NMR properties, including low or negative NOEs, narrow dispersion in backbone amide proton chemical shifts, and small secondary chemical shifts (SCSs), most of the proteins are disordered with at most transient α-helices. A few are partially folded, including Sev-NT with a well-populated (~80%) long helix (residues 478–491; Jensen et al., 2008), CREB-binding protein fourth intrinsically disordered linker (CBP-ID4) with >50% propensities for two long helices (residues 2–25 and 101–128; Piai et al., 2016), HOX transcription factor DFD (HOX-DFD) with a well-folded domain comprising three helices (Maiti et al., 2019), and Hahellin (apo form) as a molten globule (Patel et al., 2014). In Figure 2, we display representative conformations of five IDPs, ranging from fully disordered MAPK kinase 4 (MKK4; Delaforge et al., 2018) and α-synuclein (Sung and Eliezer, 2007) to Measles virus phosphoprotein N-terminal domain (Mev-PNTD; Milles et al., 2018) with transient short helices to Sev-NT and CBP-ID4 with stable long helices. The sequences of all the IDPs are listed in Appendix 1.
We used 45 of the 54 IDPs to train and validate SeqDYN and reserved the remaining 9 for testing. The sequence lengths of the training set range from 39 to 406 residues, with an average of 125.3 residues. Altogether data are available for 3966 residues. A large majority (35 out of 45) of the 45 IDPs have mean values (, calculated among all the residues in a protein) between 2.5 and 5.5 s–1 (Table 1 and Figure 3A). This range is much lower than that of structured proteins with similar sequence lengths. The low values and lack of dependence on sequence length (Figure 3—figure supplement 1A) suggest that of the IDPs is mostly dictated by local sequence instead of tertiary interaction.
The most often used temperature for acquiring the data was 298 K, but low temperatures (277–280 K) were used in a few cases (Table 1 and Figure 3—figure supplement 1B). Of the seven IDPs with > 6.4 s–1, four can be attributed to low temperatures (Sólyom et al., 2015; Martin et al., 2016; Janke et al., 2018; Jenner et al., 2018), one is due to a relatively low temperature (283 K) as well as the presence of glycerol (20% v/v; Bafaro et al., 2019), and two can be explained by tertiary structure formation [a folded domain (Maiti et al., 2019) or molten globule (Patel et al., 2014)]. A simple reason for higher values at lower temperatures is the higher water viscosity, resulting in a slowdown in molecular tumbling; a similar effect is achieved by adding glycerol. In some cases, was measured at both low and room temperatures (Abyzov et al., 2016; Martin et al., 2020). To a good approximation, the effect of lowering temperature is a uniform scaling of across the IDP sequence. For Sev-NT, downscaling of the values at 278 K by a factor of 2.0 brings them into close agreement with those at 298 K (Figure 3—figure supplement 1C), with a root-mean-square-deviation (RMSD) of 0.5 s–1 among all the residues. Likewise, for A1-LCD, downscaling by a factor of 2.4 brings the values at 288 K into good match with those at 298 K (Figure 3—figure supplement 1D), with an RMSD of 0.4 s–1. Because SeqDYN is concerned with the sequence dependence of , a uniform scaling has no effect on model parameter or prediction; therefore mixing the data from different temperatures is justified. The same can be said about the different magnetic fields in acquiring the data (Table 1 and Figure 3—figure supplement 1E). Increasing the magnetic field raises values, and the effect is also approximated well by a uniform scaling (Abyzov et al., 2016; Conicella et al., 2016; Janke et al., 2018).
One measure on the level of sequence dependence of is the standard deviation, , calculated among the residues of an IDP. Among the training set, the values of 30 IDPs have moderate sequence variations, with ranging from 0.5 to 1.5 s–1 (Table 1); the histogram of calculated for the entire training set peaks around 0.75 s–1 (Figure 3A). There is a moderate correlation between and (Figure 3A, inset), reflecting in part the fact that can be raised simply by a uniform upscaling, for example as a result of lowering temperature. Still, only two of the five IDPs with high attributable to lower temperature or presence of glycerol are among the seven IDPs with high sequence variations (>2 s–1). Therefore, the sequence variation of as captured by manifests mostly the intrinsic effect of the IDP sequence, not the influence of external factors such as temperature or magnetic field strength. The mean value among the training set is 1.24 s–1.
One way to eliminate the influence of external factors is to scale the values of each IDP by its ; we refer to the results as scaled , or . We then pooled the values for all residues in the training set, and separated them according to amino-acid types. The amino acid type-specific mean values, or , are displayed in Figure 3B. The seven amino acids with the highest in descending order are Trp, Arg, Tyr, Phe, Ile, His, and Leu. The presence of all the four aromatic amino acids in this “high-end” group immediately suggests π-π stacking as important for raising ; the presence of Arg further implicates cation-π interactions. In the other extreme, the seven amino acids with the lowest in ascending order are Gly, Cys, Val, Asp, Ser, Thr, and Asn. Gly is well-known as a flexible residue; it is also interesting that all the four amino acids with short polar sidechains are found in this “low-end” group. Pro has an excessively low [with data from only two IDPs (Murrali et al., 2018; Wong et al., 2020)], but that is due to the absence of an amide proton.
The SeqDYN model and parameters
The null model is to assume a uniform for all the residues in an IDP. The root-mean-square-error (RMSE) of the null model is equal to the standard deviation, , of the measured values. The mean RMSE, , of the null model, equal to 1.24 s–1 for the training set, serves as the upper bound for evaluating the errors of predictors. The next improvement is a one-residue predictor, where first each residue (with index n) assumes its amino acid-specific mean () and then a uniform scaling factor is applied:
This one-residue model does only minutely better than the null model, with a of 1.22 s–1.
In SeqDYN, we account for the influence of neighboring residues. Specifically, each residue i contributes a factor to the value of residue n. Therefore,
where N is the total number of residues in the IDP. The contributing factor depends on the sequence distance and the amino-acid type of residue :
There are 21 global parameters. The first 20 are the values, one for each of the 20 types of amino acids; the last parameter is , appearing in the Lorentzian form of the sequence-distance dependence. We define the correlation length, , as the sequence distance at which the contributing factor is midway between the values at and . It is easy to verify that . Note that the single-residue model can be seen as a special case of SeqDYN, with set to 0 and set to .
The functional forms of Equation 2a and Equation 2b were adapted from Li et al., 2020; we also used them for predicting residue-specific membrane association propensities of IDPs (Qin et al., 2022). In these previous applications, a linear term was also present in the denominator of Equation 2b. In our initial training of SeqDYN, the coefficient of the linear term always converged to near zero. We thus eliminated the linear term. In addition to the Lorentzian form, we also tested a Gaussian form for the sequence-distance dependence and found somewhat worse performance. The more gradual attenuation of the Lorentzian form with increasing sequence distance evidently provides an overall better model for the data in the entire training set. Others (Cho et al., 2007; Sekiyama et al., 2022; Delaforge et al., 2018) have modeled as the average of some parameters over a window; a window has an extremely abrupt sequence-distance dependence (1 for and 0 for ).
We parametrized the SeqDYN model represented by Equation 2a and Equation 2b on the training set of 45 IDPs. In addition to the 21 global parameters noted above, there are also 45 local parameters, namely one uniform scaling factor () per IDP. The parameter values were selected to minimize the sum of the mean-square-errors for the IDPs in the training set, calculated on data for a total of 3924 residues. We excluded the 42 Pro residues in the training set because, as already noted, their values are lower for chemical reasons. We will present validation and test results below, but first let us look at the parameter values.
The values are displayed in Figure 3B alongside . In descending order, the seven amino acids with the highest values are Trp, Ile, Tyr, Arg, His, Phe, and Leu. These are exactly the same amino acids in the high-end group for , though their order there is somewhat different. In ascending order, the seven amino acids (excluding Pro) with the lowest values are Gly, Asn, Ser, Asp, Val, Thr, and Cys. The composition of the low-end group is also identical to that for . The values thus also suggest that π-π and cation-π interactions in local sequences may raise , whereas Gly and short-polar residues may lower .
Given the common amino acids at both the high and low ends for and , it is not surprising that these two properties exhibit a strong correlation, with a coefficient of determination (R2; excluding Pro) at 0.92 (Figure 4A). Also, because the high-end group contains the largest amino acids (e.g. Trp and Tyr) whereas the low-end group contains the smallest amino acids (e.g. Gly and Ser), we anticipated some correlation of and with amino-acid size. We measure the latter property by the molecular mass (m). As shown in Figure 4B, both and indeed show a medium correlation with m, with R2=0.67 (excluding Pro) and 0.61, respectively. A bulkiness parameter was proposed as an indicator of sequence-dependent backbone dynamics of IDPs (Cho et al., 2007; Delaforge et al., 2018). Bulkiness was defined as the sidechain volume-to-length ratio, and identified amino acids with aromatic or branched aliphatic sidechains as bulky (Zimmerman et al., 1968). We found only modest correlations between either or and bulkiness, with R2 just below 0.4 (Figure 4C).
The optimized value of is 3.164 × 10–2, corresponding to an of 5.6 residues. The resulting optimized is 0.95 s–1, a clear improvement over the value 1.24 s–1 of the null model. To check the sensitivity of prediction accuracy to , we set to values corresponding to = 0, 1, 2,…, and retrained SeqDYN for fixed at each value (Figure 4D). Note that the null-model , 1.24 s–1, sets an upper bound. This upper bound is slowly reached when is increased from the optimal value. In the opposite direction, when is decreased from the optimal value, rises quickly, reaching 1.22 s–1 at = 0. The latter is the same as that of the single-residue model. Lastly we note that there is a strong correlation between the uniform scaling factors and values among the 45 IDPs (R2=0.77), as to be expected. For 39 of the 45 IDPs, values fall in the range of 0.8–2.0 s–1.
As presented next, we evaluate the performance of SeqDYN by leave-one-out cross validation, where each IDP in turn was left out of the training set and the model was trained on the remaining 44 IDPs to predict for the IDP that was left out. The parameters from the leave-one-out (also known as jackknife) training sessions allow us to assess the potential bias of the training set. For this purpose, we compare the values of the 21 global parameters, either from the full training set or from taking the averages of the jackknife training sessions. For each of the parameters, the values from these two methods differ only in the fourth digit; for example for Leu, they are both 1.1447 from full training and from jackknife training. The values for are 3.164×10–2 from full training as stated above and 3.163×10–2 from jackknife training. The close agreement in parameter values between full training and jackknife training suggests no significant bias in the training set.
Another question of interest is whether the difference between the parameters of two amino acids is statistically significant. To answer this question, we carried out fivefold cross-validation training, resulting in five independent estimates for each parameter. For example, the mean ±standard deviation of the parameter is 1.1405 ± 0.0066 for Leu and 1.2174 ± 0.0211 for Ile. A t-test shows that their difference is extremely statistically significant (P<0.0001). In contrast, the difference between Leu and Phe (=1.1552 ± 0.0304) is not significant. t-test results for other pairs of amino acids are found in Figure 4—figure supplement 1.
Validation of SeqDYN predictions
We now present leave-one-out cross-validation results. We denote the RMSE of the prediction for the left-out IDP as RMSE(–1). As expected, RMSE(–1) is higher than the RMSE obtained with the IDP kept in the training set, but the increases are generally slight. Specifically, all but eight of the IDPs have increases <0.1 s–1; the largest increase is 0.35 s–1, for CBP-ID4. The mean RMSE(–1), or , for the 45 IDPs is increased by 0.05 s–1 over , to 1.00 s–1. The latter value is still a distinct improvement over the mean RMSE 1.24 s–1 of the null model. The histogram of RMSE(–1) for the 45 IDPs is shown in Figure 5A. It peaks at 0.5 s–1, which is a substantial downshift from the corresponding peak at 0.75 s–1 for (Figure 3A). Thirty-four of the 45 IDPs have RMSE(–1) values lower than the corresponding .
To further illustrate the performance of SeqDYN, we present the comparison of predicted and measured values for five IDPs: MKK4, α-synuclein, Mev-PNTD, Sev-NT, and CBP-ID4 (Figure 5B–F). A simple common feature is the falloff of at the N- and C-termini, resulting from missing upstream or downstream residues that otherwise would be coupled to the terminal residues, as first recognized by Schwalbe et al., 1997. Representative conformations of the five IDPs are displayed in Figure 2, with residues colored according to the predicted values. For four of these IDPs, the RMSE(–1) values range from 0.44 to 0.76 s–1 and are scattered around the peak of the histogram, while the RMSE(–1) for the fifth IDP, namely CBP-ID4, the RMSE(–1) value is 2.01 s–1 and falls on the tail of the histogram (Figure 5A). Figure 5B displays the measured and predicted for MKK4. SeqDYN correctly predicts higher values in the second half of the sequence than in the first half. It even correctly predicts the peak around residue Arg75. The sequence in this region is H72IERLRTH79; six of these eight residues belong to the high-end group. In contrast, the lowest values occur in the sequence S7GGGGSGGGSGSG19, comprising entirely of two amino acids in the low-end group.
values for α-synuclein are shown in Figure 5C. Here, SeqDYN correctly predicts higher near the C-terminus and a dip around Gly68. However, it misses the peaks around Tyr39 and Asp121. MD simulations Dey et al., 2022 have found that these peaks can be explained by a combination of secondary structure formation (β-sheet around Tyr39 and polyproline II helix around Asp121) and local (between Tyr39 and Ser42) or long-range (between Asp121 and Lys96) interactions. SeqDYN cannot account for long-range interactions (e.g. between β-strands and between Asp121 and Lys96). Figure 5D shows that SeqDYN gives excellent predictions for Mev-PNTD. It correctly predicts the high peaks around Arg17, Glu31, Leu193, and lower peaks around Arg235 and Trp285, but does underpredict the narrow peak around Tyr113.
The overall profile of Sev-NT is predicted well by SeqDYN, but the peak in the long helical region (residues 478–491) is severely underestimated (green curve in Figure 5E). A similar situation occurs for CBP-ID4, where the peak in the second long helical region (around Glu113) is underpredicted (green curve in Figure 5F). While the measured exhibits a higher peak in the second helical region than in the first helical region (around Arg16), the opposite is predicted by SeqDYN. When the data were included in the training set (i.e., full training), the second peak is higher than the first one, but that is not a real prediction because the data themselves were used for training the model. It merely means that the SeqDYN functions can be parameterized to produce any prescribed profile along the sequence. Indeed, when the data of CBP-ID4 alone were used to parameterize SeqDYN, the measured profile is closely reproduced (Figure 5—figure supplement 1). The reversal in peak heights between the two helical regions is the reason for the aforementioned unusual increase in RMSE when CBP-ID4 was left out of the training set.
R2 boost in long helical regions
It is apparent that SeqDYN underestimates the of stable long helices. Transient short helices does not seem to be a problem, since these are present, for example in Mev-PNTD, where transient helix formation in the first 37 residues and between residues 189–198 (Milles et al., 2018) coincides with peaks that are correctly predicted by SeqDYN. SeqDYN can treat coupling between residues within the correlation length of 5.6 residues, but a much longer helix would tumble more slowly than implied by an of 5.6, and thus it makes sense that SeqDYN would underestimate in that case.
Our solution then is to apply a boost factor to the long helical region. To do so, we have to know whether an IDP does form long helices and if so what the constituent residues are. Secondary structure predictors tend to overpredict α-helices and β-strands for IDPs, as they are trained on structured proteins. One way to counter that tendency is to make the criteria for α-helices and β-strands stricter. We found that, by filtering PsiPred (http://bioinf.cs.ucl.ac.uk/psipred; McGuffin et al., 2000) helix propensity scores () with a very high cutoff of 0.99, the surviving helix predictions usually correspond well with residues identified by NMR as having high helix propensities. For example, for Mev-PNTD, PsiPred plus filtering predicts residues 14–17, 28–33, and 191–193 as helical; all of them are in regions that form transient helices according to chemical shifts (Milles et al., 2018). Likewise long helices are also correctly predicted for Sev-NT (residues 477–489) and CBP-ID4 (residues 6–17 and 105–116; Jensen et al., 2008; Piai et al., 2016).
We apply a boost factor, , to helices with a threshold length of 12:
The function is 1 if the helix propensity score is above the filtering cutoff and the helix length () is above the threshold, and 0 otherwise With a boost amplitude at 0.5, the boosted SeqDYN prediction for Sev-NT reaches excellent agreement with the measured (Figure 5E, red curve). The RMSE(–1) is reduced from 0.76 s–1 to 0.38 s–1 upon boosting. Applying the same helix boost to CBP-ID4 also results in a modest reduction in RMSE(–1), from 2.01 to 1.90 s–1 (Figure 5F, red curve). The only other IDP for which PsiPred plus filtering predicts a long helix is the N-terminal region of lysyl-tRNA synthetase (KRS-NT). The authors who studied this protein did not report on secondary structure (Cho et al., 2014), but feeding their reported chemical shifts to the TALOS +server (https://spin.niddk.nih.gov/bax/nmrserver/talos/; Shen et al., 2009) found only short stretches of residues that fall into the helical region of the Ramachandran map. The SeqDYN prediction for KRS-NT is already good [RMSE(–1)=0.83 s–1]; applying a helix boost would deteriorate the RMSE(−1) to 1.16 s–1.
Further test on a set of nine IDPs
We have reserved nine IDPs for testing SeqDYN (parameterized on the training set of 45 IDPs). The level of disorder in these test proteins also spans the full range, from absence of secondary structures [ChiZ N-terminal region (Hicks et al., 2020), Pdx1 C-terminal region (Cook et al., 2019), and TIA-1 prion-like domain (Sekiyama et al., 2022)] to presence of transient short helices [synaptobrevin-2 (Lakomek et al., 2019), α-endosulfine (Thapa et al., 2022), YAP (Feichtinger et al., 2022), angiomotin-like 1 (AMOTL1) (Vogel et al., 2022)] to formation of stable long helices [FtsQ Smrt et al., 2023 and CAHS-8 Malki et al., 2022]. For eight of the nine test IDPs, the RMSEs of SeqDYN predictions are lower than the experimental values, by an average of 0.66 s–1. For the ninth IDP (Pdx1), the SeqDYN RMSE is slightly higher, by 0.06 s–1, than the experimental . Together, the nine test IDPs have a mean RMSE of 1.13 s–1, close to the of 1.00 s–1 for the training set in the leave-one-out cross-validation.
The comparison of predicted and measured profiles along the sequence is presented in Figure 6A–I. For ChiZ, SeqDYN correctly predicts the major peak around Arg25 and the minor peak around Arg46 (Figure 6A). The profile of Pdx1 is largely featureless, except for a dip around Gly216, which is correctly predicted by SeqDYN (Figure 6B). Correct prediction is also obtained for the higher in the first half of TIA-1 prion-like domain than in the second half (Figure 6C). SeqDYN gives an excellent prediction for synaptobrevin-2, including a linear increase up to Arg56 and the major peak around Trp89 (Figure 6D).
The prediction is also very good for α-endosulfine, including elevated around Glu34, which coincides with the presence of a transient helix, and depressed in the last 40 residues (Figure 6E). The only miss is an underprediction for the peak around Lys74. SeqDYN also predicts well the overall shape of the profile for YAP, including peaks around Asn70, Leu91, Arg124, and Arg161, but severely underestimates the peak height around Asn70 (Figure 6F). NOE signals indicate contacts between Met86, Leu91, Fhe95, and Fhe96 (Feichtinger et al., 2022); evidently this type of local contacts is captured well by SeqDYN. The elevation around Asn70 is mostly due to helix formation: residues 61–74 have helix propensities up to 40% (Feichtinger et al., 2022). PsiPred predicts helix for residues 62–73, but only residues 65–68 survive the filtering that we impose, resulting in a helix that is too short to apply a helix boost. The prediction for AMOTL1 is mostly satisfactory, including peaks around Phe200 and Arg264 and a significant dip around Gly292 (Figure 6G). However, whereas the two peaks have approximately equal heights in the measured profile, the predicted peak height around Phe200 is too low. SCSs indicate helix propensity around both peaks (Vogel et al., 2022). PsiPred also predicts helix in both regions, but only five and two residues, respectively, survive after filtering, and are too short for applying a helix boost.
For FtsQ, SeqDYN correctly predicts elevated for the long helix [residues 46–74 Smrt et al., 2023] but underestimates the magnitude (RMSE = 2.32 s–1; green curve in Figure 6H). PsiPred plus filtering predicts a long helix formed by residues 47–73. Applying the helix boost substantially improves the agreement with the measured , with RMSE reducing to 1.71 s–1 (red curve in Figure 6H). SeqDYN also gives a qualitatively correct profile for CAHS-8, with higher for the middle section (residues 95–190; RMSE = 2.36 s–1; green curve in Figure 6I). However, it misses the extra elevation in for the first half of the middle section (residues 95–145). According to SCS, the first and second halves have helix propensities of 60% and 30%, respectively (Malki et al., 2022). PsiPred plus filtering predicts helices for residues 96–121, 124–141, 169–173, and 179–189. Only the first two helices, both in the first half of the middle section, are considered long according to our threshold. Once again, applying the helix boost leads to marked improvement in the predicted in , with RMSE reducing to 1.92 s–1 (red curve in Figure 6I).
Inputting the sequences of structured proteins predicts in the unfolded state
SeqDYN is trained on IDPs, what if we feed it with the sequence of a structured protein? The prediction using the sequence of hen egg white lysozyme, a well-studied single-domain protein, is displayed in Figure 7A. It shows remarkable agreement with the profile measured by Klein-Klein-Seetharaman et al., 2002 in the unfolded state (denatured by 8 M urea at pH 2 and reduced to break disulfide bridges), including a major peak around Trp62, a second peak around Trp111, and a third peak around Trp123. Klein-Seetharaman et al. mutated Trp62 to Gly and the major peak all but disappeared. This result is also precisely predicted by SeqDYN with the mutant sequence (Figure 7B).
SeqDYN also predicts well the profiles of other proteins in the unfolded state. For unfolded apomyoglobin (8 M urea; pH 2.3), Schwarzinger et al., 2002 claimed that depressed corresponded to stretches of small amino acids (Gly and Ala), whereas elevated corresponded to local hydrophobic interactions. SeqDYN reproduces all the observed peaks and valleys in the profile (Figure 7C). The deepest valley indeed occurs over a Gly/Ala-rich stretch, G125ADAQGA131, but the highest peak occurs over a stretch, I102KYLEFI108, that contains both hydrophobic and charged residues, all of which are on the high end of the parameters (Figure 3B). The profile of unfolded ubiquitin (8 M urea; pH 2) is relatively flat, which Wirmer et al., 2006 attributed to lack of residual secondary structure, based on the assumption that β-sheets (major elements of folded ubiquitin) are less resistant to denaturation than α-helices. SeqDYN predicts a relatively flat profile (Figure 7D), but the reason is that the ubiquitin sequence lacks a contiguous stretch of high- amino acids.
Discussion
We have developed a powerful method, SeqDYN, that predicts the backbone amide transverse relaxation rates () of IDPs. The method is based on IDP sequences, is extremely fast, and available as a web server at https://zhougroup-uic.github.io/SeqDYNidp/ (Qin and Zhou, 2024). The excellent performance supports the notion that the ns-dynamics reported by is coded by the local sequence, comprising up to 6 residues on either side of a given residue. The amino-acid types that contribute the most to coupling within a local sequence are aromatic (Trp, Tyr, Phe, and His), Arg, and long branched aliphatic (Ile and Leu), suggesting the importance of π-π, cation-π, and hydrophobic interactions in raising . These interactions are interrupted by Gly and amino acids with short polar sidechains (Ser, Thr, Asn, and Asp), leading to reduced . Transient short helices produce moderate elevation in , whereas stable long helices result in a big boost in . Tertiary contacts can also raise , but appear to be infrequent in most IDPs (Dey et al., 2022).
It is also possible that reported by backbone amide 15N relaxation (as is the case for most of the IDPs studied here) may not be particularly sensitive to exchange effects, which likely involve tertiary contact formation. For the D2 domain of p27Kip1, the exchange contributions measured using 15N relaxation were small (<2.5 s–1) but were as large as 25 s–1 when measured by high-power 1H relaxation dispersion (Ban et al., 2017). This experiment measures the effective transverse relation rate, , over a range of effective radiofrequency . The exchange contribution is maximal for the value at low but is largely quenched for the value in the high- limit. The SeqDYN prediction for this IDP matches much better with than with (Figure 8). It is not clear whether this IDP is unique in forming persistent tertiary contacts that give rise to substantial exchange contributions or the 1H relaxation dispersion experiment is unique in reporting the exchange contributions. At the minimum, SeqDYN yields the exchange-free portion of the transverse relaxation rate, enabling easy identification of residues that potentially participate in tertiary contacts. For the D2 domain of p27Kip1, SeqDYN correctly predicts the local maxima at W76 and Y88. It is these same two residues that show substantial exchange contributions and putatively participate in tertiary contact (Ban et al., 2017). Therefore local contacts may seed tertiary contacts. If data with substantial exchange contributions become available for more IDPs, SeqDYN may be retrained to make predictions for IDPs forming persistent tertiary contacts.
The parameters, while introduced here to characterize the propensities of amino acids to participate in local interactions, appear to correlate with the tendencies of amino acids to drive liquid-liquid phase separation. Consistent with the rank order of , Trp, Tyr, and Arg have been reported to be strong drivers of phase separation, Lys is a moderate driver, whereas Gly and Ser suppress phase separation (Martin et al., 2020; Wong et al., 2020; Wang et al., 2018). Recent measurements of the threshold concentration produced the following order for the propensity of phase separation by eight nonpolar amino acids in homotetrapeptides of the form XXssXX (ss: backbone disulfide bond): Trp > Phe > Leu>Met > Ile>Val > Ala>Pro (Zhang et al., 2024). This order is the same as that of the parameters, except that the values of Ile and Val are in the second and last places, respectively. Threshold concentrations of IDPs are now predicted reasonably well by coarse-grained simulations where each amino acid is modeled by a single bead with a Lennard-Jones diameter and a stickiness parameter (Tesei and Lindorff-Larsen, 2022). Our parameter shows a good correlation (R2=0.59) with the compound parameter (Figure 9). Therefore, the parameter may serve as a predictor for the tendency of an amino acid to drive phase separation. In essence, the same ability of an amino acid, for example Trp, to form interactions with neighboring residues of an IDP in the free state also applies when it comes to interactions with residues on neighboring chains in a dense phase.
Our method incorporates ideas from a number of previous efforts at describing . The first serious effort was by Schwalbe et al., 1997, who accounted for contributions from neighboring residues as additive terms, instead of multiplicative factors as in SeqDYN. Cho et al., 2007 and Delaforge et al., 2018 used the running average of the bulkiness parameter over a window of five to nine residues as a qualitative indicator of . Here again the calculation was based on an additive model. Sekiyama et al., 2022 employed a multiplicative model, with calculated as a geometric mean of ‘indices of local dynamics’ over a five-residue window. These indices, akin to our parameters, were trained on a single IDR (TIA-1 prion-like domain) and used to reproduce the measured for the same IDR. As we have illustrated on CBP-ID4 (Figure 5—figure supplement 1), training on a single protein merely biases the parameters to that model and has little value in predicting for other proteins. In comparison, SeqDYN is trained on 45 IDPs and its predictions are robust and achieve quantitative agreement with measured .
Ten of the IDPs tested here have been studied recently by MD simulations using IDP-specific force fields (Dey et al., 2022; Hicks et al., 2020; Yu and Brüschweiler, 2022; Smrt et al., 2023). In Table 2, we compare the RMSEs of SeqDYN predictions with those for calculations from MD simulations. For five of these IDPs: A1-LCD, Aβ40, α-synuclein, tau K18, and FtsQ, RMSEs of SeqDYN and MD are remarkably similar. Four of these IDPs lack significant population of α-helices or β-sheets, but FtsQ forms a stable long helix. For one other IDP, namely HOX-DFD, MD, by explicitly modeling its folded domain, does a much better job in predicting than SeqDYN (RMSEs of 1.40 s–1 vs 1.99 s–1). However, for the four remaining IDPs: p53TAD, Pup, Sev-NT, and ChiZ, SeqDYN significantly outperforms MD, with RMSEs averaging only 0.47 s–1, compared to the MD counterpart of 1.14 s–1. Overall, SeqDYN is very competitive against MD in predicting , but without the significant computational cost. While MD simulations can reveal details of local interactions, as noted for α-synuclein, and capture tertiary interactions if they occur, they still suffer from perennial problems of force-field imperfection and inadequate sampling. SeqDYN provides an accurate description of IDP dynamics at a ‘mean-field’ level, but could miss idiosyncratic behaviors of specific local sequences.
Deep-learning models have become very powerful, but they usually have millions of parameters and require millions of protein sequences for training (Rives et al., 2021). In contrast, SeqDYN employs a mathematical model with dozens of parameters and requires only dozens of proteins for training. Reduced models (by collapsing amino acids into a small number of distinct types) have even been trained on <10 IDPs to predict propensities for binding nanoparticles (Li et al., 2020) or membranes (Qin et al., 2022). The mathematical model-based approach may be useful in other applications where data, similar to , are limited, including predictions of IDP secondary chemical shifts or residues that bind drug molecules (Robustelli et al., 2022) or protein targets, or even in protein design, for example for recognizing an antigenic site or a specific DNA site.
Methods
Collection of IDPs with measured
Starting from six nonhomologous IDPs in our previous MD study (Dey et al., 2022), we obtained data for eight IDPs from the Bimolecular Magnetic Resonance Data Bank (BMRB; https://bmrb.io); data for two other IDPs were from our collaborators (Hicks et al., 2020; Smrt et al., 2023). Most of the 54 IDPs studied here were from searching the literature. Disorder was judged by dispersion in backbone amide proton chemical shifts, NOE, and SCS. data that were not available from the authors or BMRB were obtained by digitizing plots presented in figures of published papers, using WebPlotDigitizer (https://automeris.io/WebPlotDigitizer; Rohatgi, 2022) and further inspected visually.
Homology of IDPs was checked by sequence alignment using Clustal W (http://www.clustal.org/clustal2; Larkin et al., 2007), and presented as a clock-like tree using the ‘ape’ package (http://ape-package.ird.fr; Paradis et al., 2004). IDPs that had discernible homology with the selected training set were removed. Removed IDPs included HOX-SCR and β-synuclein from our previous MD study (Dey et al., 2022), due to homology with HOX-DFD and α-synuclein, respectively.
Coding for SeqDYN
The training of SeqDYN was coded in python, similar to our previous work for predicting residue-specific membrane association propensities (ReSMAP; https://zhougroup-uic.github.io/ReSMAPidp/; Qin et al., 2022). The cost function was the sum of mean-squared-errors for the IDPs in the training set. We used the least_squares function in scipy.optimize, with Trust Region Reflective as the minimization algorithm and all parameters restricted to the positive range. For the web server (https://zhougroup-uic.github.io/SeqDYNidp/; Qin and Zhou, 2024), we rewrote the prediction code javascript.
Appendix 1
Sequences of 54 IDPs
Terminal tags and other insertions are underlined.
Training set (45 IDPs)
>A1-LCD (Uniprot ID P04256, residues 186-320; deletion of 258-263)
GSMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSGDGYNGFGNDGSNFGGGGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSSSSSSYGSGRRF
>Aβ40 (Uniprot ID Q28053, residues 7-46)
DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGV
>Ash1 (Uniprot ID P34233, residues 420-500)
GASASSSPSPSTPTKSGKMRSRSSSPVRPKAYTPSPRSPNYHRFALDSPPQSPRRSSNSSITKKGSRRSSGSSPTRHTTRVCV
>Beclin1 (Uniprot ID Q14457, residues 1-150)
MGSSHHHHHHSQDPMEGSKTSNNSTMQVSFVSQRSSQPLKLDTSFKILDRVTIQELTAPLLTTAQAKPGETQEEETNSGEEPFIETPRQDGVSRRFIPPARMMSTESANSFTLIGEASDGGTMENLSRRLKVTGDLFDIMSGQTDVDHPLSEESTDTLLDQLDTY
>CAPRIN1 (Uniprot ID Q14444, residues 607-709)
SRGVSRGGSRGARGLMNGYRGPANGFRGGYDGYRPSFSNTPNSGYTQSQFSAPRDYSGYQRDGYQQNFKRGSGQSGPRGAPRGRGGPPRPNRGMPQMNTQQVN
>CBP-ID4 (Uniprot ID Q92793, residues 1852-2057)
MQQQIQHRLQQAQLMRRRMATMNTRNVPQQSLPSPTSAPPGTPTQQPSTPQTPQPPAQPQPSPVSMSPAGFPSVARTQPPTTVSTGKPTSQVPAPPPPAQPPPAAVEAARQIEREAQQQQHLYRVNINNSMPPGRTGMGTPGSQMAPVSLNVPRPNQVSGPVMPSMPPGQWQQAPLPQQQPMPGLPRPVISMQAQAAVAGPRMPSVQ
>GbnD4-DHD (Uniprot ID A0A808VWJ6, residues 412-482)
MKHHHHHHHHGGLVPRGSHGSDEGVPDALRADTVPRAGPVRYARRRYWIGEARSDALAPAAPLEREPLPAEAMGAYFAIRRTDADDTVAAH
>ERD14 (Uniprot ID P42763, residues 1-185)
MAEEIKNVPEQEVPKVATEESSAEVTDRGLFDFLGKKKDETKPEETPIASEFEQKVHISEPEPEVKHESLLEKLHRSDSSSSSSSEEEGSDGEKRKKKKEKKKPTTEVEVKEEEKKGFMEKLKEKLPGHKKPEDGSAVAAAPVVVPPPVEEAHPVEKKGILEKIKEKLPGYHPKTTVEEEKKDKE
>ExsE (Uniprot ID Q9I322, residues 1-81)
MKIESIPPVQPSQDAGAEAVGHFEGRSVTRAAVRGDDRSSVAGLARWLARNVAGDPRSEQALQRLADGDGTPLEARTVRRREFLEGSS
>FCP1 (Uniprot ID Q9Y5B0, residues 879-961)
PGPEEQEEEPQPRKPGTRRERTLGAPASSERSAAGGRGPRGHKRKLNEEDAASESSRESSNEDEGSSSEADEMAKALEAELNDLM
>FUS (Uniprot ID P35637, residues 1-163)
MASNDYTQQATQSYGAYPTQPGQGYSQQSSQPYGQQSYSGYSQSTDTSGYGQSSYSSYGQSQNTGYGTQSTPQGYGSTGGYGSSQSSQSSYGQQSSYPGYGQQPAPSSTSGSYGSSSQSSSYGQPQSGSYSQQPSYGGQQQSYGQQQSYNPPQGYGQQNQYNS
>GAb1 (Uniprot ID B7Z3B9, residues 510-591)
SSPMIKPKGDKQVEYLDLDLDSGKSTPPRKQKSSGSGSSVADERVDYVVVDQQKTLALKSTREAWTDGRQSTESETPAKSVK
>hACTR (Uniprot ID Q9Y6Q9, residues 1023-1091)
GTQNRPLLRNSLDDLVGPPSNLEGQSDERALLDQLHTLLSNTDATGLEEIDRALGIPELVNQGQALEPK>Hahellin (Uniprot ID Q2SHN6, residues 162-252)
MGEKTVKLYEDTHFKGYSVELPVGDYNLSSLISRGALNDDLSSARVPSGLRLEVFQHNNFKGVRDFYTSDAAELSRDNDASSVRVSKMETTN
>hCSD1 (Uniprot ID P20810, residues 137-277)
AVPVESKPDKPSGKSGMDAALDDLIDTLGGPEETEEENTTYTGPEVSDPMSSTYIEELGKREVTIPPKYRELLAKKEGITGPPADSSKPIGPDDAIDALSSDFTCGSPTAAGKKTEKEESTEVLKAQSAGTVRSAAPPQEK
>HOX-DFD (Uniprot ID P07548, residues 337-426)
TDGERIIYPWMKKIHVAGVANGSYQPGMEPKRQRTAYTRHQILELEKEFHYNRYLTRRRRIEIAHTLVLSERQIKIWFQNRRMKWKKDNK
>hZIP4-ICL2 (Uniprot ID Q6P5W5, residues 424-498)
GDRGPEFELGTLPRDPEDLEDGPCGHSSHSHGGHSHGVSLQLAPSELRQPKPPHEGSRADLVAEESPELLNPEPRRLSPELRLLPYGHGLSAWSHPQFEK
>Jaburetox (Uniprot ID I1K3K3, residues 230-320)
MGPVNEANCKAAMEIVCRREFGHKEEEDASEGVTTGDPDCPFTKAIPREEYANKYGPTIGDKIRLGDTDLIAEIEKDFALYGDESVFGGGKVIH
>KRS-NT (Uniprot ID Q15046, residues 1-72)
MAAVQAAEVKVDGSEPKLSKNELKRRLKAEKKVAEKEAKQKELSEKQLSQATAAATNHTTDNGVGPEEESVD
>MBP-xα2 (Uniprot ID P04370, residues 172-237)
SIGRFFSGDRGAPKRGSGKDSHTRTTHYGSLPQKSQHGRTQDENPVVHFFKNIVTPRTPPPSQGKGRGLS
>MKK4 (Uniprot ID P45985, residues 1-86)
MAAPSPSGGGGSGGGSGSGTPGPVGSPAPGHPAVSSMQGKRKALKLNFANPPFKSTARFTLNPNPTGVQNPHIERLRTHSIESSGK
>N-Cby (Uniprot ID B0QY54, residues 1-63)
MPFFGNTFSPKKTPPRKSASLSNLHSLDRSTREVELGLEYGSPTMNLAGQSLKFENGQWIAET
>Niv-PNTD (Uniprot ID P0C1C7, residues 1-406)
MDKLELVNDGLNIIDFIQKNQKEIQKTYGRSSIQQPSIKDQTKAWEDFLQCTSGESEQVEGGMSKDDGDVERRNLEDLSSTSPTDGTIGKRVSNTRDWAEGSDDIQLDPVVTDVVYHDHGGECTGYGFTSSPERGWSDYTSGANNGNVCLVSDAKMLSYAPEIAVSKEDRETDLVHLENKLSTTGLNPTAVPFTLRNLSDPAKDSPVIAEHYYGLGVKEQNVGPQTSRNVNLDSIKLYTSDDEEADQLEFEDEFAGSSSEVIVGISPEDEEPSSVGGKPNESIGRTIEGQSIRDNLQAKDNKSTDVPGAGPKDSAVKEEPPQKRLPMLAEEFECSGSEDPIIRELLKENSLINCQQGKDAQPPYHWSIERSISPDKTEIVNGAVQTADRQRPGTPMPKSRGIPIKK
>NS5A-D2D3 (Uniprot ID O92972, residues 2163-2419)
GHMASGSLRGGEPEPDVTVLTSMLTDPSHITAETAKRRLARGSPPSLASSSASQLSAPSLKATCTTHHDSPDADLIEANLLWRQEMGGNITRVESENKVVILDSFEPLHADGDEREISVAAEILRKSRKFPSALPIWARPDYNPPLLESWKDPDYVPPVVHGCPLPPTKAPPIPPPRRKRTVVLTESNVSSALAELATKTFGSSGSSAVDSGTATALPDQASDDGDKGSDVESYSSMPPLEGEPGDPDLSDGSWSTVSEEASEDVVCC
>NUPR1 (Uniprot ID O60356, residues 2-82)
MRGSHHHHHHGSATFPPATSAPQQPPGPEDEDSSLDESDLYSLAHSYLGGGGRKGRTKREAAANTNRPSPGGHERKLVTKLQNSERKKRGARR
OPN (Uniprot ID F1NSM8, residues 46-264)
MHQDHVDSQSQEHLQQTQNDLASLQQTHYSSEENADVPEQPDFPDVPSKSQETVDDDDDDDNDSNDTDESDEVFTDFPTEAPVAPFNRGDNAGRGDSVAYGFRAKAHVVKASKIRKAARKLIEDDATTEDGDSQPAGLWWPKESREQNSRELPQHQSVENDSRPKFDSREVDGGDSKASAGVDSRESQGSVPAVDASNQTLESAEDAEDRHSIENNEVTR
>p53TAD (Uniprot ID P04637, residues 1-71)
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPRV
>PDEγ (Uniprot ID P61248, residues 1-87)
MNLEPPKAEIRSATRVMGGPVTPRKGPPKFKQRQTRQFKSKPPKKGVQGFGDDIPGMEGLGTDITVIAPWEAFNHLELHELAQYGII
>PKIα (Uniprot ID P61925, residues 2-76)
TDVETTYADFIASGRTGRRNAIHDILVSSASGNSNELALKLAGLDINKTEGEEDAQRSSTEQSGEAQGEAAKSES
>Mev-PNTD (Uniprot ID P03422, residues 1-304)
MAEEQARHVKNGLECIRALKAEPIGSLAIEEAMAAWSEISDNPGQERATCREEKAGSSGLSKPCLSAIGSTEGGAPRIRGQGPGESDDDAETLGIPPRNLQASSTGLQCYYVYDHSGEAVKGIQDADSIMVQSGLDGDSTLSGGDNESENSDVDIGEPDTEGYAITDRGSAPISMGFRASDVETAEGGEIHELLRLQSRGNNFPKLGKTLNVPPPPDPGRASTSGTPIKKGTERRLASFGTEIASLLTGGATQCARKSPSEPSGPGAPAGNVPECVSNAALIQEWTPESGTTISPRSQNNEEGG
>ProTα (Uniprot ID P06454, residues 1-111)
GPMSDAAVDTSSEITTKDLKEKKEVVEEAENGRDAPANGNAENEENGEQEADNEVDEEEEEGGEEEEEEEEGDGEEEDGDEDEEAESATGKRAAEDDEDDDVDTKKQKTDEDD
>Pup (Uniprot ID P9WHN4, residues 1-64)
MAQEQTKRGGGGGDDDDIAGSTAAGQERREKLTEETDDLLDEIDDVLEENAEDFVRAYVQKGGQ
>rmBG21 (Uniprot ID P04370, residues 2-190)
GNHSGKRELSAEKASKDGEIHRGEAGKKRSVGKLSQTASEDSDVFGEADAIQNNGTSAEDTAVTDSKHTADPKNNWQGAHPADPGNRPHLIRLFSRDAPGREDNTFKDRPSESDELQTIQEDPTAASGGLDVMASQKRPSQRSKYLATASTMDHARHGFLPRHRDTGILDSIGRFFSGDRGAPKRGSGKVSLEHHHHHH
>RPB1 (Uniprot ID P24928, residues 1773-1970)
GHMSPNYTPTSPNYSPTSPSYSPTSPSYSPTSPSYSPSSPRYTPQSPTYTPSSPSYSPSSPSYSPASPKYTPTSPSYSPSSPEYTPTSPKYSPTSPKYSPTSPKYSPTSPTYSPTTPKYSPTSPTYSPTSPVYTPTSPKYSPTSPTYSPTSPKYSPTSPTYSPTSPKGSTYSPTSPGYSPTSPTYSLTSPAISPDDSDEEN
>securin (Uniprot ID O95997, residues 1-202)
MATLIYVDKENGEPGTRVVAKDGLKLGSGPSIKALDGRSQVSTPRFGKTFDAPPALPKATRKALGTVNRATEKSVKTKGPLKQKQPSFSAKKMTEKTVKAKSSVPASDDAYPEIEKFFPFNPLDFESFDLPEEHQIAHLPLSGVPLMILDEERELEKLFQLGPPSPVKMPSPPWESNLLQSPSSILSTLDVELPPVCCDIDI
>Sev-NT (Uniprot ID Q07097, residues 401-524)
LSGGDGAYHEPTGGGAIEVALDNADIDLETEAHADQDARGWGGESGERWARQVSGGHFVTLHGAERLEEETNDEDVSDIERRIAMRLAERRQEDSATHGDEGRNNGVDHDEDDDAAAVAGIGGI
>Sic1 (Uniprot ID P38634, residues 1-90)
GSMTPSTPPRSRGTRYLAQPSGNTSSSALMQGQKTPQKPSQNLVPVTPSTTKSFKNAPLLAPPNSNMGMTSPFNGLTSPQRSPFPKSSVKRT
>SKIPN (Uniprot ID G3V5R3, residues 59-129)
GDGGAFPEIHVAQYPLDMGRKKKMSNALAIQVDSEGKIKYDAIARQGQSKDKVIYSKYTDLVPKEVMNADD
>SLBP-NT (Uniprot ID Q9VAN6, residues 17-108)
MGSSHHHHHHSSGLVPRGSHMGSGSLNSSASSISIDVKPTMQSWAQEVRAEFGHSDEASSSLNSSAASCGSLAKKETADGNLESKDGEGREMAFEFLDGVNEVKFERLVKEEK
>α-synuclein (Uniprot ID P37840, residues 1-140)
MDVFMKGLSKAKEGVVAAAEKTKQGVAEAAGKTKEGVLYVGSKTKEGVVHGVATVAEKTKEQVTNVGGAVVTGVTAVAQKTVEGAGSIAAATGFVKKDQLGKNEEGAPQEGILEDMPVDPDNEAYEMPSEEGYQDYEPEA
>SOCS5-JIR (Uniprot ID A0A5E4BAI0, residues 12-81)
RSLRQRLQDTVGLCFPMRTYSKQSKPLFSNKRKIHLSELMLEKCPFPAGSDLAQKWHLIKQHTAPVSPHS
>tau K18 (Uniprot ID Q9MYX8, residues 186-314)
QTAPVPMPDLKNVKSKIGSTENLKHQPGGGKVQIINKKLDLSNVQSKCGSKDNIKHVPGGGSVQIVYKPVDLSKVTSKAGSLGNIHHKPGGGQVEVKSEKLDFKDRVQSKIGSLDNITHVPGGGNKKIE
>TC1 (Uniprot ID Q9NR00, residues 1-106)
MKAKRSHQAIIMSTSLRVSPSIHGYHFDTASRKKAVGNIFENTDQESLERLFRNSGDKKAEERAKIIFAIDQDVEEKTRALMALKKRTKDKLFQFLKLRKYSIKVH
>TDP-43 (Uniprot ID Q13148, residues 267-414)
GHMNRQLERSGRFGGNPGGFGNQGGFGNSRGGGAGLGNNQGSNMGGGMNFGAFSINPAMMAAAQAALQSSWGMMGMLASQQNQSGPSGNNQNQGNMQREPNQAFGSGNNSYSGSNSGAAIGWGSASNAGSGSGFNGGFGSSMDSKSSGWGM
>γ-tubulin-CT (Uniprot ID P53378, residues 439-473)
LLRGAAEQDSYLDDVLVDDENMVGELEEDLDADGDHKLV
Test set (9 IDPs)
>AMOTL1 (Uniprot ID Q8IY63, residues 178-384)
STQPQQNNEELPTYEEAKAQSQFFRGQQQQQQQQGAVGHGYYMAGGTSQKSRTEGRPTVNRANSGQAHKDEALKELKQGHVRSLSERIMQLSLERNGAKQHLPGSGNGKGFKVGGGPSPAQPAGKVLDPRGPPPEYPFKTKQMMSPVSKTQEHGLFYGDQHPGMLHEMVKPYPAPQPVRTDVAVLRYQPPPEYGVTSRPCQLPFPST
>CAHS-8 (Uniprot ID P0CU50, residues 1-227)
MSGRNVESHMERNEKVVVNNSGHADVKKQQQQVEHTEFTHTEVKAPLIHPAPPIISTGAAGLAEEIVGQGFTASAARISGGTAEVHLQPSAAMTEEARRDQERYRQEQESIAKQQEREMEKKTEAYRKTAEAEAEKIRKELEKQHARDVEFRKDLIESTIDRQKREVDLEAKMAKRELDREGQLAKEALERSRLATNVEVNFDSAAGHTVSGGTTVSTSDKMEIKRNENLYFQ
>ChiZ (Uniprot ID I6YA32, residues 1-64)
MTPVRPPHTPDPLNLRGPLDGPRWRRAEPAQSRRPGRSRPGGAPLRYHRTGVGMSRTGHGSRPV
>-endosulfine (Uniprot ID O43768, residues 1-121)
MSQKQEEENPAEETGEEKQDTQEKEGILPERAEEAKLKAKYPSLGQKPGGSDFLMKRLQKGQKYFDSGDYNMAKAKMKNKQLPSAGPDKNLVTGDHIPTPQDLPQRKSSLVTSKLAGGQVE
>FtsQ (Uniprot ID Q8IY63, residues 1-99)
MTEHNEDPQIERVADDAADEEAVTEPLATESKDEPAEHPEFEGPRRRARRERAERRAAQARATAIEQARRAAKRRARGQIVSEQNPAKPAARGVVRGLK
>Pdx1 (Uniprot ID P52945, residues 204-283)
GPGEEDKKRGGGTAVGGGGVAEPEQDCAVTSGEELLALPPPPPPGGAVPPAAPVAAREGRLPPGLSASPQPSSVAPRRPQEPR
>synaptobrevin-2 (Uniprot ID P63027, residues 1-96)
MSATAATAPPAAPAGEGGPPAPPPNLTSNRRLQQTQAQVDEVVDIMRVNVDKVLERDQKLSELDDRADALQAGASQFETSAAKLKRKYWWKNLKMM
>TIA-1 (Uniprot ID P31483, residues 320-386)
MGSSHHHHHHHHHHHHSENLYFQGGQYVPNGWQVPAYGVYGQPWSQQGFNQTQSSAPWMGPNYSVPPPQGQNGSMLPSQPAGYRVAGYETQ
>YAP (Uniprot ID P46937, residues 50-171)
AGHQIVHVRGDSETDLEALFNAVMNPKTANVPQTVPMRLRKLPDSFFKPPEPKSHSRQASTDAGTAGALTPQHVRAHSSPASLQLGAVSPGTLTPTGVVSGPAATPTAQHLRQSSFEIPDDV
Data availability
All data generated or analyzed during this study are included in the manuscript and supplementary files; source data have been provided for Figures 3—9, Figure 3—figure supplement 1, Figure 4—figure supplement 1, and Figure 5—figure supplement 1.
References
-
Identification of dynamic modes in an intrinsically disordered protein using temperature-dependent NMR relaxationJournal of the American Chemical Society 138:6240–6251.https://doi.org/10.1021/jacs.6b02424
-
A small molecule causes a population shift in the conformational landscape of an intrinsically disordered proteinJournal of the American Chemical Society 139:13692–13700.https://doi.org/10.1021/jacs.7b01380
-
Amino acid bulkiness defines the local conformations and dynamics of natively unfolded alpha-synuclein and tauJournal of the American Chemical Society 129:3032–3033.https://doi.org/10.1021/ja067482k
-
Solution ensemble of the C-terminal domain from the transcription factor Pdx1 resembles an excluded volume polymerThe Journal of Physical Chemistry. B 123:106–116.https://doi.org/10.1021/acs.jpcb.8b10051
-
Structural and dynamic characterization of intrinsically disordered human securin by NMR spectroscopyJournal of the American Chemical Society 130:16873–16879.https://doi.org/10.1021/ja805510b
-
Deciphering the dynamic interaction profile of an intrinsically disordered protein by nmr exchange spectroscopyJournal of the American Chemical Society 140:1148–1158.https://doi.org/10.1021/jacs.7b12407
-
Sequence-dependent backbone dynamics of intrinsically disordered proteinsJournal of Chemical Theory and Computation 18:6310–6323.https://doi.org/10.1021/acs.jctc.2c00328
-
Probabilistic sampling of protein conformations: new hope for brute force?Proteins 46:8–23.
-
Macromolecular crowding induces a binding competent transient structure in intrinsically disordered Gab1Journal of Molecular Biology 434:167407.https://doi.org/10.1016/j.jmb.2021.167407
-
Removing thermostat distortions of protein dynamics in constant-temperature molecular dynamics simulationsJournal of Chemical Theory and Computation 17:5920–5932.https://doi.org/10.1021/acs.jctc.1c00448
-
Mechanism of intersubunit ketosynthase-dehydratase interaction in polyketide synthasesNature Chemical Biology 14:270–275.https://doi.org/10.1038/nchembio.2549
-
Quantitative conformational analysis of partially folded proteins from residual dipolar couplings: application to the molecular recognition element of Sendai virus nucleoproteinJournal of the American Chemical Society 130:8055–8061.https://doi.org/10.1021/ja801332d
-
Clustal W and clustal X version 2.0Bioinformatics 23:2947–2948.https://doi.org/10.1093/bioinformatics/btm404
-
Carbon-detected (15)N NMR spin relaxation of an intrinsically disordered protein: FCP1 dynamics unbound and in complex with RAP74The Journal of Physical Chemistry Letters 3:1409–1413.https://doi.org/10.1021/jz300432e
-
Quantitative cooperative binding model for intrinsically disordered proteins interacting with nanomaterialsJournal of the American Chemical Society 142:10730–10738.https://doi.org/10.1021/jacs.0c01885
-
Sequence determinants of the conformational properties of an intrinsically disordered protein prior to and upon multisite phosphorylationJournal of the American Chemical Society 138:15323–15335.https://doi.org/10.1021/jacs.6b10272
-
The PSIPRED protein structure prediction serverBioinformatics 16:404–405.https://doi.org/10.1093/bioinformatics/16.4.404
-
Proline fingerprint in intrinsically disordered proteinsChembiochem 19:1625–1629.https://doi.org/10.1002/cbic.201800172
-
Dynamics of the intrinsically disordered protein NUPR1 in isolation and in its fuzzy complexes with DNA and prothymosin αBiochimica et Biophysica Acta - Proteins and Proteomics 1867:140252.https://doi.org/10.1016/j.bbapap.2019.07.005
-
Conformational propensities and dynamics of a βγ-crystallin, an intrinsically disordered proteinPhysical Chemistry Chemical Physics 16:12703.https://doi.org/10.1039/c3cp53558d
-
WebsiteSeqDYN: Predictor for the Sequence-Dependent Backbone Dynamics of Intrinsically Disordered ProteinsAccessed October 23, 2024.
-
Local and global dynamics in intrinsically disordered synucleinAngewandte Chemie 57:15262–15266.https://doi.org/10.1002/anie.201808172
-
reorientational dynamics of amyloid-β from NMR spin relaxation and molecular simulationThe Journal of Physical Chemistry Letters 10:3369–3375.https://doi.org/10.1021/acs.jpclett.9b01050
-
Molecular basis of small-molecule binding to α-SynucleinJournal of the American Chemical Society 144:2501–2510.https://doi.org/10.1021/jacs.1c07591
-
Visualizing the molecular recognition trajectory of an intrinsically disordered protein using multinuclear relaxation dispersion NMRJournal of the American Chemical Society 137:1220–1229.https://doi.org/10.1021/ja511066q
-
TALOS+: a hybrid method for predicting protein backbone torsion angles from NMR chemical shiftsJournal of Biomolecular NMR 44:213–223.https://doi.org/10.1007/s10858-009-9333-z
-
Residual structure, backbone dynamics, and interactions within the synuclein familyJournal of Molecular Biology 372:689–707.https://doi.org/10.1016/j.jmb.2007.07.008
-
Full backbone assignment and dynamics of the intrinsically disordered dehydrin ERD14Biomolecular NMR Assignments 5:189–193.https://doi.org/10.1007/s12104-011-9297-2
-
Interaction mechanism of endogenous PP2A inhibitor protein ENSA with PP2AThe FEBS Journal 289:519–534.https://doi.org/10.1111/febs.16150
-
A large intrinsically disordered region in SKIP and its disorder-order transition induced by PPIL1 binding revealed by NMRThe Journal of Biological Chemistry 285:4951–4963.https://doi.org/10.1074/jbc.M109.087528
-
Motional properties of unfolded ubiquitin: a model for a random coil proteinJournal of Biomolecular NMR 35:175–186.https://doi.org/10.1007/s10858-006-9026-9
-
NMR experiments for studies of dilute and condensed protein phases: application to the phase-separating protein CAPRIN1Journal of the American Chemical Society 142:2471–2489.https://doi.org/10.1021/jacs.9b12208
-
Quantitative binding behavior of intrinsically disordered proteins to nanoparticle surfaces at individual residue levelChemistry – A European Journal 24:16997–17001.https://doi.org/10.1002/chem.201804556
-
Amino acid-dependent phase equilibrium and material properties of tetrapeptide condensatesCell Reports Physical Science 5:102218.https://doi.org/10.1016/j.xcrp.2024.102218
-
The transiently ordered regions in intrinsically disordered ExsE are correlated with structural elements involved in chaperone bindingBiochemical and Biophysical Research Communications 417:129–134.https://doi.org/10.1016/j.bbrc.2011.11.070
-
The characterization of amino acid sequences in proteins by statistical methodsJournal of Theoretical Biology 21:170–201.https://doi.org/10.1016/0022-5193(68)90069-6
Article and author information
Author details
Funding
National Institutes of Health (GM118091)
- Huan-Xiang Zhou
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
This work was supported by Grant GM118091 from the National Institutes of Health.
Version history
- Preprint posted:
- Sent for peer review:
- Reviewed Preprint version 1:
- Reviewed Preprint version 2:
- Version of Record published:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.88958. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2023, Qin and Zhou
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
-
- 771
- views
-
- 52
- downloads
-
- 2
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Structural Biology and Molecular Biophysics
Peroxisome proliferator-activated receptor gamma (PPARγ) is a nuclear receptor transcription factor that regulates gene expression programs in response to ligand binding. Endogenous and synthetic ligands, including covalent antagonist inhibitors GW9662 and T0070907, are thought to compete for the orthosteric pocket in the ligand-binding domain (LBD). However, we previously showed that synthetic PPARγ ligands can cooperatively cobind with and reposition a bound endogenous orthosteric ligand to an alternate site, synergistically regulating PPARγ structure and function (Shang et al., 2018). Here, we reveal the structural mechanism of cobinding between a synthetic covalent antagonist inhibitor with other synthetic ligands. Biochemical and NMR data show that covalent inhibitors weaken—but do not prevent—the binding of other ligands via an allosteric mechanism, rather than direct ligand clashing, by shifting the LBD ensemble toward a transcriptionally repressive conformation, which structurally clashes with orthosteric ligand binding. Crystal structures reveal different cobinding mechanisms including alternate site binding to unexpectedly adopting an orthosteric binding mode by altering the covalent inhibitor binding pose. Our findings highlight the significant flexibility of the PPARγ orthosteric pocket, its ability to accommodate multiple ligands, and demonstrate that GW9662 and T0070907 should not be used as chemical tools to inhibit ligand binding to PPARγ.
-
- Structural Biology and Molecular Biophysics
SCARF1 (scavenger receptor class F member 1, SREC-1 or SR-F1) is a type I transmembrane protein that recognizes multiple endogenous and exogenous ligands such as modified low-density lipoproteins (LDLs) and is important for maintaining homeostasis and immunity. But the structural information and the mechanisms of ligand recognition of SCARF1 are largely unavailable. Here, we solve the crystal structures of the N-terminal fragments of human SCARF1, which show that SCARF1 forms homodimers and its epidermal growth factor (EGF)-like domains adopt a long-curved conformation. Then, we examine the interactions of SCARF1 with lipoproteins and are able to identify a region on SCARF1 for recognizing modified LDLs. The mutagenesis data show that the positively charged residues in the region are crucial for the interaction of SCARF1 with modified LDLs, which is confirmed by making chimeric molecules of SCARF1 and SCARF2. In addition, teichoic acids, a cell wall polymer expressed on the surface of gram-positive bacteria, are able to inhibit the interactions of modified LDLs with SCARF1, suggesting the ligand binding sites of SCARF1 might be shared for some of its scavenging targets. Overall, these results provide mechanistic insights into SCARF1 and its interactions with the ligands, which are important for understanding its physiological roles in homeostasis and the related diseases.