Protein phase separation is implicated in formation of membraneless organelles, signaling puncta and the nuclear pore. Multivalent interactions of modular binding domains and their target motifs can drive phase separation. However, forces promoting the more common phase separation of intrinsically disordered regions are less understood, with suggested roles for multivalent cation-pi, pi-pi, and charge interactions and the hydrophobic effect. Known phase-separating proteins are enriched in pi-orbital containing residues and thus we analyzed pi-interactions in folded proteins. We found that pi-pi interactions involving non-aromatic groups are widespread, underestimated by force-fields used in structure calculations and correlated with solvation and lack of regular secondary structure, properties associated with disordered regions. We present a phase separation predictive algorithm based on pi interaction frequency, highlighting proteins involved in biomaterials and RNA processing.https://doi.org/10.7554/eLife.31486.001
Protein phase separation has important implications for cellular organization and signaling (Mitrea and Kriwacki, 2016; Brangwynne et al., 2009; Su et al., 2016), RNA processing (Sfakianos et al., 2016), biological materials (Yeo et al., 2011) and pathological aggregation (Taylor et al., 2016). For some systems, multivalent interactions between modular binding domains and cognate peptide motifs underlie phase-separation (Li et al., 2012; Banjade and Rosen, 2014). However, many phase-separating proteins contain large intrinsically disordered protein regions (IDRs) with low complexity sequences that do not form stable folded structure (reviewed in [Mitrea and Kriwacki, 2016; Chong and Forman-Kay, 2016]), including the Nephrin intracellular domain (NICD) (Pak et al., 2016), polyglutamine tracts (Crick et al., 2013), tropoelastin (Yeo et al., 2011), FUS (Burke et al., 2015; Kato et al., 2012), Ddx4 and the homologous LAF-1 (Nott et al., 2015; Elbaum-Garfinkle et al., 2015) and FG-repeat nucleoporins (Frey et al., 2006). The underlying physical principles and chemical interactions that drive phase separation in these IDRs are not well understood. Multivalent (Li et al., 2012; Pierce et al., 2016) electrostatic (Pak et al., 2016; Lin et al., 2016) and cation-pi (Nott et al., 2015; Kim et al., 2016; Sherrill, 2013) interactions and the hydrophobic effect (Yeo et al., 2011) have all been proposed to contribute to IDR phase separation, the latter suggested to be dominant for tropoelastin (Luan et al., 1990). For Ddx4, electrostatic interactions between charge blocks has been demonstrated (Nott et al., 2015; Lin et al., 2016). The abundance of Phe-Gly/Gly-Phe and Arg-Gly/Gly-Arg dipeptides in Ddx4 and the fact that Phe to Ala mutations inhibit phase separation also point to pi-pi and/or cation-pi interactions. The Phe-Gly repeats in FG nucleoporins similarly indicate pi-pi interactions, but the lack of aromatics in elastins and designed phase-separating sequences (Quiroz and Chilkoti, 2015) seems to suggest that they are not essential. Clearly, a number of physical interactions may be sufficient for driving phase separation without being universally necessary, and a better understanding of these interactions is needed to define the balance of forces biological systems use for driving protein phase transitions.
Although pi-pi interactions are commonly associated with aromatic rings, where interaction energy is thought to involve induced quadrupolar electrostatic interactions (Sherrill, 2013), π (pi) orbitals of bonded sp2-hybridized atoms are also found in peptide backbone amide groups and sidechain amide, carboxyl or guanidinium groups. Sidechains with pi bonds include Tyr, Phe, Trp, His, Gln, Asn, Glu, Asp and Arg. Small residues with relatively exposed backbone peptide bonds include Gly, Ser, Thr and Pro. Notably, low complexity IDRs implicated in phase separation of FUS, EWS, hnRNPA1, TIA-1, TDP-43 and the RNA Pol II C-terminal domain (CTD) (Mitrea and Kriwacki, 2016; Taylor et al., 2016; Kato et al., 2012) are very enriched in these residues that have high potential for formation of pi-pi interactions, relative to average occurrence in the proteome. Even elastins, which lack sidechain pi groups but have Val-Pro-Gly-Xxx-Gly repeats (Yeo et al., 2011), are highly enriched in Pro and Gly residues with exposed pi-containing peptide backbones.
Given the high frequency of aromatic residues, arginine and glutamine in many phase-separating sequences, we were motivated to investigate the structural behavior of pi-pi interactions in order to better understand how their observed physical behavior relates to their potential role in phase separation. We first characterized the frequency and correlations of pi interactions in a non-redundant protein set from the RCSB protein data bank (PDB) of folded proteins (Berman et al., 2000). We discovered that planar pi-pi contacts involving a non-aromatic group, including those involving the backbone amide group, are the predominant form of pi-pi interaction, and we showed that planar pi-contact rates can be predicted from sequence. Then, we tested the relevance of these planar pi-pi interactions to phase separation by training a phase separation predictor using only these expected pi-contact rates. We then demonstrated that three of the predicted proteins, FMR1, a multifunctional RNA-binding protein and a neuronal granule component (El Fatimy et al., 2016), engrailed-2, a DNA
-binding homeobox protein, and the pAP isoform of the Human cytomegalovirus capsid scaffolding protein phase separate in isolation in vitro. Analysis of predictions for the full human proteome suggests strong phase-separation propensities for proteins involved in biomaterials and RNA processing, with likely regulation by splicing and post-translational modifications (PTMs).
To determine the frequency of pi-pi interactions and better understand their nature and physical properties, we performed a bioinformatics analysis of folded proteins. We searched the PDB for pi-pi interactions by measuring contact distances between planar surfaces and comparing planar orientations (see Materials and methods), choosing to focus on interactions involving pi-orbital planar surfaces as this category shows the most enrichment over expectations, both in terms of overall frequency (Appendix 1—figure 1) and in relation to resolution. Face-to-face planar pi-pi contacts were defined using a simple distance- and orientation-based metric designed to consistently capture this enrichment across diverse sp2-containing groups (Appendix 1—figure 1A,B,C).
Our analysis was originally intended to explore the known interactions of aromatic sidechains with each other and with arginine, but in order to provide a control group we defined our contact parameters in a way that allowed us to treat all sp2 groups in the same fashion. In high-resolution (≤1.8 Å) and low R-factor (≤0.18) protein crystal structures (N = 5718), we found that planar pi-pi stacking interactions involving non-aromatic atoms outnumber aromatic-aromatic stacking interactions by approximately 13 to 1 (Figure 1A and Appendix 1—table 1) suggesting that, while aromatic sidechains may be enriched in stacking interactions relative to their frequency, there is a more general role for pi-contacts that involve non-aromatic atoms. The vast majority of planar pi-orbital contacts in proteins involve one of five non-aromatic sp2-hybridized sidechains or the peptide bond itself. Fully 36% of observed pi-pi stacking interactions do not involve an aromatic partner, with face-to-face planar contacts between different backbone peptide bonds occurring as often as aromatic face-to-face contacts (Figure 1A). Across the high-resolution set, we observe that 58% of heavy atoms are sp2, of which 10.5% are involved in pi-stacking. Furthermore, 28% of heavy atoms that are not directly involved in pi-stacking are found within van der Waals (VDW) contact distance (4.9 Å) of atoms that are. Thus, planar pi-pi interactions form a common feature of the protein chemical environment. Comparisons to previous work showing that aromatic-aromatic interactions in proteins are instead biased toward face-to-edge or parallel displaced geometries (Hunter et al., 1991; Martinez and Iverson, 2012) are complicated by the observations that VDW contacts between aromatic sidechains coincide with face-to-face pi-pi stacking to a third sp2-hybridized group 49% of the time, and that parallel displacement often accommodates an additional non-aromatic pi-contact to the same planar surface.
Analysis of protein structures showed that the frequencies of planar pi interactions strongly correlate with the power of the experimental data to constrain the structure and with the fit to the data. We identified a linear relationship between contact frequencies and the resolution of crystal structures (Figure 1A). We identify a similar dependence of contact frequency on the relative number of sidechain-specific distance constraints in NMR structures (Appendix 1—figure 2A) and confirm that the dependence on resolution persists for identical sequences solved multiple times at different resolutions (Appendix 1—figure 2B). These data suggest that the relative importance of pi-pi interactions are underestimated in the force-fields that are used in the structure calculations and thus appear more frequently in structures that are heavily constrained by experimental observations. In addition, pi-pi contact frequencies for amino acid and other small sp2-containing ligands bound to proteins (including non-aromatic ligands) are higher than the frequencies observed for the same chemical group found within proteins, despite or perhaps because of their relative freedom of movement (Appendix 1—table 2).
To examine whether sp2-containing sidechains engage in stacking behavior beyond what could be expected for average contact frequencies and overall packing considerations, we determined sidechain contacts to backbone peptide groups, focusing on the percentage of VDW contacts (with two or more pairs of atoms within 4.9 Å) which satisfy our planar-pi criterion, and then compared the frequencies observed for sp2 sidechain groups to those observed for planar surfaces on the terminal end of sp3 sidechains, using atom groups as listed in the Materials and methods section. This metric addresses the issue of amino acid composition effects by taking advantage of the even distribution of backbone groups and allows for normalization of contact frequency for sidechains of different size. Enrichment of sp2 planar contacts relative to sp3 is clearly observed for all sp2 sidechains except Asn and Gln, which our previous analysis showed are more likely to form contacts with their backbone than with their sidechains (Figure 1—figure supplement 1). Further analysis of the relative frequency of planar pi VDW contacts to other VDW contacts as a function of resolution demonstrates that for some contact types the increased pi-contact frequencies with increasing resolution (lower values in Å) are at the expense of decrease in other VDW contacts, suggesting that these contacts represent a specific geometric constraint present in the experimental data, rather than an overall increase in VDW contact frequency at higher resolution (Figure 1—figure supplement 2).
Aromatic groups are known to have favorable interactions with other aromatics, with the peptide backbone (Tóth et al., 2001), and with charged groups. We observed that the guanidine group of arginine is either the first or second most likely planar pi-stacking partner for any given aromatic sidechain, a phenomenon previously described as cation-pi (Sherrill, 2013; Martinez and Iverson, 2012). However, we also observed planar stacking interactions between non-aromatic groups of all kinds, including both anion-to-anion and cation-to-cation, with relative frequencies shown in Figure 1B,C. Surprisingly, 3.6% of arginine sidechains are found in direct, parallel pi-stacking contact with another arginine, despite repulsive charges (Figure 1—figure supplement 2), suggesting that these guanidinium interactions are better described as pi-pi, rather than cation-pi (example shown in Figure 2A).
Modeling and analysis of protein structures typically involves the use of coarse-grained energy functions. To test the degree to which contact frequencies in solved structures derive from experimental constraints, rather than the force fields used, we explored how well planar pi interactions are captured by the simple energy functions used in certain protein modeling protocols. We examined a few different modeling protocols by either running available methods or downloading pre-computed datasets (see Materials and methods). In general, planar pi-pi contacts were lost during simulations (Appendix 1—figure 3A) and energy minimization (Appendix 1—figure 3B). In one older molecular dynamics simulation of folded proteins, made available for 100 proteins via Dynameomics (Kehl et al., 2008), 90% of the planar pi-pi contacts found in the starting structures were lost during simulation, with the majority being lost within the first few simulation steps. Similarly, modeling of the energetic effect of mutations, the ∆∆G of unfolding, using both FOLDX (Schymkowitz et al., 2005) and Rosetta (Kellogg et al., 2011), shows decreased prediction accuracy at positions involved in pi-contacts (Appendix 1—figure 3C-F), based on comparison to a reference set of ∆∆G measurements (Bava et al., 2004). These observed issues in modeling pi-contacts may be overcome by more recent and sophisticated energy functions, but our results are consistent with the inherent energetic importance of planar pi interactions, rather than their observation being due to simple force fields used in refining protein structures.
For exploring the contribution of pi contacts to general structural and functional properties of proteins, we examined contact enrichment for sp2 groups found in a diverse range of interactions. We observe increased frequency of pi-pi contacts at positions with known catalytic function (Furnham et al., 2014), with enrichment of 1.87 ± 0.07 overall and 1.42 ± 0.07 when normalized by residue type (Appendix 1—table 3), with pi-pi contacts often playing a role in defining the geometry of the active site (Figure 2B) or forming networks of pi-pi contacts. We find that hydrogen bond frequency increases at sp2 sidechains involved in pi-contacts (Appendix 1—figure 4), and when sp2 groups hydrogen bond each other we observe increased frequencies of a third sp2 group being found in simultaneous pi-stacking to both the donor and acceptor groups of the hydrogen bond (Appendix 1—figure 1F and Figure 2C), suggesting potential cooperativity via the electrostatic and geometric stabilization of the bond. We also observe up to 20-fold enrichment at the ends of secondary structure elements, relative to the median backbone contact rate of 1.7%, with enriched positions often involving the last hydrogen bond made within a helix or at the end of a strand (Figure 2D and Appendix 1—figure 5), commonly placing them in the context of local capping motifs thought to stabilize secondary structure (Richardson and Richardson, 1988). Finally, we find that protein-RNA interactions typically involve pi-pi contacts, especially with arginine. A detailed description of these observations is included in Appendix 1.
Interactions at the surface of a protein are typically in competition with solvent and their enthalphic contribution often decreases with solvent exposure, as for protein-protein hydrogen bonds (Efimov and Brazhnikov, 2003). Planar pi-pi interactions, in contrast, cannot be formed with water, but often involve groups with hydrogen bond acceptors and donors; thus, we predicted that the frequency of pi-pi interactions in proteins could be increased in more solvated environments. To test this, we identified high-resolution structures with an abundance of solved water and then counted the observed solvent interactions by the number of water oxygen atoms within a broad VDW contact radius (4.9 Å) to each residue. We saw an unambiguous positive correlation between the number of water contacts and the probability that a residue is involved in a planar pi-pi-contact, with a significant increase in average probability observed for each additional water contact (Figure 3A,B), climbing even as the average number of protein:protein VDW contacts declines. This relationship is true for the general case (unspecified residue identity) and is also individually true for each of the nine residues with pi orbital-containing sidechains. However, both the contact frequencies and the amino acid frequencies themselves increase with a greater dependence on solvation for non-aromatic residues, especially for the charged amino acids Arg, Glu and Asp, such that non-aromatic contacts become the dominant form of interaction at high solvation levels (Figure 3—figure supplement 1). The relative increase is highest for contacts involving sidechains of like-charge, especially arginine (Figure 3—figure supplement 2), suggesting that solvation plays a role in the strength of the interactions.
Of relevance to intrinsically disordered protein regions that mediate interactions, we find that planar pi-pi interactions occur more often at positions with properties associated with disorder; they are more prevalent in proteins having overall less rigid secondary structure (with contacts for coil/loop/turn > strand > helix), especially disulfide bond containing proteins (Figure 2F), and in sequences that are locally enriched in residue types associated with backbone flexibility or breaking secondary structure (Gly, Ser, Thr, Pro) (Appendix 1—figure 6). Considering planar pi-pi contact frequencies as a function of the sequence position relative to secondary structure elements, we find that the frequency is highest in long loops, showing a sigmoidal relationship when transitioning from order to disorder that goes from 9.5% probability for residues >7 positions away from the closest loop/turn to 16% for residues >7 positions away from the closest helix/strand (Figure 3C,D).
To test whether these interactions are compatible with the multivalent interactions involved in phase separation, we examined contact statistics for protein interactions, comparing sidechain pi-pi interaction frequencies within a chain to those between chains. We classified interfaces as sequence- or complex-specific (between different chains of a crystal structure) and opportunistic (at crystal packing interfaces). In both cases, we defined interface residues as those with sidechains having at least one VDW contact to any atom in a different chain. We found that both the overall contact frequencies at interface positions and local (<5 residue) contact frequencies remain similar to the frequencies observed at non-interface positions, but that there is a significant exchange of long-range (≥5 residue sequence separation) inter-chain to intra-chain contacts (Figure 4). This exchange is also observed for the residues in interfaces involved in crystal packing interaction, demonstrating that long-range planar pi interactions are not specific to particular protein folds, but are common features of protein-protein interactions. These results suggest that non-local pi-pi contact propensity could play a general role in mediating protein interactions, including those driving phase separation.
In our bioinformatics analyses, we identified a type of interaction, planar pi-pi, which is more prevalent for solvated residues, RNA-binding interactions and regions lacking regular secondary structure. These properties are also associated with the emerging functional class of intrinsically disordered phase-separating proteins that coalesce through fluid, multivalent interactions to form protein-dense cellular bodies or membraneless organelles involved in RNA processing (Mitrea and Kriwacki, 2016), the nuclear pore (Frey et al., 2006) and extracellular biological materials (Yeo et al., 2011). The currently known phase-separating proteins are diverse, both in sequence and function (Mitrea and Kriwacki, 2016; Chong and Forman-Kay, 2016), but many are enriched in motifs we can now associate with high planar pi-pi contact frequencies (i.e. Pro-Gly, Phe-Gly, Ser-Arg, Tyr-Gly and Arg-Gly repeats) (Mitrea and Kriwacki, 2016; Nott et al., 2015; Schmidt and Görlich, 2015).
While phase separation of some proteins has been suggested to be driven by the potential for multivalent aromatic stacking and cation-pi interactions (Nott et al., 2015; Brangwynne et al., 2015), our observations show (i) that planar pi-pi interactions are a much more broadly distributed phenomenon in proteins than previously considered, especially in solvated protein regions, (ii) that aromatic residues are not required, (iii) that backbone pi groups make significant contributions, and (iv) that protein sequence can have distinct effects on both long-range contact propensity and local contact propensity. These led us to hypothesize that the number of pi orbitals available to make long-range multivalent contacts is an important feature in determining whether a disordered protein region can phase separate and, thus, that the sp2-hybridization of the arginine sidechain is more important to phase separation than its charge. We tested this hypothesis using the N-terminal 236 residues of Ddx4, an intrinsically disordered region that contains both Arg-Gly and Phe-Gly dipeptide sequences and that can phase separate (Nott et al., 2015). We removed pi-character while leaving charge intact by replacing all 24 Arg residues with Lys. Matching our expectation, this protein region fails to phase separate under the conditions characterized for the wild-type Ddx4 sequence, even at concentrations of 400 mg/ml, 200 times higher than the lowest concentration for which phase separation is observed for the wild type, and four times higher than observed for constructs with an equivalent mass change from mutating nine phenylalanine residues to alanine (Appendix 1—table 4). We note that arginine is likely key for the phase-separation, association and toxicity of C9orf72, which can encode Gly-Arg and Pro-Arg dipeptide repeat sequences (Lee et al., 2016).
Given this supportive experimental evidence for the role of pi interactions in phase separation and our observation that opportunistic non-local pi interactions are commonly found at protein crystal contacts, we chose to test the importance of these interactions for phase separation by determining the degree to which it is possible to predict general phase separation behavior using solely the pi-pi contact propensity of a protein sequence. We recognize that multiple physical interactions can contribute to driving phase separation (Brangwynne et al., 2015), but our goal was not to predict subtle differences in phase separation propensity or quantitative phase diagrams. Instead, we aimed to merely classify proteins as having the potential to self-associate under particular biological conditions or not, as a test of our hypothesis of the involvement of planar pi interactions. In this exercise, we define phase-separating proteins as those that for presumed functional reasons self-associate in a way that is at least transiently reversible and dynamic, allowing for the protein to self-concentrate as a function of available protein concentration, temperature or other condition. This basic definition does not cover the complexity of the phase diagram, merely the ability to reversibly self-concentrate, and does not consider competing transitions, such as irreversible aggregation and precipitation, which have typically been selected against in the natural sequences on which the predictor is designed to be used.
Using this definition, we applied a constrained training approach divided into two stages. In the first stage, we required accurate prediction of contact propensities for folded proteins, using sequence propensities for both local and non-local contacts. For this aim, we developed a statistical method for predicting the expected number of contacts given a protein sequence, using frequencies taken from the PDB, splitting observations by distinct residue pairs with varying sequence separation and applying a statistical comparison of the full list of pairs associated with a given sp2 group to calculate expectations (see Materials and methods). The reliability of these predictions against folded proteins is given in Figure 5A. We then predicted the number of pi-pi contacts for a list of 11 proteins containing IDRs that have been shown to be sufficient for phase separation behavior in vitro (Figure 5—source data 1A), finding that 8 out of the 11 have a predicted number of planar pi-pi contacts per residue in the 99th percentile relative to folded proteins found in the RCSB PDB (Figure 5B).
For the second stage, we developed a phase separation predictor that ranks sequences only by the weighted combinations of pi-contact frequency predictions, without any other interaction or observational data. We used a stochastic optimization approach to find optimal weights and sequence window normalizations for converting pi-contact frequency predictions into a score function able to discriminate known phase-separating proteins from sequences found in the PDB. The individual components weighted and normalized include: (i) short- and long-range contacts as defined by residue pair sequence separation ≤4 or >4, respectively, (ii) sidechain groups vs. the backbone peptide bond, (iii) absolute predicted frequency vs. normalized frequency compared to the specific group, and (iv) number of carbon atoms in the specific group. In constraining this stage of the test, we defined the fixed goal for optimization as the PDB normalized z-score difference between the highest scoring 1% of the PDB and the lowest scoring member of the phase separation training set. We then trained until reaching a plateau, and at that point we finalized the score, running a single validation test against a testing set of 62 proteins directly associated with phase separation in the literature. This testing set can be divided into three subsets by the nature of the evidence available: (i) sufficient for in vitro phase separation as a purified single component (which matches the training set), (ii) evidence of in vitro phase separation involving a multi-component system (e.g., phase separates on the addition of RNA), without evidence of independent phase separation, and (iii) direct evidence of in cell phase separation (where the protein itself has been labeled and dynamic exchange demonstrated by FRAP or similar methods) without evidence of in vitro phase separation or sufficiency.
We used receiver operating characteristic (ROC) plots comparing predictions of phase-separating proteins within the test set against predictions of phase-separating proteins in the human proteome to assay the ability of the predictor to rank known positives against the members of a set that we assume is primarily negative; the area under the curve (AUC) measurement describes the ability to discriminate between sets. For the human proteome as the negative set, we show an AUC of 0.88 ± 0.02 measured using the entire testing set as a positive, 0.93 ± 0.01 if we exclude sequences which only phase separate in complex with other polymers, and 0.96 ± 0.01 if we restrict to the 32 test set sequences that match the sufficiency criteria used for selecting the training set (Figure 5C and Appendix 1—figure 7A). These measurements are complicated by the potential for homology between test set and training set proteins. To control for this, we also measured discrimination using another positive set of the 59 artificial sequences designed and shown to phase separate by the Chilkoti lab (Quiroz and Chilkoti, 2015; MacEwan et al., 2017; Simon et al., 2017) (details in Figure 5—source data 1C), showing an AUC of 0.86 ± 0.03 against the human proteome as a negative set (Appendix 1—figure 7B).
Interpreting these AUC values is complicated by the fact that the true positive rate of the human proteome is unknown, and our analysis will treat unknown phase-separating proteins as false positives, inaccurately decreasing the AUC. Similar analysis against protein sets with less expected phase separation results in higher AUCs, going from 0.88 ± 0.02 for human to 0.92 ± 0.01 for Caenorhabditis elegans, 0.93 ± 0.02 for Saccharomyces cerevisiae, 0.98 ± 0.01 for Escherichia coli, and 0.97 ± 0.01 for our PDB testing set. Within the comparisons to E. coli and S. cerevisiae proteomes, we show examples of the proteome-dependent score distributions underlying the analysis (Appendix 1—figure 7C,D). Using a defined standard confidence threshold of ≥4.0 standard deviations from the PDB average for the propensity score (PScore) captures 0.3%, 2.2%, and 5.1% of the E. coli, S. cerevisiae, and human proteome sets, respectively, as compared to 0.1% of our full PDB set and 81% (26/32) of the self-sufficient for in vitro phase separation test set (dropping to 36/62 for the entire proteomic test set and to 35/59 for the synthetic test set).
When compared to the unweighted pi-contact predictions, the trained PScore confirms the training results, with the number of test set proteins that fall within or above the top 1% range of the PDB increasing from 11/30 to 29/30 (Figure 5D). This increase is matched by an increase in the percentage of human proteins in the same range, from 2.3% to 13.1%. Even though the score is trained for discrimination against folded proteins, we do not see a systematic increase in the scores of all disordered human proteins. Comparison against a top performing sequence homology-based disorder predictor (Disopred3, [Jones and Cozzetto, 2015]) and a physics-based disorder predictor (IUPRED-Long [Dosztányi et al., 2005a]) shows that disorder predictors are better at discriminating disordered proteins from the PDB and the human proteome, while the PScore is consistently better at identifying phase-separating proteins (Appendix 1—table 5). The majority of the proteins in our phase separation test set show disordered character, and the analysis shows that, while PScore does correlate with disorder, it only highlights a subset of disordered proteins and does not reflect a general disorder prediction (Figure 5—figure supplement 1). As a direct test of this discrimination, we find that using the subset of human proteins with known intrinsic disorder (Piovesan et al., 2017) as the phase separation negative set shows similar results as using the human proteome as the negative, at AUC:0.84 ± 0.03 for the full test set and AUC:0.93 ± 0.02 for the in vitro sufficient set.
We note that the optimization methodology used for developing our predictor, specifically training for discrimination against the PDB, was intended to exclude phase separation involving multivalent binding properties of folded proteins with multiple binding surfaces (Pierce et al., 2016; Marzahn et al., 2016) or multiple folded modular binding domains that interact with multiple linear sequence motifs (Li et al., 2012; Banjade and Rosen, 2014). Thus, we expect and find a lower success rate for prediction of phase separation of proteins using these mechanisms. We also note that the goal of the prediction experiment is to see whether observed phase separation can be predicted exclusively from contact probabilities as a test of the hypothesis that pi interactions are important for phase separation, but that our method uses probabilities found in the PDB, was trained on natural sequences, and was tested using sequences that are either found in nature or were designed based on sequences that are. The ability to predict contacts is expected to decrease for sequences not observed in nature and for sequences relying to a greater degree on other energetic contributions.
In order to identify the contact features that play the largest role in the optimized predictor, we did a retrospective analysis testing the predictive power of different scoring algorithms produced during the training process, and explored potential mechanistic implications by testing the power of individual score components, grouping contact predictions into long-range vs. short-range and backbone vs. sidechain (Appendix 1—table 6). Our analysis shows that, while training did improve the predictor, a comparable result can be obtained by using only the long-range contact rate predictions for the peptide backbone (Figure 5—figure supplement 2, as further described in Appendix 1). This property significantly upweights the role of residues, especially Gly and Pro, that are associated with high overall backbone pi-pi contact frequencies and with lower short-range contact frequencies for local sidechain groups, and is especially important for predicting elastin-like proteins, which often have very few sp2-containing sidechains. Thus, these results highlight the increased availability of sp2 groups for non-local pi-interactions as a key driving force behind the phase separation predictions and is consistent with highly multivalent weak interactions leading to phase separation, both in non-polar structural proteins like elastin and highly charged RNA-binding proteins like FUS or Ddx4.
Many high contact frequency residue types are also associated with disordered proteins in general, so to control for that potential role we took a selection of 3501 human proteins predicted to have long disordered regions (as described in the methods), split them by PScore into high (PScore ≥4) and low (PScore <1) subsets, and compared the sequence characteristics distinguishing high PScore and low PScore sequences (Appendix 1—figure 8A). We find that non-phase-separating intrinsically disordered proteins are actually depleted in Gly and Pro, especially relative to the enrichment seen in phase-separating sequences and sequences predicted to phase separate. Conversely, they are most enriched in Lys, which on average is depleted in phase-separating sequences.
While the division of the predictor into two distinct protocols was used to avoid scores that simply describe sequence similarity to the training set, it is still possible that the training process picked up on specific sequence features in the training set. To explore the contribution of sequence similarity to the score, we made a measurement of sequence profile similarity based on dipeptide composition (neighboring residue pair frequencies). We compared the high scoring regions selected by the predictor to each of the sequences used in the training set (Appendix 1—table 7, see Materials and methods). This analysis, shown in Appendix 1—figure 8B, finds that high scoring (PScore ≥4.0) human proteins are, on average, more similar to the training set than are human proteins in general, but that the majority fall within the normal range. Comparison to a set of 1000 BLAST-level sequence homologs of the training set suggests that the majority of the similarity is compositional preference, not homology.
Both sequence similarity and compositional behavior can also be related to the bias toward disorder regions observed in phase-separating proteins. To characterize this, we again took the high and low PScore subsets of our set of 3501 human proteins predicted to have long disordered regions and then compared their sequence profiles. It has previously been observed that disordered proteins have a Shannon entropy (a measurement of sequence complexity) that is lower, but significantly overlapping with ordered proteins (Romero et al., 2001). We find here that the high PScore set has a Shannon entropy that is far lower than the range seen for low PScore disordered proteins, which have Shannon entropies that fall in the range observed for folded proteins (Appendix 1—figure 8C). Comparing our phase separation test set with the human disprot set we can confirm that this bias toward lower complexity sequences is observed in known phase-separating sequences.
Given the favorable characteristics of our predictor, we investigated correlations of phase separation scores with protein interactions, various biological mechanisms that may regulate phase separation and gene ontology (GO) terms. The principle of sequences with high propensity for non-local pi-pi contact being more likely to self-associate implies that different proteins with high phase separation propensity scores would be more likely to interact with one another. By comparing score pairs from protein interactions taken from the I2D metadatabase (Niu et al., 2010), we confirm that high-scoring proteins and low-scoring proteins are both over two-fold more likely to interact with proteins of similar score, relative to expectations (Figure 6A). This holds true even when comparing interactions between largely hydrophobic or cytoskeletal proteins (such as elastin and collagen) and highly polar RNA-binding proteins (like Ddx4 and FUS).
This like-score interaction propensity is predicted by a model of phase separation in which multivalent but individually low-affinity interactions between proteins of similar character coordinate the formation of large, dynamic complexes. To test this aspect of the score, we looked at large complex formation and interaction propensity by examining the background ‘contamination’ rates observed in affinity purification coupled with mass spectrometry (AP-MS). Large complex formation is measured by the number of negative control experiments in which each human protein appears, over a set of 411 experiments involving non-specific affinity purification steps performed without the specific affinity tag (Mellacheruvu et al., 2013). Within this dataset, we observe that 26/28 of our known human phase-separating proteins show up as a contaminant in at least one experiment (O/E = 3.51), and 17/28 show up more than 10% of the time (O/E = 14.9), confirming that phase-separating proteins show the expected behavior. By binning proteins by prediction scores, we show that this is also a trend for high PScore proteins in general (Figure 6B), suggesting that the pi contacts driving this score may play a general role in localizing proteins to large complexes.
Phase separation behavior could potentially be modulated by the addition, modification, or removal of even small segments with high phase separation propensity, leading to regulation of phase separation by alternative splicing and post-translational modification (Romero et al., 2006; Hegyi et al., 2011). To test the possible regulation by splicing, we ran our predictor against human sequences in the UniProtKB/Swiss-Prot (Magrane et al., 2011) variable splicing database. We found that 40 ± 2% of included proteins strongly predicted to phase separate (PScore ≥4) have alternative splice variants which either remove the prediction or significantly change the score (ΔPScore >1), often having multiple splice variants spanning a wide range of scores (Figure 6C). By comparison, an overall rate of significant changes in score (ΔPScore >1) of 23.0 ± 0.4% is observed for all proteins in the set.
To examine post-translational modifications (PTMs), we analyzed our scores against the database of known PTMs curated by PhosphoSitePlus (Hornbeck et al., 2012). We tested the relationship between predicted propensities and number of PTM sites, controlling for protein length by taking PTM counts from the maximum number of annotations observed for any 100 residue window in a sequence. By comparing populations with an above average number of sites (greater than the average plus one standard deviation) against the baseline frequency, we see enrichment in high PScores (≥4) for a variety of PTM site annotations, including literature annotations of disease relevance and known regulatory function (Appendix 1—table 8). We also observe that for phosphorylation and methylation the absolute number of PTMs correlates with the average PScores observed, with methylation having a stronger effect than phosphorylation, and ubiquitination shown as a negative control (Figure 6D).
Next, we compared our phase separation predictions to known localization or function, as annotated in the GO database (Figure 7A,B,C). Ranking GO terms by enrichment of proteins with prediction values above our threshold (PScore ≥4) enabled us to generate a list of terms associated with significant enrichment of pi-pi contacts (p<0.000001 and 5–50 fold observed over expected); this list includes 4.1% of the 27342 GO terms tested. This subset of the GO database demonstrates enrichment for phase separation propensity in known phase-separated compartments (stress granules, Cajal bodies, post-synaptic density [Zeng et al., 2016]), in RNA processing (transcription, splicing, modification, transport, and stability), in the assembly and plasticity of structural components (cytoskeletal organization, extracellular matrix assembly), and in signaling, regulation, and development (Notch signaling, NF-κB, Wnt). We note that the sequence property predicted here is a physical behavior that occurs on a cellular scale, so the observation of a similar score distribution for a specific biological process, as observed for annotations involving localization to known phase-separating bodies, is an implicit prediction that phase separation is one of the physical mechanisms involved in the process. Consistent with that, we see similar score distributions for many processes involving organization of structural components, signaling, and cell-fate commitment. The property of phase separation is also strongly associated with the regulation and development of multicellular cooperation and neurogenesis. In contrast, the vast majority of GO terms (77.2%) show no enrichment in phase separation propensity, with significantly lower enrichment in categories involving metabolic processes and enzymatic catalysis. A selection of high-scoring human proteins associated with enriched functions is shown with per-residue scores and PTM annotations in Appendix 1—figure 9, with examples chosen from the highest scoring protein in any given gene ontology function/localization annotation related to neuronal plasticity or behavior in A, cytoskeletal biomaterials in B, signaling in C, and extracellular biomaterials in D.
Within the testing set, there exist some proteins which have not been shown to be capable of independent phase separation (Han et al., 2012), and which may associate with phase-separated bodies without sharing the same behavior. One of these, synaptic functional regulator FMR1 (El Fatimy et al., 2016), also known as fragile X syndrome protein FMRP, has a PScore of 4.7, and is involved in RNA binding, neurological development and regulation of translation, all GO terms enriched in high PScores. FMR1 is a multifunctional polyribosome-associated protein, which is highly expressed in the brain and in the testes, and is known to localize to granular bodies with two other proteins (FXR1 and FXR2) (El Fatimy et al., 2016) that are also predicted with high PScores (at 2.9 and 5.3). In order to validate that high PScore predicts sufficiency for phase separation and not associated properties like miscibility in the separated phases of other proteins, or other interactions with phase-separating proteins, we purified the highest scoring region (residues 445–632) and confirmed the ability to spontaneously undergo liquid phase separation at low temperature and high concentrations in physiological buffer conditions (Figure 8A). The concentration required for visual confirmation of liquid phase separation behavior is quite high, at 1 mM FMRP-LCR, but can be reduced through the use of crowding reagents (Figure 8—figure supplement 1A). To confirm the relevance of pi-character, we then replaced all 28 Arg residues with Lys, which resulted in a loss of phase separation behavior (Appendix 1—table 4).
To test whether or not the predictor is applicable to sequences that do not share motifs or functions with any of our training set proteins, we did a manual search for predictions with sequence properties and functions dissimilar from the training set proteins and selected two proteins, human engrailed-2 (UID: P19622, PScore 5.0), a DNA-binding homeobox protein, and the pAP isoform of the Human cytomegalovirus capsid scaffolding protein (UID: P16753-2, PScore 3.8), a protein that plays an essential structural role in assembling the viral capsid, a novel function relative to those known to involve phase separation. Both sequences have little overlap with any of the sequence motifs found in our training set (Appendix 1—table 7), aside from general enrichment in glycine and proline residues. Experimentally, we observe reversible liquid phase separation of pAP protein with increasing temperature, with viscoelastic properties similar to the coacervation of elastins (Figure 8C). We did not observe phase separation of engrailed-2 under the same buffer conditions, even at 1 mM protein concentration, but did observe temperature-dependent liquid droplet formation in the presence of a crowding reagent (20 mg/ml ficol) (Figure 8—figure supplement 1B). While these observations do not represent a robust or comprehensive test of prediction quality, they do suggest that the predictions provide a useful tool for selecting natural proteins capable of self-sufficient liquid demixing.
We tested the potential role played by pi-contacts in mediating phase separation by using the single property of pi-contact frequency to train a simplistic predictor of phase separation behavior found in natural sequences, finding that the single property of long-range pi-contact propensity is sufficient for marking the majority of known phase-separating proteins as outliers relative to the proteome, supporting the hypothesis that this sequence property is commonly associated with phase-separating proteins. While this association is demonstrably useful for identifying phase-separating proteins in proteomic datasets, these contacts may not be the predominant interaction driving the physical process of phase separation for each case, and could instead reflect a modulatory role since it is not exclusive of other interactions like hydrogen bonds and charge interactions. However, tests showing that arginine to lysine mutations abrogate phase separation behavior do provide evidence of the importance of planar sp2 groups for phase-separating systems.
The finding that a single contact potential can generate a reasonably accurate classifier of phase separation behavior suggests that a sequence-based prediction of phase separation behavior is a tractable problem, and that future development of an algorithm that can predict the complexities of the phase transition, environmental effects and concentration requirements is a reasonable goal. This goal could potentially be addressed by introducing the range of phase separation associated sequence properties that were intentionally excluded by our empirical test of the pi-contact association, including the electrostatic effects of charge patterning (Nott et al., 2015; Lin et al., 2016; Das and Pappu, 2013), multivalency of PTM sites and PTM-binding motifs, and transient structural interactions, including strand formation (Murray et al., 2017) and helical interactions (Conicella et al., 2016). There is also a role for incorporating predictions of competing states, the irreversible aggregation propensity of a sequence or its amyloidogenic potential. Incorporating annotation data associated with phase-separating proteins could be another avenue for generating a physiological classifier in a more comprehensive predictor.
The physical nature of pi-pi contacts and their underlying mechanistic relationship to phase separation are not revealed by the simple contact frequency measurements used in our predictions. These contacts are observed in folded proteins, both internally and near solvated interfaces and, while that suggests they play a general role in the energetics of protein-protein interactions, the nature of that role is not clear. There is potential for electrostatic or induced dipole and quadrupolar interactions, especially in the context of other dipole interactions and hydrogen bonds, but the flat surfaces of sp2 groups could also enable solvation to drive contacts and lead to entropic contributions due to the relative freedom of movement inherent in packing flat plates, compared to the more rigid shape complementation involved in packing aliphatic groups. It is interesting to note that these proposed mechanisms could be affected by temperature in opposite ways, and that our predictor using pi-contact frequencies is useful in identifying phase-separating proteins regardless of whether they associate more readily as temperatures decrease (such as Ddx4) (Nott et al., 2015) or increase (such as elastin) (Yeo et al., 2011).
As part of characterizing the proteomic associations highlighted by our empirical prediction test, we point out that manual inspection of our prediction results across the human proteome suggests that planar pi-contact associated phase separation likely facilitates a wide range of cellular functions. To highlight this, we selected a range of examples by taking the highest scoring member of gene ontology categories we found to be generally enriched in high PScore proteins. We see enrichment of phase separation propensity in proteins associated with cytoskeletal organization. These include proteins with known structural roles such as the cytoskeletal intermediate filament proteins desmin and vimentin (PScores 4.3 and 4.4), as well as keratins 8 and 18 (PScores 5.9 and 5.4), with scores deriving primarily from the disordered head and tail domains (Appendix 1—figure 9). Intermediate filaments form through dynamic processes (Yeo et al., 2011) consistent with a model in which phase separation-induced condensation concentrates proteins prior to the formation of (often fibrillar) structure (Yeo et al., 2011). Interestingly, helical domain mutations impeding structure formation cause these four proteins to instead accumulate in protein-rich membraneless inclusions such as Mallory-Denk bodies (Strnad et al., 2008; Goebel and Bornemann, 1993; Ogrodnik et al., 2014). We also predict high PScores for non-structural proteins involved in regulating cytoskeletal organization and in binding some of the previously mentioned cytoskeletal proteins, including focal adhesion kinase 1 (PScore 4.2) and DNAJB homolog 6 (PScore 8.8), the latter of which is also a chaperone that can prevent huntingtin aggregation (Chuang et al., 2002; Hageman et al., 2010; Gillis et al., 2013).
Many of the high PScore predictions involve proteins that are both involved in signaling pathways and known to either localize to membraneless organelles or interact with phase-separating proteins. For example, adenomatous polyposis coli protein (APC) (PScore 3.2) and axin1 (PScore 2.2), involved in the Wnt signaling pathway, interact in a dynamic fashion in the large and multimeric β-catenin destruction complex (Pronobis et al., 2015), and we find high PScores for other critical members of the complex, including β-catenin (PScore 5.5) and GSK3α (PScore 6.4). The β-catenin destruction complex formation is regulated by GSK3 phosphorylation, and we note that the predictor shows a difference between the two human GSK3 orthologs, with GSK3β having a PScore of 2.2. These orthologs are often functionally interchangeable (Doble et al., 2007), but there is evidence of isoform-specific roles for GSK3α (Ma, 2014) and the predicted differences could reflect a difference in modulating phase-separation behavior.
In conclusion, we have shown that planar pi-pi interactions are more prevalent in protein structures than previously described, with potential roles in structural motifs, catalysis and RNA binding. Planar pi-pi contact frequencies are increased in protein segments that lack regular secondary structure or have increased solvent exposure, pointing to their relevance for disordered protein regions. This, together with the enrichment of pi-containing groups in protein regions known to phase separate, provided an impetus for development of a phase-separation predictor based on the likelihood of forming non-local planar pi-pi contacts. The performance of the predictor supports the hypothesis that these pi-pi interactions can drive phase separation. While experimental data and computational work suggest other contributions (Yeo et al., 2011; Pak et al., 2016; Nott et al., 2015; Lin et al., 2016; Kim et al., 2016; Brangwynne et al., 2015), including the hydrophobic effect, electrostatics and multivalent binding of folded protein domains, our prediction test shows that an algorithm focused solely on pi-interactions performs well for the majority of proteins that we identified as phase-separating from the literature (Figure 5C, Figure 5—source data 1B). These results strongly suggest that most phase-separating proteins can make significant non-local planar pi-interactions, even in cases where there are other dominant or required forces driving phase separation. Thus, this represents a valuable tool for the currently expanding field of protein phase separation and its link to biological function and disease (Mitrea and Kriwacki, 2016; Conicella et al., 2016; Patel et al., 2015). In particular, the association of neurological diseases with proteins comprising RNA processing bodies (Vanderweyde et al., 2013), including those known to phase separate, highlights the importance of predictive methods for facilitating mechanistic studies of the underlying biology and pathology.
Protein structures determined by X-ray crystallography were downloaded from the PDB based on lists compiled using the Pisces web server (Wang and Dunbrack, 2003), May 7 2015, which identified 23074 non-redundant chains based on cutoffs of <60%,<5.0, and <0.5 for sequence identity, resolution, and R-factor, respectively. For calculating statistics, high-resolution structures were defined as a subset of 5718 structures with resolution ≤1.8 and R-factor ≤0.18. For structures determined using distance restraints from nuclear magnetic resonance spectroscopy, we took the full list of 2949 PDBs with distance constraints available from the BMRB (Ulrich et al., 2008) database of converted restraints (DOCR) (Doreleijers et al., 2005) as of July 3rd, 2015.
To probe contact geometries we read each set of coordinates into custom python scripts, filtering input data by ignoring sp2 systems that lack any of the expected heavy atoms (<0.1%) and only taking the first set of coordinates when represented by multiple conformations. The sp2 systems were defined by atom names for each of nine sidechain groups (from W,F,Y,H,R,Q,N,E,D), the backbone peptide bond, and the C-terminal carboxyl group. Planar axes were defined as normal vectors by using the cross product method against defined lists of three sequential atoms. VDW contacts between sp2 groups were determined by the full set of heavy atom (C, N, O) distance measurements, using a threshold of ≤4.9 Å to define contacts. This represents the upper range of VDW contacts between sp2 groups (Appendix 1—figure 1E), because we intended to compare contact frequencies by data resolution and did not want to introduce arbitrary energetic cutoffs for atoms with potentially unreliable positions.
In analyzing the planar orientations of sp2 groups found in VDW contact, we found enrichment of in-plane contacts, predominantly face-to-face, so we devised a simple system for identifying them that can be generalized across groups with variable numbers of atoms. These planar surface contacts were defined first by requiring at least two different pairs of atoms to be in VDW contact. Contact distances were further restricted by requiring that surfaces 1.7 Å above the sp2 plane be ≤1.5 Å apart (as shown in Appendix 1—figure 1B). This planar-surface distance requirement is used to ensure contacts that put the pi-orbitals in proximity to one another, and we note that while this threshold will accept atom-atom contacts as far as 4.9 Å apart the majority end up below 4.0 Å (Appendix 1—figure 1E, in purple). To restrict contacts to planar contacts, the dot product of the planar normal vectors were required to have an absolute value ≥0.8 (equivalent to an orientation difference from 0° to ~37°). This threshold retains >80% of the contacts identified by distance, and was chosen because interactions between planar groups show a noticeable enrichment relative to random orientation in this range (Appendix 1—figure 1D).
Annotation data for the full non-redundant set of PDBs analyzed are included in files Figure 1—source data 1 and 2, and scripts for creating contact annotations from a PDB are included in supplemental file Source code 1.
Comprehensive lists of planar pi-pi contacts were computed for each chain and were stored in a database. Contact frequencies were calculated as the total number of observed contacts divided by the total number of residues considered. Residues were counted only for each non-redundant chain, and contacts include both the ones made within that chain and the ones that chain makes to any other chains present in the PDB (except when noted otherwise). Contacts to crystal symmetry partners were also measured but were kept separate and, except where specifically investigated to probe inter-chain contacts, were excluded from analysis based on the observation that VDW contacts made to symmetry partners can contain a small (<1%) population of extreme clashes (atoms <1 Å from one another).
PDBs were sorted into 77 non-overlapping bins first by exact resolution and then by rolling any bins with less than 100 PDBs into the next acceptable bin within 0.25 Å. This method rounds up the small populations of resolution values while retaining as much resolution information as possible. Correlation values and lines of best fit were estimated using linear regression (inherited from the scipy python package, version 0.12.1) against bin averages, with bins weighted by sample size.
In order to compare frequency of contacts involving all 20 common amino acids, we defined involvement based on the participation of any atom from that residue in a planar contact. For most sequence positions, this means at least one contact made to either one of the flanking peptide bonds or, for the nine amino acids that have them, the sidechain group. By this definition, backbone planar contacts involve both flanking residues.
To provide a prior expectation control for enrichment of planar pi-pi contacts, we took the terminal heavy-atom planar surfaces from fully sp3 hybridized sidechains, using the following PDB atom names to define each planar group: Leucine: CD1, CG, CD2; Valine: CG1, CB, CG2; Methionine: CE, SD, CG; Isoleucine: CD1, CG1, CB; Cysteine: SG, CB, CA; Serine: OG, CB, CA; Threonine: OG1, CB, CG2; and Lysine: NZ, CE, CD.
The PDB was screened for crystallographic structures containing either amino acids or other small molecules as free ligands, with the other small molecules being restricted to those that (1) are present in more than 100 structures, (2) have a single sp2 group, and (3) have all heavy atoms (C,N,O) falling within the sp2 plane. These structures were then filtered for resolution (≤3.0 Å) and redundancy (≤90% identity) by using the Pisces web server. Contact frequencies were determined across the full list of ligands in these non-redundant sets, with contacts to amino acid ligands being divided into backbone carboxyl and sidechain sp2 groups. As an internal control for amino acid contact frequencies, contact frequencies were determined for each amino acid based on the same set of structures used to define the ligand frequency. For sidechain groups, the controls are their direct equivalents found within the protein, and for the amino carboxyl groups, we used the protein C-terminal carboxyl groups as the control. Population statistics are summarized in Appendix 1—table 2.
We defined catalytic sites based on direct literature annotation as described in the Catalytic Site Atlas (Furnham et al., 2014), with 2914 residue positions identified over 928 protein structures. The full population of residues across the annotated chains was split into 40 bins according to identity and annotation status. Relative contact involvement frequencies, catalytic vs. non-catalytic, were obtained for each amino acid type. For each catalytic residue, we then identified the total number of VDW contacts made to any other residue, identified which VDW contacts fall into the subset defined by our pi-contact rules, and then, for the 2377 catalytic residues with at least one VDW contact to another catalytic residue, we computed the frequency of VDW contacts that are also pi-contacts.
Hydrogen bond data were calculated using PHENIX (Adams et al., 2010), with amino sidechains allowed to flip 180 deg to maximize the number of donor/acceptor pairs. DSSP (Kabsch and Sander, 1983) was used to define backbone secondary structures. Water contacts were defined by direct distance measurements, with the full set of water molecules, including symmetry partners, extracted using the SYMPEXP function from pymol (Schrodinger LLC, 2015). NMR restraints were obtained for 2949 structures from the Database Of Converted Restraints (Doreleijers et al., 2005). For defining short secondary structure motifs, we used the simplified one letter definitions provided by DSSP (‘H’, ‘B’, ‘E’, ‘G’, ‘I’, ‘T’, ‘S’, and ‘‘), in order to maintain adequate sample size when comparing enrichment across motifs. For comparing ordered and disordered residues, clear helices and strands (‘H’ for α-helix, ‘G’ for 310 helix, and ‘E’ for β-strand) were defined as the ordered assignments, representing regular secondary structure.
We trained a statistical potential for predicting pi-contact frequency from protein sequence for individual sp2 groups, with contacts split by sequence separation into short-range (≤4 residues apart) and long-range (≥5 residues apart, or different chains). We trained against an 80% random cut of the 17388 proteins in our non-redundant crystal structure subset of the PDB, leaving the remaining 20% as a testing set for a single final test of the predictor. The final predictor, covered in detail in Appendix 2, operates by first averaging the frequencies observed for sp2 groups found in specific sequence contexts (with context defined as all residues within 40 amino acids of a given residue) and then comparing the average values to the distributions observed for sp2 groups with the same sequence identity (with nine sequence identities for sidechain sp2 systems, and 400 distinct identities for the peptide backbone), where the final prediction is the contact frequency observed at matching positions in the PDB.
To develop a predictor for the phase separation propensity of a given protein sequence, we started by defining a set of 11 proteins which have been shown, in the literature, to phase separate in vitro as single purified components due to interactions involving intrinsically disordered regions of the protein. We also defined a leave out set of 62 proteins associated with phase separation in the literature by combining proteins matching three distinct criteria: ‘in vitro sufficient’ (N = 32), proteins satisfying the criteria used to select the test set, ‘in vitro insufficient’ (N = 18), proteins for which the literature contains evidence of in vitro phase separation in complex with other proteins or with RNA but phase separation as a single component not observed and ‘in cellulo associated’ (N = 12), proteins without in vitro characterization, but with evidence of phase separation in live cells, as determined both by localization to a known dynamic protein body and by a direct measurement of dynamic character, typically involving FRAP recovery of a fluorescent tag. These benchmark sets are included in supplemental file Figure 5—source data 1A,B and C.
Over these datasets, we only found eight proteins less than 300 residues in length, with the smallest protein sequence observed (RBM3_HUMAN) being 157 residues long. To avoid extrapolating our predictions onto an unobserved class of proteins, we decided to restrict testing to sequences ≥140 residues in length. To define additional control and training sets, we applied this sequence cutoff to a series of datasets, including the PDB sets used for developing the pi-contact predictor, with 13388/17388 training set and 3406/4347 test set sequences retained after restricting by length ≥140, the UniProt human reference proteome (September 2016, 18582/21047 sequences used), and the subset of the human proteome with known disorder, as defined by the DISPROT database (Piovesan et al., 2017) (205/249 sequences).
The phase separation propensity predictor starts by inheriting a table of 8 pi-contact prediction values per sp2 group in the sequence, splitting contacts by i) short-range (≤4 sequence separation) vs. long-range (>4), sidechain vs. backbone, and absolute predicted frequency vs. relative difference from sp2 groups with the same identity (with nine sidechain sp2 groups and 400 backbone groups, split by their associated sequence). Sequences are then scored by a series of weighted sequence window averages. Weights, window length, and normalization parameters were refined using a stochastic optimization process to maximize the score difference between the lowest scoring member of our 11 member training set, and the average score of the highest scoring 1% of the PDB training set. A full training history and details of the final predictor are described in Appendix 2. AUC values at different stages of training the predictor are tabulated in Appendix 1—table 6. Standard error of the mean values for AUC calculations were estimated by bootstrap using sampling with replacement (10,000 iterations) against both the test and human sets.
The final predictor consists of a single python script and associated database files, with the state at time of submission included in supplemental file Source code 1.
Phase separation scores for analyzed proteins were considered with respect to known interactions, and functional annotations, using the gene ontology database (release Oct-04–2016), UniProt’s Swiss-Prot and TREMBL sequence databases, including the reference proteome annotations, vertebrate protein sequence list, and variant splicing data (release 11-May-2016) (Magrane et al., 2011; Pundir et al., 2017; Huntley et al., 2015; Suzek et al., 2007), PTM data from PhosphoSitePlus (release Dec-16–2011) (Hornbeck et al., 2015), human protein-protein interactions collated under the Interologous Interactions Database (I2D) (version 2.9) (Brown and Jurisica, 2007), and background AP-MS detection rates from the Contaminant Repository for Affinity Purification Mass Spectrometry Database (CRAPome version 1.1) (Mellacheruvu et al., 2013).
GO term enrichment data over the full range of propensity scores were analyzed against all proteins with UniProt codes contained within both the vertebrate reference sequences and the gene ontology database. Enrichment scores and p-values for individual GO terms were obtained for a defined four sigma cut against the human proteome by using PANTHER (Thomas et al., 2003; Mi et al., 2017; Mi et al., 2016) analysis.
Per residue disorder predictions were obtained using Disopred3.16 (Jones and Cozzetto, 2015) (standard command line and Refseq database) and IUPRED-Long (Dosztányi et al., 2005a; Dosztányi et al., 2005b) against the phase separation test and training sets, the PDB test set, the human Disprot set, and a random selection of 7397 sequences from the human proteome. To convert these into per-sequence scores for comparison to the PScore, we then used the optimized window averaging method developed during training of the predictor, where the window is defined as all residues within five sequence positions of the highest scoring sixty. These scores can be used to classify whether or not a given sequence has a number of disordered residues comparable to the length of a folded domain, either concentrated in a single large region or distributed throughout the sequence.
Sequence similarity to the proteins within the training set was measured by computing dipeptide sequence profiles (the frequencies of all 400 possible i,i + 1 amino acid combinations in a sequence), calculating block L1 distances between a query dipeptide profile and each of the training set profiles, and then returning the lowest observed distance. Comparison to the sequence similarity of direct homologs was observed against a set of 1100 sequences obtained via BLAST by using the phase separation training set sequences used as queries against the seq database (E = 0.0000001).
Shannon entropy values were calculated for amino acid profiles of sequences by the standard equation (Shannon, 2001), and comparisons of high and low scoring disordered proteins were obtained from the subset of human sequences with Disopred3 predictions > 0.80, using the window averaging method described previously (N = 3501 out of 7397 sequences). High and low PScore sets were defined by PScore ≥4.0 (N = 310) and PScore <1.0 (N = 1044), corresponding to our standard phase separation confidence threshold and scores less than one standard deviation above the PDB average, respectively.
Ddx4: Constructs for Ddx4 1-236 wild type sequence (UID: Q9NQI0-1) and mutants were synthesized and subcloned into a pET Sumo vector (Genscript) to produce His-SUMO-Ddx41-236 (Nott et al., 2015), His-SUMO-Ddx41-236(9FtoA) (Nott et al., 2015), His-SUMO-Ddx41-236(14FtoA) (Brady et al., 2017), and His-SUMO-Ddx41-236(RtoK) (Brady et al., 2017). Protein was overexpressed in E. coli and purified as described previously (Nott et al., 2015). Phase separation was induced at 24°C by dialysis of a high concentration of Ddx4 in 20 mM Na2PO4, 1 M NaCl, 5 mM TCEP, pH 6.5 into a buffer containing 20 mM Na2PO4, 100 mM NaCl, 5 mM TCEP, pH 6.5. Concentrations were measured by spectrophotometry, using an extinction coefficient of 23950 M−1cm−1 at 280 nm.
FMR1: His-SUMO-FMR1445-632 (FMR1 from UID: Q06787-1) and His-SUMO-FMR1445-632(RtoK) were transformed into E. coli BL21-CodonPlus(DE3) RIL cells. Bacteria were grown in Luria Both at 37°C and protein expression was induced with 0.5 mM IPTG at OD600 nm of ~0.6–0.8, followed by overnight growth at 24°C. Cells were harvested by centrifugation and pellets were stored at −20°C. Protein pellets were re-suspended in lysis buffer containing 50 mM NaPO4 pH 8.0, 6 M guanidinium chloride (GdmCl), 500 mM NaCl, 20 mM imidazole and 2 mM DTT. Cells were then lysed via sonication and lysates were cleared by centrifugation at 39,000 g for 30 mins at 4°C. The supernatant was loaded onto a HisTrap column equilibrated with the lysis buffer followed by extensive washing with the same buffer (10 CV). The GdmCl was removed by washing the column with buffer containing 50 mM Na2PO4 pH 8.0, 500 mM NaCl, 20 mM imidazole and 2 mM DTT (10 CV). The protein was then eluted in the same buffer supplemented with 300 mM imidazole. The His-SUMO tag was cleaved with the SUMO protease, Ulp, while dialyzing against 50 mM NaPO4 pH 8.0, 500 mM NaCl, 20 mM imidazole, and 10 mM DTT at 4°C over night. The dialysate was loaded again onto a HisTrap column equilibrated with dialysis buffer to separate the His-SUMO tag and the His-tagged Ulp from the FMR1445-632 protein. All fractions were analyzed by SDS-PAGE, and fractions containing the protein of interest were combined and concentrated with ultrafiltration. Concentrated samples were passed over a Superdex 75 gel filtration column into a final buffer of 50 mM NaPO4 pH 8.0, 2 M GdmCl, 200 mM NaCl, and 2 mM DTT. Protein identity was confirmed by mass spectrometry and frozen at −80°C until use. Concentrations were determined from the absorbance at 280 nm using a molar extinction coefficient of 9970 M−1cm−1.
pAP: His-SUMO-pAPA341Q (SCAF Isoform pAP from UID: P16753-2, with a single mutation A341Q added to confer protease resistance (Brignole and Gibson, 2007)), was transformed, grown, induced, and purified following the protocol for FMR1, but with growth post-induction done for 4 hr at 37°C, and with an additional Superdex 75 gel filtration step added between the first HisTrap step and the Ulp cleavage step. Concentrations were determined from the absorbance at 280 nm using a molar extinction coefficient of 35870 M−1cm−1.
Engrailed-2: The expression and purification steps of engrailed-2 (UID: P19622) from His-SUMO-EN2 were similar to the protocols used for FMR1, but with growth post-induction done for 4 hr at 25°C, and addition of a HiTrap SP XL (GE Healthcare) ion exchange chromatography step between the Ulp cleavage step and a Superdex 75 gel filtration step for removing His-SUMO by increasing NaCl concentration from 50 to 1000 mM in 50 mM NaPO4, 2 mM DTT at pH 7.4. Concentrations were determined from the absorbance at 280 nm using a molar extinction coefficient of 22460 M-1cm-1.
For Figure 8A, concentrated FMR1 protein samples were dialyzed against 20 mM NaPO4 pH 7.4 and 2 mM DTT overnight at 4°C and then diluted with the same buffer to the desired protein concentrations for imaging. Samples were incubated on ice for 5 min before placing them onto a glass cover slip. For Figure 8—figure supplement 1A, FMR1 protein samples were instead dialyzed against 100 mM NaCl, 20 mM NaPO4 pH 7.0 and 5 mM DTT, with 20 mg/ml ficol added prior to imaging.
Concentrated pAP and engrailed-2 protein samples were dialyzed against 100 mM NaCl, 20 mM NaPO4 pH 7.0 and 5 mM DTT overnight at 4°C and then either diluted with the same buffer to the desired protein concentrations for imaging (pAP, Figure 8C) or with the addition of 20 mg/ml ficol (engrailed-2, Figure 8—figure supplement 1B). Droplet images were acquired using differential interference contrast with 40X, 63X or 100X objectives on either a Zeiss Axiovert 200M Epifluorescence microscope or a Zeiss Axio Observer. Temperatures were controlled using a PE100-ZAL inverted Peltier system from Linkam Scientific.
Standard error estimated for measured population parameters was obtained by bootstrap analysis using random sampling with replacement, with 10,000 iterations. For measurements involving populations of features found within the PDB, we split observation data into dependent blocks by sampling against the list of PDBs used in calculating the parameter rather than by against the list of observed features. Statistics for many of these calculations are tabulated in Appendix 1—table 1. ROC curves were calculated as non-parametric step functions by empirical cumulative distributions, and AUC was estimated by direct measurement of the AUC, without smoothing. AUC values are tabulated in Appendix 1—table 6.
To investigate the functional relevance of planar pi-pi interactions, we examined the contact frequencies at positions with known enzymatic function. We observe that residues in our non-redundant protein set which are annotated as being catalytic in the Catalytic Site Atlas (Furnham et al., 2014) (N = 912 PDBs, 2914 catalytic residues) are more likely to be involved in planar pi-pi contacts than expected by the contact frequency of the given residue type, at 1.87 ± 0.07 times expectation overall and 1.42 ± 0.07 when normalized by catalytic residue frequency (Appendix 1—table 3), with Glu, Gln, Asn, and Asp having the largest increases, going from 12–19% average contact frequency to 21–39% in catalytic positions. In addition, by normalizing contact frequency by the number of van der Waals contacts each residue is involved in we observe that catalytic residues are 2.1 ± 0.2 times more likely to have pi-pi stacking interactions with their VDW contacts that are also annotated as being catalytic than they are to non-catalytic neighbors, suggesting that the contacts themselves play a role in defining the constrained geometries of the active site. This analysis only covers the protein-protein interactions made within the binding pocket, and does not cover the nature of the contacts made by catalytic residues to ligands, which, judging from our small molecule binding data (Appendix 1—table 2), could significantly increase the overall rate. (An example of a catalytic side with both protein-protein and protein-ligand pi-pi contacts is shown in Figure 2B.). We note that the effect of these interactions may involve some energetic cooperativity with other interactions, as they are often found in large networks, and commonly involve sidechains stacked simultaneously to both the donor and acceptor groups of an sp2-sp2 hydrogen bond (Figure 2C).
Roughly half of the observed pi-pi contacts involve the peptide backbone and seem to be particularly important in defining structure at the termini of strands and helices (Figure 2D). We found that contact frequency increases at backbone positions located at the transition points between different types of secondary structure, with up to a 3–4 fold increase in contact frequency around the terminal positions in strands and helices (Appendix 1—figure 5) primarily involving local sidechain contacts to the last peptide bond involved in hydrogen bonding. We further observed that many of these contacts are directly involved in motifs known to be important for stabilizing secondary structure during folding, including β-hairpins and helix-caps (Serrano and Fersht, 1989; Trevino et al., 2007). These interactions have previously been recognized for aromatic sidechains (Tóth et al., 2001), but we find that non-aromatic sp2 sidechains are, in aggregate, more likely to form these backbone contacts (Figure 1C).
To examine the relevance of these backbone contacts in defining structural motifs we searched for examples of specific motifs with exceptionally high contact frequencies. We present our top examples in Appendix 1—figure 5. Appendix 1—figure 5A shows a β-hairpin with DSSP assignment (Kabsch and Sander, 1983) ‘ETTTTE’, a motif found in 14.4 ± 0.5% of the structures in our high resolution PDB set (N = 822/5718). Across these observations we find a planar pi-pi contact frequency at the peptide position involved in the hydrogen bond defining the turn 38 ± 2% of the time (N = 401/1051),~22 times the median contact frequency (of 1.7%) over all backbone groups, suggesting that these pi-pi contacts play a role in defining the motif.
As a way of providing a sidechain specific test that excludes aromatic interactions and cation-pi, we looked at arginine residues and found that when arginine is in a helix (≥4 residues from the N-terminal position, by DSSP) the sidechain group forms a pi-pi contact to the peptide bond between residues i-3 and i-2 on average 9.2% of the time (N = 18507), where the arginine is effectively stacking to the surface of the helix. When looking at sidechains found closer to the helix N-terminus we find that this frequency increases to 13.6% for the position where the i-3/i-2 peptide group forms the terminal hydrogen bond (N = 2495), and then drops to 0.14% when it is no longer part of the helix (N = 1438).
We also examined the planar pi-pi contact frequencies in RNA/protein complexes, using 94 X-ray structures with at least 10 RNA bases from our non-redundant proteins set. Overall, 9.2 ± 1.4% (N = 3847) of RNA bases in these RNA/protein complexes form planar pi-pi interactions with protein. This is heavily biased towards bases that are not already stacking to other bases. Thus, bases that have no base stacking interactions (16.2 ± 1.8% of all bases) have planar pi-pi contacts to protein 31.2 ± 4.4% of the time, and those that only stack to RNA on one side (17.3 ± 1.1% of all bases) have planar pi-pi contacts to protein 17.7 ± 2.6% of the time. This often situates the protein/RNA interface at positions for which the RNA secondary structure is disrupted.
Arginine sidechains are the largest contributor to RNA/protein planar pi-pi interactions, found in 35.0 ± 3.3% of cases, compared to 34.7 ± 4.0% involving any of the four aromatic sidechains (Phe, Tyr, His, Trp) and 21.9 ± 2.8% the peptide backbone. Arginine’s role in binding RNA is well known and can also be explained by its abilities to form charge interactions with the phosphate backbone and hydrogen bonds to base pairs. We observe that arginine residues in VDW contact with RNA form planar pi-pi interactions 28% of the time, and that planar pi-pi contacts are, relative to other arginine to RNA VDW contacts, more likely to be found in simultaneous contact to the phosphate group. Thus, arginine in VDW contact with RNA has a 36% and 9% chance of being in contact with PO4 or hydrogen bonded to a base, respectively, while arginine with a pi-pi contact to RNA has a 63% and 15% chance of being in contact with PO4 or hydrogen bonded to a base, demonstrating that these interactions do not compete and may be synergistic or cooperative (Figure 2E).
The protocol for developing the final phase separation predictor was conceived as an empirical test of the hypothesis that pi-contacts have relevance to phase separation, with the logic being that if pi-contacts play a functional or energetic role in mediating protein phase separation then an accurate prediction of pi-contact rates (contacts formed per residue) should often be sufficient for predicting phase-separation behavior. To this end, we defined strict training sets, with 11 phase-separating sequences as the positive standard and sequences in our PDB sets as our negative standard, we kept the testing sets internally blind until choosing an arbitrary point to finalize the predictor, and we intentionally excluded any analysis of sequence features, such as charge patterns, amyloid propensity, and direct homology, that could improve predictor quality without testing this hypothesis.
Our protocol was split into four design steps and one test step, as follows. (1) We defined sequence dependent contact rates for different sp2 groups by taking statistics from the PDB, (2) we measured average frequencies observed for groups found in specific sequence context, (3) we used the context data to train a sequence-based pi-contact rate predictor against the PDB, (4) we then optimized a weighting function for combining pi-contact predictions against the ability to discriminate the phase separation training set proteins from the PDB, and (5) we finalized the weight optimized predictor and tested it against the phase separation test set and proteomic test sets a single time, shown in the main text, with supplemental analysis of predictor quality throughout the design process done as a retrospective analysis.
From the non-redundant PDB chains previously used for structure analysis, we selected nonredundant structures into the training set by randomly selecting 17388 structures. The remaining 4347 structures were taken as the leave out set, being removed and held for the final testing step. Independence of the two sets is restrained only by the 60% identity cutoff used in obtaining the full list, which prevents broadly identical proteins from showing up in both sets. The ability to further guarantee independence is limited by the nature of homology, and was considered outside of the scope of this manuscript.
Sequences were extracted from the PDB REFSEQ annotations and then mapped to residues found in the structure, with missing density annotated as such. Pi-contact observations were then mapped to their involved residues with contacts split by sequence separation into long-range (≥5 or different chain) and short-range (≤4 residue). Contact rates were determined for individual sp2 groups using nine residue identities for the sidechain groups and 400 residue identities for backbone groups, with backbone groups defined by both flanking residues. To measure local sequence effects we also calculated rates for non-flanking residue pairs, up to 40 residues apart, which when combined with the backbone groups produced 16400 residue pair types (20 n-terminal residues x 40 sequence separation distances x 20 C-terminal residues), which are observed in the training set at a median sample size of N = 8622, ranging from N = 563 to N = 41796 from the least to most populated.
For the initial sequence-based pi-contact predictor contact rate, observations for pairs of residues found within a fixed window length were averaged to produce a context-dependent estimate. To account for sampling error in rare sequence pairs we estimated the standard error of these measurements by a limited bootstrap analysis, using 200 randomly sampled (without replacement) 70% cut subsets of the training set. To average observation values, a given sp2 system was first defined by its residue type (a single amino acid for sidechain groups or the sequential amino acids for backbone groups). Next, residues less than 40 residues away in the primary sequence (both sides) were compared to our precomputed database, to obtain a comprehensive list of rate values for all residue-distance-residue pairings found in this window. These values were then averaged using a sampling error correction, intended to weight observations by confidence, using the following equation, where R is the error weighted contact frequency at position x as obtained by averaging over the closest l positions, P is the database frequency observed for x a given residue pair (x and x + y), and σ is the standard error of the mean obtained for that database frequency.
This average value, which represents a very limited sequence-based prediction, does not contain information on whether or not local sequence increases or decreases the group’s contact rate relative to what is expected by the nature of the group on its own. This was calculated by comparing it against values obtained by Equation 1 for every instance of the matched system found in the training set, using the precomputed data to convert the average into a z-score.
We next tested the effect of sequence window size on the ability to predict total number of contacts for a given sequence by iteratively adding two more adjacent flanking residues in order to find an optimal window. While analyzing these scores for their ability to predict contact rates, we observed that the separate z-scores for long-range and short-range contacts, each derived from distinct non-overlapping sets of observations, are not independent; they each contain information on the probability expectation of the other, with sequence dependent correlations that could contain data on competition and cooperation.
To capitalize on the observation that short-range and long-range contact rates carry information on each other, we added an additional rate prediction step where the previous averaged rates are used in tandem short-range/long-range pairs to extract matching observation data from the original rate database. To do this we created a system of lookup tables in which our training set observations are tallied in two-dimensional arrays by half z-score bins. To obtain frequency values using these lookup tables we calculate the pair of z-scores for the query group and match them to a corresponding bin in the lookup tables.
Splitting observations into bins is problematic at extreme z-scores, as it significantly increases sampling error. To address this problem, we developed a smoothing method involving iterative sampling across a range of window-length dependent tables in order to use the natural random variance of the database to average the observations.
This method starts with the initial z-score calculation on a window of sequences covering the group itself and up to one flanking residue on either side, collecting values for observed contacts and total number of database entries from a look up table built using the same window size. We then add additional flanking residues, re-calculate the corresponding z-scores for the new window length, and then add the observed contacts and total entry numbers for that window size to the previous observation. From here, additional residues are added iteratively, one flanking residue and window specific score calculation at a time, up to a maximum window length of 40 residues on either side. The final frequency is determined by the total number of pi-contacts observed for similar groups in the database, summed over all windows, divided by the total number of similar groups found in the database. This is defined in the following equation, where F is the contact frequency for position x, O is the number of database pi-pi contacts observed within a bin defined by a pair of long-range (LR) and short-range (SR) z-scores (Z) calculated for window length i, and T is the total number of groups observed for the same bin.
Our selection criterion in developing this method, for determining scores, window lengths, grid spacing, and other details, was to test options against one another by filling statistical databases from defined 70% cuts of the training set and then testing how well the sum of frequency predictions made for each protein in the remaining 30% correlated with the total number of contacts found in each protein. Once our final pi-contact rate prediction method was selected we then filled our final databases using the full training set and did a single prediction quality test against the leave out database.
As a starting point, we ran the pi-contact predictor developed for the PDB on each sequence in the PDB and in our 11 protein phase-separating protein training set and returned the highest number of contacts predicted for any 100 residue window. These contact frequency predictions showed a reasonably normal distribution for the PDB (skew and kurtosis of 0.22 and 0.62, as calculated using the scipy.stats python package) and above PDB-average predictions for our set of 11 phase-separating proteins (8/11 in the 99th percentile). Additional analysis showed that this enrichment was higher for long-range contact predictions than for short-range contacts. As an aside, the phase-separating protein with the lowest contact frequency, elastin, also has fewer sidechain groups and a lower average mass per residue. However, elastin still has a predicted contact frequency greater than the PDB average because of very high contact frequency predictions for its backbone.
Since different categories of contacts utilized in our predictor could have different effects on phase separation propensity, we trained a phase separation predictor by optimizing contact prediction weights and normalization methods against the ability to discriminate the lowest scoring member of the phase separation training set from the highest scoring members (the top percentile) of the non-redundant PDB set, by using a single score value per protein, as defined in the following equation where D is the relative discrimination score and S is the score function being tested.
We then ran the pi-contact predictor on every sequence with ≥140 residues (based on the length of the smallest phase-separating protein in our training set) and stored prediction values for four contact frequencies (long-range/short-range vs. sidechain/backbone) and four corresponding relative frequencies (the average identity normalized z-scores produced as intermediate values during the contact prediction protocol).
To optimize weights against these eight values, we built a machine learning toolbox for stochastic optimization against our discrimination function. Both random and manual sampling of score functions and weights were tested, with weights determined by brute force sampling against a Metropolis Monte Carlo acceptance criterion, and score function changes tested by branching the weight optimization into parallel runs tested against the same criterion.
In terms of the general flow, we started with score functions that were a weight averaging of contact predictions over fixed window sizes throughout each sequence with the sequence score being the highest score observed. We used the optimization toolbox to test window sizes against one another, with each weight starting at the same value and allowed to deviate by sequential rounds of small random additions and subtractions. In order to avoid scoring the full set of PDB sequences for every small weight change we added fast diversification screening steps, where we introduced sub-optimization rounds against a small select subsets of the PDB (100–2000 sequences), returning new weight sets to the primary optimization protocol for re-scoring against the full PDB and the primary acceptance/rejection step.
After the first few thousand rounds this results in a diverse population of weights associated with similar performance, which can be mined to capture the range of values that are acceptable for any given weight. At this point, we introduce weight changes using random selections from that population, instead of by random addition and subtraction, which allows for more aggressive sampling (how many weights can change per step, and how far they can change) at a lower observed rejection rate. This weight resampling method also allowed for better parallel computing, as the most successful weight combination were able to automatically propagate between processes.
During this process weight optimization was entirely stochastic, but we also tested a variety of score formulations and window lengths by parallel competition. This branched optimization strategy added a series of changes to the base protocol, including normalizing window averages by the number of carbon atoms in the window, adding multiple overlapping window lengths to allow for weights to differ by sequence distance, and finally, by changing how the final score is determined per sequence, which went from the top scoring position in the sequence, to the top decile position, and then to an average over all positions within 5 residues of the top scoring 60.
The final propensity score can be described as the following steps. Raw contact prediction numbers are generated for each residue in the sequence as an initial step. For every residue in the sequence, we then start iterating over the flanking residues while summing up values for eight contact prediction terms (long/short-range vs. backbone/sidechain vs. frequency/z-score) as well as one residue-based term (number of carbon atoms), with final sums being kept for three windows, defined as the closest 40, 80, and 120 flanking residues. Each set of window sums is then scored by the following equation, using the final set of 27 optimized weights, where S is the window score associated with residue x, spanning the closest residues from i = x to i = N, w is a weight constant, C is the number of carbon atoms in a residue, and F and Z are the pi-contact frequency and z-score values from Equations 3 and 2.
Finally, the full list of scores is then sorted, the top scoring 60 residues are identified, and an average score (T) for regions of the sequence associated with this high-scoring subset is obtained by summing over all positions found within 5 residues of the top 60. This value is then scaled by conversion to a z-score relative to our non-redundant PDB set.
One of the key limitations in designing the predictor is that phase separation in biological systems is not a fully defined phenomenon, and calculating true/false positive or negative rates are limited by the fact that there are very few gold standard positives to train against and there is no gold standard negative set. The vast majority of proteomic sequences are simply untested. Our experimental design attempts to address this by limiting training to a single test approach, validated against proteomic data a single time, but this setup meant that many design decisions were made blind, and the relative final impact of the different steps in the protocol remain untested.
To address this, we went through each step, creating matched predictors using the frequency and weight data available at the time, with new score functions returning the highest contact prediction sum observed over any 100 residue window. Results are shown in Appendix 1—table 6, demonstrating an increase in performance against the final test set for each training step.
We then compared the relative effects of different types of contacts, short-range vs. long-range and sidechain vs. backbone, by testing scores made from each side of the comparison, excluding the contact rates from the other. This analysis, also shown in Appendix 1—table 6, identifies baseline backbone pi-contact rates as being sufficient for the majority of the predictions, where the baseline observed rates for our 400 backbone peptide group definitions split the phase separation test set from the human proteome at an AUC of 0.841 ± 0.027, compared to 0.881 ± 0.021 for the final predictor. Score distributions demonstrating the sufficiency of long-range backbone contact predictions in recapitulating the phase separation predictions of the final predictor are shown in Figure 5—figure supplement 2, with panels A and B corresponding to training step 3, short-range backbone only and long-range backbone only, respectively, and panel C corresponding to the final predictor.
PHENIX: a comprehensive Python-based system for macromolecular structure solutionActa Crystallographica Section D Biological Crystallography 66:213–221.https://doi.org/10.1107/S0907444909052925
ProTherm, version 4.0: thermodynamic database for proteins and mutantsNucleic Acids Research 32:120D–121.https://doi.org/10.1093/nar/gkh082
Polymer physics of intracellular phase transitionsNature Physics 11:899–904.https://doi.org/10.1038/nphys3532
Enzymatic activities of human cytomegalovirus maturational protease assemblin and its precursor (pPR, pUL80a) are comparable: [corrected] maximal activity of pPR requires self-interaction through its scaffolding domainJournal of Virology 81:4091–4103.https://doi.org/10.1128/JVI.02821-06
Liquid-liquid phase separation in cellular signaling systemsCurrent Opinion in Structural Biology 41:180–186.https://doi.org/10.1016/j.sbi.2016.08.001
Characterization of a brain-enriched chaperone, MRJ, that inhibits Huntingtin aggregation and toxicity independentlyJournal of Biological Chemistry 277:19831–19838.https://doi.org/10.1074/jbc.M109613200
The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymesNucleic Acids Research 42:D485–D489.https://doi.org/10.1093/nar/gkt1243
The DNAJB6 and DNAJB8 protein chaperones prevent intracellular aggregation of polyglutamine peptidesJournal of Biological Chemistry 288:17225–17237.https://doi.org/10.1074/jbc.M112.421685
PhosphoSitePlus, 2014: mutations, PTMs and recalibrationsNucleic Acids Research 43:D512–D520.https://doi.org/10.1093/nar/gku1267
Pi-pi interactions: the geometry and energetics of phenylalanine-phenylalanine interactions in proteinsJournal of Molecular Biology 218:837–846.https://doi.org/10.1016/0022-2836(91)90271-7
The GOA database: gene Ontology annotation updates for 2015Nucleic Acids Research 43:D1057–D1063.https://doi.org/10.1093/nar/gku1113
Dynameomics: a multi-dimensional analysis-optimized database for dynamic protein dataProtein Engineering Design and Selection 21:379–386.https://doi.org/10.1093/protein/gzn015
Role of conformational sampling in computing mutation-induced changes in protein structure and stabilityProteins: Structure, Function, and Bioinformatics 79:830–838.https://doi.org/10.1002/prot.22921
Sequence-specific polyampholyte phase separation in membraneless organellesPhysical Review Letters 117:178101.https://doi.org/10.1103/PhysRevLett.117.178101
PANTHER version 10: expanded protein families and functions, and analysis toolsNucleic Acids Research 44:D336–D342.https://doi.org/10.1093/nar/gkv1194
Phase separation in biology; functional organization of a higher orderCell Communication and Signaling 14:1.https://doi.org/10.1186/s12964-015-0125-7
Combined covalent-electrostatic model of hydrogen bonding improves structure prediction with RosettaJournal of Chemical Theory and Computation 11:609–622.https://doi.org/10.1021/ct500864r
Multiple weak linear motifs enhance recruitment and processivity in SPOP-mediated substrate ubiquitinationJournal of Molecular Biology 428:1256–1271.https://doi.org/10.1016/j.jmb.2015.10.002
DisProt 7.0: a major update of the database of disordered proteinsNucleic Acids Research 45:D219–D227.https://doi.org/10.1093/nar/gkw1056
Sequence complexity of disordered proteinProteins: Structure, Function, and Genetics 42:38–48.https://doi.org/10.1002/1097-0134(20010101)42:1<38::AID-PROT50>3.0.CO;2-3
The PyMOL Molecular Graphics System, version 1.8The PyMOL Molecular Graphics System.
Intermediate filament cytoskeleton of the liver in health and diseaseHistochemistry and Cell Biology 129:735–749.https://doi.org/10.1007/s00418-008-0431-x
PANTHER: a library of protein families and subfamilies indexed by functionGenome Research 13:2129–2141.https://doi.org/10.1101/gr.772403
Significance of aromatic-backbone amide interactions in protein structureProteins: Structure, Function, and Bioinformatics 43:373–381.https://doi.org/10.1002/prot.1050
Increasing protein conformational stability by optimizing beta-turn sequenceJournal of Molecular Biology 373:211–218.https://doi.org/10.1016/j.jmb.2007.07.061
Yibing ShanReviewing Editor; DE Shaw Research, United States
In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.
Thank you for submitting your article "Pi-Pi Contacts: An Overlooked Protein Feature Relevant To Phase Separation" for consideration by eLife. Your article has been favorably evaluated by a Senior Editor and three reviewers, one of whom, Yibing Shan, is a member of our Board of Reviewing Editors.
The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.
The manuscript by Vernon et al. describes an analysis of (planar) Pi-Pi interactions in proteins, and uses the resulting insights to further our understanding of liquid-liquid phase separating proteins. It concludes that Pi-Pi interactions are generally an important yet overlooked factor in folded proteins. Based on the analysis, the authors propose that these interactions are a key driver of phase separation. These are potentially important findings for our basic understanding of phase separation as well as for our understanding of protein structures.
1) The authors have identified trends of PDB structures without providing any information on the expected prior. Are there more planar Pi-Pi interactions involving non-aromatic residues because there are overwhelmingly more non-aromatic sidechains than aromatic sidechains in the PDB? Throughout the manuscript, the authors present analyses of enrichment/depletion or correlation of Pi-Pi interactions in various settings. These are generally presented clearly and honestly, but at the same time the authors do not adequately discuss the issues of confounding or indirect effects.
1a) The reviewers realize that it is not trivial to device a good prior, given that the distribution of amino acids are not random throughout protein structures. One of the reviewers suggested a prior as the following:
PDB analysis: The analysis in Figure 1 is very interesting, but a little hard to interpret when the numbers are not normalized by the expected background values. For example, there are ~5x more H/R/Q/N/E/D/G residues in the human proteome than there are Y/F/W residues. If we blindly assume pairwise interactions without any bias, we should naively expect ~20x more interactions amongst H/R/Q/N/E/D/G residues than amongst Y/F/W residues. This naive random network approximation is clearly not appropriate, but some accounting of the drastically different numbers of these residues should be included for the reader to make sense of what these analysis mean. Is 13x the expected value without any bias, given the combination of 3D distribution of residues across folded structures and the amino acid frequencies? For example, the counter hypothesis to this is that Y/F/W residues are typically buried in protein cores generally lack cavities, and so through a simply packing-effect there are necessarily many aromatic to non-aromatic contacts. If you consider other atom types (i.e., non sp2 carbon atoms) how do these compare?
When examining different types of inter-residue interactions (Figure 1C) the sidechain and backbone accessible surface area (and numbers of sp2 atoms) differ wildly between different residues. The intention of this figure is to demonstrate that certain types of residues engage more frequently in planar Pi interactions, but without a prior that takes the number/size of those sidechains into account, can one infer any causation from the fact that bigger sidechains with more sp2 atoms are more frequently engaged in planar sp2-X interactions?
1b) Another example of the confounding effects is the enrichment of the Pi-Pi interactions in the active site (Appendix 1 Supplementary Results). Is it clear that this cannot be explained by the fact that active sites have more of the residues that can form the Pi-Pi interactions? Similarly, to what extent that the occurrence of Pi-Pi interactions near water molecules can be explained by the fact that a dominating majority of the residues are polar there? These issues are important because the authors need to establish not just a correlation, but also causation.
1c) It appears that Figures 1B and 1C are exclusively examining planar pi-pi interactions (given the absence of any sidechain interactions in residues that lack sp2 carbons in their sidechains). These analyses provide a relative picture with respect to one another, but there is no information on the global importance of these interactions in the greater context of all interactions in folded proteins. For example, how do these numbers compare to hydrogen bonds or to interactions between non- sp2 atoms and sp2 atoms. Such analysis is needed to understand if planar pi-pi interactions are more frequent than other planar interactions involving sp3 atoms in planar systems.
2) The authors develop a predictor and a PScore for phase separation based on the idea that Pi-Pi interactions are important for this behaviour. This is potentially of widespread interest, but the authors need to clarify the premise of a "phase separation predictor" which is important to this work conceptually. Assigning a score that will be referred to as a phase separation propensity assumes a gross simplification to the physics of phase separation. How should one think about this predictor? For a given set of solution conditions such as pH, salt concentration, type of salt, and temperature are we to think of the score as a predictor of the saturation concentration? Would it be of the low or high concentration arm of the binodal and what if the phase diagram is replete with complexities? It would help if the authors provide a coherent description of the key concepts and parameters that underlie the physics of phase separation for a generic protein under a given set of solution conditions, and with these parameters in hand explain if the Pscore serves as a proxy for one or more of the physical parameters. If so, the authors could assemble a data set from extant measurements to calibrate the Pscore. If the Pscore does not or should not be viewed as a proxy for any physical parameter, then this should be made clear and a detailed explanation of how the parameter should be used and what insight one might gather from its usage would be essential.
The authors fall short of explaining what signals the predictor captures and fully discussing its limitations. It is unclear what goes into the PScore. To this point, for polyNE the Pscore is -10, polyFG gives -6.5 and polyRG +3, while polyH gives +57, polyC gives +28 and polyD gives +12. The PScore is highest for a protein that is basically an Asn-Glu polyelectrolyte repeat, highly unlikely a sequence that will drive phase separation on its own. It will most certainly need a complexing partner and it is this type of caveats that can mislead a naive user. The PScore would be much more interesting if the authors can really demonstrate that the reason why it works (if it does work) is because of some specific mechanistic (e.g. Pi-Pi) insight. As it stands, the link is somewhat poorly explained because there is a lot of fitting and tweaking in the development of PScore. And so it is not clear why it works, and what physical mechanisms – if any – it captures. The authors should both discuss/demonstrate better the limitations of the PScore, and demonstrate that the reason it works is because of the specific mechanism proposed (rather than e.g. some correlated variable etc.).
Overall, the work that concerns the predictor is more of a preliminary nature and its weight should be reduced and the claims should be tone down in the revision.
2a) The authors effectively train a predictor on known sequences. While this has some basis in the information mined from the PDB, to what extent is the predictor simply finding sequences that are similar to known sequences? For example, what is the correlation between proteins identified in the authors' prior study published in Molecular Cell in 2015?
2b) The final optimization appears to involve separating known phase separating proteins from the top scoring proteins in the PDB. How can the authors distinguish between developing a predictor for phase separation vs. one for (long-range) intrinsic disorder? How well would a disorder predictor (e.g. one with a similarly long window size) do? This is important because the goal is not just to separate the two classes of proteins, but also to provide mechanistic insight. Are the Pi-Pi interactions causative, or is it a (spurious) correlation? As the authors are well aware, there are differences in amino acid composition between IDP/IDRs and folded proteins, that include some of the Pi-containing residues (in particular Asn, Glu, Phe, Tyr, Trp have rather different occurrences in IDPs vs. folded proteins). How can it be demonstrated that the optimization isn't merely picking out differences in disorder? One of the reviewers performed a rudimentary comparison between PScores and disorder for a subset of proteins and found a substantial correlation. The authors should use one of the better disorder predictors, perhaps creating a score like the one used (max score over a similar window) and see how well this performs compared to the one actually optimized from the liquid phase separation data.
2c) The authors show a number of examples of predictions in Appendix—Figure 9. It is, however, unclear how these proteins were chosen. It would be useful to have an argumentation for why these particular proteins were selected, since they are obviously not at the top of the list (e.g. FAK1 is #822). This is in part to ensure that the authors haven't selected proteins that they know, via other avenues, will form droplets. For example, one of the proteins presented (Androgen receptor) has been presented at a couple of conferences to form droplets. While I want to stress that I am not implying that the authors have somehow cheated, I think it is crucial that the authors haven't chosen the proteins in Appendix—Figure 9 based on specific knowledge that some of them have already been shown to form droplets. Many RGG domains have been shown to phase separate. Using the predictor to 'select' one (out of the entire proteome) does not help demonstrating the value of their predictor. It would be far more compelling if the authors found a sequence that does not share features with the set of sequences that are already known to phase separate.
2d) The PScore appears to provide the maximum value observed over any 100-residue window. If that's the case, then larger proteins will obviously have a greater likelihood of reaching a high PScore. While this may be fully realistic, it wasn't clear to me whether this has indeed been observed experimentally.
2e) The authors benchmark their predictions using a ROC analysis (Appendix—Figure 7). It is not fully clear what the "negative" set is, but it appears to be the entire proteome. Is that correct? In general, one should be cautious when using ROC analyses and reporting AUC when there is a large imbalance in the size of the two datasets. If the negative set is very large, one can get a rather large AUC even with a somewhat poor predictor. At the same time, if the goal is just to predict the top few percent hits, then a good predictor could have a relatively low AUC, but still find most true positives at a low false discovery rate (see e.g. Berrar and Flach, Briefings in Bioinformatics, 2011). What is the AUC when e.g. averaged over random subsets of "negatives" that are of the same size as the positive subset? Also, although it might become a bit arbitrary, it would be useful to have an idea on a "standard value" above which one would consider the predictions "reliable".
2f) The sequence tested phase separates at 1 mM after incubation on ice for 5 minutes. It would be surprising if, under those conditions, it had not phase separated. For context, many of the other proteins studied undergo phase separation at room temperature at concentrations that are 10^4 lower. If the selected region from FMR1 does undergo phase separation at something closer to physiological conditions including this would be critical, otherwise, this 'experimental validation' is less convincing.
3) The analyses of MD, Rosetta, FoldX are maybe interesting, but would require a much more detailed analysis which is out of scope for a revision. As is, the analyses fall far short of establishing that the Pi-Pi interaction is not represented in these computational models. It is difficult to believe that MD can fold so many proteins while entirely missing something supposed to be so fundamental. The importance of this work does not lie on the notion that the planar Pi-Pi interaction is not accounted for in existing computational models (molecular dynamics, FoldX, and Rosetta). The reviewers suggest the authors to strongly tone down the related statements, if not to delete the MD simulation analysis or move it to a supplemental section. Many of the strong assertions, including statements declaring that this analysis uncovers unappreciated facts should be toned down.
3a) Subsection “Prevalence of Pi contacts in the PDB” and in Figure 1: "These data suggest that the relative importance of pi-pi interactions are underestimated in the force-fields that are used in the structure calculations and thus appear more frequently in structures that are heavily constrained by experimental observations." It might be worth noting here that these are generally extremely primitive FFs, that are in general unrelated to those used e.g. in molecular dynamics simulations. The fact that these force fields do not include Pi-Pi interactions is not surprising, nor very interesting; these force fields are really minimal and capture almost no physical effects well.
3b) Although the analysis of the dynameomics simulations looks interesting, it should be noted that these simulations were performed with a relatively old force field, and that more recent and accurate force fields exist. Although all fixed charge force fields share the same basic functional form, the higher accuracy in more modern force fields would likely mean greater structural stability overall. Thus, whether or not the Pi-Pi interactions are lost in these simulations have little to say about the force fields that are more commonly used in the field nowadays. An interaction not accounted for explicitly in a MD force field could still be accounted for implicitly by the force field which is obtained by fitting to empirical data. For a reference point, past studies (Gallivan and Dougherty, PNAS 1999) showed that although pi-cation interactions are not explicitly accounted for in OPLS-AA force field for MD, but the energy calculation of Lys-Phe and Arg-Phe interaction using this force field correlates reasonably well with the quantum results. MD force fields have been much improved from then.
3d) Both in the MD calculations and Rosetta calculations the benchmark is the number of Pi-Pi contacts found in the crystalline state. In the case of salt bridges, it has been demonstrated by NMR in at least one case that a set of salt bridges found in the crystal structure are absent in solution (JH Tomlinson et al., JACS, 2009). In such cases, the expected (real) behaviour would be that the interactions found in the XR structure should dissolve in solution. From a cursory comparison between the contact frequencies in Xray structures (Appendix—Figure 2A) and NMR structures (Appendix—Figure 2B), it appears there are fewer contacts in the solution state – even when the NMR structures have many constraints. Could this be due to real differences between the two states (and also temperature differences), and would this imply that the loss of contacts e.g. upon minimization (Appendix—Figure 3B) represents reality?
3e) The way the data is presented in Appendix—Figure 3C is confusing. Why not simply compare the experiment and calculated numbers? Also, the Protherm database is notoriously problematic, and curated versions exist (e.g. Conchuir et al., PLOS ONE, 2015 or Kellogg et al., Proteins, 2010).https://doi.org/10.7554/eLife.31486.044
- Julie Deborah Forman-Kay
- Julie Deborah Forman-Kay
- Julie Deborah Forman-Kay
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
We thank Hue Sun Chan, Régis Pomès, Simon Sharpe, Alan Moses, Fred Keeley, Lewis Kay, and Jacob Brady for useful discussions. We acknowledge funding to JDF-K from the Canadian Institutes of Health Research, Natural Sciences and Engineering Council of Canada and Canadian Cancer Society Research Institute. We also acknowledge the use of The Hospital for Sick Children Structural & Biophysical Core Facility.
- Yibing Shan, Reviewing Editor, DE Shaw Research, United States
© 2018, Vernon et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.