Pi-Pi contacts are an overlooked protein feature relevant to phase separation

Abstract
Introduction
Results
Discussion
Materials and methods
Appendix 1
Appendix 2
References
Article and author information
Metrics

Abstract

Protein phase separation is implicated in formation of membraneless organelles, signaling puncta and the nuclear pore. Multivalent interactions of modular binding domains and their target motifs can drive phase separation. However, forces promoting the more common phase separation of intrinsically disordered regions are less understood, with suggested roles for multivalent cation-pi, pi-pi, and charge interactions and the hydrophobic effect. Known phase-separating proteins are enriched in pi-orbital containing residues and thus we analyzed pi-interactions in folded proteins. We found that pi-pi interactions involving non-aromatic groups are widespread, underestimated by force-fields used in structure calculations and correlated with solvation and lack of regular secondary structure, properties associated with disordered regions. We present a phase separation predictive algorithm based on pi interaction frequency, highlighting proteins involved in biomaterials and RNA processing.

https://doi.org/10.7554/eLife.31486.001

Introduction

Protein phase separation has important implications for cellular organization and signaling (Mitrea and Kriwacki, 2016; Brangwynne et al., 2009; Su et al., 2016), RNA processing (Sfakianos et al., 2016), biological materials (Yeo et al., 2011) and pathological aggregation (Taylor et al., 2016). For some systems, multivalent interactions between modular binding domains and cognate peptide motifs underlie phase-separation (Li et al., 2012; Banjade and Rosen, 2014). However, many phase-separating proteins contain large intrinsically disordered protein regions (IDRs) with low complexity sequences that do not form stable folded structure (reviewed in [Mitrea and Kriwacki, 2016; Chong and Forman-Kay, 2016]), including the Nephrin intracellular domain (NICD) (Pak et al., 2016), polyglutamine tracts (Crick et al., 2013), tropoelastin (Yeo et al., 2011), FUS (Burke et al., 2015; Kato et al., 2012), Ddx4 and the homologous LAF-1 (Nott et al., 2015; Elbaum-Garfinkle et al., 2015) and FG-repeat nucleoporins (Frey et al., 2006). The underlying physical principles and chemical interactions that drive phase separation in these IDRs are not well understood. Multivalent (Li et al., 2012; Pierce et al., 2016) electrostatic (Pak et al., 2016; Lin et al., 2016) and cation-pi (Nott et al., 2015; Kim et al., 2016; Sherrill, 2013) interactions and the hydrophobic effect (Yeo et al., 2011) have all been proposed to contribute to IDR phase separation, the latter suggested to be dominant for tropoelastin (Luan et al., 1990). For Ddx4, electrostatic interactions between charge blocks has been demonstrated (Nott et al., 2015; Lin et al., 2016). The abundance of Phe-Gly/Gly-Phe and Arg-Gly/Gly-Arg dipeptides in Ddx4 and the fact that Phe to Ala mutations inhibit phase separation also point to pi-pi and/or cation-pi interactions. The Phe-Gly repeats in FG nucleoporins similarly indicate pi-pi interactions, but the lack of aromatics in elastins and designed phase-separating sequences (Quiroz and Chilkoti, 2015) seems to suggest that they are not essential. Clearly, a number of physical interactions may be sufficient for driving phase separation without being universally necessary, and a better understanding of these interactions is needed to define the balance of forces biological systems use for driving protein phase transitions.

Although pi-pi interactions are commonly associated with aromatic rings, where interaction energy is thought to involve induced quadrupolar electrostatic interactions (Sherrill, 2013), π (pi) orbitals of bonded sp²-hybridized atoms are also found in peptide backbone amide groups and sidechain amide, carboxyl or guanidinium groups. Sidechains with pi bonds include Tyr, Phe, Trp, His, Gln, Asn, Glu, Asp and Arg. Small residues with relatively exposed backbone peptide bonds include Gly, Ser, Thr and Pro. Notably, low complexity IDRs implicated in phase separation of FUS, EWS, hnRNPA1, TIA-1, TDP-43 and the RNA Pol II C-terminal domain (CTD) (Mitrea and Kriwacki, 2016; Taylor et al., 2016; Kato et al., 2012) are very enriched in these residues that have high potential for formation of pi-pi interactions, relative to average occurrence in the proteome. Even elastins, which lack sidechain pi groups but have Val-Pro-Gly-Xxx-Gly repeats (Yeo et al., 2011), are highly enriched in Pro and Gly residues with exposed pi-containing peptide backbones.

Given the high frequency of aromatic residues, arginine and glutamine in many phase-separating sequences, we were motivated to investigate the structural behavior of pi-pi interactions in order to better understand how their observed physical behavior relates to their potential role in phase separation. We first characterized the frequency and correlations of pi interactions in a non-redundant protein set from the RCSB protein data bank (PDB) of folded proteins (Berman et al., 2000). We discovered that planar pi-pi contacts involving a non-aromatic group, including those involving the backbone amide group, are the predominant form of pi-pi interaction, and we showed that planar pi-contact rates can be predicted from sequence. Then, we tested the relevance of these planar pi-pi interactions to phase separation by training a phase separation predictor using only these expected pi-contact rates. We then demonstrated that three of the predicted proteins, FMR1, a multifunctional RNA-binding protein and a neuronal granule component (El Fatimy et al., 2016), engrailed-2, a DNA

-binding homeobox protein, and the pAP isoform of the Human cytomegalovirus capsid scaffolding protein phase separate in isolation in vitro. Analysis of predictions for the full human proteome suggests strong phase-separation propensities for proteins involved in biomaterials and RNA processing, with likely regulation by splicing and post-translational modifications (PTMs).

Results

Prevalence of Pi contacts in the PDB

To determine the frequency of pi-pi interactions and better understand their nature and physical properties, we performed a bioinformatics analysis of folded proteins. We searched the PDB for pi-pi interactions by measuring contact distances between planar surfaces and comparing planar orientations (see Materials and methods), choosing to focus on interactions involving pi-orbital planar surfaces as this category shows the most enrichment over expectations, both in terms of overall frequency (Appendix 1—figure 1) and in relation to resolution. Face-to-face planar pi-pi contacts were defined using a simple distance- and orientation-based metric designed to consistently capture this enrichment across diverse sp²-containing groups (Appendix 1—figure 1A,B,C).

Our analysis was originally intended to explore the known interactions of aromatic sidechains with each other and with arginine, but in order to provide a control group we defined our contact parameters in a way that allowed us to treat all sp² groups in the same fashion. In high-resolution (≤1.8 Å) and low R-factor (≤0.18) protein crystal structures (N = 5718), we found that planar pi-pi stacking interactions involving non-aromatic atoms outnumber aromatic-aromatic stacking interactions by approximately 13 to 1 (Figure 1A and Appendix 1—table 1) suggesting that, while aromatic sidechains may be enriched in stacking interactions relative to their frequency, there is a more general role for pi-contacts that involve non-aromatic atoms. The vast majority of planar pi-orbital contacts in proteins involve one of five non-aromatic sp²-hybridized sidechains or the peptide bond itself. Fully 36% of observed pi-pi stacking interactions do not involve an aromatic partner, with face-to-face planar contacts between different backbone peptide bonds occurring as often as aromatic face-to-face contacts (Figure 1A). Across the high-resolution set, we observe that 58% of heavy atoms are sp², of which 10.5% are involved in pi-stacking. Furthermore, 28% of heavy atoms that are not directly involved in pi-stacking are found within van der Waals (VDW) contact distance (4.9 Å) of atoms that are. Thus, planar pi-pi interactions form a common feature of the protein chemical environment. Comparisons to previous work showing that aromatic-aromatic interactions in proteins are instead biased toward face-to-edge or parallel displaced geometries (Hunter et al., 1991; Martinez and Iverson, 2012) are complicated by the observations that VDW contacts between aromatic sidechains coincide with face-to-face pi-pi stacking to a third sp²-hybridized group 49% of the time, and that parallel displacement often accommodates an additional non-aromatic pi-contact to the same planar surface.

Figure 1 with 2 supplements see all

Download asset Open asset

PDB statistics for planar pi-pi interactions.

(A) Average number of sp² groups involved in planar pi-pi contacts per 100 protein residues binned by crystal structure resolution. Values are shown for contacts defined by the nature of the involved sp² groups, with all groups in black, aromatic to non-aromatic sp² in blue, non-aromatic to non-aromatic in pink, backbone to backbone in gray, and aromatic to aromatic in orange. Error bars show bootstrap SEM. (B) Planar pi-pi contact interaction frequencies for each residue type, with the average across all residue types shown as a red line, and (C) frequency of each residue type in contributing to planar pi-pi interactions, with bars showing overall frequency colored proportionally by the nature of the contact partners. Figure 1—source data 1 and 2.

https://doi.org/10.7554/eLife.31486.002

Figure 1—source data 1 Pi-Pi contact annotations for the full PDB set. Text file listing the pi-pi contacts observed across our non-redundant PDB set, with contact types shown by residue annotations where single amino acid names refer to sidechains and pairs of amino acids refer to the backbone peptide bond between residue i and residue i + 1.: https://doi.org/10.7554/eLife.31486.005
Download elife-31486-fig1-data1-v2.txt
Figure 1—source data 2 Residue and amino acid counts for the full PDB set. Text file listing the residues assessed in each individual PDB chain, used for calculating contact frequencies.: https://doi.org/10.7554/eLife.31486.006
Download elife-31486-fig1-data2-v2.txt

Analysis of protein structures showed that the frequencies of planar pi interactions strongly correlate with the power of the experimental data to constrain the structure and with the fit to the data. We identified a linear relationship between contact frequencies and the resolution of crystal structures (Figure 1A). We identify a similar dependence of contact frequency on the relative number of sidechain-specific distance constraints in NMR structures (Appendix 1—figure 2A) and confirm that the dependence on resolution persists for identical sequences solved multiple times at different resolutions (Appendix 1—figure 2B). These data suggest that the relative importance of pi-pi interactions are underestimated in the force-fields that are used in the structure calculations and thus appear more frequently in structures that are heavily constrained by experimental observations. In addition, pi-pi contact frequencies for amino acid and other small sp²-containing ligands bound to proteins (including non-aromatic ligands) are higher than the frequencies observed for the same chemical group found within proteins, despite or perhaps because of their relative freedom of movement (Appendix 1—table 2).

To examine whether sp²-containing sidechains engage in stacking behavior beyond what could be expected for average contact frequencies and overall packing considerations, we determined sidechain contacts to backbone peptide groups, focusing on the percentage of VDW contacts (with two or more pairs of atoms within 4.9 Å) which satisfy our planar-pi criterion, and then compared the frequencies observed for sp² sidechain groups to those observed for planar surfaces on the terminal end of sp³ sidechains, using atom groups as listed in the Materials and methods section. This metric addresses the issue of amino acid composition effects by taking advantage of the even distribution of backbone groups and allows for normalization of contact frequency for sidechains of different size. Enrichment of sp² planar contacts relative to sp³ is clearly observed for all sp² sidechains except Asn and Gln, which our previous analysis showed are more likely to form contacts with their backbone than with their sidechains (Figure 1—figure supplement 1). Further analysis of the relative frequency of planar pi VDW contacts to other VDW contacts as a function of resolution demonstrates that for some contact types the increased pi-contact frequencies with increasing resolution (lower values in Å) are at the expense of decrease in other VDW contacts, suggesting that these contacts represent a specific geometric constraint present in the experimental data, rather than an overall increase in VDW contact frequency at higher resolution (Figure 1—figure supplement 2).

Aromatic groups are known to have favorable interactions with other aromatics, with the peptide backbone (Tóth et al., 2001), and with charged groups. We observed that the guanidine group of arginine is either the first or second most likely planar pi-stacking partner for any given aromatic sidechain, a phenomenon previously described as cation-pi (Sherrill, 2013; Martinez and Iverson, 2012). However, we also observed planar stacking interactions between non-aromatic groups of all kinds, including both anion-to-anion and cation-to-cation, with relative frequencies shown in Figure 1B,C. Surprisingly, 3.6% of arginine sidechains are found in direct, parallel pi-stacking contact with another arginine, despite repulsive charges (Figure 1—figure supplement 2), suggesting that these guanidinium interactions are better described as pi-pi, rather than cation-pi (example shown in Figure 2A).

Figure 2

Download asset Open asset

Examples of planar pi-pi contacts in folded protein structures.

Pi-pi interactions shown using rods to describe the normal vector of the plane. Rods extend to a carbon VDW radius of 1.7 Å, colored by category with sidechain groups in purple, backbone in blue, small molecule ligands in orange, and RNA in gray. Ligand molecules are green, with relevant water molecules shown as red spheres and hydrogen bonds as yellow lines. (A) Arginine ladder motif in Porin P (PDB:2o4v). (B) Catalytic site from arginine kinase (PDB:1m15). (C) Network of interactions in nitrogenase (PDB: 3u7q). (D) Backbone/sidechain contacts at the ends of secondary structure elements (PDB:4b93). (E) RNA-binding interactions (PDB: 4lgt). (F) Interaction network stacked between disulfide bonds (PDB: 4v2a).

https://doi.org/10.7554/eLife.31486.007

Modeling and analysis of protein structures typically involves the use of coarse-grained energy functions. To test the degree to which contact frequencies in solved structures derive from experimental constraints, rather than the force fields used, we explored how well planar pi interactions are captured by the simple energy functions used in certain protein modeling protocols. We examined a few different modeling protocols by either running available methods or downloading pre-computed datasets (see Materials and methods). In general, planar pi-pi contacts were lost during simulations (Appendix 1—figure 3A) and energy minimization (Appendix 1—figure 3B). In one older molecular dynamics simulation of folded proteins, made available for 100 proteins via Dynameomics (Kehl et al., 2008), 90% of the planar pi-pi contacts found in the starting structures were lost during simulation, with the majority being lost within the first few simulation steps. Similarly, modeling of the energetic effect of mutations, the ∆∆G of unfolding, using both FOLDX (Schymkowitz et al., 2005) and Rosetta (Kellogg et al., 2011), shows decreased prediction accuracy at positions involved in pi-contacts (Appendix 1—figure 3C-F), based on comparison to a reference set of ∆∆G measurements (Bava et al., 2004). These observed issues in modeling pi-contacts may be overcome by more recent and sophisticated energy functions, but our results are consistent with the inherent energetic importance of planar pi interactions, rather than their observation being due to simple force fields used in refining protein structures.

Enrichment of pi-pi contacts in catalytic, capping and RNA-binding sites

For exploring the contribution of pi contacts to general structural and functional properties of proteins, we examined contact enrichment for sp² groups found in a diverse range of interactions. We observe increased frequency of pi-pi contacts at positions with known catalytic function (Furnham et al., 2014), with enrichment of 1.87 ± 0.07 overall and 1.42 ± 0.07 when normalized by residue type (Appendix 1—table 3), with pi-pi contacts often playing a role in defining the geometry of the active site (Figure 2B) or forming networks of pi-pi contacts. We find that hydrogen bond frequency increases at sp² sidechains involved in pi-contacts (Appendix 1—figure 4), and when sp² groups hydrogen bond each other we observe increased frequencies of a third sp² group being found in simultaneous pi-stacking to both the donor and acceptor groups of the hydrogen bond (Appendix 1—figure 1F and Figure 2C), suggesting potential cooperativity via the electrostatic and geometric stabilization of the bond. We also observe up to 20-fold enrichment at the ends of secondary structure elements, relative to the median backbone contact rate of 1.7%, with enriched positions often involving the last hydrogen bond made within a helix or at the end of a strand (Figure 2D and Appendix 1—figure 5), commonly placing them in the context of local capping motifs thought to stabilize secondary structure (Richardson and Richardson, 1988). Finally, we find that protein-RNA interactions typically involve pi-pi contacts, especially with arginine. A detailed description of these observations is included in Appendix 1.

Correlation of pi-pi contacts with solvation and lack of regular structure

Interactions at the surface of a protein are typically in competition with solvent and their enthalphic contribution often decreases with solvent exposure, as for protein-protein hydrogen bonds (Efimov and Brazhnikov, 2003). Planar pi-pi interactions, in contrast, cannot be formed with water, but often involve groups with hydrogen bond acceptors and donors; thus, we predicted that the frequency of pi-pi interactions in proteins could be increased in more solvated environments. To test this, we identified high-resolution structures with an abundance of solved water and then counted the observed solvent interactions by the number of water oxygen atoms within a broad VDW contact radius (4.9 Å) to each residue. We saw an unambiguous positive correlation between the number of water contacts and the probability that a residue is involved in a planar pi-pi-contact, with a significant increase in average probability observed for each additional water contact (Figure 3A,B), climbing even as the average number of protein:protein VDW contacts declines. This relationship is true for the general case (unspecified residue identity) and is also individually true for each of the nine residues with pi orbital-containing sidechains. However, both the contact frequencies and the amino acid frequencies themselves increase with a greater dependence on solvation for non-aromatic residues, especially for the charged amino acids Arg, Glu and Asp, such that non-aromatic contacts become the dominant form of interaction at high solvation levels (Figure 3—figure supplement 1). The relative increase is highest for contacts involving sidechains of like-charge, especially arginine (Figure 3—figure supplement 2), suggesting that solvation plays a role in the strength of the interactions.

Figure 3 with 2 supplements see all

Download asset Open asset

Correlation of planar pi-pi interactions with solvent and lack of secondary structure.

(A) Contact frequency for sidechain groups (red) and backbone (blue) increases with the total number of solved water molecules within 4.9 Å of the residue, based on structures with >1 water oxygen per residue, including all molecules within 8 Å of the chain of interest, including symmetry partners. (B) Representative example of a pi-stacked sidechain in contact with 11 water molecules (PDB:4u98), showing how the interaction does not appear to compete with solvent. (C) Mean contact frequency vs. sequence distance from regular secondary structure and loop/turn regions. (D) Example of the range of interactions found >10 residues from helix/strand secondary structure (PDB:4b4h).

https://doi.org/10.7554/eLife.31486.008

Of relevance to intrinsically disordered protein regions that mediate interactions, we find that planar pi-pi interactions occur more often at positions with properties associated with disorder; they are more prevalent in proteins having overall less rigid secondary structure (with contacts for coil/loop/turn > strand > helix), especially disulfide bond containing proteins (Figure 2F), and in sequences that are locally enriched in residue types associated with backbone flexibility or breaking secondary structure (Gly, Ser, Thr, Pro) (Appendix 1—figure 6). Considering planar pi-pi contact frequencies as a function of the sequence position relative to secondary structure elements, we find that the frequency is highest in long loops, showing a sigmoidal relationship when transitioning from order to disorder that goes from 9.5% probability for residues >7 positions away from the closest loop/turn to 16% for residues >7 positions away from the closest helix/strand (Figure 3C,D).

Pi-pi contacts in protein interactions

To test whether these interactions are compatible with the multivalent interactions involved in phase separation, we examined contact statistics for protein interactions, comparing sidechain pi-pi interaction frequencies within a chain to those between chains. We classified interfaces as sequence- or complex-specific (between different chains of a crystal structure) and opportunistic (at crystal packing interfaces). In both cases, we defined interface residues as those with sidechains having at least one VDW contact to any atom in a different chain. We found that both the overall contact frequencies at interface positions and local (<5 residue) contact frequencies remain similar to the frequencies observed at non-interface positions, but that there is a significant exchange of long-range (≥5 residue sequence separation) inter-chain to intra-chain contacts (Figure 4). This exchange is also observed for the residues in interfaces involved in crystal packing interaction, demonstrating that long-range planar pi interactions are not specific to particular protein folds, but are common features of protein-protein interactions. These results suggest that non-local pi-pi contact propensity could play a general role in mediating protein interactions, including those driving phase separation.

Figure 4

Download asset Open asset

Sidechain contacts at interface positions.

Contact frequencies are shown for the nine sp²-containing sidechain types, split into three bars based on interface proximity. From left to right, these bars are i) no other chain within 4.9 Å of any sidechain atom, ii) within 4.9 Å VDW contact distance of any atoms in a different chain within the unit cell of the crystal, iii) within 4.9 Å of any atoms in a chain from a neighboring unit cell, as determined by crystal symmetry data. Bars are colored by the proportion of total contacts contributed by three categories, bottom/black corresponding to local (sequence separation ≤4 residues) intrachain contacts, middle/blue to non-local intrachain contacts, and top/pink to interchain contacts, showing that overall contact frequencies and local contact frequencies remain similar and that the non-local contacts do not discriminate between intra and interchain.

https://doi.org/10.7554/eLife.31486.011

Importance of pi-pi contacts for phase separation

In our bioinformatics analyses, we identified a type of interaction, planar pi-pi, which is more prevalent for solvated residues, RNA-binding interactions and regions lacking regular secondary structure. These properties are also associated with the emerging functional class of intrinsically disordered phase-separating proteins that coalesce through fluid, multivalent interactions to form protein-dense cellular bodies or membraneless organelles involved in RNA processing (Mitrea and Kriwacki, 2016), the nuclear pore (Frey et al., 2006) and extracellular biological materials (Yeo et al., 2011). The currently known phase-separating proteins are diverse, both in sequence and function (Mitrea and Kriwacki, 2016; Chong and Forman-Kay, 2016), but many are enriched in motifs we can now associate with high planar pi-pi contact frequencies (i.e. Pro-Gly, Phe-Gly, Ser-Arg, Tyr-Gly and Arg-Gly repeats) (Mitrea and Kriwacki, 2016; Nott et al., 2015; Schmidt and Görlich, 2015).

While phase separation of some proteins has been suggested to be driven by the potential for multivalent aromatic stacking and cation-pi interactions (Nott et al., 2015; Brangwynne et al., 2015), our observations show (i) that planar pi-pi interactions are a much more broadly distributed phenomenon in proteins than previously considered, especially in solvated protein regions, (ii) that aromatic residues are not required, (iii) that backbone pi groups make significant contributions, and (iv) that protein sequence can have distinct effects on both long-range contact propensity and local contact propensity. These led us to hypothesize that the number of pi orbitals available to make long-range multivalent contacts is an important feature in determining whether a disordered protein region can phase separate and, thus, that the sp²-hybridization of the arginine sidechain is more important to phase separation than its charge. We tested this hypothesis using the N-terminal 236 residues of Ddx4, an intrinsically disordered region that contains both Arg-Gly and Phe-Gly dipeptide sequences and that can phase separate (Nott et al., 2015). We removed pi-character while leaving charge intact by replacing all 24 Arg residues with Lys. Matching our expectation, this protein region fails to phase separate under the conditions characterized for the wild-type Ddx4 sequence, even at concentrations of 400 mg/ml, 200 times higher than the lowest concentration for which phase separation is observed for the wild type, and four times higher than observed for constructs with an equivalent mass change from mutating nine phenylalanine residues to alanine (Appendix 1—table 4). We note that arginine is likely key for the phase-separation, association and toxicity of C9orf72, which can encode Gly-Arg and Pro-Arg dipeptide repeat sequences (Lee et al., 2016).

Prediction of phase separation using pi-pi contacts

Given this supportive experimental evidence for the role of pi interactions in phase separation and our observation that opportunistic non-local pi interactions are commonly found at protein crystal contacts, we chose to test the importance of these interactions for phase separation by determining the degree to which it is possible to predict general phase separation behavior using solely the pi-pi contact propensity of a protein sequence. We recognize that multiple physical interactions can contribute to driving phase separation (Brangwynne et al., 2015), but our goal was not to predict subtle differences in phase separation propensity or quantitative phase diagrams. Instead, we aimed to merely classify proteins as having the potential to self-associate under particular biological conditions or not, as a test of our hypothesis of the involvement of planar pi interactions. In this exercise, we define phase-separating proteins as those that for presumed functional reasons self-associate in a way that is at least transiently reversible and dynamic, allowing for the protein to self-concentrate as a function of available protein concentration, temperature or other condition. This basic definition does not cover the complexity of the phase diagram, merely the ability to reversibly self-concentrate, and does not consider competing transitions, such as irreversible aggregation and precipitation, which have typically been selected against in the natural sequences on which the predictor is designed to be used.

Using this definition, we applied a constrained training approach divided into two stages. In the first stage, we required accurate prediction of contact propensities for folded proteins, using sequence propensities for both local and non-local contacts. For this aim, we developed a statistical method for predicting the expected number of contacts given a protein sequence, using frequencies taken from the PDB, splitting observations by distinct residue pairs with varying sequence separation and applying a statistical comparison of the full list of pairs associated with a given sp² group to calculate expectations (see Materials and methods). The reliability of these predictions against folded proteins is given in Figure 5A. We then predicted the number of pi-pi contacts for a list of 11 proteins containing IDRs that have been shown to be sufficient for phase separation behavior in vitro (Figure 5—source data 1A), finding that 8 out of the 11 have a predicted number of planar pi-pi contacts per residue in the 99th percentile relative to folded proteins found in the RCSB PDB (Figure 5B).

Figure 5 with 2 supplements see all

Download asset Open asset

Prediction of phase separation based on planar pi-pi interactions.

(A) Reliability plot showing average predicted and observed contact frequencies for percentile bins by pi-pi contact prediction for proteins in the PDB, with PDB sequences used for training in blue and the leave out set in red. Bars show SEM. (B) Highest number of contacts predicted, by window, for two phase separation predictor training sets and three test sets, for the unoptimized predictor. (C) Modified ROC curve showing the final predictor’s performance on three test sets vs. the human proteome, with the full set in pink (N = 62), the full set minus the insufficient for phase separation set shown in green (N = 44), and the sufficient for phase separation set in blue (N = 32). (D) Results for the final predictor (as for panel b) plotted with the predictor’s phase separation propensity scores (PScore). Data underlying B-D included in Figure 5—source data 1 and Figure 5—source data 2.

https://doi.org/10.7554/eLife.31486.012

Figure 5—source data 1 Phase separation training, testing and designed protein test sets. Excel table containing identification and literature references for proteins in the phase separation test and training sets, with sheet one showing the training set proteins, two showing proteomic test set proteins, and three showing synthetic test set proteins.: https://doi.org/10.7554/eLife.31486.015
Download elife-31486-fig5-data1-v2.xls
Figure 5—source data 2 Additional phase separation propensity scores used in final ROC analysis. Excel table containing protein IDs and predicted propensity scores, with different datasets on each sheet. Sheets 1–3 have full predictions for the human, E. coli, S. cerevisiae proteomes, respectively. Sheet four repeats the subset of human proteins found in the DisProt database. Sheet five shows scores for the protein sequences found in our non-redundant PDB set, and sheet six repeats the subset of PDB sequences withheld from predictor training.: https://doi.org/10.7554/eLife.31486.016
Download elife-31486-fig5-data2-v2.xls

For the second stage, we developed a phase separation predictor that ranks sequences only by the weighted combinations of pi-contact frequency predictions, without any other interaction or observational data. We used a stochastic optimization approach to find optimal weights and sequence window normalizations for converting pi-contact frequency predictions into a score function able to discriminate known phase-separating proteins from sequences found in the PDB. The individual components weighted and normalized include: (i) short- and long-range contacts as defined by residue pair sequence separation ≤4 or >4, respectively, (ii) sidechain groups vs. the backbone peptide bond, (iii) absolute predicted frequency vs. normalized frequency compared to the specific group, and (iv) number of carbon atoms in the specific group. In constraining this stage of the test, we defined the fixed goal for optimization as the PDB normalized z-score difference between the highest scoring 1% of the PDB and the lowest scoring member of the phase separation training set. We then trained until reaching a plateau, and at that point we finalized the score, running a single validation test against a testing set of 62 proteins directly associated with phase separation in the literature. This testing set can be divided into three subsets by the nature of the evidence available: (i) sufficient for in vitro phase separation as a purified single component (which matches the training set), (ii) evidence of in vitro phase separation involving a multi-component system (e.g., phase separates on the addition of RNA), without evidence of independent phase separation, and (iii) direct evidence of in cell phase separation (where the protein itself has been labeled and dynamic exchange demonstrated by FRAP or similar methods) without evidence of in vitro phase separation or sufficiency.

We used receiver operating characteristic (ROC) plots comparing predictions of phase-separating proteins within the test set against predictions of phase-separating proteins in the human proteome to assay the ability of the predictor to rank known positives against the members of a set that we assume is primarily negative; the area under the curve (AUC) measurement describes the ability to discriminate between sets. For the human proteome as the negative set, we show an AUC of 0.88 ± 0.02 measured using the entire testing set as a positive, 0.93 ± 0.01 if we exclude sequences which only phase separate in complex with other polymers, and 0.96 ± 0.01 if we restrict to the 32 test set sequences that match the sufficiency criteria used for selecting the training set (Figure 5C and Appendix 1—figure 7A). These measurements are complicated by the potential for homology between test set and training set proteins. To control for this, we also measured discrimination using another positive set of the 59 artificial sequences designed and shown to phase separate by the Chilkoti lab (Quiroz and Chilkoti, 2015; MacEwan et al., 2017; Simon et al., 2017) (details in Figure 5—source data 1C), showing an AUC of 0.86 ± 0.03 against the human proteome as a negative set (Appendix 1—figure 7B).

Interpreting these AUC values is complicated by the fact that the true positive rate of the human proteome is unknown, and our analysis will treat unknown phase-separating proteins as false positives, inaccurately decreasing the AUC. Similar analysis against protein sets with less expected phase separation results in higher AUCs, going from 0.88 ± 0.02 for human to 0.92 ± 0.01 for Caenorhabditis elegans, 0.93 ± 0.02 for Saccharomyces cerevisiae, 0.98 ± 0.01 for Escherichia coli, and 0.97 ± 0.01 for our PDB testing set. Within the comparisons to E. coli and S. cerevisiae proteomes, we show examples of the proteome-dependent score distributions underlying the analysis (Appendix 1—figure 7C,D). Using a defined standard confidence threshold of ≥4.0 standard deviations from the PDB average for the propensity score (PScore) captures 0.3%, 2.2%, and 5.1% of the E. coli, S. cerevisiae, and human proteome sets, respectively, as compared to 0.1% of our full PDB set and 81% (26/32) of the self-sufficient for in vitro phase separation test set (dropping to 36/62 for the entire proteomic test set and to 35/59 for the synthetic test set).

When compared to the unweighted pi-contact predictions, the trained PScore confirms the training results, with the number of test set proteins that fall within or above the top 1% range of the PDB increasing from 11/30 to 29/30 (Figure 5D). This increase is matched by an increase in the percentage of human proteins in the same range, from 2.3% to 13.1%. Even though the score is trained for discrimination against folded proteins, we do not see a systematic increase in the scores of all disordered human proteins. Comparison against a top performing sequence homology-based disorder predictor (Disopred3, [Jones and Cozzetto, 2015]) and a physics-based disorder predictor (IUPRED-Long [Dosztányi et al., 2005a]) shows that disorder predictors are better at discriminating disordered proteins from the PDB and the human proteome, while the PScore is consistently better at identifying phase-separating proteins (Appendix 1—table 5). The majority of the proteins in our phase separation test set show disordered character, and the analysis shows that, while PScore does correlate with disorder, it only highlights a subset of disordered proteins and does not reflect a general disorder prediction (Figure 5—figure supplement 1). As a direct test of this discrimination, we find that using the subset of human proteins with known intrinsic disorder (Piovesan et al., 2017) as the phase separation negative set shows similar results as using the human proteome as the negative, at AUC:0.84 ± 0.03 for the full test set and AUC:0.93 ± 0.02 for the in vitro sufficient set.

We note that the optimization methodology used for developing our predictor, specifically training for discrimination against the PDB, was intended to exclude phase separation involving multivalent binding properties of folded proteins with multiple binding surfaces (Pierce et al., 2016; Marzahn et al., 2016) or multiple folded modular binding domains that interact with multiple linear sequence motifs (Li et al., 2012; Banjade and Rosen, 2014). Thus, we expect and find a lower success rate for prediction of phase separation of proteins using these mechanisms. We also note that the goal of the prediction experiment is to see whether observed phase separation can be predicted exclusively from contact probabilities as a test of the hypothesis that pi interactions are important for phase separation, but that our method uses probabilities found in the PDB, was trained on natural sequences, and was tested using sequences that are either found in nature or were designed based on sequences that are. The ability to predict contacts is expected to decrease for sequences not observed in nature and for sequences relying to a greater degree on other energetic contributions.

Mechanistic implications of the optimized phase separation predictor

In order to identify the contact features that play the largest role in the optimized predictor, we did a retrospective analysis testing the predictive power of different scoring algorithms produced during the training process, and explored potential mechanistic implications by testing the power of individual score components, grouping contact predictions into long-range vs. short-range and backbone vs. sidechain (Appendix 1—table 6). Our analysis shows that, while training did improve the predictor, a comparable result can be obtained by using only the long-range contact rate predictions for the peptide backbone (Figure 5—figure supplement 2, as further described in Appendix 1). This property significantly upweights the role of residues, especially Gly and Pro, that are associated with high overall backbone pi-pi contact frequencies and with lower short-range contact frequencies for local sidechain groups, and is especially important for predicting elastin-like proteins, which often have very few sp²-containing sidechains. Thus, these results highlight the increased availability of sp² groups for non-local pi-interactions as a key driving force behind the phase separation predictions and is consistent with highly multivalent weak interactions leading to phase separation, both in non-polar structural proteins like elastin and highly charged RNA-binding proteins like FUS or Ddx4.

Many high contact frequency residue types are also associated with disordered proteins in general, so to control for that potential role we took a selection of 3501 human proteins predicted to have long disordered regions (as described in the methods), split them by PScore into high (PScore ≥4) and low (PScore <1) subsets, and compared the sequence characteristics distinguishing high PScore and low PScore sequences (Appendix 1—figure 8A). We find that non-phase-separating intrinsically disordered proteins are actually depleted in Gly and Pro, especially relative to the enrichment seen in phase-separating sequences and sequences predicted to phase separate. Conversely, they are most enriched in Lys, which on average is depleted in phase-separating sequences.

While the division of the predictor into two distinct protocols was used to avoid scores that simply describe sequence similarity to the training set, it is still possible that the training process picked up on specific sequence features in the training set. To explore the contribution of sequence similarity to the score, we made a measurement of sequence profile similarity based on dipeptide composition (neighboring residue pair frequencies). We compared the high scoring regions selected by the predictor to each of the sequences used in the training set (Appendix 1—table 7, see Materials and methods). This analysis, shown in Appendix 1—figure 8B, finds that high scoring (PScore ≥4.0) human proteins are, on average, more similar to the training set than are human proteins in general, but that the majority fall within the normal range. Comparison to a set of 1000 BLAST-level sequence homologs of the training set suggests that the majority of the similarity is compositional preference, not homology.

Both sequence similarity and compositional behavior can also be related to the bias toward disorder regions observed in phase-separating proteins. To characterize this, we again took the high and low PScore subsets of our set of 3501 human proteins predicted to have long disordered regions and then compared their sequence profiles. It has previously been observed that disordered proteins have a Shannon entropy (a measurement of sequence complexity) that is lower, but significantly overlapping with ordered proteins (Romero et al., 2001). We find here that the high PScore set has a Shannon entropy that is far lower than the range seen for low PScore disordered proteins, which have Shannon entropies that fall in the range observed for folded proteins (Appendix 1—figure 8C). Comparing our phase separation test set with the human disprot set we can confirm that this bias toward lower complexity sequences is observed in known phase-separating sequences.

Analysis and validation of predictions of phase separation

Given the favorable characteristics of our predictor, we investigated correlations of phase separation scores with protein interactions, various biological mechanisms that may regulate phase separation and gene ontology (GO) terms. The principle of sequences with high propensity for non-local pi-pi contact being more likely to self-associate implies that different proteins with high phase separation propensity scores would be more likely to interact with one another. By comparing score pairs from protein interactions taken from the I2D metadatabase (Niu et al., 2010), we confirm that high-scoring proteins and low-scoring proteins are both over two-fold more likely to interact with proteins of similar score, relative to expectations (Figure 6A). This holds true even when comparing interactions between largely hydrophobic or cytoskeletal proteins (such as elastin and collagen) and highly polar RNA-binding proteins (like Ddx4 and FUS).

Figure 6

Download asset Open asset

Association of phase separation propensity scores with protein interactions, splice isoforms, PTMs, and GO localization, process, and function terms.

(A) Protein-protein interaction enrichment by the PScore of partner 1 vs. the PScore of partner 2. The color gradient shows the natural logarithm of the observed over expected ratio. (B) Percentage of human proteins at each PScore range that are detected in more than 10% of AP-MS negative control experiments. (C), Score ranges for alternative splicing variants shown as vertical lines sorted by reference sequence values. (D), Number of PTMs vs. average relative PScore, with methylation shown in red, phosphorylation in green, and ubiquitination in blue.

https://doi.org/10.7554/eLife.31486.017

This like-score interaction propensity is predicted by a model of phase separation in which multivalent but individually low-affinity interactions between proteins of similar character coordinate the formation of large, dynamic complexes. To test this aspect of the score, we looked at large complex formation and interaction propensity by examining the background ‘contamination’ rates observed in affinity purification coupled with mass spectrometry (AP-MS). Large complex formation is measured by the number of negative control experiments in which each human protein appears, over a set of 411 experiments involving non-specific affinity purification steps performed without the specific affinity tag (Mellacheruvu et al., 2013). Within this dataset, we observe that 26/28 of our known human phase-separating proteins show up as a contaminant in at least one experiment (O/E = 3.51), and 17/28 show up more than 10% of the time (O/E = 14.9), confirming that phase-separating proteins show the expected behavior. By binning proteins by prediction scores, we show that this is also a trend for high PScore proteins in general (Figure 6B), suggesting that the pi contacts driving this score may play a general role in localizing proteins to large complexes.

Phase separation behavior could potentially be modulated by the addition, modification, or removal of even small segments with high phase separation propensity, leading to regulation of phase separation by alternative splicing and post-translational modification (Romero et al., 2006; Hegyi et al., 2011). To test the possible regulation by splicing, we ran our predictor against human sequences in the UniProtKB/Swiss-Prot (Magrane et al., 2011) variable splicing database. We found that 40 ± 2% of included proteins strongly predicted to phase separate (PScore ≥4) have alternative splice variants which either remove the prediction or significantly change the score (ΔPScore >1), often having multiple splice variants spanning a wide range of scores (Figure 6C). By comparison, an overall rate of significant changes in score (ΔPScore >1) of 23.0 ± 0.4% is observed for all proteins in the set.

To examine post-translational modifications (PTMs), we analyzed our scores against the database of known PTMs curated by PhosphoSitePlus (Hornbeck et al., 2012). We tested the relationship between predicted propensities and number of PTM sites, controlling for protein length by taking PTM counts from the maximum number of annotations observed for any 100 residue window in a sequence. By comparing populations with an above average number of sites (greater than the average plus one standard deviation) against the baseline frequency, we see enrichment in high PScores (≥4) for a variety of PTM site annotations, including literature annotations of disease relevance and known regulatory function (Appendix 1—table 8). We also observe that for phosphorylation and methylation the absolute number of PTMs correlates with the average PScores observed, with methylation having a stronger effect than phosphorylation, and ubiquitination shown as a negative control (Figure 6D).

Next, we compared our phase separation predictions to known localization or function, as annotated in the GO database (Figure 7A,B,C). Ranking GO terms by enrichment of proteins with prediction values above our threshold (PScore ≥4) enabled us to generate a list of terms associated with significant enrichment of pi-pi contacts (p<0.000001 and 5–50 fold observed over expected); this list includes 4.1% of the 27342 GO terms tested. This subset of the GO database demonstrates enrichment for phase separation propensity in known phase-separated compartments (stress granules, Cajal bodies, post-synaptic density [Zeng et al., 2016]), in RNA processing (transcription, splicing, modification, transport, and stability), in the assembly and plasticity of structural components (cytoskeletal organization, extracellular matrix assembly), and in signaling, regulation, and development (Notch signaling, NF-κB, Wnt). We note that the sequence property predicted here is a physical behavior that occurs on a cellular scale, so the observation of a similar score distribution for a specific biological process, as observed for annotations involving localization to known phase-separating bodies, is an implicit prediction that phase separation is one of the physical mechanisms involved in the process. Consistent with that, we see similar score distributions for many processes involving organization of structural components, signaling, and cell-fate commitment. The property of phase separation is also strongly associated with the regulation and development of multicellular cooperation and neurogenesis. In contrast, the vast majority of GO terms (77.2%) show no enrichment in phase separation propensity, with significantly lower enrichment in categories involving metabolic processes and enzymatic catalysis. A selection of high-scoring human proteins associated with enriched functions is shown with per-residue scores and PTM annotations in Appendix 1—figure 9, with examples chosen from the highest scoring protein in any given gene ontology function/localization annotation related to neuronal plasticity or behavior in A, cytoskeletal biomaterials in B, signaling in C, and extracellular biomaterials in D.

Figure 7

Download asset Open asset

PScore enrichment by gene ontology annotation for subcellular localization (A), biological process (B), and molecular function (C).

The color gradient shows the natural logarithm of the observed over expected ratio. Heatmaps show enrichment in vertebrate sequences across six defined score ranges, with the highest score range (PScore ≥4) labeled with human enrichment values calculated using PANTHER (see Materials and methods).

https://doi.org/10.7554/eLife.31486.018

Within the testing set, there exist some proteins which have not been shown to be capable of independent phase separation (Han et al., 2012), and which may associate with phase-separated bodies without sharing the same behavior. One of these, synaptic functional regulator FMR1 (El Fatimy et al., 2016), also known as fragile X syndrome protein FMRP, has a PScore of 4.7, and is involved in RNA binding, neurological development and regulation of translation, all GO terms enriched in high PScores. FMR1 is a multifunctional polyribosome-associated protein, which is highly expressed in the brain and in the testes, and is known to localize to granular bodies with two other proteins (FXR1 and FXR2) (El Fatimy et al., 2016) that are also predicted with high PScores (at 2.9 and 5.3). In order to validate that high PScore predicts sufficiency for phase separation and not associated properties like miscibility in the separated phases of other proteins, or other interactions with phase-separating proteins, we purified the highest scoring region (residues 445–632) and confirmed the ability to spontaneously undergo liquid phase separation at low temperature and high concentrations in physiological buffer conditions (Figure 8A). The concentration required for visual confirmation of liquid phase separation behavior is quite high, at 1 mM FMRP-LCR, but can be reduced through the use of crowding reagents (Figure 8—figure supplement 1A). To confirm the relevance of pi-character, we then replaced all 28 Arg residues with Lys, which resulted in a loss of phase separation behavior (Appendix 1—table 4).

Figure 8 with 1 supplement see all

Download asset Open asset

Visual confirmation of phase separation.

(A) Test tubes containing transparent or turbid solutions of 1 mM FMR1 C-terminus (residues 445–632) along with their corresponding DIC microscopy images taken at room temperature or 4°C, respectively. (B) 1 mM FMR1 C-terminus forms droplets exhibiting liquid fusion properties at 4°C. (C) 40 µM solutions of Human Cytalomegalovirus pAP along with corresponding microscopy images taken at room temperature or 80°C, respectively.

https://doi.org/10.7554/eLife.31486.019

To test whether or not the predictor is applicable to sequences that do not share motifs or functions with any of our training set proteins, we did a manual search for predictions with sequence properties and functions dissimilar from the training set proteins and selected two proteins, human engrailed-2 (UID: P19622, PScore 5.0), a DNA-binding homeobox protein, and the pAP isoform of the Human cytomegalovirus capsid scaffolding protein (UID: P16753-2, PScore 3.8), a protein that plays an essential structural role in assembling the viral capsid, a novel function relative to those known to involve phase separation. Both sequences have little overlap with any of the sequence motifs found in our training set (Appendix 1—table 7), aside from general enrichment in glycine and proline residues. Experimentally, we observe reversible liquid phase separation of pAP protein with increasing temperature, with viscoelastic properties similar to the coacervation of elastins (Figure 8C). We did not observe phase separation of engrailed-2 under the same buffer conditions, even at 1 mM protein concentration, but did observe temperature-dependent liquid droplet formation in the presence of a crowding reagent (20 mg/ml ficol) (Figure 8—figure supplement 1B). While these observations do not represent a robust or comprehensive test of prediction quality, they do suggest that the predictions provide a useful tool for selecting natural proteins capable of self-sufficient liquid demixing.

Discussion

We tested the potential role played by pi-contacts in mediating phase separation by using the single property of pi-contact frequency to train a simplistic predictor of phase separation behavior found in natural sequences, finding that the single property of long-range pi-contact propensity is sufficient for marking the majority of known phase-separating proteins as outliers relative to the proteome, supporting the hypothesis that this sequence property is commonly associated with phase-separating proteins. While this association is demonstrably useful for identifying phase-separating proteins in proteomic datasets, these contacts may not be the predominant interaction driving the physical process of phase separation for each case, and could instead reflect a modulatory role since it is not exclusive of other interactions like hydrogen bonds and charge interactions. However, tests showing that arginine to lysine mutations abrogate phase separation behavior do provide evidence of the importance of planar sp² groups for phase-separating systems.

The finding that a single contact potential can generate a reasonably accurate classifier of phase separation behavior suggests that a sequence-based prediction of phase separation behavior is a tractable problem, and that future development of an algorithm that can predict the complexities of the phase transition, environmental effects and concentration requirements is a reasonable goal. This goal could potentially be addressed by introducing the range of phase separation associated sequence properties that were intentionally excluded by our empirical test of the pi-contact association, including the electrostatic effects of charge patterning (Nott et al., 2015; Lin et al., 2016; Das and Pappu, 2013), multivalency of PTM sites and PTM-binding motifs, and transient structural interactions, including strand formation (Murray et al., 2017) and helical interactions (Conicella et al., 2016). There is also a role for incorporating predictions of competing states, the irreversible aggregation propensity of a sequence or its amyloidogenic potential. Incorporating annotation data associated with phase-separating proteins could be another avenue for generating a physiological classifier in a more comprehensive predictor.

The physical nature of pi-pi contacts and their underlying mechanistic relationship to phase separation are not revealed by the simple contact frequency measurements used in our predictions. These contacts are observed in folded proteins, both internally and near solvated interfaces and, while that suggests they play a general role in the energetics of protein-protein interactions, the nature of that role is not clear. There is potential for electrostatic or induced dipole and quadrupolar interactions, especially in the context of other dipole interactions and hydrogen bonds, but the flat surfaces of sp² groups could also enable solvation to drive contacts and lead to entropic contributions due to the relative freedom of movement inherent in packing flat plates, compared to the more rigid shape complementation involved in packing aliphatic groups. It is interesting to note that these proposed mechanisms could be affected by temperature in opposite ways, and that our predictor using pi-contact frequencies is useful in identifying phase-separating proteins regardless of whether they associate more readily as temperatures decrease (such as Ddx4) (Nott et al., 2015) or increase (such as elastin) (Yeo et al., 2011).

As part of characterizing the proteomic associations highlighted by our empirical prediction test, we point out that manual inspection of our prediction results across the human proteome suggests that planar pi-contact associated phase separation likely facilitates a wide range of cellular functions. To highlight this, we selected a range of examples by taking the highest scoring member of gene ontology categories we found to be generally enriched in high PScore proteins. We see enrichment of phase separation propensity in proteins associated with cytoskeletal organization. These include proteins with known structural roles such as the cytoskeletal intermediate filament proteins desmin and vimentin (PScores 4.3 and 4.4), as well as keratins 8 and 18 (PScores 5.9 and 5.4), with scores deriving primarily from the disordered head and tail domains (Appendix 1—figure 9). Intermediate filaments form through dynamic processes (Yeo et al., 2011) consistent with a model in which phase separation-induced condensation concentrates proteins prior to the formation of (often fibrillar) structure (Yeo et al., 2011). Interestingly, helical domain mutations impeding structure formation cause these four proteins to instead accumulate in protein-rich membraneless inclusions such as Mallory-Denk bodies (Strnad et al., 2008; Goebel and Bornemann, 1993; Ogrodnik et al., 2014). We also predict high PScores for non-structural proteins involved in regulating cytoskeletal organization and in binding some of the previously mentioned cytoskeletal proteins, including focal adhesion kinase 1 (PScore 4.2) and DNAJB homolog 6 (PScore 8.8), the latter of which is also a chaperone that can prevent huntingtin aggregation (Chuang et al., 2002; Hageman et al., 2010; Gillis et al., 2013).

Many of the high PScore predictions involve proteins that are both involved in signaling pathways and known to either localize to membraneless organelles or interact with phase-separating proteins. For example, adenomatous polyposis coli protein (APC) (PScore 3.2) and axin1 (PScore 2.2), involved in the Wnt signaling pathway, interact in a dynamic fashion in the large and multimeric β-catenin destruction complex (Pronobis et al., 2015), and we find high PScores for other critical members of the complex, including β-catenin (PScore 5.5) and GSK3α (PScore 6.4). The β-catenin destruction complex formation is regulated by GSK3 phosphorylation, and we note that the predictor shows a difference between the two human GSK3 orthologs, with GSK3β having a PScore of 2.2. These orthologs are often functionally interchangeable (Doble et al., 2007), but there is evidence of isoform-specific roles for GSK3α (Ma, 2014) and the predicted differences could reflect a difference in modulating phase-separation behavior.

In conclusion, we have shown that planar pi-pi interactions are more prevalent in protein structures than previously described, with potential roles in structural motifs, catalysis and RNA binding. Planar pi-pi contact frequencies are increased in protein segments that lack regular secondary structure or have increased solvent exposure, pointing to their relevance for disordered protein regions. This, together with the enrichment of pi-containing groups in protein regions known to phase separate, provided an impetus for development of a phase-separation predictor based on the likelihood of forming non-local planar pi-pi contacts. The performance of the predictor supports the hypothesis that these pi-pi interactions can drive phase separation. While experimental data and computational work suggest other contributions (Yeo et al., 2011; Pak et al., 2016; Nott et al., 2015; Lin et al., 2016; Kim et al., 2016; Brangwynne et al., 2015), including the hydrophobic effect, electrostatics and multivalent binding of folded protein domains, our prediction test shows that an algorithm focused solely on pi-interactions performs well for the majority of proteins that we identified as phase-separating from the literature (Figure 5C, Figure 5—source data 1B). These results strongly suggest that most phase-separating proteins can make significant non-local planar pi-interactions, even in cases where there are other dominant or required forces driving phase separation. Thus, this represents a valuable tool for the currently expanding field of protein phase separation and its link to biological function and disease (Mitrea and Kriwacki, 2016; Conicella et al., 2016; Patel et al., 2015). In particular, the association of neurological diseases with proteins comprising RNA processing bodies (Vanderweyde et al., 2013), including those known to phase separate, highlights the importance of predictive methods for facilitating mechanistic studies of the underlying biology and pathology.

Materials and methods

Key resources table

Reagent type (species) or resource	Designation	Source or reference	Additional information
Recombinant DNA reagent	His-SUMO-Ddx4 ^1-236	PMID 25747659	Expression vector (His-Sumo tagged) for Ddx4 residues 1–236, sequence from UID: Q9NQI0-1 (uniprot identification)
Recombinant DNA reagent	His-SUMO-Ddx4 ^1-236(9FtoA)	PMID 25747659	Expression vector (His-Sumo tagged) for Ddx4 residues 1–236, sequence from UID: Q9NQI0-1, 9 out of 14 phenylalanines mutated to alanine
Recombinant DNA reagent	His-SUMO-Ddx4 ^{1-236(14FtoA)}	PMID 28894006	Expression vector (His-Sumo tagged) for Ddx4 residues 1–236, sequence from UID: Q9NQI0-1, all phenylalanines mutated to alanine
Recombinant DNA reagent	His-SUMO-Ddx4 ^1-236(RtoK)	PMID 28894006	Expression vector (His-Sumo tagged) for Ddx4 residues 1–236, sequence from UID: Q9NQI0-1, all arginines mutated to lysine
Recombinant DNA reagent	His-SUMO-FMR1^445-632	This paper	Expression vector (His-Sumo tagged) for FMR1 residues 445–632, sequence from UID: Q06787-1
Recombinant DNA reagent	His-SUMO-FMR1^{445-632(RtoK)}	This paper	Expression vector (His-Sumo tagged) for FMR1 residues 445–632, sequence from UID: Q06787-1, all arginines mutated to lysine
Recombinant DNA reagent	His-SUMO-pAP^A341Q	This paper	Expression vector (His-Sumo tagged) for SCAF isoform pAP, sequence from UID: P16753-2, alanine 341 mutated to glutamine
Recombinant DNA reagent	His-SUMO-EN2	This paper	Expression vector (His-Sumo tagged) for Engrailed-2, sequence from UID: P19622-1

Measurement			Value	N
Contacts per 100 residues
	Pi-Contacts per 100 residues, averaged over PDBs		6.06 ± 2.5*	5,718 PDBs
	Pi-Contacts per 100 residues, averaged over all residues		6.27 ± 0.03	1,384,228 residues
Atom Contact Probabilities (%)
	Heavy Atoms in a Pi-Contact		6.10 ± 0.03	10,836,487 atoms
	sp² Heavy Atoms in a Pi-Contact		10.52 ± 0.05	6,283,150 atoms
	Heavy Atoms within 4.9 Å of any Pi-Contact		32.1 ± 0.1	10,836,487 atoms
Sidechain-Sidechain Contact Proportions (%)				25,930 contacts
	Aromatic to Aromatic		24.73 ± 0.29	“
	Aromatic to Non-Aromatic		53.24 ± 0.33	“
	Non-Aromatic to Non-Aromatic		22.03 ± 0.28	“
All Contact Proportions (%)				86,860 contacts
	Sidechain to Sidechain		29.85 ± 0.17	“
	Aromatic Sidechain to Backbone		40.41 ± 0.20	“
	Non-Aromatic Sidechain to Backbone		22.80 ± 0.16	“
	Backbone to Backbone		6.94 ± 0.09	“
	Aromatic to Aromatic		7.38 ± 0.10	“
		outnumbered by Aromatic to Non-Aromatic	7.6 ± 0.1 to 1	“
		outnumbered by Non-Aromatic to Non-Aromatic	3.9 ± 0.1 to 1	“
Arginine Sidechain Contacts (per 100 residues)				61,877 residues
	Contact to Aromatic		9.74 ± 0.13	“
	Contact to Backbone		10.6 ± 0.13	“
	Contact to Glutamine/Asparagine Sidechain		1.96 ± 0.06	“
	Contact to Glutamate/Aspartate Sidechain		1.49 ± 0.05	“
	Contact to Arginine Sidechain		3.63 ± 0.11	“

Amino acid sp²group*	# PDBs	# Ligand Groups	Ligand Pi-Pi Contact Frequency (%)	# Protein Groups	Protein contact frequency (%)	O/E (Ligand/Protein)
GLU Sidechain	84	209	16.3 ± 4.1	5353	5.0 ± 0.4	3.3 ± 0.8
HIS Sidechain	36	80	42.5 ± 8.0	530	17.4 ± 2.7	2.4 ± 0.7
PHE Sidechain	36	93	53.8 ± 9.0	1389	28.4 ± 2.3	1.9 ± 0.3
ARG Sidechain	61	145	43.9 ± 6.0	2878	15.9 ± 0.9	2.8 ± 0.5
TYR Sidechain	30	68	58.8 ± 12.2	806	19.9 ± 1.9	3.0 ± 0.7
GLN Sidechain	21	50	22.0 ± 8.9	800	13.0 ± 2.3	1.8 ± 0.8
ASP Sidechain	39	86	3.5 ± 2.0	2153	4.5 ± 0.8	0.8 ± 0.5
TRP Sidechain	43	109	50.5 ± 8.0	377	28.9 ± 2.3	1.7 ± 0.3
ASN Sidechain	11	32	6.3 ± 7.3	466	8.6 ± 1.9	0.7 ± 0.9
Amino Carboxyl	688	1704	8.9 ± 1.1	976	5.7 ± 1.2	1.5 ± 0.4
Small molecule^†	# PDBs	# Free Ligands	Ligand Pi-Pi Contact Frequency (%)	# sp² Atoms	RCSB Ligand ID	Isomeric SMILES
Ethanal	44	76	3.9 ± 2.9	2	ACE	CC = O
Formic Acid	444	2093	11.0 ± 0.8	3	FMT	OC = O
Acetate Ion	1664	4794	12.9 ± 0.6	3	ACT	CC([O-])=O
Acetic Acid	403	1133	13.5 ± 1.5	3	ACY	CC(O)=O
Nitrate Ion	225	852	15.3 ± 1.7	4	NO3	[O-][N+]([O-])=O
Guanidine	32	115	15.7 ± 4.7	4	GAI	NC(N)=N
Urea	23	91	16.5 ± 4.0	4	URE	NC(N)=O
Imidazole	279	684	26.6 ± 2.4	5	IMD	C1C[NH+]C[NH]1

Residue type	Non-catalytic contact frequency (%)	Catalytic residue contact frequency (%)	N catalytic	Enrichment	Non-catalytic percent of VDW (%)	Catalytic residue percent of VDW (%)	Percent of VDW ratio (cat./non)
ANY	13.1 ± 0.1	24.5 ± 0.9	2914	1.87 ± 0.07	1.91 ± 0.09	3.94 ± 0.40	2.06 ± 0.23
HIS	31.9 ± 0.9	35.9 ± 2.3	471	1.12 ± 0.06	3.01 ± 0.26	4.33 ± 0.84	1.45 ± 0.31
ASP	12.9 ± 0.5	21.2 ± 2.0	448	1.65 ± 0.14	1.71 ± 0.22	3.08 ± 0.81	1.83 ± 0.54
GLU	11.7 ± 0.4	32.2 ± 2.5	370	2.75 ± 0.18	2.44 ± 0.31	6.57 ± 1.38	2.74 ± 0.71
ARG	26.7 ± 0.8	30.3 ± 3.2	287	1.14 ± 0.12	1.77 ± 0.29	7.85 ± 1.48	4.57 ± 1.26
LYS	6.2 ± 0.4	9.7 ± 2.0	259	1.56 ± 0.35	0.82 ± 0.20	0.37 ± 0.37	0.48 ± 0.52
TYR	36.1 ± 1.2	30.4 ± 3.7	171	0.84 ± 0.10	2.69 ± 0.39	4.62 ± 1.38	1.75 ± 0.59
SER	8.8 ± 0.5	13.0 ± 2.6	169	1.48 ± 0.28	0.83 ± 0.23	1.39 ± 0.70	1.84 ± 1.20
CYS	8.5 ± 1.3	14.7 ± 2.9	150	1.73 ± 0.26	1.45 ± 0.33	0.92 ± 0.66	0.68 ± 0.54
ASN	18.9 ± 1.1	26.6 ± 4.3	109	1.41 ± 0.20	1.88 ± 0.42	6.11 ± 1.76	3.42 ± 1.29
GLY	12.0 ± 0.7	16.2 ± 4.5	99	1.35 ± 0.41	1.26 ± 0.48	5.87 ± 1.92	5.72 ± 4.33
THR	6.8 ± 0.6	4.7 ± 2.3	86	0.69 ± 0.38	0.34 ± 0.18	0.88 ± 0.87	3.42 ± 4.17
GLN	18.0 ± 1.6	40.3 ± 7.3	62	2.24 ± 0.34	3.12 ± 0.53	1.88 ± 1.30	0.61 ± 0.44
ALA	7.8 ± 0.9	7.0 ± 3.3	57	0.91 ± 0.48	0.85 ± 0.43	1.28 ± 1.28	2.04 ± 2.75
PHE	34.2 ± 2.1	35.9 ± 7.1	53	1.05 ± 0.23	2.52 ± 0.64	4.92 ± 2.39	2.10 ± 1.27
TRP	46.2 ± 3.2	45.1 ± 7.5	51	0.98 ± 0.14	3.57 ± 0.73	2.88 ± 2.00	0.87 ± 0.68
AVG	18.7 ± 0.4	24.2 ± 1.2	N/A	1.42 ± 0.07

Sample	Concentration at which phase separation is observed (conditions)	# F	# R	Total #	Mw (Da)
Ddx4 1–236	(24°C, 20 mM Na₂PO₄ pH 6.5, 100 mM NaCl)
WT	~2 mg/mL	14	24	236	25430
9FtoA	~100 mg/mL	5	24	236	24745
14FtoA	~350 mg/mL	0	24	236	24364
RtoK	Not observed up to 400 mg/mL	14	0	236	24758
FMR1 445–632	(4°C, 20 mM Na₂PO₄ pH 7.4, 2 mM DTT)
WT	~16 mg/mL	2	28	188	20573
RtoK	Not observed up to 216 mg/mL	2	0	188	19789

Share this article

Cite this article

PDB statistics for planar pi-pi interactions.

Figure 1—source data 1

Figure 1—source data 2

Examples of planar pi-pi contacts in folded protein structures.

Correlation of planar pi-pi interactions with solvent and lack of secondary structure.

Sidechain contacts at interface positions.

Prediction of phase separation based on planar pi-pi interactions.

Figure 5—source data 1

Figure 5—source data 2

Association of phase separation propensity scores with protein interactions, splice isoforms, PTMs, and GO localization, process, and function terms.

PScore enrichment by gene ontology annotation for subcellular localization (A), biological process (B), and molecular function (C).

Visual confirmation of phase separation.

Contact definitions.

Cross validation against NMR restraints and X-ray structure resolution.

Pi-pi interactions underestimated by some energy functions.

Hydrogen bonding correlates with planar-pi contacts.

Backbone pi-pi contacts in secondary structure motifs.

Peptide sequence effects on contact frequency.

Phase separation propensity predictor testing.

Sequence comparisons of high PScore proteins.

Prediction examples.

Contact statistics in high resolution, low R-factor protein structures.

Small molecule contact frequencies.

Pi-pi contact enrichment for catalytic residues.

Effect of sp2 sidechain mutations on phase separation.

Comparison of phase separation prediction and disorder prediction.

Retrospective analysis of predictor quality at different stages during the training process.

Sequence similarity comparison.

High PScore enrichment for human proteins with a greater than average number of post-translational modification (PTM) site annotations in Phosphosite+.

Author details

Robert McCoy Vernon

Contribution

Competing interests

Paul Andrew Chong

Contribution

Competing interests

Brian Tsang

Contribution

Competing interests

Tae Hun Kim

Contribution

Competing interests

Alaji Bah

Present address

Contribution

Competing interests

Patrick Farber

Present address

Contribution

Competing interests

Hong Lin

Contribution

Competing interests

Julie Deborah Forman-Kay

Contribution

For correspondence

Competing interests

Citations by DOI

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Categories and tags

Research organism

Effect of sp² sidechain mutations on phase separation.