Decoding Phase Separation of Prion-Like Domains through Data-Driven Scaling Laws

M Julia Maristany; Anne Aguirre Gonzalez; Jorge R Espinosa; Jan Huertas; Rosana Collepardo-Guevara; Jerelle A Joseph

doi:10.7554/eLife.99068.1

1 Introduction

Biomolecular condensates are highly multicomponent systems with remarkable specificity: they exhibit non-random molecular compositions and, concomitantly, contain distinct physicochemical environments defined by the collective behavior of their components [1, 2]. Phase separation has emerged as a leading mechanism to account for the formation of biomolecular condensates. In this thermodynamic process, segregative (density-driven) and associative (connectivity-driven) transitions are coupled to produce protein- and nucleic acid-rich phases suspended in the cellular cytoplasm or nucleoplasm [3]. Within this framework, intrinsically disordered proteins/regions (IDP/IDRs) have emerged as strong contributors to the multivalency and phase separation capacity of many naturally occurring proteins [1, 4–12]. Moreover, IDRs characterized by amino acid sequences of low complexity and enriched in aromatic residues, such as the prion-like low-complexity domains (PLDs) of RNA–binding proteins [8, 13–16], are prone to form condensates that behave like viscoelastic fluids. Furthermore, the PLDs of various naturally occurring proteins, such as Fused in Sarcoma (FUS), Trans-activation response DNA-binding protein 43 (TDP-43), and the heterogeneous nuclear ribonucleoprotein A1 (hnRNPA1) have been reported to transition from liquid-like condensates into dynamically arrested solids and glasses over time [17, 18]. Such arrested states have been postulated as precursors of pathological protein aggregates and amyloids [19]. Understanding the molecular factors that regulate the ability of PLDs to phase separate is of crucial importance, as these are conjectured to also facilitate their aggregation [20].

Biomolecular condensates are networked fluids sustained by multivalent macromolecules that interconnect with more than two partners simultaneously [1, 2, 21–27]. Thus, the factors that increase the molecular valency, give rise to more densely connected liquid networks, and more stable condensates condensates [1, 21, 22, 25, 28–31]. For PLDs, multivalency has been shown to be strongly promoted by the presence of aromatic residues, which act as “stickers” that enable associative π–π interactions, and also cation–π contacts involving arginine and lysine [8, 9, 32–38]. Indeed, the range of coexistence of PLD condensates has been shown to change most significantly in response to mutations involving aromatic residues [9, 39]. Quantitative experimental phase diagrams for a set of hnRNPA1-IDR variants in combination with mean-field simulations demonstrated that tyrosine is a stronger sticker than phenylalanine, arginine is a context-specific sticker and lysine destabilizes PLD–PLD interactions [35]. More importantly, such study demonstrates that, at least for hnRNPA1, there exist quantitative rules that explain the changes in thermodynamic parameters of condensates in response to precise amino-acid sequence mutations.

In this work, we investigate whether there are scaling laws that can predict the changes in critical parameters of a broader collection of PLDs in response to amino acid sequence mutations. Scaling laws are mathematical relationships that describe how the properties of a system change for a given variable, such as its size or composition [40]. In the context of polymer-like systems such as IDPs, scaling laws have offered invaluable insight into vastly different aspects of their physical behavior: for example, they were instrumental in determining that the relationship between viscosity and radius of gyration in polymer solutions is described by a power law between the two quantities [41]; demonstrating that aging scaling during polymer collapse is a universal property in homopolymers [42]; determining that in single polymer systems, polymer end-to-end distance is directly proportional to their length [43]; showing that foldable sequences with more compact unfolded states are due to protein evolution [44]; and, importantly, illustrating that IDPs indeed behave like a polymer in a good solvent [45]. In the context of protein condensates, we hypothesized that scaling laws, if found, may allow us to quantify how condensates respond to different intrinsic perturbations—such as amino acid mutations or presence/absence of different post-translational modifications, changes in protein length, increased/decreased flexibility—or external stimuli—such as changes to the pH of the medium or the ionic environment. Thus, we focused on on determining if such laws exist for PLDs by exploring six different RNA-binding proteins (i.e., hnRNPA1, TDP43, FUS, EWSR1, RBM14, and TIA1; Fig. 1 A,B), linked to the etiology of neurodegenerative disorders through pathology and genetics [46–56].

Data generation and representation of dataset for probing scaling laws of PLD phase behavior
**(A)** Representation of the prion-like domains (grey-shaded regions) of each considered protein, alongside the associated number of variants simulated, totaling a set of 140 sequences. **(B)** Composition of the wild-type sequence of the PLDs simulated, in terms of the percentage of neutral amino acids (A, C, I, L, M, P, S, T, and V; no net charge at pH 7 and no π electrons in the side chain), glycine (G), negative (D, E), positive (K, H, R), neutral-π (N and Q; no net charge at pH 7 with π electrons in the side chain) and aromatic (F, W, Y) residues. **(C)** Top panel: Representation of the simulation model: each amino acid of the protein is represented by a single bead, bonded interactions are modeled with springs, and non-bounded interactions are modeled with a combination of the Wang–Frenkel potential and a Coulomb term with Debye-Hückel screening. Middle panel: The Direct Coexistence simulation method, in the slab geometry, is employed, where both the protein-rich and protein-depleted phases are simulated simultaneously. Bottom panel: Finally, to obtain the data point in the concentration–temperature phase space, at a given temperature the density profile is computed and the concentrations of the rich and depleted phases are obtained via a suitable fitting (Methods). **(D)** Representation of the entire computational data set. Orange data points represent variants where charged or aromatic residues were mutated (i.e., stickers), while cyan data points represent all other types of mutations studied (i.e., spacers).

Leveraging the accuracy and efficiency of a modern sequence-dependent coarse-grained model [34], we quantify the phase diagrams (temperature-vs-density plane) of an unprecedentedly large set of mutants spanning all six PLDs mentioned above. We performed Direct Coexistence molecular dynamics (MD) simulations (Fig. 1 C), exploring simultaneously the protein-rich and protein-depleted phases of the condensate for each mutant, in the temperature–concentration phase space. Our massive set of simulations encompasses 140 PLD mutants, each sampled at 8 to 12 different temperatures. Accounting for independent replicas to ensure sufficient sampling and convergence, the dataset presented here comprises ∼ 2000 independent Direct Coexistence simulations (Fig. 1 D), totaling 1.2 milliseconds of accumulated sampling.

Our work puts forward a data-driven scaling-law framework for the study of PLD phase behavior. We reveal that the changes in the critical solution temperatures of PLDs, as a function of the number and type of amino acid sequence mutations (e.g., aromatic, positively charged, etc), can indeed be quantified by scaling laws, which are highly conserved across the family of PLDs we investigate. The addition of aromatic residues yields a positive linear correlation between mutation fraction and critical temperature changes, consistent with the dominant role of aromatic residues in PLD phase separation. We find that protein length dictates scaling effects when aromatic residues are mutated into uncharged, non-aromatic amino acids. We also quantify the effects of arginine mutations on condensate stability, where removing arginines decreases the critical temperature of the condensate—revealing an inverse relationship between the fraction of arginine deletions and critical temperatures. Finally, polar amino acids emerge as subtle modulators of the phase behavior of PLDs. By examining the impact of different amino acid substitutions on the critical solution temperature, our results provide insights into the underlying molecular mechanisms of protein condensation. The scaling laws presented in this study serve two purposes: firstly, to verify the generality of the molecular grammar that governs the phase behavior across various PLD proteins, and secondly, to quantify the extent to which the critical solution temperature of a PLD protein solution will change in response to specific amino acid sequence changes. We anticipate that scaling laws for other families of proteins with very different compositional biases (e.g., transcription factors, nucleoporins, or molecular chaperones) may exist, but could be distinct from those we report here for PLDs. Overall, our work suggests that scaling laws represent useful predictors of the phase behavior of protein families.

2 Results

2.1 Generating and Analyzing an Extensive Dataset of Critical Solution Temperatures for Prion-like Domain Variants

The phase behavior of PLDs and their modulation with amino acid sequence mutations has been extensively investigated with a variety of experimental techniques and with theory and simulation methods for systems including the hnNRPA1 PLD [35, 57], the FUS PLD [8, 52, 54, 58–66], and the TDP-43 PLD [48, 49, 53, 67, 68], among many others. Here, we perform residue-resolution coarse-grained Molecular Dynamics (MD) simulations (see Fig. 1 C, with further details in Methods) with our Mpipi model [34] to quantitatively compare the critical parameters of a total of 140 mutants of the PLDs of TDP-43, FUS, EWSR1, hnRNPA1, RBM14, and TIA1 (Fig. 1 A) under consistent solution conditions. From our simulations, we compute temperature-vs-density phase diagrams and estimate the critical solution temperature for each system. The PLDs of the proteins we investigate have similar composition biases—i.e., are highly rich in neutral, polar, and aromatic residues—and present less than 5% of total charged amino acids in their sequence (Fig. 1 B). Thus, our simulation results allow us to assess if there exists a code—i.e., a simple set of scaling laws—that can predict how the critical parameters of the family of PLDs will scale in response to amino acid sequence mutations.

Our residue-resolution model Mpipi (see Methods 4.2.2) is ideally suited to investigate the phase behavior of PLDs because it achieves the correct balance of the dominant π–π and cation–π interactions, which drive the phase behavior of PLDs. Indeed, the Mpipi model recapitulates the quantitative experimental phase diagrams of hnRNPA1 variants by Bremer and colleagues [35] with remarkable accuracy [34]. It is worth noting that the Mpipi model was not parameterized based on the experimental dataset of Bremer and colleagues, but rather the model parameters were developed independently by combining atomistic umbrella sampling MD simulations of amino acid pairs with the bioinformatics data of Vernon and colleagues for π-driven contacts [37]. Using Flory–Huggins–Staverman theory, critical temperatures for phase separation were extracted from the experimental data of Bremer et al., which were then used to validate the Mpipi model [34].

The amino acid sequences of all the proteins we investigated (Supporting Information) were designed based on five main considerations. (1) The set of mutants should serve to probe the effects of both the strong modulators of PLD phase separation—namely, aromatic and arginine interactions—and the subtle modulators—that is, polar and neutral amino acids, previously identified experimentally [8, 9, 35]. (2) To ensure that the observed changes in the critical solution temperatures can be attributed unequivocally to specific amino-acid-type changes, we mutated only one amino acid type at a time. However, multiple mutations of the same amino acid type were allowed. (3) Mutations were distributed homogeneously across the entire protein. (4) Deleted amino acids (e.g. aromatic deletions) were replaced by weakly interacting amino acids (i.e. serines, threonines, glycines, and alanines). For each PLD protein, the identity of the weakly interacting residues used as replacements were chosen to preserve the percentages of the various types of weakly interacting amino acids present originally in the unmodified PLD. (5) Achieve a data set that spans a wide array of critical solution temperatures, as shown in Fig. 1 D.

2.1.1 Extended Validation of the Mpipi Model for Phase behavior of Prion-like Domains

The availability of quantitative experimental binodals for the hnRNPA1-IDR and various amino acid sequence variants [35] has allowed us to test the accuracy of the Mpipi model to describe the phase diagrams of PLD solutions [34]. Here, we make such a test more stringent by extending the hnRNPA1-IDR validation set from 9 PLD variants to 21, selecting mutations spanning changes in aromatic, charged, polar, and neutral residues. The amino acid sequences of all the hnRNPA1-IDR variants considered are described in the Supporting Information and are denoted using the intuitive nomenclature established by Bremer and colleagues: variants that mutate n residues of type X into m residues of type Z are labeled as −nX + mZ, and so forth. For more details on the nomenclature, see Methods.

For each hnRNPA1-IDR variant, we perform Direct Coexistence MD simulations, compute critical temperatures from the simulations, and compare them with the experimental critical temperatures (Fig. 2 A–D). We use the fitting methodology outlined in Methods to estimate the critical solution temperatures from the experimental data.

Mpipi model quantitatively captures phase behavior for a large set of hnRNPA1 PLD mutants.
Mutations are divided into A aromatic, B Arg/Lys, C charged, and D Gly/Ser sequence variants of the PLD of the protein hnRNPA1, and comparison with experimental results. The hnRNPA1 PLD was used to validate the accuracy of the model, by extracting critical temperatures from experimental phase diagrams of a large set of these variants [35]. To assess how well the Mpipi model performed for PLDs, we compared critical temperatures obtained from simulations with the corresponding ones extracted from the experimental data for those variants that were available (depicted in the top right of each binodal, where r is the Pearson coefficient and D is the root mean square deviation between simulated and experimental values). For the variants where the experimental critical temperature could not be determined, we make a comparison between the critical temperature with the saturation concentration, since for these systems we expect them to be inversely correlated. The naming system for the variants is discussed in detail in the text.

When comparing with experimental critical temperatures, our model achieves a Pearson correlation coefficient higher than 0.93 for all types of mutations analyzed. Additionally, for all cases the root-mean-square deviation (D) between the experimental and simulated critical temperatures is ≤ 14 Kelvin. Our extended validation set confirms that the Mpipi potential can effectively capture the thermodynamic behavior of a wide range of hnRNPA1-PLD variants, and suggests that Mpipi is adequate for proteins with similar sequence compositions, as in the set of proteins analyzed in this study.

2.1.2 Large-Scale Data Generation on Critical Parameters of PLD Phase Separation

The goal of generating an unprecedentedly large set of binodal data for protein phase separation presented several unique computational challenges, from the management of considerable computational power to the application of efficient simulation techniques and the implementation of large-scale data analysis.

Due to the amount of computer resources required for the study, totaling over a quarter million CPU hours for the entire project, our first challenge was optimizing the efficiency of our Direct Coexistence MD simulations. In the Direct Coexistence method, the diluted phase and the condensate are simulated in the same rectangular box separated by an interface. The phase diagrams we compute explore the plane of temperature (vertical axis) versus density (horizontal axis). For each phase diagram, we run a set of about 20 Direct Coexistence simulations–each simulation explores the phase behavior of the system at different constant temperatures. For a given value of the temperature, if two phases are detected in our simulations, we measure a “density profile”: the density of the proteins as a function of the long side of the simulation box (Bottom panel in Fig. 1 C). We then plot the two coexistence densities for each temperature we simulate and build the coexistence curve (also known as binodal), Fig. 3. The coexistence curve shows the values of temperature and protein density for which phase separation will occur. The region above the coexistence curve is the “one-phase region” where the temperature is too high for the enthalpic gain coming from the associative protein–protein interactions to overcome the entropic loss upon phase separation. The region below the coexistence curve, the “two-phase coexistence region” represents lower temperatures for which the enthalpic gain due to protein–protein interactions is now sufficiently high to favor phase separation into a condensed (protein-enriched) and a diluted (protein-depleted) liquid phase. The maximum in the coexistence curve is known as the critical point: the upper critical solution temperature (herein, critical solution temperature) and corresponding critical density. For temperatures higher than the critical solution temperature, phase separation is no longer observed.

Complete set of 140 binodals, in the temperature versus density phase space, generated via largscale molecular dynamics simulations.
Binodals are grouped according to the mutation type (Aromatic, Arg/Lys, Gln/Asn (Polar), Ala/Ser and Thr/Gly/Ser (Neutral)). Each set is further grouped based on the PLD in question.

The protein solutions that we investigate with our coarse-grained simulations must be sufficiently large, in terms of the number of copies of the same protein, to minimize finite size effects—for instance, to adequately sample bulk phenomena without dominating interfacial effects. At the same time, our simulations must be small enough to ensure computational convergence in a feasible timescale. Following extensive testing of computational convergence for systems of different sizes and MD trajectories of different time lengths, we determined that a system composed of 64 protein replicas, with trajectories of 200 ns ensured rigorous convergence and reproducible results, while taking into careful consideration the trade-off between accuracy and computational demands. The size of the simulation boxes were also optimized, again to ensure finite size effects did not affect the predictions (See Methods and Materials, and Supporting Information).

To obtain 200 ns of MD trajectories in each computer run, with systems of up to 20,000 atoms, every data point required approximately 12 hours of wall time in 16 CPU nodes. Consequently, each binodal in Fig. 3 required 2,000 CPU hours for computation, and for all 140 protein variants considered, the cumulative computational requirement exceeded a staggering quarter of a million CPU hours. To efficiently manage the workload, since different temperatures of different variants are fully independent runs with no crossover communication between processors, the simulation runs were meticulously scheduled and the resources were dynamically allocated to maximize usage and reduce idle time and resource wastage. More details on technical computational requirements can be found in the Methods section.

Analyzing the data required a complex, multi-step procedure as well. The first step is thorough data cleaning, as well as ensuring that every equilibrium configuration generated via MD is uncorrelated from one another, followed by statistical analysis to interpret the results accurately, and with statistical rigor. All custom scripts developed for this study have been made publicly accessible, to facilitate replication, extension, or application of our methodologies, and are available in our GitHub repository.

Our research marks a significant milestone in the field of computational studies of protein phase behavior, primarily due to our integrated and automated approach focused on streamlining every step of our data generation and processing pipeline—from simulation and storage to analysis and interpretation. Our pipeline was designed to minimize manual intervention, reduce the potential for errors, and maximize the throughput of data processing. As a result, we generated a dataset of critical parameters of unparalleled scale, of over an order of magnitude higher than previously reported studies [9, 35, 69–72]. This extensive dataset (Fig. 3) not only offers new insights into phase separation phenomena but also sets a new benchmark for data-driven research in this field.

2.2 A Data-Driven Scaling Law Approach for Studying Phase Behavior of Prion-like Domains

Scaling laws can reveal key relationships between system variables, such as composition and critical parameters, providing a useful theory to predict and interpret the regulation of complex biological phenomena such as phase separation. In the context of PLD phase behavior, scaling laws would provide a quantitative theory to analyze how alterations in amino acid composition affect the critical solution parameters—a crucial aspect for understanding protein behavior in biological systems. Here, our objective is to discern which scaling laws can effectively capture the variations of PLD critical parameters as a function of sequence mutations. To accomplish this, we employ a systematic approach of varying the amino acid sequence of the proteins considered, to determine how the phase behavior of PLDs changes with amino acid compositional perturbations.

For each type of mutation we analyze, we define and test three types of scaling laws. These scaling laws illustrate a different relationship between critical solution temperature changes and the number of amino acid mutations. The first and simplest scaling law is a linear relationship between these two quantities,

where ΔT_c is the variation on the critical temperature between the mutated protein and the corresponding unmodified reference sequence, X_i is an amino acid (e.g., Y, F, R, and so forth) and is the number of X₁ residues that have been mutated to X₂. Thus, in this case quantifies the variation in the critical solution temperature of the protein solution for a mutation of one X₁ residue to X₂. If the data is well represented by this first scaling law, it would imply that the effects of amino acid mutations on PLD phase behavior depend exclusively on the total number of mutations, but are not affected by other PLD parameters, such as the protein length. Thus, the first scaling law (Eq. 1) is the simplest possible model we test here for how incremental mutations may collectively influence the phase behavior of PLDs.

Next, we define a second type of scaling law, which hypothesizes that the changes to the critical solution temperature upon mutation is proportional to the fraction of residues that are mutated, rather than their raw number. That is,

where, L represents the number of total amino acids in the PLD, or protein length, and is the fraction of X₁ residues that have been mutated to X₂. Thus, this second scaling law hypothesizes that the effect of mutations on the critical solution temperature is more pronounced in shorter proteins (where mutations constitute a larger fraction of the protein) and less so in longer proteins (where mutations represent a smaller fraction of the total number of amino acids). The expectation is that because longer PLDs have higher valencies than shorter ones, the density of inter-molecular connections they form inside condensates is more robust towards perturbations than that of shorter PLDs [73].

Finally, we define a third scaling law,

where the change in critical solution temperature is hypothesized to be directly proportional to the number of mutations but inversely linear to Like our second scaling law, this third relationship also presumes that longer PLDs form condensates with critical parameters that are less affected by mutations than shorter PLDs. However, by considering , instead of L as in Eq. 2, the expectation is that the buffering effects of protein length are more modest than those stated by Eq. 2. Notably, the amino acid composition of PLDs can be understood by the stickers-and-spacers framework [74, 75]. The stickers are amino acids like Tyr or Arg, which confer stability to the liquid network via the formation of associative interactions. Spacers are the residues interspersed among the stickers. Eq. 3 suggests that as a protein lengthens, the number of redundant residues that can establish sufficiently strong inter-protein interactions, i.e. number of stickers, grows too; yet, it grows much more conservatively than L because a fraction of L is made of spacers.

With these scaling laws as ansatz, we set to uncover trends in the dataset we built from our MD simulations of 140 variants of different PLDs. Adopting these laws as foundational hypotheses, we explore the relationship between alterations in composition and changes in critical solution temperatures, offering insights into the molecular drivers of PLD phase separation.

2.2.1 Aromatic Mutations Exhibit a Linear Correlation Between Mutation Fraction and Critical Solution Temperatures

Aromatic amino acids, namely tyrosine (Tyr, Y), phenylalanine (Phe, F), and tryptophan (Trp, W), are widely regarded as essential contributors to phase separation of PLDs, due to their ability to form π–π and cation–π contacts [9, 32, 33, 35, 37, 38, 76–78]. Among aromatics, experiments and simulations agree that Tyr is a stronger contributor to the stability of condensates than Phe [8, 33, 35, 63], and that both the composition and the patterning of aromatic residues in PLDs have an impact on the condensate critical parameters and its material state [18, 70, 79, 80].

Aromatic residues are highly abundant in PLDs, with the fraction of aromatic residues in the PLD variants analyzed in this study ranging from 0.19 to 0.43 (Fig. 1B). Furthermore, because different aromatics are unequal contributors to PLD phase behavior, we analyzed the impact of mutations for Tyr and Phe and Trp separately.

To quantify the impact of mutating Tyr to Phe in shifting the critical solution temperatures of our set of PLDs—i.e., determining which of our proposed scaling laws best describes the data—we designed 17 protein mutants that varied the Tyr to Phe ratio in the family of PLDs and computed their phase diagrams (Fig. 4A). Our simulations reveal that Tyr is stronger than Phe as a driving force for phase separation for the entire PLD family considered, a finding that replicates the results of Bremer et al. [35] for hnRNPA1 and confirms that it extends to other PLD proteins. By analyzing the change of critical solution temperature with respect to the unmodified PLD (or wild type, WT) against the number of mutations, our results reveal a linear relationship between the number of Tyr to Phe mutations and the change in the critical temperature of the system; our first scaling law, S_Y→F, Eq. 4. Here, ΔT_c represents the temperature difference between the mutated protein and its wild-type sequence, and N_Y→F represents the number of Tyr residues that were mutated to Phe in the sequence.

Mutations in aromatic amino acids have strong effects on the critical solution temperature of PLDs.
A. Tyr to Phe mutations. The number of Tyr mutated to Phe (x-axis) versus the change in critical temperature, (y-axis), computed as the critical temperature of the variant minus that of the corresponding PLD wild-type sequence. The trendline defines the simplest, dominating scaling law for this mutation. The R² value of 0.88 shows the agreement of our data to the scaling law defined by Equation 4. B. Tyr or Phe to Trp mutations. In this case, the overall trendline, defined by equation 5 has an R² of 0.99.C. Analysis of the variants involving mutations of aromatic residues to uncharged, non-aromatic amino acids, a.k.a. aromatic deletions. In the y-axis, the change in the critical temperature of the variant to that of the wild type sequence, and in the x-axis, a renormalized measure of mutations: the fraction of aromatic residues mutated times the, to account for the competing physical interactions between PLDs. Error bars indicate the standard error associated with replicas of the simulation.

Note that, by definition S_Y→F = −S_F→Y, since N_Y→F is defined as positive for Tyr to Phe mutations and negative for Phe to Tyr mutations, or, in other words, N_Y→F = −N_F→Y. For our data set, we find S_Y→F to be − 0.40 ± 0.04, where the error given is the standard error. The way to interpret this result is that a single mutation of a Tyr residue to a Phe residue in a given PLD protein would result in a −0.4 K variation in the critical solution temperature for phase separation. Such a modest variation due to a single-point mutation is smaller than the typical error associated with estimating the absolute value of the critical solution temperature itself using residue-resolution coarse-grained Direct Coexistence simulations (approximately 3 K). Unambiguously measuring the value of the critical solution temperature with sub-Kelvin accuracy poses a considerable challenge also in experiments [81, 82]. However, we can uncover statistically significant trends by analyzing the behavior of the critical solution temperature in a large set of PLDs with diverse numbers of mutations rather than a single case. In particular, our analysis reveals that the impact of Tyr to Phe mutations is modest and the effect of multiple mutations can be described adequately as cumulative. In addition to these analyses, the role of Tyr/Phe to Trp mutations, albeit Trp being a rare residue in prion domains, was investigated in TIA1 due to the presence of Trp in the wild type of this PLD, Fig. 4B. The simplest scaling law that describes this mutation (Equation 5) is fundamentally equal to the law describing Tyr to Phe mutations. However, a notable distinction is that the S_F/Y→W scaling constant for our data is approximately 4.3 ± 0.1, which is one order of magnitude greater than S_Y→F, consistent with the much stronger π–π contacts established by Trp. It is important to note that, in this case, similarly to Try to Phe mutations, multiple Ansatzs align closely with the observed data, which underscores the fact that, in the context of these mutations, the size of the side chains does not significantly alter phase behavior, as the predominance of aromatic perturbations exerts a dominating effect.

Across the whole family of PLDs analyzed, our data reveals that changes in aromatic amino acid identity impact the critical solution temperatures modestly (< 1 K per mutations involving F and Y) to strongly (4 K for mutations involving Trp) and independently of the protein length. Such a result is consistent with the mutations in aromatic identity preserving the total number of associative interactions that the PLDs can establish, and hence, the density of molecular connections in the condensed liquid network. That is, aromatic mutations are expected to influence the energetics and lifetimes of the multivalent interactions among PLDs, but not the number of intermolecular contacts they form.

2.2.2 Strong Critical Temperature Reduction Due To Aromatic Deletions is Buffered By Protein Length

We next examine the role of replacing aromatic residues (Tyr and Phe) with uncharged, non-aromatic residues—such as Gly, Ser, Gln, Ala, and Asn—which are prevalent in PLDs. The mutations were designed to preserve amino acid sequence compositional biases by replacing Tyr and Phe with uncharged, non-aromatic residues in the same proportions as they appear in the wild type of each sequence, i.e., maintaining the relative amounts of each type of residue found in the sequence. All amino acid sequences are available in the Supporting Information.

Our simulations show that aromatic deletions strongly reduce the critical solution temperature of the PLD solutions, Fig. 4C. The best scaling law from our ansätz to describe the data is Eq. 3, which involves the square root of the PLD length. That is, S_Y/F→X, where X represents any uncharged, non-aromatic residue.

Here, once again, N_Y/F→X is the number of mutations, and L is the length of the PLD. For our data set, the value of the scaling constant S_Y/F→X is −56 ± 5. To interpret this result, let us consider a model PLD of two hundred amino acids (L = 200). If we mutate one Tyr to an uncharged, non-aromatic amino L acid, the change in the critical temperature—computed via the formula is approximately −4 K. Such a value is consistent with the dominance of π–π and cation–π contacts in PLD phase separation reported experimentally [9, 32, 33, 35, 37, 38, 76–78]. As stated above, aromatic residues behave as stickers within PLDs condensates. When these aromatic residues are replaced with uncharged non-aromatic amino acids the density of connections network of molecular connections in the liquid is strongly perturbed, as we have shown earlier [21]. To observe this in more detail, we computed select contact maps (see Fig. 11) – notably, mutations that removed aromatic amino acids strongly altered the primary sites of interactions.

Our results reveal that the length of the PLDs strongly influences how the critical parameters of PLD solutions change in response to aromatic deletions. The effect of mutations is stronger in shorter proteins because each aromatic deletion represents a higher percentage of the total number of stickers in the PLD, than in longer proteins. Longer proteins have higher valencies and redundant stickers that facilitate a network rearrangement.

Our results highlight the delicate balance between amino acid composition and protein length in determining the phase behavior of PLDs, providing a deeper understanding of the molecular basis of protein phase transitions.

2.2.3 Arginine Mutations Drastically Alter Condensate Stability

The distribution and balance of charged amino acids–i.e., aspartic acid (Asp, D), glutamic acid (Glu, E), histidine (His, H), arginine (Arg, R), and lysine (Lys, K)—within an IDR affects its solubility, conformational ensemble, and the range of stability of the ensuing condensates it might form. PLDs are not particularly rich in charged residues: in the wild type PLDs of our data set, the fraction of charged residue ranges from 0.02 to 0.1 (Fig. 1B), and the fraction of positively charged amino acids is narrowly distributed around 0.015, except for hnRNPA1, which has a value of 0.09 (Fig. 1B). However, despite their low prevalence in PLDs, charged residues, in particular Arg, have been shown to be important contributors to the stability of a wide-range of biomolecular condensates [32, 35, 83–85].

While both Arg and Lys are positively charged amino acids at physiological pH, it has been demonstrated that they have unequal roles in the stability of biomolecular condensates [32, 35, 63, 84, 86]. Both residues can form charge–charge and cation–π associative interactions, but the case of Arg is unique: the sp²-hybridized planar guanidinium group of Arg confers it a special capacity to form strong hybrid π–π/cation–π contacts with the π-electrons of aromatic rings [33]. Thus, to quantify the impact of Arg in the range of stability of PLD condensates, we designed mutants that replaced Arg with Lys and vice-versa (Fig. 5).

Critical effects of Arg mutations in the phase transition temperature of the condensates
**(A)** Analysis of the critical solution temperature of PLDs with Arg to Lys mutations. The number of Arg to Lys mutations divided by PLD length (x-axis) versus the change in the critical solution temperature, that is, T_c(variant) −T_c(WT) (y-axis). The trendline, representing Equation 7, fits the data with an R² = 0.94, and defines the stability measure of this perturbation. **(B)** Analysis of the critical solution temperature of PLDs with Arg mutations to uncharged, non-aromatic amino acids that maintain the WT compositional percentages (mainly Gly, Ser, and Ala. For more details see Supporting Information). Arg deletion fraction (x-axis) versus T_c(variant) −T_c(WT) (y-axis). The trendline, which fits the data with R² = 0.96, defines the stability measure, represented in Equation 8.

The best fitting scaling law associated with mutating Arg to Lys is inversely dependent on the fraction of mutations, rather than the number of residues mutated (Eq. 7). The resulting scaling constant for this particular mutation, i.e., S_R→K, has a value of −1300 ± 100, indicating that for a typical PLD of L = 200, the mutation of a single Arg into Lys results in a reduction of the critical temperature of the system by 6.5 K. Hence, replacing Arg with Lys is a destabilizing mutation of PLD condensates in agreement with the wide-body of experimental evidence [32, 35, 63, 84, 86].

The length dependency of this particular scaling law suggests that the impact of Arg to Lys mutations on the phase behavior of PLDs primarily depends on the relative concentration of these mutations (mutation fraction). Additionally, it proposes that longer sequences may exhibit increased resilience against Arg to Lys mutations. This resilience is likely attributed to the greater separation of charge within the longer sequences. To further analyze the effect of the charge of the PLDs in this set of mutants, in Fig. 5 we include a scalebar showing the net charge per residue (NCPR). Mutants that present higher NCPR present the biggest decrease in their phase-separating propensities, which is consistent with the expectation that condensates with a neutral overall charge are more energetically stable [35, 87, 88].

When we replace Arg with uncharged, non-aromatic residues, the resulting scaling law also depends inversely on the fraction of mutations, with a value of −640 ±50, amounting to a reduction of 3.2 K in T_c per Arg deletion for a model PLD of length L = 200.

Therefore, our results reveal that it is almost twice more destabilizing to mutate Arg to Lys than to replace Arg with any uncharged, non-aromatic amino acid, namely, Gly, Ser or Ala. Mutating Arg to an uncharged non-aromatic amino acid is less destabilizing than mutating it to Lys likely because the chargereduced variants decrease the cation–cation repulsion of the PLDs, with respect to the Arg-rich or Lys-rich systems. Furthermore, mutating Arg to Lys decreases the enthalpic gain for condensate formation not only via the replacement of stronger Arg-based associative interactions for weaker Lys-based ones but also by increasing the overall solubility of the Lys-rich PLD [38, 89], given that the hydration free energy of Arg is less favorable than that of Lys [33, 38, 83].

2.2.4 Polar Amino Acids Subtly Modulate PLD Phase behavior

Uncharged, non-aromatic amino acids, including Gly, Ala, proline (P, Pro), Ser, and Thr, are commonly found in PLDs. Because they generally form weaker residue–residue interactions and present smaller excluded volumes than aromatics, they are classified as spacers that contribute to maintaining the solubility of proteins and material states of condensates [70, 74]. These amino acids can also have other functions in proteins, and the role they play as spacers depends on their chemical make-up, but also on the specific context in which they are found [35, 39, 90].

Here, we investigate the role of glutamine (Gln, Q) and asparagine (Asp, N). Both Gln and Asn are polar amino acids, hence their charge distribution is more asymmetric, presenting a significant dipole moment. Consequently, they can establish transient ion–dipole, dipole–dipole and π–dipole interactions with other residues. While the Mpipi model does not account explicitly for polarity, it is worth noting that it was parametrized based on bioinformatics data, which implicitly incorporates this aspect. Thus, our study allows for the observation of emergent behaviors that stem from this inherent characteristic.

We designed mutations that vary the relative amounts of Gln and Asn, and examined trends in their phase-separating propensity. Increasing the number of Gln per unit length by mutating Asn to Gln increases the ability of the system to phase separate, likely due to Gln having a higher dipole moment than Asn, which is approximated in our model via marginally stronger non-ionic interactions formed by Gln over Asn (Fig. 6 A).

Polar and neutral amino acid mutations show subtle modulation of the critical solution temperatures of PLDsA)
Analysis of the critical solution temperature of PLDs with Asn to Gln mutations. The number of Gln to Asn mutations renormalized by dividing by PLD length (x-axis) versus the temperature change T_c(variant) −T_c(WT) (y-axis). The trendline, which fits the data with an R² = 0.84, fits the scaling law Equation 9. **(B)** Analysis of the critical solution temperature of PLDs with Gly/Ser to Thr mutations (x-axis) versus the temperature change T_c(variant) −T_c(WT) (y-axis). The trendlines, representing Equations 10 and 11 fit the data with an R² = 0.7. **(C)** The number of Ala to Ser mutations (x-axis) versus the temperature change T_c(variant) −T_c(WT). The trendline, Equation 12, fits the data with an R² = 0.77

The value of the scaling constant S_N→Q is 50 ± 5, which is significantly lower than that of mutations of residues with a net charge or aromatic deletions (rounding 1300, for example, for arginine to lysine mutations). This result indicates that mutations of Asn to Gln or vice versa have a smaller impact in the critical parameters than aromatic deletions or mutations of charged residues.

Finally, we also explored the role of Ser, Gly, Thr, and Ala on the critical parameters of the condensate. While Ser and Thr are also polar amino acids, their dipole moment is significantly smaller than that of Gln and Asp. Consistently, the scaling constants involving mutations Ser and Thr mutations—i.e., S_S→T — are considerably smaller (Fig. 6 B and C) overall. A comparable pattern is noted for S_G→T, with G expected to exhibit significantly lower polarity compared to Gln and Asp.

Regarding Ala, we also tested its relative role with respect to Ser.

Amino acids with larger side chains, such as Thr, have a larger excluded volume than those with smaller side chains, such as Gly or Ser. The corresponding value of the scaling constants S_S→T and S_G→T suggests that in the case of Thr mutations, mutations of neutral, non-aromatic amino acids to those with larger excluded volumes inhibit phase separation of the PLD by lowering the corresponding critical temperature, while mutations to those amino acids with smaller excluded volumes have an opposite effect. This is likely due to increased flexibility of the protein backbone, allowing the chain to adopt multiple conformations that promote multivalent contacts. Importantly, these trends cannot be anticipated by inspecting the parameters of the Mpipi model; instead, they emerge from examining the behavior of the collective system.

On the other hand, for mutations involving Ser, Gly, and Ala–all small amino acids with minor side chain size variations–the impact on phase separation seems to stem predominantly from changes in solvation. Solvation plays a key role in phase separation since it can help balance the driving forces of this process [91, 92]. While hydrophobic regions of the protein, with low solvation free energies, tend to form aggregates to reduce interactions with the solvent environment; hydrophilic regions promote interactions with water. Whether these regions inhibit or promote phase separation is likely due to a balance of opposing effects. On the one hand, hydrophilic regions can form energetically favorable contacts with solvent, promoting dispersion and suppressing phase separation. However, when these regions are recruited to condensates the release of water molecules can help to off-set the entropic cost of sequestering hydrophobic regions within a protein droplet, thereby facilitating phase separation.

Specifically, mutations from Gly to Ser (not shown) increase the size of the side chain while also enhancing solvation, while mutations from Ala to Ser do not significantly change the size of the side chain but significantly increase the protein solvation. In the first case, the critical parameters of phase separation remain virtually unchanged, only presenting a very minor decrease. In the case of Ala to Ser mutations, phase separation is slightly enhanced. Both scaling laws S_G→S and S_A→S reflect such an effect, and underlie the context-dependent nature of these mutations.

Overall, mutations of unchanged, non-aromatic amino acids seem to have an impact on phase separation due to two competing effects: a change in side chain size and a change in solvation. However, it is important to note that the magnitude of the scaling laws that govern these mutations is minimal, and thus, likely extremely sensitive to small fluctuations such as changes in the system and solution conditions. Notably, these findings are consistent with experiments that report small changes in phase separation propensity for mutations involving neutral, non-aromatic residues [8, 35].

2.2.5 Beyond composition: patterning considerations

Beyond composition, the relative positioning of amino acids in the sequence, i.e. sequence pattern, also plays a role in protein folding, function and, in this case, phase behavior [9, 93, 94]. Due to the sequence heterogeneity of PLDs, and, more generally, IDPs, it is particularly complex to define adequate patterning measures [88, 95–97]. In this work, we have taken advantage of our large data set to analyze the impact of amino acid sequence patterning by adopting a patterning order parameter [9, 87, 98], σ_aro, that has been designed to describe patterning of aromatic residues in such IDPs.

To compute σ_aro, we divide the amino acid sequence into peptide segments of equal length, compute a measure of sequence aromatic asymmetry for each segment, and finally sum the values of such measure over all the segments. Consistent with literature [9, 87, 98], we take l to be 5 and 6. σ_aro is then given by:

where L is the total number of amino acids in the protein domain, N_stickers the number of sticker residues—i.e. charged or aromatic residues—and σ_i has the functional form

Here, N_aro,i is the number of aromatic residues in the i-th segment, l_i the segment length, and σ_T is a function of the total number of aromatic residues in the sequence N_aro:

In turn, σ_max,i is the measure of aromatic amino acid distribution for an artificial sequence with all the aromatic amino acids concentrated in part of the chain.

Notably, our results, presented in Fig. 7, indicate that the sequences with the most dispersed aromatic residues (i.e., low values of σ_aro) phase separate more readily. Dispersed residues result in greater effective valency of a PLD, leading to a higher probability of forming a percolated network structure to sustain the condensate. Additionally, this result demonstrates that the patterning order parameter, σ_aro, can be used to predict the phase-separating propensity of a PLDs as the function of aromatic residue distribution. This result is consistent with previous studies that demonstrate the importance of aromatic residue patterning on the phase behaviour of PLDs [9, 99].

Prion-like domains featuring sequences with dispersed aromatic residues show greater propensities for phase separation.
A Representation of the aromatic amino acids present in the wild type sequences of each PLD considered B Location of the positively charged (red) and negatively charged (blue) amino acids present in the wild type sequences of each PLD considered C Analysis of the critical solution temperature versus the σ order parameter, defined in Equation 13. Those groups of variants with the same 13 value show their average critical temperature highlighted, and softer values for other data points. For lower values of σ, the critical temperature present in the condensates is higher, and vice versa, indicating that those variants with a more homogeneous distribution of aromatic amino acids have condensates more stable at higher temperatures.

Convergence tests for the WT sequence of the PLD of FUS. The density profiles of different and independent simulations are plotted across the perpendicular axis to the condensate interfaces. Each simulation was performed for different timescales (as specified in the legend) to check for proper convergence.

3 Discussion

Computer simulations have advanced our ability to map emergent properties of protein solutions to the chemical make-up of their individual biomolecules. For instance, modern residue-resolution protein coarsegrained models [34, 76, 100–104] can now be used to quantify the modulation of protein phase behavior as a result of amino acid sequence mutations of their native proteins with near-quantitative agreement with experiments [34, 102, 103]. In this study, we exploit the accuracy and efficiency of our protein model, Mpipi [34], to compute and analyze the temperature-vs-density phase diagrams of an unprecedentedly large set of PLD proteins—140 variants of the proteins hnRNPA1, TDP43, FUS, EWSR1, RBM14, and TIA1. Collectively, our set of phase diagrams reveal conserved scaling laws that quantitatively predict the change in the critical solution temperatures of single-component PLD solutions as a function of the number and type of amino acid mutations of the PLDs.

To facilitate the interpretation of the scaling laws derived in our study, we have organized the data in a table that illustrates the effect of single point mutations on the critical solution temperatures of the family of PLDs, as a function of the PLD length.

Consistent with experimental evidence and computational predictions, interactions involving aromatic residues—namely, π–π and cation–π—are dominant in stabilizing condensates formed by PLDs [8, 34, 35]. Notably, our simulations reveal the largest decrease in the critical solution temperature of a PLD condensate occurs when an aromatic residue is replaced by an uncharged, non-aromatic amino acid. Our scaling law for such aromatic deletions (56 K) predicts that the destabilizing effect of replacing an aromatic with a non-aromatic residue is linearly amplified by the number of such mutations and screened by the square root of the length of the protein (), indicating a disruption in the percolated network of interactions sustaining the condensate. Condensates made of longer PLDs are more robust against aromatic deletions because, by construction, each aromatic residue represents a smaller percentage of the total valency of the PLD in question. In other words, longer PLDs are higher valency molecules than their shorter counterparts and can, thus, form a densely connected percolating condensed-liquid network, even when a few of their aromatic stickers have been deleted.

Our findings also highlight the crucial role of Arg in the critical parameters of PLD condensates. Specifically, our scaling law for Arg deletions (640 K/L) predicts that the destabilizing effect of an Arg mutation to an uncharged, non-aromatic residue is inversely proportional to the length of the protein (L), as well as linear with respect to the number of mutations. More specifically, it indicates that, for a PLD of the length of 100 amino acids, the change in the critical temperature will be of the order of 6.4 K. While formulaically similar to the scaling law for aromatic deletions, a notable difference is a change in the dependence on the PLD length. In the case of Arg deletions, the effect of mutations is screened by L rather than , indicating a stronger screening dependence on the PLD length. This indicates that for long PLDs, the destabilizing effects due to Arg deletions are diminished when compared to aromatic deletions, while in shorter PLDs this trend is reversed with each Arg deletion exerting a more significant destabilizing effect than the aromatic deletions.

These results underscore once again how for PLDs cation–π and π–π interactions are the main drivers of phase separation [9, 32, 33, 35, 37, 38, 76–78]. While it may seem counterintuitive for Arg deletions to have a lesser effect in the critical parameters of longer PLDs than their aromatic counterparts (as Arg residues are known to form strong cation–π interactions [105, 106]) the explanation lies in the type of systems in question, which are particularly rich in aromatic residues. Longer PLDs have, as a result of their higher valency, a larger network of possible pairwise interactions, which can mitigate the destabilizing effect of Arg deletions. In the case of shorter PLDs, each Arg residue represents a larger proportion of the total valency, and thus its deletion has a proportionally larger impact on the critical parameters. In contrast, aromatic residues contribute to the stability of the condensate through both cation–π and π–π interactions, which are predicted to be weaker than their cation–π analogues [33, 34], but more numerous in PLDs. In a longer PLD, the deletion of an aromatic residue results in the loss of more potential interactions, creating a larger destabilizing effect than the deletion of an Arg residue. However, in shorter PLDs, due to their lower total valency, the loss of potential associative interactions that result from the deletion of an aromatic residue is lower. Hence, the destabilizing effect resulting from this mutation is comparatively less than for the Arg deletion.

Our scaling laws have also allowed us to quantify the destabilizing effect of Arg to Lys mutations in the phase separation of PLDs. Concretely, mutations of Arg residues to Lys result in an overall decrease in the critical temperature of phase separation, two-fold in comparison to the impact of mutating Arg into an alternate uncharged non-aromatic residue. Consequently, the scaling law considerably intensifies for mutations from Arg to Lys, nearly doubling the effects of each mutation in the critical temperature of the condensate (See Table 1). This implies that each mutation of Arg to Lys exerts a substantially more destabilizing effect than mutations of Arg to other uncharged, non-aromatic residues, consistent with experimental evidence [32, 35, 84, 107] and highlighting the significant, modulatory role Lys plays in the phase behavior of PLDs.

Predicted effect of single point mutations on the critical temperature, in Kelvin, of PLDs, as a function of their length, L

Uncharged, non-aromatic residues such as Ser, Gly, Asn, Gln, and Thr are found to play a marginal role in PLD condensate stability [35, 39, 70, 74]. Notably, PLD condensates are highly stable in the face of Gly, Ser, Thr and Ala mutations. This behavior is quantified by their respective scaling laws, with values for the scaling constants for mutations from Gly to Thr, Ser to Thr and Ala to Ser at −0.09 K, −0.08 K and 0.16 K respectively. These numerical values indicate that, for a change of one degree in the critical temperature, approximately 10 amino acids of this set would need to undergo mutation. Notably, these scaling laws do not display a significant length dependency. This observation aligns with the idea that these mutations primarily affect the condensate properties through modulation in excluded volume and solvation, rather than by interrupting specific interaction networks.

In contrast, the condensates display heightened sensitivity towards Asn and Gln mutations, underscored by the Asn to Gln scaling of 50 K/L. This law is also inversely length dependent, indicating that mutations on these amino acids affect interaction networks and that mutations Asn to Gln result in a heightened ability of the system to phase separate. We hypothesize that, in addition to differences in solvation volume, the polarity of each amino acid plays a role by forming charge–dipole and dipole–dipole interactions. Both Gln and Asn are polar amino acids, with Gln having a longer side chain than Asn, which could potentially interact with a larger number of nearby atoms. Furthermore, the additional methylene group (-CH2-) may slightly alter the electron distribution around the amide group, which can lead to the formation of more stable dipole–dipole contacts. It is important to note that our computational model, Mpipi, is a charge-fixed model, and hence does not directly capture the effects of polarization. However, the lack of physical details in the Mpipi model is compensated by the parameters having been derived from bioinformatics data.

Collectively, the definition of a set of scaling laws acts as a unifying framework for the study of the critical parameters of PLD phase separation, by allowing for direct comparison of individual sequence effects. Such laws are cumulative, meaning that the effects of successive mutations can be predicted by the addition of the appropriate laws, as long as the resulting mutation keeps within PLD composition and patterning biases.

Our work explores a new framework to describe, quantitatively, the phase behavior of a large set of protein variants via the identification of scaling laws. Our scaling laws can predict changes in the stability of condensates based on amino acid sequence mutations. This code not only allows for testing the effects of composition bias on PLD phase separation but also serves as a predictive tool, allowing for a better selection of mutations to achieve desired changes in different condensates. This study is focused on PLDs, but we hypothesize similar scaling behavior in other IDPs, with the laws having a dependence on the protein length that reflects the nature of the underlying interaction networks of the system. Our findings have important implications for understanding PLD condensation and for designing modifications to counter their aberrant aggregation, by setting the basis for characterizing and quantifying the molecular determinants of PLD phase behavior.

4 Methods and Materials

4.1 Design of the dataset

To design sequence variants to test the phase behavior of prion-like domains, we use in-silico targeted mutagenesis to introduce specific changes to the amino acid sequence of the PLD of interest. These changes involve substituting different amino acids, while carefully maintaining patterning and compositional biases, to test the role of given amino acids in the phase-separating propensities of the respective PLDs.

The nomenclature employed is consistent with that of Bremer and colleagues [35]. Namely, if n amino acids of type X are removed from the sequence and replaced with n amino acids of type Y, the variant is labeled −nX + nY. If instead, n amino acids of type X are removed from the sequence and replaced with amino acids (often different types) in proportion to the compositional bias of the PLD, the variant is simply labeled −nX.

To begin, we select a family of PLDs composed of hNRNPA1, RBM14, FUS, TDP43, TIA1 and EWSR1 and perform mutations on each sequence to generate a set of variants. Accordingly, we assess whether the effects of sequence mutations in PLD phase behaviors hold for the whole family of PLDs. Furthermore, our data set allows us to develop a set of rules that can aid in understanding and predicting PLD phase behavior. The amino acid sequences of the PLDs, and their corresponding variants, investigated herein are provided in the Supporting Information.

In total, we compute phase diagrams of 140 PLD variants via molecular dynamics (MD). This task required a staggering amount of CPU hours, over a quarter million in total. Given these demands, efficient utilization of high-performance computing (HPC) resources and proper parallelization of the computational tasks are paramount to ensuring optimal performance. To this end, we compute MD trajectories using the LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) software [108], exploiting its efficient parallelization algorithms. Another significant computational challenge involves the large size of data generated. Each single run produced a significant amount of trajectory, backup, and analysis data, and a large number of files. On average, each run produced over 50 files and a minimum of 5 GB of data. To efficiently handle this, we implement a robust, tiered, data storage and retrieval system, ensuring that the vast amounts of data is organized through a combination of indexing and partitioning, enhancing search efficiency and accessibility. We leveraged the Lustre filesystem for its high performance and scalability, making it the ideal choice for managing and accessing large volumes of active data required by our computational jobs. Custom scripts are employed for automated data processing, enabling us to extract meaningful insights from the complex datasets. All scripts are available in our GitHub repository.

4.2 Critical Parameter Determination Through MD

We used MD simulations to determine the full binodals (i.e., the boundaries that delimit the two-phase region from the one phase region) for the PLDs. Specifically, we perform MD simulations in the canonical ensemble – i.e., fixed volume (V), temperature (T), and number of particles (N). For each binodal, a series of independent NV T simulations are performed at different temperatures and the density of each of the coexisting phases are determined. Thus, this approach yields binodals in the temperature–density phase space.

4.2.1 Direct Coexistence Simulations

Specifically, we use Direct Coexistence simulations, which enables the simultaneous simulation of both high- and low-density phases of a mixture, separated by an interface, within a single simulation box. Additionally, these simulations are performed with periodic boundary conditions in each dimension.

To initiate the NV T Direct Coexistence simulations, we start with one copy of a PLD in a cubic simulation cell and replicate it to the target number of chains (64 in total). We then iteratively scaled the simulation box to compress the chains into a high density direct coexistence simulation (approx. 0.8–1 g/cm³) [101]. Here, we use a Langevin thermostat [109] (at the target temperature, with a damping time of 40 ps) for temperature control. Keeping the length of the box in the X and Y dimensions fixed, we then elongate the Z dimension of the box. We then simulate each system at the target temperature for over 200 ns (see Convergence test in the SI). From each simulation we compute the density profile in the Z direction and determine the density of the dilute and condense phase. These data are then used to produce the binodals.

4.2.2 Mpipi model

All the simulations are run using the Mpipi coarse-grained force-field [34] with an implicit solvent. In the Mpipi model electrostatic, cation–π and π–π interactions are carefully balanced to capture the phase behavior of PLDs. Notably, Mpipi achieves quantitative agreement with experiments in terms of prediction of critical solution temperatures of such systems. In Mpipi, each amino acid is represented by a single unique bead – i.e., of corresponding mass, molecular diameter (σ), charge (q), and an energy scale that reflects the relative frequency of planar π–π contacts (ϵ) [34].

The potential energy in the Mpipi model is computed as follows:

Beads are connected via harmonic springs i

Here k is the bond force constant (19.1 kcal mol⁻¹ Å⁻²) and is the equilibrium bond length, which is set to 3.81 Å.

The electrostatic contribution is modeled via a Coulomb interaction term, with Debye–Hückel screening.

where q_i and q_j are the charges of the bead i and j, respectively. ϵ₀ is the electric constant and ϵ_r = 80 is the relative dielectric constant of water. κ = 0.126 Å⁻¹ is the inverse Debye screening length, which corresponds to monovalent salt concentration of 150 mM.

Finally, short-ranged non-bonded interactions are modeled with the Wang–Frenkel potential

where E_pair is the potential energy between all pairs of amino acids, ϕ_i,j is the potential energy between the pairs of amino acids i and j, ϵ_ij is an energy parameter that dictates the strength of interaction of that particular pair, σ_i,j is average the Van der Waals radii of the pair that dictate the position of the energy minima, r is the distance between center of masses and ν_i,j and µ_i,j are fitted exponential parameters that determine the shape of the energy curve. Full details of the Mpipi model are described in 34.

4.2.3 Data fitting

To estimate the critical temperatures of the binodals computed in Section 4.2.1, we use the law of coexistence densities

and critical densities are computed by assuming that the law of rectilinear diameters holds, namely

where ρ_high(T), ρ_low(T) and ρ_c are the densities of the high-density and low-density phases and the critical density, respectively, T_c is the critical temperature, and d and A are fitting parameters.

4.3 Finite-size scaling

In direct coexistence simulations in the slab geometry finite-size effects can arise from either the crosssectional area (i.e., size of the interface) or from the long axis (i.e., fraction bulk condensate). To ensure our calculations are robust with respect to system size, we perform finite-size scaling at fixed system number density: (1) varying the size of the cross section and (2) varying the length of the long axis. Using these tests, we confirm that the phase diagrams computed for systems of varying sizes (27, 64, 125 and 216 chains) exhibit no significance difference in predicted densities (see Figure 9 in Supporting Information). Our scaling analysis revealed that 64 chains in a box of dimensions 125x125x1000 is the minimum size that ensures proteins do not interact with their periodic images and that the condensed phase is large enough to represent a true bulk phase.

Acknowledgements

We thank Pin Yu Chew for useful comments on the manuscript. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement No 803326 to R.C.G.). MJM would like to acknowledge the Winton Programme for Physics of Sustainability for doctoral funding. AAG is funded by the ERC (grant agreement No 803326). JRE acknowledges funding from the Ramon y Cajal fellowship (RYC2021-030937-I) and the Spanish National Agency for Research under the grant PID2022-136919NA-C33. This work was supported by the Engineering and Physical Sciences Research Council (EPSRC) [grant number EP/X02332X/1 to JHM] under the UK Research and Innovation (UKRI) Postdoctoral Fellowships Guarantee Scheme [project TF-CHROM-LLPS]. JAJ acknowledges research support from departmental start-up funds provided by the Department of Chemical and Biological Engineering and the Omenn–Darling Bioengineering Institute at Princeton University. This project made use of time on HPC granted via the UK High-End Computing Consortium for Biomolecular Simulation, HECBioSim (http://hecbiosim.ac.uk), supported by EPSRC (grant no. EP/R029407/1).

Supporting Information

A Amino acid Sequences of PLDs

RBM14

TDP43

TIA1

FUS

EWSR1

B Convergence tests

Molecular Dynamic simulations are inherently sensitive to both the length of simulated time and system scaling factor such as the number of particles present in the system and the size of the simulation box. In this study, we systematically tested all these aspects to ensure proper convergence of the results.

The system was placed in the middle of the simulation box, and the length of X and Y was the same as the length of the condensate, while the box was elongated in Z, to about ten times the size of X, Y(see 1). Importantly, the cross-section is large enough to avoid protein self-interactions through the periodic boundary conditions (e.g., cross-section sizes are at least twice the protein radius of gyration in condensate bulk conditions).

We tested for time convergence by plotting the obtained density of the protein wild types after increasing simulation times, from 50 ns to 350 ns. On simulated timesteps above 100 ns, the density of the condensate exhibited no significant differences. We thus concluded that, to optimize speed-up, 200 ns provided sufficient sampling for each point of the binodal.

Finally, we decided to keep the number of particles at a constant value of 64, which we found to be a good balance between minimizing finite-size effects and optimizing simulation speed (see Figure 9). We tested this by varying the corresponding number of protein chains of the FUS WT as a model protein from 27 to 216 and found that in these systems there was not a significant difference in the simulated density. Additionally, with 64 protein chains we could ensure a bulk large enough to sustain the condition that the cross section of the box should be higher than twice the radii of gyration of the protein chain.

B.1 Normalised and comparative contact maps

Further study of the differences between variants at the chemical level was carried out by computing the inter-molecular contacts per residue inside a phase-separated droplet. For each variant, the contact matrix was computed at a temperature around ten percent below the T_c. The cutoff to define a pairwise distance as a contact was dynamically calculated for each pair, as 1.2 times the average Van der Waals radius of both pair residues. Given residues i and j belong to different molecules, the contact maps were computed as shown below:

To get a more accurate picture of the main interactions driving phase separation for each variant, we normalised the contact maps to account for the occurrence of each residue in the sequence. This allowed us to account for the disparities in residue distributions in the sequences, and extract the relevant residue–residue interactions. For each residue–residue pair, the normalisation was performed as follows:

where n_i is the number of i residues in the IDP sequence. The resulting contact matrix is min–max normalised:

The comparative contact maps (Fig. 10 and 11), on the other hand, are computed as the difference in contacts of a protein variant respective to its wild type, as:

Normalised contact maps of different *R to K* variants of PLDs: A) TDP B) FUS, C) hnRNPA1

Normalised contact maps of different aromatic variants of PLDs: A) FUS B) hnRNPA1.

which is finally mean-normalised.

Significance of findings

Strength of evidence

Abstract

1 Introduction

Data generation and representation of dataset for probing scaling laws of PLD phase behavior

2 Results

2.1 Generating and Analyzing an Extensive Dataset of Critical Solution Temperatures for Prion-like Domain Variants

2.1.1 Extended Validation of the Mpipi Model for Phase behavior of Prion-like Domains

Mpipi model quantitatively captures phase behavior for a large set of hnRNPA1 PLD mutants.

2.1.2 Large-Scale Data Generation on Critical Parameters of PLD Phase Separation

Complete set of 140 binodals, in the temperature versus density phase space, generated via largscale molecular dynamics simulations.

2.2 A Data-Driven Scaling Law Approach for Studying Phase Behavior of Prion-like Domains

2.2.1 Aromatic Mutations Exhibit a Linear Correlation Between Mutation Fraction and Critical Solution Temperatures

Mutations in aromatic amino acids have strong effects on the critical solution temperature of PLDs.

2.2.2 Strong Critical Temperature Reduction Due To Aromatic Deletions is Buffered By Protein Length

2.2.3 Arginine Mutations Drastically Alter Condensate Stability

Critical effects of Arg mutations in the phase transition temperature of the condensates

2.2.4 Polar Amino Acids Subtly Modulate PLD Phase behavior

Polar and neutral amino acid mutations show subtle modulation of the critical solution temperatures of PLDsA)

2.2.5 Beyond composition: patterning considerations

Prion-like domains featuring sequences with dispersed aromatic residues show greater propensities for phase separation.

3 Discussion

Predicted effect of single point mutations on the critical temperature, in Kelvin, of PLDs, as a function of their length, L

4 Methods and Materials

4.1 Design of the dataset

4.2 Critical Parameter Determination Through MD

4.2.1 Direct Coexistence Simulations

4.2.2 Mpipi model

4.2.3 Data fitting

4.3 Finite-size scaling

Acknowledgements

Supporting Information

A Amino acid Sequences of PLDs

RBM14

TDP43

TIA1

FUS

EWSR1

B Convergence tests

B.1 Normalised and comparative contact maps

References

Article and author information

Author information

M Julia Maristany

Anne Aguirre Gonzalez

Jorge R Espinosa

Jan Huertas

Rosana Collepardo-Guevara

Jerelle A Joseph

Version history

Cite all versions

Copyright

Metrics