Abstract
Intrinsically disordered regions (IDRs) play a critical role in phase separation and are essential for the formation of membraneless organelles (MLOs). Mutations within IDRs can disrupt their multivalent interaction networks, altering phase behavior and contributing to various diseases. Therefore, examining the evolutionary fitness of IDRs provides valuable insights into the relationship between protein sequences and phase separation. In this study, we utilized the ESM2 protein language model to map the fitness landscape of IDRs. Our findings reveal that IDRs, particularly those actively participating in phase separation, contain conserved amino acids. This conservation is evident through mutational constraints predicted by ESM2 and supported by direct analyses of multiple sequence alignments. These conserved, disordered amino acids include residues traditionally identified as “stickers” as well as “spacers” and frequently form continuous sequence motifs. The strong conservation, combined with their critical role in phase separation, suggests that these motifs act as functional units under evolutionary selection to support stable MLO formation. Our findings underscore the insights into phase separation’s molecular grammar made possible through evolutionary analysis enabled by protein language models.
Introduction
Membraneless organelles (MLOs), such as nucleoli, stress granules, and P-bodies, are distributed throughout diverse cellular environments and play a vital role in forming specialized biochemical compartments that drive essential cellular functions.1–11 These biomolecular condensates typically assemble through phase separation, dynamically recruiting reactants and releasing products to improve the efficiency and specificity of cellular processes.4,12,13 Intrinsically disordered regions (IDRs), which lack well-defined tertiary structures yet exhibit unique structural disorder, frequently act as scaffolds within MLOs.12,14–17 IDRs facilitate multivalent interactions, including π-π stacking, cation-π, and electrostatic interactions, that promote phase separation. Mutations in IDRs can disrupt phase behavior, potentially leading to MLO dysfunction and contributing to diseases such as neurodegenerative disorders and cancer.18–21
Substantial research has focused on linking protein sequences to the phase behaviors of condensates.4,6,15,22–42 These studies support the “stickers and spacers” framework,3,4,43–46 in which specific residues, termed “stickers,” drive strong, specific interactions, while “spacer” regions act as flexible linkers with minimal nonspecific interactions.37,47–50 Sood and Zhang 51 further introduced an evolutionary dimension, proposing that IDRs adapt this framework to balance effective phase separation with compositional specificity. For instance, membraneless organelles (MLOs) form through specific interactions among stickers, while minimizing non-specific interactions among spacers to enable condensate formation with defined structural and compositional properties. This evolutionary pressure supports the enrichment of low-complexity domains (LCDs) in IDRs, reducing nonspecific interactions yet preserving conserved stickers critical for condensate specificity and stability. This evolutionary perspective, while promising, has not been extensively validated.
Evolutionary analysis of IDRs is challenging due to difficulties in sequence alignment.52–58 Recent advances in protein language models59–64 provide alternative approaches for sequence analysis and information decoding. Trained on large protein datasets, such as UniProt,65 these models leverage neural network architectures like the Transformer66 to capture correlations among amino acids within a sequence. These correlations enable the models to identify constraints, enforced by the whole sequence, that influence the chemical identity of amino acids at specific positions. Since the whole sequence is linked to function and stability, the mutational preferences predicted by these models serve as quantitative measures of the fitness of amino acids. In contrast to many traditional methods, these predictions do not rely on homology alignments, making them potentially useful for decoding disordered protein sequences. While these models have been widely applied to analyze mutational effects in folded proteins,67,68 their applicability to studying the fitness and evolutionary pressures on IDRs has yet to be established.
In this study, we employ the Evolutionary Scale Model (ESM2)68 to investigate the fitness landscape of IDRs involved in MLO formation. Our analysis demonstrates the utility of ESM2 for examining disordered sequences, identifying a notable subset of amino acids that exhibit mutation resistance. These amino acids are evolutionarily conserved, as confirmed through multi-sequence alignment analysis. The conserved residues are primarily located in regions associated with phase separation. Importantly, the conserved disordered amino acids include both sticker residues, such as tyrosine (Y), tryptophan (W), and phenylalanine (F), and spacer residues, such as alanine (A), glycine (G), and Proline (P). These residues frequently colocalize within continuous sequence stretches, which we identify as motifs. Our findings provide strong evidence for evolutionary pressures acting on specific IDRs to preserve their roles in scaffolding phase separation mechanisms, emphasizing the functional importance of entire motifs rather than individual residues in MLO formation.
Results
Protein Language Model for Quantifying the Fitness Landscape of MLO Proteins
To examine the fitness of IDRs and their connection to phase separation, we compiled a database of human proteins with disordered regions. From this, we identified a subset of 939 proteins associated with the formation of membraneless organelle, referred to as MLO-hIDR. The Methods section provides additional information on the dataset preparation. The MLO-hIDR subset contains proteins with varying numbers of disordered residues, ranging from a few dozen to several thousand per protein (Figure 1A). These proteins are involved in the assembly of various MLOs, including P-bodies, Cajal bodies, and centrosome granules, and are distributed across both nuclear and cytoplasmic compartments.

Predicting the fitness landscape of IDRs involved in MLO formation using ESM2.
(A) Schematic representation of the MLO-hIDR database construction. The right panel shows the distribution of disordered and folded residues identified in each protein, where structural order is assigned using the AlphaFold2 pLDDT score. (B) Workflow of the pre-trained ESM2 model (left)68 for predicting the fitness landscape of a given protein sequence (right). Upon receiving a protein sequence as input, ESM2 generates a log-likelihood ratio (LLR) for each mutation type at each residue position. Using the 20-element LLR vector, we compute the ESM2 score for each residue (Eq. 1) to assess mutational tolerance.
We analyzed proteins in the MLO-hIDR dataset using the protein language model, ESM2.68 As illustrated in Figure 1B, ESM2 is a conditional probabilistic model (masked language model) that predicts the likelihood of specific amino acids appearing at a given position, based on the surrounding sequence context. Using a transformer architecture, it processes amino acid sequences as input and calculates these probabilities. ESM2 was trained on over 250 million protein sequences, enabling it to capture the intricate relationships among amino acids and to quantify the fitness of particular residues at specific sites. Notably, this model does not rely on multiple sequence alignments, thereby addressing common challenges encountered in the evolutionary analysis of IDRs.
The fitness of a specific amino acid at a given site is defined as follows. ESM2 enables the quantification of the probability, or likelihood, of observing any of the 20 amino acids at site i. To assess the preference for a mutant over the wild-type (WT) residue, we calculate the log-likelihood ratio (LLR) between the mutant and WT residues. Consequently, a 20-element vector representing the LLRs for each amino acid can be generated at each site (Figure 1B). This vector is then condensed into a single value, referred to as the ESM2 score, which is derived using an information entropy expression for the LLR probabilities of individual amino acids (Eq. 1 in the Methods Section). The ESM2 score provides a measure of the overall mutational tolerance of a given residue. Lower scores indicate higher mutational constraint and reduced flexibility, implying that these residues are more likely essential for protein function, as they exhibit fewer permissible mutational states. Previous studies have demonstrated a strong correlation between ESM2 scores and changes in free energy related to protein structure stability.67,68
ESM2 Identifies Conserved, Disorderd Residues
We next used ESM2 to analyze the fitness of amino acids in both structured and disordered regions. We carried out ESM2 predictions for all proteins in the MLO-hIDR dataset, and determined the ESM2 scores of individual amino acids. In addition, to quantify structural disorder, we computed the AlphaFold2 predicted Local Distance Difference Test (pLDDT) scores for each residue. The pLDDT scores have been shown to correlate well with protein flexibility and disorder,69 making them a reliable tool for distinguishing structured from unstructured regions. Following previous studies,70,71 we used a threshold of pLDDT = 70 to differentiate ordered from disordered residues. This threshold reflects amino acid composition preferences for folded versus disordered proteins (see Figure S1).70,71
We first analyzed the relationship between ESM2 and pLDDT scores for human Heterochromatin Protein 1α (HP1α, residues 1–191). HP1α is a crucial chromatin organizer that promotes phase separation and facilitates the compaction of chromatin into transcriptionally inactive regions.33,72–75 HP1α comprises both structured and disordered segments, as illustrated in Figure 2A. Here, residues with pLDDT scores exceeding 70 (indicating ordered regions) are shown in white, while disordered segments (pLDDT ≤ 70) are highlighted in blue. Figure 2B displays the AlphaFold2 predicted structure of HP1α, with residues colored according to their pLDDT scores.

ESM2 predicts fitness landscape for structured and disordered residues.
(A) The ESM2 scores for amino acids in the human HP1α protein (UniProt ID: P45973) are presented, with residues having pLDDT scores below 70 highlighted in blue to signify regions lacking a defined structure. (B) A detailed view of the fitness landscape across three regions with varying degrees of structural order. On the left, the AlphaFold2-predicted structure of the human HP1α protein is displayed in cartoon representation, with residues colored according to their pLDDT scores. Three specific regions, representing flexible disordered (residues 75–85), conserved disordered (residues 87–92), and folded (residues 120–130) segments, are highlighted in blue, orange, and red, respectively, using ball-and-stick styles. The panels on the right depict the ESM2 LLR predictions for each of these regions. (C, D) Histograms of pLDDT and ESM2 score distributions for structured (C) and disordered (D) residues are presented. Contour lines indicate free energy levels computed as −log P (pLDDT, ESM2), where P is the probability density of residues based on their pLDDT and ESM2 scores. Contours are spaced at 0.5-unit intervals to distinguish areas of differing density.
Our analysis demonstrated that HP1α’s structured domains consistently yield low ESM2 scores, reflecting strong mutational constraints characteristic of folded regions. These constraints are further evident in the local LLR predictions, as shown in Figure 2B, where we illustrate the folded region G120-T130. In contrast, disordered regions, including the N-terminal (residues 1–19), C-terminal (175–191), and hinge domain (70–117), typically exhibit higher ESM2 scores, indicating increased mutational flexibility.
Nonetheless, not all disordered regions show similar flexibility. Within the hinge domain, a conserved segment known as the KRK patch (highlighted in orange)72,76–78 shows low ESM2 scores, despite being disordered. This distinction allows us to classify disordered regions into two types: “flexible disordered” regions, which show high ESM2 scores and greater mutational tolerance, and “conserved disordered” regions, which display low ESM2 scores, indicating varying levels of mutational constraint despite a lack of stable folding.
We then examined the distribution of ESM2 scores for all amino acids in the MLO-hIDR dataset to evaluate the generality of these patterns. Amino acids in folded regions (pLDDT > 70) consistently yield low ESM2 scores, reflecting strong mutational constraints. As shown in Figure 2C, the histogram of ESM2 versus pLDDT scores for structured residues reveals a dominant population with low ESM2 values (region a, ESM2 Score ≤ 0.5), consistent with the established understanding that folded domains require structural and functional integrity and are thus more mutation-sensitive.79–81
In contrast, disordered residues (pLDDT ≤ 70) predominantly show high ESM2 scores (region b, ESM2 Score ≥ 2.0), consistent with the rapid evolution and higher mutational tolerance typical of disordered proteins.82–84 However, as shown in Figure 2D, a substantial subset of disordered amino acids also exhibit low ESM2 scores (region a). Given that low ESM2 scores generally reflect mutational constraint in folded proteins, the presence of region a among disordered residues suggests that certain disordered amino acids are evolutionarily conserved and likely functionally significant.
ESM2 Scores Correlate with Sequence Conservation
Our analysis indicates that a substantial proportion of amino acids within disordered regions are evolutionarily conserved and exhibit reduced mutational tolerance. To evaluate this hypothesis, we conducted an evolutionary analysis of MLO-hIDR proteins, examining the conservation patterns of individual amino acids.
This analysis was based on a multi-sequence alignment (MSA) of homologs of MLO-hIDR proteins. We employed HHblits for homolog detection, a method particularly suited to IDRs as it effectively captures distant sequence similarities in highly divergent sequences.85–87 The presence of folded domains in these proteins facilitates reliable alignment between references and their query homologs. To exclude sequences that no longer qualify as homologs, we filtered for sequences with at least 20% identity to the reference, resulting in homologous sets ranging from tens to thousands per protein (Figure 3A). From these aligned sequences, we calculated the conservation score for each reference amino acid as the ratio of its occurrence in homologs to the total number of sequences (see Eq. 2 in the Methods Section).

Low ESM2 scores correlate strongly with evolutionary conservation.
(A) Estimating amino acid conservation using multiple sequence alignment. The conservation score calculation is demonstrated for human HP1α protein along with a subset of its homologs found by HHblits. In the aligned sequences, missing residues appear as dashed lines, insertions are shown in lowercase letters, and mismatches are highlighted in red. The three rows below the alignment indicate at position i: ni, the number of conserved residues from the reference sequence to the query sequences; Ni; the total number of existing residues; and CSi, the conservation score calculated from Eq. 2, respectively. The right panel illustrates the distribution of homolog counts for each MLO-hIDR found by HHblits. (B) Histograms showing conservation and ESM2 score distributions for all residues in MLO-hIDR, grouped by pLDDT scores from AlphaFold2. The contour lines denote free energy levels, calculated as −log P (CS, ESM2), where P is the probability density of residues based on their conservation and ESM2 scores. Contours are spaced at 0.5-unit intervals to highlight regions of distinct density. (C) Correlation between mean conservation and ESM2 scores for amino acids classified by structural order levels. Pearson correlation coefficients, r, are reported in the legends.
Our findings reveal a strong correlation between ESM2 scores and conservation scores. In Figure 3B, we present the histograms of ESM2 and conservation scores for all amino acids from MLO-hIDR proteins. Given that folded domains generally show higher conservation scores than disordered regions, we further classified residues into four groups based on their AlphaFold2 pLDDT scores to assess conservation patterns across varying levels of structural disorder. This stratification allowed us to analyze conservation trends in detail. Across all categories, we observed bimodal distributions, reinforcing the correlation between increasing ESM2 scores and decreasing conservation.
The conservation of amino acids with low ESM2 scores is also apparent in Figure 3C. For each of the four structural order groups, we computed the average ESM2 and conservation scores for all 20 amino acid types. Methionine (M) was excluded from the correlation analysis due to its frequent position as the initial residue in sequences, which complicates reliable mutational effect predictions.88,89 In each group, we consistently observed a strong correlation between average ESM2 and conservation scores.
While ESM2 scores align closely with conservation scores, the relative conservation of specific amino acids varies across structural order groups. In more disordered regions, hydrophilic residues such as Glutamine (Q), Lysine (K), and Arginine (R) exhibit lower ESM2 scores, indicating that mutations in these residues are particularly detrimental. Conversely, hydrophobic residues like Valine (V) and Isoleucine (I) show higher ESM2 scores, suggesting they experience reduced evolutionary constraints. In more folded domains, hydrophobic residues such as W and F are more conserved (see Figure S2), consistent with the characteristic conservation patterns of proteins across different disorder levels.90,91 Overall, these findings strongly support our hypothesis that ESM2 scores effectively capture evolutionary conservation, enabling the identification of functionally significant residues through the fitness landscape, independent of structural flexibility.
Conserved, Disordered Residues Localize in Regions Driving Phase Separation
The presence of evolutionarily conserved disordered residues raises the question of their functional significance. To explore this, we divided the MLO-hIDR dataset into two categories: drivers (dMLO-hIDR), which actively drive phase separation, and clients (cMLO-hIDR), which are present in MLOs under certain conditions but do not promote phase separation themselves.92 Additionally, human IDRs not associated with MLOs, termed nMOL-hIDR, were included as a control. To enhance statistical robustness, we extended our dataset by incorporating driver proteins from additional species,93 resulting in the expanded dMLO-IDR dataset.
As illustrated in Figure 4A, there is a progressive increase in the fraction of conserved disordered residues and a corresponding decline in flexible disordered residues from non-phase-separating proteins (nMOL-hIDR) to clients (cMLO-hIDR) and drivers (dMLO-hIDR and dMLO-IDR). Driver proteins, particularly those in the expanded dataset, display a notable reduction in flexible residues. These findings imply that disordered regions with a role in phase separation tend to contain functionally significant and evolutionarily conserved regions.

Phase separation driving IDRs exhibit more conserved disorder.
(A) Population of ESM2 score for disordered residues in proteins from nMLO-hIDR, cMLO-hIDR, dMLO-hIDR, and dMLO-IDR datasets. Red dots indicate the mean values of the respective distributions. The selection of proteins in the dMLO-IDR dataset is shown in the right panel. See also methods for details in dataset preparation. (B) The classification of three IDR functional groups based on their overlap with the experimentally identified phase separation (PS) segments. (C) The distribution of the ESM2 score for residues in three IDR groups, driving (blue), participating (orange), and non-participating (green) shown in the violin plot. (D) The distribution of the conservation score (CS) for residues in three IDR groups shown in the violin plot with same coloring scheme in C.
We further examined the sequence location of conserved, disordered residues in driver proteins (dMLO-IDR). For these proteins, experimentally verified segments have been identified, the deletion or mutation of which impairs phase separation93–97 (Figure 4B). These segments can include both structured and disordered regions. Herein, if a disordered region constitutes over 50% of the phase-separating segment, we designate it as “driving”, indicating a likely critical contribution to phase separation. If the disordered region represents less than 50%, we classify it as “participating”, with a potentially limited role. Finally, if there is no overlap between the disordered region and the phase-separating segment, we categorize it as “non-participating”. The number of three IDR groups, along with their amino acid compositions, are shown in Figure S4.
We then analyzed the distribution of ESM2 predictions across these IDR groups. In alignment with Figure 3A, we observed a significantly higher proportion of conserved disordered residues within driving IDRs, while few were present in non-participating IDRs. Supporting the ESM2 predictions, conservation analysis based on MSA also indicated that drving IDRs contain a greater concentration of conserved residues (Figure 4C). Collectively, these findings demonstrate that ESM2 effectively identifies evolutionarily conserved functional sites, enriched in IDR regions likely involved in driving phase separation.
Conserved, Disordered Residues Form Motifs
Finally, we investigated the chemical identities of conserved residues within driving IDRs to understand their potential role in phase separation. Figure 5A displays the average ESM2 LLR predictions for each of the 20 amino acids in the mutational matrix, indicating that mutations to most amino acids are generally unfavorable, as reflected by their low, negative LLR values. This trend is particularly pronounced in driving IDRs compared to nMLO-IDRs or non-participating IDRs (Figure S5).

ESM2 Identifies Functional Motifs in driving IDRs.
(A) Mean LLR values for the 20 amino acids calculated by averaging across all residues of each amino acid type. (B) Clustering of amino acids based on the two UMAP embeddings of their LLR vectors presented in part A. The dashed lines are manually added for a clear visualization of the separation of each group. (C) The percent of conserved residues locating in motifs for all amino acids. (D) Word cloud of motifs identified by ESM2. The word font size reflects the relative motif length, while the color represents the proportion of “sticker” residues (Y, F, W, R, K, and Q) within each motif.
We further characterize these conserved residues within driving IDRs. Using hierarchical clustering on two UMAP-derived embeddings from the LLR vectors, we grouped amino acids into five clusters (Figure 5B). This approach distinguishes more conserved residues (Groups I to III) from the more flexible residues (Groups IV and V). Notably, W, F, and Y—often referred to as “stickers” due to their crucial role in phase separation43,98–100—are uniquely grouped within the highly conserved Group I. These findings support the expectation that amino acids essential to phase separation are often evolutionarily conserved, aligning with their central role in functional stability.
Interestingly, residues in Groups II and III, which include traditional “spacers” (G, A, P, and S), also show high conservation and resistance to most mutation types, particularly hydrophobic mutations (Figure 5A). Spacer residues, generally regarded as less critical for interactions driving condensate formation, were unexpectedly conserved, suggesting a broader functional relevance than previously assumed.
We propose that this conservation pattern for spacers is likely not due to isolated residue conservation but may instead reflect the conservation of specific sequence stretches. To examine this hypothesis, we identified ESM2 “motifs” as continuous sequence regions with average ESM2 scores below 0.5. A full list of motifs is available in the appendix (ESM2_motif_with _exp_ref.csv). We observed that conserved amino acids with ESM2 scores below 0.5 are predominantly located within these motifs (see Figure 5C and Figure S6A). For instance, conserved glycine residues have a 97.9% likelihood of being part of an ESM2 motif, with similar probabilities observed for other spacers, such as alanine (99.0%) and proline (93.1%), as well as for sticker residues like Y, W, and F.
These results suggest that IDRs crucial for phase separation frequently contain conserved sequence motifs composed of both sticker and spacer residues. Interestingly, many of these motifs have been experimentally validated as essential for phase separation, with representative motifs for each driving IDR listed in ESM2_motif_with_exp_ref.csv. In these cases, mutations or deletions have been shown to disrupt phase separation. For visualization, a word cloud of these motifs is presented in Figure 5D. Altogether, our analysis suggest a tendency for IDRs to uniquely cluster conserved residues into motifs and execute significant biological roles in phase separation.
Conclusions and Discussion
We have utilized the protein language model ESM2 to investigate the fitness landscape of IDRs. Our analysis reveals a substantial population of amino acids resistant to mutation. Multi-sequence alignment further confirms the evolutionary conservation of these amino acids. Notably, these conserved, disordered residues are predominantly located in regions actively involved in phase separation, contributing to the formation of membraneless organelles. These findings underscore evolutionary constraints on specific IDRs to preserve their functional roles in scaffolding phase separation processes.
Our results indicate that the conserved amino acids encompass both “sticker” and “spacer” classifications, as defined in recent literature.72,100–108 This observation suggests that classifications at the level of individual amino acids may obscure their true functional roles in MLO formation. Instead, the evolutionarily conserved motifs, which can be identified through straightforward analysis of ESM2 score profiles, represent functionally significant units that include combinations of both stickers and spacers. Experimental perturbations of these motifs in in vivo studies may further elucidate their functional importance.
Outside the conserved motifs, IDRs exhibit a higher tolerance for mutations, albeit with a general preference for hydrophilic residues. This arrangement of motif and non-motif regions in protein sequences aligns with a generalized stickers-and-spacers framework. Specifically, the evolutionarily conserved motifs likely engage in strong interactions to drive phase separation, while interactions among non-motif regions remain attenuated. This framework supports the formation of MLOs with well-defined chemical composition and interaction networks, as suggested by Sood and Zhang 51. Investigating the interaction strengths of the identified motifs presents an interesting avenue for future research.
Methods
Data Collection and Preprocessing
MLO-hIDR and nMLO-hIDR Dataset
To construct the MLO-hIDR dataset, we initially identified human proteins with disordered residues using the UniProtKB database.109 From these proteins, we selected candidates with at least 10% of their residues exhibiting a pLDDT score ≤ 70, yielding a total of 5,121 candidates. This subset was subsequently cross-referenced with proteins listed in the CD-code Database110 to derive the final dataset of 939 proteins associated with membraneless organelle formation (Figure 1A). The CD-code Database further categorizes these proteins into two groups: driver datasets (dMLO-hIDR, n=82) and client datasets (cMLO-hIDR, n=814). Additional information on MLO-hIDR candidate proteins and their biological roles is available in the appendix file MLO-hIDR.csv. Proteins among the 5,121 that are not linked to membraneless organelle formation are included in the nMLO-hIDR dataset.
dMLO-IDR dataset
In addition to datasets comprising only human proteins, we developed a specialized dMLO-IDR database incorporating proteins from diverse species involved in driving phase separation. This database includes all Driver proteins cataloged in the MLOsMetaDB database,93 which documents experimentally validated disordered and phase-separating regions. Beginning with 780 Driver proteins, we filtered the dataset to retain entries where disordered regions overlap with phase-separating segments, yielding 399 candidates across 40 species (Figure 4A). To further refine the dataset, we analyzed sequence identity, excluding homologous pairs with sequence identity exceeding 50%. This process resulted in a final set of 341 non-redundant candidates. These proteins play critical roles in mediating the formation of various MLOs, including P-bodies, stress granules, paraspeckles, and centrosomes (see Appendix: dMLO-IDR.csv).
AlphaFold2 score for structural order
The pLDDT scores for human proteins were retrieved from the AlphaFold Protein Structure Database111 by selecting the Homo sapiens organisms (Reference Proteome ID: UP000005640).
ESM2 predictions for Mutational Preferences
We employed the code and pretrained parameters for ESM2 available from the model’s official GitHub repository at https://github.com/facebookresearch/esm to conduct mutational predictions. To optimize computational efficiency, we utilized the esm2_t33_650M_UR50D model, which has 650 million parameters and achieves prediction accuracy comparable to the larger 15B parameter model.67 In addition to calculating the log-likelihood ratio (LLR) for individual mutations, we defined an ESM2 score at each position to quantify mutational tolerance, formulated as
where LLRi denotes the LLR value for the i-th amino acid mutation type.
Evolutionary Sequence Analysis
We performed multi-sequence alignment (MSA) analysis using HHblits from the HH-suite3 software suite,112–115 a widely used open-source toolkit known for its sensitivity in detecting sequence similarities and identifying protein folds. HHblits builds MSAs through iterative database searches, sequentially incorporating matched sequences into the query MSA with each iteration.
The HH-suite3 software was obtained from its GitHub repository (https://github.com/soedinglab/hh-suite). Homologous sequences were identified through searches against the UniRef30 protein database (release 2023/02).116 For each query, we performed three iterations of HHblits searches, incorporating sequences from profile HMM matches with an E-value threshold of 0.001 into the query MSA in each cycle. Using a lower E-value threshold (closer to 0) ensures greater sequence similarity among the matches, while multiple iterations enhance the alignment’s depth and accuracy. The resulting alignments in A3M format were converted to CLUSTAL format using the reformat.pl script provided in HH-suite, aligning all sequences to a uniform length (Figure 3A). To refine alignment quality by focusing on closely related homologs, we filtered out sequences with ≤ 20% identity to the query, excluding weakly related sequences where only short segments show similarity to the reference.
The conservation score CSi for position i was then calculated from the MSA as
where ni(ref=query) represents the number of times the residues from the reference sequence appear across all sequences, and Ni(non_gap) represents the total non-gap residues across the aligned sequences.
Motif Identification
We defined motifs as contiguous stretches of amino acid sequences with an average ESM2 score of 0.5 or lower. To identify these motifs within a given IDR, we implemented an iterative procedure. Starting from the N-terminus of the sequence, we first identified the initial residue with an ESM2 score below 0.5, denoted as i. From this position, residues were sequentially appended in the C-terminal direction until the cumulative segment’s average ESM2 score exceeded 0.5. The residue where this threshold was surpassed was denoted as j. The segment spanning from i to j – 1, i.e., (i, i + 1, …, j – 1), was then considered a candidate motif. This process was repeated starting from residue j, continuing iteratively until the C-terminal end of the IDR was reached.
When two motifs are in close proximity along the sequence, they may be merged into a single motif. Specifically, if the starting position of one motif is within 8 residues of the ending position of another, we define a candidate segment as the sequence spanning both motifs and the intervening residues. If the candidate segment’s average ESM2 score is below 0.5, it is included as a merged motif, replacing the individual motifs in the final list (Appendix: ESM2_motif_with_exp_ref.csv). In the analyses shown in Figure 5, we showed all motifs with n ≥ 4; however, varying motif minimal length n does not alter the overall conclusions (Figure S6B).
References
- (1)A Guide to Membraneless Organelles and Their Various Roles in Gene RegulationNature Reviews Molecular Cell Biology 24:288–304
- (2)Biomolecular Condensates: Organizers of Cellular BiochemistryNature Reviews Molecular Cell Biology 18:285–298
- (3)Physical Principles Underlying the Complex Biology of Intracellular Phase TransitionsAnnual Review of Biophysics 49:107–133
- (4)Phase Transitions of Associative BiomacromoleculesChemical Reviews 123:8945–8987
- (5)An Introduction to the Stickers-and-Spacers Framework as Applied to Biomolecular CondensatesIn:
- Zhou H.-X.
- Spille J.-H.
- Banerjee P. R.
- (6)Molecular Determinants for the Layering and Coarsening of Biological CondensatesAggregate 3:e306
- (7)Organization and Function of Non-dynamic Biomolecular CondensatesTrends in Biochemical Sciences 43:81–94
- (8)Microphase Separation Produces Interfacial Environment within Diblock Biomolecular CondensateseLife
- (9)Compartmentalization with Nuclear Landmarks Yields Random, yet Precise, Genome OrganizationBiophysical Journal 122:1376–1389
- (10)Phase Separation and Correlated Motions in Motorized GenomeThe Journal of Physical Chemistry B 126:5619–5628
- (11)OpenNucleome for High Resolution Nuclear Structural and Dynamical ModelingeLife
- (12)Liquid-Liquid Phase Separation in BiologyAnnual Review of Cell and Developmental Biology 30:39–58
- (13)Coexisting Liquid Phases Underlie Nucleolar SubcompartmentsCell 165:1686–1697
- (14)How Do Intrinsically Disordered Protein Regions Encode a Driving Force for Liquid–Liquid Phase Separation?Current Opinion in Structural Biology 67:41–50
- (15)A Conceptual Framework for Understanding Phase Separation and Addressing Open Questions and ChallengesMolecular Cell 82:2201–2214
- (16)Biomolecular Phase Separation: From Molecular Driving Forces to Macroscopic PropertiesAnnual Review of Physical Chemistry 71:turoverov
- (17)Stochasticity of Biological Soft Matter: Emerging Concepts in Intrinsically Disordered Proteins and Biological Phase SeparationTrends in Biochemical Sciences 44:716–728
- (18)Liquid–Liquid Phase Separation in DiseaseAnnual Review of Genetics 53:171–194
- (19)Phase Separation and Neurodegenerative Diseases: A Disturbance in the ForceDevelopmental Cell 55:45–68
- (20)Liquid-Liquid Phase Separation of Tau: From Molecular Biophysics to Physiology and DiseaseProtein Science: A Publication of the Protein Society 30:1294–1314
- (21)Comparative Roles of Charge, π, and Hydrophobic Interactions in Sequence-Dependent Phase Separation of Intrinsically Disordered ProteinsProceedings of the National Academy of Sciences 117:28795–28805
- (22)Polymer Physics of Intracellular Phase TransitionsNature Physics 11:899–904
- (23)Developments in Describing Equilibrium Phase Transitions of Multivalent Associative MacromoleculesCurrent Opinion in Structural Biology 79:102540
- (24)Protein Network Structure Enables Switching between Liquid and Gel StatesJournal of the American Chemical Society 142:874–883
- (25)Convergence of Artificial Protein Polymers and Intrinsically Disordered ProteinsBiochemistry 57:2405–2414
- (26)Thermodynamics of High Polymer SolutionsThe Journal of Chemical Physics 10:51–61
- (27)Some Properties of Solutions of Long-chain CompoundsThe Journal of Physical Chemistry 46:151–158
- (28)Sequence Determinants of Protein Phase Behavior from a Coarse-Grained ModelPLOS Computational Biology 14:e1005941
- (29)Simulation Methods for Liquid–Liquid Phase Separation of Disordered ProteinsCurrent Opinion in Chemical Engineering 23:92–98
- (30)Maximum Entropy Optimized Force Field for Intrinsically Disordered ProteinsJournal of Chemical Theory and Computation 16:773–781
- (31)Physics-Driven Coarse-Grained Model for Biomolecular Phase Separation with near-Quantitative AccuracyNature computational science 1:732–743
- (32)Improved Coarse-Grained Model for Studying Sequence Dependent Phase Separation of Disordered ProteinsProtein Science 30:1371–1379
- (33)Consistent Force Field Captures Homologue-Resolved HP1 Phase SeparationJournal of Chemical Theory and Computation 17:3134–3144
- (34)Accurate Model of Liquid– Liquid Phase Behavior of Intrinsically Disordered Proteins from Optimization of Single-Chain PropertiesProceedings of the National Academy of Sciences 118:e2111696118
- (35)Simulation of FUS Protein Condensates with an Adapted Coarse-Grained ModelJournal of Chemical Theory and Computation 17:525–537
- (36)On the Stability and Layered Organization of Protein-DNA CondensatesBiophysical Journal 121:1727–1737
- (37)Phase Separation of Protein Mixtures Is Driven by the Interplay of Homotypic and Heterotypic InteractionsNature Communications 14:5527
- (38)Direct Prediction of Intrinsically Disordered Protein Conformational Properties from SequenceNature Methods
- (39)Prediction of Phase Separation Propensities of Disordered Proteins from SequencebioRxiv https://doi.org/10.1101/2024.06.03.597109
- (40)Toward Accurate Simulation of Coupling between Protein Secondary Structure and Phase SeparationJournal of the American Chemical Society 146:342–357
- (41)Coarse-Grained Models to Study Protein–DNA Interactions and Liquid–Liquid Phase SeparationJournal of Chemical Theory and Computation 20:1717–1731
- (42)OpenABC Enables Flexible, Simplified, and Efficient GPU Accelerated Simulations of Biomolecular CondensatesPLOS Computational Biology 19:e1011442
- (43)A Molecular Grammar Governing the Driving Forces for Phase Separation of Prion-like RNA Binding ProteinsCell 174:688–699
- (44)Thermodynamics of Associative Polymer BlendsMacromolecules 51:5918–5932
- (45)LASSI: A Lattice Model for Simulating Phase Transitions of Multivalent ProteinsPLOS Computational Biology 15:e1007028
- (46)Decoding the Physical Principles of Two-Component Biomolecular Phase SeparationeLife 10:e62403
- (47)Intrinsically Disordered Linkers Determine the Interplay between Phase Separation and Gelation in Multivalent ProteinseLife 6:e30294
- (48)Deciphering How Naturally Occurring Sequence Features Impact the Phase Behaviours of Disordered Prion-like DomainsNature Chemistry 14:196–207
- (49)Condensates Formed by Prion-like Low-Complexity Domains Have Small-World Network Structures and Interfaces Defined by Expanded ConformationsNature Communications 13:7722
- (50)Dynamic Metastable Long-Living Droplets Formed by Sticker-Spacer ProteinseLife 9:e56159
- (51)Preserving Condensate Structure and Composition by Lowering Sequence ComplexityBiophysical Journal 123:1815–1826
- (52)The Difficulty of Aligning Intrinsically Disordered Protein Sequences as Assessed by Conservation and PhylogenyPLOS One 18:e0288388
- (53)Intrinsically Disordered Proteins and Intrinsically Disordered Protein RegionsAnnual Review of Biochemistry 83:553–584
- (54)Evolutionary Rate Heterogeneity in Proteins with Long Disordered RegionsJournal of Molecular Evolution 55:104–110
- (55)Analysis of the Relationships between Evolvability, Thermodynamics, and the Functions of Intrinsically Disordered Proteins/RegionsComputational Biology and Chemistry 41:51–57
- (56)An Easy Protocol for Evolutionary Analysis of Intrinsically Disordered ProteinsIn:
- Kragelund B. B.
- Skriver K.
- (57)A Comprehensive Benchmark Study of Multiple Sequence Alignment Methods: Current Challenges and Future PerspectivesPLOS One 6:e18093
- (58)KMAD: Knowledge-Based Multiple Sequence Alignment for Intrinsically Disordered ProteinsBioinformatics 32:932–936
- (59)The Language of Proteins: NLP, Machine Learning & Protein SequencesComputational and Structural Biotechnology Journal 19:1750–1758
- (60)Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein SequencesProceedings of the National Academy of Sciences 118:e2016239118
- (61)CodeTrans: Towards Cracking the Language of Silicon’s Code Through Self-Supervised Deep Learning and High Performance ComputingarXiv https://doi.org/10.48550/arXiv.2104.02443
- (62)UDSMProt: Universal Deep Sequence Models for Protein ClassificationBioinformatics 36:2401–2409
- (63)Unified Rational Protein Engineering with Sequence-Based Deep Representation LearningNature Methods 16:1315–1322
- (64)ProteinBERT: A Universal Deep-Learning Model of Protein Sequence and FunctionBioinformatics 38:2102–2110
- (65)UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry ViewIn:
- Edwards D.
- (66)Attention Is All You NeedAdvances in Neural Information Processing Systems
- (67)Genome-Wide Prediction of Disease Variant Effects with a Deep Protein Language ModelNature Genetics 55:1512–1522
- (68)Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language ModelScience 379:1123–1130
- (69)Highly Accurate Protein Structure Prediction with AlphaFoldNature 596:583–589
- (70)AlphaFold and Implications for Intrinsically Disordered ProteinsJournal of Molecular Biology 433:167208
- (71)Systematic Identification of Conditionally Folded Intrinsically Disordered Regions by AlphaFold2Proceedings of the National Academy of Sciences of the United States of America 120:e2304302120
- (72)Liquid Droplet Formation by HP1α Suggests a Role for Phase Separation in HeterochromatinNature 547:236–240
- (73)HP1 Reshapes Nucleosome Core to Promote Phase Separation of HeterochromatinNature 575:390–394
- (74)HP1-driven Phase Separation Recapitulates the Thermodynamics and Kinetics of Heterochromatin Condensate FormationProceedings of the National Academy of Sciences 120:e2211855120
- (75)HP1a Promotes Chromatin Liquidity and Drives Spontaneous Heterochromatin CompartmentalizationbioRxiv https://doi.org/10.1101/2024.10.18.618981
- (76)The Hinge and Chromo Shadow Domain Impart Distinct Targeting of HP1-like ProteinsMolecular and Cellular Biology 21:2555–2569
- (77)Mutations in the Heterochromatin Protein 1 (HP1) Hinge Domain Affect HP1 Protein Interactions and Chromosomal DistributionChromosoma 113:370–384
- (78)HP1α Is a Chromatin Crosslinker That Controls Nuclear and Mitotic Chromosome MechanicseLife 10:e63972
- (79)Conserved Residues and the Mechanism of Protein FoldingNature 379:96–98
- (80)Conservation of Folding and Stability within a Protein Family: The Tyrosine Corner as an Evolutionary Cul-de-Sac1Journal of Molecular Biology 295:641–649
- (81)Conservation of Protein Structure over Four Billion YearsStructure (London, England : 1993) 21:1690–1697
- (82)Evolution and DisorderCurrent opinion in structural biology 21:441–446
- (83)Advantages of Proteins Being DisorderedProtein Science 23:539–550
- (84)From Sequence and Forces to Structure, Function, and Evolution of Intrinsically Disordered ProteinsStructure 21:1492–1499
- (85)Predicting MoRFs in Protein Sequences Using HMM ProfilesBMC Bioinformatics 17:504
- (86)Insights from Analyses of Low Complexity Regions with Canonical Methods for Protein Sequence ComparisonBriefings in Bioinformatics 23:bbac299
- (87)CLIP: Accurate Prediction of Disordered Linear Interacting Peptides from Protein Sequences Using Co-Evolutionary InformationBriefings in Bioinformatics 24:bbac502
- (88)Prediction of Protein Half-lives from Amino Acid Sequences by Protein Language ModelsbioRxiv https://doi.org/10.1101/2024.09.10.612367
- (89)Language Models Enable Zero-Shot Prediction of the Effects of Mutations on Protein FunctionIn: Proceedings of the 35th International Conference on Neural Information Processing Systems pp. 29287–29303
- (90)Prediction of Folding Patterns for Intrinsic Disordered ProteinScientific Reports 13:20343
- (91)Amino Acid Homorepeats in ProteinsNature Reviews. Chemistry 4:420–434
- (92)Integration of Data from Liquid–Liquid Phase Separation Databases Highlights Concentration and Dosage Sensitivity of LLPS DriversInternational Journal of Molecular Sciences 22:3017
- (93)MLOsMetaDB, a Meta-Database to Centralize the Information on Liquid–Liquid Phase Separation Proteins and Membraneless OrganellesProtein Science 33:e4858
- (94)PhaSePro: The Database of Proteins Driving Liquid–Liquid Phase SeparationNucleic Acids Research 48:D360–D367
- (95)PhaSepDB: A Database of Liquid–Liquid Phase Separation Related ProteinsNucleic Acids Research 48:D354–D359
- (96)DrLLPS: A Data Resource of Liquid–Liquid Phase Separation in EukaryotesNucleic Acids Research 48:D288–D295
- (97)LLPSDB: A Database of Proteins Undergoing Liquid–Liquid Phase Separation in VitroNucleic Acids Research 48:D320–D327
- (98)Learning the Molecular Grammar of Protein Condensates from Sequence Determinants and EmbeddingsProceedings of the National Academy of Sciences 118:e2019053118
- (99)Classification of Proteins Inducing Liquid– Liquid Phase Separation: Sequential, Structural and Functional CharacterizationJournal of Biochemistry 173:255–264
- (100)Expanding the Molecular Language of Protein Liquid–Liquid Phase SeparationNature Chemistry 16:1113–1124
- (101)Phase Transition of Spindle-Associated Protein Regulate Spindle Apparatus AssemblyCell 163:108–122
- (102)Identifying Sequence Perturbations to an Intrinsically Disordered Protein That Determine Its Phase-Separation BehaviorProceedings of the National Academy of Sciences of the United States of America 117:11421–11431
- (103)Single Amino Acid Substitutions in Stickers, but Not Spacers, Substantially Alter UBQLN2 Phase Transitions and Dense Phase Material PropertiesThe Journal of Physical Chemistry B 123:3618–3629
- (104)Nuclear Condensates of the Polycomb Protein Chromobox 2 (CBX2) Assemble through Phase SeparationJournal of Biological Chemistry 294:1451–1463
- (105)A Conserved Core Region of the Scaffold NEMO Is Essential for Signal-Induced Conformational Change and Liquid-Liquid Phase SeparationJournal of Biological Chemistry 299:105396
- (106)Self-Interaction of NPM1 Modulates Multiple Mechanisms of Liquid–Liquid Phase SeparationNature Communications 9:842
- (107)Regulation of Zebrafish Dorsoventral Patterning by Phase Separation of RNA-binding Protein Rbm14Cell Discovery 5:1–17
- (108)Liquid–Liquid Phase Separation in Human Health and DiseasesSignal Transduction and Targeted Therapy 6:1–16
- (109); The UniProt Consortium Annotation of Biologically Relevant Ligands in UniProtKB Using ChEBIBioinformatics 39:btac793
- (110)CD-CODE: Crowdsourcing Condensate Database and EncyclopediaNature Methods 20:673–676
- (111)AlphaFold Protein Structure Database in 2024: Providing Structure Coverage for over 214 Million Protein SequencesNucleic Acids Research 52:D368–D375
- (112)UniRef: Comprehensive and Non-Redundant UniProt Reference ClustersBioinformatics 23:1282–1288
- (113)UniRef Clusters: A Comprehensive and Scalable Alternative for Improving Sequence Similarity SearchesBioinformatics 31:926–932
- (114)HHblits: Lightning-Fast Iterative Protein Sequence Searching by HMM-HMM AlignmentNature Methods 9:173–175
- (115)HH-suite3 for Fast Remote Homology Detection and Deep Protein AnnotationBMC Bioinformatics 20:473
- (116)Uniclust Databases of Clustered and Deeply Annotated Protein Sequences and AlignmentsNucleic Acids Research 45:D170–D176
Article and author information
Author information
Version history
- Sent for peer review:
- Preprint posted:
- Reviewed Preprint version 1:
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.105309. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2025, Zhang et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics
- views
- 14
- downloads
- 0
- citations
- 0
Views, downloads and citations are aggregated across all versions of this paper published by eLife.