Introduction

Membraneless organelles (MLOs), such as nucleoli, stress granules, and P-bodies, are distributed throughout diverse cellular environments and play a vital role in forming specialized biochemical compartments that drive essential cellular functions.111 These biomolecular condensates typically assemble through phase separation, dynamically recruiting reactants and releasing products to improve the efficiency and specificity of cellular processes.4,12,13 Intrinsically disordered regions (IDRs), which lack well-defined tertiary structures yet exhibit unique structural disorder, frequently act as scaffolds within MLOs.12,1417 IDRs facilitate multivalent interactions, including π-π stacking, cation-π, and electrostatic interactions, that promote phase separation. Mutations in IDRs can disrupt phase behavior, potentially leading to MLO dysfunction and contributing to diseases such as neurodegenerative disorders and cancer.1821

Substantial research has focused on linking protein sequences to the phase behaviors of condensates.4,6,15,2242 These studies support the “stickers and spacers” framework,3,4,4346 in which specific residues, termed “stickers,” drive strong, specific interactions, while “spacer” regions act as flexible linkers with minimal nonspecific interactions.37,4750 Sood and Zhang 51 further introduced an evolutionary dimension, proposing that IDRs adapt this framework to balance effective phase separation with compositional specificity. For instance, membraneless organelles (MLOs) form through specific interactions among stickers, while minimizing non-specific interactions among spacers to enable condensate formation with defined structural and compositional properties. This evolutionary pressure supports the enrichment of low-complexity domains (LCDs) in IDRs, reducing nonspecific interactions yet preserving conserved stickers critical for condensate specificity and stability. This evolutionary perspective, while promising, has not been extensively validated.

Evolutionary analysis of IDRs is challenging due to difficulties in sequence alignment.5258 Recent advances in protein language models5964 provide alternative approaches for sequence analysis and information decoding. Trained on large protein datasets, such as UniProt,65 these models leverage neural network architectures like the Transformer66 to capture correlations among amino acids within a sequence. These correlations enable the models to identify constraints, enforced by the whole sequence, that influence the chemical identity of amino acids at specific positions. Since the whole sequence is linked to function and stability, the mutational preferences predicted by these models serve as quantitative measures of the fitness of amino acids. In contrast to many traditional methods, these predictions do not rely on homology alignments, making them potentially useful for decoding disordered protein sequences. While these models have been widely applied to analyze mutational effects in folded proteins,67,68 their applicability to studying the fitness and evolutionary pressures on IDRs has yet to be established.

In this study, we employ the Evolutionary Scale Model (ESM2)68 to investigate the fitness landscape of IDRs involved in MLO formation. Our analysis demonstrates the utility of ESM2 for examining disordered sequences, identifying a notable subset of amino acids that exhibit mutation resistance. These amino acids are evolutionarily conserved, as confirmed through multi-sequence alignment analysis. The conserved residues are primarily located in regions associated with phase separation. Importantly, the conserved disordered amino acids include both sticker residues, such as tyrosine (Y), tryptophan (W), and phenylalanine (F), and spacer residues, such as alanine (A), glycine (G), and Proline (P). These residues frequently colocalize within continuous sequence stretches, which we identify as motifs. Our findings provide strong evidence for evolutionary pressures acting on specific IDRs to preserve their roles in scaffolding phase separation mechanisms, emphasizing the functional importance of entire motifs rather than individual residues in MLO formation.

Results

Protein Language Model for Quantifying the Fitness Landscape of MLO Proteins

To examine the fitness of IDRs and their connection to phase separation, we compiled a database of human proteins with disordered regions. From this, we identified a subset of 939 proteins associated with the formation of membraneless organelle, referred to as MLO-hIDR. The Methods section provides additional information on the dataset preparation. The MLO-hIDR subset contains proteins with varying numbers of disordered residues, ranging from a few dozen to several thousand per protein (Figure 1A). These proteins are involved in the assembly of various MLOs, including P-bodies, Cajal bodies, and centrosome granules, and are distributed across both nuclear and cytoplasmic compartments.

Predicting the fitness landscape of IDRs involved in MLO formation using ESM2.

(A) Schematic representation of the MLO-hIDR database construction. The right panel shows the distribution of disordered and folded residues identified in each protein, where structural order is assigned using the AlphaFold2 pLDDT score. (B) Workflow of the pre-trained ESM2 model (left)68 for predicting the fitness landscape of a given protein sequence (right). Upon receiving a protein sequence as input, ESM2 generates a log-likelihood ratio (LLR) for each mutation type at each residue position. Using the 20-element LLR vector, we compute the ESM2 score for each residue (Eq. 1) to assess mutational tolerance.

We analyzed proteins in the MLO-hIDR dataset using the protein language model, ESM2.68 As illustrated in Figure 1B, ESM2 is a conditional probabilistic model (masked language model) that predicts the likelihood of specific amino acids appearing at a given position, based on the surrounding sequence context. Using a transformer architecture, it processes amino acid sequences as input and calculates these probabilities. ESM2 was trained on over 250 million protein sequences, enabling it to capture the intricate relationships among amino acids and to quantify the fitness of particular residues at specific sites. Notably, this model does not rely on multiple sequence alignments, thereby addressing common challenges encountered in the evolutionary analysis of IDRs.

The fitness of a specific amino acid at a given site is defined as follows. ESM2 enables the quantification of the probability, or likelihood, of observing any of the 20 amino acids at site i. To assess the preference for a mutant over the wild-type (WT) residue, we calculate the log-likelihood ratio (LLR) between the mutant and WT residues. Consequently, a 20-element vector representing the LLRs for each amino acid can be generated at each site (Figure 1B). This vector is then condensed into a single value, referred to as the ESM2 score, which is derived using an information entropy expression for the LLR probabilities of individual amino acids (Eq. 1 in the Methods Section). The ESM2 score provides a measure of the overall mutational tolerance of a given residue. Lower scores indicate higher mutational constraint and reduced flexibility, implying that these residues are more likely essential for protein function, as they exhibit fewer permissible mutational states. Previous studies have demonstrated a strong correlation between ESM2 scores and changes in free energy related to protein structure stability.67,68

ESM2 Identifies Conserved, Disorderd Residues

We next used ESM2 to analyze the fitness of amino acids in both structured and disordered regions. We carried out ESM2 predictions for all proteins in the MLO-hIDR dataset, and determined the ESM2 scores of individual amino acids. In addition, to quantify structural disorder, we computed the AlphaFold2 predicted Local Distance Difference Test (pLDDT) scores for each residue. The pLDDT scores have been shown to correlate well with protein flexibility and disorder,69 making them a reliable tool for distinguishing structured from unstructured regions. Following previous studies,70,71 we used a threshold of pLDDT = 70 to differentiate ordered from disordered residues. This threshold reflects amino acid composition preferences for folded versus disordered proteins (see Figure S1).70,71

We first analyzed the relationship between ESM2 and pLDDT scores for human Heterochromatin Protein 1α (HP1α, residues 1–191). HP1α is a crucial chromatin organizer that promotes phase separation and facilitates the compaction of chromatin into transcriptionally inactive regions.33,7275 HP1α comprises both structured and disordered segments, as illustrated in Figure 2A. Here, residues with pLDDT scores exceeding 70 (indicating ordered regions) are shown in white, while disordered segments (pLDDT ≤ 70) are highlighted in blue. Figure 2B displays the AlphaFold2 predicted structure of HP1α, with residues colored according to their pLDDT scores.

ESM2 predicts fitness landscape for structured and disordered residues.

(A) The ESM2 scores for amino acids in the human HP1α protein (UniProt ID: P45973) are presented, with residues having pLDDT scores below 70 highlighted in blue to signify regions lacking a defined structure. (B) A detailed view of the fitness landscape across three regions with varying degrees of structural order. On the left, the AlphaFold2-predicted structure of the human HP1α protein is displayed in cartoon representation, with residues colored according to their pLDDT scores. Three specific regions, representing flexible disordered (residues 75–85), conserved disordered (residues 87–92), and folded (residues 120–130) segments, are highlighted in blue, orange, and red, respectively, using ball-and-stick styles. The panels on the right depict the ESM2 LLR predictions for each of these regions. (C, D) Histograms of pLDDT and ESM2 score distributions for structured (C) and disordered (D) residues are presented. Contour lines indicate free energy levels computed as −log P (pLDDT, ESM2), where P is the probability density of residues based on their pLDDT and ESM2 scores. Contours are spaced at 0.5-unit intervals to distinguish areas of differing density.

Our analysis demonstrated that HP1α’s structured domains consistently yield low ESM2 scores, reflecting strong mutational constraints characteristic of folded regions. These constraints are further evident in the local LLR predictions, as shown in Figure 2B, where we illustrate the folded region G120-T130. In contrast, disordered regions, including the N-terminal (residues 1–19), C-terminal (175–191), and hinge domain (70–117), typically exhibit higher ESM2 scores, indicating increased mutational flexibility.

Nonetheless, not all disordered regions show similar flexibility. Within the hinge domain, a conserved segment known as the KRK patch (highlighted in orange)72,7678 shows low ESM2 scores, despite being disordered. This distinction allows us to classify disordered regions into two types: “flexible disordered” regions, which show high ESM2 scores and greater mutational tolerance, and “conserved disordered” regions, which display low ESM2 scores, indicating varying levels of mutational constraint despite a lack of stable folding.

We then examined the distribution of ESM2 scores for all amino acids in the MLO-hIDR dataset to evaluate the generality of these patterns. Amino acids in folded regions (pLDDT > 70) consistently yield low ESM2 scores, reflecting strong mutational constraints. As shown in Figure 2C, the histogram of ESM2 versus pLDDT scores for structured residues reveals a dominant population with low ESM2 values (region a, ESM2 Score ≤ 0.5), consistent with the established understanding that folded domains require structural and functional integrity and are thus more mutation-sensitive.7981

In contrast, disordered residues (pLDDT ≤ 70) predominantly show high ESM2 scores (region b, ESM2 Score ≥ 2.0), consistent with the rapid evolution and higher mutational tolerance typical of disordered proteins.8284 However, as shown in Figure 2D, a substantial subset of disordered amino acids also exhibit low ESM2 scores (region a). Given that low ESM2 scores generally reflect mutational constraint in folded proteins, the presence of region a among disordered residues suggests that certain disordered amino acids are evolutionarily conserved and likely functionally significant.

ESM2 Scores Correlate with Sequence Conservation

Our analysis indicates that a substantial proportion of amino acids within disordered regions are evolutionarily conserved and exhibit reduced mutational tolerance. To evaluate this hypothesis, we conducted an evolutionary analysis of MLO-hIDR proteins, examining the conservation patterns of individual amino acids.

This analysis was based on a multi-sequence alignment (MSA) of homologs of MLO-hIDR proteins. We employed HHblits for homolog detection, a method particularly suited to IDRs as it effectively captures distant sequence similarities in highly divergent sequences.8587 The presence of folded domains in these proteins facilitates reliable alignment between references and their query homologs. To exclude sequences that no longer qualify as homologs, we filtered for sequences with at least 20% identity to the reference, resulting in homologous sets ranging from tens to thousands per protein (Figure 3A). From these aligned sequences, we calculated the conservation score for each reference amino acid as the ratio of its occurrence in homologs to the total number of sequences (see Eq. 2 in the Methods Section).

Low ESM2 scores correlate strongly with evolutionary conservation.

(A) Estimating amino acid conservation using multiple sequence alignment. The conservation score calculation is demonstrated for human HP1α protein along with a subset of its homologs found by HHblits. In the aligned sequences, missing residues appear as dashed lines, insertions are shown in lowercase letters, and mismatches are highlighted in red. The three rows below the alignment indicate at position i: ni, the number of conserved residues from the reference sequence to the query sequences; Ni; the total number of existing residues; and CSi, the conservation score calculated from Eq. 2, respectively. The right panel illustrates the distribution of homolog counts for each MLO-hIDR found by HHblits. (B) Histograms showing conservation and ESM2 score distributions for all residues in MLO-hIDR, grouped by pLDDT scores from AlphaFold2. The contour lines denote free energy levels, calculated as −log P (CS, ESM2), where P is the probability density of residues based on their conservation and ESM2 scores. Contours are spaced at 0.5-unit intervals to highlight regions of distinct density. (C) Correlation between mean conservation and ESM2 scores for amino acids classified by structural order levels. Pearson correlation coefficients, r, are reported in the legends.

Our findings reveal a strong correlation between ESM2 scores and conservation scores. In Figure 3B, we present the histograms of ESM2 and conservation scores for all amino acids from MLO-hIDR proteins. Given that folded domains generally show higher conservation scores than disordered regions, we further classified residues into four groups based on their AlphaFold2 pLDDT scores to assess conservation patterns across varying levels of structural disorder. This stratification allowed us to analyze conservation trends in detail. Across all categories, we observed bimodal distributions, reinforcing the correlation between increasing ESM2 scores and decreasing conservation.

The conservation of amino acids with low ESM2 scores is also apparent in Figure 3C. For each of the four structural order groups, we computed the average ESM2 and conservation scores for all 20 amino acid types. Methionine (M) was excluded from the correlation analysis due to its frequent position as the initial residue in sequences, which complicates reliable mutational effect predictions.88,89 In each group, we consistently observed a strong correlation between average ESM2 and conservation scores.

While ESM2 scores align closely with conservation scores, the relative conservation of specific amino acids varies across structural order groups. In more disordered regions, hydrophilic residues such as Glutamine (Q), Lysine (K), and Arginine (R) exhibit lower ESM2 scores, indicating that mutations in these residues are particularly detrimental. Conversely, hydrophobic residues like Valine (V) and Isoleucine (I) show higher ESM2 scores, suggesting they experience reduced evolutionary constraints. In more folded domains, hydrophobic residues such as W and F are more conserved (see Figure S2), consistent with the characteristic conservation patterns of proteins across different disorder levels.90,91 Overall, these findings strongly support our hypothesis that ESM2 scores effectively capture evolutionary conservation, enabling the identification of functionally significant residues through the fitness landscape, independent of structural flexibility.

Conserved, Disordered Residues Localize in Regions Driving Phase Separation

The presence of evolutionarily conserved disordered residues raises the question of their functional significance. To explore this, we divided the MLO-hIDR dataset into two categories: drivers (dMLO-hIDR), which actively drive phase separation, and clients (cMLO-hIDR), which are present in MLOs under certain conditions but do not promote phase separation themselves.92 Additionally, human IDRs not associated with MLOs, termed nMOL-hIDR, were included as a control. To enhance statistical robustness, we extended our dataset by incorporating driver proteins from additional species,93 resulting in the expanded dMLO-IDR dataset.

As illustrated in Figure 4A, there is a progressive increase in the fraction of conserved disordered residues and a corresponding decline in flexible disordered residues from non-phase-separating proteins (nMOL-hIDR) to clients (cMLO-hIDR) and drivers (dMLO-hIDR and dMLO-IDR). Driver proteins, particularly those in the expanded dataset, display a notable reduction in flexible residues. These findings imply that disordered regions with a role in phase separation tend to contain functionally significant and evolutionarily conserved regions.

Phase separation driving IDRs exhibit more conserved disorder.

(A) Population of ESM2 score for disordered residues in proteins from nMLO-hIDR, cMLO-hIDR, dMLO-hIDR, and dMLO-IDR datasets. Red dots indicate the mean values of the respective distributions. The selection of proteins in the dMLO-IDR dataset is shown in the right panel. See also methods for details in dataset preparation. (B) The classification of three IDR functional groups based on their overlap with the experimentally identified phase separation (PS) segments. (C) The distribution of the ESM2 score for residues in three IDR groups, driving (blue), participating (orange), and non-participating (green) shown in the violin plot. (D) The distribution of the conservation score (CS) for residues in three IDR groups shown in the violin plot with same coloring scheme in C.

We further examined the sequence location of conserved, disordered residues in driver proteins (dMLO-IDR). For these proteins, experimentally verified segments have been identified, the deletion or mutation of which impairs phase separation9397 (Figure 4B). These segments can include both structured and disordered regions. Herein, if a disordered region constitutes over 50% of the phase-separating segment, we designate it as “driving”, indicating a likely critical contribution to phase separation. If the disordered region represents less than 50%, we classify it as “participating”, with a potentially limited role. Finally, if there is no overlap between the disordered region and the phase-separating segment, we categorize it as “non-participating”. The number of three IDR groups, along with their amino acid compositions, are shown in Figure S4.

We then analyzed the distribution of ESM2 predictions across these IDR groups. In alignment with Figure 3A, we observed a significantly higher proportion of conserved disordered residues within driving IDRs, while few were present in non-participating IDRs. Supporting the ESM2 predictions, conservation analysis based on MSA also indicated that drving IDRs contain a greater concentration of conserved residues (Figure 4C). Collectively, these findings demonstrate that ESM2 effectively identifies evolutionarily conserved functional sites, enriched in IDR regions likely involved in driving phase separation.

Conserved, Disordered Residues Form Motifs

Finally, we investigated the chemical identities of conserved residues within driving IDRs to understand their potential role in phase separation. Figure 5A displays the average ESM2 LLR predictions for each of the 20 amino acids in the mutational matrix, indicating that mutations to most amino acids are generally unfavorable, as reflected by their low, negative LLR values. This trend is particularly pronounced in driving IDRs compared to nMLO-IDRs or non-participating IDRs (Figure S5).

ESM2 Identifies Functional Motifs in driving IDRs.

(A) Mean LLR values for the 20 amino acids calculated by averaging across all residues of each amino acid type. (B) Clustering of amino acids based on the two UMAP embeddings of their LLR vectors presented in part A. The dashed lines are manually added for a clear visualization of the separation of each group. (C) The percent of conserved residues locating in motifs for all amino acids. (D) Word cloud of motifs identified by ESM2. The word font size reflects the relative motif length, while the color represents the proportion of “sticker” residues (Y, F, W, R, K, and Q) within each motif.

We further characterize these conserved residues within driving IDRs. Using hierarchical clustering on two UMAP-derived embeddings from the LLR vectors, we grouped amino acids into five clusters (Figure 5B). This approach distinguishes more conserved residues (Groups I to III) from the more flexible residues (Groups IV and V). Notably, W, F, and Y—often referred to as “stickers” due to their crucial role in phase separation43,98100—are uniquely grouped within the highly conserved Group I. These findings support the expectation that amino acids essential to phase separation are often evolutionarily conserved, aligning with their central role in functional stability.

Interestingly, residues in Groups II and III, which include traditional “spacers” (G, A, P, and S), also show high conservation and resistance to most mutation types, particularly hydrophobic mutations (Figure 5A). Spacer residues, generally regarded as less critical for interactions driving condensate formation, were unexpectedly conserved, suggesting a broader functional relevance than previously assumed.

We propose that this conservation pattern for spacers is likely not due to isolated residue conservation but may instead reflect the conservation of specific sequence stretches. To examine this hypothesis, we identified ESM2 “motifs” as continuous sequence regions with average ESM2 scores below 0.5. A full list of motifs is available in the appendix (ESM2_motif_with _exp_ref.csv). We observed that conserved amino acids with ESM2 scores below 0.5 are predominantly located within these motifs (see Figure 5C and Figure S6A). For instance, conserved glycine residues have a 97.9% likelihood of being part of an ESM2 motif, with similar probabilities observed for other spacers, such as alanine (99.0%) and proline (93.1%), as well as for sticker residues like Y, W, and F.

These results suggest that IDRs crucial for phase separation frequently contain conserved sequence motifs composed of both sticker and spacer residues. Interestingly, many of these motifs have been experimentally validated as essential for phase separation, with representative motifs for each driving IDR listed in ESM2_motif_with_exp_ref.csv. In these cases, mutations or deletions have been shown to disrupt phase separation. For visualization, a word cloud of these motifs is presented in Figure 5D. Altogether, our analysis suggest a tendency for IDRs to uniquely cluster conserved residues into motifs and execute significant biological roles in phase separation.

Conclusions and Discussion

We have utilized the protein language model ESM2 to investigate the fitness landscape of IDRs. Our analysis reveals a substantial population of amino acids resistant to mutation. Multi-sequence alignment further confirms the evolutionary conservation of these amino acids. Notably, these conserved, disordered residues are predominantly located in regions actively involved in phase separation, contributing to the formation of membraneless organelles. These findings underscore evolutionary constraints on specific IDRs to preserve their functional roles in scaffolding phase separation processes.

Our results indicate that the conserved amino acids encompass both “sticker” and “spacer” classifications, as defined in recent literature.72,100108 This observation suggests that classifications at the level of individual amino acids may obscure their true functional roles in MLO formation. Instead, the evolutionarily conserved motifs, which can be identified through straightforward analysis of ESM2 score profiles, represent functionally significant units that include combinations of both stickers and spacers. Experimental perturbations of these motifs in in vivo studies may further elucidate their functional importance.

Outside the conserved motifs, IDRs exhibit a higher tolerance for mutations, albeit with a general preference for hydrophilic residues. This arrangement of motif and non-motif regions in protein sequences aligns with a generalized stickers-and-spacers framework. Specifically, the evolutionarily conserved motifs likely engage in strong interactions to drive phase separation, while interactions among non-motif regions remain attenuated. This framework supports the formation of MLOs with well-defined chemical composition and interaction networks, as suggested by Sood and Zhang 51. Investigating the interaction strengths of the identified motifs presents an interesting avenue for future research.

Methods

Data Collection and Preprocessing

MLO-hIDR and nMLO-hIDR Dataset

To construct the MLO-hIDR dataset, we initially identified human proteins with disordered residues using the UniProtKB database.109 From these proteins, we selected candidates with at least 10% of their residues exhibiting a pLDDT score ≤ 70, yielding a total of 5,121 candidates. This subset was subsequently cross-referenced with proteins listed in the CD-code Database110 to derive the final dataset of 939 proteins associated with membraneless organelle formation (Figure 1A). The CD-code Database further categorizes these proteins into two groups: driver datasets (dMLO-hIDR, n=82) and client datasets (cMLO-hIDR, n=814). Additional information on MLO-hIDR candidate proteins and their biological roles is available in the appendix file MLO-hIDR.csv. Proteins among the 5,121 that are not linked to membraneless organelle formation are included in the nMLO-hIDR dataset.

dMLO-IDR dataset

In addition to datasets comprising only human proteins, we developed a specialized dMLO-IDR database incorporating proteins from diverse species involved in driving phase separation. This database includes all Driver proteins cataloged in the MLOsMetaDB database,93 which documents experimentally validated disordered and phase-separating regions. Beginning with 780 Driver proteins, we filtered the dataset to retain entries where disordered regions overlap with phase-separating segments, yielding 399 candidates across 40 species (Figure 4A). To further refine the dataset, we analyzed sequence identity, excluding homologous pairs with sequence identity exceeding 50%. This process resulted in a final set of 341 non-redundant candidates. These proteins play critical roles in mediating the formation of various MLOs, including P-bodies, stress granules, paraspeckles, and centrosomes (see Appendix: dMLO-IDR.csv).

AlphaFold2 score for structural order

The pLDDT scores for human proteins were retrieved from the AlphaFold Protein Structure Database111 by selecting the Homo sapiens organisms (Reference Proteome ID: UP000005640).

ESM2 predictions for Mutational Preferences

We employed the code and pretrained parameters for ESM2 available from the model’s official GitHub repository at https://github.com/facebookresearch/esm to conduct mutational predictions. To optimize computational efficiency, we utilized the esm2_t33_650M_UR50D model, which has 650 million parameters and achieves prediction accuracy comparable to the larger 15B parameter model.67 In addition to calculating the log-likelihood ratio (LLR) for individual mutations, we defined an ESM2 score at each position to quantify mutational tolerance, formulated as

where LLRi denotes the LLR value for the i-th amino acid mutation type.

Evolutionary Sequence Analysis

We performed multi-sequence alignment (MSA) analysis using HHblits from the HH-suite3 software suite,112115 a widely used open-source toolkit known for its sensitivity in detecting sequence similarities and identifying protein folds. HHblits builds MSAs through iterative database searches, sequentially incorporating matched sequences into the query MSA with each iteration.

The HH-suite3 software was obtained from its GitHub repository (https://github.com/soedinglab/hh-suite). Homologous sequences were identified through searches against the UniRef30 protein database (release 2023/02).116 For each query, we performed three iterations of HHblits searches, incorporating sequences from profile HMM matches with an E-value threshold of 0.001 into the query MSA in each cycle. Using a lower E-value threshold (closer to 0) ensures greater sequence similarity among the matches, while multiple iterations enhance the alignment’s depth and accuracy. The resulting alignments in A3M format were converted to CLUSTAL format using the reformat.pl script provided in HH-suite, aligning all sequences to a uniform length (Figure 3A). To refine alignment quality by focusing on closely related homologs, we filtered out sequences with ≤ 20% identity to the query, excluding weakly related sequences where only short segments show similarity to the reference.

The conservation score CSi for position i was then calculated from the MSA as

where ni(ref=query) represents the number of times the residues from the reference sequence appear across all sequences, and Ni(non_gap) represents the total non-gap residues across the aligned sequences.

Motif Identification

We defined motifs as contiguous stretches of amino acid sequences with an average ESM2 score of 0.5 or lower. To identify these motifs within a given IDR, we implemented an iterative procedure. Starting from the N-terminus of the sequence, we first identified the initial residue with an ESM2 score below 0.5, denoted as i. From this position, residues were sequentially appended in the C-terminal direction until the cumulative segment’s average ESM2 score exceeded 0.5. The residue where this threshold was surpassed was denoted as j. The segment spanning from i to j – 1, i.e., (i, i + 1, …, j – 1), was then considered a candidate motif. This process was repeated starting from residue j, continuing iteratively until the C-terminal end of the IDR was reached.

When two motifs are in close proximity along the sequence, they may be merged into a single motif. Specifically, if the starting position of one motif is within 8 residues of the ending position of another, we define a candidate segment as the sequence spanning both motifs and the intervening residues. If the candidate segment’s average ESM2 score is below 0.5, it is included as a merged motif, replacing the individual motifs in the final list (Appendix: ESM2_motif_with_exp_ref.csv). In the analyses shown in Figure 5, we showed all motifs with n ≥ 4; however, varying motif minimal length n does not alter the overall conclusions (Figure S6B).