Protein language model identifies disordered, conserved motifs implicated in phase separation
Figures
Predicting the mutational landscape of intrinsically disordered regions (IDRs) involved in membraneless organelle (MLO) formation (MLO-hProt) using ESM2.
(A) Schematic representation of the MLO-hProt database construction. The right panel shows the distribution of disordered and folded residues identified in each protein, where structural order is assigned using the AlphaFold2-predicted Local Distance Difference Test (pLDDT) score. (B) Workflow of the pretrained ESM2 model (left) (Lin et al., 2023) for predicting the mutational landscape of a given protein sequence (right). Upon receiving a protein sequence as input, ESM2 generates a log-likelihood ratio (LLR) for each mutation type at each residue position. Using the 20-element LLR vector, we compute the ESM2 score for each residue (Equation 1) to assess mutational tolerance.
ESM2 predicts mutational landscape for structured and disordered residues.
(A) The ESM2 scores for amino acids in the human HP1α protein (UniProt ID: P45973) are presented, with residues having predicted Local Distance Difference Test (pLDDT) scores below 70 highlighted in blue to signify regions lacking a defined structure. (B) A detailed view of the mutational landscape across three regions with varying degrees of structural order. On the left, the AlphaFold2-predicted structure of the human HP1α protein is displayed in cartoon representation, with residues colored according to their pLDDT scores. Three specific regions, representing flexible disordered (residues 75–85), conserved disordered (residues 87–92), and folded (residues 120–130) segments, are highlighted in blue, orange, and red, respectively, using ball-and-stick styles. The panels on the right depict the ESM2 log-likelihood ratio (LLR) predictions for each of these regions. Histograms of pLDDT and ESM2 score distributions for structured (C) and disordered (D) residues are presented. Contour lines indicate free energy levels computed as , where is the probability density of residues based on their pLDDT and ESM2 scores. Contours are spaced at 0.5-unit intervals to distinguish areas of differing density.
Fraction of different amino acids in structured and disordered residues identified from proteins in the MLO-hProt dataset (939).
The structured (predicted Local Distance Difference Test [pLDDT] >70) and the disordered (pLDDT ≤70) residues were identified using the AlphaFold2 pLDDT score.
ESM2 and AlphaFold2 predictions for all proteins in the dMLO-hProt dataset.
ESM2 scores for all amino acids in the dMLO-hProt proteins (full-length) are shown, with UniProt IDs indicated above each plot. Residues with predicted Local Distance Difference Test (pLDDT) ≤70 are blue (disordered). Of these residues, those with ESM2 scores ≤1.5 are colored plum (conserved disordered segments).
Low ESM2 scores correlate strongly with evolutionary conservation.
(A) Estimating amino acid conservation using multiple sequence alignment. The conservation score calculation is demonstrated for human HP1α protein along with a subset of its homologs found by HHblits. In the aligned sequences, missing residues appear as dashed lines, insertions are shown in lowercase letters, and mismatches are highlighted in red. The three rows below the alignment indicate at position , the number of conserved residues from the reference sequence to the query sequences; ; the total number of existing residues; and , the conservation score calculated from Equation 2, respectively. The right panel illustrates the distribution of homolog counts for each MLO-hProt found by HHblits. (B) Histograms showing conservation and ESM2 score distributions for all residues in MLO-hProt, grouped by predicted Local Distance Difference Test (pLDDT) scores from AlphaFold2. The contour lines denote free energy levels, calculated as , where is the probability density of residues based on their conservation and ESM2 scores. Contours are spaced at 0.5-unit intervals to highlight regions of distinct density. (C) Correlation between mean conservation and ESM2 scores for amino acids classified by structural order levels. Pearson correlation coefficients, , are reported in the legends.
Average ESM2 score for amino acids with different structural order, averaged over residues from proteins in the MLO-hProt dataset (939).
Phase separation driving intrinsically disordered regions (IDRs) exhibits more conserved disorder.
(A) Population of ESM2 score for disordered residues in proteins from nMLO-hIDR, cMLO-hIDR, dMLO-hIDR, and dMLO-IDR datasets. Red dots indicate the mean values of the respective distributions. The selection of proteins in the dMLO-IDR dataset is shown in the right panel. See also methods for details in dataset preparation. (B) The classification of three IDR functional groups based on their overlap with the experimentally identified phase separation (PS) segments. (C) The distribution of the ESM2 score for residues in three IDR groups, driving (blue), participating (orange), and non-participating (green) shown in the violin plot. The distribution of the conservation score (CS) for residues in three IDR groups shown in the violin plot in the left panel with same coloring scheme as in the right. Pairwise statistical comparisons were conducted using two-sided Mann–Whitney U tests on the ESM2 score distributions (null hypothesis: the two groups have equal medians). p-values indicate the probability of observing the observed rank differences under the null hypothesis. Statistical significance is denoted as follows: ***p < 0.001; **p < 0.01; *p < 0.05; †p < 0.10 (marginal); n.s.: not significant, p ≥ 0.10.
Fraction of different amino acids across proteins from various datasets.
Panels (A, B) show amino acid composition of disordered and folded residues, respectively. For nMLO-/cMLO-/dMLO-hProt sets, order uses AlphaFold2-predicted Local Distance Difference Test (pLDDT); for dMLO-IDR, disorder is from MLOsmetaDB (Orti et al., 2024).
Disorder annotation of dMLO-IDR proteins in the MobiDB database (Di Domenico et al., 2012).
(A) Computational predictors (fraction disordered). (B) Experimental techniques (fraction disordered).
Disorder annotation of human proteins in MobiDB (Di Domenico et al., 2012).
(A) Top-left: pie chart of fraction annotated as disordered versus not for MLO-hIDR; top-right: histogram of lengths of the 24% lacking annotations; bottom: distributions analogous to Figure 4—figure supplement 2. (B) Same analyses for nMLO-hIDR.
Violin plots of conservation scores for disordered residues across nMLO-hIDR, cMLO-hIDR, dMLO-hIDR, and dMLO-IDR.
Two-sided Mann–Whitney U tests; significance: ***p < 0.001; **p < 0.01; *p < 0.05.
Statistics of three intrinsically disordered region (IDR) subgroups in dMLO-IDR.
(A) Counts (and circle areas) per subgroup; overlaps arise when subgroups share proteins. (B) Amino acid proportions across IDRs per subgroup.
Statistics of conserved amino acids in regions with varying contributions to phase separation.
(A) Fraction of conserved residues per region relative to total conserved per protein; ‘LLPS’ = driving/participating (Figure 4B), ‘non-LLPS’ = not involved. (B) Per-region probability of conservation (conserved count/segment length). Means shown with diamonds. Thresholds: ESM2 ≤0.5, ≤1.0, ≤1.5. Two-sided Mann–Whitney U tests; significance: ***p < 0.001; **p < 0.01; *p < 0.05; †p < 0.10 (marginal); n.s.: not significant, p ≥ 0.10.
ESM2 identifies conserved motifs in driving intrinsically disordered regions (IDRs).
(A) Mean log-likelihood ratio (LLR) values for the 20 amino acids calculated by averaging across all residues of each amino acid type. (B) Clustering of amino acids based on the two UMAP embeddings of their LLR vectors presented in part A. The dashed lines are manually added for a clear visualization of the separation of each group. (C) The percent of conserved residues locating in motifs for all amino acids. (D) Word cloud of motifs identified by ESM2. The word font size reflects the relative motif length, while the color represents the proportion of ‘sticker’ residues (Y, F, W, R, K, and Q) within each motif.
Mean log-likelihood ratio (LLR) values across 20 amino acids.
(A) Structured residues from nMLO-hIDR (predicted Local Distance Difference Test [pLDDT] >70), (B) disordered residues from nMLO-hIDR (pLDDT ≤70), (C) driving intrinsically disordered regions (IDRs) from dMLO-IDR, and (D) non-participating IDRs from dMLO-IDR.
Distribution of conserved residues in motifs for driving intrinsically disordered regions (IDRs).
(A) Box plot comparing conserved amino acids in non-motifs versus motifs with scatter overlay. (B) Proportion conserved within motifs across minimum motif length thresholds.
ESM2 predictions as indicators of pathogenic mutations.
ClinVar (March 2024). (A) Log-likelihood ratio (LLR) distributions for mutations in folded versus disordered residues by ClinVar categories (Landrum et al., 2014; Landrum et al., 2018); red dashed line: LLR = −5.4 (ESM2 score 0.5 if uniform). (B) LLR distributions across membraneless organelle (MLO) databases (pathogenic vs. benign as defined). (C) ClinVar category pies for variants in motif versus non-motif regions of dMLO-IDR and MLO-hProt.
Comparison of the correlation between AlphaFold2 pLDDT scores and conservation scores with the correlation between ESM2 scores and conservation scores.
Calculations were performed using proteins in the MLO-hProt dataset. (A) Correlation between the mean AlphaFold2 pLDDT scores and conservation scores for various amino acids. Pearson correlation coefficients (r) are indicated in the figure legends. The four panels on the right present analogous correlation plots for amino acids grouped by structural order, as defined by their pLDDT scores. (B) Similar as in part A but for ESM2 scores.
Histograms of the ESM2 score and the conservation score, presented in a format consistent with Figure 3B of the main text.
The conservation scores were computed using aligned sequences with identity thresholds of ≥0, ≥10%, ≥20%, and ≥40% (left to right). Contour lines represent different levels of −log_P_(CS,ESM2), where P is the joint probability density of conservation score (CS) and ESM2 score. Contours are spaced at 0.5-unit intervals, highlighting regions of distinct density.
Number of peripheral residues and their relative length to the full-motif length identified from both sides.
(A). The unique motifs identified from N-to-C terminal direction. (B) The unique motifs identified from C-to-N terminal direction.
Violin plots illustrating the distribution of conservation scores for disordered residues across the nMLO–hIDR, cMLO–hIDR, dMLO–hIDR, and dMLO–IDR datasets.
Pairwise statistical comparisons were conducted using two-sided Mann–Whitney U tests on the conservation score distributions (null hypothesis: the two groups have equal medians). P-values indicate the probability of observing the observed rank differences under the null hypothesis. Statistical significance is denoted as follows: ***: p < 0.001; **: p < 0.01; *:p < 0.05;
Screenshot of information provided by the DisProt database.
Detailed annotations of biological functions and structural features, along with experimental references, are accessible via mouse click.
Additional files
-
MDAR checklist
- https://cdn.elifesciences.org/articles/105309/elife-105309-mdarchecklist1-v1.pdf
-
Supplementary file 1
dMLO-IDR motifs identified by low ESM2 fitness scores and corroborated by experimental evidence.
- https://cdn.elifesciences.org/articles/105309/elife-105309-supp1-v1.csv
-
Supplementary file 2
Protein-level annotations from the MLO-hIDR dataset, including UniProt identifiers, organism, database source, LLPS roles, associated membraneless organelles (MLOs), intrinsically disordered regions (IDRs), LCRs, LLPS regions, and Pfam domains.
- https://cdn.elifesciences.org/articles/105309/elife-105309-supp2-v1.csv
-
Supplementary file 3
Protein-level annotations from the dMLO-IDR dataset, including UniProt identifiers, organism, database source, LLPS roles, associated membraneless organelles (MLOs), intrinsically disordered regions (IDRs), LCRs, LLPS regions, and Pfam domains.
- https://cdn.elifesciences.org/articles/105309/elife-105309-supp3-v1.csv