Figures and data

Predicting the fitness landscape of IDRs involved in MLO formation using ESM2.
(A) Schematic representation of the MLO-hIDR database construction. The right panel shows the distribution of disordered and folded residues identified in each protein, where structural order is assigned using the AlphaFold2 pLDDT score. (B) Workflow of the pre-trained ESM2 model (left)68 for predicting the fitness landscape of a given protein sequence (right). Upon receiving a protein sequence as input, ESM2 generates a log-likelihood ratio (LLR) for each mutation type at each residue position. Using the 20-element LLR vector, we compute the ESM2 score for each residue (Eq. 1) to assess mutational tolerance.

ESM2 predicts fitness landscape for structured and disordered residues.
(A) The ESM2 scores for amino acids in the human HP1α protein (UniProt ID: P45973) are presented, with residues having pLDDT scores below 70 highlighted in blue to signify regions lacking a defined structure. (B) A detailed view of the fitness landscape across three regions with varying degrees of structural order. On the left, the AlphaFold2-predicted structure of the human HP1α protein is displayed in cartoon representation, with residues colored according to their pLDDT scores. Three specific regions, representing flexible disordered (residues 75–85), conserved disordered (residues 87–92), and folded (residues 120–130) segments, are highlighted in blue, orange, and red, respectively, using ball-and-stick styles. The panels on the right depict the ESM2 LLR predictions for each of these regions. (C, D) Histograms of pLDDT and ESM2 score distributions for structured (C) and disordered (D) residues are presented. Contour lines indicate free energy levels computed as −log P (pLDDT, ESM2), where P is the probability density of residues based on their pLDDT and ESM2 scores. Contours are spaced at 0.5-unit intervals to distinguish areas of differing density.

Low ESM2 scores correlate strongly with evolutionary conservation.
(A) Estimating amino acid conservation using multiple sequence alignment. The conservation score calculation is demonstrated for human HP1α protein along with a subset of its homologs found by HHblits. In the aligned sequences, missing residues appear as dashed lines, insertions are shown in lowercase letters, and mismatches are highlighted in red. The three rows below the alignment indicate at position i: ni, the number of conserved residues from the reference sequence to the query sequences; Ni; the total number of existing residues; and CSi, the conservation score calculated from Eq. 2, respectively. The right panel illustrates the distribution of homolog counts for each MLO-hIDR found by HHblits. (B) Histograms showing conservation and ESM2 score distributions for all residues in MLO-hIDR, grouped by pLDDT scores from AlphaFold2. The contour lines denote free energy levels, calculated as −log P (CS, ESM2), where P is the probability density of residues based on their conservation and ESM2 scores. Contours are spaced at 0.5-unit intervals to highlight regions of distinct density. (C) Correlation between mean conservation and ESM2 scores for amino acids classified by structural order levels. Pearson correlation coefficients, r, are reported in the legends.

Phase separation driving IDRs exhibit more conserved disorder.
(A) Population of ESM2 score for disordered residues in proteins from nMLO-hIDR, cMLO-hIDR, dMLO-hIDR, and dMLO-IDR datasets. Red dots indicate the mean values of the respective distributions. The selection of proteins in the dMLO-IDR dataset is shown in the right panel. See also methods for details in dataset preparation. (B) The classification of three IDR functional groups based on their overlap with the experimentally identified phase separation (PS) segments. (C) The distribution of the ESM2 score for residues in three IDR groups, driving (blue), participating (orange), and non-participating (green) shown in the violin plot. (D) The distribution of the conservation score (CS) for residues in three IDR groups shown in the violin plot with same coloring scheme in C.

ESM2 Identifies Functional Motifs in driving IDRs.
(A) Mean LLR values for the 20 amino acids calculated by averaging across all residues of each amino acid type. (B) Clustering of amino acids based on the two UMAP embeddings of their LLR vectors presented in part A. The dashed lines are manually added for a clear visualization of the separation of each group. (C) The percent of conserved residues locating in motifs for all amino acids. (D) Word cloud of motifs identified by ESM2. The word font size reflects the relative motif length, while the color represents the proportion of “sticker” residues (Y, F, W, R, K, and Q) within each motif.