Figures and data

Graphical summary of our functional characterisation of the human proteome.
(1) For all human proteins, we collected the corresponding sequence and predicted structure. We used these as input to ESM-1b and ESM-IF to generate variant effect scores probing conservation and stability. (2) We combined these data as input to FunC-ESMs which assigns an output label to each variant. We used these labels to further assign a functional annotation for all residues. (3) We collected clinical annotations from ClinVar. We analysed these in light of the different ESM scores and used FunC-ESMs to provide insight into the mechanism of loss of function. (4) The resulting information is made available in an online database at the Electronic Research Data Archive at University of Copenhagen (KU/UCPH) (ERDA) available via https://sid.erda.dk/cgi-sid/ls.py?share_id=DUWFpyjZp0. The results can also be accessed with a colab notebook available at https://github.com/KULL-Centre/_2024_cagiada-jonsson-func.

Analysis of functionally relevant and structurally critical residues in the human proteome.
We compared the percentage of (A) functionally relevant and (B) structurally critical residues for proteins of different sizes. In each plot, the white dot in the centre of the distribution represents the median, the thick black bar represents the interquartile range and the thin black line represents the range of the data. The number of proteins in each subset is shown at the top of each violin plot. (C) Mean radial distribution functions (g(r)) for all analysed human proteins clustered by chain length. The blue lines represent g(r) for functionally relevant residues, the red line for the structurally critical residues, and the grey line for a control group. Below each plot we show in black the ratio of g(r) for the functionally relevant residues and the control group; values greater than one show that functionally relevant residues are more strongly clustered than the control group.

Molecular mechanisms of pathogenic variants.
(A) Assessment of the accuracy of the two ESM models as predictors of clinical pathogenicity using receiver operating characteristic analysis. We used the ESM-1b and ESM-IF models to separate the clinically annotated ‘benign’ and ‘pathogenic’ missense variants and calculated the area under the curve (AUC) for each model. (B) Classification of benign and pathogenic missense variants using FunC-ESMs: ‘WT-like’ (green), ‘stable-but-inactive’ (blue), and ‘total-loss’ (red). We further divide these variants into subgroups depending on whether they are located in (C) folded or (D) intrinsically disordered regions, and whether they are (E) buried or (F) exposed. The total number of variants in each subset is shown at the top of each subplot for the different conditions. All bars are annotated with both the absolute counts and percentages for each category.

Details of the threshold selection for the FunC-ESMs model.
(A) Scatter plot with Spearman’s correlations between the predictions of ESM-1b (x-axis) and ESM-IF (y-axis) against the experimental result for proteins included in the ProteinGym database. The red dotted lines indicate the systems used (in green) to select the threshold for ESM-1b. (B) Comparison between ESM-IF variant scores and ΔΔG measurements of over 200,000 variants (Tsuboyama et al., 2023). The red dotted line represents the threshold that we used to define destabilized variants. (C and D) Area under the ROC curve used to select the thresholds for the two ESM predictors used in our FunC-ESMs model. In each plot, the AUC is shown in the bottom right corner and a yellow dot indicates the true positive and false positive coordinates for the threshold we selected by the maximal Youden’s index.

Validation FunC-ESMs and comparison to other methods.
(A) Comparison for predictions of stable-but-inactive (SBI) variants between FunC-ESMs (lavender) and the predictions of the Functional Model (Cagiada et al., 2023) (green) on the proteins part of the training dataset used to train the Functional Model. To make the comparison fair, we retrained the Functional Model without the selected protein. (B) Comparison between FunC-ESMs (lavender), and a classification based on GEMME and Rosetta (Cagiada et al., 2023) (orange) and the Functional Model (Cagiada et al., 2023) (green) in predicting the variant mechanism and residue classes. The dataset used to compare the models is from an experimental classification on the GRB2-SH3 system. The y-axis shows the value for each metric used (and defined on the x-axis for each bar series).

Validation of FunC-ESMs on data on glucokinase.
(A) Comparison between predictions from FunC-ESMs (lavender) and the Functional Model (Cagiada et al., 2023) (green) and experimental data human glucokinase (Gersing et al., 2023a,b). (B) Cartoon visualisation of the classification using the experimental data (left), the Functional Model (centre) and FunC-ESMs (right) for human glucokinase. The upper part of the panel shows the full structure of glucokinase, with residues coloured according to the resulting residue class; a white colour indicates residues that were not classified in the experiment. The lower panel shows the area around the active site, including the glucose molecule (in yellow) and all side chains of residues closer than 7 Å.

Percentages of variants and residues in each of FunC-ESMs’ class.
Distribution of (A, B, C) variants and (D, E, F) residues for (A, D) all residues, (B, E) residues in folded regions and (C, F) residues intrinsically disordered regions of the human proteome.

Contribution of different amino acid types and substitutions to the different predicted classes in folded regions.
For each amino acid type, we present (A) the overall and (B) the normalized (by residue frequencies in the folded regions) counts for how each amino acid type contribute to the four position classes in the FunC-ESMs predictions. For each type of substitution (pair of ‘start’ and ‘end’ amino acid type), we show (C) the raw counts and (D) the normalized counts of variants falling into the three possible variant classes. The counts in (D) are normalised to the ‘start/end’ amino acid pair across the three different output labels.

Statistics on solvent exposure and secondary structure elements in the folded domains of the human proteome.
(A) Percentages of residues for each of the FunC-ESMs classes in exposed and buried regions and (B) percentages of each of the four residue classes in buried or exposed region. (C) Cumulative distribution of the half-sphere exposure components, with the up-component representing the contacts mostly surrounding the side chains and the down-component representing the direction of the backbone. The median value for each class is reported as a dashed line.