(A) ‘Fishtail plot’ of ΔΔG-values vs. allele frequencies for all variants listed in gnomAD (gray), as well as those analyzed by Takahashi et al. (2007); the latter are color-coded by DME. Note that the leftmost group of colored dots are variants that have been reported in patients, but are not recorded in gnomAD (thus their allele frequency in gnomAD is zero). Variants with common to intermediate frequencies are all predicted to be stable, while some rare variants are predicted to be destabilized. ΔΔGs for gnomAD variants are provided as source data (Figure 6—source data 1), those for variants characterized by Takahashi et al. (2007) in Table 1 and source data (Table 1—source data 1). (B) FoldX ΔΔG for benign (blue), likely benign (cyan), likely pathogenic (orange), and pathogenic (red) variants that are reported in ClinVar with ‘at least one star’ curation. The whiskers represent the mean and standard error of the mean. (C) Evolutionary sequence energies for ClinVar-reported variants, color scheme as in (B). The whiskers represent the mean and standard error of the mean. (D) Landscape of variant tolerance by combination of changes in protein stability (x axis) and evolutionary sequence energies (y axis), such that the upper right corner indicates most likely detrimental variants, while those in the lower left corner are predicted stable and observed in MLH1 homologs. The green background density illustrates the distribution of all variants listed in gnomAD. The combination of metrics captures most non-functional variants (DME scores 0 or 1). Outliers are discussed in the main text. (E) Logistic regression model of FoldX ΔΔGs and evolutionary sequence energies. Pathogenic variants in red, benign in blue. Dot shape indicates whether pathogenicity of the respective variant was correctly predicted by a regression model trained on all but this data point (‘jackknife’, TP, true positives, FN, false negatives, FP, false positives, TN, true negatives). Parameters for a model trained on the full dataset are: FoldX ΔΔG weight 0.52, evolutionary sequence energy weight 3.50, intercept −1.55. (F) ROC curves for logistic regression model, FoldX ΔΔGs, evolutionary sequence energies, and the ensemble-predictor REVEL to assess their performance in separating benign from pathogenic variants. TPR, true positive rate. FPR, false positive rate. Standard deviations in AUC were determined by performing 100 ROC analyses on randomly sampled but balanced subsets, so that there are equal numbers of positive and negative cases. (G) Integrating potential effects these variants may have on splicing in the genomic context. Purple squares indicate pathogenic variants that are predicted to affect splicing (SpliceAI, threshold 0.5). No benign variants are predicted to affect splicing. Mapping to genomic loci (Yates et al., 2015) and SpliceAI Scores for ClinVar entries used in this work are provided as source data (Figure 6—source data 2).