Overview of VAMP-seq datasets and proteins. (A) Structures of proteins previously studied with VAMP-seq coloured by average VAMP-seq abundance score per residue. Averages are shown for all residues with abundance scores available, even if only one or few scores have been reported for the residue. Homodimer structures are shown for NUDT15 and ASPA, and residues with no reported scores are shown in gray. (B– E) Overview of VAMP-seq dataset sizes, completenesses and noise levels, showing specifically (B) the average number of variant scores per residue position, with 19 (indicated with dashed line) being the highest possible value, (C) mutational completeness, that is, the percentage of all possible single residue substitution variants with an abundance score in the dataset, (D) the total number of single residue substitution abundance scores per protein, and (E) Pearson’s correlation coefficient r between originally reported and resampled abundance score datasets, with resampling based on reported abundance score standard deviations.

Average abundance score substitution matrices predict unseen abundance data. Average abundance scores for all possible amino acid residue substitution types for residues sitting in either (A) structurally buried or (B) solvent-exposed environments in the wild-type protein structures. The bar plot to the left of each substitution matrix indicates the average number of abundance scores that was used to calculate score averages in each row of the matrix. Amino acids are sorted according to row averages in the global substitution matrix (Fig. S7). (C) Correlations between experimental abundance scores and abundance scores predicted from average abundance score substitution matrices using leave-out-one-protein cross-validation. The matrices used for predictions were constructed using all data (Global) and for different residue categories based on residue solvent-exposure (Exposure), secondary structure (Secondary) or both solvent-exposure and secondary structure (Combined). The vertical black bars mark the median correlation coefficient for each category. (D) Correlations between experimental abundance scores and abundance scores predicted from average abundance score substitution matrices for residues sitting in either buried or exposed environments. Scores were predicted for each protein (indicated above each plot) using average abundance score substitution matrices calculated using an increasing number of VAMP-seq datasets. Orange data points mark the prediction result from a single specific combination of datasets, and the average for each dataset count is indicated with black circles.

Analysis of average abundance score substitution matrices. (A) Hierarchical clustering of average abundance score substitution matrices for buried and solvent-exposed environments. Clustering was performed along both axes of the matrices. Grey squares indicate a synonymous substitution. (B) Principal component analysis of substitution profiles describing the effects of all substitutions from and to each of the twenty amino acid residues. The analysis was performed using average abundance scores for residues in buried and exposed environments separately. (C) Analysis of substitution profiles of loop residues with backbone dihedral angles similar to those found in left-handed helices. The heat map shows average abundance scores for all loop residues with left-handed helix-like backbone dihedral angles in the six proteins. Grey squares indicate a synonymous substitution as well as no data. The plot below the heat map shows average abundance scores for substitutions to each of the twenty amino acid residues for this particular class of loop residues, with error bars indicating the standard deviation over all abundance scores for the substitution type. The left bar plot shows the total number of loop residues that for each residue type was found across the six proteins to adopt the left-handed helix-like backbone conformation.

Homodimerisation stabilises NUDT15 and ASPA in vivo. RMSD between abundance score substitution profile and average abundance scores in buried (orange) and exposed (green) environments for residues in (A) the NUDT15 dimer interface and (B) the ASPA dimer interface. For each interface residue, the average abundance score (grey) is also shown with error bars that indicate the abundance score standard deviation at the position. If RMSDexposed is smaller than RMSDburied, the abundance score substitution profile of the residue resembles the average profile of an exposed residue more than that of a buried residue of the same amino acid type, and vice versa. RMSD values were only calculated for residues with at least five reported abundance scores. (C) NUDT15 dimer with one monomer represented by cartoon of backbone and one monomer shown in transparent surface representation. Side chains are shown for all interface residues, and interface residues with RMSDburied smaller than RMSDexposed are shown in orange, while interface residues with RMSDexposed smaller than RMSDburied are shown in green. (D, E) ASPA dimer shown from two different angles represented similarly to the NUDT15 dimer.

VAMP-seq abundance score distributions shown separately for each of the six datasets included in our analysis. The datasets only include scores for single residue substitution variants at residue sites in soluble protein regions.

VAMP-seq abundance score standard deviations plotted as function of abundance scores separately for each VAMP-seq dataset. The standard deviations were calculated using a varying number of experimental replicates in the original VAMP-seq publications.

Summary of protein structure and sequence compositions. (A) The per protein fractions of residues forming different types of secondary structure are shown alongside with the fractions of residues found at structurally buried and solvent-exposed sites. For NUDT15 and ASPA, fractions were calculated using homodimer structures. The six proteins have relatively similar structural compositions, however with exceptions such as the relative helical enrichment in CYP2C9 and the loop and thereby also exposed site enrichment in PRKN. (B) Protein sequence compositions are similarly shown as residue type fractions with respect to the total number of residues in each protein.

rASA and WCN correlate with single residue substitution abundance scores. Abundance scores for all single residue substitution variants are plotted against residue (A) rASA and (B) WCN values calculated using the wild-type protein structures. Pearson’s correlation coefficient r and Spearman’s rank correlation coefficient rs between abundance scores and rASA or WCN values were calculated for each protein and are shown in the indvidual plots.

rASA and WCN correlate with residue-averaged abundance scores. Residue-averaged abundance scores are plotted against residue (A) rASA and (B) WCN values calculated using the wild-type protein structures. Pearson’s correlation coefficient r and Spearman’s rank correlation coefficient rs between residue-averaged abundance scores and rASA or WCN were calculated for each protein and are shown in the individual plots. Averages are shown for all residues with abundance scores available, even if only one or few abundance scores have been reported for a given residue.

Grid search for selection of structure feature cutoffs to classify residues as solvent-exposed or structurally buried. (A) Pearson’s correlation coefficient r between experimental and predicted abundance scores, with predictions based on exposure-based substitution matrices of average abundance scores calculated using different rASA and WCN cutoff values to classify residues as buried or exposed. Predictions were made using leave-one-protein-out cross-validation, and r is an average of results obtained for each of the proteins in our analysis. (B) Mean absolute error between between experimental and predicted abundance scores for the same predictions that results are shown for in panel A. Based on the results shown in panels A and B, we have used an rASA cutoff value of 0.1 to classify residues as solvent-exposed or buried in all other analyses in this work. (C,D) Results for grid search analysis similar to the analysis presented in panel A and B, but using median rather than average abundance scores to construct substitution matrices. We explored if using exposure-based substitution matrices of median abundance scores rather than average abundance scores would result in higher prediction accuracy. We found that, for our selected cutoffs, both the r and the MAE between experimental and predicted abundance scores are highly similar for the median- and average-based matrices.

Average abundance score substitution matrices calculated using the six VAMP-seq datasets. The average score matrices were calculated by averaging over all scores, scores for residues forming helix, scores for residues forming strand and scores for residues forming loop structure, with structural context evaluated on wild-type structures. In each panel, the bar plot to the left of the average score matrix shows the average number of data points used to calculate the averages in a matrix row.

VAMP-seq abundance scores plotted against abundance score predictions obtained from average abundance score substitution matrices constructed without consideration of the structural context of the wild-type residue (Global in Fig. 2). All predictions were performed using using leave-out-one-protein cross-validation. Data point density is shown using an upper limit of 50 on the colour scale, meaning that a bin will be coloured red if it contains 50 data points or more.

VAMP-seq abundance scores plotted against abundance score predictions obtained from average abundance score substitution matrices constructed by considering the degree of solvent-exposure of the wild-type residue (Exposure in Fig. 2). All predictions were performed using using leave-out-one-protein cross-validation. Data point density is shown using an upper limit of 50 on the colour scale, meaning that a bin will be coloured red if it contains 50 data points or more.

VAMP-seq abundance scores plotted against abundance score predictions obtained from average abundance score substitution matrices constructed by considering the secondary structure context of the wild-type residue (Secondary in Fig. 2). All predictions were performed using using leave-out-one-protein cross-validation. Data point density is shown using an upper limit of 50 on the colour scale, meaning that a bin will be coloured red if it contains 50 data points or more.

VAMP-seq abundance scores plotted against abundance score predictions obtained from average abundance score substitution matrices constructed by considering both the degree of solvent-exposure and the secondary structure context of the wild-type residue (Combined in Fig. 2). All predictions were performed using using leave-out-one-protein cross-validation. Data point density is shown using an upper limit of 50 on the colour scale, meaning that a bin will be coloured red if it contains 50 data points or more.

Variant abundance scores correlate with calculated ΔΔG values. The ΔΔG for all single residue substitution variants in each of the six proteins was estimated using Rosetta and correlate moderately well with variant abundance scores, as indicated with the rs value per dataset in the individual plots. Data point density is shown using an upper limit of 130 on the colour scale.

Average abundance score substitution matrices for (A) buried, (B) exposed and (C) loop environments calculated without abundance scores from the PRKN VAMP-seq dataset, but including all data from the five remaining datasets.

Heatmaps indicating the degree of symmetry across the diagonal in average abundance score substitution matrices. The symmetry was for each of the three residue environments (global, buried and exposed) evaluated for a given substitution type by subtracting the average abundance score for the reverse substitution from the average abundance score of the forward substitution. Hence, white indicates that a given residue pair behaves symmetrically, red indicates that the substitution is less disruptive in the forward than in the reverse direction, and blue means that the substitution is more disruptive in the forward than in the reverse direction.

Correlations between experimental abundance scores and abundance scores predicted from the six average abundance score substitution matrices that consider both residue burial and secondary structure context of the wild-type residue. Scores were predicted for each protein (plot title) using average abundance score substitution matrices calculated with an increasing number of VAMP-seq datasets, though always leaving out the dataset for which predictions are performed. Orange data points mark prediction results from a single specific combination of datasets, and black data points are the average of the orange ones.

Analysis of the impact of amino acid helix propensity on average abundance scores. Pearson’s correlation coefficient, r, between the average abundance scores for residues sitting in α-helical structure (either all α-helical structure or in only buried or exposed α-helical structure) and substitution matrices constructed from different helix propensity scales reporting on the ΔΔG of helix residue substitutions. Helix propensity scales were obtained from and are named as in previous work comparing the different scales (Pace and Scholtz, 1998). Explt scale reports on helix propensities for solvent-exposed, non-capping residues (Pace and Scholtz, 1998), AK/AQ (Rohl et al., 1996) and AGADIR (Muñoz and Serrano, 1995) were constructed from experimental data on peptides in solution, and C. & F. (Chou and Fasman, 1978) and Williams (Williams et al., 1987) scales are based on the frequency of amino acid occurrence in α-helices, buried as well as exposed. Finally, the Luque scale was originally derived from structure-based thermodynamical calculations and reports on helix formation propensities for residues in solvent-exposed helices (Luque et al., 1996).

Comparison of average abundance score substitution matrices across proteins and residue environments. The mean absolute error and correlation coefficient were calculated between pairs of substitution matrices considering all abundance score averages in the two matrices to evaluate matrix similarity. (A) Comparison of buried environment matrices that were each calculated using data from only a single protein. Protein names on the axes thus indicate the single dataset used for the substitution matrix calculation. (B) Comparison of exposed environment matrices each calculated with data from single a protein. (C) Comparison of buried environment matrices, with matrices calculated using five out of six VAMP-seq datasets. Protein names on the axes indicate which dataset was omitted from the substitution matrix calculation. (D) Comparing matrices for exposed environments, with matrices calculated using five out of six VAMP-seq datasets. (E) Comparing matrices for different structural environments using matrices based on all data from all six proteins.

Discovering residues with abundance score substitution profiles non-typical for their structural environment. Each datapoint in the plots corresponds to a single residue coloured by its wild-type structure environment type and for which we calculated RMSDexposed and RMSDburied. Solvent-exposed residues (blue) are expected to have RMSDexposed < RMSDburied and thus appear below the plot diagonals, while the opposite is true for buried residues (yellow). Blue datapoints falling above the diagonal thus correspond to solvent-exposed residues with substitution profiles most similar to the average profile of buried residue, and yellow datapoints below the diagonal represent buried residues with substitution profiles relatively similar to the average profile of exposed residues. We quantified the extent to which buried and exposed residues respectively satisfy RMSDburied < RMSDexposed and RMSDexposed < RMSDburied by calculating the balanced classification accuracy (Acc.) per dataset. Data is only shown for residues with a least five measured abundance scores.

Protein structures with solvent-exposed residues found to have a smaller RMSDburied than RMSDexposed shown in dark orange. To make the results easier to interpret visually, only residues with a difference between RMSDexposed and RMSDburied larger than 0.05 are coloured orange here. NUDT15 and ASPA are shown as homodimers, but the exposed residues discovered in the analysis are shown for only one of the monomers in the complexes.

Overview of structure and sequence data used for analysis. Protein sequences are given by their UniProt identifiers and crystal structure input files (Lee et al., 1999; Wu et al., 2007; Wester et al., 2004; Carter et al., 2015; Le Coq et al., 2008; Kumar et al., 2015) by their PDB identifiers. We used the PDB file chains and residues noted in the last two columns.