Deep mutational scanning and machine learning reveal structural and molecular rules governing allosteric hotspots in homologous proteins

  1. Megan Leander
  2. Zhuang Liu
  3. Qiang Cui  Is a corresponding author
  4. Srivatsan Raman  Is a corresponding author
  1. Department of Biochemistry, University of Wisconsin-Madison, United States
  2. Department of Physics, Boston University, United States
  3. Department of Chemistry, Boston University, United States
  4. Department of Bacteriology, University of Wisconsin-Madison, United States
  5. Department of Chemical and Biological Engineering, University of Wisconsin-Madison, United States
6 figures, 2 tables and 6 additional files

Figures

Figure 1 with 7 supplements
Allosteric hotspots in four bacterial allosteric transcription factors (aTFs) identified using deep mutational scanning (DMS).

(A) Nonfluorescent cells in the TetR, TtgR, MphR, and RolR single-mutant library were sorted (gray bar) in the presence (light shade) and absence (dark shade) of 1 µM anhydrotetracycline (aTC), 500 …

Figure 1—figure supplement 1
Experimental scheme for deep mutational scanning.

Protein-wide, single-site saturation mutagenesis of four TetR-like family allosteric transcription factors (aTFs) – TetR, TtgR, RolR, and MphR – using reporter-based screening followed by deep …

Figure 1—figure supplement 2
A detailed summary of all single-mutant phenotypes for every position within the proteins.

Heatmaps detailing the effect of all single mutants at every position in (A) TetR, (B) TtgR, (C) MphR, and (D) RolR are shown. Wild-type residues are black, mutations that do not affect protein …

Figure 1—figure supplement 3
Histograms of weighted scores and thresholds for identifying hotspots.

The distribution of weighted scores for every position in (A) TetR, (B) TtgR, (C) MphR, and (D) RolR is shown. Box and whisker plots above each histogram illustrate the spread of the data where …

Figure 1—figure supplement 4
Correlation of weighted scores between a ×5 or ×10 read count threshold.

The correlation of weighted scores for every position using a ×5 or ×10 read count threshold is shown for (A) TetR, (B) TtgR, (C) MphR, and (D) RolR. The red and orange lines illustrate the spread …

Figure 1—figure supplement 5
Distribution of allosteric hotspots in TetR homologs.

The percent of hotspots in the four main structural regions of the TetR homologs. Regions were broken into groups based on the crystal structures of TetR (PDB ID: 4AC0), TtgR (PDB ID: 2UXU), MphR …

Figure 1—figure supplement 6
Conservation of allosteric hotspots.

Average conservation score of all positions considered inactive or having no effect in (A) TetR, (B) TtgR, (C) MphR, and (D) RolR. Data show as mean ± SEM.

Figure 1—figure supplement 7
Comparison of experimental hotspots with predictions made by the Ohm server.

Allosteric hotspots of TetR, TtgR, MphR, and RolR determined by experiments (A, C, E, G) and the Ohm webserver (B, D, F, H) differ significantly. The Ohm webserver identifies critical residues along …

Figure 2 with 2 supplements
Hotspots enriched among long-range interactions (LRIs).

(A) Residue-residue contact map showing LRIs within each homolog. The LRIs are grouped by color, following standard k-means clustering, representing different regions of the protein. Inset shows …

Figure 2—figure supplement 1
Hotspot interactions are more likely to be long range than those of non-hotspots.

The percent of hotspot and non-hotspot residues participating in long-range interactions (LRIs) in each homolog protein.

Figure 2—figure supplement 2
Elbow method to determine the optimal number of clusters.

The optimal number of clusters to use for the k-means clustering of long-range interactions (LRIs) in each homolog was determined by iteratively calculating the variance within clusters for 1–25 …

Figure 3 with 1 supplement
Mutational preferences and physicochemical properties of dead variants.

(A) Percentage of mutations (final mutated state) among dead (red) and not-dead (gray) variants from deep mutational scanning (DMS) data for all four homologs combined. (B) Comparison of …

Figure 3—figure supplement 1
Enrichment of mutations in allosterically dead or no effect variants.

Mutations in (A) TetR, (B) TtgR, (C) MphR, and (D) RolR were separated based on their effect on protein function, dead (red) or no effect (gray), and the proportion of each of the 20 amino acids …

Figure 4 with 8 supplements
Machine learning identifies structural and molecular features that differentiate allosteric hotspots.

(A) The full list of 27 features is shown at the top. The F scores (measure of importance) of the features for each of the four allosteric transcription factors (aTFs) is shown below. (B) Frequency …

Figure 4—figure supplement 1
Global features have the highest Jensen-Shannon divergence (JSD).

The full list of 27 features is shown at the top. The JSDs (measure of importance) of the features for each of the four allosteric transcription factors (aTFs) is shown below. JSD is a measure of …

Figure 4—figure supplement 2
Distributions of TetR’s hotspots’ and non-hotspots’ z-scored feature values for feature 1–27.

The 27 plots correspond to the distributions of TetR’s hotspots’ (hs) and non-hotspots’ (non-hs) z-scored feature values for feature 1–27 as labeled by figure titles. The distributions of hotspots …

Figure 4—figure supplement 3
Distributions of MphR’s hotspots’ and non-hotspots’ z-scored feature values for feature 1–27.

The 27 plots correspond to the distributions of MphR’s hotspots’ (hs) and non-hotspots’ (non-hs) z-scored feature values for feature 1–27 as labeled by figure titles. The distributions of hotspots …

Figure 4—figure supplement 4
Distributions of TtgR’s hotspots’ and non-hotspots’ z-scored feature values for feature 1–27.

The 27 plots correspond to the distributions of TtgR’s hotspots’ (hs) and non-hotspots’ (non-hs) z-scored feature values for feature 1–27 as labeled by figure titles. The distributions of hotspots …

Figure 4—figure supplement 5
Distributions of RolR’s hotspots’ and non-hotspots’ z-scored feature values for feature 1–27.

The 27 plots correspond to the distributions of RolR’s hotspots’ (hs) and non-hotspots’ (non-hs) z-scored feature values for feature 1–27 as labeled by figure titles. The distributions of hotspots …

Figure 4—figure supplement 6
Average and best F1 scores of 4–10 feature combinations converge after 10 generations in the genetic algorithm feature selection.

The plots show the average and best F1 scores for 4–10 feature combinations as a function of generation in the genetic algorithm feature selection for the four homologous allosteric transcription …

Figure 4—figure supplement 7
Machine learning identifies structural and molecular features that differentiate allosteric hotspots.

Frequency of appearance of the 27 features in the top ten 1–10 feature combinations ranked by F1 score for each protein (labeled on top). Row 2–28 corresponds to feature 1–27, row 1 is the average …

Figure 4—figure supplement 8
Positions of centrality peaks.

Plots of centrality against residue number of each protein (labeled by the title), with the four red stars label the positions of centrality peaks 1–4 from left to right. The centrality peaks are …

Cross-protein prediction – predicting allosteric hotspots in one homolog using data from other homologs.

(A) Best cross-protein predictions without (CPP, yellow) and with transfer learning (CPP_TL, green) achieved for each protein using models trained with 1–10 features and different training data. The …

Figure 6 with 1 supplement
Predicting allosteric hotspots using homology models.

(A) Correlation between relative performance and the identity between the template protein and the target protein for modeling. (B) Correlation between relative performance and the root mean squared …

Figure 6—figure supplement 1
Sequence identity and root mean squared distance (RMSD) between template protein and target protein are anticorrelated.

Correlation of sequence identity and RMSD between the four allosteric transcription factors (aTFs) and their corresponding templates used in generating homology models. R squared shows the …

Tables

Table 1
Mutation phenotype prediction performancea.
TetRMphRRolRTtgR
UniRep19000.50±0.010.65±0.000.57±0.010.43±0.01
feat19270.53±0.000.69±0.000.59±0.020.44±0.00
random0.110.120.090.07
  1. a. Performances are evaluated as the average performance of five times of fivefold cross-validation tests; Unirep1900 and feat1927 show best NN performance using only Unirep features and using Unirep features in combination with 27 physical features, respectively. Data are presented as average ± std.

Table 2
Hotspot prediction performancea.
TetRMphRRolRTtgR
feat270.83±0.020.82±0.020.64±0.020.54±0.03
UniRep19000.61±0.070.50±0.020.32±0.030.35±0.03
random0.190.160.260.21
  1. a. Feat27 represents the fitness of the best-performing feature combination emerged in feature selection with the GA-NN approach. Performances are evaluated as the average performance of five times of fivefold cross-validation tests, and presented as average ± std.

Additional files

Download links