Figures and data

Proteins with the highest frequency of mutation events are associated with antibiotic resistance.
(A) Workflow used to create our combined dataset of non-synonymous substitutions and inframe INDEL mutations mapped to protein 3D structures, for 92% of the M. tuberculosis H37Rv proteome. With a dataset of homoplastic mutations from 31,428 MTBC isolates17,29, we mapped these mutations to protein sequences, and 3D structures based on a combination of experimentally determined (RCSB PDB30) and computationally predicted (Alphafold31) structures. (B) The total number of non-synonymous mutation events per protein versus the percent of the amino acids in the protein’s structure that have been mutated at least once. Marginal histograms are displayed along both axes. Proteins with the highest frequency of mutations are those associated with resistance to antibiotics, according to the WHO catalogue of resistance-associated mutations32.

G-statistic reveals clustering of mutations in antibiotic-resistance conferring proteins.
(A) Workflow to compute residue-wise Getis-Ord statistic for proteins in M. tuberculosis. (B) Results for proteins: KatG is shown in complex with heme (orange), PDB ID = 4C51 chain A.36 RpoB is shown in complex with rifampin (orange), PDB ID = 5UH6 chain C,37 aligned as described in Methods. PncA is shown in complex with Fe2+, PDB ID = 3PL1 chain A.38 RsmG (encoded by the gidB gene) structure from Thermus thermophilus is shown in complex with ligand adenosine monophosphate (please note the streptomycin binding site is unknown in M. tuberculosis and thus is not shown here), PDB ID = 3G8A chain A.39

Benchmarking the ability of G-scores to find significant clustering in protein structures.
The procedure for generating downsampled true positive and true negative examples from real proteins. We then test five scores for their ability to distinguish true positives and negatives.

Hits of proteome-wide screen for clustering of mutations.
(A) pipeline for detecting hits in 3,687 proteins. (B) GO terms with top fold enrichment (all significant at FDR < 0.05). (C) Examples of proteins with significant clustering. All structures shown are from AlphaFold with low-confidence residues filtered out (note that relative domain orientation for PknB and PknH is low confidence).

Performance of classification models on predicting whether mutations are R-conferring from the mutation catalog.
Proximity-1D was trained using just the distance in primary sequence to the nearest known R mutation, Proximity-3D was trained using the distance in 3D to the nearest known R mutation, and G-score was trained using the G-score calculated in this manuscript.