Inferring joint sequence-structural determinants of protein functional specificity
Figures

BPPS-SIPRIS analysis of the GNAT superfamily and Gna1-family based on structural coordinates for Gna1 (pdb: 4ag9) (Dorfmueller et al., 2012).
SIPRIS clearly associates Gna1-residues with the substrate and homodimeric interfaces (p=8.5 × 10−7). Color scheme: homodimer subunits A and B, green and blue backbones, respectively; BPPS-defined Gna1-family residues in subunits A and B, magenta and red sidechains, respectively (glycine residues are shown as Cα atom spheres); GNAT superfamily residues, yellow sidechains; ligands, cyan. Lys116 (shown in light red) is outside of the SIPRIS defined cluster, but forms a hydrogen bond to a CoA phosphate group. BPPS-SIPRIS spherical clustering identified the GNAT superfamily residues shown (p=1.7 × 10−5). The following figure supplement and source data are available for Figure 1.
-
Figure 1—source data 1
Contrast alignments for Gna1 N-acetyltransferase.
- https://doi.org/10.7554/eLife.29880.005

Applying SIPRIS to the Gna1 protein in conjunction with various methods.
Residue sets were defined using BPPS and three other programs (with default parameter settings). Residue color schemes: BPPS: NAT superfamily, yellow; Gna1-family, red; FRpred: conserved, yellow; subtype, red; CLIPS-1D: structurally important, yellow; ligand binding, orange; catalytic, red; ET: residues of high functional importance, orange; CoA and substrate, cyan. The SIPRIS predefined clustering p-values corresponding to the homodimer/substrate interface are indicated below each image.

BPPS-SIPRIS analysis of R4 P-loop GTPases.
Bound guanine nucleotide (shown in cyan) allows orientation of each subfigure relative to the others. (A). BPPS-defined hierarchical relationships among the GTPases examined here. (B). Entamoeba histolytica Rho1 GTPase (pdb: 3refB) (Bosch et al., 2011). Color scheme: R4-specific residues forming a BPPS-SIPRIS-defined hydrogen-bond network (p=8.3 × 10−5), red sidechains; residues conserved in P-loop GTPases and interacting with bound guanine nucleotide, yellow sidechains; atoms forming hydrogen bonds, CPK coloring. Modeled hydrogen atoms were generated using the Reduce program (Word et al., 1999). (C). Rab4 bound to GTP and to the Rab-binding domain of Rabenosyn (pdb: 1z0kA [Eathiraj et al., 2005]). BPPS-SIPRIS-defined residues distinctive of R4 (red sidechains) and Rab (orange) have core and Rabenosyn-contacting predefined cluster p-values of 2.6 × 10−6 and 2.9 × 10−8, respectively. The sensor threonine (Thr40) has substantial van der Waals contact with Glu44; Thr40 is a R4-specific (red) residue outside of the SIPRIS-defined cluster. (D). Rab8a in complex with the GTP analog, GNP, and with Ocrl1 (residue 540–678) (pdb: 3qbtA) (Hou et al., 2011]). Residues distinctive of Rab GTPases (orange) and of the Rab8 subgroup (green) are enriched at the Ocr1 interface (p=5.2 × 10−7 and 6.1 × 10−6, respectively). (E). Rab8a homodimeric complex (pdb: 4lhwAB) (Guo et al., 2013). Rab-specific residues (orange) are enriched at the homodimeric interface (p=8.7 × 10−7). The following source data are available for Figure 2.
-
Figure 2—source data 1
Contrast alignments for Rab8, Rab4 and Rho1 GTPases.
- https://doi.org/10.7554/eLife.29880.007

BPPS-SIPRIS analysis of translation-associated P-loop NTPases.
(A). Thermus aquaticus EF-Tu complexed with the antibiotic enacyloxin IIA, a GTP analog, and Phe-tRNA (pdb: 1ob5) (Parmeggiani et al., 2006). Color scheme: BPPS-SIPRIS defined GTPase-, TF- and EF-Tu/CysN-specific residues, yellow, red, and orange sidechains, respectively; GTPase domain backbone, green; C-terminal β-barrel domains, gray; phe-tRNA, teal; 5’ end nucleotide bases, light cyan; guanine nucleotide, cyan; enacyloxin IIA, greenish-cyan. Spheres indicate glycine Cα atoms. (B). BPPS-SIPRIS cluster of EF-Tu TF-residues centered on EF-Ts Phe81 at the EF-Tu/EF Ts interface (pdb: 1efu) (Kawashima et al., 1996). Regions in EF-Ts conserved between E. coli and cow are shown in cyan both in the figure and in the corresponding alignment below it. (C). P. aeruginosa EF-Tu bound to the Tse6 toxin domain (pdb: 4zv4) (Whitney et al., 2015). EF-Tu His20, which corresponds to His19 in (B), appears to form a salt bridge with Glu291 of Tse6. In light pink are regions of Tse6 contacting EF-Tu. Spherically clustered residues (p=0.0060) centered on Glu291 of Tse6 are shown with red sidechains. (D). Spherically clustered EF-Tu/CysN residues (orange; p=6.3 × 10−5) within the CysND complex (pdb: 1zun) (Mougous et al., 2006). (E). Spherically clustered EF-Tu/CysN-residues in EF-Tu (pdb: 1ob5) (p=1.0 × 10−6). (F). Human eIF4AIII bound to RNA, ADP, and the γ-phosphate transition state mimic AlF3 (pdb: 3e × 7) (Nielsen et al., 2009). Color scheme: eIF4AIII N- and C-terminal domains, violet and green, respectively; RNA and ADP, cyan; AlF3, light cyan; superfamily-conserved catalytic residues, yellow sidechains; RNA helicase-specific residues clustered on (light cyan-colored) RNA bases 4–5, red; other RNA helicase-specific residues, light red; C-terminal catalytic residues, bright green. The following source data are available for Figure 3.
-
Figure 3—source data 1
Contrast alignments EFTu GTPase and eIF4AIII RNA helicase.
- https://doi.org/10.7554/eLife.29880.009

BPPS-SIPRIS analysis of synaptojanin/EEP domains.
(A). The two major groups of the BPPS-defined EEP hierarchy examined here. (B). Human APE1 phosphorothioate substrate complex (pdb: 5dfi) (Freudenthal et al., 2015). Replacement of the phosphodiester bond with phosphorothioate prohibits cleavage by APE1 at the abasic site (circled). Cys310, which is nitrosated, is indicated. Color scheme: APE1 backbone trace, green; DNA strand containing the abasic site, cyan; complementary strand, marine blue; the BPPS-SIPRIS-defined residues distinctive of the EEP superfamily and of the exoIII-AP-endo family, yellow and red sidechains, respectively; basic residues within a loop interacting with the major groove of DNA, purple. (C). Close up of the APE1 active site. EEP-specific residues forming a hydrogen-bond network are shown with yellow sidechains. For clarity, only a few of the EEP- and exoIII-AP-endo-specific residues in the network are shown. The following source data are available for Figure 4.
-
Figure 4—source data 1
Contrast alignments for APE1 endonuclease.
- https://doi.org/10.7554/eLife.29880.011

BPPS-SIPRIS analysis of synaptojanin/EEP domains within INPP5 proteins.
Color code: EEP-residues, yellow sidechains; INPP5 residues, red sidechains; INPP5B-, INPP5E- and SHIP2-subfamily residues, orange sidechains; ligands, cyan; atoms involved in hydrogen bonds, CPK coloring. (A). Human INPP5B in complex with phosphatidylinositol 3,4-bisphosphate (pdb: 4cml) (Trésaugues et al., 2014), which is associated with cytosolic and mitochondrial membranes (Speed et al., 1995). BPPS-SIPRIS results: EEP spherical cluster, p=5.8 × 10−13; INPP5 spherical cluster, p=3.9 × 10−7; INPP5B spherical cluster, p=0.0021. (B). INPP5 hydrogen bond network within human INPP5B (pdb: 3mtc) (unpublished). (C). View of INPP5-residues (in 3mtc) that bind the 4-phosphate group required for substrate recognition. (D). Human INPP5B with phosphate bound to a possible membrane interaction or allosteric site (Mills et al., 2016). (E). Human INPP5B Ocrl with glycerol bound to the same site as indicated in (D) (Trésaugues et al., 2014). (F). INPP5 subgroups within the BPPS-defined hierarchy. (G). Human INPP5E (pdb: 2xsw) (unpublished), which is associated with the primary cilium, an organelle involved in signal transduction (Jacoby et al., 2009) (spherical cluster, p=3.6 × 10−4). (H). Human SHIP2 (pdb: 4a9c) (Mills et al., 2012), which is associated with membrane ruffle formation (Hasegawa et al., 2011) (spherical cluster, p=0.30). The following source data are available for Figure 5.
-
Figure 5—source data 1
Contrast alignments for INPP5 phosphatases.
- https://doi.org/10.7554/eLife.29880.013

BPPS-SIPRIS analysis of DNA glycosylases.
(A). Thymine DNA glycosylase (TDG) family (red sidechains) and metazoan subfamily (orange sidechains) residues forming a significant hydrogen bond network (p=3.5 × 10−5) within human TDG (pdb: 5hf7) (Pidugu et al., 2016). (B). TDG H-bond network consisting of residues distinctive both of all TDGs (red sidechains) and of metazoan TDGs (orange sidechains). This network includes hydrogen bonds to DNA oxygen atoms on either side of the thymine base to be excised (cyan); note that Phe238 and Tyr235 appear to position the N-terminus of their helix to hydrogen bond to substrate backbone oxygens; another such hydrogen bond involves Ser273, a residue generally conserved in the entire superfamily. The water molecule shown may act as the nucleophile in the reaction. For clarity, not all of the BPPS-SIPRIS-defined residues are shown. (C). TDG hydrogen-bond network residues may help position basic residues (green sidechains) interacting with the minor and major grooves of DNA. (D). TDG family-specific hydrogen-bond network residues surrounding a proposed catalytic water molecule (red sphere with dot cloud). (E). A BPPS-SIPRIS-defined H-bond network (p=1.7 × 10−5) distinct from that of TDG within Thermus thermophilus uracil DNA glycosylase (UDG) (pdb: 2dp6). The following source data are available for Figure 6.
-
Figure 6—source data 1
Contrast alignments for DNA glycosylases.
- https://doi.org/10.7554/eLife.29880.015

Overview of BPPS-SIPRIS analysis.
(A) Steps required for a BPPS-SIPRIS analysis. The fatax program adds phylum-annotations to database sequences. MAPGAPS detects and aligns database sequences containing the domain defined by a cma-formatted MSA or hiMSA. (MAPGAPS can also convert an MSA from fasta- to cma-format.) This creates an MSA that step 1 of BPPS then partitions hierarchically into subgroups based on discriminating pattern residues, as illustrated schematically in (B). Step E of BPPS checks for consistency between BPPS step 1 runs. Step 2 of BPPS adjusts the sub-alignment for each subgroup to align and possibly assign pattern residues to regions uniquely conserved in that subgroup, thereby creating a hiMSA. Step 3 of BPPS creates, for each node in the hiMSA, lineage-specific ‘contrast alignments’, as is illustrated schematically in (C), and a corresponding input file to SIPRIS, which identifies statistically significant structural interaction networks associated with pattern residues. For further descriptions, see text. (B) Schematic diagram of the node eight contrast alignment. Sequences assigned to node 8's subtree (green subfamily nodes in (C)) constitute a ‘foreground’ partition; sequences assigned to the other nodes of the subtree rooted at the parent of node 8 (gray subfamily nodes in (C)) constitute a ‘background’ partition, and the remaining sequences constitute a non-participating partition. Green horizontal bars in (B) represent foreground sequences. The green vertical bars in (B) represent conserved foreground residue patterns (as shown below each bar); these diverge from (or contrast with) the background compositions at those positions (white vertical bars). Red vertical bars above quantify the degree of divergence. (C) Schematic diagram of a BPPS-3-generated set of ‘contrast alignments’ corresponding to the node 9 lineage of the sequence hierarchy in (A). Within a hiMSA, there is one such lineage for each leaf node. Horizontal lines represent aligned sequences and are colored by level in the hierarchy. Thin light gray horizontal lines represent non-homologous and deleted regions. Vertical lines represent the contrasting pattern positions upon which the hierarchy is based and are similarly colored by levels. The trees shown correspond to each subgroup along the lineage. The colored, gray and white nodes in each tree correspond, respectively, to their alignment foreground, background and non-participating partitions. The background for the entire superfamily (lower right) consists of standard amino acid frequencies at each position.

Eleven haloacid dehalogenase sequences that the SFLD assigned to SG1129, but that are more closely related to SG1130 sequences.
The Venn diagram shows the overlap between the subgroups BSG15, SG1129 and SG1130 with the numbers of sequences indicated. The table gives the mean pairwise gapped BLAST scores for the 11 sequences assigned to both SG1129 and BSG15 versus the sequence sets shown; this analysis indicates that the 11 sequences should be reassigned from SG1129 to SG1130. Similar analyses indicate that four other sequences in SG1129 should be reassigned to SG1135 (based on mean scores of 27 versus 139) and that a sequence in SG1136 should be reassigned to SG1137 (based on a mean score of 8 versus 149).
Tables
Summary of BPPS-SIPRIS results for the most significant cluster in each test case.
https://doi.org/10.7554/eLife.29880.002Protein | PDB | SIPRIS | Focal | BPPS-SIPRIS‡ | SIPRIS | Tree | Interpretive comments# | ||
---|---|---|---|---|---|---|---|---|---|
Structure | mode* | point† | Dist. | Init. | Term. | p-value | level§ | ||
Gna1 | 4ag9A | p=BDF | - | 22 | 57 | 71 | 8.5 × 10−7 | 1 | Substrate and homodimeric interfaces |
S | CoA | 17 | 41 | 87 | 6.8 × 10−5 | 0 | CoA-binding subdomain | ||
S | - | 23 | 56 | 72 | 9.3 × 10−6 | 1 | DCA-based clustering | ||
S | - | 14 | 21 | 107 | 2.5 × 10−4 | 1 | Structure-based clustering | ||
Rho1 | 3refB | B | - | 20 | 53 | 100 | 8.3 × 10−5 | 1 | (Active site secondary shell) |
C | - | 22 | 55 | 98 | 7.8 × 10−7 | 1 | “ “ “ “ | ||
Rab4 | 1z0kA | S | - | 10 | 11 | 153 | 2.1 × 10−5 | 1 | (Active site secondary shell) |
C | - | 25 | 91 | 73 | 2.6 × 10−6 | 1 | “ “ “ “ | ||
p=B | - | 14 | 23 | 141 | 2.9 × 10−8 | 2 | Interface with Rabenosyn-5 | ||
S | - | 22 | 42 | 122 | 4.8 × 10−10 | 2 | “ “ “ “ | ||
Rab8 | 3qbtA | p=B | - | 13 | 23 | 139 | 5.2 × 10−7 | 2 | Interface with Ocrl1 |
p=B | - | 12 | 23 | 139 | 6.1 × 10−6 | 3 | Interface with Ocrl1 helix | ||
4lhwB | p=A | - | 10 | 14 | 148 | 8.7 × 10−7 | 2 | Homodimeric interface | |
EF-Tu | 1ob5A | S | - | 18 | 33 | 150 | 1.4 × 10−7 | 1 | (GTP to tRNA allosteric link) |
S | - | 23 | 71 | 112 | 1.0 × 10−6 | 2 | (GTP/tRNA allosteric link to β-barrel) | ||
S | 1B | 22 | 81 | 102 | 1.3 × 10−5 | 1 | Cluster around 5’ base 1 of tRNA | ||
S | 2B | 18 | 47 | 136 | 2.6 × 10−6 | 1 | Cluster around 5’ base 2 of tRNA | ||
1efuA | S | 81B | 14 | 49 | 128 | 5.2 × 10−5 | 1 | (Nucleotide exchange allosteric network) | |
4zv4A | S | 291C | 21 | 66 | 109 | 0.0060 | 1 | (Mediates hijacking by Tse6 toxin) | |
CysN | 1zunB | S | - | 23 | 79 | 118 | 6.3 × 10−5 | 2 | (Allosteric link to β-barrel domain) |
eIF4AIII | 3ex7H | p=J | - | 11 | 18 | 128 | 6.4 × 10−6 | 1 | (ATP to RNA allosteric link) |
S | 4J | 13 | 18 | 128 | 5.1 × 10−7 | 1 | Cluster around RNA rotation bond | ||
S | 5J | 16 | 41 | 105 | 5.5 × 10−4 | 1 | “ “ “ “ “ | ||
APE1 | 5dfiA | H | 11P | 9 | 13 | 238 | 5.2 × 10-6 | 0 | Abasic site H-bond network |
H | 11P | 22 | 99 | 152 | 1.6 × 10−6 | 1 | “ “ “ “ | ||
H | - | 25 | 137 | 114 | 1.7 × 10−6 | 1 | (Active site secondary shell) | ||
H | 9P | 25 | 137 | 114 | 1.9 × 10−7 | 1 | H-bond network positioning abasic site | ||
H | 12P | 23 | 119 | 132 | 7.6 × 10−6 | 1 | “ “ “ “ “ | ||
Inpp5b | 4cmlA | S | - | 24 | 69 | 216 | 5.8 × 10−13 | 0 | Active site core residues |
S | - | 21 | 77 | 208 | 3.9 × 10−7 | 1 | (Substrate recognition with allosteric link) | ||
S | - | 12 | 30 | 255 | 0.0022 | 2 | (Membrane substrate sequestration) | ||
Inpp5b | 3mtcA | S | - | 22 | 91 | 194 | 8.0 × 10−7 | 1 | (Substrate recognition with allosteric link) |
S | - | 12 | 29 | 256 | 0.0015 | 2 | (Membrane substrate sequestration) | ||
Inpp5e | 2xswA | S | - | 25 | 140 | 148 | 3.7 × 10−7 | 1 | (Substrate recognition with allosteric link) |
S | - | 9 | 13 | 275 | 3.6 × 10−4 | 2 | (Membrane substrate sequestration) | ||
SHIP2 | 4a9cA | S | - | 17 | 38 | 260 | 6.0 × 10−8 | 1 | (Substrate recognition with allosteric link) |
S | - | 4 | 4 | 294 | 0.30 | 2 | (Membrane substrate sequestration) | ||
TDG | 5hf7A | H | 17D | 19 | 97 | 76 | 4.1 × 10−4 | 1 | H-bond network around excised base |
H | - | 20 | 98 | 75 | 3.5 × 10−5 | 1 | H-bond network around catalytic water | ||
UDG | 2dp6A | B | - | 13 | 17 | 121 | 1.7 × 10−5 | 1 | H-bond network distinct from TDG |
-
*Modes: S, spherical expansion; C, core expansion; H, hydrogen bond expansion (involving sidechain interactions); B, hydrogen bond expansion (also involving backbone-to-backbone interactions); P, predefined clustering (residues in the cluster are those interacting with the chain(s) whose pdb identifiers are given to the right of the equal sign).
†Focal points defining starting residue(s): ‘-‘,analysis was optimized over multiple starting residues (i.e., no focal point); CoA, cluster initiated from the residue closest to Coenzyme A; others, cluster initiated from the residue closest to the indicated position and chain (e.g., 1B = position 1 in pdb chain B).
-
‡Nature of the optimum cluster: dist., the number of distinguishing residues within the cluster (total = 25); init., the total number of residues within the cluster; term., the number of residues outside of the cluster.
§Codes designate pattern residue class: 0, superfamily; 1, family; 2, subfamily; 3, sub-subfamily. In the figures, these correspond to residues with yellow, red, orange and green sidechains, respectively.
-
#Comments in parentheses indicate possible functions.
Structural diversity among proteins identified and aligned by MAPGAPS.
https://doi.org/10.7554/eLife.29880.017Superfamily | structures* | RMSD† (Å) | Domain length‡ | Resolution (Å) | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
% ID | No. | Avg | Min | Max | S.D. | MSA | Avg | S.D. | Avg | Max | |
GNAT | 27 | 16 | 3.25 | 1.0 | 6.7 | 1.4 | 125 | 139.8 | 17.0 | 1.94 | 2.61 |
GTPases | 30 | 20 | 3.96 | 0.6 | 14.7 | 3.5 | 164 | 195.9 | 41.6 | 2.31 | 3.10 |
Helicases | 40 | 12 | 6.39 | 2.6 | 9.8 | 1.8 | 466 | 482.8 | 60.7 | 2.86 | 3.56 |
EEP | 40 | 16 | 3.02 | 0.8 | 5.2 | 0.95 | 241 | 259.0 | 27.6 | 2.07 | 2.99 |
UDG/TDG | 40 | 8 | 2.54 | 1.1 | 3.6 | 0.69 | 125 | 135.9 | 12.7 | 1.83 | 2.58 |
-
*NMR and poor resolution structures were not used; no two proteins in each set contained more than the indicated level of percent sequence identity (% ID); pdb identifies for these are given in supplementary file 1.
†RMSDs were computed using MUSTANG (Konagurthu et al., 2006) with default parameters; the structural coordinates used for the analysis were limited to the domain of interest.
-
‡The number of aligned columns in the MSA, and the average length and standard deviation of the domain ‘footprint’.
Summary of BPPS results for five superfamilies.
https://doi.org/10.7554/eLife.29880.018Superfamily | Subgroup | # Sequences | % Identity* | # Nodes in subtree | Minimum subtree size |
---|---|---|---|---|---|
GNAT | 237,359 | 98 | 44 | 200 | |
Gna1 family | 1243 | 1 | |||
GTPases | 127,418 | 95 | 121 | 500 | |
R4 family | 18,901 | 26 | |||
Rab subfamily | 7002 | 12 | |||
Rab8 sub-subfamily | 3.312 | 7 | |||
TF family | 25,224 | 10 | |||
EFTu/CysN subfamily | 4429 | 3 | |||
Helicases | 131,321 | 98 | 47 | 300 | |
RNA helicases | 36,788 | 8 | |||
EEP | 45,799 | 99 | 166 | 100 | |
exoIII-AP-endo | 13,711 | 47 | |||
INPP5 | 3855 | 14 | |||
TDG/UDG | 23,592 | 98 | 47 | 100 | |
TDG | 1639 | 6 | |||
UDG | 376 | 1 |
-
*The maximum % identity allowed between any two sequences in the set
Summary of SFLD benchmarking of BPPS.
https://doi.org/10.7554/eLife.29880.023Superfamily | # subgroups | BPPS | Annotated by SFLD | BSG‡ | BPPS conflicts§ | Maximum | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
SFLD | BPPS | min.* | No | Yes | expt† | Error | ? | Correct | % errors# | ||
radical SAM | 49 | 17 | 800 | 52,608 | 17,680 | 12 | 13,676 | 10 | 6 | 326 | 0.12 |
glutathione transferase | 26 | 15 | 100 | 6921 | 3633 | 0 | 1945 | 0 | 0 | 0 | 0 |
peroxiredoxin | 6 | 11 | 100 | 3870 | 5521 | 0 | 5255 | 0 | 1 | 0 | 0.02 |
haloacid dehalogenase | 24 | 28 | 200 | 21,768 | 33,379 | 9 | 26,589 | 35 | 66 | 27 | 0.38 |
isoprenoid synthase I | 9 | 7 | 200 | 9666 | 1604 | 55 | 1536 | 0 | 0 | 0 | 0 |
isoprenoid synthase II | 3 | 5 | 100 | 6974 | 671 | 38 | 591 | 1 | 0 | 0 | 0.17 |
nitroreductase | 110 | 11 | 200 | 0 | 17,318 | 0 | 7242 | 20 | 11 | 0 | 0.43 |
enolase | 8 | 8 | 800 | 26,227 | 2267 | 7 | 2143 | 0 | 0 | 0 | 0 |
total: | 82,073 | 121 | 58,977 | 66 | 84 | 353 | avg: 0.14 |
-
*The minimum number of sequences required for each BPPS subgroup.
†Numbers of experimentally validated annotations.
-
‡The number of SFLD annotated sequences assigned by BPPS to a subgroup.
§The number of SFLD annotated sequences in conflict with BPPS classification; error, SFLD annotation appears to be correct; ‘?’, not sure whether SFLD or BPPS is correct; correct, BPPS appears to be correct
-
#Percent erroneous or ambiguous (‘?”) BPPS assignments among annotated sequences not assigned to a root node.
Correspondence between BPPS and SFLD subgroups for haloacid dehalogenases*.
https://doi.org/10.7554/eLife.29880.024Subgroup IDs | SFLD& | SFLD# | ||
---|---|---|---|---|
BPPS | SFLD‡ | BPPS§ | Total | % |
root† | various | 1531 | 1618 | 96.2 |
1138 | 82 | 833 | 9.8 | |
34 | 0 | 200 | 21768 | 0.9 |
1129 | 3 | 129 | 2.3 | |
23 | 0 | 101 | 21768 | 0.5 |
1124 | 1 | 495 | 0.2 | |
1135 | 125 | 9423 | 1.3 | |
21 | 0 | 158 | 21768 | 0.7 |
1124 | 91 | 495 | 18.4 | |
1145 | 2 | 43 | 4.7 | |
20 | 0 | 46 | 21768 | 0.2 |
1131 | 162 | 201 | 80.6 | |
25 | 0 | 76 | 21768 | 0.3 |
1135 | 311 | 9423 | 3.3 | |
2 | 0 | 1915 | 21768 | 8.8 |
2 | 10091 | 11846 | 85.2 | |
3 | 0 | 937 | 21768 | 4.3 |
1129 | 4 | 129 | 3.1 | |
1134 | 1 | 866 | 0.1 | |
1135 | 4500 | 9423 | 47.8 | |
1139 | 4 | 1851 | 0.2 | |
1140 | 1 | 821 | 0.1 | |
4 | 0 | 2422 | 21768 | 11.1 |
2 | 1 | 11846 | 0.0 | |
1137 | 9 | 1430 | 0.6 | |
1140 | 3 | 821 | 0.4 | |
1141 | 53 | 278 | 19.1 | |
1142 | 2 | 236 | 0.8 | |
1144 | 2497 | 2759 | 90.5 | |
5 | 0 | 229 | 21768 | 1.1 |
2 | 986 | 11846 | 8.3 | |
6 | 0 | 342 | 21768 | 1.6 |
1124 | 360 | 495 | 72.7 | |
7 | 0 | 330 | 21768 | 1.5 |
1134 | 628 | 866 | 72.5 | |
33 | 0 | 100 | 21768 | 0.5 |
1134 | 153 | 866 | 17.7 | |
8 | 0 | 57 | 21768 | 0.3 |
1133 | 400 | 400 | 100 | |
32 | 0 | 32 | 21768 | 0.1 |
1139 | 218 | 1851 | 11.8 | |
31 | 0 | 28 | 21768 | 0.1 |
1139 | 216 | 1851 | 11.7 | |
9 | 0 | 195 | 21768 | 0.9 |
1134 | 1 | 866 | 0.1 | |
1139 | 942 | 1851 | 50.9 | |
10 | 0 | 105 | 21768 | 0.5 |
1137 | 896 | 1430 | 62.7 | |
11 | 0 | 284 | 21768 | 1.3 |
1135 | 3 | 9423 | 0.0 | |
12 | 0 | 478 | 21768 | 2.2 |
1136 | 1 | 246 | 0.4 | |
1137 | 178 | 1430 | 12.4 | |
13 | 0 | 117 | 21768 | 0.5 |
1138 | 751 | 833 | 90.2 | |
14 | 0 | 1034 | 21768 | 4.8 |
1135 | 32 | 9423 | 0.3 | |
15 | 0 | 525 | 21768 | 2.4 |
1129 | 11 | 129 | 8.5 | |
1130 | 809 | 836 | 96.8 | |
1132 | 6 | 227 | 2.6 | |
1135 | 63 | 9423 | 0.7 | |
1139 | 3 | 1851 | 0.2 | |
16 | 0 | 337 | 21768 | 1.5 |
1135 | 1 | 9423 | 0.0 | |
1140 | 670 | 821 | 81.6 | |
17 | 0 | 230 | 21768 | 1.1 |
1135 | 288 | 9423 | 3.1 | |
18 | 0 | 505 | 21768 | 2.3 |
1135 | 950 | 9423 | 10.1 | |
19 | 0 | 197 | 21768 | 0.9 |
1129 | 3 | 129 | 2.3 | |
22 | 0 | 338 | 21768 | 1.6 |
24 | 0 | 110 | 21768 | 0.5 |
1135 | 107 | 9423 | 1.1 |
-
*Erroneous, ambiguous and corrected classifications are shown as italicized, underlined, and bold, respectively.
†Averages over 12 root-assigned subgroups.
-
‡SFLD subgroups represented in each BPPS subgroup; zero indicates the SFLD unannotated sequence set.
§The number of sequences in both the SFLD and BPPS subgroups in each row.
-
#Total number of sequences in each SFLD subgroup and the percentage of these in the BPPS subgroup.
Haloacid dehalogenase SG1129 sequences that BPPS assigned to distinct subgroups (BSG).
https://doi.org/10.7554/eLife.29880.025BPPS | SG1129 | % matches to BPPS pattern† | Mean score vs other seqs:‡ | |||||
---|---|---|---|---|---|---|---|---|
BSG | # seqs | In BSG* | BSG 34 | BSG 3 | BSG 15 | BSG 19 | In SG1129 | In BSG |
34 | 203 | 3 | 80 | 31 | 37 | 8 | 14 | 399 |
3§ | 5447 | 4 | 6 | 96 | 55 | 5 | 27 | 137 |
15# | 1417 | 11 | 10 | 50 | 87 | 12 | 21 | 104 |
19 | 200 | 3 | 8 | 21 | 36 | 83 | 9 | 488 |
root | 16,869 | 108 | 9 | 40 | 42 | 8 | na | na |
-
*The number of SG1129 sequences assigned to the BSG in each row.
†Average percentage of matches to the pattern residues for their assigned BSG among the SG1129 sequences. The highest percentages (bold) correspond to each BSG’s own pattern.
-
‡The mean pairwise BLAST scores of the reassigned sequences against the remaining sequences either in SG1129 or in the BSG for that row.
§This BSG corresponds to SG1135. (See Appendix 1—table 2.)
-
#This BSG corresponds to SG1130. (See Appendix 1—table 2.)
Average percentage of matches to various BPPS subgroup (BSG) patterns for haloacid dehalogenase sequences assigned to SFLD subgroup SG1135.
https://doi.org/10.7554/eLife.29880.026BSG | SG1135 | % matches to each BSG pattern for SG1135 sequences*: | Mean score† vs others | new‡ | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ID | # seqs | # seqs | 15 | 16 | 25 | 3 | 11 | 14 | 17 | 18 | 24 | 23 | In SG1135 | In BSG | SGs |
15 | 1417 | 63 | 69 | 35 | 32 | 51 | 8 | 42 | 19 | 56 | 22 | 20 | 63 | 64 | ? |
16 | 1008 | 1 | 43 | 40 | 52 | 52 | 12 | 40 | 40 | 52 | 24 | 8 | 110 | 14 | error |
25 | 387 | 311 | 48 | 37 | 84 | 57 | 13 | 36 | 43 | 45 | 27 | 10 | 80 | 309 | yes |
3 | 5447 | 4500 | 49 | 25 | 35 | 90 | 15 | 37 | 20 | 38 | 20 | 9 | 148 | 132 | ? |
11 | 287 | 3 | 33 | 29 | 37 | 51 | 93 | 32 | 17 | 32 | 16 | 8 | 44 | 632 | yes |
14 | 1066 | 32 | 49 | 27 | 25 | 37 | 9 | 77 | 18 | 47 | 19 | 13 | 42 | 130 | yes |
17 | 518 | 288 | 41 | 39 | 41 | 42 | 8 | 30 | 88 | 48 | 25 | 6 | 55 | 321 | yes |
18 | 1455 | 950 | 58 | 34 | 28 | 44 | 9 | 50 | 16 | 88 | 20 | 15 | 43 | 169 | yes |
24 | 217 | 107 | 42 | 35 | 26 | 44 | 9 | 32 | 20 | 42 | 92 | 4 | 53 | 293 | yes |
23 | 227 | 125 | 62 | 32 | 23 | 41 | 4 | 32 | 10 | 36 | 14 | 90 | 45 | 403 | yes |
root | n.a. | 3040 | 46 | 37 | 41 | 56 | 12 | 40 | 29 | 46 | 23 | 10 | n.a. | n.a. | n.a. |
-
*Average percentage of matches to the pattern residues for their assigned BSG among the SG1135 sequences. The highest percentages (bold) correspond to the highest percentage in each row.
†The mean pairwise BLAST scores of the BPPS-assigned sequences against the remaining sequences either in SG1135 or in the BSG for that row. The highest scores in each row are bold. (See Appendix 1—table 2.)
-
‡A ‘yes’ in this column indicates that the SG1135 sequences assigned to the BSG in that row likely correspond to a subgroup distinct from SG1135; ‘?’ indicates a possible subcategory of SG1135; ‘error’ indicates a BPPS misclassification.
BPPS-SIPRIS analyses using MAPGAPS (MG) versus Jackhmmer (JH) generated MSAs as input.
https://doi.org/10.7554/eLife.29880.027Protein | MSA* | SIPRIS† | BPPS-SIPRIS† | SIPRIS | Tree | Optimal BPPS pattern ∩ SIPRIS cluster# | ||
---|---|---|---|---|---|---|---|---|
Mode | Dist. | Init. | Term. | p-value | level‡ | |||
Gna1 | JH | p=BDF | 21 | 69 | 87 | 1.2 × 10−5 | 1 | F93,I94,D105,K136,H95,Y68,Y135,T44,R102,E104, E90,C141,K92,L40,L43,V88,V134,Y36,L27,F58,V89 |
MG | p=BDF | 22 | 57 | 71 | 8.5 × 10−7 | 1 | F93,I94,D105,K136,H95,Y68,Y135,T44,R102,E104, E90,C141,K92,M61,L40,L43,V134,F54,Y36,F58, G98,V89 | |
JH | S | 14 | 21 | 135 | 2.1 × 10−5 | 1 | E90,K92,R102,V89,V88,G101,Y135,I94,V134,K136, F93,E104,Y68,H95 | |
MG | S | 14 | 21 | 107 | 2.5 × 10−4 | 1 | E90,K92,R102,V89,G101,Y135,I94,V134,K136,F93, G98,E104,Y68,H95 | |
APE1 | JH | H | 15 | 38 | 219 | 5.0 × 10−5 | 0 | V206,L167,Q95,S66,G209,W67,P311,H309,D283, S307,N68,D210,E96,N212,R185 |
MG | H | 16 | 33 | 218 | 4.2 × 10−7 | 0 | V206,L167,F165,Q95,S66,G209,V69,W67,H309, D283,T265,S307,N68,D210,E96,N212 | |
JH | H | 25 | 158 | 99 | 8.8 × 10−6 | 1 | Y128,E154,R156,Y171,P173,W188,D70,W267,N277, E236,C310,G71,R237,D219,R254,R281,V213,A214, L62,G279,K98,V131,A175,L72,R181 | |
MG | H | 25 | 137 | 214 | 1.7 × 10−6 | 1 | Y128,G155,E154,D152,R156,Y171,P173,R185,D70, W267,N277,W188,E236,C310,Y269,R237,D219, R254,V213,A214,Y264,G279,K98,A175,G145 | |
Rho1 | JH | B | 20 | 63 | 110 | 3.5 × 10−4 | 1 | S106,D28,W114,Y81,E117,Y89,A76,Q78,L84,E79, K22,W73,F176,F99,F107,V101,E163,Y161,G149, C153 |
MG | B | 20 | 53 | 100 | 8.3 × 10−5 | 1 | S106,D28,W114,Y81,Y89,A76,Q78,L84,E79,K22, W73,F99,F107,V101,V144,R137,E163,Y161,G149, C153 | |
JH | C | 24 | 82 | 91 | 2.4 × 10−6 | 1 | L84,Y81,Q78,A76,E117,W114,Y89,D28,V24,E79, W73,K22,F99,F107,Y161,C153,G149,S106,V101, E163,G131,F176,F40,F57 | |
MG | C | 22 | 55 | 98 | 7.8 × 10−7 | 1 | L84,Y81,Q78,A76,T52,W114,Y89,D28,E79,W73,K22, F99,F107,Y161,C153,G149,V144,S106,V101,E163, G131,R137 | |
eIF4AIII | JH | p=J | 8 | 18 | 212 | 2.7 × 10−4 | 1 | G165,R116,D169,R166,G143,T115,G196,P164 |
MG | p=J | 11 | 18 | 128 | 6.4 × 10-6 | 1 | G165,F197,R116,Q200,D169,R166,G143,T115, G142,G196,P164 |
-
*Input MSA: Jackhmmer, JH; MAPGAPS, MG.
†Explained in the footnotes to Table 1.
-
‡Codes designate BPPS category: 0, superfamily; 1, family.
§Pattern residue discrepancies between the Jackhmmer and MAPGAPS runs are shown in bold.
Additional files
-
Supplementary file 1
The pdb files used for computing RMSDs in Table 2.
- https://doi.org/10.7554/eLife.29880.019
-
Transparent reporting form
- https://doi.org/10.7554/eLife.29880.020