Figures and data in Inferring joint sequence-structural determinants of protein functional specificity

Figures
Tables
Additional files

8 figures, 8 tables and 2 additional files

Figures

Figure 1 with 1 supplement

Download asset Open asset

BPPS-SIPRIS analysis of the GNAT superfamily and Gna1-family based on structural coordinates for Gna1 (pdb: 4ag9) (Dorfmueller et al., 2012).

SIPRIS clearly associates Gna1-residues with the substrate and homodimeric interfaces (p=8.5 × 10⁻⁷). Color scheme: homodimer subunits A and B, green and blue backbones, respectively; BPPS-defined Gna1-family residues in subunits A and B, magenta and red sidechains, respectively (glycine residues are shown as C_α atom spheres); GNAT superfamily residues, yellow sidechains; ligands, cyan. Lys116 (shown in light red) is outside of the SIPRIS defined cluster, but forms a hydrogen bond to a CoA phosphate group. BPPS-SIPRIS spherical clustering identified the GNAT superfamily residues shown (p=1.7 × 10⁻⁵). The following figure supplement and source data are available for Figure 1.

https://doi.org/10.7554/eLife.29880.003

Figure 1—source data 1 Contrast alignments for Gna1 N-acetyltransferase.: https://doi.org/10.7554/eLife.29880.005
Download elife-29880-fig1-data1-v1.docx

Figure 1—figure supplement 1

Download asset Open asset

Applying SIPRIS to the Gna1 protein in conjunction with various methods.

Residue sets were defined using BPPS and three other programs (with default parameter settings). Residue color schemes: BPPS: NAT superfamily, yellow; Gna1-family, red; FRpred: conserved, yellow; subtype, red; CLIPS-1D: structurally important, yellow; ligand binding, orange; catalytic, red; ET: residues of high functional importance, orange; CoA and substrate, cyan. The SIPRIS predefined clustering p-values corresponding to the homodimer/substrate interface are indicated below each image.

https://doi.org/10.7554/eLife.29880.004

Figure 2

Download asset Open asset

BPPS-SIPRIS analysis of R⁴ P-loop GTPases.

Bound guanine nucleotide (shown in cyan) allows orientation of each subfigure relative to the others. (A). BPPS-defined hierarchical relationships among the GTPases examined here. (B). *Entamoeba histolytica* Rho1 GTPase (pdb: 3refB) (Bosch et al., 2011). Color scheme: R⁴-specific residues forming a BPPS-SIPRIS-defined hydrogen-bond network (p=8.3 × 10⁻⁵), red sidechains; residues conserved in P-loop GTPases and interacting with bound guanine nucleotide, yellow sidechains; atoms forming hydrogen bonds, CPK coloring. Modeled hydrogen atoms were generated using the Reduce program (Word et al., 1999). (C). Rab4 bound to GTP and to the Rab-binding domain of Rabenosyn (pdb: 1z0kA [Eathiraj et al., 2005]). BPPS-SIPRIS-defined residues distinctive of R⁴ (red sidechains) and Rab (orange) have core and Rabenosyn-contacting predefined cluster p-values of 2.6 × 10⁻⁶ and 2.9 × 10⁻⁸, respectively. The sensor threonine (Thr40) has substantial van der Waals contact with Glu44; Thr40 is a R⁴-specific (red) residue outside of the SIPRIS-defined cluster. (D). Rab8a in complex with the GTP analog, GNP, and with Ocrl1 (residue 540–678) (pdb: 3qbtA) (Hou et al., 2011]). Residues distinctive of Rab GTPases (orange) and of the Rab8 subgroup (green) are enriched at the Ocr1 interface (p=5.2 × 10⁻⁷ and 6.1 × 10⁻⁶, respectively). (E). Rab8a homodimeric complex (pdb: 4lhwAB) (Guo et al., 2013). Rab-specific residues (orange) are enriched at the homodimeric interface (p=8.7 × 10⁻⁷). The following source data are available for Figure 2.

https://doi.org/10.7554/eLife.29880.006

Figure 2—source data 1 Contrast alignments for Rab8, Rab4 and Rho1 GTPases.: https://doi.org/10.7554/eLife.29880.007
Download elife-29880-fig2-data1-v1.docx

Figure 3

Download asset Open asset

BPPS-SIPRIS analysis of translation-associated P-loop NTPases.

(A). *Thermus aquaticus* EF-Tu complexed with the antibiotic enacyloxin IIA, a GTP analog, and Phe-tRNA (pdb: 1ob5) (Parmeggiani et al., 2006). Color scheme: BPPS-SIPRIS defined GTPase-, TF- and EF-Tu/CysN-specific residues, yellow, red, and orange sidechains, respectively; GTPase domain backbone, green; C-terminal β-barrel domains, gray; phe-tRNA, teal; 5’ end nucleotide bases, light cyan; guanine nucleotide, cyan; enacyloxin IIA, greenish-cyan. Spheres indicate glycine C_α atoms. (B). BPPS-SIPRIS cluster of EF-Tu TF-residues centered on EF-Ts Phe81 at the EF-Tu/EF Ts interface (pdb: 1efu) (Kawashima et al., 1996). Regions in EF-Ts conserved between *E. coli* and cow are shown in cyan both in the figure and in the corresponding alignment below it. (C). *P. aeruginosa* EF-Tu bound to the Tse6 toxin domain (pdb: 4zv4) (Whitney et al., 2015). EF-Tu His20, which corresponds to His19 in (B), appears to form a salt bridge with Glu291 of Tse6. In light pink are regions of Tse6 contacting EF-Tu. Spherically clustered residues (p=0.0060) centered on Glu291 of Tse6 are shown with red sidechains. (D). Spherically clustered EF-Tu/CysN residues (orange; p=6.3 × 10⁻⁵) within the CysND complex (pdb: 1zun) (Mougous et al., 2006). (E). Spherically clustered EF-Tu/CysN-residues in EF-Tu (pdb: 1ob5) (p=1.0 × 10⁻⁶). (F). Human eIF4AIII bound to RNA, ADP, and the γ-phosphate transition state mimic AlF₃ (pdb: 3e × 7) (Nielsen et al., 2009). Color scheme: eIF4AIII N- and C-terminal domains, violet and green, respectively; RNA and ADP, cyan; AlF₃, light cyan; superfamily-conserved catalytic residues, yellow sidechains; RNA helicase-specific residues clustered on (light cyan-colored) RNA bases 4–5, red; other RNA helicase-specific residues, light red; C-terminal catalytic residues, bright green. The following source data are available for Figure 3.

https://doi.org/10.7554/eLife.29880.008

Figure 3—source data 1 Contrast alignments EFTu GTPase and eIF4AIII RNA helicase.: https://doi.org/10.7554/eLife.29880.009
Download elife-29880-fig3-data1-v1.docx

Figure 4

Download asset Open asset

BPPS-SIPRIS analysis of synaptojanin/EEP domains.

(A). The two major groups of the BPPS-defined EEP hierarchy examined here. (B). Human APE1 phosphorothioate substrate complex (pdb: 5dfi) (Freudenthal et al., 2015). Replacement of the phosphodiester bond with phosphorothioate prohibits cleavage by APE1 at the abasic site (circled). Cys310, which is nitrosated, is indicated. Color scheme: APE1 backbone trace, green; DNA strand containing the abasic site, cyan; complementary strand, marine blue; the BPPS-SIPRIS-defined residues distinctive of the EEP superfamily and of the exoIII-AP-endo family, yellow and red sidechains, respectively; basic residues within a loop interacting with the major groove of DNA, purple. (C). Close up of the APE1 active site. EEP-specific residues forming a hydrogen-bond network are shown with yellow sidechains. For clarity, only a few of the EEP- and exoIII-AP-endo-specific residues in the network are shown. The following source data are available for Figure 4.

https://doi.org/10.7554/eLife.29880.010

Figure 4—source data 1 Contrast alignments for APE1 endonuclease.: https://doi.org/10.7554/eLife.29880.011
Download elife-29880-fig4-data1-v1.docx

Figure 5

Download asset Open asset

BPPS-SIPRIS analysis of synaptojanin/EEP domains within INPP5 proteins.

Color code: EEP-residues, yellow sidechains; INPP5 residues, red sidechains; INPP5B-, INPP5E- and SHIP2-subfamily residues, orange sidechains; ligands, cyan; atoms involved in hydrogen bonds, CPK coloring. (A). Human INPP5B in complex with phosphatidylinositol 3,4-bisphosphate (pdb: 4cml) (Trésaugues et al., 2014), which is associated with cytosolic and mitochondrial membranes (Speed et al., 1995). BPPS-SIPRIS results: EEP spherical cluster, p=5.8 × 10⁻¹³; INPP5 spherical cluster, p=3.9 × 10⁻⁷; INPP5B spherical cluster, p=0.0021. (B). INPP5 hydrogen bond network within human INPP5B (pdb: 3mtc) (unpublished). (C). View of INPP5-residues (in 3mtc) that bind the 4-phosphate group required for substrate recognition. (D). Human INPP5B with phosphate bound to a possible membrane interaction or allosteric site (Mills et al., 2016). (E). Human INPP5B Ocrl with glycerol bound to the same site as indicated in (D) (Trésaugues et al., 2014). (F). INPP5 subgroups within the BPPS-defined hierarchy. (G). Human INPP5E (pdb: 2xsw) (unpublished), which is associated with the primary cilium, an organelle involved in signal transduction (Jacoby et al., 2009) (spherical cluster, p=3.6 × 10⁻⁴). (H). Human SHIP2 (pdb: 4a9c) (Mills et al., 2012), which is associated with membrane ruffle formation (Hasegawa et al., 2011) (spherical cluster, p=0.30). The following source data are available for Figure 5.

https://doi.org/10.7554/eLife.29880.012

Figure 5—source data 1 Contrast alignments for INPP5 phosphatases.: https://doi.org/10.7554/eLife.29880.013
Download elife-29880-fig5-data1-v1.docx

Figure 6

Download asset Open asset

BPPS-SIPRIS analysis of DNA glycosylases.

(A). Thymine DNA glycosylase (TDG) family (red sidechains) and metazoan subfamily (orange sidechains) residues forming a significant hydrogen bond network (p=3.5 × 10⁻⁵) within human TDG (pdb: 5hf7) (Pidugu et al., 2016). (B). TDG H-bond network consisting of residues distinctive both of all TDGs (red sidechains) and of metazoan TDGs (orange sidechains). This network includes hydrogen bonds to DNA oxygen atoms on either side of the thymine base to be excised (cyan); note that Phe238 and Tyr235 appear to position the N-terminus of their helix to hydrogen bond to substrate backbone oxygens; another such hydrogen bond involves Ser273, a residue generally conserved in the entire superfamily. The water molecule shown may act as the nucleophile in the reaction. For clarity, not all of the BPPS-SIPRIS-defined residues are shown. (C). TDG hydrogen-bond network residues may help position basic residues (green sidechains) interacting with the minor and major grooves of DNA. (D). TDG family-specific hydrogen-bond network residues surrounding a proposed catalytic water molecule (red sphere with dot cloud). (E). A BPPS-SIPRIS-defined H-bond network (p=1.7 × 10⁻⁵) distinct from that of TDG within *Thermus thermophilus* uracil DNA glycosylase (UDG) (pdb: 2dp6). The following source data are available for Figure 6.

https://doi.org/10.7554/eLife.29880.014

Figure 6—source data 1 Contrast alignments for DNA glycosylases.: https://doi.org/10.7554/eLife.29880.015
Download elife-29880-fig6-data1-v1.docx

Figure 7

Download asset Open asset

Overview of BPPS-SIPRIS analysis.

(A) Steps required for a BPPS-SIPRIS analysis. The fatax program adds phylum-annotations to database sequences. MAPGAPS detects and aligns database sequences containing the domain defined by a cma-formatted MSA or hiMSA. (MAPGAPS can also convert an MSA from fasta- to cma-format.) This creates an MSA that step 1 of BPPS then partitions hierarchically into subgroups based on discriminating pattern residues, as illustrated schematically in (B). Step E of BPPS checks for consistency between BPPS step 1 runs. Step 2 of BPPS adjusts the sub-alignment for each subgroup to align and possibly assign pattern residues to regions uniquely conserved in that subgroup, thereby creating a hiMSA. Step 3 of BPPS creates, for each node in the hiMSA, lineage-specific ‘contrast alignments’, as is illustrated schematically in (C), and a corresponding input file to SIPRIS, which identifies statistically significant structural interaction networks associated with pattern residues. For further descriptions, see text. (B) Schematic diagram of the node eight contrast alignment. Sequences assigned to node 8's subtree (green subfamily nodes in (C)) constitute a ‘foreground’ partition; sequences assigned to the other nodes of the subtree rooted at the parent of node 8 (gray subfamily nodes in (C)) constitute a ‘background’ partition, and the remaining sequences constitute a non-participating partition. Green horizontal bars in (B) represent foreground sequences. The green vertical bars in (B) represent conserved foreground residue patterns (as shown below each bar); these diverge from (or contrast with) the background compositions at those positions (white vertical bars). Red vertical bars above quantify the degree of divergence. (C) Schematic diagram of a BPPS-3-generated set of ‘contrast alignments’ corresponding to the node 9 lineage of the sequence hierarchy in (A). Within a hiMSA, there is one such lineage for each leaf node. Horizontal lines represent aligned sequences and are colored by level in the hierarchy. Thin light gray horizontal lines represent non-homologous and deleted regions. Vertical lines represent the contrasting pattern positions upon which the hierarchy is based and are similarly colored by levels. The trees shown correspond to each subgroup along the lineage. The colored, gray and white nodes in each tree correspond, respectively, to their alignment foreground, background and non-participating partitions. The background for the entire superfamily (lower right) consists of standard amino acid frequencies at each position.

https://doi.org/10.7554/eLife.29880.016

Appendix 1—figure 1

Download asset Open asset

Eleven haloacid dehalogenase sequences that the SFLD assigned to SG1129, but that are more closely related to SG1130 sequences.

The Venn diagram shows the overlap between the subgroups BSG15, SG1129 and SG1130 with the numbers of sequences indicated. The table gives the mean pairwise gapped BLAST scores for the 11 sequences assigned to both SG1129 and BSG15 versus the sequence sets shown; this analysis indicates that the 11 sequences should be reassigned from SG1129 to SG1130. Similar analyses indicate that four other sequences in SG1129 should be reassigned to SG1135 (based on mean scores of 27 versus 139) and that a sequence in SG1136 should be reassigned to SG1137 (based on a mean score of 8 versus 149).

https://doi.org/10.7554/eLife.29880.022

Tables

Table 1

Summary of BPPS-SIPRIS results for the most significant cluster in each test case.

https://doi.org/10.7554/eLife.29880.002

Protein	PDB	SIPRIS	Focal	BPPS-SIPRIS^‡			SIPRIS	Tree	Interpretive comments^#
	Structure	mode^*	point^†	Dist.	Init.	Term.	p-value	level^§
Gna1	4ag9A	p=BDF	-	22	57	71	8.5 × 10⁻⁷	1	Substrate and homodimeric interfaces
		S	CoA	17	41	87	6.8 × 10⁻⁵	0	CoA-binding subdomain
		S	-	23	56	72	9.3 × 10⁻⁶	1	DCA-based clustering
		S	-	14	21	107	2.5 × 10⁻⁴	1	Structure-based clustering
Rho1	3refB	B	-	20	53	100	8.3 × 10⁻⁵	1	(Active site secondary shell)
		C	-	22	55	98	7.8 × 10⁻⁷	1	“ “ “ “
Rab4	1z0kA	S	-	10	11	153	2.1 × 10⁻⁵	1	(Active site secondary shell)
		C	-	25	91	73	2.6 × 10⁻⁶	1	“ “ “ “
		p=B	-	14	23	141	2.9 × 10⁻⁸	2	Interface with Rabenosyn-5
		S	-	22	42	122	4.8 × 10⁻¹⁰	2	“ “ “ “
Rab8	3qbtA	p=B	-	13	23	139	5.2 × 10⁻⁷	2	Interface with Ocrl1
		p=B	-	12	23	139	6.1 × 10⁻⁶	3	Interface with Ocrl1 helix
	4lhwB	p=A	-	10	14	148	8.7 × 10⁻⁷	2	Homodimeric interface
EF-Tu	1ob5A	S	-	18	33	150	1.4 × 10⁻⁷	1	(GTP to tRNA allosteric link)
		S	-	23	71	112	1.0 × 10⁻⁶	2	(GTP/tRNA allosteric link to β-barrel)
		S	1B	22	81	102	1.3 × 10⁻⁵	1	Cluster around 5’ base 1 of tRNA
		S	2B	18	47	136	2.6 × 10⁻⁶	1	Cluster around 5’ base 2 of tRNA
	1efuA	S	81B	14	49	128	5.2 × 10⁻⁵	1	(Nucleotide exchange allosteric network)
	4zv4A	S	291C	21	66	109	0.0060	1	(Mediates hijacking by Tse6 toxin)
CysN	1zunB	S	-	23	79	118	6.3 × 10⁻⁵	2	(Allosteric link to β-barrel domain)
eIF4AIII	3ex7H	p=J	-	11	18	128	6.4 × 10⁻⁶	1	(ATP to RNA allosteric link)
		S	4J	13	18	128	5.1 × 10⁻⁷	1	Cluster around RNA rotation bond
		S	5J	16	41	105	5.5 × 10⁻⁴	1	“ “ “ “ “
APE1	5dfiA	H	11P	9	13	238	5.2 × 10^-6	0	Abasic site H-bond network
		H	11P	22	99	152	1.6 × 10⁻⁶	1	“ “ “ “
		H	-	25	137	114	1.7 × 10⁻⁶	1	(Active site secondary shell)
		H	9P	25	137	114	1.9 × 10⁻⁷	1	H-bond network positioning abasic site
		H	12P	23	119	132	7.6 × 10⁻⁶	1	“ “ “ “ “
Inpp5b	4cmlA	S	-	24	69	216	5.8 × 10⁻¹³	0	Active site core residues
		S	-	21	77	208	3.9 × 10⁻⁷	1	(Substrate recognition with allosteric link)
		S	-	12	30	255	0.0022	2	(Membrane substrate sequestration)
Inpp5b	3mtcA	S	-	22	91	194	8.0 × 10⁻⁷	1	(Substrate recognition with allosteric link)
		S	-	12	29	256	0.0015	2	(Membrane substrate sequestration)
Inpp5e	2xswA	S	-	25	140	148	3.7 × 10⁻⁷	1	(Substrate recognition with allosteric link)
		S	-	9	13	275	3.6 × 10⁻⁴	2	(Membrane substrate sequestration)
SHIP2	4a9cA	S	-	17	38	260	6.0 × 10⁻⁸	1	(Substrate recognition with allosteric link)
		S	-	4	4	294	0.30	2	(Membrane substrate sequestration)
TDG	5hf7A	H	17D	19	97	76	4.1 × 10⁻⁴	1	H-bond network around excised base
		H	-	20	98	75	3.5 × 10⁻⁵	1	H-bond network around catalytic water
UDG	2dp6A	B	-	13	17	121	1.7 × 10⁻⁵	1	H-bond network distinct from TDG

^*Modes: S, spherical expansion; C, core expansion; H, hydrogen bond expansion (involving sidechain interactions); B, hydrogen bond expansion (also involving backbone-to-backbone interactions); P, predefined clustering (residues in the cluster are those interacting with the chain(s) whose pdb identifiers are given to the right of the equal sign).

^†Focal points defining starting residue(s): ‘-‘,analysis was optimized over multiple starting residues (i.e., no focal point); CoA, cluster initiated from the residue closest to Coenzyme A; others, cluster initiated from the residue closest to the indicated position and chain (e.g., 1B = position 1 in pdb chain B).
^‡Nature of the optimum cluster: dist., the number of distinguishing residues within the cluster (total = 25); init., the total number of residues within the cluster; term., the number of residues outside of the cluster.

^§Codes designate pattern residue class: 0, superfamily; 1, family; 2, subfamily; 3, sub-subfamily. In the figures, these correspond to residues with yellow, red, orange and green sidechains, respectively.
^#Comments in parentheses indicate possible functions.

Table 2

Structural diversity among proteins identified and aligned by MAPGAPS.

https://doi.org/10.7554/eLife.29880.017

Superfamily	structures*		RMSD^† (Å)				Domain length^‡			Resolution (Å)
	% ID	No.	Avg	Min	Max	S.D.	MSA	Avg	S.D.	Avg	Max
GNAT	27	16	3.25	1.0	6.7	1.4	125	139.8	17.0	1.94	2.61
GTPases	30	20	3.96	0.6	14.7	3.5	164	195.9	41.6	2.31	3.10
Helicases	40	12	6.39	2.6	9.8	1.8	466	482.8	60.7	2.86	3.56
EEP	40	16	3.02	0.8	5.2	0.95	241	259.0	27.6	2.07	2.99
UDG/TDG	40	8	2.54	1.1	3.6	0.69	125	135.9	12.7	1.83	2.58

^*NMR and poor resolution structures were not used; no two proteins in each set contained more than the indicated level of percent sequence identity (% ID); pdb identifies for these are given in supplementary file 1.

^†RMSDs were computed using MUSTANG (Konagurthu et al., 2006) with default parameters; the structural coordinates used for the analysis were limited to the domain of interest.
^‡The number of aligned columns in the MSA, and the average length and standard deviation of the domain ‘footprint’.

Table 3

Summary of BPPS results for five superfamilies.

https://doi.org/10.7554/eLife.29880.018

Superfamily	Subgroup	# Sequences	% Identity^*	# Nodes in subtree	Minimum subtree size
GNAT		237,359	98	44	200
	Gna1 family	1243		1
GTPases		127,418	95	121	500
	R⁴ family	18,901		26
	Rab subfamily	7002		12
	Rab8 sub-subfamily	3.312		7
	TF family	25,224		10
	EFTu/CysN subfamily	4429		3
Helicases		131,321	98	47	300
	RNA helicases	36,788		8
EEP		45,799	99	166	100
	exoIII-AP-endo	13,711		47
	INPP5	3855		14
TDG/UDG		23,592	98	47	100
	TDG	1639		6
	UDG	376		1

^*The maximum % identity allowed between any two sequences in the set

Appendix 1—table 1

Summary of SFLD benchmarking of BPPS.

https://doi.org/10.7554/eLife.29880.023

Superfamily	# subgroups		BPPS	Annotated by SFLD			BSG^‡	BPPS conflicts^§			Maximum
	SFLD	BPPS	min.^*	No	Yes	expt^†		Error	?	Correct	% errors^#
radical SAM	49	17	800	52,608	17,680	12	13,676	10	6	326	0.12
glutathione transferase	26	15	100	6921	3633	0	1945	0	0	0	0
peroxiredoxin	6	11	100	3870	5521	0	5255	0	1	0	0.02
haloacid dehalogenase	24	28	200	21,768	33,379	9	26,589	35	66	27	0.38
isoprenoid synthase I	9	7	200	9666	1604	55	1536	0	0	0	0
isoprenoid synthase II	3	5	100	6974	671	38	591	1	0	0	0.17
nitroreductase	110	11	200	0	17,318	0	7242	20	11	0	0.43
enolase	8	8	800	26,227	2267	7	2143	0	0	0	0
				total:	82,073	121	58,977	66	84	353	avg: 0.14

*The minimum number of sequences required for each BPPS subgroup.

^†Numbers of experimentally validated annotations.
^‡The number of SFLD annotated sequences assigned by BPPS to a subgroup.

^§The number of SFLD annotated sequences in conflict with BPPS classification; error, SFLD annotation appears to be correct; ‘?’, not sure whether SFLD or BPPS is correct; correct, BPPS appears to be correct
^#Percent erroneous or ambiguous (‘?”) BPPS assignments among annotated sequences not assigned to a root node.

Appendix 1—table 2

Correspondence between BPPS and SFLD subgroups for haloacid dehalogenases*.

https://doi.org/10.7554/eLife.29880.024

Subgroup IDs		SFLD&	SFLD^#
BPPS	SFLD^‡	BPPS^§	Total	%
root^†	various	1531	1618	96.2
	1138	82	833	9.8
34	0	200	21768	0.9
	1129	3	129	2.3
23	0	101	21768	0.5
	1124	1	495	0.2
	1135	125	9423	1.3
21	0	158	21768	0.7
	1124	91	495	18.4
	1145	2	43	4.7
20	0	46	21768	0.2
	1131	162	201	80.6
25	0	76	21768	0.3
	1135	311	9423	3.3
2	0	1915	21768	8.8
	2	10091	11846	85.2
3	0	937	21768	4.3
	1129	4	129	3.1
	1134	1	866	0.1
	1135	4500	9423	47.8
	1139	4	1851	0.2
	1140	1	821	0.1
4	0	2422	21768	11.1
	2	1	11846	0.0
	1137	9	1430	0.6
	1140	3	821	0.4
	1141	53	278	19.1
	1142	2	236	0.8
	1144	2497	2759	90.5
5	0	229	21768	1.1
	2	986	11846	8.3
6	0	342	21768	1.6
	1124	360	495	72.7
7	0	330	21768	1.5
	1134	628	866	72.5
33	0	100	21768	0.5
	1134	153	866	17.7
8	0	57	21768	0.3
	1133	400	400	100
32	0	32	21768	0.1
	1139	218	1851	11.8
31	0	28	21768	0.1
	1139	216	1851	11.7
9	0	195	21768	0.9
	1134	1	866	0.1
	1139	942	1851	50.9
10	0	105	21768	0.5
	1137	896	1430	62.7
11	0	284	21768	1.3
	1135	3	9423	0.0
12	0	478	21768	2.2
	1136	1	246	0.4
	1137	178	1430	12.4
13	0	117	21768	0.5
	1138	751	833	90.2
14	0	1034	21768	4.8
	1135	32	9423	0.3
15	0	525	21768	2.4
	1129	11	129	8.5
	1130	809	836	96.8
	1132	6	227	2.6
	1135	63	9423	0.7
	1139	3	1851	0.2
16	0	337	21768	1.5
	1135	1	9423	0.0
	1140	670	821	81.6
17	0	230	21768	1.1
	1135	288	9423	3.1
18	0	505	21768	2.3
	1135	950	9423	10.1
19	0	197	21768	0.9
	1129	3	129	2.3
22	0	338	21768	1.6
24	0	110	21768	0.5
	1135	107	9423	1.1

*Erroneous, ambiguous and corrected classifications are shown as italicized, underlined, and bold, respectively.

^†Averages over 12 root-assigned subgroups.
^‡SFLD subgroups represented in each BPPS subgroup; zero indicates the SFLD unannotated sequence set.

^§The number of sequences in both the SFLD and BPPS subgroups in each row.
^#Total number of sequences in each SFLD subgroup and the percentage of these in the BPPS subgroup.

Appendix 1—table 3

Haloacid dehalogenase SG1129 sequences that BPPS assigned to distinct subgroups (BSG).

https://doi.org/10.7554/eLife.29880.025

BPPS		SG1129	% matches to BPPS pattern^†				Mean score vs other seqs:^‡
BSG	# seqs	In BSG^*	BSG 34	BSG 3	BSG 15	BSG 19	In SG1129	In BSG
34	203	3	80	31	37	8	14	399
3^§	5447	4	6	96	55	5	27	137
15^#	1417	11	10	50	87	12	21	104
19	200	3	8	21	36	83	9	488
root	16,869	108	9	40	42	8	na	na

*The number of SG1129 sequences assigned to the BSG in each row.

^†Average percentage of matches to the pattern residues for their assigned BSG among the SG1129 sequences. The highest percentages (bold) correspond to each BSG’s own pattern.
^‡The mean pairwise BLAST scores of the reassigned sequences against the remaining sequences either in SG1129 or in the BSG for that row.

^§This BSG corresponds to SG1135. (See Appendix 1—table 2.)
^#This BSG corresponds to SG1130. (See Appendix 1—table 2.)

Appendix 1—table 4

Average percentage of matches to various BPPS subgroup (BSG) patterns for haloacid dehalogenase sequences assigned to SFLD subgroup SG1135.

https://doi.org/10.7554/eLife.29880.026

BSG		SG1135	% matches to each BSG pattern for SG1135 sequences*:										Mean score^† vs others		new^‡
ID	# seqs	# seqs	15	16	25	3	11	14	17	18	24	23	In SG1135	In BSG	SGs
15	1417	63	69	35	32	51	8	42	19	56	22	20	63	64	?
16	1008	1	43	40	52	52	12	40	40	52	24	8	110	14	error
25	387	311	48	37	84	57	13	36	43	45	27	10	80	309	yes
3	5447	4500	49	25	35	90	15	37	20	38	20	9	148	132	?
11	287	3	33	29	37	51	93	32	17	32	16	8	44	632	yes
14	1066	32	49	27	25	37	9	77	18	47	19	13	42	130	yes
17	518	288	41	39	41	42	8	30	88	48	25	6	55	321	yes
18	1455	950	58	34	28	44	9	50	16	88	20	15	43	169	yes
24	217	107	42	35	26	44	9	32	20	42	92	4	53	293	yes
23	227	125	62	32	23	41	4	32	10	36	14	90	45	403	yes
root	n.a.	3040	46	37	41	56	12	40	29	46	23	10	n.a.	n.a.	n.a.

*Average percentage of matches to the pattern residues for their assigned BSG among the SG1135 sequences. The highest percentages (bold) correspond to the highest percentage in each row.

^†The mean pairwise BLAST scores of the BPPS-assigned sequences against the remaining sequences either in SG1135 or in the BSG for that row. The highest scores in each row are bold. (See Appendix 1—table 2.)
^‡A ‘yes’ in this column indicates that the SG1135 sequences assigned to the BSG in that row likely correspond to a subgroup distinct from SG1135; ‘?’ indicates a possible subcategory of SG1135; ‘error’ indicates a BPPS misclassification.

Appendix 1—table 5

BPPS-SIPRIS analyses using MAPGAPS (MG) versus Jackhmmer (JH) generated MSAs as input.

https://doi.org/10.7554/eLife.29880.027

Protein	MSA^*	SIPRIS^†	BPPS-SIPRIS^†			SIPRIS	Tree	Optimal BPPS pattern ∩ SIPRIS cluster^#
		Mode	Dist.	Init.	Term.	p-value	level^‡
Gna1	JH	p=BDF	21	69	87	1.2 × 10⁻⁵	1	F93,I94,D105,K136,H95,Y68,Y135,T44,R102,E104, E90,C141,K92,L40,L43,V88,V134,Y36,L27,F58,V89
	MG	p=BDF	22	57	71	8.5 × 10⁻⁷	1	F93,I94,D105,K136,H95,Y68,Y135,T44,R102,E104, E90,C141,K92,M61,L40,L43,V134,F54,Y36,F58, G98,V89
	JH	S	14	21	135	2.1 × 10⁻⁵	1	E90,K92,R102,V89,V88,G101,Y135,I94,V134,K136, F93,E104,Y68,H95
	MG	S	14	21	107	2.5 × 10⁻⁴	1	E90,K92,R102,V89,G101,Y135,I94,V134,K136,F93, G98,E104,Y68,H95
APE1	JH	H	15	38	219	5.0 × 10⁻⁵	0	V206,L167,Q95,S66,G209,W67,P311,H309,D283, S307,N68,D210,E96,N212,R185
	MG	H	16	33	218	4.2 × 10⁻⁷	0	V206,L167,F165,Q95,S66,G209,V69,W67,H309, D283,T265,S307,N68,D210,E96,N212
	JH	H	25	158	99	8.8 × 10⁻⁶	1	Y128,E154,R156,Y171,P173,W188,D70,W267,N277, E236,C310,G71,R237,D219,R254,R281,V213,A214, L62,G279,K98,V131,A175,L72,R181
	MG	H	25	137	214	1.7 × 10⁻⁶	1	Y128,G155,E154,D152,R156,Y171,P173,R185,D70, W267,N277,W188,E236,C310,Y269,R237,D219, R254,V213,A214,Y264,G279,K98,A175,G145
Rho1	JH	B	20	63	110	3.5 × 10⁻⁴	1	S106,D28,W114,Y81,E117,Y89,A76,Q78,L84,E79, K22,W73,F176,F99,F107,V101,E163,Y161,G149, C153
	MG	B	20	53	100	8.3 × 10⁻⁵	1	S106,D28,W114,Y81,Y89,A76,Q78,L84,E79,K22, W73,F99,F107,V101,V144,R137,E163,Y161,G149, C153
	JH	C	24	82	91	2.4 × 10⁻⁶	1	L84,Y81,Q78,A76,E117,W114,Y89,D28,V24,E79, W73,K22,F99,F107,Y161,C153,G149,S106,V101, E163,G131,F176,F40,F57
	MG	C	22	55	98	7.8 × 10⁻⁷	1	L84,Y81,Q78,A76,T52,W114,Y89,D28,E79,W73,K22, F99,F107,Y161,C153,G149,V144,S106,V101,E163, G131,R137
eIF4AIII	JH	p=J	8	18	212	2.7 × 10⁻⁴	1	G165,R116,D169,R166,G143,T115,G196,P164
	MG	p=J	11	18	128	6.4 × 10^-6	1	G165,F197,R116,Q200,D169,R166,G143,T115, G142,G196,P164

*Input MSA: Jackhmmer, JH; MAPGAPS, MG.

^†Explained in the footnotes to Table 1.
^‡Codes designate BPPS category: 0, superfamily; 1, family.

^§Pattern residue discrepancies between the Jackhmmer and MAPGAPS runs are shown in bold.

Additional files

Supplementary file 1 The pdb files used for computing RMSDs in Table 2.: https://doi.org/10.7554/eLife.29880.019
Download elife-29880-supp1-v1.docx
Transparent reporting form: https://doi.org/10.7554/eLife.29880.020
Download elife-29880-transrepform-v1.pdf

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Andrew F Neuwald
L Aravind
Stephen F Altschul

(2018)

Inferring joint sequence-structural determinants of protein functional specificity

eLife 7:e29880.

https://doi.org/10.7554/eLife.29880

Share this article

Cite this article

BPPS-SIPRIS analysis of the GNAT superfamily and Gna1-family based on structural coordinates for Gna1 (pdb: 4ag9) (Dorfmueller et al., 2012).

Figure 1—source data 1

Applying SIPRIS to the Gna1 protein in conjunction with various methods.

BPPS-SIPRIS analysis of R4 P-loop GTPases.

Figure 2—source data 1

BPPS-SIPRIS analysis of translation-associated P-loop NTPases.

Figure 3—source data 1

BPPS-SIPRIS analysis of synaptojanin/EEP domains.

Figure 4—source data 1

BPPS-SIPRIS analysis of synaptojanin/EEP domains within INPP5 proteins.

Figure 5—source data 1

BPPS-SIPRIS analysis of DNA glycosylases.

Figure 6—source data 1

Overview of BPPS-SIPRIS analysis.

Eleven haloacid dehalogenase sequences that the SFLD assigned to SG1129, but that are more closely related to SG1130 sequences.

Summary of BPPS-SIPRIS results for the most significant cluster in each test case.

Structural diversity among proteins identified and aligned by MAPGAPS.

Summary of BPPS results for five superfamilies.

Summary of SFLD benchmarking of BPPS.

Correspondence between BPPS and SFLD subgroups for haloacid dehalogenases*.

Haloacid dehalogenase SG1129 sequences that BPPS assigned to distinct subgroups (BSG).

Average percentage of matches to various BPPS subgroup (BSG) patterns for haloacid dehalogenase sequences assigned to SFLD subgroup SG1135.

BPPS-SIPRIS analyses using MAPGAPS (MG) versus Jackhmmer (JH) generated MSAs as input.

Supplementary file 1

Transparent reporting form

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

BPPS-SIPRIS analysis of R⁴ P-loop GTPases.