Inferring joint sequence-structural determinants of protein functional specificity

  1. Andrew F Neuwald  Is a corresponding author
  2. L Aravind
  3. Stephen F Altschul
  1. University of Maryland School of Medicine, United States
  2. National Institutes of Health, United States
8 figures, 8 tables and 2 additional files

Figures

Figure 1 with 1 supplement
BPPS-SIPRIS analysis of the GNAT superfamily and Gna1-family based on structural coordinates for Gna1 (pdb: 4ag9) (Dorfmueller et al., 2012).

SIPRIS clearly associates Gna1-residues with the substrate and homodimeric interfaces (p=8.5 × 10−7). Color scheme: homodimer subunits A and B, green and blue backbones, respectively; BPPS-defined Gna1-family residues in subunits A and B, magenta and red sidechains, respectively (glycine residues are shown as Cα atom spheres); GNAT superfamily residues, yellow sidechains; ligands, cyan. Lys116 (shown in light red) is outside of the SIPRIS defined cluster, but forms a hydrogen bond to a CoA phosphate group. BPPS-SIPRIS spherical clustering identified the GNAT superfamily residues shown (p=1.7 × 10−5). The following figure supplement and source data are available for Figure 1.

https://doi.org/10.7554/eLife.29880.003
Figure 1—source data 1

Contrast alignments for Gna1 N-acetyltransferase.

https://doi.org/10.7554/eLife.29880.005
Figure 1—figure supplement 1
Applying SIPRIS to the Gna1 protein in conjunction with various methods.

Residue sets were defined using BPPS and three other programs (with default parameter settings). Residue color schemes: BPPS: NAT superfamily, yellow; Gna1-family, red; FRpred: conserved, yellow; subtype, red; CLIPS-1D: structurally important, yellow; ligand binding, orange; catalytic, red; ET: residues of high functional importance, orange; CoA and substrate, cyan. The SIPRIS predefined clustering p-values corresponding to the homodimer/substrate interface are indicated below each image.

https://doi.org/10.7554/eLife.29880.004
BPPS-SIPRIS analysis of R4 P-loop GTPases.

Bound guanine nucleotide (shown in cyan) allows orientation of each subfigure relative to the others. (A). BPPS-defined hierarchical relationships among the GTPases examined here. (B). Entamoeba histolytica Rho1 GTPase (pdb: 3refB) (Bosch et al., 2011). Color scheme: R4-specific residues forming a BPPS-SIPRIS-defined hydrogen-bond network (p=8.3 × 10−5), red sidechains; residues conserved in P-loop GTPases and interacting with bound guanine nucleotide, yellow sidechains; atoms forming hydrogen bonds, CPK coloring. Modeled hydrogen atoms were generated using the Reduce program (Word et al., 1999). (C). Rab4 bound to GTP and to the Rab-binding domain of Rabenosyn (pdb: 1z0kA [Eathiraj et al., 2005]). BPPS-SIPRIS-defined residues distinctive of R4 (red sidechains) and Rab (orange) have core and Rabenosyn-contacting predefined cluster p-values of 2.6 × 10−6 and 2.9 × 10−8, respectively. The sensor threonine (Thr40) has substantial van der Waals contact with Glu44; Thr40 is a R4-specific (red) residue outside of the SIPRIS-defined cluster. (D). Rab8a in complex with the GTP analog, GNP, and with Ocrl1 (residue 540–678) (pdb: 3qbtA) (Hou et al., 2011]). Residues distinctive of Rab GTPases (orange) and of the Rab8 subgroup (green) are enriched at the Ocr1 interface (p=5.2 × 10−7 and 6.1 × 10−6, respectively). (E). Rab8a homodimeric complex (pdb: 4lhwAB) (Guo et al., 2013). Rab-specific residues (orange) are enriched at the homodimeric interface (p=8.7 × 10−7). The following source data are available for Figure 2.

https://doi.org/10.7554/eLife.29880.006
Figure 2—source data 1

Contrast alignments for Rab8, Rab4 and Rho1 GTPases.

https://doi.org/10.7554/eLife.29880.007
BPPS-SIPRIS analysis of translation-associated P-loop NTPases.

(A). Thermus aquaticus EF-Tu complexed with the antibiotic enacyloxin IIA, a GTP analog, and Phe-tRNA (pdb: 1ob5) (Parmeggiani et al., 2006). Color scheme: BPPS-SIPRIS defined GTPase-, TF- and EF-Tu/CysN-specific residues, yellow, red, and orange sidechains, respectively; GTPase domain backbone, green; C-terminal β-barrel domains, gray; phe-tRNA, teal; 5’ end nucleotide bases, light cyan; guanine nucleotide, cyan; enacyloxin IIA, greenish-cyan. Spheres indicate glycine Cα atoms. (B). BPPS-SIPRIS cluster of EF-Tu TF-residues centered on EF-Ts Phe81 at the EF-Tu/EF Ts interface (pdb: 1efu) (Kawashima et al., 1996). Regions in EF-Ts conserved between E. coli and cow are shown in cyan both in the figure and in the corresponding alignment below it. (C). P. aeruginosa EF-Tu bound to the Tse6 toxin domain (pdb: 4zv4) (Whitney et al., 2015). EF-Tu His20, which corresponds to His19 in (B), appears to form a salt bridge with Glu291 of Tse6. In light pink are regions of Tse6 contacting EF-Tu. Spherically clustered residues (p=0.0060) centered on Glu291 of Tse6 are shown with red sidechains. (D). Spherically clustered EF-Tu/CysN residues (orange; p=6.3 × 10−5) within the CysND complex (pdb: 1zun) (Mougous et al., 2006). (E). Spherically clustered EF-Tu/CysN-residues in EF-Tu (pdb: 1ob5) (p=1.0 × 10−6). (F). Human eIF4AIII bound to RNA, ADP, and the γ-phosphate transition state mimic AlF3 (pdb: 3e × 7) (Nielsen et al., 2009). Color scheme: eIF4AIII N- and C-terminal domains, violet and green, respectively; RNA and ADP, cyan; AlF3, light cyan; superfamily-conserved catalytic residues, yellow sidechains; RNA helicase-specific residues clustered on (light cyan-colored) RNA bases 4–5, red; other RNA helicase-specific residues, light red; C-terminal catalytic residues, bright green. The following source data are available for Figure 3.

https://doi.org/10.7554/eLife.29880.008
Figure 3—source data 1

Contrast alignments EFTu GTPase and eIF4AIII RNA helicase.

https://doi.org/10.7554/eLife.29880.009
BPPS-SIPRIS analysis of synaptojanin/EEP domains.

(A). The two major groups of the BPPS-defined EEP hierarchy examined here. (B). Human APE1 phosphorothioate substrate complex (pdb: 5dfi) (Freudenthal et al., 2015). Replacement of the phosphodiester bond with phosphorothioate prohibits cleavage by APE1 at the abasic site (circled). Cys310, which is nitrosated, is indicated. Color scheme: APE1 backbone trace, green; DNA strand containing the abasic site, cyan; complementary strand, marine blue; the BPPS-SIPRIS-defined residues distinctive of the EEP superfamily and of the exoIII-AP-endo family, yellow and red sidechains, respectively; basic residues within a loop interacting with the major groove of DNA, purple. (C). Close up of the APE1 active site. EEP-specific residues forming a hydrogen-bond network are shown with yellow sidechains. For clarity, only a few of the EEP- and exoIII-AP-endo-specific residues in the network are shown. The following source data are available for Figure 4.

https://doi.org/10.7554/eLife.29880.010
Figure 4—source data 1

Contrast alignments for APE1 endonuclease.

https://doi.org/10.7554/eLife.29880.011
BPPS-SIPRIS analysis of synaptojanin/EEP domains within INPP5 proteins.

Color code: EEP-residues, yellow sidechains; INPP5 residues, red sidechains; INPP5B-, INPP5E- and SHIP2-subfamily residues, orange sidechains; ligands, cyan; atoms involved in hydrogen bonds, CPK coloring. (A). Human INPP5B in complex with phosphatidylinositol 3,4-bisphosphate (pdb: 4cml) (Trésaugues et al., 2014), which is associated with cytosolic and mitochondrial membranes (Speed et al., 1995). BPPS-SIPRIS results: EEP spherical cluster, p=5.8 × 10−13; INPP5 spherical cluster, p=3.9 × 10−7; INPP5B spherical cluster, p=0.0021. (B). INPP5 hydrogen bond network within human INPP5B (pdb: 3mtc) (unpublished). (C). View of INPP5-residues (in 3mtc) that bind the 4-phosphate group required for substrate recognition. (D). Human INPP5B with phosphate bound to a possible membrane interaction or allosteric site (Mills et al., 2016). (E). Human INPP5B Ocrl with glycerol bound to the same site as indicated in (D) (Trésaugues et al., 2014). (F). INPP5 subgroups within the BPPS-defined hierarchy. (G). Human INPP5E (pdb: 2xsw) (unpublished), which is associated with the primary cilium, an organelle involved in signal transduction (Jacoby et al., 2009) (spherical cluster, p=3.6 × 10−4). (H). Human SHIP2 (pdb: 4a9c) (Mills et al., 2012), which is associated with membrane ruffle formation (Hasegawa et al., 2011) (spherical cluster, p=0.30). The following source data are available for Figure 5.

https://doi.org/10.7554/eLife.29880.012
Figure 5—source data 1

Contrast alignments for INPP5 phosphatases.

https://doi.org/10.7554/eLife.29880.013
BPPS-SIPRIS analysis of DNA glycosylases.

(A). Thymine DNA glycosylase (TDG) family (red sidechains) and metazoan subfamily (orange sidechains) residues forming a significant hydrogen bond network (p=3.5 × 10−5) within human TDG (pdb: 5hf7) (Pidugu et al., 2016). (B). TDG H-bond network consisting of residues distinctive both of all TDGs (red sidechains) and of metazoan TDGs (orange sidechains). This network includes hydrogen bonds to DNA oxygen atoms on either side of the thymine base to be excised (cyan); note that Phe238 and Tyr235 appear to position the N-terminus of their helix to hydrogen bond to substrate backbone oxygens; another such hydrogen bond involves Ser273, a residue generally conserved in the entire superfamily. The water molecule shown may act as the nucleophile in the reaction. For clarity, not all of the BPPS-SIPRIS-defined residues are shown. (C). TDG hydrogen-bond network residues may help position basic residues (green sidechains) interacting with the minor and major grooves of DNA. (D). TDG family-specific hydrogen-bond network residues surrounding a proposed catalytic water molecule (red sphere with dot cloud). (E). A BPPS-SIPRIS-defined H-bond network (p=1.7 × 10−5) distinct from that of TDG within Thermus thermophilus uracil DNA glycosylase (UDG) (pdb: 2dp6). The following source data are available for Figure 6.

https://doi.org/10.7554/eLife.29880.014
Figure 6—source data 1

Contrast alignments for DNA glycosylases.

https://doi.org/10.7554/eLife.29880.015
Overview of BPPS-SIPRIS analysis.

(A) Steps required for a BPPS-SIPRIS analysis. The fatax program adds phylum-annotations to database sequences. MAPGAPS detects and aligns database sequences containing the domain defined by a cma-formatted MSA or hiMSA. (MAPGAPS can also convert an MSA from fasta- to cma-format.) This creates an MSA that step 1 of BPPS then partitions hierarchically into subgroups based on discriminating pattern residues, as illustrated schematically in (B). Step E of BPPS checks for consistency between BPPS step 1 runs. Step 2 of BPPS adjusts the sub-alignment for each subgroup to align and possibly assign pattern residues to regions uniquely conserved in that subgroup, thereby creating a hiMSA. Step 3 of BPPS creates, for each node in the hiMSA, lineage-specific ‘contrast alignments’, as is illustrated schematically in (C), and a corresponding input file to SIPRIS, which identifies statistically significant structural interaction networks associated with pattern residues. For further descriptions, see text. (B) Schematic diagram of the node eight contrast alignment. Sequences assigned to node 8's subtree (green subfamily nodes in (C)) constitute a ‘foreground’ partition; sequences assigned to the other nodes of the subtree rooted at the parent of node 8 (gray subfamily nodes in (C)) constitute a ‘background’ partition, and the remaining sequences constitute a non-participating partition. Green horizontal bars in (B) represent foreground sequences. The green vertical bars in (B) represent conserved foreground residue patterns (as shown below each bar); these diverge from (or contrast with) the background compositions at those positions (white vertical bars). Red vertical bars above quantify the degree of divergence. (C) Schematic diagram of a BPPS-3-generated set of ‘contrast alignments’ corresponding to the node 9 lineage of the sequence hierarchy in (A). Within a hiMSA, there is one such lineage for each leaf node. Horizontal lines represent aligned sequences and are colored by level in the hierarchy. Thin light gray horizontal lines represent non-homologous and deleted regions. Vertical lines represent the contrasting pattern positions upon which the hierarchy is based and are similarly colored by levels. The trees shown correspond to each subgroup along the lineage. The colored, gray and white nodes in each tree correspond, respectively, to their alignment foreground, background and non-participating partitions. The background for the entire superfamily (lower right) consists of standard amino acid frequencies at each position.

https://doi.org/10.7554/eLife.29880.016
Appendix 1—figure 1
Eleven haloacid dehalogenase sequences that the SFLD assigned to SG1129, but that are more closely related to SG1130 sequences.

The Venn diagram shows the overlap between the subgroups BSG15, SG1129 and SG1130 with the numbers of sequences indicated. The table gives the mean pairwise gapped BLAST scores for the 11 sequences assigned to both SG1129 and BSG15 versus the sequence sets shown; this analysis indicates that the 11 sequences should be reassigned from SG1129 to SG1130. Similar analyses indicate that four other sequences in SG1129 should be reassigned to SG1135 (based on mean scores of 27 versus 139) and that a sequence in SG1136 should be reassigned to SG1137 (based on a mean score of 8 versus 149).

https://doi.org/10.7554/eLife.29880.022

Tables

Table 1
Summary of BPPS-SIPRIS results for the most significant cluster in each test case.
https://doi.org/10.7554/eLife.29880.002
ProteinPDBSIPRISFocalBPPS-SIPRISSIPRISTreeInterpretive comments#
Structuremode*pointDist.Init.Term.p-valuelevel§
Gna14ag9Ap=BDF-2257718.5 × 10−71Substrate and homodimeric interfaces
SCoA1741876.8 × 10−50CoA-binding subdomain
S-2356729.3 × 10−61DCA-based clustering
S-14211072.5 × 10−41Structure-based clustering
Rho13refBB-20531008.3 × 10−51(Active site secondary shell)
C-2255987.8 × 10−71“ “ “ “
Rab41z0kAS-10111532.1 × 10−51(Active site secondary shell)
C-2591732.6 × 10−61“ “ “ “
p=B-14231412.9 × 10−82Interface with Rabenosyn-5
S-22421224.8 × 10−102“ “ “ “
Rab83qbtAp=B-13231395.2 × 10−72Interface with Ocrl1
p=B-12231396.1 × 10−63Interface with Ocrl1 helix
4lhwBp=A-10141488.7 × 10−72Homodimeric interface
EF-Tu1ob5AS-18331501.4 × 10−71(GTP to tRNA allosteric link)
S-23711121.0 × 10−62(GTP/tRNA allosteric link to β-barrel)
S1B22811021.3 × 10−51Cluster around 5’ base 1 of tRNA
S2B18471362.6 × 10−61Cluster around 5’ base 2 of tRNA
1efuAS81B14491285.2 × 10−51(Nucleotide exchange allosteric network)
4zv4AS291C21661090.00601(Mediates hijacking by Tse6 toxin)
CysN1zunBS-23791186.3 × 10−52(Allosteric link to β-barrel domain)
eIF4AIII3ex7Hp=J-11181286.4 × 10−61(ATP to RNA allosteric link)
S4J13181285.1 × 10−71Cluster around RNA rotation bond
S5J16411055.5 × 10−41“ “ “ “ “
APE15dfiAH11P9132385.2 × 10-60Abasic site H-bond network
H11P22991521.6 × 10−61“ “ “ “
H-251371141.7 × 10−61(Active site secondary shell)
H9P251371141.9 × 10−71H-bond network positioning abasic site
H12P231191327.6 × 10−61“ “ “ “ “
Inpp5b4cmlAS-24692165.8 × 10−130Active site core residues
S-21772083.9 × 10−71(Substrate recognition with allosteric link)
S-12302550.00222(Membrane substrate sequestration)
Inpp5b3mtcAS-22911948.0 × 10−71(Substrate recognition with allosteric link)
S-12292560.00152(Membrane substrate sequestration)
Inpp5e2xswAS-251401483.7 × 10−71(Substrate recognition with allosteric link)
S-9132753.6 × 10−42(Membrane substrate sequestration)
SHIP24a9cAS-17382606.0 × 10−81(Substrate recognition with allosteric link)
S-442940.302(Membrane substrate sequestration)
TDG5hf7AH17D1997764.1 × 10−41H-bond network around excised base
H-2098753.5 × 10−51H-bond network around catalytic water
UDG2dp6AB-13171211.7 × 10−51H-bond network distinct from TDG
  1. *Modes: S, spherical expansion; C, core expansion; H, hydrogen bond expansion (involving sidechain interactions); B, hydrogen bond expansion (also involving backbone-to-backbone interactions); P, predefined clustering (residues in the cluster are those interacting with the chain(s) whose pdb identifiers are given to the right of the equal sign).

    Focal points defining starting residue(s): ‘-‘,analysis was optimized over multiple starting residues (i.e., no focal point); CoA, cluster initiated from the residue closest to Coenzyme A; others, cluster initiated from the residue closest to the indicated position and chain (e.g., 1B = position 1 in pdb chain B).

  2. Nature of the optimum cluster: dist., the number of distinguishing residues within the cluster (total = 25); init., the total number of residues within the cluster; term., the number of residues outside of the cluster.

    §Codes designate pattern residue class: 0, superfamily; 1, family; 2, subfamily; 3, sub-subfamily. In the figures, these correspond to residues with yellow, red, orange and green sidechains, respectively.

  3. #Comments in parentheses indicate possible functions.

Table 2
Structural diversity among proteins identified and aligned by MAPGAPS.
https://doi.org/10.7554/eLife.29880.017
Superfamilystructures*RMSD (Å)Domain lengthResolution (Å)
% IDNo.AvgMinMaxS.D.MSAAvgS.D.AvgMax
GNAT27163.251.06.71.4125139.817.01.942.61
GTPases30203.960.614.73.5164195.941.62.313.10
Helicases40126.392.69.81.8466482.860.72.863.56
EEP40163.020.85.20.95241259.027.62.072.99
UDG/TDG4082.541.13.60.69125135.912.71.832.58
  1. *NMR and poor resolution structures were not used; no two proteins in each set contained more than the indicated level of percent sequence identity (% ID); pdb identifies for these are given in supplementary file 1.

    RMSDs were computed using MUSTANG (Konagurthu et al., 2006) with default parameters; the structural coordinates used for the analysis were limited to the domain of interest.

  2. The number of aligned columns in the MSA, and the average length and standard deviation of the domain ‘footprint’.

Table 3
Summary of BPPS results for five superfamilies.
https://doi.org/10.7554/eLife.29880.018
SuperfamilySubgroup# Sequences% Identity*# Nodes in subtreeMinimum subtree size
GNAT237,3599844200
Gna1 family12431
GTPases127,41895121500
R4 family18,90126
Rab subfamily700212
Rab8 sub-subfamily3.3127
TF family25,22410
EFTu/CysN subfamily44293
Helicases131,3219847300
RNA helicases36,7888
EEP45,79999166100
exoIII-AP-endo13,71147
INPP5385514
TDG/UDG23,5929847100
TDG16396
UDG3761
  1. *The maximum % identity allowed between any two sequences in the set

Appendix 1—table 1
Summary of SFLD benchmarking of BPPS.
https://doi.org/10.7554/eLife.29880.023
Superfamily# subgroupsBPPSAnnotated by SFLDBSGBPPS conflicts§Maximum
SFLDBPPSmin.*NoYesexptError?Correct% errors#
radical SAM491780052,60817,6801213,6761063260.12
glutathione transferase261510069213633019450000
peroxiredoxin61110038705521052550100.02
haloacid dehalogenase242820021,76833,379926,5893566270.38
isoprenoid synthase I97200966616045515360000
isoprenoid synthase II351006974671385911000.17
nitroreductase11011200017,31807242201100.43
enolase8880026,2272267721430000
total:82,07312158,9776684353avg: 0.14
  1. *The minimum number of sequences required for each BPPS subgroup.

    Numbers of experimentally validated annotations.

  2. The number of SFLD annotated sequences assigned by BPPS to a subgroup.

    §The number of SFLD annotated sequences in conflict with BPPS classification; error, SFLD annotation appears to be correct; ‘?’, not sure whether SFLD or BPPS is correct; correct, BPPS appears to be correct

  3. #Percent erroneous or ambiguous (‘?”) BPPS assignments among annotated sequences not assigned to a root node.

Appendix 1—table 2
Correspondence between BPPS and SFLD subgroups for haloacid dehalogenases*.
https://doi.org/10.7554/eLife.29880.024
Subgroup IDsSFLD&SFLD#
BPPSSFLDBPPS§Total%
rootvarious1531161896.2
1138828339.8
340200217680.9
112931292.3
230101217680.5
112414950.2
113512594231.3
210158217680.7
11249149518.4
11452434.7
20046217680.2
113116220180.6
25076217680.3
113531194233.3
201915217688.8
2100911184685.2
30937217684.3
112941293.1
113418660.1
11354500942347.8
1139418510.2
114018210.1
4024222176811.1
21118460.0
1137914300.6
114038210.4
11415327819.1
114222360.8
11442497275990.5
50229217681.1
2986118468.3
60342217681.6
112436049572.7
70330217681.5
113462886672.5
330100217680.5
113415386617.7
8057217680.3
1133400400100
32032217680.1
1139218185111.8
31028217680.1
1139216185111.7
90195217680.9
113418660.1
1139942185150.9
100105217680.5
1137896143062.7
110284217681.3
1135394230.0
120478217682.2
113612460.4
1137178143012.4
130117217680.5
113875183390.2
1401034217684.8
11353294230.3
150525217682.4
1129111298.5
113080983696.8
113262272.6
11356394230.7
1139318510.2
160337217681.5
1135194230.0
114067082181.6
170230217681.1
113528894233.1
180505217682.3
1135950942310.1
190197217680.9
112931292.3
220338217681.6
240110217680.5
113510794231.1
  1. *Erroneous, ambiguous and corrected classifications are shown as italicized, underlined, and bold, respectively.

    Averages over 12 root-assigned subgroups.

  2. SFLD subgroups represented in each BPPS subgroup; zero indicates the SFLD unannotated sequence set.

    §The number of sequences in both the SFLD and BPPS subgroups in each row.

  3. #Total number of sequences in each SFLD subgroup and the percentage of these in the BPPS subgroup.

Appendix 1—table 3
Haloacid dehalogenase SG1129 sequences that BPPS assigned to distinct subgroups (BSG).
https://doi.org/10.7554/eLife.29880.025
BPPSSG1129% matches to BPPS patternMean score vs other seqs:
BSG# seqsIn BSG*BSG 34BSG 3BSG 15BSG 19In SG1129In BSG
342033803137814399
3§5447469655527137
15#1417111050871221104
19200382136839488
root16,869108940428nana
  1. *The number of SG1129 sequences assigned to the BSG in each row.

    Average percentage of matches to the pattern residues for their assigned BSG among the SG1129 sequences. The highest percentages (bold) correspond to each BSG’s own pattern.

  2. The mean pairwise BLAST scores of the reassigned sequences against the remaining sequences either in SG1129 or in the BSG for that row.

    §This BSG corresponds to SG1135. (See Appendix 1—table 2.)

  3. #This BSG corresponds to SG1130. (See Appendix 1—table 2.)

Appendix 1—table 4
Average percentage of matches to various BPPS subgroup (BSG) patterns for haloacid dehalogenase sequences assigned to SFLD subgroup SG1135.
https://doi.org/10.7554/eLife.29880.026
BSGSG1135% matches to each BSG pattern for SG1135 sequences*:Mean score vs othersnew
ID# seqs# seqs1516253111417182423In SG1135In BSGSGs
1514176369353251842195622206364?
1610081434052521240405224811014error
253873114837845713364345271080309yes
3544745004925359015372038209148132?
112873332937519332173216844632yes
14106632492725379771847191342130yes
1751828841394142830884825655321yes
181455950583428449501688201543169yes
2421710742352644932204292453293yes
23227125623223414321036149045403yes
rootn.a.304046374156124029462310n.a.n.a.n.a.
  1. *Average percentage of matches to the pattern residues for their assigned BSG among the SG1135 sequences. The highest percentages (bold) correspond to the highest percentage in each row.

    The mean pairwise BLAST scores of the BPPS-assigned sequences against the remaining sequences either in SG1135 or in the BSG for that row. The highest scores in each row are bold. (See Appendix 1—table 2.)

  2. A ‘yes’ in this column indicates that the SG1135 sequences assigned to the BSG in that row likely correspond to a subgroup distinct from SG1135; ‘?’ indicates a possible subcategory of SG1135; ‘error’ indicates a BPPS misclassification.

Appendix 1—table 5
BPPS-SIPRIS analyses using MAPGAPS (MG) versus Jackhmmer (JH) generated MSAs as input.
https://doi.org/10.7554/eLife.29880.027
ProteinMSA*SIPRISBPPS-SIPRISSIPRISTreeOptimal BPPS pattern ∩ SIPRIS cluster#
ModeDist.Init.Term.p-valuelevel
Gna1JHp=BDF2169871.2 × 10−51F93,I94,D105,K136,H95,Y68,Y135,T44,R102,E104,
E90,C141,K92,L40,L43,V88,V134,Y36,L27,F58,V89
MGp=BDF2257718.5 × 10−71F93,I94,D105,K136,H95,Y68,Y135,T44,R102,E104,
E90,C141,K92,M61,L40,L43,V134,F54,Y36,F58,
G98,V89
JHS14211352.1 × 10−51E90,K92,R102,V89,V88,G101,Y135,I94,V134,K136,
F93,E104,Y68,H95
MGS14211072.5 × 10−41E90,K92,R102,V89,G101,Y135,I94,V134,K136,F93,
G98,E104,Y68,H95
APE1JHH15382195.0 × 10−50V206,L167,Q95,S66,G209,W67,P311,H309,D283,
S307,N68,D210,E96,N212,R185
MGH16332184.2 × 10−70V206,L167,F165,Q95,S66,G209,V69,W67,H309,
D283,T265,S307,N68,D210,E96,N212
JHH25158998.8 × 10−61Y128,E154,R156,Y171,P173,W188,D70,W267,N277,
E236,C310,G71,R237,D219,R254,R281,V213,A214,
L62,G279,K98,V131,A175,L72,R181
MGH251372141.7 × 10−61Y128,G155,E154,D152,R156,Y171,P173,R185,D70,
W267,N277,W188,E236,C310,Y269,R237,D219,
R254,V213,A214,Y264,G279,K98,A175,G145
Rho1JHB20631103.5 × 10−41S106,D28,W114,Y81,E117,Y89,A76,Q78,L84,E79,
K22,W73,F176,F99,F107,V101,E163,Y161,G149,
C153
MGB20531008.3 × 10−51S106,D28,W114,Y81,Y89,A76,Q78,L84,E79,K22,
W73,F99,F107,V101,V144,R137,E163,Y161,G149,
C153
JHC2482912.4 × 10−61L84,Y81,Q78,A76,E117,W114,Y89,D28,V24,E79,
W73,K22,F99,F107,Y161,C153,G149,S106,V101,
E163,G131,F176,F40,F57
MGC2255987.8 × 10−71L84,Y81,Q78,A76,T52,W114,Y89,D28,E79,W73,K22,
F99,F107,Y161,C153,G149,V144,S106,V101,E163,
G131,R137
eIF4AIIIJHp=J8182122.7 × 10−41G165,R116,D169,R166,G143,T115,G196,P164
MGp=J11181286.4 × 10-61G165,F197,R116,Q200,D169,R166,G143,T115,
G142,G196,P164
  1. *Input MSA: Jackhmmer, JH; MAPGAPS, MG.

    Explained in the footnotes to Table 1.

  2. Codes designate BPPS category: 0, superfamily; 1, family.

    §Pattern residue discrepancies between the Jackhmmer and MAPGAPS runs are shown in bold.

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Andrew F Neuwald
  2. L Aravind
  3. Stephen F Altschul
(2018)
Inferring joint sequence-structural determinants of protein functional specificity
eLife 7:e29880.
https://doi.org/10.7554/eLife.29880