Inferring joint sequence-structural determinants of protein functional specificity
Abstract
Residues responsible for allostery, cooperativity, and other subtle but functionally important interactions remain difficult to detect. To aid such detection, we employ statistical inference based on the assumption that residues distinguishing a protein subgroup from evolutionarily divergent subgroups often constitute an interacting functional network. We identify such networks with the aid of two measures of statistical significance. One measure aids identification of divergent subgroups based on distinguishing residue patterns. For each subgroup, a second measure identifies structural interactions involving pattern residues. Such interactions are derived either from atomic coordinates or from Direct Coupling Analysis scores, used as surrogates for structural distances. Applying this approach to N-acetyltransferases, P-loop GTPases, RNA helicases, synaptojanin-superfamily phosphatases and nucleases, and thymine/uracil DNA glycosylases yielded results congruent with biochemical understanding of these proteins, and also revealed striking sequence-structural features overlooked by other methods. These and similar analyses can aid the design of drugs targeting allosteric sites.
https://doi.org/10.7554/eLife.29880.001Introduction
Residues remote from an enzyme’s active site can influence catalytic activity and substrate specificity. It has been proposed that an enzyme generally has multiple conformational states that modulate its function, with residues remote from the active site often shifting the enzyme’s conformational equilibrium to favor interactions associated with specific substrates or reactions (Ramanathan et al., 2014; Bhabha et al., 2015; Whitney et al., 2016; Campbell et al., 2016). Computational methods can help identify such functionally relevant non-active-site residues and their interactions. For example, direct coupling analysis (DCA) (Morcos et al., 2011), which predicts structural contacts from covarying residue pairs in a multiple sequence alignment (MSA), has been used to infer major conformational transitions for Hsp70 chaperones (Malinverni et al., 2015) and to explain the conformational heterogeneity seen in molecular dynamics simulations (Sutto et al., 2015). Statistical Coupling Analysis (SCA) (Lockless and Ranganathan, 1999) seeks to identify structural pathways of ‘energetic connectivity’ by applying principal component analysis to a covariance matrix to identify groups of coevolving residue positions (Halabi et al., 2009). SCA has been used to design proteins (Reynolds et al., 2013) and to predict surface sites (Reynolds et al., 2011) and hydrophobic cavities (Tanwar et al., 2013) involved in allosteric regulation. Here, we investigate residue interaction networks by combining two correlation analysis methods distinct from DCA and SCA (see Figure 7): Bayesian Partitioning with Pattern Selection (BPPS) (Neuwald, 2014a; Neuwald, 2014b), which identifies arbitrarily large correlated residue patterns arising through evolutionary divergence, and Structurally Interacting Pattern Residues’ Inferred Significance (SIPRIS), which we first describe here.
BPPS relies on the observation that protein superfamilies often diverge into subgroups, each adapting the superfamily’s structural core to fill a functional niche. Often a subgroup G diverges further into smaller subgroups, each conserving residues constrained by G’s function, as well as other residues constrained by more specialized functions. Repeated rounds of such divergence have led to hierarchically arranged subgroups, each of which conserves distinctive residues at particular positions. BPPS identifies and characterizes these subgroups by partitioning an MSA into a hierarchically nested series of MSAs, a hiMSA, based on correlated residue patterns that are distinctive of each subgroup and that often include non-active site residues.
For each subgroup of interest, the SIPRIS program takes a BPPS-defined residue pattern as input, as well as structural coordinates for a protein from that subgroup. It then identifies the statistically most significant network of pattern residues embedded within a structurally defined cluster, with a view to suggesting hypotheses for experimental investigation. Such a network is doubly significant inasmuch as BPPS identifies significant residue patterns in the absence of structural data, whereas SIPRIS defines structural clusters in the absence of sequence data. In this way, SIPRIS may statistically validate the output of BPPS or other sequence-based methods. Of course, a set of residues identified by a sequence-based method may still be biologically relevant despite a lack of SIPRIS-assigned significance. However, as we illustrate, BPPS-SIPRIS analyses often elucidate sequence/structural properties that conventional computational and experimental approaches have failed to detect.
Results
SIPRIS takes as input: (1) structural coordinates for a protein of interest; (2) a set of residues defined by BPPS; and, optionally, (3) a predefined cluster of residues, or a starting residue defined either explicitly or as the residue closest to a ‘focal point’ molecule or atom. If a third input is absent, then SIPRIS uses each of the BPPS-defined residues as a starting residue, in turn, and returns the most significant result. Nested clusters are defined around a starting residue in one of three ways: (i) ‘Spherical expansion’, which sequentially adds residues closest to the starting residue, which thus forms the center of each cluster. (ii) ‘Core expansion’, which sequentially adds the residue closest to a residue within the cluster’s ‘core’. This core is defined as the starting residue R plus all cluster residues whose distance to their kth closest cluster residue is less than R’s distance to its kth closest cluster residue (with k = 7 by default; this was selected empirically to avoid both spherical- and tentacle-shaped clusters). In this case, the cluster typically expands less symmetrically. (iii) Hydrogen-bond-network expansion, which sequentially adds a residue forming the closest sidechain-to-sidechain or sidechain-to-backbone hydrogen bond with a cluster residue. (iv) For spherical or core clustering, SIPRIS may also take DCA scores (Marks et al., 2012, 2011) as a surrogate for 3D structural distances. SIPRIS evaluates the intersection between clusters and BPPS-defined residue sets with a p-value.
We applied BPPS-SIPRIS to a GCN5-like N-acetyltransferase (GNAT), several P-loop GTPases, an RNA Superfamily-II helicase, several members of the Synaptojanin/Exonuclease-Endonuclease-Phosphatase (EEP) superfamily, and two uracil/thymine DNA glycosylases. These results are summarized in Table 1. (Go to sipris.igs.umaryland.edu for BPPS output alignments.)
Distinct N-acetyltransferase cofactor- and substrate-binding subdomains
GNATs catalyze the transfer of a carboxylic acyl group from Coenzyme A (acyl-CoA) to a diversity of substrates. Previously, a BPPS analysis of glucosamine-6-phosphate N-acetyltransferase (Gna1) led to two observations (Neuwald and Altschul, 2016a) (Figure 1): (1) Within the homodimeric structure of Gna1 (pdb: 4ag9) (Dorfmueller et al., 2012), BPPS-defined residues for this family are contributed by both subunits to form the dimeric interface and the active site for each subunit. In contrast, within a single subunit most of these residues are far from the active site and face away from it. Thus, the BPPS analysis implicates family-specific residues in the formation of this unusual substrate-binding pocket between subunits. (2) Residues conserved in the GNAT superfamily cluster within an acyl-CoA binding subdomain distinct from the homodimer/substrate interacting subdomain. This raises the question: How likely is such a structural distribution of these family and superfamily residues to have occurred by chance?
-
Figure 1—source data 1
- https://doi.org/10.7554/eLife.29880.005
SIPRIS returns a p-value of 8.5 × 10−7 for the intersection between Gna1-family residues and the predefined cluster of 57 residues contacting either the substrate or the other subunit (for residues conserved across GNATs, the corresponding p-value was 0.96). Among the 25 Gna1-family residues defined by BPPS, 22 intersect with the structurally defined cluster. The three remaining residues may perform complementary functions: Gly35 and Gly101 by imparting backbone flexibility and Lys116 by helping properly position CoA via interaction with a CoA phosphate group.
SIPRIS returns a p-value of 6.8 × 10−5 for the intersection between a (spherical) CoA-centered cluster and the set of residues conserved in all GNATs. (The corresponding p-value for Gna1-family residues is >0.99.) Of the 25 residues most distinctive of GNATs, 17 are among the 41 residues of this CoA-centered cluster. Hence, in the absence of explicit structural information, BPPS detects structurally and presumably biologically relevant features: GNAT-residues that map to an acetyl-CoA-binding module and Gna1-family residues that map to a substrate-specific ‘reaction chamber’ facilitating acetylation of glucosamine-6-phosphate.
DCA-based SIPRIS analysis
Spherical clustering using residue-to-residue pseudo-distances based on DCA pairwise scores (instead of actual structural distances) likewise identifies these Gna1 structural features. In fact, the DCA-based p-value for Gna1-family residues (9.3 × 10−6) was more significant than the corresponding structurally based p-value (2.5 × 10−4). We suggest two possible reasons for this. First, DCA scores are based on multiple sequences (1200 in this case) and thus implicitly on multiple structures rather than one. Second, DCA scores should be affected by pairwise contacts between homodimeric subunits, whereas SIPRIS currently considers distances only within a single subunit. Thus, DCA- and structurally-based analyses provide somewhat different perspectives.
Likely determinants of GTPase family and subfamily functional specificity
P-loop GTPases, upon binding to GTP versus GDP, undergo a conformational change in their so-called switch I and switch II regions that depends on the presence of a γ-phosphate group; this acts as a signal to downstream cellular components. We applied SIPRIS to two major subgroups: Rab/Rho/Ras/Ran GTPases (termed R4) and translation factor (TF) GTPases (Figure 2A).
-
Figure 2—source data 1
- https://doi.org/10.7554/eLife.29880.007
R4 GTPases function as on/off switches regulating cellular processes. GTPase activating proteins (GAPs) facilitate hydrolysis of bound GTP (the ‘on’ state) to GDP (the ‘off’ state). Guanine nucleotide exchange factors (GEFs) turn GTPases back on by stimulating replacement of GDP with GTP. SIPRIS identifies a significant network of BPPS-defined R4 residues. In Rho1 GTPases, this appears within a hydrogen-bond cluster (p=8.3 × 10−5; Figure 2B) or within a core cluster (p=7.8 × 10−7). In most Rab GTPases, this network often appears within a spherical or core cluster (e.g., Figure 2C) and, rarely, within a hydrogen-bond cluster (e.g. Rab9, pdb:1s8f [Wittmann and Rudolph, 2004]; p=9.0 × 10−4). We postulate that a significant hydrogen-bond network forms only in certain conformations. These R4 sequence/structural configurations correspond to features identified through previous analyses, including: (i) Several aromatic-CH-π interactions proposed to stabilize β-strands (Merkel and Regan, 1998) associated with the P-loop and with the guanine binding loop, and to facilitate guanine nucleotide exchange (Neuwald, 2009a) (Phe99-Gly131 and Trp114-Gly27 in Figure 2B). (ii) A salt bridge also associated with the guanine-binding loop (Arg137-Glu163 in Figure 2B). (iii) Residues forming a switch II ‘charge dipole pocket’ proposed to facilitate conformational changes associated with the switching mechanism (Neuwald, 2009b). And (iv) glutamine and glutamate residues proposed to function in GTP hydrolysis (Vetter and Wittinghofer, 2001) and nucleotide exchange (Gasper et al., 2008), respectively. We propose that, together, these residues, which adjoin the GTP-binding site from the guanine-binding loop to the γ-phosphate interacting switch II region, constitute in large part the R4 switching mechanism.
SIPRIS identifies a network of residues distinctive of the Rab subfamily of R4 GTPases within a spherical cluster in the switch I and II regions (p=4.8 × 10−10 for Rab4). Rab subfamily residues also intersect with those residues contacting Rab-binding domains, with high significance based on predefined clustering: for Rab4-Rabenosyn-5 (Figure 2C) (Eathiraj et al., 2005) and Rab8a-Ocr1 (Hou et al., 2011) (Figure 2D) p=2.9 × 10−8 and 5.2 × 10−7, respectively. This occurs despite the Rabenosyn and Ocrl1 domains being structurally distinct. Rab subfamily residues are similarly enriched at the Rab8a homodimeric interface (p=8.7 × 10−7) (Figure 2E) (Guo et al., 2013), supporting the notion that these residues can interact with diverse structural folds. For the Rab4 structure in Figure 2C, Thr40, another R4-specific residue, albeit one outside of the SIPRIS-defined cluster, corresponds to the switch I residue that senses the γ-phosphate of GTP. This residue establishes its greatest contact area (45 Å2) with Glu44, one of the Rab-specific residues contacting Rabenosyn-5; thus Thr40 and Glu44 may link sensing of the γ-phosphate to substrate binding. For Rab8a both Rab- and Rab8-specific residues appear to mediate binding to the Ocr1 domain (Figure 2D); in all, 19 of the 23 Rab8-Ocrl1 interface residues are distinctive of either the Rab subfamily or the Rab8 sub-subfamily. Many of the Rab8-residues interact with an N-terminal helix extending out of the Ocrl1 β-sandwich domain, perhaps thereby compensating for the lack of binding specificity of Rab-subfamily residues.
BPPS grouped translation factor (TF) GTPases into a single family (Figure 2A), which includes initiation factors (e.g. IF2 and eIF5B), sulfate adenyltransferases (CysN), ribosome-releasing factor 2, peptide chain release factor 3, elongation factors EF-Tu, EF1α and selenocysteine-specific elongation factor, EF4, aEF2, and EF-G (Leipe et al., 2002). Within Thermus aquaticus EF-Tu complexed with a GTP analog, Phe-tRNA, and the antibiotic Enacyloxin IIA (Parmeggiani et al., 2006), TF-specific residues (Figure 3A) spherically cluster around the switch I and II and P-loop regions (p=1.4 × 10−7); this differs from the R4-residue arrangement in Figure 2B. The two 5’-terminal tRNA nucleotide bases, which base-pair with the 3’ strand to which the aminoacyl group is attached, establish the greatest contact with the EF-Tu GTPase domain among all the bases of the tRNA. TF-specific residues cluster around these 5’ bases (p=1.3 × 10−5 and 2.6 × 10−6, respectively) and link the 5’ region of aa-tRNA to the GTP γ-phosphate; this cluster includes Thr62, which senses γ-phosphate. We hypothesize that, upon correct tRNA-anticodon pairing with its mRNA codon, these TF residues assist in coupling GTP hydrolysis to coordinated conformational changes that dissociate EF-Tu from the ribosome and from tRNA, which can then fully enter the ribosomal A site.
-
Figure 3—source data 1
- https://doi.org/10.7554/eLife.29880.009
TF-specific residues also may be important for guanine nucleotide exchange mediated by EF-Ts. Within the structure of EF-Tu bound to EF-Ts (pdb: 1efu) (Kawashima et al., 1996), 14 TF-residues form a (spherical) cluster (p=5.2 × 10−5; Figure 3B) centered on Phe81 of EF-Ts, the residue with the greatest area of contact with EF-Tu. These TF-residues, which include His19, His84, and Gln114 of EF-Tu, adjoin two regions of EF-Ts contacting EF-Tu and are conserved across bacteria and eukaryotes (Figure 3B). His19, which is located in the P-loop of EF-Tu, is the residue that is most characteristic of these translation factors. Both His19 and Gln114 have been implicated in nucleotide exchange (Zhang et al., 1998), and in destabilization of Mg+2 coordination (leading to guanine nucleotide release) upon intrusion of EF-Ts Phe81 near His84 of EF-Tu (Schümmer et al., 2007). Given recent evidence for an EF-Tu/Ts·GTP·aa-tRNA quaternary complex (Burnett et al., 2014), we conjecture that TF-residues may help couple GTP-hydrolysis-mediated loading of aa-tRNA onto the ribosome with nucleotide exchange by EF-Ts. P. aeruginosa Tse6 toxin (Whitney et al., 2015) appears to have hijacked this TF interaction interface with EF-Ts (Figure 3C).
BPPS partitions EF-Tu and CysN into a common subfamily within the TF family, consistent with earlier analysis supporting their specific relationship (Leipe et al., 2002; Inagaki et al., 2002). CysN, together with the catalytic CysD subunit, form a sulfate adenylyltransferase complex involved in sulfur assimilation. The CysND-catalyzed reaction is analogous to the first step in charging a tRNA, and CysN’s contact sites with CysD are similar to, and include residues homologous to, EF-Tu’s contact sites with aa-tRNA. Within the CysND complex (pdb: 1zun) (Mougous et al., 2006) EF-Tu/CysN-residues cluster around the switch I and II regions (p=6.3 × 10−5; Figure 3D). In CysN, these residues adjoin contact regions with CysD and with the CysN C-terminal linker and β-barrel domains. Analogously in EF-Tu, they are proximal to the contact region with aa-tRNA and the EF-Tu C-terminal linker and β-barrel domains (Figure 3E). Within EF-Tu these residues are also located between the bound antibiotic enacyloxin IIA and the GTPase- and TF-specific residues (Figure 3A). Because enacyloxin IIA hinders the release of EF-Tu-GDP from the ribosome (Parmeggiani et al., 2006), we hypothesize that these residues may help mediate this process.
Comparison of two P-loop NTPase superfamilies: eIF4AIII RNA helicase
For comparison, we analyzed another nucleic-acid-associated P-loop NTPase, the Superfamily II RNA helicase eIF4AIII, which is a component of the exon junction complex (EJC). The EJC deposits onto spliced mRNAs and plays an important role in mRNA transport, translation, and quality control. RNA helicases are part of a huge group of NTPases that undergo ATP-hydrolysis-coupled conformational changes to unwind double-stranded nucleic acids, translocate nucleic acids or re-distribute protein complexes on nucleic acids (Anantharaman et al., 2002; Bourgeois et al., 2016; Lohman et al., 2008; Northall et al., 2016). For the transition state structure of eIF4AIII bound to RNA, a predefined cluster of RNA helicase-specific residues contacting RNA is highly significant (p=6.4 × 10−6; Figure 3F). Focal point spherical clustering indicates that these residues are centered on RNA bases 4 and 5 (p=5.1 × 10−7 and p=5.5 × 10−4, respectively), which establish the greatest contact with the ATPase domain. These observations and a rotated bond between bases 4 and 5 suggest that these residues help couple ATP hydrolysis to disruption of duplex RNA. Clusters centered on other bases are not significant (p>0.9). Most of the remaining RNA helicase-specific residues surround key active site residues or interact with C-terminal domain catalytic residues, including two arginine fingers (Figure 3F). Given this configuration, ATP hydrolysis seems likely to shift the relative orientations of the N- and C-terminal domains, both of which interact with RNA.
Residue networks adapting the EEP catalytic core to diverse substrates
EEP enzymes cleave phosphodiester bonds in substrates that include nucleic acids and phospholipids. To identify residues likely responsible for EEP functional divergence, we applied BPPS-SIPRIS to APE1, an exonuclease III-like apurinic/apyrimidinic endonuclease (exoIII-AP-endo), and several inositol polyphosphate 5-phosphatases (INPP5) (Figure 4A).
-
Figure 4—source data 1
- https://doi.org/10.7554/eLife.29880.011
APE1 participates in the DNA excision repair pathway by incising the apurinic/apyrimidinic (AP) site phosphodiester backbone; this generates a single nucleotide DNA gap with 3’-hydroxyl and 5’-deoxyribose phosphate termini—a cytotoxic intermediate substrate that is then processed by DNA polymerase β (Liu et al., 2007). A proposed mechanism for APE1 (Mol et al., 2000) involves superfamily-conserved active site residues forming hydrogen bonds with the oxygen atoms of the phosphate group at the abasic site. Consistent with this, SIPRIS identifies a superfamily-conserved hydrogen-bond network centered on the abasic site (p=5.2 × 10−6) within a structure of APE1 bound to DNA harboring an abasic site phosphate group analog (phosphorothioate) in one strand (Figure 4B,C). Centering on adjacent bases in the same strand was less significant (p>0.003). For exoIII-AP-endo-conserved residues SIPRIS identifies a significant hydrogen-bond network centered on the abasic site (p=1.6 × 10−6) or on adjacent bases 8–9 and 12–13 (p=1.9 × 10−7 to 7.6 × 10−6); these residues may contextually position catalytic residues around the abasic site. In particular, regions associated with these residues insert into the DNA major and minor grooves on either side of the abasic site, and form a kink in and engulf the target DNA strand (Figure 4B). Thus, exoIII-AP-endo residues appear to form a substrate-specific ‘reaction chamber’, as might be expected. They also tend to aggregate between the catalytic core and a loop containing basic residues that interact with the major groove of DNA (Figure 4B). Modification by nitric oxide (nitrosation) of one of these residues, Cys310, results in dissociation of APE1 from DNA and relocation to the cytoplasm (Qu et al., 2007); thus, the associated hydrogen-bond network may communicate the nitrosation signal to the DNA-binding site.
BPPS-SIPRIS-defined INPP5-residues also form a significant hydrogen bond network (p=1.1 × 10−7) adjacent to the superfamily-conserved cluster (Figure 5A,B). We hypothesize that this network recognizes inositol polyphosphates harboring phosphate groups at positions 4 and 5 of the inositol ring. INPP5 phosphatases cleave the 5-phosphate, but require for recognition the 4-phosphate, which directly interacts with three network-associated basic residues—perhaps thereby mediating substrate recognition (Figure 5C). In some structures, the INPP5 network residues most remote from the catalytic core are part of a cleft accommodating a phosphate or a glycerol (Figure 5D,E), suggesting that these may form another (unknown) membrane interaction site or an allosteric site that binds a molecule similar to the known substrate.
-
Figure 5—source data 1
- https://doi.org/10.7554/eLife.29880.013
INPP5 proteins regulate diverse cellular processes, including postsynaptic vesicular trafficking, insulin signaling, cell growth and survival, and endocytosis. With this in mind, we examined three INPP5 subfamilies: INPP5B, INPP5E and SHIP2 (Figure 5F). Residues that most distinguish the INPP5B subfamily form a cluster between the proposed membrane interacting region (Trésaugues et al., 2014) and the EEP catalytic core (Figure 5A). INPP5E- and SHIP2-specific residues also cluster in this same region (Figure 5G,H)—although the SHIP2 cluster is not statistically significant. This suggests a possible role for these residues in sequestering specific membrane-associated phosphoinositide substrates from the lipid bilayer.
Family-specific catalysis: thymine DNA glycosylases
Uracil DNA glycosylases (UDGs) remove uracil from DNA, thereby initiating the DNA base excision repair pathway (Aravind and Koonin, 2000). Uracil may be incorporated into DNA by DNA polymerase or by cytosine deamination. Thymine DNA glycosylases (TDGs) initiate base excision repair by removing T from G·T mispairs, which can be due to deamination of 5-methylcytosine. These enzymes also remove oxidized derivatives of methyl cytosine such as 5-formyl and 5-carboxymethyl cytosine, which are epigenetic marks or intermediates in the reset of 5mC marks by the TET enzymes (Pastor et al., 2013). Within the structure for human TDG (Pidugu et al., 2016) BPPS-SIPRIS identifies a significant hydrogen-bond network associated with TDG-family residues (Figure 6A,B); also in this network are residues classified by BPPS to a metazoan TDG subfamily. Like APE1, network residues appear to position loops containing basic residues that, in this case, interact with both the major and minor grooves of bound DNA (Figure 6C). Network residues also form hydrogen bonds to DNA oxygen atoms on either side of the thymine base being excised—suggesting that they may help position the substrate for catalysis by sensing particular sequence contexts (Figure 6B). Near the center of this network and in contact with the targeted thymine base is the residue most distinctive of metazoan TDGs, Asn230 (Figure 6B and Figure 6—source data 1); in other TDG subfamilies, a hydrophobic residue occurs at this position. Other TDG-residues in this network encase a water molecule believed to function as a nucleophile in catalysis (Pidugu et al., 2016) (Figure 6D). Hence, for TDG, family-specific residues may play a critical catalytic role. UDG harbors a hydrogen bond network distinct from that of TDG (Figure 6E), indicating a mechanistic divergence.
-
Figure 6—source data 1
- https://doi.org/10.7554/eLife.29880.015
Applying SIPRIS with other methods
Applying SIPRIS in conjunction with various protein function determining residue (FDR) methods (Casari et al., 1995; Ye et al., 2008; Pirovano et al., 2006; Kalinina et al., 2004; Hannenhalli and Russell, 2000; Livingstone and Barton, 1996; Mihalek et al., 2004; Mirny and Gelfand, 2002; Lichtarge et al., 1996; Sankararaman and Sjölander, 2008; Fischer et al., 2008; Kalinina et al., 2009; Janda et al., 2012; Janda et al., 2014; Marttinen et al., 2006; Kolesov and Mirny, 2009; Wilkins et al., 2012; Chakraborty and Chakrabarti, 2015; Gaucher et al., 2002; Xin and Radivojac, 2011; Capra and Singh, 2008) is straightforward in principle. However, several factors complicate comparisons to BPPS-SIPRIS. First, a fair number of published FDR methods are no longer available as source code, executables or over the world wide web (e.g. INTREPID [Sankararaman and Sjölander, 2008] and MINER [La and Livesay, 2005]). Second, many FDR methods (e.g. GroupSim [Capra and Singh, 2008]) require user-provided input, such as an MSA, a phylogenetic tree, or prespecified categories with corresponding sequence assignments for each category. This confounds the comparison because the contribution of each user-provided component to overall performance is unclear. In contrast, BPPS-SIPRIS requires no input beyond the query and database sequences, and its algorithmic components are statistically coherent. Third, those FDR methods not requiring user-generated input typically are based on a phylogenetic tree; this renders infeasible their application to large sequence sets, which is a key aspect of SIPRIS’s ability to detect biologically relevant features. Our attempts to input even moderately large sequence sets to various FDR programs resulted in runtime errors. By focusing on a hierarchy of subgroups, each defined by a correlated residue pattern, BPPS eliminates the need for a phylogenetic tree, which would introduce more complexity than either is necessary or can be reliably inferred.
Finally, BPPS-SIPRIS aims to identify biologically relevant interaction networks whose functions are not necessarily known, whereas FDR methods generally try to identify residues responsible for well-characterized functions—such as catalysis or substrate recognition—that can be experimentally benchmarked (Chakrabarti and Panchenko, 2009). However, as has been noted (Dessimoz et al., 2013; Jiang et al., 2014), we lack reliable gold standards for many functionally relevant residues, due to a lack of experimental characterization. Consequently, methods designed to identify residues with specific, known functions, if successful, will tend to penalize residues involved in unknown functions. In contract, the goal of BPPS-SIPRIS is to recognize also such residues of unknown function.
With this in mind, we compared the BPPS-SIPRIS analyses in this study to SIPRIS analyses based on the FRpred (Fischer et al., 2008), CLIPS-1D (Janda et al., 2012), and Evolutionary Trace (ET) (Lichtarge et al., 1996; Wilkins et al., 2012) programs, which define residue sets given only a query sequence. These and similar methods differ from BPPS by not classifying sequences into divergent subgroups per se. Instead, FRpred seeks to classify residues as catalytic, ligand binding and subtype-specific. FRpred catalytic and ligand-binding residues generally correspond to superfamily-conserved residues, whereas FRpred subtype-specific residues fail to correspond to any BPPS subgroups. For example, when we ran the Rab4 analysis as in Figure 2C using FRpred-defined residue sets instead of BPPS-defined sets, the first two FRpred categories nearly entirely overlapped with each other and with the Rab4 structural core; the subtype-specific category failed to return a significant cluster (p>0.05). SIPRIS analyses of other protein domains yielded similar results. CLIPS-1D defines catalytic, ligand-binding and structural categories, which likewise fail to correspond to BPPS subgroups. ET assigns residue functional importance scores without splitting into categories, and thus fails to differentiate between BPPS subgroups. As previously noted (Madabushi et al., 2002), high ET-scoring residues are often clustered structurally, which SIPRIS analyses confirm. Due to methodological differences, however, BPPS-SIPRIS clustering identifies sequence/structural features distinct from these other methods, as illustrated in Figure 1—source data 1. Although other methods may identify biologically relevant residues different than those identified here, this study suggests that by characterizing divergent subgroups, BPPS-SIPRIS analyses can identify significant, otherwise overlooked sequence/structural properties.
Discussion
Active site residues directly involved in catalysis are believed often to communicate with a network of other functionally important residues, some of which may be far from the active site (Sunden et al., 2015). The problem of identifying these networks is fundamental for understanding how proteins work. As illustrated here, BPPS-SIPRIS analyses can reveal information relevant to functional specialization by identifying statistically significant interaction networks. This includes, for example: (1) The nitrosation associated network in APE1 of the synaptojanin (EEP) superfamily. (2) The protein-protein interaction interfaces for diverse R4 GTPases. (3) The protein-protein interaction interface in EF-Tu, which can be hijacked by the P. aeruginosa Tse6 toxin (Whitney et al., 2015). In each of these cases, the residue-networks identified by our analysis suggest features congruent with current biochemical understanding of these proteins. Additionally, our analyses generated the following hypotheses: (1) Family-specific residues form hydrogen bonds (Figure 4C) responsible for APE1 abasic site substrate specificity. (2) INPP5 family and sub-family specific residues (Figure 5E–F) mediate, respectively, allosteric regulation and sequestration of specific membrane-associated phosphoinositide substrates from the lipid bilayer. (3) A hydrogen bond network associated with the residue most distinctive of metazoan TDGs, Asn230 in humans, mediates substrate-specific catalysis in DNA glycosylases, perhaps related to the discrimination of epigenetic marks present in metazoan DNA (Pastor et al., 2013; Zhang et al., 2012), such as 5-fC and 5-caC.
More generally our analyses suggest: (1) Family-specific residues often form a substrate-specific ‘reaction chamber’ associated with the structural core and active site, as seen for Gna1-related acetyltransferases, phosphoesterases related to APE1, and DNA glycosylases. (2) Subfamily-specific residues serve subordinate roles, such as mediating interactions with effector proteins, or coupling conformational changes to signaling. In this way, the same basic structural core and catalytic mechanism may accommodate a wide variety of cellular functions.
The SIPRIS clustering strategies described here accommodate further development. For example, one might use consensus distances from multiple structures to reduce noise. An open question is the significance of multiple BPPS-SIPRIS networks for a single subgroup, analogous to that for multiple regions of similarity between two sequences (Karlin and Altschul, 1993). Additional strategies include: applying BPPS-SIPRIS to functionally interacting proteins, treating them as a single sequence; and defining clusters using features such as secondary structure, surface accessibility or electrostatic potential. BPPS identifies correlated residue patterns presumably associated with functional specialization, and SIPRIS identifies correlations between defined residue sets and structural features. In contrast, DCA identifies correlations between pairs of residues that presumably interact structurally. Combining BPPS-SIPRIS with DCA may improve protein modeling and the characterization of functional interactions. Given the statistical and information theoretic foundation of these methods, one should be able to combine them in a principled manner.
In summary, the BPPS-SIPRIS system should aid the characterization of functionally interacting residues remote from protein active sites.
Materials and methods
BPPS-SIPRIS overview
Request a detailed protocolBPPS-SIPRIS analysis involves the following steps, as illustrated in Figure 7: (1) MAPGAPS (Neuwald, 2009) detects and aligns protein database sequences containing the domain of interest starting from a representative (‘seed’) MSA or from an hiMSA, either of which may be either curated manually or created automatically. This generates an MSA. (2) Bayesian Partitioning with Pattern Selection (BPPS) (Neuwald, 2014a; Neuwald, 2014b; Neuwald and Altschul, 2016a) is applied in three steps: (i) Step 1 uses Markov chain Monte Carlo (MCMC) sampling to partition the MSA into hierarchically-arranged subgroups based on the correlated residue patterns most distinctive of each subgroup. (ii) Step 2 converts the MSA into a hiMSA based on the BPPS hierarchy. (iii) Step 3 creates subgroup ‘contrast alignments’ and corresponding SIPRIS input files. (3) The SIPRIS program performs pattern residue cluster analyses and, as a runtime option, will create corresponding PyMOL (Schrodinger, 2010) scripts for viewing clusters within 3D structures (as in Figures 1–6). Each step in this process applies statistical criteria to ensure significance (see below).
Software and data availability
Request a detailed protocolBPPS-SIPRIS software, source code, instructions, and the input data required to perform the analyses described here are available at sipris.igs.umaryland.edu; this includes: (1) the MAPGAPS, BPPS, and SIPRIS programs; (2) MSA format conversion programs; (3) a phylum annotation program (fatax); and (4) the full multiple sequence alignments and pdb structural coordinate files used as input to BPPS and SIPRIS. The source code is available at sourceforge (sourceforge.net/p/bpps-sipris/code/; Neuwald, 2017). A copy is archived at https://github.com/elifesciences-publications/bpps-sipris-code.. The fatax program annotates sequences by phylum and kingdom based on the National Center for Biotechnology Information (NCBI) taxdump and prot.accession2taxid files, available at ftp://ftp.ncbi.nlm.nih.gov/. MAPGAPS searches were performed on the NCBI nr, env_nr and translated est databases (April 8, 2016 releases). Modeled hydrogen atoms were added to structural coordinate files using the Reduce program (Word et al., 1999) (http://kinemage.biochem.duke.edu/software/reduce.php).
MAPGAPS search and alignment
Request a detailed protocolMAPGAPS (Neuwald, 2009) creates an MSA by: (1) Taking as input either a small but ideally very accurate MSA, each sequence of which represents a distinct subgroup within a protein superfamily, or, alternatively, a set of hierarchically aligned MSAs, each of which represents a distinct subgroup. For the analyses here, we obtained from the NCBI conserved domain database (CDD) a set of hierarchically aligned MSAs or, if unavailable, a single curated MSA. A hiMSA from a previous BPPS analysis may also be used. (2) Creating a hidden Markov model (HMM) profile for each subgroup based on the input MSA. (3) Searching a protein sequence database and aligning each significantly scoring sequence (i.e. with p≤0.001) to the profile yielding the highest score. (4) Multiply aligning all the sequences obtained in this way using an alignment among profiles as a template (Neuwald, 2009). This process creates a large MSA that generally preserves the accuracy of the input alignment; BPPS uses this MSA as input. Table 2 describes the structural diversity of proteins with known structure identified in this way and included in our analysis. For a superfamily of domains near the limit of current sequence analysis methods’ ability to identify as related, we find that an average RMSD of 3.75 Å is typical. The RMSDs for the GNAT, EEP and UDG/TDG superfamilies fall below this value. Those for the GTPases are slightly higher, which can easily be explained by the conformational variability arising from GTPases’ function as switches. The helicases yield unusually high RMSDs, which are likely due to the large domain-domain movements typical of this clade.
MAPGAPS query alignments
Request a detailed protocolCurated MSAs used for MAPGAPS searches were constructed as follows: The GNAT and GTPase MSAs were curated by L. Aravind’s and A. Neuwald’s group, respectively. The NCBI CDD resource group curated the other query MSAs; the CDD codes are: cd00046, DEAD-like helicase superfamily; cd08372, Exonuclease-Endonuclease-Phosphatase (EEP) domain superfamily; and cd09593, Uracil-DNA glycosylases (UDG) and related enzymes. Using these MSAs as MAPGAPS queries, we searched the NCBI nr, env_nr and translated EST databases for matching sequences. For ESTs, we obtained organism codon usage and taxonomic information from NCBI taxdump files.
BPPS sampling
Request a detailed protocolStep 1 of the BPPS (Neuwald, 2014a, Neuwald, 2014b) program stochastically partitions an MSA into hierarchically arranged subgroups (i.e. nodes). Starting from a single root node, it attaches or removes leaf nodes, moves subtrees, inserts or deletes internal nodes, moves sequences between nodes, and modifies the ‘discriminating’ pattern for each node. BPPS samples from among possible patterns for each subgroup based on how well each pattern distinguishes subgroup-assigned sequences (termed ‘foreground’ sequences) from sequences assigned to the rest of the parent node’s subtree (termed ‘background’ sequences); Figure 7B illustrates this schematically. An optional Step E checks for consistency between BPPS Step 1 runs. Step 2 of BPPS (Neuwald and Altschul, 2016a) uses a combination of multiple sequence alignment and BPPS MCMC sampling. The Gibbs Sampler for Multi-alignment Optimization (GISMO) (Neuwald and Altschul, 2016b) adjusts each sub-group’s alignment by adding regions conserved in the subgroup but not in the superfamily as a whole. Further BPPS sampling then adjusts subgroup sequence and pattern assignments taking into consideration these newly aligned regions. This converts the MSA into a hierarchical interrelated MSA (hiMSA) (Figure 7C). Step 3 creates, for individual nodes in the hiMSA, both a rich text formatted (rtf) contrast alignment (as shown, for example, in figure source data files) and corresponding SIPRIS input files. Table 3 summarizes results for the five superfamilies analyzed here.
SIPRIS
Request a detailed protocolSIPRIS relies on a statistical approach termed Initial Cluster Analysis (ICA), which addresses the following questions: Consider a string of 0 s and 1 s of length L and containing D 1 s. Are some or all of the 1 s significantly clustered near the start of the sequence, and, if so, how surprising is the most significant such clustering? Elsewhere we describe and validate ICA (Altschul and Neuwald, 2017), which has a variety of biomedical applications. Here, we focus on the statistical and information theoretical bases of ICA as applied to BPPS-SIPRIS analysis.
BPPS-defined residue sets
Request a detailed protocolStep 2 of BPPS generates a hiMSA (Figure 7). For each subgroup (i.e. subtree) G within a hierarchy, BPPS defines a corresponding set of ‘discriminating’ residues that most distinguish members of that subgroup from closely related subgroups. This set is ordered from the most to the least distinguishing residues. We assume that these residues are likely responsible for functions specific to subgroup G. Although such a set typically includes residues with well-characterized functions, our focus is on residues of unknown functional relevance. When mapped to available structures, these distinguishing residues may readily suggest plausible hypotheses; in this respect, a BPPS analysis is informative by itself. However, SIPRIS can obtain deeper insight into and corroboration of a BPPS analysis by identifying significant overlap between BPPS-defined discriminating residues and structurally defined residue sets; we term the intersection of two such sets a BPPS-SIPRIS cluster. SIPRIS analysis was motivated, in part, by Karlin and Zhu’s approach (Karlin and Zhu, 1996) for identifying significant clusters of residues that share physical-chemical properties.
BPPS-SIPRIS predefined clusters
Request a detailed protocolThe simplest BPPS-SIPRIS analysis is based on a specific, predefined structural cluster of n residues. This corresponds to a ball-in-urn problem, in which the BPPS-defined distinguishing residues correspond to N1 red balls, the remaining residues to N2 black balls, and the cluster to n balls drawn from the urn. The probability that at least x of the n residues are distinguishing (i.e. are ‘red’) is given by the cumulative hypergeometric distribution:
BPPS-SIPRIS optimized-clusters
Request a detailed protocolSimilar to BPPS-predefined clustering is choosing the optimal BPPS-structural cluster among various alternatives. To construct these, we start from a well-defined position in space, and sequentially add ‘structurally adjacent’ residues (variously defined, as described in Results) to generate a set of nested, structurally defined clusters. From this nested set, we select the structural cluster that optimally overlaps with the BPPS-defined residue set by applying the Minimum Description Length (MDL) principle (Grunwald, 2007), as described in the next section. Optimizing over different starting residues, or different numbers of discriminating residues, requires further p-value adjustment, for which we currently apply the overly-conservative Bonferroni correction to obtain an upper bound.
The MDL principle
Request a detailed protocolTo avoid overfitting BPPS-SIPRIS statistical models to observed data, we apply the MDL principle (Grunwald, 2007), which can be understood as formalizing Occam's Razor (‘a model should not be needlessly complex’). Conceptually, this principle claims that the best among a set of alternative models is that which minimizes the description length of the model, plus the maximum-likelihood description length of the data given the model. This approach accounts for the implicit number of independent tests performed when optimizing the parameters of a model, and strikes a balance between a model's complexity and its ability to fit the data—in our case to describe biologically relevant amino acid residue patterns. More formally, a theory is a probability distribution over all possible sets of data, and a model is a parameterized set of theories. The description length of the data D given a model M, is then defined by DL(D|M) ≡ -log P(D|T), where T is maximum-likelihood theory contained in M (i.e. the theory which yields the greatest probability for D). The description length of the model M is defined by DL(M) ≡ log(N), where N is the number of the effectively distinct theories (i.e. parameter settings) M accommodates (Grunwald, 2007). The MDL principle aims to minimize DL(D|M)+DL(M).
MDL applied to BPPS-SIPRIS clustering
Request a detailed protocolBPPS-optimized clustering presents several mathematical challenges. Computing valid p-values requires adjusting for the multiple tests implicit in optimizing over starting residues and clusters. Also, this optimization itself may carry an implicit bias favoring small or large clusters, as outlined below.
We start with a null model in which discriminating residues (e.g. defined by BPPS) are distributed randomly throughout an entire sequence. Given a fixed number of discriminating residues, this model yields a uniform likelihood for all sets of data, and serves as a basis of comparison for likelihoods generated by an alternative model. This model divides the sequence into an initial segment of length x (which we refer to as a cluster) having m discriminating residues, and a terminal segment of length y having n discriminating residues. The model assumes discriminating residues are generated with different probabilities in the initial and terminal segments, and its maximum-likelihood theory assigns the likelihood to the data. For a particular cut-point x, this likelihood requires choosing the discriminating-residue probabilities m/x and n/y for the initial and terminal segments, and is easily normalized for the selection of these parameters. Our aim, however, is to pick the x (i.e. cluster) that yields the greatest likelihood for the data. Applying the MDL principle requires calculating the effective number of independent tests N implicit in choosing x (Altschul and Neuwald, 2017). By treating x as a continuous as opposed to a discrete parameter, we are able to calculate its Fisher information (Altschul and Neuwald, 2017), and thus N.
One subtlety is that simply choosing the cut point x yielding the greatest likelihood implicitly favors low or high values of x. This occurs because the Fisher information is greater at extreme values of x, implying that the likelihoods are more independent of one another at those values. Empirical analyses show that this bias toward large and small clusters often yields suboptimal results from a biological perspective. However, by adding an x-dependent correction, derived from the Fisher information, to our optimization, we may flatten the implicit prior associated with x (Altschul and Neuwald, 2017). Random simulation shows that analytic p-values computed using our approach fall within about 20% of empirical p-values. We still need to adjust these p-values for clusters found using different starting residues. Absent a better approach, we currently apply the simple but overly conservative Bonferroni correction (Bonferroni, 1936).
Runtimes
Request a detailed protocolThe runtime bottleneck in an analysis is BPPS. BPPS runtimes depend on the desired depth of the hierarchy, on the width of and the number of sequences in the input MSA and on the minimum number of sequences required to define a leaf node. For example, on a 64-bit Linux workstation, a 125,000-sequence GTPase MSA requires about 4 weeks to generate a 120 node hierarchy up to eight nodes deep and with a minimum leaf node size of 500 sequences. Note that much of this time is spent marginally refining a hierarchy. This approach is not recommended. Instead, we suggest running an initial analysis at a depth of 1 and then using the BPPS ‘focus’ option with a maximum depth of 2–4 to expand the subtree for a specific major node of interest. For the GTPase MSA, this approach takes less than a few days.
MSA cma format
Request a detailed protocolThe programs used here require cma-formatted MSAs. The cma (collinear multiple alignment) format, which is unique to our programs, allows the specification of a hierarchically-arranged set of MSAs, such as are created in step 2 of BPPS and which serve as input to the MAPGAPS program. (MAPGAPS will also take as input a single MSA either in cma or fasta format.) For a single MSA, the cma format consists of a header line, such as ‘[0_(1)=name(135){go = 10000,gx = 2000,pn = 1000.0,lf = 0,rf = 0}:'. The leftmost ‘0’ labels this as the root node of a hierarchical MSA; ‘(1)’ indicates a single aligned block (this parameter is utilized during MCMC sampling); ‘name’ labels the MSA; ‘135’ indicates the number of aligned sequences; and the string in curly brackets gives parameter settings that are not used here. This is followed by a second header line, such as ‘(20)********************', where ‘20’ indicates the number of aligned columns and the asterisks designate which columns MCMC column should be sampled (Neuwald et al., 1997).
Each sequence in the MSA is specified by three lines. An example of the first line is ‘$41 = 34(28):', where ‘$41’ indicates that this is the 41st sequence, ‘34’ indicates the total number of residues in the sequence and ‘28’ the number of residues and gaps (‘-‘) minus the number of insertions (this information is required for MCMC sampling). The second line gives a fasta formatted identifier and description, such as ‘>4ABC_A’. And the third line, such as ‘{(QEYP)ID-QTGKCEPYigqiTKCStfLPNST(NVTN)}*', specifies the aligned sequence where residues within parentheses represent regions flanking the aligned region on either side; upper- and lower-case letters represent matching and insertion residues, respectively; and gap characters represent deletions. The curly brackets on each end allow multiple aligned blocks to be defined during MCMC sampling. The last line of the MSA, such as ‘_0].', indicates the end of the MSA; this syntax allows multiple (hierarchically arranged) MSAs to be included within a single input file.
Additional considerations
Request a detailed protocolBPPS assigns a log-odds score to each pattern residue; ranked by these scores, a specific number of positions are considered by SIPRIS. SIPRIS identifies the statistically most significant intersection, if any, between the BPPS- and structurally defined residue sets; adjusting its p-value for the number of starting residues considered. Note that discriminating residues outside of the intersection may have BPPS scores as high as or higher than those within; SIPRIS makes no distinctions in this regard.
PyMol 3D visualization
Request a detailed protocolGiven structural coordinates as input, the BPPS and SIPRIS programs will generate PyMol (Schrodinger, 2010) scripts to aid visualization of BPPS-defined residues and of BPPS-SIPRIS structural networks, respectively.
Appendix 1
BPPS-SIPRIS and functional binding sites
BPPS-SIPRIS does not seek to identify functional binding sites per se, though it often reveals such sites after the fact. We illustrate this by comparing our analyses of Gna1, Rab4 and Rab8 with annotations on the Inferred Biomolecular Interactions Server (IBIS) (Shoemaker et al., 2012)—keeping in mind that BPPS-SIPRIS will also identify residues having other roles. Thirteen of the 22 residues in the Gna1 family-specific network are among the 27 residues annotated on the IBIS as binding either to substrate (N-acetyl-D-glucosamine-6-phosphate) or to the other homodimeric subunit. Eleven of the 14 residues in the Rab4 subfamily-specific network are among the 15 Rabenosyn-5 binding residues annotated by the IBIS. Structures are available for Rab8 bound to Ocrl1 (3qbt) (Hou et al., 2011), to the guanine nucleotide exchange factor Rabin8 (4lhx) (Guo et al., 2013), and to the bMERB domain of Mical-cL (5szi) (Rai et al., 2016). The IBIS annotates nine Rab8 residues as binding to all three components: K10, I43, D44, F45, I47, W62, I73, Y77, and R79. Seven of these (all but the flanking residues, K10 and R79) are included among the 19 Rab8 subfamily and sub-subfamily residues that interact with the Ocrl1 binding interface defined by SIPRIS. Of course, SIPRIS performs a benchmarking role similar to that of the IBIS inasmuch as both utilize 3D interactions within high quality crystal structures as gold standards.
Benchmarking BPPS against the SFLD
We benchmarked BPPS by applying it to eight superfamilies within the Structure Function Linkage Database (SFLD) (Akiva et al., 2014); other SFLD superfamilies contained too few sequences or subgroups (<2) for BPPS. SFLD classifies each superfamily into subgroups based on sequence and structural considerations and, to a very limited extent, on experimental data. Most SFLD superfamilies also include unannotated and automatically annotated sequences, which were used to enlarge the input sets, but not for benchmarking. We aligned the sequences for each superfamily using GISMO (Neuwald and Altschul, 2016b), and then removed both redundant (>98% identical) sequences and sequences having deletions in more than 30% of aligned columns. To ensure sufficient statistical support, we required that each BPPS leaf node correspond to at least 100 to 800 sequences, depending on superfamily size and diversity.
BPPS misclassified at most 0.14% of the annotated sequences overall (Appendix 1—table 1). For some superfamilies, BPPS assigned some subgroups to multiple nodes or several subgroups to a single node; in such cases, however, there were no inherent conflicts. As illustrated in Appendix 1—table 2, for haloacid dehalogenases (HADs): (i) BPPS assigned certain SFLD subgroups (SGs) to the root node, presumably due to their being too small or lacking a significant distinguishing pattern. (ii) For certain SFLD subgroups BPPS assigns some sequences to the root node due to a significant number of pattern mismatches (e.g., SG1138). (iii) BPPS misclassified 101 HAD sequences, probably due to misalignments. (This MSA contains a large number of indels rendering it relatively inaccurate—a problem that is addressed in Materials and Methods.) (iv) Several sequences appear to be erroneously assigned by SFLD to SG1129 (Appendix 1—table 3). BPPS assigned 108 of the 129 sequences in SG1129 to the root, but 21 sequences to four other nodes. As shown in Appendix 1—table 3, using the criterion of mean BLAST score, the sequences assigned to each node better match non-SG1129 sequences of that node than they do SG1129 sequences assigned to other nodes. This analysis avoids problems that may arise from MSA errors and suggests how best to reassign misclassified sequences (Appendix 1—figure 1 and Appendix 1—table 4). A similar analysis of the radical SAM superfamily indicates that 326 sequences assigned by SFLD to SG1118 are more closely related to SG1066 than they are to other SG1118 sequences. The respective high mean BLAST scores of 429 and 210 may suggest that SG1118 and SG1066 should be merged. In this way, BPPS may be useful for protein subgroup database curation.
Dependency of BPPS-SIPRIS on the input MSA
We assessed the dependence of BPPS-SIPRIS on the quality of the input using jackhmmer (Finn et al., 2015) MSAs, which should be less accurate than the MAPGAPS MSAs used here (see Materials and methods and Figure 7). Jackhmmer, which applies a query-centric iterative algorithm similar to that of PSI-BLAST (Altschul et al., 1997), was run over the EV-fold website (evfold.org) with its default parameter settings (five iterations; aligned columns or sequences with >30% deletions ignored). Using Gna1, APE1, Rho1 and eIF4AIII as queries, jackhmmer yielded MSAs of 107,738, 36,297, 99,838 and 107,213 sequences, respectively; these correspond to four of the five superfamilies examined here. (For the UDG/TDG superfamily, jackhmmer pulled in too few sequences for a valid comparison.) In each case, BPPS-SIPRIS identified nearly the same residue sets using either type of MSA (Appendix 1—table 5), with the p-values for jackhmmer MSAs tending to be less significant. Thus, BPPS-SIPRIS yields fairly consistent results using very different MSAs, although using MAPGAPS with a curated MSA helps detect more sequences and improve sensitivity.
References
-
The structure-function linkage databaseNucleic Acids Research 42:D521–D530.https://doi.org/10.1093/nar/gkt1130
-
Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Research 25:3389–3402.https://doi.org/10.1093/nar/25.17.3389
-
Initial Cluster AnalysisJournal of Computational Biology 24:.https://doi.org/10.1089/cmb.2017.0050
-
Comparative genomics and evolution of proteins involved in RNA metabolismNucleic Acids Research 30:1427–1464.https://doi.org/10.1093/nar/30.7.1427
-
Keep on moving: discovering and perturbing the conformational dynamics of enzymesAccounts of Chemical Research 48:423–430.https://doi.org/10.1021/ar5003158
-
Teoria statistica delle classi e calcolo delle probabilitàPubblicazioni Del R Istituto Superiore Di Scienze Economiche E Commerciali Di Firenze 8:3–62.
-
Unique structural and nucleotide exchange features of the Rho1 GTPase of Entamoeba histolyticaJournal of Biological Chemistry 286:39236–39246.https://doi.org/10.1074/jbc.M111.253898
-
The multiple functions of RNA helicases as drivers and regulators of gene expressionNature Reviews Molecular Cell Biology 17:426–438.https://doi.org/10.1038/nrm.2016.50
-
Direct evidence of an elongation factor-Tu/Ts·GTP·Aminoacyl-tRNA quaternary complexJournal of Biological Chemistry 289:23917–23927.https://doi.org/10.1074/jbc.M114.583385
-
The role of protein dynamics in the evolution of new enzyme functionNature Chemical Biology 12:944–950.https://doi.org/10.1038/nchembio.2175
-
A method to predict functional residues in proteinsNature Structural & Molecular Biology 2:171–178.https://doi.org/10.1038/nsb0295-171
-
A survey on prediction of specificity-determining sites in proteinsBriefings in Bioinformatics 16:71–88.https://doi.org/10.1093/bib/bbt092
-
CAFA and the open world of protein function predictionsTrends in Genetics 29:609–610.https://doi.org/10.1016/j.tig.2013.09.005
-
Structural and biochemical characterization of a trapped coenzyme A adduct of Caenorhabditis elegans glucosamine-6-phosphate N-acetyltransferase 1Acta Crystallographica Section D Biological Crystallography 68:1019–1029.https://doi.org/10.1107/S0907444912019592
-
Capturing snapshots of APE1 processing DNA damageNature Structural & Molecular Biology 22:924–931.https://doi.org/10.1038/nsmb.3105
-
Predicting functional divergence in protein evolution by site-specific rate shiftsTrends in Biochemical Sciences 27:315–321.https://doi.org/10.1016/S0968-0004(02)02094-7
-
Intermediates in the guanine nucleotide exchange reaction of Rab8 protein catalyzed by guanine nucleotide exchange factors Rabin8 and GRABJournal of Biological Chemistry 288:32466–32474.https://doi.org/10.1074/jbc.M113.498329
-
Analysis and prediction of functional sub-types from protein sequence alignmentsJournal of Molecular Biology 303:61–76.https://doi.org/10.1006/jmbi.2000.4036
-
SH3YL1 regulates dorsal ruffle formation by a novel phosphoinositide-binding domainThe Journal of Cell Biology 193:901–916.https://doi.org/10.1083/jcb.201012161
-
Using evolutionary information to find specificity-determining and co-evolving residuesMethods in Molecular Biology 541:421–448.https://doi.org/10.1007/978-1-59745-243-4_18
-
MUSTANG: a multiple structural alignment algorithmProteins: Structure, Function, and Bioinformatics 64:559–574.https://doi.org/10.1002/prot.20921
-
MINER: software for phylogenetic motif identificationNucleic Acids Research 33:W267–W270.https://doi.org/10.1093/nar/gki465
-
Classification and evolution of P-loop GTPases and related ATPasesJournal of Molecular Biology 317:41–72.https://doi.org/10.1006/jmbi.2001.5378
-
An evolutionary trace method defines binding surfaces common to protein familiesJournal of Molecular Biology 257:342–358.https://doi.org/10.1006/jmbi.1996.0167
-
Coordination of steps in single-nucleotide base excision repair mediated by apurinic/apyrimidinic endonuclease 1 and DNA polymerase betaJournal of Biological Chemistry 282:13532–13541.https://doi.org/10.1074/jbc.M611295200
-
Non-hexameric DNA helicases and translocases: mechanisms and regulationNature Reviews Molecular Cell Biology 9:391–401.https://doi.org/10.1038/nrm2394
-
Structural clusters of evolutionary trace residues are statistically significant and common in proteinsJournal of Molecular Biology 316:139–154.https://doi.org/10.1006/jmbi.2001.5327
-
Protein structure prediction from sequence variationNature Biotechnology 30:1072–1080.https://doi.org/10.1038/nbt.2419
-
Aromatic rescue of glycine in beta sheetsFolding and Design 3:449–456.https://doi.org/10.1016/S1359-0278(98)00062-5
-
A family of evolution-entropy hybrid methods for ranking protein residues by importanceJournal of Molecular Biology 336:1265–1282.https://doi.org/10.1016/j.jmb.2003.12.078
-
Inference of functionally-relevant N-acetyltransferase residues based on statistical correlationsPLOS Computational Biology 12:e1005294.https://doi.org/10.1371/journal.pcbi.1005294
-
Bayesian top-down protein sequence alignment with inferred position-specific gap penaltiesPLOS Computational Biology 12:e1004936.https://doi.org/10.1371/journal.pcbi.1004936
-
Extracting protein alignment models from the sequence databaseNucleic Acids Research 25:1665–1677.https://doi.org/10.1093/nar/25.9.1665
-
The charge-dipole pocket: a defining feature of signaling pathway GTPase on/off switchesJournal of Molecular Biology 390:142–153.https://doi.org/10.1016/j.jmb.2009.05.001
-
Protein domain hierarchy Gibbs sampling strategiesStatistical Applications in Genetics and Molecular Biology 13:497–517.https://doi.org/10.1515/sagmb-2014-0008
-
A Bayesian sampler for optimization of protein domain hierarchiesJournal of Computational Biology 21:269–286.https://doi.org/10.1089/cmb.2013.0099
-
Enacyloxin IIa pinpoints a binding pocket of elongation factor Tu for development of novel antibioticsJournal of Biological Chemistry 281:2893–2900.https://doi.org/10.1074/jbc.M505951200
-
TETonic shift: biological roles of TET proteins in DNA demethylation and transcriptionNature Reviews Molecular Cell Biology 14:341–356.https://doi.org/10.1038/nrm3589
-
Sequence comparison by sequence harmony identifies subtype-specific functional sitesNucleic Acids Research 34:6540–6548.https://doi.org/10.1093/nar/gkl901
-
Nitric oxide controls nuclear export of APE1/Ref-1 through S-nitrosation of cysteines 93 and 310Nucleic Acids Research 35:2522–2532.https://doi.org/10.1093/nar/gkl1163
-
Protein conformational populations and functionally relevant substatesAccounts of Chemical Research 47:149–156.https://doi.org/10.1021/ar400084s
-
Evolution-based design of proteinsMethods in Enzymology 523:213–235.https://doi.org/10.1016/B978-0-12-394292-0.00010-2
-
Tissue distribution and intracellular localisation of the 75-kDa inositol polyphosphate 5-phosphataseEuropean Journal of Biochemistry 234:216–224.https://doi.org/10.1111/j.1432-1033.1995.216_c.x
-
Evolution of a protein interaction domain family by tuning conformational flexibilityJournal of the American Chemical Society 138:15150–15156.https://doi.org/10.1021/jacs.6b05954
-
Evolutionary trace for prediction and redesign of protein functional sitesMethods in Molecular Biology 819:29–42.https://doi.org/10.1007/978-1-61779-465-0_3
-
Visualizing and quantifying molecular goodness-of-fit: small-probe contact dots with explicit hydrogen atomsJournal of Molecular Biology 285:1711–1733.https://doi.org/10.1006/jmbi.1998.2400
-
Computational methods for identification of functional residues in protein structuresCurrent Protein & Peptide Science 12:456–469.https://doi.org/10.2174/138920311796957685
-
Thymine DNA glycosylase specifically recognizes 5-carboxylcytosine-modified DNANature Chemical Biology 8:328–330.https://doi.org/10.1038/nchembio.914
-
Mutational analysis of the roles of residues in Escherichia coli elongation factor Ts in the interaction with elongation factor TuJournal of Biological Chemistry 273:4556–4562.https://doi.org/10.1074/jbc.273.8.4556
Article and author information
Author details
Funding
University of Maryland
- Andrew F Neuwald
National Institutes of Health (Intramural Research Program)
- L Aravind
- Stephen F Altschul
The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.
Acknowledgements
LA and SFA were supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine. AFN received no specific funding for this work, but was supported by the University of Maryland, Baltimore. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Copyright
This is an open-access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.
Metrics
-
- 1,486
- views
-
- 230
- downloads
-
- 14
- citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading
-
- Computational and Systems Biology
- Microbiology and Infectious Disease
Timely and effective use of antimicrobial drugs can improve patient outcomes, as well as help safeguard against resistance development. Matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) is currently routinely used in clinical diagnostics for rapid species identification. Mining additional data from said spectra in the form of antimicrobial resistance (AMR) profiles is, therefore, highly promising. Such AMR profiles could serve as a drop-in solution for drastically improving treatment efficiency, effectiveness, and costs. This study endeavors to develop the first machine learning models capable of predicting AMR profiles for the whole repertoire of species and drugs encountered in clinical microbiology. The resulting models can be interpreted as drug recommender systems for infectious diseases. We find that our dual-branch method delivers considerably higher performance compared to previous approaches. In addition, experiments show that the models can be efficiently fine-tuned to data from other clinical laboratories. MALDI-TOF-based AMR recommender systems can, hence, greatly extend the value of MALDI-TOF MS for clinical diagnostics. All code supporting this study is distributed on PyPI and is packaged at https://github.com/gdewael/maldi-nn.
-
- Computational and Systems Biology
- Genetics and Genomics
Enhancers and promoters are classically considered to be bound by a small set of transcription factors (TFs) in a sequence-specific manner. This assumption has come under increasing skepticism as the datasets of ChIP-seq assays of TFs have expanded. In particular, high-occupancy target (HOT) loci attract hundreds of TFs with often no detectable correlation between ChIP-seq peaks and DNA-binding motif presence. Here, we used a set of 1003 TF ChIP-seq datasets (HepG2, K562, H1) to analyze the patterns of ChIP-seq peak co-occurrence in combination with functional genomics datasets. We identified 43,891 HOT loci forming at the promoter (53%) and enhancer (47%) regions. HOT promoters regulate housekeeping genes, whereas HOT enhancers are involved in tissue-specific process regulation. HOT loci form the foundation of human super-enhancers and evolve under strong negative selection, with some of these loci being located in ultraconserved regions. Sequence-based classification analysis of HOT loci suggested that their formation is driven by the sequence features, and the density of mapped ChIP-seq peaks across TF-bound loci correlates with sequence features and the expression level of flanking genes. Based on the affinities to bind to promoters and enhancers we detected five distinct clusters of TFs that form the core of the HOT loci. We report an abundance of HOT loci in the human genome and a commitment of 51% of all TF ChIP-seq binding events to HOT locus formation thus challenging the classical model of enhancer activity and propose a model of HOT locus formation based on the existence of large transcriptional condensates.