1. Structural Biology and Molecular Biophysics
  2. Computational and Systems Biology
Download icon

Prediction of enzymatic pathways by integrative pathway mapping

  1. Sara Calhoun
  2. Magdalena Korczynska
  3. Daniel J Wichelecki
  4. Brian San Francisco
  5. Suwen Zhao
  6. Dmitry A Rodionov
  7. Matthew W Vetting
  8. Nawar F Al-Obaidi
  9. Henry Lin
  10. Matthew J O'Meara
  11. David A Scott
  12. John H Morris
  13. Daniel Russel
  14. Steven C Almo
  15. Andrei L Osterman
  16. John A Gerlt  Is a corresponding author
  17. Matthew P Jacobson  Is a corresponding author
  18. Brian K Shoichet  Is a corresponding author
  19. Andrej Sali  Is a corresponding author
  1. University of California, San Francisco, United States
  2. University of Illinois, United States
  3. Sanford Burnham Prebys Medical Discovery Institute, United States
  4. Russian Academy of Sciences, Russia
  5. Albert Einstein College of Medicine, United States
Research Article
  • Cited 6
  • Views 1,883
  • Annotations
Cite this article as: eLife 2018;7:e31097 doi: 10.7554/eLife.31097

Abstract

The functions of most proteins are yet to be determined. The function of an enzyme is often defined by its interacting partners, including its substrate and product, and its role in larger metabolic networks. Here, we describe a computational method that predicts the functions of orphan enzymes by organizing them into a linear metabolic pathway. Given candidate enzyme and metabolite pathway members, this aim is achieved by finding those pathways that satisfy structural and network restraints implied by varied input information, including that from virtual screening, chemoinformatics, genomic context analysis, and ligand -binding experiments. We demonstrate this integrative pathway mapping method by predicting the L-gulonate catabolic pathway in Haemophilus influenzae Rd KW20. The prediction was subsequently validated experimentally by enzymology, crystallography, and metabolomics. Integrative pathway mapping by satisfaction of structural and network restraints is extensible to molecular networks in general and thus formally bridges the gap between structural biology and systems biology.

https://doi.org/10.7554/eLife.31097.001

Introduction

Problem and approach

The functions of most sequenced proteins have not been determined by experiment (Gerlt et al., 2011; Jacobson et al., 2014; Schnoes et al., 2009). They are also difficult to predict for enzymes with less than 60% sequence identity to characterized enzymes (Radivojac et al., 2013). The problem is much greater when seeking to predict the functions of entire metabolic pathways. Here, we propose a computational approach that outputs an enzymatic pathway and corresponding ligands, given a set of potential enzymes and metabolites, using information derived by experiment and/or computational analyses (Figure 1). The approach benefits from two considerations. First, predicting an entire pathway may sometimes be easier than predicting individual enzymatic functions in isolation, because the product of one enzyme is the substrate for the next in the pathway. Thus, even when each enzyme’s ligand assignment is ambiguous, the ligand assignments consistent with both enzymes may be more precise and accurate. Second, while it may be impossible to identify a pathway using information from any single method, there may be sufficient information provided by several methods. Our approach was inspired by the previous work using metabolite docking to multiple enzymes and substrate-binding proteins hypothesized to participate in the pathway (Jacobson et al., 2014; Kalyanaraman and Jacobson, 2010; Macchiarulo et al., 2004; Zhao et al., 2013), integrative structure determination of large macromolecular assemblies (Alber et al., 2007; Russel et al., 2012), and comparative genomics approaches for metabolic reconstructions (Markowitz et al., 2010; Karp et al., 2016; Osterman and Overbeek, 2003; Overbeek et al., 2005; Yamanishi et al., 2007; Ye et al., 2005).

Figure 1 with 3 supplements see all
Overview of integrative pathway mapping method.

The four stages of integrative modeling are: (1) Gathering information, (2) Designing model representation and evaluation, (3) Sampling good models, and (4) Analyzing models and information. (1) Here, the input information is gathered from seven different sources used to determine the candidate proteins, such as co-localization and conservation in the genome neighborhood, and the scoring restraints (docking scores from virtual screening, chemical transformations, ensemble similarity calculations of virtual screening hits from similarity ensemble approach, DSF screening hits, metabolic endpoints, and characterized chemical reactions). (2) A pathway model is represented as a graph composed of protein and ligand nodes. Proteins are depicted as diamonds and ligands are depicted as circles, with lines showing the node patterns evaluated by a given type of information. (3) The combinatorial optimization problem is solved by Monte Carlo simulated annealing sampling, consisting of randomly swapping nodes in and out of the pathway model and rearranging the edges between the nodes. (4) The final analysis stage involves assessing the sampling, precision, and accuracy of the models.

https://doi.org/10.7554/eLife.31097.002

A pathway model is represented as a graph in which the enzymes, substrates, and products are nodes and the enzymatic reactions are edges (Figure 1). Input information, such as scores from experimental ligand screening, molecular docking screens, and chemical similarity, is encoded as ‘network’ restraints on the identity of the nodes and edges in the pathway; these restraints are combined into a scoring function. An ensemble of pathways consistent with the input information is computed by a Monte Carlo algorithm that samples well-scoring pathways over possible enzymes and metabolites. The resulting ensemble of good-scoring pathway models is assessed by its precision, its satisfaction of the input restraints, and ideally experimental observations not used in its construction. In addition to gauging the accuracy and precision of the models and the observations, this analysis can identify the most informative future experiments. Because the approach ranks alternative pathways using all available information, it in principle produces maximally accurate, precise, and complete pathway models given that information. The process of data gathering and modeling can iterate until a satisfactory model is obtained. We suggest that the four stages of integrative pathway mapping by satisfaction of network restraints mimic how human experts often derive and test pathway models.

Results

The approach begins with a list of candidate proteins (here enzymes, binding proteins, and transporters) and a list of endogenous metabolites that are candidate substrates or products of these enzymes (Figure 1, Figure 1—figure supplement 1). The pathway members can be winnowed from the entire proteome by predicting functionally related proteins using information about the genome organization that is often available for bacterial pathways (Zhao et al., 2014). For example, for the gulonate pathway, we identified five metabolic enzymes that are conserved in the genome neighborhood of the TRAP transporter gene by constructing a genome neighborhood network (GNN) (Figure 1—figure supplement 2); the GNN approach has been demonstrated to accurately predict enzymes and transporters that function together in metabolic pathways based on conserved protein families in genome neighborhoods across different species (Zhao et al., 2014). The network restraints can then be inferred in multiple ways, exemplified by the following restraints in this study (Figure 2A). First, for each candidate, the libraries of endogenous metabolites are docked against either an experimentally determined structure if available or a comparative structure model (Mysinger and Shoichet, 2010). In the case of glycolysis, 2965 sugars in the KEGG database were screened against two crystal structures and eight comparative models for the 10 enzymes in this pathway (Kalyanaraman and Jacobson, 2010). Second, with the top 500 metabolites docked-and-ranked against each of the enzymes, the pathway enzymes may be linked by the similarity of their high-ranking docked ligands, here using the chemoinformatic Similarity Ensemble Approach (SEA) (Keiser et al., 2007; Lin et al., 2013); other related approaches can also be used (Besnard et al., 2012; Gregori-Puigjané and Mestres, 2006; Mestres et al., 2006; Nidhi et al., 2006; Paolini et al., 2006). The restraint can be informative because enzymes are often more likely to be pathway neighbors when their high-scoring docked ligands resemble each other. For instance, the top 500 metabolites of 3-phosphate dehydrogenase in the glycolysis pathway (as ranked by docking) are dominated by six chemotypes, while the phosphoglycerate kinase has three of these chemotypes overrepresented. This similarity is captured by the SEA E-value of 9.5 10−63, suggesting that the observed level of similarity between the two predicted ligand lists is unlikely to have occurred by chance (Figure 2—figure supplement 1C). Thus, the two enzymes are linked by their related predicted metabolites. Third, consideration of the enzymatic reaction types assigned to the enzymes’ superfamilies restrains the reactions in the pathway models. We require that the predicted metabolites can actually be substrates or products of an enzyme, given its reaction profile extracted from its protein family annotation. As an example, the glyceraldehyde 3-phosphate dehydrogenase is assigned the reactions that can convert an aldehyde to a phosphate and vice versa (Supplementary file 6). Finally, all available experimental screening hits, substrate specificities from homology, constraints on the pathway endpoints, and other information can also be considered. Each of these considerations is added to a scoring function that ranks alternative pathway models by assessing their consistency with the available information (Figure 2B). Thus, pathway models are preferred when they contain (i) good-scoring metabolite-enzyme pairs, (ii) pairs of neighboring enzymes that share chemotypes, (iii) pairs of neighboring enzymes catalyzing chemical reactions that allow the product of an upstream enzyme to be a substrate of the downstream enzyme, etc. This integrative approach does not require that each type of restraint be available for each protein and metabolite, nor that each restraint is accurate or precise; it only requires that the scoring function consisting of all restraints is sufficiently accurate and precise.

Figure 2 with 1 supplement see all
Representation of alternative models obtained based on consistency with input information provided for the glycolysis benchmark pathway.

(A) Example of three alternative models evaluated using different types of restraints based on modeling of the glycolysis pathway with a subset of pathways shown. The restraints on node patterns are shown using colored lines (blue – docking restraints, green – SEA restraints, purple – chemical transformation restraints, red – restraints with unfavorable scores). Metabolites are labeled by KEGG ID and enzymes are labeled by step in glycolysis pathway. On the left, alternate model one is consistent with docking scores, but not with all SEA scores and chemical transformations. In the middle, alternate model two is consistent with the docking scores and SEA scores, but not with chemical transformations. On the right, alternate model three is consistent with docking scores, SEA scores, and chemical transformations, thus increasing the rank of the correct enzyme-substrate pairings. (B) Alternative models shown with chemical structures. (C), Ranks of correct substrate for the corresponding enzyme at each step in the glycolysis benchmark case. 1 – glucokinase, 2 – phosphoglucose isomerase, 3 – phosphofructokinase, 4 – fructose bisphosphate aldolase, 5 – triosephosphate isomerase, 6 – glyceraldehyde 3-phosphate dehydrogenase, 7 – phosphoglycerate 8 – phosphoglycerate mutase 9 – enolase and 10 – pyruvate kinase.

https://doi.org/10.7554/eLife.31097.006

Benchmarking

The method was tested by retrospectively ordering enzymes and identifying their substrates in three well characterized pathways: glycolysis (10 enzymes) (Kalyanaraman and Jacobson, 2010), cytidine monophosphate 3-deoxy-D-manno-octulosonate 8-phosphate (CMP KDO-8P) biosynthesis (four enzymes), and serine biosynthesis (five enzymes) (Supplementary files 1 and 2). Docking screens of several thousand metabolites against comparative models of the enzymes, chemical transformation annotations of the enzymes, and the chemoinformatic SEA analysis were the input information for mapping each pathway. Because the functions of these enzymes have been characterized, homology-based annotations were not included as restraints for the purposes of the retrospective benchmark. The method successfully identified the substrates and products and correctly ordered all pathway components in the top-scoring models (Figure 2; Supplementary files 1 and 2). The method performed well even when the number and identity of pathway enzymes were unspecified (Figure 3AB) or when the candidate enzymes set was incomplete (Figure 2—figure supplement 1D).

Figure 3 with 2 supplements see all
12 representative predictions of the L-gulonate TRAP-SBP catabolic pathway.

(A) 12 representative pathway models of TRAP SBP pathway predictions ordered by score, starting from the top with the best-scored prediction. The scores of the representative pathways are listed to the right of the corresponding pathway. Pathway enzymes are labeled by numbers as follows: 1 – HiGulD, 2 – HiUxuB, 3 – HiUxuA, 4 – HiKdgK, 5 – HiKdgA. (B) Graphical representation of an ensemble of representative pathway models. The predicted components in the ensemble of pathway models at each position are vertically aligned to the corresponding position in the gray pathway on the top. Ligand components are shown as circle nodes with the color corresponding to the ligand identity. Chemical structures are shown in Figure 3—figure supplement 2. Pathway enzymes are shown as diamond nodes with the same numbering as above. Edges are colored by individual pathway model prediction. The validated prediction is shown by black edges, enzyme nodes are colored black, and substrate/product nodes are outlined in black.

https://doi.org/10.7554/eLife.31097.008

The accuracy of the predicted benchmark pathways is not limited by the lack of sampling (Figure 3—figure supplement 1), but rather by the input information (Figure 2A). Thus, the integration of multiple types of information improves the accuracy and precision of pathway prediction (Figure 2BC). For example, the correct substrate of the dehydrogenase in the glycolysis pathway, glyceraldehyde 3-phosphate (G3P), is predicted to be most consistent with all the input information. Although by docking alone, the rank of G3P is only 117 out of the 2965 metabolites docked, the additional restraints from SEA and chemical transformations lead to the overall top ranking of G3P (Figure 2A).

Several caveats bear mentioning. Both thorough sampling and accurate scoring become more difficult when the number of possible pathways increases (which in turn arises from a large set of candidate enzymes and metabolites), when some enzymes or metabolites are not in the input set, when the pathway is long, or its length is unknown. Here, only linear pathways are sampled; thus, non-linear pathways, including cyclic pathways, are not modeled. The preparation of input information requires manual processing. Although docking, chemoinformatics, comparative modeling, chemical transformations, and differential scanning fluorimetry (DSF) screening information may be collected in an automated way, the quality of information often benefits from expert choices. For example, comparative model building can be especially time consuming when low sequence similarity structures are available for target building, and docking may require expert intervention when parameterization of cofactors is necessary for correctly defining the binding site. Nevertheless, we emphasize that once the input information is provided, its conversion into the predicted pathway is automated and does not require human intervention. Finally, docking against modeled structures will sometimes fail, even with the added advantages of insisting on consistency in docking hit lists. Some of these pitfalls can be detected through testing the thoroughness of sampling (Figure 3—figure supplement 1), statistical bootstrapping and jack-knifing tests (Efron, 1981), and by direct experimental testing of predictions (Figure 4). The method becomes more robust when the pathway start and end are defined. More generally, failures can also be reduced by introducing restraints or constraints that limit the size of the input enzyme and metabolite sets, by improving the accuracy of the scoring function, by limiting the sampling, or by further filtering the set of good-scoring solutions.

Figure 4 with 2 supplements see all
Catabolic pathway of H. influenzae Rd KW20.

(A) The best-scoring pathway identified using the integrative mapping approach is annotated with experimental evidence: enzyme activity (blue), fitness growth determinants (red), transcript analyses on L-gulonate media (orange), atomic structure (green), and isotopic metabolic labeling (purple). The pathway demonstrates L-gulonate degradation into glyceraldehyde 3-phosphate and pyruvate. Bonds undergoing changes in the subsequent steps are colored in red. (B) Kinetics of pathway enzymes on predicted substrates. (C) Crystal structure of L-gulonate bound to SBP TRAP (PDB ID: 4PBQ). (D) Knockout growth assays of H. influenzae strains, ΔGulP (gulonate transporter periplasmic subunit) and ΔGulD (L-gulonate dehydrogenase), when grown on D-glucose vs. L-gulonate as a sole carbon source. (E) Fold change in expression for each gene when grown on the indicated carbon source, relative to growth on glucose. Error bars indicate one standard deviation for three biological replicates.

https://doi.org/10.7554/eLife.31097.011

Prospective prediction of the L-gulonate catabolic pathway in Haemophilus influenzae Rd KW20

L-gulonate and D-mannonate were identified as potential ligands of the TRAP solute binding protein (SBP) from H. influenzae (Supplementary file 3), using DSF screening of a library of 189 compounds (Vetting et al., 2015). The HiGulPQM TRAP transporter consists of three subunits, including the periplasmic SBP HiGulP, and two membrane components HiGulQ and HiGulM. SBPs recognize substrates to be imported into the cytosol by the transporter. Because these sugars are not involved in central carbon metabolism, the observation suggested an uncharacterized pathway that converts L-gulonate or D-mannonate into substrates for central carbon metabolism. While this pathway had been proposed based on the DSF screening hits and genome neighborhood analysis, we sought to predict it using integrative pathway mapping, based on the following information (Figure 1, Figure 1—figure supplement 1).

First, the position of TRAP SBP was fixed at the pathway start and its ligand was constrained to be L-gulonate or D-mannonate; this positioning is reasonable given the TRAP’s role as a transporter. Second, five possible pathway enzymes (a dehydratase, two dehydrogenases, a kinase, and an aldolase) were identified from the genome neighborhood around the TRAP-solute-binding protein (Uniprot ID P71336) (Figure 1—figure supplement 2A–C, Supplementary file 2). Third, 3650 out of the 14,212 metabolites in KEGG (Kanehisa et al., 2016) were identified as the smallest unique ligand set of substrates or products for these enzymes, based on the top scoring docking hits, which were optimized by chemical transformations and chemical similarity (Supplementary file 2). Fourth, the pathway was constrained to end in a metabolite of central metabolism (Supplementary file 4). Fifth, the dehydratase (Uniprot ID P44488) was hypothesized to be a D-mannonate dehydratase because of high sequence similarity to a characterized D-mannonate dehydratase, UxuA, in other organisms (73% sequence identity to UxuA in E. coli) (Dreyer, 1987). Finally, as in the benchmarking, the scoring function used docking, SEA E-values, and chemical transformations inferred from annotations in Pfam (Finn et al., 2016).

The sampling algorithm found 154 unique high-scoring pathways, which clustered into 12 groups (Figure 3AB). The best-scoring pathway, starting from L-gulonate as the ligand of the TRAP SBP, begins with its oxidation to D-fructuronate by the first dehydrogenase (HiGulD) as the first catalytic step (Figure 4A). Next, reduction of D-fructuronate by the second dehydrogenase (HiUxuB) produces D-mannonate; its dehydration by the dehydratase (HiUxuA) produces 2-keto-3-deoxy-D-gluconate. The last few steps in the pathway model are part of a conserved Entner-Dourdoroff pathway, ending with glyceraldehyde 3-phosphate and pyruvate, known members of central metabolism. Even when individual restraints are excluded from the input information, the best-scoring pathway falls within the top-scoring pathway models. For example, the same pathway has the highest score if either the starting or end point is not restrained. If information about both starting and end points is excluded, this pathway model drops to 15th in score.

Experimental testing of the L-gulonate catabolic pathway

The pathway model was tested experimentally in five independent ways, including by enzyme activity, X-ray crystallography, fitness growth assays of the deletion mutants, transcript analyses, and isotopic metabolic labeling (Figure 4A–E).

First, all enzymes had kcat/KM values larger than 103 M−1s−1 for their predicted substrates (Figure 4B). Initially, HiGulD had negligible activity with its putative substrate L-gulonate. However, the pathway prediction was deemed to be of sufficient quality to encourage optimization of enzyme purification, ultimately producing an active enzyme with a kcat/KM value of 104 M−1s−1 for L-gulonate. All enzymes exhibit micromolar KM values, except for HiUxuA (potentially reflecting high D-mannonate concentrations in the cytosol). Second, the model is supported by the crystallographic structure of the complex between the TRAP SBP protein and L-gulonate (Vetting et al., 2015) (Figure 4C). Third, knockouts (KOs) of ∆HiGulP SBP and ∆HiGulD dehydrogenase were constructed. ∆HiGulP and ∆HiGulD KO strains retain the ability to grow on glucose (Figure 4D, left), while they do not grow on L-gulonate (Figure 4D, right). Fourth, all predicted pathway encoding genes, including HiGulPQM transporter, HiGulD, HiUxuA, HiUxuB, HiKdgK, and HiKdgA, are upregulated when H. influenzae is grown on L-gulonate or D-mannonate as the sole carbon source (Figure 4E). Fifth, when H. influenzae was incubated with U-13C-L-gulonate during the early exponential phase, even for one minute, substantial labeling of central carbon metabolites was observed, indicating rapid cellular uptake and metabolism of L-gulonate (Figure 4—figure supplement 1A). In addition, time-dependent labeling of D-fructuronate was observed (Figure 4—figure supplement 1B), further supporting the first predicted step in the L-gulonate catabolic pathway (Figure 4—figure supplement 1C). Finally, identification of this pathway in H. influenzae allowed us to reconstruct the L-gulonate and hexuronate pathways in related bacteria, mapping their conservation and variation to better understand the evolution and function of the pathway (Figure 4—figure supplement 2).

Discussion

A number of methods have been developed to predict metabolic enzymes and pathways (Jacobson et al., 2014; Planes and Beasley, 2008). The most common method assigns enzyme function from its sequence similarity to a characterized enzyme (Lee et al., 2007), sometimes allowing genome-scale metabolic reconstructions (Karp et al., 2016; Bordbar et al., 2014). However, similarity-based approaches often fail when the sequence identity drops below 60% (Radivojac et al., 2013) or when distant homologs are functionally divergent. Virtual screening used for integrative pathway mapping, even against comparative models, can predict substrates more accurately than homology-based transfer (Fan et al., 2009). In other cases, functional linking based on omics data, such as gene clusters, phylogenetic profiles, and gene expression profiles, can also guide functional prediction (Osterman and Overbeek, 2003; Overbeek et al., 1999). Methods that integrate sequence similarity and functional linking can improve predictions (Plata et al., 2012), as can approaches that incorporate modeling of metabolic flux with genomics-based metabolic reconstruction to identify missing enzymes (Karp et al., 2016; Bordbar et al., 2014; Monk et al., 2014). Several studies have combined structural information with metabolic reconstructions for genome-scale analysis (Zhang et al., 2009; Brunk et al., 2016; Chang et al., 2010). Still, most of the methods are limited by the biochemical knowledge available and the reactions that are mapped. Studies that deorphanize enzyme function (Irwin et al., 2005; Korczynska et al., 2014) or annotate new pathways (Zhao et al., 2013) will enhance the accuracy and applicability of these computational methods (Bordbar et al., 2014) as well as our integrative method. However, a key strength of our integrative approach is its ability to predict pathways that contain previously unknown biochemical reactions, and to assemble pathways de novo from simple and often newly predicted enzyme activities.

Integrative pathway mapping provides a flexible and general approach to functional annotation and pathway modeling. Because it generalizes functional annotation into the sampling of pathways consistent with any available input information, it can use more information than alternative methods and thus, at least in principle, produces more accurate, precise, and complete answers. For example, while there are numerous methods for predicting functions by combining information (Yamanishi et al., 2007; Ye et al., 2005; Plata et al., 2012; Green and Karp, 2007; Kharchenko et al., 2006; Smith et al., 2012; Zhu et al., 2012), the generality and flexibility of integrative pathway mapping allows us to combine structural information with other types of data in a most straightforward manner. If not all bacterial pathways have enough information from the genome context to infer the pathway enzymes, many do. Moreover, no single type of input information is essential, provided sufficient information is available from other sources. For example, potential pathway members in prokaryotes could also be obtained from regulon analysis based on predicting conserved binding sites for transcriptional regulators (Ravcheev et al., 2013; Rodionova et al., 2013). Other approaches for identifying candidate pathway members are especially needed for eukaryotes, because the relationship between genome neighborhood and pathway membership is significantly weaker in eukaryotes than in prokaryotes. For some pathways in eukaryotes, consideration of homology and biochemical function as well as direct experimental evidence, such as spatial co-localization by proteomics or chemical cross-linking, could be used to identify a set of potential protein members for pathway mapping. Thus, the integrative approach is at least in principle not limited to prokaryotic pathways. If docking struggles to prioritize the right substrates as top ranking hits, it often ranks the right ones well (Korczynska et al., 2014; Hall et al., 2010; Hermann et al., 2007); insisting that the product of one step feed into the next provides a surprisingly useful criterion not only for pathway membership and ordering, but also for re-prioritizing the correct substrate from the docking candidates. The integrative approach strengthens what would ordinarily be approximate answers by insisting on maximal possible consistency across the enzymes and across different types of information. Because of the generality of integrative pathway mapping, new sources of information can be incorporated, including knockout screens and known metabolic capabilities. While not all types of input for integrative pathway mapping can be obtained automatically (e.g. docking, experimental measurements), the mapping itself is entirely automated. With further development, the framework may be applicable on a larger scale, approaching complete genomes, but mapping topologies of networks will be more demanding as it will require more input information and larger computation.

Materials and methods

Key resources table
Reagent type
(species) or resource
DesignationSource or referenceIdentifiersAdditional information
Gene (Haemophilus
influenzae)
UxuAThis paper,
pNYCOMPSC-tagless
HiUxuA vector
Uniprot:P44488See Supplementary file 7. cloned into the
C-terminal TEV cleavable 10x-Histag
containing vector pNYCOMPS-LIC-TH10-
ccdB (C-term) such that the tag is out of frame
Gene (H.
influenzae)
GulDThis paper,
pNYCOMPSC-tagless
HiGulD vector
Uniprot:Q57517See Supplementary file 7. cloned into the
C-terminal TEV cleavable 10x-Histag
containing vector pNYCOMPS-LIC-TH10-
ccdB (C-term) such that the tag is out of frame
Gene (H.
influenzae)
KdgKThis paper,
HiKdgK-pSGC-His vector
Uniprot:P44482See Supplementary file 7. cloned into the
N-terminal TEV cleavable 6x-Histag
containing vector pNIC28-Bsa4
Gene (H.
influenzae)
UxuBThis paper,
HiUxuB-pSGC-His vector
Uniprot:P44481See Supplementary file 7. cloned into the
N-terminal TEV cleavable 6x-Histag
containing vector pNIC28-Bsa4
Gene (H.
influenzae)
KdgAThis paper,
HiKdgA-pSGC-His vector
Uniprot:P44480See Supplementary file 7. cloned into the
N-terminal TEV cleavable 6x-Histag
containing vector pNIC28-Bsa4
Gene (H.
influenzae)
GulPThis paperUniprot:P71336See Supplementary files 8 and 9
Gene (H.
influenzae)
GulQThis paperUniprot:P44484See Supplementary files 8 and 9
Gene (H.
influenzae)
GulMThis paperUniprot:P44483See Supplementary files 8 and 9
Gene (H.
influenzae)
UxuRThis paperUniprot:P44487See Supplementary files 8 and 9
Oligonucleotide
(H.
influenzae)
UxuA, UxuR, GulD,
GulP, GulQ, GulM,
KdgK, UxuB,
KdgA, Hflu
This paperSee Supplementary file 8. qRT-PCR
oligonucleotide sequences used for gene
expression profiling
Strain, strain
background
(H. influenzae
Rd KW20)
H. fluhttps://www.atcc.orgATCC 51907Supplementary file 9. Genetic deletion
mutants of the putative L-gulonate
catabolism pathway in H. influenzae Rd KW20
Genetic reagent
(H. influenzae)
∆GulPThis paperSupplementary file 8. Genetic deletion
mutants of the putative L-gulonate
catabolism pathway in H. influenzae Rd KW20
Genetic reagent
(H.
influenzae)
∆GulDThis paperSupplementary file 8. Genetic deletion
mutants of the putative L-gulonate
catabolism pathway in H. influenzae Rd KW20
Transfected
construct (E. coli
BL21 (DE3)
BL21 (DE3) E. coli
containing the pRIL
plasmid
StratageneGrowth media contain 25 μg/mL
Kanamycin or 100 μg/mL Carbomycin and
34 μg/mL Chloramphenicol
Commercial
assay or kit
RNAprotect Bacteria
Reagent
QiagenCat No./ID: 76506
Commercial
assay or kit
RNeasy Mini KitQiagenCat No./ID: 74104
Commercial
assay or kit
ProtoScript First
Strand cDNA
Synthesis Kit
New England BioLabsCat No./ID: E6300S
Chemical compound,
drug
2-keto-3-deoxy-D-
gluconate
 Enzymatically synthesizedCAS: 17510-99-5Enzymatic synthesis by D-mannonate
dehydratase (Uniprot ID B0T0B1). Verified
via 1H-NMR
Chemical compound,
drug
2-keto-3-deoxy-D-
gluconate-6-
phosphate
 Enzymatically synthesizedCAS: 884312-23-6Enzymatic synthesis by D-mannonate
dehydratase (Uniprot ID B0T0B1) and 1 μM
2-keto-3-deoxy-D-gluconate kinase
(Uniprot ID A4XF21). Verified via 1H-NMR
Software, algorithmIntegrative Pathway
Mapping
This paperhttps://github.com/salilab/pathway_mappingThe source code for the IMP program,
benchmark, input scripts files, and output
files for the benchmark and the gulonate
pathway calculations are available here(50)
Software, algorithmIMP programRussel D, et al, Putting the
pieces together: integrative
structure determination of
macromolecular
assemblies. PLoS
Biology. 10(1):e1001244, 2012
http://integrativemodeling.orgIntegrative modeling
Software, algorithmMODELLERB. Webb, A. Sali. Comparative
Protein Structure Modeling
Using Modeller. Current
Protocols in Bioinformatics,
John Wiley & Sons, Inc., 5.6.1-
5.6.32, 2014.
https://salilab.org/modeller/Comparative modeling
Software, algorithmDOCK3.6Mysinger MM, Shoichet BK.
Rapid context-dependent
ligand desolvation in molecular
docking. J Chem Inf Model.
50(9):1561-73, 2010.
http://dock.compbio.ucsf.edu/Docking
Software, algorithmAutomated version
DOCK3.6
Irwin JJ, et al. Automated
Docking Screens: A Feasibility
Study. J. Med. Chem.
52(18)5712–5720, 2009.
http://blaster.docking.org/Docking
Software, algorithmSimilarity Ensemble
Approach (SEA)
Keiser MJ, et al. Relating
protein pharmacology
by ligand chemistry.
Nat Biotechnol. 25(2):
197-206, 2007.
http://sea.bkslab.org/SEA chemo-informatic calculations
Software, algorithmOpenEye Scientific
Software
OpenEye Scientific Software I.
OEChem. 2.0.2 ed2014.
https://www.eyesopen.com/In silico chemical transformations
Software, algorithmRDKitLandrum G. RDKit:
Open-source
cheminformatics.
Release_2016.03.1 ed2016
http://www.rdkit.org/Chemical similarity calculations
Software, algorithmEFI-ESTGerlt JA, et al. Enzyme Function
Initiative-Enzyme Similarity
Tool (EFI-EST): A web tool for
generating protein sequence
similarity networks. Biochim.
Biophys. Acta. 1854(8):1019-
1037, 2015.
http://efi.igb.illinois.edu/efi-est/index.phpGenome neighborhood networks
Software, algorithmPythoscape v1.0Barber AE, Babbitt PC.
Pythoscape: a framework for
generation of large protein
similarity networks.
Bioinformatics. 28(21):2845-
2846, 2012.
http://www.rbvi.ucsf.edu/trac/PythoscapeSequence similarity networks
Software, algorithmCytoscape v3.4Shannon P, et al. Cytoscape: a
software environment for
integrated models of
biomolecular interaction
networks. Genome Res.
13(11):2498-504, 2003.
http://www.cytoscape.org/Network visualization

Computational methods

Integrative pathway mapping

The method computes all linear pathway models consistent with the input information, in a four-stage process (Figure 1). First, input information has to be collected from computational and/or experimental sources. Here, three established and convenient computational methods (i.e. molecular docking, Similarity Ensemble Approach, and chemical transformation analysis) were selected to illustrate the idea of integrative pathway mapping and to benchmark it on three known pathways. In principle, subsets of input information can be missing. Moreover, additional types of information can be added, hopefully improving the accuracy, precision, and applicability of the approach, as illustrated by the gulonate pathway prediction that also depends on DSF data, pathway anchor points, and protein homology considerations. Second, each data point is converted into a pathway restraint via a Z-score. The score of a pathway model is then a sum of these Z-scores. Third, the good scoring pathways are found by Monte Carlo sampling of pathways consisting of input enzymes and metabolites. Finally, the good scoring pathways are analyzed. Next, we describe the four stages of integrative pathway mapping in turn, using the L-gulonate catabolic pathway as an example (Figure 1).

Stage 1: Gathering information

Information for the pathway mapping cases comes from the following sources: high-throughput DSF screening, genome context, structure-based docking screens, chemical transformations based on Pfam classification (Finn et al., 2016), and knowledge of central metabolism. With this information in hand, we use it to design representation, scoring, and sampling, which determine the output models.

Stage 2: Designing pathway model representation and evaluation

For pathways of unknown length, we model pathways of each possible length independently, and then select an optimal combination of pathway length and score. The pathway model is represented as a linear graph, in which the molecular components are represented by nodes and the interactions are represented by edges. In the specific case of a metabolic pathway, the two classes of molecular components are the metabolites (substrates and products) and the proteins, which are binding proteins, transporters, or enzymes converting substrates to products. In addition, we allow for the inclusion of a dummy node representing an unknown and uncharacterized protein in the pathway. The sampling space of the models is constrained by the candidate enzyme and metabolite node identities that are given as input, as well as the linearity and length of the pathway.

A sequence similarity network (SSN) and genome neighborhood network (GNN) were constructed using the EFI-EST webserver (Gerlt et al., 2015) and Pythoscape v1.0 software (Barber and Babbitt, 2012) for an anchor protein, TRAP SBPs (Uniprot ID P71336 and Uniprot ID A7JQX0), to provide candidate pathway members (Zhao et al., 2014; Gerlt et al., 2015). The network stringency for computing iso-functional clusters was set to an E-value cutoff of 10−120, corresponding to a median sequence identity between proteins of ~60% (Vetting et al., 2015). At this stringency, the majority of experimentally annotated TRAP SBPs are assigned to isofunctional clusters in the SSN. The full GNN was clustered based on Pfam designation into individual neighborhood nodes in the genome neighborhood of cluster 223, which included the TRAP transporter anchor protein (Figure 1—figure supplement 2). Analysis of the GNN identified five enzyme families as candidate pathway members, including two dehydrogenases, one sugar dehydratase, one carbohydrate kinase, and one aldolase. The genes associated with these families in H. influenzae are co-localized in the genome with the TRAP SBP gene. This step can be substituted or supplemented by any other method that identifies candidate genes, including but not limited to: (1) colocalization of genes providing operon/metabolic context for prokaryotic proteins (Overbeek et al., 1999), (2) coexpression measured through chip-based and RNA-seq technologies (Wang et al., 2009), (3) co-regulation predicted by upstream DNA motifs (Pilpel et al., 2001; Rodionov, 2007), (4) protein-protein interaction studies (Bork et al., 2004; Meier et al., 2013), (5) protein fusion events (Enright et al., 1999; Marcotte et al., 1999), and (6) phylogenetic profiles across different genomes (Pellegrini et al., 1999).

To obtain the smallest candidate subset of KEGG that contains all metabolites needed to predict a pathway, we considered only the metabolites with good virtual screening scores against any of the candidate proteins as well as metabolites that can be derived from the virtual screening hits by applying chemical transformations related to the known activities of enzymes in the relevant superfamilies. Therefore, the top 1000-scoring metabolites from each docking screen are added to a single list of metabolites. Chemical transformations performed by each predicted enzyme are applied on the top-scoring metabolites using OEChem Tools (OpenEye Scientific Software I, 2014) excluding metabolites with no matches to the substrate motifs. Products of these reactions are compared by RDKit Morgan fingerprints (Landrum, 2016; Rogers and Hahn, 2010) to the metabolites from the KEGG LIGAND database (Kanehisa et al., 2016; Kanehisa and Goto, 2000). KEGG metabolites that have a Tanimoto coefficient above 0.75 to the products are added to the list of metabolites. This final list of metabolites contains 3650 unique ligands that are considered as the sampling space for candidate metabolite nodes.

Scoring pathway models

Information about the pathway is encoded as pathway restraints that are summed into a scoring function. For example, a candidate edge between a given enzyme and metabolite is restrained by a virtual screening score for the pair. In an attempt to ‘weigh’ each piece of information optimally, each term in the scoring function is expressed as a Z-score. Our scoring function can in principle benefit from all available information, even when some information is not available for every enzyme or ligand. In such cases, the corresponding terms are simply omitted from the scoring function. A dummy node in a model contributes no score, except towards the chemical transformation term. Thus, the scoring function (ZPathway) for ranking alternative L-gulonate pathways is a sum of Z-scores for each type of restraint, including virtual screening (ZVS), chemical transformations (ZCT), SEA analysis (ZSEA), known pathway boundaries (ZCM), high-throughput screening (ZHTS), and homology to characterized enzymes (ZHS):

ZPathway= ZVS+ ZCT+ZSEA +ZCM+ ZHTS+ZHS

Next, we define these specific pathway restraints.

Molecular docking screens

Favorable binding interactions predicted by docking can illuminate the identity of a ligand-protein pair. Pathway models with ligand-protein pairs that have favorable docking scores are more likely to be correct than those that have unfavorable docking scores. For each candidate pathway protein, a crystal structure or homology model, generated by MODELLER (Sali and Blundell, 1993; Eswar et al., 2006), was prepared with an automated pipeline for docking (Irwin et al., 2009) (Supplementary files 5 and 6). The proposed active site for each enzyme was identified by superimposing liganded structures of closely related family members or related domains; for the enzymes considered here, identical results would have been obtained by identifying the largest cavity on the structure, for example, by using program PocketPicker (Coleman and Sharp, 2010). Co-factors (as generated by PRODRG server [Schüttelkopf and van Aalten, 2004]), metal ions, and water molecules were included in the protein structure preparation where required for enzyme function (Irwin et al., 2005; Korczynska et al., 2014; London et al., 2014). The KEGG database (Kanehisa et al., 2016; Kanehisa and Goto, 2000) of 14,212 unique metabolites from the ZINC database (Irwin et al., 2012) was docked against each target with DOCK3.6 in an automated fashion (Mysinger and Shoichet, 2010) (http://dock.compbio.ucsf.edu/). Compounds were ranked by a physics-based scoring function that evaluates ligand-protein complementarity considering van der Waals and electrostatic interactions, corrected for ligand desolvation (Mysinger and Shoichet, 2010; Meng et al., 1992; Wei et al., 2002). For each protein, the docking scores were converted to Z-scores by subtracting the mean and dividing by the standard deviation of the docking scores. The docking Z-score for the entire pathway model is the normalized sum of the Z-scores for all substrate-enzyme and product-enzyme pairs:

ZVS = 1NiNZi,

where N is the total number of enzyme-substrate and enzyme-product pairs in the pathway model. Similar normalizations of docking scores have been described for other applications (Casey et al., 2009).

Chemical transformations

Chemical transformations derived from protein family annotations can help identify the substrate and product of an enzyme. These generic chemical transformations describe the enzymatic reaction without precise knowledge of the substrate, encoding the differences between the reactant and product on a more general level. While the full Enzyme Commission (EC) number describes the substrate specificity of an enzyme, the generic chemical transformation typically corresponds to the third level EC classification (Hatzimanikatis et al., 2005). For example, a serine acetyltransferase is a transferase that catalyzes the reaction of converting an alcohol into an ester. Pathway models with substrate-product pairs that match the chemical transformations are more likely to be correct than those pairs that do not (however, the final predicted reactions reflect the totality of all restraints, not only chemical transformation restraints).

The chemical transformation is determined from the Pfam classification (Finn et al., 2016) from the generic chemical reaction or reactions that are conserved across members of the protein family. Multiple chemical transformations may be considered for a single family. Relying on the library generation tool in OEChem Tools, the in silico transformation using SMIRKS strings was performed on each metabolite, represented as a SMILES string (Supplementary files 5 and 6). SMIRKS strings encode a generic reaction composed of a structural motif or pattern in the substrate and the corresponding pattern in the resulting product. Using RDKit, the Tanimoto coefficient between the transformed molecule and every other metabolite is computed based on the Morgan fingerprints, which are graph-based circular fingerprints useful for structural comparisons, with chirality taken into account. Because no transformation is defined for a dummy node, the Tanimoto coefficient between the substrate of the dummy node and every other metabolite is computed. For molecules with undefined stereocenters, the highest Tanimoto coefficient for up to 16 distinct stereoisomers was used.

Tanimoto coefficients were converted into Z-scores, similarly to the docking Z-scores. The score for transformations is the average Z-score over the Z-scores of all substrate-product-enzyme node triads in a model:

ZCT = 1NiNZi,

where N is the total number of substrate-product-enzyme triads in the pathway model.

Similarity ensemble approach

Comparison of ensembles of ligands using the Similarity Ensemble Approach (SEA) version 1.0 can predict functionally-linked proteins from the similarity of their ligands (Keiser et al., 2007), irrespective of their sequence or structural similarities (Lin et al., 2013). It is more likely that enzymes with high ligand similarity than enzymes with low ligand similarity are adjacent to each other in a pathway, as exemplified by a DUDE analysis (Irwin et al., 2012) (Figure 2—figure supplement 1C). Ensembles of predicted ligands can be obtained from virtual screening and used to restrain the identities of pairs of enzymes in a pathway (Fan et al., 2013). The top 500 docking-ranked metabolites for each enzyme were considered as the ligand ensemble. SEA E-values were calculated based on the similarity between these top 500 metabolites for each pair of enzymes in a putative pathway. The SEA E-value reflects the significance of the similarity between ligand ensembles for a pair of enzymes, compared to an expectation for two similarly sized sets of randomly selected metabolites from KEGG.

Specifically, the SEA E-value (evalue) for two consecutive pathway proteins A and B is first converted into the SAB score:

SAB= wAB*FAB , wAB=min(-logevalue, 50)

where FAB is 150min(wAA,wBB), modeling our confidence in the SEA analysis. Next, the SAB score is normalized into a Z-score Si by subtracting the mean and dividing by the standard deviation obtained from the distribution of SAB scores for all pairs of input enzymes, whether or not they are linked in the pathway. Finally, the SEA component (ZSEA) of the integrative pathway score ZPathway is the Si score averaged over all consecutive protein pairs in the tested pathway:

ZSEA= 1NiNSi

Pathway nodes and boundaries

Any known nodes of the pathway can be easily specified as constraints on the search for pathways that satisfy all input information. In the particular case of the L-gulonate catabolic pathway, a solute-binding protein (SBP) subunit of a TRAP transporter was identified, based on its strong sequence similarity to the TRAP SBP family. Thus, this transporter defines the start of the metabolic pathway to be modeled, with the rest of the pathway corresponding to intracellular enzymes acting on the transporter’s substrate in series.

Similarly, knowledge about the endpoints of metabolic pathways can also constrain integrative pathway mapping. For catabolic pathways of sugars, we assume that the pathway produces a compound that feeds into central carbohydrate metabolism. Therefore, pathways in which the final product is a compound in central metabolism are more likely to be correct (Supplementary file 4).

The final metabolite in the model is compared by the Tanimoto coefficient using RDKit Morgan Fingerprints to all metabolites in central metabolism, and the maximum Tanimoto coefficient is used. The central metabolism endpoint score is:

ZCM=TC TC SD,

where TC is the Tanimoto coefficient between the final product in the pathway model and the most similar compound in central metabolism. TC- is the average and SD is the standard deviation of Tanimoto coefficients for all comparisons between each candidate metabolite and all compounds in central metabolism.

Differential scanning fluorimetry screening hits

In our predicted pathway, L-gulonate and D-mannonate were identified as hits for the TRAP solute binding protein (ie, the first protein in the pathway), using screening of 189 compounds by DSF (Vetting et al., 2015). We assume that true substrates have high chemical similarity to the screening hits. Thus, the hits are compared by the Tanimoto coefficient using RDKit Morgan Fingerprints to the substrate of the screened enzyme in the pathway model.

ZHTS=TC TC SD,

where TC is the Tanimoto coefficient between the substrate in the pathway model and the hit in the screening assay. For multiple screening hits, the maximum of the Tanimoto coefficients between the substrate and hits is used. TC- is the average and SD is the standard deviation of Tanimoto coefficients between the substrate and all metabolites.

Homology to a characterized enzyme

Substrates of enzymes in the pathway model are expected to be the same or similar to substrates of homologs that share high sequence similarity. For the L-gulonate pathway, the dehydratase has 73% sequence identity to a characterized mannonate dehydratase in E. coli. Therefore, pathways in which the substrate of the dehydratase is similar to D-mannonate are more likely to be correct than others. The proposed dehydratase substrate in an evaluated pathway model was compared to D-mannonate by the Tanimoto coefficient, using RDKit Morgan Fingerprints:

ZHS=TC TC SD,

where TC is the Tanimoto coefficient between the proposed substrate and the known substrate. TC- is the average and SD is the standard deviation of Tanimoto coefficients for all comparisons between each candidate metabolite and the known substrate.

Stage 3: Sampling good models

With the scoring function in hand, the next step is to find pathway models that score well. These models are obtained by sampling candidate metabolites and proteins at each position in the linear pathway of a given length. We use Monte Carlo (MC) sampling by the Metropolis-Hastings algorithm with simulated annealing (Hastings, 1970; Kirkpatrick et al., 1983). The set of MC moves includes (i) swapping components of the same type within the graph and (ii) replacing a component in the graph with an unused candidate component of the same type. At each MC step, if the pathway score improves, the new model is accepted. Otherwise, the new model is accepted if a randomly sampled number from the uniform distribution between 0 and 1 is less than the acceptance probability computed by the standard Metropolis criterion:

p=exp-DT

where D is the difference between the old and new pathway scores and T is the simulated annealing ‘temperature’ parameter. The temperature drops over the course of the MC run:

T=0.3*0.2N+0.1

where N is the MC step number normalized by the total number of MC steps. With these parameters, a sampling run generally converges (Figure 3—figure supplement 1E), terminating after 5,000,000 MC steps. 1000 independent runs are performed, and the models sampled from all runs are combined. The unique sampled models with a score above a cutoff (good-scoring models) are considered in the analysis; the cutoff is two standard deviations below the best score. For the glycolysis pathway, which is about twice as long as the other pathways, a more stringent cutoff of 1.5 standard deviations was used such that the number of good-scoring pathways is comparable to that for the other benchmark pathways. The standard deviation is calculated from a distribution of scores of random models. Convergence of sampling was tested by determining the fraction of unique clusters as a function of the number of independent runs (Figure 3—figure supplement 1).

Stage 4: Analyzing models and information

The resulting ensemble of good-scoring models is analyzed in terms of its precision and satisfaction of the restraints (Figure 2). Using hierarchical clustering in the scikit-learn python package (Pedregosa et al., 2011), the pathway models are clustered by a pairwise distance metric (here, the Hamming distance). The Hamming distance is the number of positions at which the corresponding nodes are differently divided by the number of nodes in the pathway (including both proteins and ligands). Clusters are determined at the cutoff distance of 0.2.

A Cytoscape (Shannon et al., 2003) app, NetIMP, was written to perform visualization of the models (Figure 1—figure supplement 3). The app takes as input a JSON formatted file of with the IMP results, including the overall scores and the scores for the individual restraints. The app constructs a ‘union network’ that is the union of all models and presents it to the user. A slider is available to adjust the minimum score for inclusion in the union network. Each of the corresponding models are shown in a results panel to the right of the network. The user can highlight the model in the relevant network by selecting the row. Checkboxes allow the user to view the individual restraints, including whether or not the model satisfies the restraint. NetIMP is available in the Cytoscape app store.

Computational cost

The total computation time depends on the numbers of enzymes and ligands considered. The computationally most demanding part of integrative pathway mapping is the preparation of the input information, not the sampling of the alternative pathways. For the three benchmark and L-gulonate pathways, the computing times for various steps in the process are as follows. Parallelized virtual screening of a library of ~20,000 compounds by docking each ligand against each of ~10 enzymes takes just over an hour on a cluster of 1400 nodes. The SEA analysis takes ~15 min to screen ~20,000 compounds against ~10 target proteins on 10 computing nodes. Finally, chemical transformation calculations take ~1 hr for 10 enzymes on a single node. Once the inputs are in hand, the runtime for a single Monte Carlo sampling run is approximately 1 hr on a single computing node; a few hundred independent Monte Carlo optimizations are typically performed in parallel on a cluster of compute nodes. In conclusion, the entire process, from preparation of inputs to the sampling of the pathways, can be performed in a few hours on a cluster of a few hundred nodes.

Benchmarking

We assess our method on three known metabolic pathways, including the glycolysis pathway (10 enzymes, 2965 potential ligands), CMP KDO-8P biosynthesis pathway (four enzymes, 3336 potential ligands), and serine biosynthesis (five enzymes, 3494 potential ligands) (Supplementary file 1). Pathway docking was performed against crystal structures and comparative models (Supplementary file 5). The score of the pathway model is a sum of the individual Z-score terms for the docking screen, SEA calculation, and chemical transformations. The ability to recover the true ligand-enzyme pair is evaluated by the relative frequency of the ligand-enzyme edge occurring in the good-scoring pathway models. Therefore, we compare the rank of the substrate or product for a given enzyme based on the initial docking score to the rank based on the frequency of the corresponding ligand-enzyme edge.

Benchmark assessment for decoy and dummy enzymes

The enzyme composition of a pathway is not always known with certainty, so we considered other scenarios beyond the simple case (Figure 3AB). We tested whether or not our method could detect the correct pathway when non-pathway enzymes, or decoys, are included in the initial candidate set of enzymes. An additional term that considers membership in the same gene cluster is included in the scoring function. Conserved gene clusters identified from comparative genome neighborhood analysis (Figure 1—figure supplement 2) can be informative about the functional relationships of genes acting in the same pathway (Overbeek et al., 1999). Pathway models with all members of a gene cluster are more likely to be correct than those that are missing members or containing non-members.

The set of protein pairs associated with the gene cluster identified by genome neighborhood analysis is compared to sets of protein pairs from all possible subsets of the protein candidates. The intersection between the sets of pairs is normalized by the larger of the number of pairs in the sets. The possible subsets range in number from as few as three to the number of total candidate proteins. The gene cluster score is:

ZGC=GC GCSD,

where GC is the normalized intersection between the set of protein pairs in the gene cluster and the set of protein pairs in the pathway model. GC- is the average and SD is the standard deviation of the normalized intersection for all comparisons between the proteins associated with the gene cluster and possible subsets of candidate proteins.

For the case in which the candidate set of enzymes is incomplete, a dummy enzyme that represents an unknown pathway enzyme was used (Figure 2—figure supplement 1D). For each of the pathways, one enzyme was replaced with a dummy enzyme in the initial set of candidate enzymes. For the serine biosynthesis case, the correct pathway with the dummy enzyme included was ranked as the top-scoring model (Figure 2—figure supplement 1D). For the other test cases, the inclusion of the dummy lowered the overall ranking of the correct pathway model, but the correct pathway was still within the top-scoring models.

Experimental methods

Cloning, expression, and purification of HiUxuB, HiKdgK, and HiKdgA

The genes HiUxuB (Uniprot ID P44481), HiKdgK (Uniprot ID P44482), and HiKdgA (Uniprot ID P44480) were amplified from H. influenzae strain Rd KW20 (ATCC 51907) genomic DNA. PCR was performed using KOD Extreme DNA Polymerase (Novagen) according to the manufacturer’s guidelines. The conditions were: 2 min at 95°C, followed by 40 cycles of 20 s at 95°C, 20 s at 66°C, and 20 s at 72°C. Primers are listed in Supplementary file 7. The amplified fragments were cloned into the N-terminal TEV cleavable 6x-Histag containing vector pNIC28-Bsa4 (pSGC-His), by ligation-independent cloning (Aslanidis and de Jong, 1990; Savitsky et al., 2010).

The HiUxuB-pSGC-His, HiKdgK-pSGC-His, HiKdgA-pSGC-His vectors were transformed into BL21 (DE3) E. coli containing the pRIL plasmid (Stratagene) and used to inoculate a 20 mL 2 x YT culture containing 25 μg/mL Kanamycin or 100 μg/mL Carbomycin and 34 μg/mL Chloramphenicol. The cultures were grown overnight at 37°C in a shaking incubator. The overnight culture was used to inoculate 2 L of PASM-5052 auto-induction media containing 150 mM 2–2-bipyridyl, 1 mM ZnCl2, and 1 mM MnCl2 (Studier, 2005) that was incubated at 37°C in a LEX48 airlift fermenter for 4 hr and then at 22°C overnight. The culture was harvested and pelleted by centrifugation.

Cells were suspended in lysis buffer (20 mM HEPES, pH 7.5, 500 mM NaCl, 20 mM imidazole, and 10% Glycerol) and lysed by sonication. The lysate was clarified by centrifugation at 35,000 x g for 30 min, loaded onto a 5-mL Strep-Tactin column (IBA) on an AKTAxpress FPLC (GE Healthcare), and washed with five column volumes of lysis buffer, and eluted in StrepB buffer (20 mM HEPES, pH 7.5, 500 mM NaCl, 20 mM Imidazole, 10% glycerol, and 2.5 mM desthiobiotin). The eluent was loaded onto a 1-mL HisTrap FF column (GE Healthcare), washed with 10 column volumes of lysis buffer, and eluted in buffer containing 20 mM HEPES pH 7.5, 500 mM NaCl, 500 mM Imidazole, and 10% glycerol. The purified sample was loaded onto a HiLoad S200 16/60 PR gel filtration column, which was equilibrated with SECB buffer (20 mM HEPES, pH 7.5, 150 mM NaCl, 10% glycerol, and 5 mM DTT). Peak fractions were collected, protein was analyzed by SDS-PAGE, concentrated to 2.4 g/L, 1.9 g/L, and 2.0 g/L, respectively, flash frozen in liquid nitrogen, and stored at −80°C.

Cloning, expression, and purification of HiGulD and HiUxuA

The genes HiGulD (Uniprot ID Q57517) and HiUxuA (Uniprot ID P44488) were amplified from H. influenzae strain Rd KW20 (ATCC 51907) genomic DNA. PCR was performed using KOD Extreme DNA Polymerase (Novagen) according to the manufacturer’s guidelines. The conditions were: 2 min at 95°C, followed by 40 cycles of 20 s at 95°C, 20 s at 66°C, and 20 s at 72°C. Primers are listed in Supplementary file 7. The amplified fragment was cloned into the C-terminal TEV cleavable 10x-Histag containing vector pNYCOMPS-LIC-TH10-ccdB (C-term) such that the tag is out of frame (pNYCOMPSC-tagless), by ligation-independent cloning (Aslanidis and de Jong, 1990; Savitsky et al., 2010).

The pNYCOMPSC-tagless HiGulD and HiUxuA constructs were transformed into E. coli BL21 (DE3) for expression. Both HiGulD and HiUxuA were purified from 1 L of culture using DEAE Sepharose, Q‐Sepharose, and phenyl-Sepharose columns (all Amersham Biosciences) as previously described (Wichelecki et al., 2014). Proteins were concentrated to 15 g/L and 6 g/L, respectively, flash frozen in liquid nitrogen, and stored at −80°C.

Preparation of 2-keto-3-deoxy-D-gluconate

2-Keto-3-deoxy-D-gluconate was synthesized via an enzymatic procedure. The reaction (1.5 mL) contained 50 mM potassium HEPES, pH 7.9, 10 mM MgCl2, 100 mM D-mannonate, and 1 μM D-mannonate dehydratase (Uniprot ID B0T0B1). The reaction was left to proceed at 37°C for 48 hr. Afterward, the enzyme was removed by filtration using 30,000 NMWL ultrafiltration membranes (Millipore). The identity of the product was verified via 1H-NMR.

Preparation of 2-keto-3-deoxy-D-gluconate-6-phosphate

2-Keto-3-deoxy-D-gluconate-6P was synthesized via an enzymatic procedure. The reaction (1.5 mL) contained 100 mM potassium HEPES, pH 7.9, 10 mM MgCl2, 120 mM ATP, 100 mM D-mannonate, 1 μM D-mannonate dehydratase (Uniprot ID B0T0B1), and 1 μM 2-keto-3-deoxy-D-gluconate kinase (Uniprot ID A4XF21). The reaction was left to proceed at 37°C for 48 hr. Afterward, the enzyme was removed by filtration using 10,000 NMWL ultrafiltration membranes (Millipore). The identity of the product was verified via 1H-NMR.

Kinetic assays of L-gulonate catabolic pathway proteins

The kinetic assays were run in 200 μL aliquots at 37°C and monitored using a continuous spectrophotometric assay. The identities of all products were verified via 1H-NMR.

Oxidation was quantitated by measuring the increase in absorbance at 340 nm (ε = 6220 M−1 cm‐1) of L‐gulonate at carbon-5 by HiGulD (50 mM Tris, pH 9, 1.5 mM NAD+, and 200 nM HiGulD). The substrate concentration was varied from 100 μM to 10 mM.

Oxidation was quantitated by measuring the increase in absorbance at 340 nm (ε = 6220 M−1 cm‐1) of D-mannonate at carbon-5 by HiUxuB (50 mM Tris, pH 9, 10 mM MgCl2, 1.5 mM NAD+, and 2 nM HiUxuB). The substrate concentration was varied from 50 μM to 5 mM.

Dehydration was quantitated by measuring the decrease in absorbance at 340 nm (ε = 6220 M−1 cm‐1) of D-mannonate by HiUxuA (50 mM potassium HEPES, pH 7.9, 10 mM MgCl2, 1.5 mM ATP, 1.5 mM PEP, 0.16 mM NADH, 9 units of pyruvate kinase, 9 units of lactate dehydrogenase, 18 units of 2-keto-3-deoxy-D-gluconate kinase, and 200 nM HiUxuA). The substrate concentration was varied from 100 μM to 30 mM.

Phosphorylation was quantitated by measuring the decrease in absorbance at 340 nm (ε = 6220 M−1 cm‐1) of 2-keto-3-deoxy-D-gluconate by HiKdgK in a coupled assay with lactate dehydrogenase (50 mM potassium HEPES, pH 7.9, 10 mM MgCl2, 1.5 mM ATP, 1.5 mM PEP, 0.16 mM NADH, 9 units of pyruvate kinase, 9 units of lactate dehydrogenase, and 200 nM HiKdgK). The substrate concentration was varied from 100 μM to 5 mM.

Cleavage was quantitated by measuring the decrease in absorbance at 340 nm (ε = 6220 M−1 cm‐1) of 2-keto-3-deoxy-D-gluconate-6P by HiKdgA in a coupled assay with lactate dehydrogenase (50 mM potassium HEPES, pH 7.9, 10 mM MgCl2, 1.5 mM PEP, 0.16 mM NADH, 9 units of lactate dehydrogenase, and 200 nM HiKdgA). The substrate concentration was varied from 100 μM to 5 mM.

Bacterial strains and growth conditions

H. influenzae Rd KW20 (ATCC 51907) was grown aerobically at 37°C with shaking at 225 rpm, and was routinely cultured in Brain Heart Infusion (BHI, Difco) broth or on BHI solid medium, supplemented with nicotinamide adenine dinucleotide (NAD) and hemin at 10 µg mL−1 (sBHI). For gene expression analyses and carbon utilization studies, the defined medium of Coleman et al. (2003) was used. Glucose-free RPMI-1640 (Sigma R1383) was supplemented with the following additives: HEPES, 6 mg mL−1; NaHCO3, 2 mg mL−1; inosine, 1.75 mg mL−1; uracil, 87 µg mL−1; NAD, 10 µg mL−1; hemin, 10 µg mL−1. D-glucose, L-gulonate, or D-mannonate (10 mM) served as the source of carbon. Kanamycin was added at 10 µg mL−1 when appropriate.

Growth curves

Growth curves were recorded using the Bioscreen C instrument (Growth Curves, USA) and 100-well plates. Starter cultures of H. influenzae Rd KW20 were grown overnight in sBHI, washed in minimal medium lacking carbon source, and re-suspended in an equivalent volume of minimal medium lacking carbon source. Each well contained 300 µL minimal medium with D-glucose, L-gulonate, or D-mannonate (10 mM), and was inoculated to 1% with washed starter culture. Plates were incubated at 37°C with continuous shaking at medium amplitude and the optical density at 600 nanometers (OD600) was recorded every 30 min for 48 hr.

Transcriptional analysis

Starter cultures of H. influenzae Rd KW20 were grown overnight in sBHI, washed in minimal medium lacking carbon source, and re-suspended in an equivalent volume of minimal medium lacking carbon source. This culture was used to inoculate 5 mL minimal medium (1% inoculum) with 10 mM glucose, and cultures were grown until OD6000.3–0.5. Cells were washed and re-suspended in 4 mL minimal medium lacking carbon source. Cultures were divided into two equal 2 mL volumes, 10 mM glucose was added to one volume and 10 mM L-gulonate or D-mannonate was added to the other, and the cultures were grown until OD6000.8–1.0. At the time of cell harvest, one volume of RNAprotect Bacteria Reagent (Qiagen) was immediately added to two volumes of each actively growing culture. Samples were mixed by vortexing for 10 s and incubated for 5 min at room temperature. Cells were pelleted, flash frozen in liquid nitrogen, and stored at −80°C.

RNA isolation was performed in an RNAse-free environment at room temperature using the RNeasy Mini Kit (Qiagen) according to the manufacturer’s instructions. Cells were disrupted according to the ‘Enzymatic Lysis Protocol’ in the RNAprotect Bacteria Reagent Handbook (Qiagen); lysozyme (Thermo-Pierce) was used at 15 mg/mL. RNA concentrations were determined by absorption at 260 nanometers (A260) using the Nanodrop 2000 (Thermo) and absorption ratios A260/A280 and A260/A230 were used to assess sample integrity and purity. Isolated RNA was stored at −80°C until further use.

cDNA synthesis was performed using 300 ng of total isolated RNA with the ProtoScript First Strand cDNA Synthesis Kit (NEB), according to the manufacturer’s instructions. Primers for quantitative real-time (qRT) PCR were designed using the Primer3 primer tool; amplicons were 150–200 bps in length. Primers were 18 to 24 nucleotides in length and had a theoretical Tm of 55–60°C. Primer efficiency was determined to be at least 90% for each primer pair. Primer sequences are provided in Supplementary file 8. qRT-PCRs were carried out in 96-well plates using the LightCycler 480 II instrument (Roche) with the LightCycler 480 SYBR Green I Master Mix (Roche), according to the manufacturer’s instructions. Minus-RT controls were performed to verify the absence of genomic DNA in each RNA sample, for each gene target analyzed. Relative changes in gene expression were analyzed by the 2-∆∆CT method (Livak and Schmittgen, 2001), using the 16S rRNA gene as a reference. Each qRT-PCR was performed in triplicate, and fold-changes are the averages of at least three biological replicates.

Gene disruption

To create a genetic deletion of the putative L-gulonate SBP (HI0052), triple overlap extension PCR was used. Briefly, using Pfu Ultra High-Fidelity DNA polymerase (Agilent), three PCR fragments were generated: (a) a fragment corresponding to the genomic region ~1000 bps upstream of HI0052 was amplified from H. influenzae Rd KW20 genomic DNA with primers Del_HI0052_arm1fwd and Del_HI0052_arm1rev, (b) the kanamycin resistance cassette from p34s-Km (Dennis and Zylstra, 1998) was amplified with primers Kan_OL_delHI0052_fwd and Kan_OL_delHI0052_rev, and (c) a fragment corresponding to the genomic region ~1000 bps downstream of HI0052 was amplified from H. influenzae Rd KW20 genomic DNA with primers Del_HI0052_arm2fwd and Del_HI0052_arm2rev. The 3’ end of fragment ‘a’ and the 5’ end of the kanamycin resistance cassette (fragment ‘b’) were engineered with 50 bps of identical overlapping sequence, as were the 3’ of the kanamycin resistance cassette and the 5’ end of fragment ‘c’. One hundred ng of each of these PCR fragments were combined in a triple overlap extension PCR with primers Del_HI0052_arm1fwd and Del_HI0052_arm2rev to generate a ~3 kb fragment with arms homologous to the genomic regions flanking HI0052, with an intervening kanamycin resistance cassette.

The same approach was used to create a genetic deletion of the putative L-gulonate dehydrogenase (HI0053). To generate the triple overlap extension product for deletion of HI0053, primers Del_HI0053_arm1fwd and Del_HI0053_arm1rev and primers Del_HI0053_arm2fwd and Del_HI0053_arm2rev were used to amplify the regions ~1000 bps upstream and downstream of HI0053, respectively, from H. influenzae Rd KW20 genomic DNA. The kanamycin resistance cassette from p34s-Km was amplified with primers Kan_OL_delHI0053_fwd and Kan_OL_delHI0053_rev. One hundred ng of each of these PCR fragments were combined in a triple overlap extension PCR with primers Del_HI0053_arm1fwd and Del_HI0053_arm2rev to generate a ~3 kb fragment with arms homologous to the genomic regions flanking HI0053, with an intervening kanamycin resistance cassette. Primer sequences are provided in Supplementary file 9.

Each of the triple overlap PCR products was gel-purified and 100 ng was transformed into 1 mL of H. influenzae Rd KW20 cells made competent by the M-IV method (Poje and Redfield, 2003). Double crossover recombinants were selected by resistance to kanamycin and confirmed by genomic PCRs.

Cell preparation and metabolite extraction

GC-MS-based metabolic analysis of whole cell extracts was carried out with samples of H. influenzae Rd KW20 grown with 13C labeled L-gulonate or D-glucose, following the procedure of Zhao et al. (2013). Cells grown in rich medium were diluted 1:100 into defined medium with 10 mM unlabeled L-gulonate or D-glucose as added carbon source and grown to an OD600 of 0.6 (approximately 18 hr). Cells were harvested by centrifugation (4000 × g, 10 min, 4°C), washed twice in defined medium without added carbon source, and re-suspended in this medium. Cell density was adjusted to OD600 = 6.0, and the cell suspension was then depleted of catabolic metabolites by incubation at 37°C for 30 min before transferring to ice. A mixture of 5 mM 13C labeled L-gulonate plus 5 mM unlabeled L-gulonate, or a mixture of 5 mM 13C labeled D-glucose plus 5 mM unlabeled D-glucose, was added to the samples followed by incubation at 37°C. At time points of 1, 2, 10, and 60 min, samples were pelleted by centrifugation (16,000 × g for 1 min), supernatants were removed, and cell pellets were flash frozen in liquid nitrogen. Samples were stored at −80°C prior to extraction. Metabolites were extracted directly from cell pellets by re-suspension in 0.5 mL extraction buffer (40:40:20 mixture of methanol:acetonitrile:water spiked with 1 mM L-norvaline for an internal standard) followed by 10 min of vortexing at room temperature. Cell extracts were cleared of debris via two rounds of centrifugation at 16,000 × g for 1 min, split into two equal portions, dried, and stored at −80°C prior to analysis.

Cell extracts were derivatized by one of two methods. To determine labeling of small metabolites of central carbon metabolism (glycolysis, TCA cycle and amino acids), extracts were derivatized with isobutylhydroxylamine and N-tert-butyldimethylsilyl-N-​methyltrifluoroacetamide (TBDMS), and analyzed by GC-MS as described before (Ratnikov et al., 2015). Data were calculated in terms of fractional 13C labeling (the average 13C labeling across all metabolite carbons). Alternatively, to determine the labeling of gulonate and other 6-carbon molecules in the proposed pathway for L-gulonate metabolism, extracts were derivatized with 30 µL methoxyamine hydrochloride (Sigma, 20 mg/ml in pyridine) for 20 min at 80°C, followed by 30 µL BSTFA (trimethylsilylating reagent, Thermo) for 60 min at 80°C. Derivatized metabolites were analyzed by GC-MS as described before (Scott et al., 2011), using a modified temperature gradient: initial temperature was 60°C, held for 4 min, rising at 20 °C/min to 280°C, held for 4 min. Metabolites were identified by matching elution times and mass fragment patterns to standards. Labeling data for D-fructuronate were calculated as the ratio of mass 268: mass 264. Mass 264 corresponds to the fragment of D-fructuronate containing the four carbon atoms C3-C6 of the carbon backbone plus the complete derivatized side-chains (methoxaminated keto group and three trimethylsilylated hydroxyl groups) of this C3-C6 fragment, formula C14H34NO4Si3. Mass 268 corresponds to the same fragment with the backbone carbons 13C-labeled.

Code availability

The source code for the IMP program, benchmark, input scripts files, and output files for the benchmark and the gulonate pathway calculations are available at http://integrativemodeling.org and https://github.com/salilab/pathway_mapping (Calhoun, 2017; copy available at https://github.com/elifesciences-publications/pathway_mapping). Open access metabolite docking and libraries at http://metabolite.docking.org/ and at http://blaster.docking.org/. NetIMP is available in the Cytoscape app store.

References

  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
    Plasposons: modular self-cloning minitransposon derivatives for rapid genetic analysis of gram-negative bacterial genomes
    1. JJ Dennis
    2. GJ Zylstra
    (1998)
    Applied and Environmental Microbiology 64:2710–2715.
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
  25. 25
  26. 26
  27. 27
  28. 28
  29. 29
  30. 30
  31. 31
  32. 32
  33. 33
  34. 34
  35. 35
  36. 36
  37. 37
  38. 38
  39. 39
  40. 40
  41. 41
  42. 42
  43. 43
    RDKit: Open-source cheminformatics
    1. G Landrum
    (2016)
    RDKit: Open-source cheminformatics, Release_2016.03.1.
  44. 44
  45. 45
  46. 46
  47. 47
  48. 48
  49. 49
  50. 50
  51. 51
  52. 52
  53. 53
  54. 54
  55. 55
  56. 56
  57. 57
    OEChem, version 2.0.2
    1. OpenEye Scientific Software I
    (2014)
    OEChem.
  58. 58
  59. 59
  60. 60
  61. 61
  62. 62
  63. 63
    Scikit-learn: Machine Learning in Python
    1. F Pedregosa
    2. G Varoquaux
    3. A Gramfort
    4. V Michel
    5. B Thirion
    6. O Grisel
    7. M Blondel
    8. P Prettenhofer
    9. R Weiss
    10. V Dubourg
    11. J Vanderplas
    12. A Passos
    13. D Cournapeau
    14. M Brucher
    15. M Perrot
    16. E Duchesnay
    (2011)
    Journal of Machine Learning Research 12:2825–2830.
  64. 64
  65. 65
  66. 66
  67. 67
  68. 68
    Transformation of Haemophilus influenzae
    1. G Poje
    2. RJ Redfield
    (2003)
    Methods in Molecular Medicine 71:57–70.
  69. 69
  70. 70
  71. 71
  72. 72
    Extended-connectivity fingerprints
    1. D Rogers
    2. M Hahn
    (2010)
    Journal of Chemical Information and Modeling 50:742–754.
    https://doi.org/10.1021/ci100050t
  73. 73
  74. 74
  75. 75
  76. 76
  77. 77
  78. 78
  79. 79
  80. 80
  81. 81
  82. 82
  83. 83
  84. 84
  85. 85
  86. 86
  87. 87
  88. 88
  89. 89
  90. 90
  91. 91

Decision letter

  1. Nir Ben-Tal
    Reviewing Editor; Tel Aviv University, Israel

In the interests of transparency, eLife includes the editorial decision letter and accompanying author responses. A lightly edited version of the letter sent to the authors after peer review is shown, indicating the most substantive concerns; minor comments are not usually included.

Thank you for submitting your article "Integrative mapping of enzymatic pathways" for consideration by eLife. Your article has been favorably evaluated by Michael Marletta (Senior Editor) and three reviewers, one of whom, Nir Ben-Tal (Reviewer #1), is a member of our Board of Reviewing Editors.

The reviewers have discussed the reviews with one another and the Reviewing Editor has drafted this decision to help you prepare a revised submission.

The manuscript introduces a novel approach to enzyme annotation based on analyzing the enzyme within the context of its metabolic pathway. Structural information is utilized to enhance systems models. In particular, predictions of the substrates and products of the enzymes, as well as structural data (from experiment and prediction) are used within the framework of Monte Carlo search to assemble putative metabolic pathways. The most highly ranked pathway is considered as hypothesis, and experiments are designed to examine it.

The strength of this study is that it is an intuitive, well-fleshed out method. Structural prediction data from docking, which is not so strong on its own for functional prediction, is coupled to locations within a pathway, restricting the search space of possible solutions to chemically similar compounds in the pathway proximity of each other. There is also strong experimental evidence for parts of the pathway prediction in H. influenzae.

The weakness of this study is that the initial starting points to build these pathways seems vague, pathway models are not fully experimentally verified (although parts have strong evidence), and application on a larger (genome) scale is not discussed, where this would be incredibly useful. An assumption is made to start with genome organization to select the pathway members which is unexplained and should be backed up.

Opinion:

It is an interesting approach but because of several hype statements the reader ends up somewhat disappointed. The approach is validated on three well known and one novel pathway. This is very impressive both in the scope of work and in the conceptual completeness, but potentially misleading as the computational approaches clearly depend on the amount and quality of already existing information. From experimental structures cocrystallized with correct ligands that would help in virtual screening, to reliance on well understood biochemistry and on homology to already characterized enzymes this approach seems to be aiming for semi-automated reanalysis of already well characterized pathways rather than on annotations of novel, uncharacterized ones – which seems to be the stated motivation for this work. In this context the benchmarks and the "discovery" of L-gulonate catabolic pathway in H. influenzae (which seems to be already well annotated in public databases) are not very convincing. It is also a bit unclear what constitutes actual results of this paper, as significant elements of L-gulonate catabolic pathway discovery in H. influenzae were already published by the same authors in 2015. The authors should either rebrand their approach as aiming at cleaning and solving apparent inconsistencies in existing annotations, or come up with more convincing examples of discovery and annotations of genuinely novel functions/pathways. Such examples would provide more realistic evaluation of the proposed approach. In summary: The main text should make it clearer as to what the immediate application is. Is this method currently limited to filling in parts of a pathway, or is it feasible to now try reconstructing entire cells from "bags" of proteins and metabolites? If the latter, suitable examples should be presented.

Major issues:

1) Assuming that the examples will not change, many statements should be tuned down, starting in the title and Abstract, to reflect the fact that: (a) the proposed approach can, at best, resolve conflicts in existing annotations rather than suggest new annotation, and (b) that even that requires a lot of manual interventions, as opposed to fully automated method.

2) The reasons for the data types chosen to be integrated over other potentially useful sources should be flashed. Reasons could include increasing the computational time, limitations of experimental data, etc.

3) Methods of picking the initial proteins should be explained, since that is a nebulous first step to get around before using this useful method.

4) Selecting based on proximity in genome organization is not backed up (subsection “Problem and approach”, last paragraph). Statistics or citations should be shown on known pathways if this is true. This is an important step to explain because it seems like it may ignore many enzymes which are not in the genome neighborhood. In eukaryotic organisms this does not seem to be nearly as applicable as to prokaryotes.

5) Computational time is unknown, throughout the paper – estimated time to run all docking predictions, chemical similarity, binding site locating, MC sampling, etc. What is the bottleneck? Or main source of human time/input?

6) Is docking done only to one predicted active site in an enzyme? How reliable or comprehensive the binding site estimates used here are? How much does docking add over geometric + biochemical comparisons of active sites? Would it be sufficient to do that and compute similarity analyses of that, rather than running many docking simulations?

7) Discussion, first paragraph. There is additional related work of genome-scale metabolic reconstructions that include structural data:

- http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000938 – structural properties used in predicting drug off target pathways.

- https://bmcsystbiol.biomedcentral.com/articles/10.1186/s12918-016-0271-6 – structural properties used to compare lifestyles and pathway usage.

8) Figure 4E – what is the reason behind higher gene expression increases of UxuB, GulPQM, GulD on a D-mannonate carbon source compared to L-gulonate? Since these proteins are involved in L-gulonate transport and metabolism it would seem like they should be expressed higher on L-gulonate carbon source, not D-mannonate which comes after them in the proposed pathway.

9) What is the author's opinion on applying this method on a genome scale (i.e. entire metabolic network)? Is it feasible in terms of speed and data curation? Is it possible for the authors to generate genome-scale network predictions and compare them to available reconstructed metabolic models? These would be interesting questions to address in the main text.

10) How broadly applicable the approach is? How many new enzymes can be annotated this way?

11) The statement concerning the applicability of the method also to protein-protein interaction (PPI) networks is a bit of a stretch. For now, the essence of the method is docking of the metabolites to their binding/catalytic sites. In PPI this component will have to be replaced, and it is unclear with what. Protein-protein docking, the first possibility that comes to mind, is far less accurate, especially for transient interactions. I would eliminate the statement, or tune is down significantly.

12) Methods:a) The terms in the first equation should be defined.

b) Using the DOCK scores. Usually, in drug discovery campaigns, the docking scores are considered an indication for possible binding, but the actual values or ranks are often inaccurate. It is surprising that here they are taken as a given and nevertheless the pipeline works fine.

https://doi.org/10.7554/eLife.31097.025

Author response

Major issues:

1) Assuming that the examples will not change, many statements should be tuned down, starting in the title and Abstract, to reflect the fact that: (a) the proposed approach can, at best, resolve conflicts in existing annotations rather than suggest new annotation, and (b) that even that requires a lot of manual interventions, as opposed to fully automated method.

We have edited the manuscript throughout to tone down overreaching statements. For example:

The title was changed to: "Prediction of enzymatic pathways by integrative pathway mapping”.

We rephrased claims that we discovered a new pathway, instead focusing on development of an approach to predict the functions of the individual enzymes. For example, in the Abstract, “A computational method that can identify” → “predicts”; and Demonstrated the utility by discovering a L-gulonate pathway” → “predicted”. We also removed any mention of “discovery” of the pathway throughout the main text. For instance, “we sought to predict this putative pathway using integrative pathway mapping”.

The input is indeed not prepared entirely automatically, which is true for all automated methods as at some point inputs must be prepared at least in part “by hand”. However, once all the input information is in hand, our method does convert it into the output automatically. For example, the input for the L-gulonate pathway prediction was mainly docking results, chemoinformatics analysis, chemical transformations, genomic neighbourhood, and experimental data obtained as the project evolved. This information was then fed into integrative pathway mapping that automatically produced the output (i.e., the gulonate pathway).

We clarified this point: “The preparation of input information requires manual processing. […] Nevertheless, we emphasize that once the input information is provided, its conversion into the predicted pathway is automated and does not require human intervention.”

Also, we state: “While not all types of input for integrative pathway mapping can be obtained automatically (e.g., docking, experimental measurements), the mapping itself is entirely automated.”

As the reviewer expected, we have indeed not eliminated the description of the L-gulonate pathway from the manuscript. As described in the summary above, this was a genuinely new pathway prediction, one tested and supported in this work with an experimental rigor still relatively rare in the field.

2) The reasons for the data types chosen to be integrated over other potentially useful sources should be flashed. Reasons could include increasing the computational time, limitations of experimental data, etc.

A major goal of the manuscript is to introduce an integrative approach to pathway mapping. Accordingly, we selected several accessible types of input to demonstrate this idea, while suggesting that additional sources of information could be considered in the future. As more types of information are added to the integrative approach, they might further improve its accuracy, precision, and applicability. Docking, the chemoinformatic Similarity Ensemble Approach, and chemical reactions seemed like sensible initial inputs, as they exploit established methods. The application to the gulonate pathway illustrates the use of several other types of input, including experimental data from differential scanning fluorimetry (it can be done in high-throughput) and additional theoretical considerations (e.g., the end point of the gulonate pathway must be a known metabolite), thus illustrating the relative ease with which information about a pathway can be included.

In Materials and methods, we now add: “First, input information has to be collected from computational and/or experimental sources. […] Moreover, additional types of information can be added, hopefully improving the accuracy, precision, and applicability of the approach, as illustrated by the gulonate pathway prediction that also depends on differential scanning fluorimetry data, pathway anchor points, and protein homology considerations.”

3) Methods of picking the initial proteins should be explained, since that is a nebulous first step to get around before using this useful method.

The candidate enzymes involved in the gulonate pathway were selected by identifying genes that are conserved in the genome neighborhood of the TRAP SBP gene (which encodes the transporter known to import gulonate into the cell) across multiple bacterial species. This analysis involved the construction of a genome neighborhood network (Zhao et al., eLife, 2014), which displays conserved protein families found in proximity to the TRAP SBP gene.

Thus, we added the following description: “For example, for the gulonate pathway, we identified 5 metabolic enzymes that are conserved in the genome neighborhood of the TRAP transporter gene by constructing a genome neighborhood network (GNN); the GNN approach has been demonstrated to accurately predict enzymes and transporters that function together in metabolic pathways based on conserved protein families in genome neighborhoods across different species (Zhao et al., 2014).”

Also, specific details for the gulonate pathway are given in Materials and methods: “A sequence similarity network (SSN) and genome neighborhood network (GNN) were constructed using the EFI-EST webserver (Gerlt et al., 2015) and Pythoscape v1.0 software (Barber and Babbitt, 2012) for an anchor protein, TRAP SBPs (Uniprot ID P71336 and Uniprot ID A7JQX0), to provide candidate pathway members […] Analysis of the GNN identified five enzyme families as candidate pathway members, including two dehydrogenases, one sugar dehydratase, one carbohydrate kinase, and one aldolase. The genes associated with these families in H. influenzae are co-localized in the genome with the TRAP SBP gene.”

4) Selecting based on proximity in genome organization is not backed up (subsection “Problem and approach”, last paragraph). Statistics or citations should be shown on known pathways if this is true. This is an important step to explain because it seems like it may ignore many enzymes which are not in the genome neighborhood. In eukaryotic organisms this does not seem to be nearly as applicable as to prokaryotes.

We agree. Accordingly, we added a discussion of the relationship between the genome neighborhood and pathway membership for prokaryotes. We also point out that the relationship is not nearly as strong in eukaryotes, most likely invalidating this particular consideration for predicting eukaryotic pathways. We also note that the integrative framework allows for incorporating other information about pathway membership (e.g., based on cellular localization, homology, biochemical function, and direct experimental evidence), which might reduce the size of the input set of potential protein members.

In Discussion, we now write: “Moreover, no single type of input information is essential, provided sufficient information is available from other sources. […] Thus, the integrative approach is at least in principle not limited to prokaryotic pathways.”

Also, in section “Stage 2: Designing pathway model representation and evaluation” in Discussion, we added: “This step can be substituted or supplemented by any other method that identifies candidate genes, including but not limited to: 1) colocalization of genes providing operon/metabolic context for prokaryotic proteins (Overbeek et al., 1999), 2) coexpression measured through chip-based and RNA-seq technologies (Barber and Babbitt, 2012), 3) co-regulation predicted by upstream DNA motifs (Wang, Gerstein and Snyder, 2009; Pilpel, Sudarsanam and Church 2001), 4) protein-protein interaction studies (Rodionov, 2007; Bork et al., 2004), 5) protein fusion events (Meier, Sit and Quake, 2013), and 6) phylogenetic profiles across different genomes (Marcotte et al., 1999).”

5) Computational time is unknown, throughout the paper – estimated time to run all docking predictions, chemical similarity, binding site locating, MC sampling, etc. What is the bottleneck? Or main source of human time/input?

To address this point, we added a new section entitled “Computational cost”, in Methods): “The total computation time depends on the numbers of enzymes and ligands considered. […] In conclusion, the entire process, from preparation of inputs to the sampling of the pathways, can be performed in a few hours on a cluster of a few hundred nodes.”

6) Is docking done only to one predicted active site in an enzyme? How reliable or comprehensive the binding site estimates used here are? How much does docking add over geometric + biochemical comparisons of active sites? Would it be sufficient to do that and compute similarity analyses of that, rather than running many docking simulations?

For each enzyme, as we now describe in the section “Molecular docking screens”, the docking was run against a single predicted binding site. Identical predictions were obtained by two independent methods: (1) similarity to a protein structure with a known site and (2) identifying the largest cavity in the structure using the program PocketPicker (Irwin et al., 2009).

As the reviewer points out, one could take known ligands of proteins homologous to a potential pathway member and rank the corresponding protein ligand pairs higher than those with other potential ligands, perhaps considering the degree of similarity of active sites and/or ligands. We explored this idea, but did not incorporate it in the current version of the method because we do not always have information about homologous proteins and their ligands. Perhaps more importantly, benchmarking showed that virtual screening can potentially identify ligands more accurately than a simple homology-based ligand transfer (Fan et al., J. Chem. Inf. Model., 2009, PMID: 19845314) (now pointed out in the first paragraph of the Discussion). Nevertheless, homology-based transfer may be one of the scoring function terms to add in the future, to further improve our integrative approach.

7) Discussion, first paragraph. There is additional related work of genome-scale metabolic reconstructions that include structural data:

- http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000938 – structural properties used in predicting drug off target pathways.

- https://bmcsystbiol.biomedcentral.com/articles/10.1186/s12918-016-0271-6 – structural properties used to compare lifestyles and pathway usage.

We added these citations in the first paragraph of the Discussion.

8) Figure 4E – what is the reason behind higher gene expression increases of UxuB, GulPQM, GulD on a D-mannonate carbon source compared to L-gulonate? Since these proteins are involved in L-gulonate transport and metabolism it would seem like they should be expressed higher on L-gulonate carbon source, not D-mannonate which comes after them in the proposed pathway.

The relative natural abundances of the two carbon sources may impact the upregulation of genes involved for D-mannonate conversion. For example, D-mannonate is a part of the catabolic pathway of D-glucuronate, which is commonly found in nature. L-gulonate metabolism intersects with D-glucuronate metabolism at fructuronate, but we are unaware of any previous reports of the use of L-gulonate.

One explanation for the higher expression of GulPQM on a D-mannotate source is that the GulPQM transporter could have a lower binding efficiency for D-mannonate vs. L-gulonate (differential scanning fluorimetry results, Supplementary file 3). Furthermore, UxuB and GulD may be co-regulated with GulPQM and, thus, is also upregulated when grown on D-mannonate.

9) What is the author's opinion on applying this method on a genome scale (i.e. entire metabolic network)? Is it feasible in terms of speed and data curation? Is it possible for the authors to generate genome-scale network predictions and compare them to available reconstructed metabolic models? These would be interesting questions to address in the main text.

The framework of the approach can, in principle, take any information that can be used to rank alternative pathways, and then samples alternative pathways to find those that are highly ranked. Thus, it could take information used for constructing entire metabolic networks by others (or indeed other networks) and, given a sufficiently large computing facility and perhaps improved sampling algorithm, in principle predict entire metabolic networks. The quality of the output is limited by the amount of information and exhaustiveness of sampling. In practice, however, reconstructing entire metabolic networks is out of scope for this first paper and for the current implementation of the method. We do believe, as alluded to above, that the method can, even now, complement the genome reconstruction methods in the realm of “dark” metabolic space. Still, not wishing to mislead on this point and succumb to hype, we prefer to leave such possible future applications uncommented on in the manuscript. We beg your indulgence here.

We now end Discussion with an appropriately cautious statement: “With further development, the framework may be applicable on a larger scale, approaching complete genomes, but mapping topologies of networks will be more demanding as it will require more input information and larger computation.”

10) How broadly applicable the approach is? How many new enzymes can be annotated this way?

To answer this question to a degree, we consider the fraction of enzymes that have known X-ray structures or sufficiently high quality comparative protein structure models – the latter of which we use in this manuscript – for application of virtual screening: domains in about 70% of proteins have known or modelable structures (Pieper et al., Nucl Acids Res 42, 336-346, 2014); not every enzyme in the pathway needs to be analyzed by virtual screening (cf, the dummy enzyme benchmarking in Figure 2—figure supplement 1). Still, we estimate that not more than one half of known enzymatic pathways of comparable length to those benchmarked could be mapped by our method.

To test the generality of integrative pathway mapping approach, we are currently applying it to three prokaryotic pathways of interest to the Enzyme Function Initiative consortium, to which all authors belong. In particular, we are focusing our attention on predicting new reactions that have not been associated with a specific enzyme; all predictions are linked to a specific gene that can then be experimentally tested.

11) The statement concerning the applicability of the method also to protein-protein interaction (PPI) networks is a bit of a stretch. For now, the essence of the method is docking of the metabolites to their binding/catalytic sites. In PPI this component will have to be replaced, and it is unclear with what. Protein-protein docking, the first possibility that comes to mind, is far less accurate, especially for transient interactions. I would eliminate the statement, or tune is down significantly.

We agree, and have eliminated the statement.

12) Methods:a) The terms in the first equation should be defined.

We agree, and now define each of the terms in the equation.

b) Using the DOCK scores. Usually, in drug discovery campaigns, the docking scores are considered an indication for possible binding, but the actual values or ranks are often inaccurate. It is surprising that here they are taken as a given and nevertheless the pipeline works fine.

The reviewer correctly points out that the scores and the ranks in a docking campaigns are often inaccurate, which is why the integrative mapping approach relies on several inputs – and not only on one source of information such as docking – to reach a consensus ranking. However, docking can enrich for real ligands at the top of the docking list. Therefore, in the L-gulonate pathway example we limit the input to the top 500 docking-ranked metabolites. This is illustrated directly in Figure 2C where only a combination of all three scoring function terms results in a correct ordering of enzymes and substrate assignments, while scoring functions containing only two or less types of scoring terms result in imperfect predictions.

https://doi.org/10.7554/eLife.31097.026

Article and author information

Author details

  1. Sara Calhoun

    Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    Contributed equally with
    Magdalena Korczynska
    Competing interests
    No competing interests declared
  2. Magdalena Korczynska

    Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, United States
    Contribution
    Conceptualization, Data curation, Software, Formal analysis, Validation, Investigation, Visualization, Methodology, Writing—original draft, Project administration, Writing—review and editing
    Contributed equally with
    Sara Calhoun
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-1339-5553
  3. Daniel J Wichelecki

    1. Institute for Genomic Biology, University of Illinois, Urbana, United States
    2. Department of Biochemistry, University of Illinois, Urbana, United States
    3. Department of Chemistry, University of Illinois, Urbana, United States
    Contribution
    Formal analysis, Validation, Investigation, Writing—review and editing
    Competing interests
    No competing interests declared
  4. Brian San Francisco

    Institute for Genomic Biology, University of Illinois, Urbana, United States
    Contribution
    Formal analysis, Validation, Investigation, Writing—review and editing
    Competing interests
    No competing interests declared
  5. Suwen Zhao

    Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, United States
    Contribution
    Validation, Investigation, Writing—original draft
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-5609-434X
  6. Dmitry A Rodionov

    1. Sanford Burnham Prebys Medical Discovery Institute, La Jolla, United States
    2. A.A. Kharkevich Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  7. Matthew W Vetting

    Department of Biochemistry, Albert Einstein College of Medicine, New York, United States
    Contribution
    Validation, Investigation
    Competing interests
    No competing interests declared
  8. Nawar F Al-Obaidi

    Department of Biochemistry, Albert Einstein College of Medicine, New York, United States
    Contribution
    Resources, Supervision, Validation, Writing—review and editing
    Competing interests
    No competing interests declared
  9. Henry Lin

    Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, United States
    Contribution
    Validation, Investigation
    Competing interests
    No competing interests declared
  10. Matthew J O'Meara

    Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, United States
    Contribution
    Investigation
    Competing interests
    No competing interests declared
  11. David A Scott

    Sanford Burnham Prebys Medical Discovery Institute, La Jolla, United States
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  12. John H Morris

    Resource for Biocomputing, Visualization and Informatics, Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, United States
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  13. Daniel Russel

    Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, United States
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  14. Steven C Almo

    Department of Biochemistry, Albert Einstein College of Medicine, New York, United States
    Contribution
    Resources, Funding acquisition, Writing—review and editing
    Competing interests
    No competing interests declared
  15. Andrei L Osterman

    Sanford Burnham Prebys Medical Discovery Institute, La Jolla, United States
    Contribution
    Data curation
    Competing interests
    No competing interests declared
  16. John A Gerlt

    1. Institute for Genomic Biology, University of Illinois, Urbana, United States
    2. Department of Biochemistry, University of Illinois, Urbana, United States
    3. Department of Chemistry, University of Illinois, Urbana, United States
    Contribution
    Resources, Supervision, Funding acquisition, Project administration, Writing—review and editing
    For correspondence
    j-gerlt@illinois.edu
    Competing interests
    No competing interests declared
  17. Matthew P Jacobson

    Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, United States
    Contribution
    Resources, Supervision, Funding acquisition, Writing—review and editing
    For correspondence
    matt.jacobson@ucsf.edu
    Competing interests
    Consultant to and stockholder of Schrodinger LLC, which licenses, develops, and distributes some of the software used in this work
  18. Brian K Shoichet

    Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, United States
    Contribution
    Conceptualization, Resources, Data curation, Supervision, Funding acquisition, Writing—original draft, Writing—review and editing
    For correspondence
    bshoichet@gmail.com
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-6098-7367
  19. Andrej Sali

    1. Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, United States
    2. Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, United States
    3. California Institute for Quantitative Biosciences, University of California, San Francisco, San Francisco, United States
    Contribution
    Conceptualization, Resources, Data curation, Supervision, Funding acquisition, Writing—original draft, Writing—review and editing
    For correspondence
    Sali@Salilab.org
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-0435-6197

Funding

National Institutes of Health (U54 GM093342)

  • John A Gerlt

National Institute of General Medical Sciences (P41-GM103311)

  • Andrej Sali

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

We thank Robert Munson for helpful discussions and advice on microbiological methods for Haemophilus influenzae. Chakrapani Kalyanaraman generously provided docking data. Martin Steinegger helped explore alternative pathway sampling methods in the initial stages of development. We thank OpenEye Scientific software for the use of OEChem tools. This work was supported by NIH U54 GM093342 (to JAG) and by NIGMS P41-GM103311 to Dr Thomas E Ferrin for the development of computational tools and Resource for Biocomputing, Visualization, and Informatics (RBVI) at UCSF, which partially supports the work of Dr A Sali.

Reviewing Editor

  1. Nir Ben-Tal, Tel Aviv University, Israel

Publication history

  1. Received: August 8, 2017
  2. Accepted: December 18, 2017
  3. Version of Record published: January 29, 2018 (version 1)

Copyright

© 2018, Calhoun et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 1,883
    Page views
  • 313
    Downloads
  • 6
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, Scopus, PubMed Central.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Download citations (links to download the citations from this article in formats compatible with various reference manager tools)

Open citations (links to open the citations from this article in various online reference manager services)

Further reading

    1. Biochemistry and Chemical Biology
    2. Structural Biology and Molecular Biophysics
    Nicholas Chim et al.
    Short Report
    1. Microbiology and Infectious Disease
    2. Structural Biology and Molecular Biophysics
    Wen-Fan Shen et al.
    Research Article