Bactabolize: A tool for high-throughput generation of bacterial strain-specific metabolic models

Ben Vezina; Stephen C. Watts; Jane Hawkey; Helena B. Cooper; Louise M. Judd; Adam Jenney; Jonathan M. Monk; Kathryn E. Holt; Kelly L. Wyres

doi:10.7554/eLife.87406.1

eLife assessment

This study presents Bactabolize, a useful tool for the rapid genome-scale reconstruction of bacteria, applied here to strains of Klebsiella pneumoniae. Using a solid and validated methodology, the tool takes reference reconstructions as the input and can generate new reconstructions for target strains from these. However, the evidence to substantiate all of the justifications and practical applications of the models produced is still incomplete, and the work would benefit from more rigorous approaches regarding the accuracy of the predicted models.

https://doi.org/10.7554/eLife.87406.1.sa3

Significance of findings

useful: Findings that have focused importance and scope

landmark
fundamental
important
valuable
useful

Strength of evidence

solid: Methods, data and analyses broadly support the claims with only minor weaknesses

incomplete: Main claims are only partially supported

exceptional
compelling
convincing
solid
incomplete
inadequate

During the peer-review process the editor and reviewers write an eLife assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife assessments

Abstract

Metabolic capacity can vary substantially within a bacterial species, leading to ecological niche separation, as well as differences in virulence and antimicrobial susceptibility. Genome-scale metabolic models are useful tools for studying the metabolic potential of individuals, and with the rapid expansion of genomic sequencing there is a wealth of data that can be leveraged for comparative analysis. However, there exist few tools to construct strain-specific metabolic models at scale.

Here we describe Bactabolize (github.com/kelwyres/Bactabolize), a reference-based tool which rapidly produces strain-specific metabolic models and growth phenotype predictions. We describe a pan reference model for the priority antimicrobial-resistant pathogen, Klebsiella pneumoniae (github.com/kelwyres/KpSC-pan-metabolic-model), and a quality control framework for using draft genome assemblies as input for Bactabolize.

The Bactabolize-derived model for K. pneumoniae reference strain KPPR1 outperformed the CarveMe-derived model across ≥201 substrate and ≥1220 knockout mutant growth predictions. Novel draft genomes passing our systematically-defined quality control criteria resulted in models with a high degree of completeness (≥99% genes and reactions captured) and high accuracy (mean 0.97, n=10).

We anticipate the tools and framework described herein will facilitate large-scale metabolic modelling analyses that broaden our understanding of diversity within bacterial species and inform novel control strategies for priority pathogens.

Introduction

Bacteria exhibit metabolic diversity and can utilise a broad range of substrates for growth. It has become clear amongst pathogens that there is an intertwined relationship between metabolism and nutrient usage with virulence and antimicrobial resistance (1-7). Comparative analyses of metabolic profiles (e.g. substrate usage) are key to fully understanding these relationships.

Traditionally, these profiles have been assessed via phenotypic growth on a limited number of substrates, such as those used to delineate between species (8-10) which form the basis of a number of commercial products for species identification. However, these methods are not sufficiently discriminatory for in-depth comparisons within species, and alternative approaches such as the Omnilog Phenotype MicroArray system (Biolog) are too expensive and/or labour intensive for application to large numbers of isolates. Similarly, probing of essential metabolism-associated genes via transposon mutant libraries (e.g. to identify novel virulence factors and therapeutic targets) (4, 11, 12) cannot be easily scaled across diverse bacterial populations.

Genome-scale metabolic models or metabolic reconstructions are a computational approach to analysing the metabolic potential of an organism, within which the entire biochemical network is represented as a stoichiometric matrix (13). Metabolic models are constructed programmatically, but typically informed and at least partially validated using phenotypic growth data (14-16). Once constructed, they can be run through simulations and analysed under various contexts, such as in silico growth experiments (Flux Balance Analysis [FBA]) to predict substrate usage profiles (17), evaluate the impact of single gene knockouts on growth (14, 18), and identify metabolic chokepoints for drug targets (19), among others. Traditionally, metabolic models are strain-specific (i.e. each model represents a unique individual http://bigg.ucsd.edu/models) and may not be applicable to other isolates due to unrepresented genetic diversity.

We recently described 37 curated strain-specific models for the Klebsiella pneumoniae Species Complex (KpSC) (14) comprised of K. pneumoniae and its close relatives (20). These organisms are a common cause of healthcare-associated infections world-wide, and among the World Health Organization’s priority antimicrobial resistant pathogens (21). KpSC are highly diverse and gene content can differ substantially between strains (22, 23). Accordingly, our models varied in terms of gene and reaction content, resulting in variable growth substrate usage profiles and metabolic redundancy (14). Similar variation has also been described in other key bacterial pathogens e.g. Escherichia coli (24), Salmonella enterica (25), Staphylococcus aureus (26) and Pseudomonas aeruginosa (27). This is highly relevant to the use of metabolic models for the exploration of virulence and antimicrobial resistance, and for the identification of novel drug targets. Therefore, such works should seek to include multiple strain-specific models, and in some cases 100s-1000s of models may be required to accurately represent population diversity (22, 28, 29).

There are several open source tools currently available that can rapidly produce strain-specific metabolic models, including CarveMe (30), ModelSEED (31) and KBase (32) (see the recent review by Mendoza and colleagues for comparative descriptions (33)), as well as a recently published modelling and analysis pipeline, ChiMera, which leverages CarveMe for model construction (34). In their systematic analysis Mendoza et al. indicated CarveMe and ModelSEED to be of particular interest for large-scale studies due to their speed and model quality (33). Like KBase, ModelSEED is a web interface application, limiting its utility for high-throughput analysis of 100s – 1000s of bacterial genomes. CarveMe is a command line application; it is open source but is dependent on commercial solvers such as CPLEX (free for academic use). However, its use of a universal reference model may limit specificity of strain-specific models (35), and result in overestimation of model genes. These limitations can be overcome by manual curation of the output models, but such curation is highly labour intensive and not suitable for high-throughput analyses. Furthermore, the CarveMe database (BiGG universal_model) appears to be no longer actively maintained, meaning that there is no opportunity to integrate novel structural and/or biochemical data as these become available in the literature (as discussed in COBRA community forums).

Here, we present Bactabolize (available at https://github.com/kelwyres/Bactabolize), an easy-to-use tool which allows scalable production of strain-specific draft metabolic models and prediction of growth phenotypes. Bactabolize builds upon the reference-based model reconstruction approach described by Norsigian et al. (35), leveraging the COBRApy framework (36) and BiGG nomenclature (37). We present a pan-metabolic reference model for the KpSC (derived from our 37 curated strain-specific models (14)), and describe an exemplar quality control framework for the application of Bactabolize to KpSC draft genome assemblies. We show that Bactabolize can rapidly produce strain-specific models from draft genomes with a high degree of completeness (as compared to models generated from completed genome assemblies), resulting in highly accurate growth predictions that match or exceed the accuracy of models from CarveMe and manual curation efforts.

Results

Description of Bactabolize

Bactabolize is written in Python 3 and utilises the metabolic modelling library COBRApy (36). Bactabolize has four main commands:

Draft model generation (draft_model command), which generates a strain-specific draft metabolic reconstruction (‘model’) using the approach outlined previously (35), and uses gap-filling to identify any missing reactions required to simulate growth in the user-specified conditions
Patching incomplete models (patch_model command) by the addition of missing reactions e.g. those identified by the automated gap-filling process
Substrate usage analysis via Flux Balance Analysis (FBA) (fba command) to predict growth outcomes for a specified range of substrates supported by the model(s)
Fig. 1).

Simplified overview of Bactabolize’s main commands. In pink is the *draft_model* command, which builds a draft strain-specific metabolic model using an input reference model and an input target assembly (approach adapted from (35)). If the model fails to simulate growth, Bactabolize will attempt automated gap-filling and produce a model patch file. The *patch_model* command (orange) allows the addition of missing reactions to produce a valid draft model that can simulate growth in a user-specified growth environment. A functioning model can be passed to the *fba* command (yellow), which performs Flux Balance Analysis to simulate growth in the user specified conditions, across all carbon, nitrogen, phosphorus and sulfur metabolite sources supported by the model under aerobic and anerobic conditions. The *sgk* command (blue) shows the Single Gene Knockout analysis, which outputs a predicted phenotype. User inputs and outputs are shown in white boxes while Bactabolize commands are shown inside the grey box.

Additional processing scripts are provided alongside Bactabolize to improve model metadata annotation (improve_model_annotations.py), convert models generated using KBase and ModelSEED to Bactabolize/BiGG-compatible format (SEED_to_BiGG_model_convert.sh), generate network graph files from models (model_to_network_graph.py) and merging output FBA profiles (merge_fba_profiles_longtable.sh).

For draft model construction, Bactabolize requires users to provide an input assembly (annotated or unannotated FASTA or Genbank format respectively), a reference model (JSON format) and the corresponding reference sequence data (gene and protein sequences in two separate multi-fasta files or a single Genbank annotation in a .gbk file) (Figure S1). If the input assembly is unannotated, Bactabolize will identify coding sequences using Prodigal (38) but will otherwise honour the existing coding sequence (CDS) notations and optionally use Prodigal to search for additional CDS. Draft genome-scale metabolic models are output in both SMBL v3.1 (39) and JSON formats (one pair of files for each independent strain-specific model), along with an optional MEMOTE quality report (40). Bactabolize will identify orthologs in the input genome(s) compared to the reference sequence data using Bi-directional BLAST (41) Best Hits (BBH) (42) using BLAST+ (35). Users can parameterise the ortholog finding settings (coverage and identity thresholds) for BBH. Alternatively, there is the option of using protein similarity to identify orthologs instead of identity.

Once a draft model has been constructed, it is validated via a simulated growth experiment on user-input choice of media and atmosphere (aerobic or anerobic). Predefined media include BG11 (Gibco), M9 + glucose (35), nutrient media (43), Luria-Bertani (LB) (43), Tryptic Soy (TSA) (43), TSA + sheep blood (43), LB as specified by the CarveMe developers (30), Chemically Defined Medium (CDM)-like (33), Plantarum Minimal Medium (PMM) PMM5-like (33) and PMM7-like (33). Users can also define custom media as Bactabolize supports several complex media ingredients, including peptone (peptic digest of bovine and porcine tissue) (44-46), tryptone (pancreatic digest of casein) (44, 46, 47), soy peptone/soytone (digest of soymeal) (44, 46, 48, 49), yeast extract (50-55) and beef extract (44, 46). If the model fails to simulate growth, gap-filling is performed to indicate missing reactions. Users can add these reactions to a patch JSON file and optionally use the patch_model command to correct the model (Figure S2). Bactabolize uses a conservative gap-filling approach that only adds the minimum number of reactions to enable growth under the chosen conditions. We recommend testing the models in minimal media and atmosphere expected to support growth for all isolates of the species of interest, unless the user has access to matched phenotypic data demonstrating growth for individual isolates in specific conditions. Aggressive gap-filling will effectively homogenise the models and should be avoided if the goal is to understand the underlying strain diversity.

Substrate usage analysis (the fba command) is performed iteratively for each possible carbon, nitrogen, sulfur and phosphor substrate supported by the model(s) (Figure S3), by replacing the default substrate in the user specified growth medium (specified in the fba_spec JSON file). For example, in M9 media the default substrates are glucose (carbon), ammonia (nitrogen), sulphate (sulfur) and phosphate (phosphor). Each substrate can be tested in aerobic and/or anaerobic conditions. Growth prediction output is recorded in a tab delimited file (one per strain). The merge_fba_profiles_longtable.sh helper script will combine the outputs for multiple strains into a single file for downstream analysis.

The growth impacts of single-gene knockout mutations can be simulated via the sgk command (Figure S4). Bactabolize will iterate through every gene in the model, temporarily removing it and its associated reactions (unless they are also associated with another gene) and running FBA to simulate growth in the user-specified conditions. The output is comparable to single-gene knockout studies such as transposon mutagenesis and can be used to probe gene essentiality.

We recorded the time required for Bactabolize to build draft models and performed 1692 independent growth predictions for each of 35 KpSC genomes (tested in triplicate) on a high-performance computing cluster (Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz and 340 GB of requested memory on a CentOS Linux release 7.9.2009 environment). The mean CPU time required for model construction was 98.41 seconds (range 83.72 - 112.55 seconds), while the mean CPU time for growth predictions was 88.79 seconds (range 85.15 - 103.37 seconds). On a standard consumer laptop (Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz and 15 GB of memory on Windows Subsystem for Linux (WSL1) environment), the mean CPU time for model construction was 87.27 seconds (range 76.133 – 102.694 seconds), while growth predictions took 82.19 seconds (range 72.24 – 103.37 seconds).

KpSC pan-metabolic reference model

We constructed a species complex-specific pan-metabolic reference model by combining a collection of 37 manually curated models for which we have previously demonstrated high accuracy (range 88.3%–96.8% for prediction of 94 distinct growth phenotypes (14)). These models represent a diverse collection of KpSC (14) (including at least one each of the seven major taxa in the complex; K. pneumoniae, Klebsiella variicola subsp variicola, Klebsiella variicola subsp tropica, Klebsiella quasipneumoniae subsp quasipneumoniae, Klebsiella quasipneumoniae subsp similipneumoniae, Klebsiella quaisivariicola, Klebsiella africana). The combined pan-model, known as KpSC-pan v1, comprises a total of 1265 distinct genes, 2319 reactions and 1696 metabolites, and is available at github.com/kelwyres/KpSC-pan-metabolic-model.

Performance comparison

We compared the output and performance of Bactabolize to CarveMe (30) and a manually curated metabolic reconstruction of K. pneumoniae strain KPPR1 (also known as VK055 and ATCC 43816, metabolic model named iKp1289) (15). This isolate was chosen as there is a completed genome sequence (Genbank accession: CP009208), single-source growth phenotype (15) and single-gene knockout growth essentiality data available (56). Draft models were built using; i) Bactabolize with the KpSC pan v1 reference; ii) CarveMe, with its universal reference model (CarveMe universal); and iii) CarveMe, with KpSC-pan v1 reference (CarveMe KpSC pan).

Importantly, neither K. pneumoniae KPPR1 nor its genetic lineage (7 gene multi-locus sequence type, ST493), are represented in the KpSC pan-reference model, meaning these benchmarking comparisons were on equal footing.

The Bactabolize draft model captured a comparable number of genes and reactions (n = 1233 and 2307, respectively) to the manually curated model (n = 1289 and 2484, respectively) but fewer than the CarveMe universal model (n = 1960 and 2857)

Fig. 2A). In contrast, the number of metabolites represented in the Bactabolize and CarveMe universal models were similar (1696 vs 1737) and both were lower than the number represented in iKp1289 (n = 1827). The CarveMe KpSC pan model method captured considerably more genes than any of the other models (n = 2407), but these were associated with many fewer unique reactions and metabolites (1206 and 825, respectively). Upon further investigation we determined that this method resulted in the over prescription of gene reaction rules (GPRs) to multiple reactions (mean 2.2 GPRs per reaction when compared to Bactabolize using the same pan reference model: 1.94 GPRs per reaction; and CarveMe Universal: 2.12 GPRs per reaction). MEMOTE scores, (produced by the MEMOTE report (40)) indicate the quality of the model metadata annotations, with the scores ranging between 0 – 100%. These provide a measure of model portability and the level of connected databases available to support the metabolite, reaction and genetic information represented in the model, but bear no reflection on model accuracy.

*K. pneumoniae* KPPR1 metabolic model benchmarking comparisons. A) Counts of model features; genes, metabolites and reactions captured by each model. Exchanges refers to number of exchange reactions, a subset of reactions involved in substrate uptake, which determine the number of distinct growth substrates for which phenotypes can be predicted with the model. B) MEMOTE scores indicating the richness of annotations and metadata for metabolic model features according to database outlinks. SBO refers to score of Systems Biology Ontology (SBO), a controlled vocabulary for systems biology. Consistency refers to the score of stoichiometric consistency and chemical formulae annotation. Total refers to total MEMOTE score, as a combination of all previous scores, and is shown in bold. C) Counts of carbon, nitrogen, phosphorus and sulfur growth substrates that can be simulated by models and for which matched phenotypes were available for comparison (15). Hatched columns indicate the total number of substrates for which phenotypic data for *K. pneumoniae* KPPR1 were described (15). D) and E) Accuracy metrics for predicted to true phenotypes for the growth substrates shown in D and E, respectively False-negatives, true-negatives, false-positives and true-positives are coloured as shown in legend. F) and G) Accuracy metrics for the KPPR1 single-gene knockout mutant library described in (56) shown in F and G, respectively. Numbers of true positives and false positives are shown to the left of the respective columns.

Bactabolize performs on the lower end, with CarveMe universal performing the best (Fig. 2B). However, Bactabolize using the KpSC-pan model outperforms the model propagation mode of CarveMe using the same reference model (Fig. 2B). Work is ongoing to improve the annotations in the KpSC-pan reference model, to improve large-scale model propagation.

We assessed the performance of each model for in silico prediction of growth phenotypes compared to the previously published experimental data (15). Accuracy, sensitivity, specificity, precision and F1 scores were calculated (57). Note that the specific set of growth substrates and gene knockouts that can be simulated is determined by the sets of genes and metabolites captured by each model and is therefore model-dependent (Data S1 and S2). Among those with matched experimental phenotype data, the Bactabolize and CarveMe universal models were able to predict growth for a greater number of carbon, nitrogen, phosphorous and sulfur substrates than both the iKp1289 model and the CarveMe KpSC pan models (Fig. 2C, Data S1). While the CarveMe universal model had the highest number of true-positive growth predictions overall, it also had a comparably high number of false-positive predictions (Fig. 2D). In contrast, the Bactabolize model had fewer false-positive predictions, resulting in the highest overall accuracy metrics (Fig. 2E, Data S1). Similarly, while the CarveMe universal model resulted in the highest absolute number of true-positive gene essentiality predictions, driving a high accuracy, the Bactabolize model was associated with the greatest overall precision, sensitivity, and specificity (Figs. 2F & 2G).

Quality control framework for input genome assemblies

There are now thousands of bacterial genomes available in public databases, the majority of which are in draft form. If we are to use these data for high-throughput metabolic modelling studies, it is essential to evaluate the expected model accuracies and understand the minimum input genome quality requirements. Here we performed a systematic analysis leveraging our published curated KpSC models (n=37, (14)), which were generated using completed genome sequences and were therefore considered to represent ‘complete’ models. We randomly subsampled the corresponding Illumina read sets to various depths (10 – 100x, increments of 10) in triplicate and generated draft assemblies that were passed to Bactabolize for generation of draft metabolic models (Data S3). Due to low read depth (≤30x), two isolates were removed from this analysis. Additionally, ten 10x depth read samples failed to produce assemblies, leaving 1040 draft genomes for analysis. The resulting draft metabolic models were compared to the complete models to; i) determine the proportions of complete model genes and reactions captured in the draft models; and ii) compare 846 in silico aerobic growth predictions in M9 minimal media, where growth on 266 carbon, 153 nitrogen, 59 phosphorus and 25 sulfur sources were examined. Substrates containing multiple elements were tested as sole sources of each element independently and in combination, e.g. 1,5-Diaminopentane was tested as a sole carbon, sole nitrogen and sole carbon plus nitrogen source.

As expected, assembly quality generally increased with increasing sequencing depth i.e. assemblies generated from higher depth read sets were associated with higher N50 values, fewer contigs and fewer assembly graph dead-ends, although the rate of improvement drastically declined beyond 40-50x depth (Figure S5, Data S3). We noted that it was rare for draft models to capture 100% of model genes and reactions (just 420 of all 1040 draft assemblies were associated with models that captured 100% model genes) (Data S3, Figure S6), even when using the highest quality draft genomes. However, ≥99% of genes and reactions were commonly captured, which plateaued from 40x depth onwards (Figure S6). Therefore, we sought to evaluate whether ≥99% model capture would produce functionally accurate models.

We used FBA to simulate substrate growth profiles for the 40x depth assemblies, representing a sequencing depth that can be routinely achieved with standard Illumina library preparations. All but one assembly triplicate set (isolate SB4767 98% gene capture, 99% reaction capture) captured ≥99% but ≤100% model genes and/or reactions. The substrate growth profiles were then compared to those of the complete models. The vast majority of draft models produced accurate growth predictions; 102 of 108 models resulted in predictions with 100% concordance to those from the corresponding complete models. Three models for K. quasipneumoniae similipneumoniae isolate SB164 resulted in predictions with a mean of 99.8% concordance. The remaining three models were for isolate SB4767 and resulted in mean of 80.4% concordance. Notably, these models were those representing <99% gene capture. Together, these data suggest that draft models capturing ≥99% of the complete model genes/reactions generate highly accurate growth predictions and that these capture rates can be readily achieved from draft genome assemblies.

We investigated the relationships between assembly quality metrics and model gene/reaction capture in more detail. Variation in assembly graph dead-ends accounted for the greatest amount of variation in model capture, closely followed by raw contig counts (cubic polynomial fit, R² of ≥0.98 for graph dead-ends, R² of ≥0.9 for contig count). A segmented linear model was fitted to N50 length (R² ≥ 0.83), producing a breakpoint at 25153 bp (Fig. 3).

Scatterplots showing distribution of best performing assembly metrics ‘assembly graph dead ends’, ‘contigs’ and ‘N50’ against model feature capture (genes and reactions). Each point represents the mean values from a single genome (technical triplicate) and is coloured by model quality. ‘Good’ models capture ≥99% of the model metric as compared to the corresponding complete model (shown at each facet), ‘Bad’ models capture <99%. Cubic polynomial line plotted for assembly ‘graph dead ends’, ‘contigs’, while a segmented linear model was plotted for ‘N50’. R² is shown on each panel.

To further explore the optimum thresholds for assembly metrics, we tallied the number of draft assemblies resulting in ≥99% and <99% gene and reaction capture at increasing graph dead-end and contig count count-offs, and decreasing N50 cut-offs. Draft models that captured ≥99% of the complete model genes/reactions were considered ‘good’ models, whereas draft models that captured <99% of complete model genes/reactions were considered ‘bad’ models. The optimum threshold for assembly graph dead end was determined to be ≤200. At this value, 94.44% of ‘good’ models were captured, and 0% ‘bad’ models. The optimum threshold for contig counts was determined as ≤130 contigs at which 67.92% of ‘good’ and 0% ‘bad’ models were captured (Fig. 4). The optimum threshold for N50 was determined to be ≥65000, at which 94.97% of ‘good’ and 1.71% of ‘bad’ models were captured. The assembly graph dead-end threshold results in comparatively higher sensitivity (i.e. a higher proportion of ‘good’ models pass the threshold) than contig count and comparatively better specificity (i.e. lower proportion of ‘bad’ models pass the threshold) than N50, but the underlying metric information is not universally available because many isolate genomes are deposited in public databases only as assemblies without the associated assembly graph. We therefore recommend a three-tier approach, whereby the assembly graph dead-end criterion is preferenced if available, followed by N50 and then contig count.

Line graphs showing the impact of assembly metric cut-off thresholds on model feature capture (n = 1040). ‘Good’ models which captured ≥99% of model features are shown in green, while ‘bad’ models captured <99% model features are shown in gold. The blue dotted line shows the metric cut-off thresholds, to minimize the number of models that capture <99% model features and maximise models that capture ≥99%. Metric cut-off statistics are calculated in intervals of 10 for assembly graph dead ends and contigs, and every 5000 for N50.

Impact of gap-filling models

Of the 901 draft genome assemblies which passed our QC criteria (≤200 assembly graph dead ends), 23 of the resulting draft models failed to simulate growth in M9 minimal media with glucose (despite capturing ≥99% of the genes and reactions in the corresponding complete models). It is expected that all KpSC models should be able to simulate growth on M9 media with glucose as a sole carbon source, as this central metabolism is universal amongst KpSC. To replace missing, critical reactions required for growth on M9 with glucose, we investigated model gap-filling using the patch_model command of Bactabolize. We then assessed the accuracy of the gap-filled models for prediction of growth on the full range of substrates, as compared to the predictions from the corresponding complete models.

Gap filling added 1 – 3 missing reactions to each model, with a median of one, fully restoring biomass production in M9 media with glucose in all but two of the 23 failed models. The missing reactions appeared to be random genes across these 23 genomes, likely due to missing information in these assemblies.

Substrate usage predictions from the 21 successfully gap-filled models were highly accurate, with 18/21 having a prediction concordance of ≥99% across all 846 growth conditions (12/21 had 100% concordance) (Figure S7). We therefore conclude that models generated for genome assemblies passing our QC criteria, which have been gap-filled to successfully simulate growth on minimal media plus glucose, are suitable for the prediction of growth across a range of substrates.

Predictive accuracy of draft models

We assessed the accuracy of Bactabolize for the construction of draft models for 10 novel KpSC clinical isolates, representing five of the major taxa in the complex. We included five isolates for which the associated STs were represented in the KpSC-pan v1 model and five isolates with STs that were not represented. Whole genome sequence data were generated on the Illumina platform and draft assemblies generated de novo. The resultant assemblies had 0-4 graph dead-ends, N50s of 151958-388486 bp and 83-187 contigs (Data S4), within the tiered threshold values.

FBA was performed, and the predicted growth profiles compared to matched phenotypic growth data for 16 carbon sources derived from Vitek GN ID cards. Though the number of tested carbon sources was limited, all were associated with high accuracy metrics (Fig. 5, Data S4). As expected, models for isolates with STs represented in the KpSC-pan v1 reference performed slightly better (mean accuracy = 0.98) than those for non-represented STs (mean accuracy = 0.95).

A) Comparisons of predicted to true phenotypes for 16 carbon source substrates. False-negatives, true-negatives, false-positives and true-positives are coloured as shown in legend. Each column represents a different isolate, separated by ST representation in the KpSC-pan model. B) Accuracy metrics for predicted vs phenotypic growth comparisons shown in A. Each column represents a different isolate, coloured by taxa and separated by ST representation in the KpSC-pan v1 model.

Discussion

In this work we described Bactabolize, a pipeline for rapid and scalable production of accurate bacterial strain-specific metabolic models and growth phenotype predictions. We describe a pan-reference model for the KpSC and demonstrate that a draft strain-specific model generated de novo via Bactabolize using the KpSC-pan v1 reference was highly accurate for growth phenotype prediction (85.79% accuracy for substrate usage across 190 substrates, and 80.57% for gene essentiality across 1220 genes). Importantly, we also described a quality control framework for the use of draft genome assemblies as input for metabolic reconstructions. We used a systematic analysis to; i) evaluate the proportion of gene and reaction capture compared to the corresponding ‘completed’ models; ii) define quality control thresholds for input assemblies (three tier approach for KpSC; ≤200 assembly graph dead ends, followed by ≥65000 N50, followed by ≤130 contigs); and iii) estimate the accuracy of the resultant growth predictions. While the quality control thresholds and accuracy estimates are specific to KpSC, the conceptual framework can be applied to any organism and is essential to support the confident application of metabolic modelling for large-scale genome datasets. We appreciate that assembly graphs may not be available for dead end count, e.g. for draft genome assemblies accessed via public repositories, however we encourage users to include this information in their quality control procedures wherever possible (e.g. using the recently published counter tool available at https://github.com/rrwick/GFA-dead-end-counter), because these counts represent a direct reflection of the completeness of the genome assembly. In contrast, contig counts and N50 are influenced by biological features such as repeat copy numbers as well as the underlying sequence data quality e.g. a bacterial genome harbouring many insertion sequence insertions will result in a draft assembly with a high number of contigs regardless of the sequence data quality and completeness.

Bactabolize’s reference-based reconstruction approach is reductive, meaning the resultant draft models will comprise only the genes, reactions and metabolites present in the reference, or a subset thereof, and will not include novel reactions unless they are manually identified and curated by the user. This is an important caveat that should be considered carefully for application of Bactabolize to large genome data sets, particularly for genetically diverse organisms such as those in the KpSC. The use of a pan-reference derived from multiple curated strain-specific models results in greater representation of the population diversity and partially alleviates the shortcomings of the reference-based approach. However, draft models constructed for strains with a corresponding lineage represented in the reference are likely more accurate. Our analysis indicated that a draft KpSC model generated by Bactabolize with the KpSC pan v1 reference was equally or more accurate than the current gold standard automated approach, CarveMe with universal model (30), and outperformed a manually curated model (15). The latter was constructed using the KBase pipeline (32), which uses RAST to annotate the sequences with Enzyme Commission numbers. It has been demonstrated several times that the Enzyme Commission scheme has systematic errors (58, 59), leading to a loss in accuracy when compared to the ortholog identification methods used in CarveMe and Bactabolize.

The CarveMe universal reference model captures greater diversity than the KpSC pan v1 reference, which resulted in a comparatively greater number of genes, reactions and metabolites in the corresponding CarveMe draft model, and ability to simulate growth outcomes for a greater number of distinct substrates (Figure 2). Overall, CarveMe (with the universal model) performed extremely well, with high numbers of true-positive growth predictions. However, these were also accompanied by comparatively higher numbers of false-positive predictions, which resulted in a lower overall accuracy score for substrate usage analysis compared to Bactabolize with the KpSC-pan v1 reference (Figure 2), and comparatively lower sensitivity and specificity for the gene essentiality analysis. False-positive predictions may indicate that the relevant metabolic machinery are present in the cell but were not active during the growth experiments (e.g. due to lack of gene expression). In this regard, false-positives are not always a sign of model inaccuracy. However, false-positive predictions can also occur from incorrect gene annotations e.g. due to reduced specificity of ortholog assignment resulting from the use of the universal model without manual curation. Given a key objective here is to facilitate high-throughput analysis for large numbers of genomes, it is not feasible to expect that all models will be manually curated, and therefore we believe that identifying fewer genes with lower overall error rates provides greater confidence in the resulting draft models. We also note that the CarveMe universal reference model is no longer being actively maintained, but in contrast, user defined species-(or genera-) specific references can be iteratively curated and updated to incorporate new knowledge and data as they become available. Accordingly, the accuracy of models derived from such references is expected to continually improve.

Bactabolize and the KpSC pan v1 model are freely available under open source licenses and satisfy the four features of the FAIR research principles (findability, accessibility, interoperability and reusability) (60). In addition to the KpSC pan-reference described here, a pan-reference model has been described previously for Salmonella enterica (representing 410 strains (25)). We are actively working to expand and improve the KpSC pan reference model and welcome similar efforts to generate high quality references for other organisms. Together these resources will facilitate population wide metabolic analyses for global priority pathogens, which can be used to understand how they transmit, cause disease and evolve drug-resistance, and to identify novel therapeutic targets.

Methods

Bactabolize pipeline

Bactabolize utilises the existing metabolic modelling library COBRApy (36) and Python 3 (61). All code is freely available and open source at GitHub (www.github.com/kelwyres/Bactabolize) under a GNU General Public License v3.0. Users should additionally cite COBRApy (36) if Bactabolize is used.

Klebsiella pneumoniae Species Complex-pan metabolic model

The 37 metabolic models from a previous study (14) were combined with the iY1228 model using the create_master_model.py script (available at 10.6084/m9.figshare.21728717). Briefly, all GPRs from the iYL1228 model and the associated sequences were included, as well as new GPRs identified from the 36 additional strains by manual curation following comparison to the matched phenotype data (as described in (14)). Additionally, orthologous sequence variants with <75% nucleotide identity to gene sequences associated with these gene reaction rules (GPRs) were added if there was phenotype data supporting the reaction. The biomass reaction was updated, removing the metabolites udpgalur_c and udpgal_c as their production was strain-specific.

Metadata annotations were improved using the improve_model_annotations.py script (also available in the Bactabolize code repository) resulting in the KpSC_pan v1 used in this study, available at www.github.com/kelwyres/KpSC-pan-metabolic-model.

Draft model generation

Bactabolize draft models were generated using the draft_model command in Bactabolize v1 with the KpSC-pan v1 model as a reference, and the following options:

--min_coverage 25 --min_pident 80 --media M9 --atmosphere aerobic

CarveMe draft models were generated firstly using the universal reference with the following commands: ‘-g M9 -i M9’. The --universe-file mode was also used, so the KpSC-pan model could be used as a reference, with the previously described command.

Speed calculations

Bactabolize draft_model and fba commands were timed via a script using the date +%s.%N command run before and after command on the MASSIVE computing cluster (Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz and 340 GB of memory, CentOS Linux release 7.9.2009 environment). Speed tests were also performed on a standard consumer laptop with the following hardware: Intel(R) Core(TM) i5-8365U CPU @ 1.60GHz and 15 GB of memory on Windows Subsystem for Linux (WSL1) environment.

Performance comparison

The genome of K. pneumoniae KPPR1 was obtained from Genbank under the accession: CP009208, and draft metabolic models were generated using Bactabolize and CarveMe as described above. The previously described, manually curated model for KPPR1 (iKp1289) was also included for comparison (15). The following KPPR1 phenotype data were retrieved from published studies: BIOLOG Phenotypic Microarray data (15) and single gene knockout data inferred from the outputs of a TraDIS transposon mutagenesis library (56).

A list of BIOLOG growth substrates for plates PM1, PM2A, PM3B, and PM4A (62) were converted where possible to BiGG and SEED IDs by manual search of the BiGG (bigg.ucsd.edu) and SEED websites (https://modelseed.org/biochem/compounds). An updated BiGG to SEED dictionary can be found in Data S1. A total of 143 of 190 carbon, 82 of 95 nitrogen, 46 of 59 phosphor and 26 of 35 sulfur substrates were successfully matched to BiGG and SEED IDs (Data S1). These growth data were compared to in silico predictions generated via FBA using the fba command from Bactabolize to optimise the biomass objective function with the following options:

--fba_spec_name m9 --fba_open_value -20

Gene essentiality was inferred from single gene knockout growth predictions using the sgk command from Bactabolize with the following options to mirror the growth conditions of the TraDIS library (LB media grown aerobically):

--media_type lb --atmosphere aerobic

In all cases, an objective value cut-off of ≥10⁻⁴ was used to indicate binarised growth as per previous studies (14, 63).

In silico predictions were compared to matched phenotype data and the following accuracy metrics were calculated:

Quality control framework

Illumina read sets (250 bp paired end) and completed genome sequences for 37 KpSC isolates were described previously (14). Here we randomly subsampled the Illumina reads at various depths (10 – 100, by increments of 10) using rasusa version 0.3.0 (64) in technical triplicate. Reads were then trimmed using TrimGalore version 0.5.0 (65) and assembled de novo with Unicycler version 0.4.7 (66), default parameters. Assembly statistics and assembly graph dead ends were calculated using the GFA-dead-end-counter version 1.0.0 (https://github.com/rrwick/GFA-dead-end-counter) (67). Draft metabolic models were generated with Bactabolize using the KpSC-pan v1 reference, and growth substrate profiles were predicted as described above. We compared the outputs from models generated for draft genome assemblies to those generated for the corresponding completed genomes. Where necessary models were gap-filled via the patch_model command.

Predictive accuracy of draft models

Novel growth phenotype data were generated for 10 KpSC clinical isolates from our in house collection using the VITEK 2 GN ID card system as described previously (14). Briefly, isolates were grown on Tryptic Soy (OXOID) agar plates overnight at 37°C, then analysed using VITEK 2 GN ID cards (bioMérieux) and read on the VITEK 2 Compact (bioMérieux) as per manufacturer’s instructions using software version 8.0. DNA was extracted for whole-genome sequencing via Genfind v3 extraction kit, library preparation performed using Nextera Flex (Illumina) using ¼ reagents. Paired-end read data (300 bp) were generated on an Illumina NovaSeq6000 SP v1.0 and have been deposited in the European Nucleotide Archive under Bioproject PRJNA777643 (individual read accession numbers are given in Data S4). Draft genome assemblies were generated with Unicycler, and draft metabolic models and growth predictions were generated with Bactabolize as described above.

Statistics and visualisation

Statistical analysis and graphical visualisation were performed using R version 4.0.3 [23], RStudio version 1.3.1093 (68), with the following software packages: tidyverse version 1.3.1 (69), viridis version 0.5.1 (70), RColorBrewer version 1.1-2 (71), ggpubr version 0.4.0 (72) ggpmisc version 0.4.4 (73), aplot version 0.1.6 (74), colorspace version 2.0-2 (75), ggpattern version 0.4.3-3 (76), ggtext version 0.1.1 (77) and glue version 1.4.2 (78).

Linear regression analysis was performed in R using the lm function in R and a third degree polynomial model was fitted to plots with the following equation: y ∼ poly(x, 3, raw = TRUE). The segmented linear model was fitted using segmented version 1.6-2 (79).

All code used to generate results can be found as supplemental material, https://github.com/kelwyres/Bactabolize and on Figshare (10.6084/m9.figshare.21728717).

Logo

The Bactabolize logo was constructed in Inkscape version 1.0.1 (80). The font used is Proportional TFB (81) and Element (82).

Supporting information

Author contributions

Conceptualization: BV, JH, JMM, KEH, KLW

Methodology: SCW, BV, LMJ, JH, JMM, KLW

Software: SCW, BV

Validation: BV, HBC, KLW

Formal analysis: BV, KLW

Investigation: BV, KLW Resources: AJ

Writing - Original Draft: BV, KLW

Writing - Review & Editing: All authors

Visualization: BV, KLW

Supervision: JMM, KEH, KLW

Project administration: KLW

Funding acquisition: JMM, KEH, KLW

All authors contributed to and approve of the manuscript in its current form.

Acknowledgements

We thank Sylvain Brisse for the isolates used in this study and review of the manuscript.

Supplementary figures

Flow diagram showing the overview of the draft_model module from Bactabolize, which produces draft metabolic models. Input and output files are shown in light pink while Bactabolize processes are shown in dark pink. Third-party dependencies are indicated within the white boxes.

Flow diagram showing the overview of the patch_model module from Bactabolize, which patches metabolic models that do not simulate growth. Input and output files are shown in light orange while Bactabolize processes are shown in dark orange. Third-party dependencies are indicated within the white boxes.

Flow diagram showing the overview of the fba module from Bactabolize, which performs growth simulations using Flux Balance Analysis. Input and output files are shown in light yellow while Bactabolize processes are shown in dark yellow. Third-party dependencies are indicated within white boxes. C, carbon; N, nitrogen, P, phosphorus; S, sulphur; O₂, oxygen.

Flow diagram showing the overview of the sgk module from Bactabolize, which performs Single Gene Knockout analysis. Input and output files are shown in light blue while Bactabolize processes are shown in dark blue. Third-party dependencies are indicated within white boxes.

Raincloud plot showing distributions of assembly metrics across various read subsampling depths (10x increments).

Line graph showing the capture of model features of draft assemblies (short read only) at various depths, compared to the corresponding completed genome (long-read + short read assemblies).

Faceted graphs showing the number of substrate usage (fba module) discrepancies of gap-filled models (patch_model module) which initially did not produce biomass (models which failed to simulate growth). The dots indicate percentage concordance with the completed genome model, while the columns indicate number of substrates with discrepancies (no simulated growth in patched model, but growth in completed genome model).

References

1.
1. Su Q
2. Guan T
3. He Y
4. Lv H.
2016Siderophore Biosynthesis Governs the Virulence of Uropathogenic Escherichia coli by Coordinately Modulating the Differential MetabolismJournal of Proteome Research 15:1323–32Google Scholar
2.
1. Wu Y
2. Meng Y
3. Qian L
4. Ding B
5. Han H
6. Chen H
7. et al
The Vancomycin Resistance-Associated Regulatory System VraSR Modulates Biofilm Formation of Staphylococcus epidermidis in an ica-Dependent MannermSphere 6:e00641–21Google Scholar
3.
1. Mir M
2. Prisic S
3. Kang C-M
4. Lun S
5. Guo H
6. Murry JP
7. et al
2014Mycobacterial gene cuvA is required for optimal nutrient utilization and virulenceInfection and immunity 82:4104–17Google Scholar
4.
1. Vornhagen J
2. Sun Y
3. Breen P
4. Forsyth V
5. Zhao L
6. Mobley HLT
7. et al
2019The Klebsiella pneumoniae citrate synthase gene, gltA, influences site specific fitness during infectionPLOS Pathogens 15:e1008010Google Scholar
5.
1. Eberl C
2. Weiss AS
3. Jochum LM
4. Durai Raj AC
5. Ring D
6. Hussain S
7. et al
2021E. coli enhance colonization resistance against Salmonella Typhimurium by competing for galactitol, a context-dependent limiting carbon sourceCell Host & Microbe Google Scholar
6.
1. Jenior ML
2. Dickenson ME
3. Papin JA
2021Genome-scale metabolic modeling reveals increased reliance on valine catabolism in clinical isolates of Klebsiella pneumoniae. bioRxivGoogle Scholar
7.
1. Hudson AW
2. Barnes AJ
3. Bray AS
4. Zafar MA
2022Klebsiella pneumoniae L-Fucose metabolism promotes gastrointestinal colonization and modulates its virulence determinants. bioRxivGoogle Scholar
8.
1. Rodrigues C
2. Passet V
3. Rakotondrasoa A
4. Diallo TA
5. Criscuolo A
6. Brisse S.
2019Description of Klebsiella africanensis sp. nov., Klebsiella variicola subsp. tropicalensis subsp. nov. and Klebsiella variicola subsp. variicola subsp. novRes Microbiol 170:165–70Google Scholar
9.
1. Blin C
2. Passet V
3. Touchon M
4. Rocha EPC
5. Brisse S.
2017Metabolic diversity of the emerging pathogenic lineages of Klebsiella pneumoniaeEnvironmental Microbiology 19:1881–98Google Scholar
10.
1. Brisse S
2. Fevre C
3. Passet V
4. Issenhuth-Jeanjean S
5. Tournebize R
6. Diancourt L
7. et al
2009Virulent Clones of Klebsiella pneumoniae: Identification and Evolutionary Scenario Based on Genomic and Phenotypic CharacterizationPLOS ONE 4:e4982Google Scholar
11.
1. Mobegi FM
2. Saft van Hijum
3. Burghout P
4. Bootsma HJ
5. Spw de Vries
6. Ce van der Gaast-de Jongh
7. et al
2014From microbial gene essentiality to novel antimicrobial drug targetsBMC Genomics 15:958Google Scholar
12.
1. Hogan AM
2. Scoffone VC
3. Makarov V
4. Gislason AS
5. Tesfu H
6. Stietz MS
7. et al
2018Competitive Fitness of Essential Gene Knockdowns Reveals a Broad-Spectrum Antibacterial Inhibitor of the Cell Division Protein FtsZAntimicrobial Agents and Chemotherapy 62:e01231–18Google Scholar
13.
1. Edwards JS
2. Palsson BO
1999Systems Properties of the Haemophilus influenzaeRd Metabolic GenotypeJournal of Biological Chemistry 274:17410–6Google Scholar
14.
1. Hawkey J
2. Vezina B
3. Monk JM
4. Judd LM
5. Harshegyi T
6. López-Fernández S
7. et al
2022A curated collection of Klebsiella metabolic models reveals variable substrate usage and gene essentialityGenome Research Google Scholar
15.
1. Henry CS
2. Rotman E
3. Lathem WW
4. Tyo KEJ
5. Hauser AR
6. Mandel MJ
2017Generation and Validation of the iKp1289 Metabolic Model for Klebsiella pneumoniae KPPR1The Journal of Infectious Diseases 215:S37–S43Google Scholar
16.
1. Liao Y-C
2. Huang T-W
3. Chen F-C
4. Charusanti P
5. Hong JSJ
6. Chang H-Y
7. et al
2011An Experimentally Validated Genome-Scale Metabolic Reconstruction of Klebsiella pneumoniae MGH 78578, iYL1228Journal of Bacteriology 193:1710–7Google Scholar
17.
1. Orth JD
2. Thiele I
3. Palsson BØ.
2010What is flux balance analysis?Nature Biotechnology 28:245–8Google Scholar
18.
1. Stanway RR
2. Bushell E
3. Chiappino-Pepe A
4. Roques M
5. Sanderson T
6. Franke-Fayard B
7. et al
2019Genome-Scale Identification of Essential Metabolic Processes for Targeting the Plasmodium Liver StageCell 179:1112–28Google Scholar
19.
1. Ramos PIP
2. Porto Fernández Do
3. Lanzarotti E
4. Sosa EJ
5. Burguener G
6. Pardo AM
7. et al
2018An integrative, multi-omics approach towards the prioritization of Klebsiella pneumoniae drug targetsScientific Reports 8:10755Google Scholar
20.
1. Wyres KL
2. Lam MMC
3. Holt KE
2020Population genomics of Klebsiella pneumoniaeNature Reviews Microbiology 18:344–59Google Scholar
21.
1. WHO
2017WHO publishes list of bacteria for which new antibiotics are urgently neededhttps://www.who.int/news/item/27-02-2017-who-publishes-list-of-bacteria-for-which-new-antibiotics-are-urgently-needed Google Scholar
22.
1. Gorrie CL
2. Mirčeta M
3. Wick RR
4. Judd LM
5. Lam MMC
6. Gomi R
7. et al
2022Genomic dissection of Klebsiella pneumoniae infections in hospital patients reveals insights into an opportunistic pathogenNature Communications 13:3017Google Scholar
23.
1. Holt KE
2. Wertheim H
3. Zadoks RN
4. Baker S
5. Whitehouse CA
6. Dance D
7. et al
2015Genomic analysis of diversity, population structure, virulence, and antimicrobial resistance Klebsiella pneumoniae, an urgent threat to public healthProceedings of the National Academy of Sciences 112:E3574Google Scholar
24.
1. Monk JM
2. Charusanti P
3. Aziz RK
4. Lerman JA
5. Premyodhin N
6. Orth JD
7. et al
2013Genome-scale metabolic reconstructions of multiple Escherichia coli strains highlight strain-specific adaptations to nutritional environmentsProceedings of the National Academy of Sciences 110:20338Google Scholar
25.
1. Seif Y
2. Kavvas E
3. Lachance J-C
4. Yurkovich JT
5. Nuccio S-P
6. Fang X
7. et al
2018Genome-scale metabolic reconstructions of multiple Salmonella strains reveal serovar-specific metabolic traitsNature Communications 9:3771Google Scholar
26.
1. Bosi E
2. Monk JM
3. Aziz RK
4. Fondi M
5. Nizet V
6. Palsson BØ
2016Comparative genome-scale modelling of Staphylococcus aureus strains identifies strain-specific metabolic capabilities linked to pathogenicityProceedings of the National Academy of Sciences 113:E3801–E9Google Scholar
27.
1. Bartell JA
2. Blazier AS
3. Yen P
4. Thøgersen JC
5. Jelsbak L
6. Goldberg JB
7. et al
2017Reconstruction of the metabolic network of Pseudomonas aeruginosa to interrogate virulence factor synthesisNature Communications 8:14631Google Scholar
28.
1. Croucher NJ
2. Coupland PG
3. Stevenson AE
4. Callendrello A
5. Bentley SD
6. Hanage WP
2014Diversification of bacterial genome content through distinct mechanisms over different timescalesNature Communications 5:5471Google Scholar
29.
1. Cummins EA
2. Hall RJ
3. Connor C
4. McInerney JO
5. McNally A.
2022Distinct evolutionary trajectories in the Escherichia coli pangenome occur within sequence typesMicrobial Genomics 8Google Scholar
30.
1. Machado D
2. Andrejev S
3. Tramontano M
4. Patil KR
2018Fast automated reconstruction of genome-scale metabolic models for microbial species and communitiesNucleic Acids Res 46:7542–53Google Scholar
31.
1. Seaver SMD
2. Liu F
3. Zhang Q
4. Jeffryes J
5. Faria JP
6. Edirisinghe JN
7. et al
2021The ModelSEED Biochemistry Database for the integration of metabolic annotations and the reconstruction, comparison and analysis of metabolic models for plants, fungi and microbesNucleic Acids Res 49:D575–D88Google Scholar
32.
1. Arkin AP
2. Cottingham RW
3. Henry CS
4. Harris NL
5. Stevens RL
6. Maslov S
7. et al
2018KBase: The United States Department of Energy Systems Biology KnowledgebaseNature Biotechnology 36:566–9Google Scholar
33.
1. Mendoza SN
2. Olivier BG
3. Molenaar D
4. Teusink B.
2019A systematic assessment of current genome-scale metabolic reconstruction toolsGenome Biology 20:158Google Scholar
34.
1. Tamasco G
2. Kumar M
3. Zengler K
4. Silva-Rocha R
5. da Silva RR
2022ChiMera: an easy to use pipeline for bacterial genome based metabolic network reconstruction, evaluation and visualizationBMC Bioinformatics 23:512Google Scholar
35.
1. Norsigian CJ
2. Fang X
3. Seif Y
4. Monk JM
5. Palsson BO
2020A workflow for generating multi-strain genome-scale metabolic models of prokaryotesNature Protocols 15:1–14Google Scholar
36.
1. Ebrahim A
2. Lerman JA
3. Palsson BO
4. Hyduke DR
2013COBRApy: COnstraints-Based Reconstruction and Analysis for PythonBMC Systems Biology 7:74Google Scholar
37.
1. Schellenberger J
2. Park JO
3. Conrad TM
4. Palsson BØ
2010BiGG: a Biochemical Genetic and Genomic knowledgebase of large scale metabolic reconstructionsBMC Bioinformatics 11:213Google Scholar
38.
1. Hyatt D
2. Chen G-L
3. LoCascio PF
4. Land ML
5. Larimer FW
6. Hauser LJ
2010Prodigal: prokaryotic gene recognition and translation initiation site identificationBMC Bioinformatics 11:119Google Scholar
39.
1. Keating SM
2. Waltemath D
3. König M
4. Zhang F
5. Dräger A
6. Chaouiya C
7. et al
2020SBML Level 3: an extensible format for the exchange and reuse of biological modelsMolecular Systems Biology 16:e9110Google Scholar
40.
1. Lieven C
2. Beber ME
3. Olivier BG
4. Bergmann FT
5. Ataman M
6. Babaei P
7. et al
2020MEMOTE for standardized genome-scale metabolic model testingNature Biotechnology 38:272–6Google Scholar
41.
1. Camacho C
2. Coulouris G
3. Avagyan V
4. Ma N
5. Papadopoulos J
6. Bealer K
7. et al
2009BLAST+: architecture and applicationsBMC Bioinformatics 10:421Google Scholar
42.
1. Hernández-Salmerón JE
2. Moreno-Hagelsieb G.
2020Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2BMC Genomics 21:741Google Scholar
43.
2009BD. Difco™ & BBL™ Manual, 2nd Edition. In: BD, editor. 2009.Google Scholar
44.
1. Biosciences BD.
2015BD Bionutrients Technical Manual: BD Biosciences – Advanced BioprocessingBD□Biosciences Google Scholar
45.
1. Loginova LI
2. Manuilova VP
3. Tolstikov VP
1974Content of free amino acids in peptone and the dynamics of their consumption in the microbiological synthesis of dextranPharmaceutical Chemistry Journal 8:249–51Google Scholar
46.
1. ThermoFisherScientific
2019Technical guide to peptones, supplements, and feeds: Enhancing performance of mammalian and microbial bioprocessesThermoFisherScientific Google Scholar
47.
1. Clausen E
2. Gildberg A
3. Raa J.
1985Preparation and testing of an autolysate of fish viscera as growth substrate for bacteriaApplied and Environmental Microbiology 50:1556–7Google Scholar
48.
1. Hagely KB
2. Palmquist D
3. Bilyeu KD
2013Classification of Distinct Seed Carbohydrate Profiles in SoybeanJournal of Agricultural and Food Chemistry 61:1105–11Google Scholar
49.
1. Choct M
2. Dersjant-Li Y
3. McLeish J
4. Peisker M.
2010Soy Oligosaccharides and Soluble Non-starch Polysaccharides: A Review of Digestion, Nutritive and Anti-nutritive Effects in Pigs and PoultryAsian-Australasian Journal of Animal Sciences :23Google Scholar
50.
1. Tomé D.
2021Yeast Extracts: Nutritional and Flavoring Food IngredientsACS Food Science & Technology 1:487–94Google Scholar
51.
1. Plata MR
2. Koch C
3. Wechselberger P
4. Herwig C
5. Lendl B.
2013Determination of carbohydrates present in Saccharomyces cerevisiae using mid-infrared spectroscopy and partial least squares regressionAnal Bioanal Chem 405:8241–50Google Scholar
52.
1. Liu Y
2. Huang G
3. Lv M.
2018Extraction, characterization and antioxidant activities of mannan from yeast cell wallInternational Journal of Biological Macromolecules 118:952–6Google Scholar
53.
1. Blagović B.
2001Lipid Composition of Brewer’s YeastFood Technology and Biotechnology 39:175–81Google Scholar
54.
1. Blagović B.
2. Mesarić M
3. Marić V
4. Rupčić J.
2005Characterization of lipid components in the whole cells and plasma membranes of baker’s YeastCroatica Chemica Acta 78:479–84Google Scholar
55.
1. Avramia I
2. Amariei S.
2021Spent Brewer’s Yeast as a Source of Insoluble β-GlucansInternational Journal of Molecular Sciences 22Google Scholar
56.
1. Short Francesca L
2. Di Sario G
3. Reichmann Nathalie T
4. Kleanthous C
5. Parkhill J
6. Taylor Peter W
7. et al
2020Genomic Profiling Reveals Distinct Routes To Complement Resistance in Klebsiella pneumoniaeInfection and Immunity 88:e00043–20Google Scholar
57.
1. Powers DMW.
2020Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlationarXiv Google Scholar
58.
1. Green ML
2. Karp PD
2005Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbersNucleic Acids Res 33:4035–9Google Scholar
59.
1. Rembeza E
2. Engqvist MKM
2021Experimental and computational investigation of enzyme functional annotations uncovers misannotation in the EC 1.1.3.15 enzyme classPLOS Computational Biology 17:e1009446Google Scholar
60.
1. Wilkinson MD
2. Dumontier M
3. Aalbersberg IJ
4. Appleton G
5. Axton M
6. Baak A
7. et al
2016The FAIR Guiding Principles for scientific data management and stewardshipScientific Data 3:160018Google Scholar
61.
1. Van Rossum G
2. Drake F. L.
Python 3 Reference ManualScotts Valley :CA2009Google Scholar
62.
1. BIOLOG
2020Phenotype MicroArraysBIOLOG Google Scholar
63.
1. Norsigian CJ
2. Attia H
3. Szubin R
4. Yassin AS
5. Aziz RK
6. et al
2019Comparative Genome-Scale Metabolic Modeling of Metallo-Beta-Lactamase–Producing Multidrug-Resistant Klebsiella pneumoniae Clinical IsolatesFrontiers in Cellular and Infection Microbiology 9Google Scholar
64.
1. Hall MB.
2019Rasusa: Randomly subsample sequencing reads to a specified coveragehttps://github.com/mbhall88/rasusa Google Scholar
65.
1. Krueger F.
Trim Galore GitHub2012https://github.com/FelixKrueger/TrimGalore Google Scholar
66.
1. Wick RR
2. Judd LM
3. Gorrie CL
4. Holt KE
2017Unicycler: Resolving bacterial genome assemblies from short and long sequencing readsPLOS Computational Biology 13:e1005595Google Scholar
67.
1. Wick RR.
2023Dead-end count for QC of short-read assembliesGoogle Scholar
68.
1. RStudio-Team
2020RStudio: Integrated Development for R. BostonMA RStudio Google Scholar
69.
1. Wickham H
2. Averick M
3. Bryan J
4. Chang W
5. McGowan LDA
6. François R
7. et al
2019Welcome to the TidyverseJournal of Open Source Software 4:1686Google Scholar
70.
1. Garnier S.
2018viridis: Default Color Maps from ‘matplotlib’Google Scholar
71.
1. Neuwirth E.
RColorBrewer: ColorBrewer Palettes:1–2Google Scholar
72.
1. Kassambara A.
ggpubr: ‘ggplot2’ Based Publication Ready Plots:0.5.0Google Scholar
73.
1. Aphalo PJ
2. Slowikowski K
3. Mouksassi S.
ggpmisc: Miscellaneous Extensions to ‘ggplot2’. 0.4.1 ed2021Google Scholar
74.
1. Yu G.
aplot: Decorate a ‘ggplot’ with Associated Information:0.0.6Google Scholar
75.
1. Zeileis A
2. Fisher JC
3. Hornik K
4. Ihaka R
5. McWhite CD
6. Murrell P
7. et al
2020colorspace: A Toolbox for Manipulating and Assessing Colors and PalettesJournal of Statistical Software 96:1–49Google Scholar
76.
1. F.C. M
2. Davis TL
2022Package ‘ggpattern’Google Scholar
77.
1. Wilke CO
ggtext: Improved Text Rendering Support for ‘ggplot2’:0.1.1Google Scholar
78.
1. Hester J
2. Bryan J
RStudio. glue: Interpreted String Literals:1.6.2Google Scholar
79.
1. Muggeo VMR
segmented: Regression Models with Break-Points / Change-Points (with Possibly Random Effects) Estimation:1–2Google Scholar
80.
1. The-Inkscape-Team
Inkscape:1.0.1Google Scholar
81.
1. zanatlija
2012Proportional TFBGoogle Scholar
82.
1. weknow
2015ElementGoogle Scholar

Article and author information

Author information

Ben Vezina
Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Victoria, Australia
ORCID iD: 0000-0003-4224-2537
- Corresponding authors (benjamin.vezina@monash.edu, kelly.wyres@monash.edu)
- These authors contributed equally
Stephen C. Watts
Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Victoria, Australia
ORCID iD: 0000-0001-7084-635X
- These authors contributed equally
Jane Hawkey
Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Victoria, Australia
ORCID iD: 0000-0001-9661-5293
Helena B. Cooper
Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Victoria, Australia
ORCID iD: 0000-0002-7096-0010
Louise M. Judd
Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Victoria, Australia
ORCID iD: 0000-0003-3613-4839
Adam Jenney
Microbiology Unit, Alfred Health, Melbourne, Victoria, Australia
ORCID iD: 0000-0002-5237-2506
Jonathan M. Monk
Department of Bioengineering, University of California, San Diego, CA, United States of America
ORCID iD: 0000-0002-3895-8949
Kathryn E. Holt
Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Victoria, Australia, Department of Infection Biology, London School of Hygiene and Tropical Medicine, London, UK
ORCID iD: 0000-0003-3949-2471
Kelly L. Wyres
Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Victoria, Australia
ORCID iD: 0000-0002-4033-6639
- Corresponding authors (benjamin.vezina@monash.edu, kelly.wyres@monash.edu)

Version history

Preprint posted: February 27, 2023
Sent for peer review: March 7, 2023
Reviewed Preprint version 1: May 30, 2023
Reviewed Preprint version 2: September 25, 2023
Version of Record published: October 10, 2023

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.87406. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

views: 4,129
downloads: 211
citations: 15

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Significance of findings

Strength of evidence

Abstract

Introduction

Results

Description of Bactabolize

KpSC pan-metabolic reference model

Performance comparison

Quality control framework for input genome assemblies

Impact of gap-filling models

Predictive accuracy of draft models

Discussion

Methods

Bactabolize pipeline

Klebsiella pneumoniae Species Complex-pan metabolic model

Draft model generation

Speed calculations

Performance comparison

Quality control framework

Predictive accuracy of draft models

Statistics and visualisation

Logo

Supporting information

Author contributions

Acknowledgements

Supplementary figures

References

Article and author information

Author information

Ben Vezina+

Stephen C. Watts+

Jane Hawkey

Helena B. Cooper

Louise M. Judd

Adam Jenney

Jonathan M. Monk

Kathryn E. Holt

Kelly L. Wyres

Version history

Cite all versions

Copyright

Metrics

Ben Vezina

Stephen C. Watts