Scheme depicting our new genome mining pipeline to precisely predict the biosynthesis, the molecular structure and the uptake machinery of pyoverdines, a family of iron-scavenging siderophores produced by members of the Pseudomonas genus.

The grey rounded outer rectangle represents a bacterial cell. The red and blue arrow-shaped boxes stand for the synthetase and receptor genes for pyoverdines, respectively. Synthetase genes are transcribed and translated to form the n-modular NRPS enzymes. These enzymes synthesize the peptide backbone of pyoverdine through an assembly line using their repeating module units, with the A domain being responsible for substrate selection and the E domain for chirality. The n-substrate siderophores are then exported to the extracellular space for iron chelation. Membrane-embedded TonB-dependent receptors recognize the ferri-siderophore complex and import it into the cell. Bold black text and black arrows describe our multi-step computational methods developed to reconstruct the entire process from genome sequence data. First, the annotation pipeline was improved (from antiSMASH23) to extract the complete sequence of pyoverdine synthetase genes from draft genomes. Second, NRPSMotifFinder was used to define A– and E-domains and to determine the exact motif-intermotif structure of the pyoverdine assembly line. Third, intermotif regions most indicative of substrate specificity were used to develop a phylogeny-focused method for precise product prediction. Fourth, a sequence-region-based annotation method was combined with genome architecture features to identify the FpvA, receptors responsible for ferri-pyoverdine import.

Improved annotation pipeline reveals a vast diversity of pyoverdine synthetase genes.

a. Improved annotation pipeline based on the raw annotation from antiSMASH23. b. The annotation pipeline was applied to 9599 Pseudomonas genomes (94% draft genomes). Genomes could be separated into three categories. Yellow: genomes without pyoverdine cluster. Green: genomes with a complete pyoverdine cluster. Red: genomes with incomplete pyoverdine synthetase cluster. The red category involved genomes with truly incomplete clusters (lacking Flu or Pep synthetic genes) or genomes with likely truncated synthetic genes at the edge of contigs. c. Distributions of the sequence length (upper panel) and the number of A domains (lower panel) across all the genomes with a complete synthetase cluster. d. Workflow applied to separate the 9599 Pseudomonas genomes into the three categories described in b and removing of redundant genomes with high phylogenic similarity and showing high similarity in pyoverdine synthetases. Red star indicates the start of the workflow. e. Phylogenetic tree depicting the relationship among the 1928 non-redundant Pseudomonas strains (1664 producers and 264 non-producers) based on the concatenated alignment of 400 single-copy conserved genes in their genomes. The inner ring depicts the taxonomical classification including the four most prevalent species. The outer ring shows the number of A domains present in the pyoverdine synthetase assembly line in each strain.

Phylogeny-focused substrate prediction for pyoverdine synthetase assembly lines.

a. Information from 101 reference A domains with known amino acid substrates were used to develop an algorithm that predicts substrates from A domain sequence data with high accuracy. The challenge is to group the variable A domains into clusters that predict the same substrate (captured by the silhouette index). To find the most distinctive algorithm, we combined different feature sequences of A domains (Amotif) with different distance and linkage methods in our hierarchical clustering analyses. The best performing path is shown in pink. b. Heatmap showing the hierarchically clustered distances of the 101 reference A domains as a function of the feature sequence used. Color bar in the lower right shows the sequence distance. Left panel: complete A domain sequences. Middle panel: Amotif3-6 sequences. Right panel: Amotif4-5 sequences. The heatmaps show that hierarchical clustering, reliably associating sequence distances with substrates, worked best with the Amotif4-5. The experimentally validated substrates are shown on top of each heatmap and consist of 13 amino acids and 2 amino acid derivatives: Ser (Serine), Lys (Lysine), Thr (Threonine), Ala (Alanine), Gly (Glycine), Gln (Glutamine), Arg (Arginine), Asp (Aspartic acid), Glu (Glutamic acid), Tyr (Tyrosine), Orn (Ornithine), Rsc (succinic acid derivatives), Dab (2,4-diaminobutyric acid), OHAsp (Aspartic acid derivative), and OHOrn (a general term for three ornithine derivatives: FoOHOrn, AcOHOrn, cOHOrn). c. Phylogeny-focused substrate prediction pipeline for query A domains (grey circle) based on Amotif4-5 feature sequence comparisons. X1 and X2 represent the feature distance between the query A domain and two closest reference A domains (blue and green circles), respectively. Three rules are used, based on the feature sequence distances X1 and X2 and a threshold value of 0.7 (50% similarity), to make substrate predictions for the query A domain. There are three possible outcomes: unambiguous substrate prediction (blue or green squares), ambiguous substrate prediction (dual-colored squares), and no prediction (“unknown). d. Phylogenetic tree of 20 Pseudomonas strains and visualization of their predicted and actual pyoverdine structures to validate our phylogeny-focused substrate prediction pipeline. Strains marked in red font indicate cases with novel (not yet characterized) pyoverdines structures. 151 out of the 160 substrates (94.4%) were correctly predicted. The nine inconsistencies are boxed in blue (Lysine and Ornithine are indistinguishable), in dashed black (detection of “unknown” substrates), and in red (true mismatches). Note that our prediction pipeline (as any other pipeline) cannot distinguish between modified variants of the same amino acid, for example, we cannot distinguish the three derivatives of ornithine (FoOHOrn, AcOHOrn, cOHOrn).

Predicted pyoverdine structural diversity based on our developed algorithm mapped onto the phylogenetic tree of all 1928 (non-redundant) Pseudomonas strains.

In this figure, the strains are arranged based on the phylogenetic tree without showing the tree itself. The stacked boxes in the outermost circle show the predicted structure of pyoverdines, whereby each color represents a specific amino acid substrate. Strains without boxes represent non-producers (n = 264). Boxes with two colors indicate cases of ambiguous (dual) substrate prediction. The inner circle shows the taxonomic species classification following Figure 2e. Because the allocation of strains to species names is often imprecise, we divided the 1928 strains by their phylogenetic distance into 18 clades (color shadings in inner-most circle), out of which 13 contained more than one strain. Lines within the inner-most circle link strains from different clades that share the same pyoverdine structures, whereby line colors represent the shared unique pyoverdines. The bending of the lines represents the phylogenetic sequence distances of the connected strain pairs.

A sequence-region-based identification pipeline for annotating FpvA receptors.

a. Heatmap displaying the hierarchically clustered sequence distances (p-distance calculation method, identity (%) = (1-sequence distance) * 100) of 35 reference siderophore receptors identified in Pseudomonas spp., based on full sequences. No clear discrimination between FpvA, FpvB and other receptors is possible. The order of receptors is consistent across panels (b), (e), and (f). b. The pHMM scores of the three standard receptor domains (STN, Plug, and TonBDR) vary across the 35 reference sequences (A: FpvA, B: FpvB and NA: others), but do not allow to distinguish between receptor groups. c. FpvA region-based conservation scores from a multi-alignment of the 35 reference sequences mapped to the FpvA sequence of strain P. aeruginosa PAO1. All residues within the top 10% of the conservation score are denoted with black dots. For each region flanked by two black dots, we calculated the FpvA identification score (heatmap), representing the ability to distinguish FpvA from non-FpvA receptors. d. Mapping of the two regions with the highest FpvA identification scores R1(dark red) and R2 (orange) to the crystal structure of FpvA from PAO1 conjugated with pyoverdine (PDB 2IAH). e. Heatmap showing the hierarchically clustered sequence distances of 35 reference siderophore receptors based on the R1 sequence region. A clear discrimination between FpvA/FpvB and other receptors emerges. f. Heatmap showing the hierarchically clustered sequence distances of 35 reference siderophore receptors based on the R2 sequence region. A clear discrimination between FpvA and FpvB receptors emerges. g. The pHMM scores of regions R1 and R2 for the 35 siderophore reference receptors are contrasted against each other, yielding a clear separation between FpvA, FpvB and other receptors. Dashed lines indicate the pHMM threshold scores used for later analysis. f. Flowchart showing all steps involved in the FpvA annotation from genome sequence data. The red star indicates the start of the workflow.

Application of the receptor annotation pipeline to the full database.

a. Applying the receptor annotation pipeline to the genomes of the 1928 non-redundant Pseudomonas strains yields 14301 Fpv-like receptors, which segregate into 4547 FpvA receptors (red box), 615 FpvB receptors, and 9139 other receptors, based on the pHMM score thresholds for regions R1 and R2. The heatmap indicates receptor density. b. Sequence distance matrix between the 35 reference sequences (y-axis) and the 4547 annotated FpvA sequences in the full database (x-axis). Database sequences were ordered by hierarchically clustering and segregated into 114 groups. 2254 of the annotated FpvA sequences have low sequence identity < 60% compared to the reference receptors, pointing at many novel subtypes of FpvA receptors. c. Genomic distance (in base pairs) between each Fpv-like receptor sequence and its pyoverdine peptide synthetase gene (Pep) for annotated FpvA receptors (upper panel), FpvB receptors (middle panel) and other receptors (lower panel). d. Distribution of the genomic distance between each FpvA receptor and its nearest pyoverdine peptide synthetase depending on whether the annotated FpvA receptor has high sequency similarity (blue, ≥ 50%) or low sequence similarity (yellow, < 50%) with at least one of the 21 reference FpvAs. e. FpvA region-based conservation scores from a multi-alignment of all the annotated FpvA receptors that are proximate (< 20 kbp) to the pyoverdine synthetase cluster mapped to the FpvA sequence of strain P. aeruginosa PAO1. All residues within the top 10% of the conservation score are denoted with black dots. For each region flanked by two black dots, we calculated the group identification score (heatmap, lower panel), representing the ability of the region to distinguish between different groups of FpvA receptors. Four regions in the plug domains had a particularly high group identification score (called the feature sequence). They are mapped to the crystal structure of FpvA from PAO1 conjugated with pyoverdine (PDB 2IAH, upper panel). All four regions surround the pyoverdine transmission channel and are shown in the respective heatmap color. f. Heatmap showing the hierarchically clustered distances between the 4547 annotated FpvA receptors based on the feature sequence (comprising the four groups with the highest identification scores). The analysis identifies 94 receptor groups with a 70% identity threshold. e. The diversity of FpvA receptors along the 13 phylogeny clades containing more than 1 strain. Receptor diversity was calculated by the Shannon entropy, similar to the alpha-diversity in microbial community.

Clustering heatmap of sequence distance matrix between A domains from 1664 strains and 101 reference A domains. The color on top indicates the substrate of reference A domains.

Clustering heatmap of sequence distance matrix between 318 Amotif4-5 from Pseudomonas strains in the MIBiG database (Supplementary_table4). MIBiG is a standard annotation database of biosynthetic gene clusters. We retrieved all secondary metabolites produced by members of the Pseudomonas genus from MIBiG. The colors on top of the heatmap indicate the substrate of the A domains. We focus on the domains with closed sequence distance but with distinct substrate specificity (marked with red text). After checking the original reference, theses domains were labeled incorrectly in MIBiG (Supplementary_table5).

Comparison of prediction accuracy in the structural composition of secondary metabolites between our approach and antiSMASH in Burkholderiales. Each data point in the violin plot symbolizes a distinct secondary metabolite. The median is denoted by a red circle, while the mean is represented by a blue ‘×’.

Comparison of clustering and grouping effects of complete sequences (a) and feature sequences (b) of all proximate-receptors.

Applying the receptor annotation pipeline to >30,000 complete bacterial genomes in NCBI based on the pHMM score thresholds for regions R1 and R2. The heatmap indicates receptor density.

The distribution of 31,936 FpvAs spans various taxonomic categories. Taxa with an abundance exceeding 1/500 are highlighted with colors, while others remain white. Taxa surpassing an abundance of 1/50 will have their specific names marked.