Phylogeny-focused substrate prediction for pyoverdine synthetase assembly lines.
a. Information from 101 reference A domains with known amino acid substrates were used to develop an algorithm that predicts substrates from A domain sequence data with high accuracy. The challenge is to group the variable A domains into clusters that predict the same substrate (captured by the silhouette index). To find the most distinctive algorithm, we combined different feature sequences of A domains (Amotif) with different distance and linkage methods in our hierarchical clustering analyses. The best performing path is shown in pink. b. Heatmap showing the hierarchically clustered distances of the 101 reference A domains as a function of the feature sequence used. Color bar in the lower right shows the sequence distance. Left panel: complete A domain sequences. Middle panel: Amotif3-6 sequences. Right panel: Amotif4-5 sequences. The heatmaps show that hierarchical clustering, reliably associating sequence distances with substrates, worked best with the Amotif4-5. The experimentally validated substrates are shown on top of each heatmap and consist of 13 amino acids and 2 amino acid derivatives: Ser (Serine), Lys (Lysine), Thr (Threonine), Ala (Alanine), Gly (Glycine), Gln (Glutamine), Arg (Arginine), Asp (Aspartic acid), Glu (Glutamic acid), Tyr (Tyrosine), Orn (Ornithine), Rsc (succinic acid derivatives), Dab (2,4-diaminobutyric acid), OHAsp (Aspartic acid derivative), and OHOrn (a general term for three ornithine derivatives: FoOHOrn, AcOHOrn, cOHOrn). c. Phylogeny-focused substrate prediction pipeline for query A domains (grey circle) based on Amotif4-5 feature sequence comparisons. X1 and X2 represent the feature distance between the query A domain and two closest reference A domains (blue and green circles), respectively. Three rules are used, based on the feature sequence distances X1 and X2 and a threshold value of 0.7 (50% similarity), to make substrate predictions for the query A domain. There are three possible outcomes: unambiguous substrate prediction (blue or green squares), ambiguous substrate prediction (dual-colored squares), and no prediction (“unknown). d. Phylogenetic tree of 20 Pseudomonas strains and visualization of their predicted and actual pyoverdine structures to validate our phylogeny-focused substrate prediction pipeline. Strains marked in red font indicate cases with novel (not yet characterized) pyoverdines structures. 151 out of the 160 substrates (94.4%) were correctly predicted. The nine inconsistencies are boxed in blue (Lysine and Ornithine are indistinguishable), in dashed black (detection of “unknown” substrates), and in red (true mismatches). Note that our prediction pipeline (as any other pipeline) cannot distinguish between modified variants of the same amino acid, for example, we cannot distinguish the three derivatives of ornithine (FoOHOrn, AcOHOrn, cOHOrn).