Figures and data in Deep evolutionary analysis reveals the design principles of fold A glycosyltransferases

Figures
Tables
Additional files

7 figures, 1 table and 4 additional files

Figures

Figure 1

Download asset Open asset

Glycosyltransferase (GT) folds and mechanisms.

Top: The three representative structural folds of GTs. The GT-A fold is characterized by a single globular domain that contains a α/β/α Rossmann nucleotide binding domain (shown 2rj7;GT6). The GT-B fold enzymes are usually metal independent and contain two α/β/α domains separated by a flexible linker region with the substrate binding cleft in between (shown 1jg7;GT63). The GT-C fold enzymes are hydrophobic integral membrane proteins, generally use lipid phosphate linked sugar donors and have multiple transmembrane helices (shown 6gxc; GT66). Bottom: The mechanism of sugar transfer employed by GTs. Inverting GTs follow a direct displacement S_N-2-like mechanism that results in an inverted anomeric configuration. The mechanism for retaining GTs is still under debate although recently a same side S_Ni-type reaction has been proposed where the donor phosphate oxygen acts as a catalytic base and deprotonates the acceptor hydroxyl facilitating a same side attack, that results in the retention of anomeric configuration. The enzyme and catalytic base B are shown in orange. A generic hexose with α-linkage to a nucleoside diphosphate is used. Other mechanisms possibly employed by GTs is discussed in detail in M.

Figure 1—source data 1 List of CAZy GT families. The structural fold and the number of sequences from each taxonomic group are shown. The number of sequences with structure or are characterized are also provided.: https://cdn.elifesciences.org/articles/54532/elife-54532-fig1-data1-v2.xlsx
Download elife-54532-fig1-data1-v2.xlsx

Figure 2 with 3 supplements

Download asset Open asset

The GT-A common core and its elements.

(A) Plot showing the schematics of the GT-A common core with 231 aligned positions. Conserved secondary structures (red α-helices, blue β-sheets, green loops) and hypervariable regions (HVs)(orange) are shown. Conservation score for each aligned position is plotted in the line graph above the schematics. Evolutionarily constrained regions in the core: the hydrophobic positions (yellow) and the active site residues (DxD: Cyan, xED: Magenta, G-loop: green, C-His: olive) are highlighted above the positions. (B) The conserved secondary structures and the location of HVs are shown in the N-terminal GT2 domain of the multidomain chondroitin polymerase structure fromE. coli(PDB: 2z87) that is used as a prototype as it displays closest similarity to the common core consensus. (C) Active site residues of the prototypic GT-A structure. Metal ion and donor substrate are shown as a brown sphere and sticks, respectively. (D) Architecture of the hydrophobic core (Yellow: core conserved in all Rossmann fold containing enzymes, Red: core elements present only in the GT-A fold). Residues are labeled based on their aligned positions. Numbers within parentheses indicate their position in the prototypic (PDB: 2z87) structure.

Figure 2—figure supplement 1

Download asset Open asset

Structure based sequence alignment showing the hydrophobic residue positions present across a collection of Rossmann fold like enzymes.

The conserved hydrophobic positions are highlighted in yellow blocks. Aligned positions are indicated at the top that correspond to aligned positions in Figure 2D. The alignment extends until the DxD motif. Other regions were unaligned due to very low homology.

Figure 2—figure supplement 2

Download asset Open asset

Changes in the extended hydrophobic core residues in selected retaining families.

(A) The conserved hydrophobic core in the prototypic GT (2z87). (**B and C**) Hydrophobic residue in the core is substituted by an Arginine and a Glutamate in GT15 and GT55 respectively. The charged residue replacing the hydrophobic residue of the core is highlighted in red sticks. The xED motif is shown in magenta.

Figure 2—figure supplement 3

Download asset Open asset

Comparison of structures for HV regions across GT-A families.

The GT-A common core is shown in surface in the middle. HVs are shown in shades of orange (HV1: light orange, HV2: dark orange, HV3: orange red). Root Mean Square Deviation (RMSD) was calculated by aligning the core GT-A domains of representative structures with and without the HVs. A significant reduction in the RMSD values was observed after removing HVs that is shown in the box plot in the center. *p-value<0.0001, t-test.

Figure 3 with 3 supplements

Download asset Open asset

Phylogenetic tree highlighting the 53 major GT-A fold subfamilies.

Tips in this tree represent GT-A sub-families condensed from the original tree for illustration. Support values are indicated using different circles. Circles at the tips indicate bootstrap support for the GT-A family clade represented by that tip. Tips missing the circles represent GT-A families that do not form a single monophyletic clade. Nodes missing circles have a bootstrap support less than 50% and are unresolved. Icon labels indicate the taxonomic diversity of that sub clade. Colors indicate the mechanism for the families (blue: Inverting, red: Retaining). This condensed tree was generated by collapsing clades to the deepest node that includes sequences from the same family. For GT-A families that did not form a monophyletic clade, the clade that included the most sequences from that family was chosen. Branch lengths may approximate the original distances, but are not drawn to scale. Detailed tree with support values, expanded nodes and scaled branch lengths are provided in Figure 3—figure supplement 1 and in Newick format in Figure 3—source data 4. The family names are described in Figure 3—source data 1.

Figure 3—source data 1 List of GT-A fold families and subfamilies. For each of these families, the groups obtained by the pattern based classification are provided in the ‘GT-A pattern based group’ column. Taxonomic distribution and a short description of these groups are also provided.: https://cdn.elifesciences.org/articles/54532/elife-54532-fig3-data1-v2.xlsx
Download elife-54532-fig3-data1-v2.xlsx
Figure 3—source data 2 The 993 representative GT-A domain sequences included in the phylogenetic analysis. The GT-A family and the pattern based classification group for each sequence is indicated in the ‘GT-A family’ and the ‘GT-A pattern based group’ columns. The domain start and end positions are indicated. Sequence for the domain region and the full length sequences are also provided. An alignment of these sequences are available in Figure 3—source data 3.: https://cdn.elifesciences.org/articles/54532/elife-54532-fig3-data2-v2.xlsx
Download elife-54532-fig3-data2-v2.xlsx
Figure 3—source data 3 The trimmed FASTA alignment of the 231 positions of the GT-A core used for phylogeny.: https://cdn.elifesciences.org/articles/54532/elife-54532-fig3-data3-v2.txt
Download elife-54532-fig3-data3-v2.txt
Figure 3—source data 4 The phylogenetic tree file for the 993 GT-A fold sequences in Newick format.: https://cdn.elifesciences.org/articles/54532/elife-54532-fig3-data4-v2.txt
Download elife-54532-fig3-data4-v2.txt

Figure 3—figure supplement 1

Download asset Open asset

Complete phylogenetic tree of 993 representative GT-A sequences.

Sequences are provided in Figure 3—source data 2. Clades are colored for each of the 53 GT-A families and labeled. Values at nodes indicate bootstrap support with 1000 replicates. Values for all major nodes are indicated. This tree is also provided in Newick text format in Figure 3—source data 4.

Figure 3—figure supplement 2

Download asset Open asset

Clade specific conserved features in the HVs.

The conserved mode of donor binding in clade 9, conserved mode of acceptor binding in clade two and the conserved QXXRW motif in clade one are illustrated. HVs are shown in orange. Metal ions are shown as spheres. Red bars above the alignment indicate the extent of significance of conservation of residue in the column (Higher is more significantly conserved). Below every position in the alignment, numbers indicate the extent of conservation of residues at the position.

Figure 3—figure supplement 3

Download asset Open asset

Sankey diagram comparing topologies of phylogenetic tree with pdb and hmm based clustering of GT-A families.

Each column highlights clusters of GT-A families obtained through different methods (from left to right: PDB structural alignment clustering, GT-A phylogeny and hmm-distance based tree). Corresponding GT-A families within clusters are connected through colored links. Non overlapping links indicate an agreement in the placement of families across methods. Full clusters and trees are shown below the columns.

Figure 4

Download asset Open asset

Variations in the GT-A conserved core.

(A) Weblogo depicting the conservation of active site residues in the common core are shown for each of the GT-A families. Residues are colored based on their physiochemical properties. (B) Variations in the C-His is compensated either using a water molecule (red sphere) or other charged residues (olive sticks) to conserve its interactions. The metal ion is shown as a purple sphere. The donor substrate is shown as brown lines. Interactions between the residues, metal ion and the donor are shown using dotted lines.

Figure 5

Download asset Open asset

Family specific conserved features in the HV regions correlate with acceptor recognition and specificity.

Conserved residues in A) HV2 of the DPM1 sequences in the GT2-DP subfamily coordinate the phosphate group of the acceptor. (B) HV1 of GT16 MGAT1 provide acceptor specificity. (C) HV2 and HV3 of EXTL GT64 family (C-terminal GT domain of the multidomain sequences) coordinate the acceptor. Left: Alignments highlighting the constrained residues are shown for each family. The family specific conserved residues are shown using black dots above the alignment. Red bars above these dots indicate the significance of conservation (Higher bar corresponds to more significantly conserved position). Right: Representative pdb structures are shown for each family (GT2-DP:5mm1, GT16:5vcs, GT64:1on8); Donor substrates are colored brown. Acceptors are colored purple. HVs are highlighted in orange. The position of the conserved DxD and xED motif for each structure is shown as cyan and magenta circles respectively.

Figure 6 with 2 supplements

Download asset Open asset

Machine learning (ML) approach for predicting donor class.

(A) Brief pipeline of the ML analysis. Training set input into the pipeline are shown in green boxes. Steps of the ML analysis in purple boxes are associated with different panels of the figure. (B) Percent accuracy based on 10-fold cross validation (CV) for each of the trained ML models. (C) Confusion matrix from the best model (GDBT using 239 features). (D) Scatter plot showing the probability scores assigned for each predicted sequence by the predicted donor type. Colors indicate the confidence level of the prediction based on probability of assignment to a given donor class as well as confidence intervals of the predicted class i.e. difference in probability values between the 1^st prediction class and the 2^nd prediction class. (Figure 6—source data 2).

Figure 6—source data 1 List of the 713 training dataset sequences used for machine learning. The ‘Assigned Donor Class’ column indicates one of the six classes the donor belongs to.: https://cdn.elifesciences.org/articles/54532/elife-54532-fig6-data1-v2.xlsx
Download elife-54532-fig6-data1-v2.xlsx
Figure 6—source data 2 Results for donor prediction using the GDBT ML model for GT-A sequences from five model organisms. The validation datasets (highlighted in blue rows) include GTs that have some experimental characterization but were not included in the characterized dataset. The validation set was used to compare the model predictions with the experimental results. The ‘Match Experimental’ column indicates whether the prediction matched experimental results. The prediction set includes predictions for GTs of unknown functions. The ‘Confidence’ column includes the confidence for prediction which was derived based on the probability for the 1 st class and its difference with the probability for the 2nd class. Probabilities for all the six classes are provided in the ‘Classwise Probablity’ columns.: https://cdn.elifesciences.org/articles/54532/elife-54532-fig6-data2-v2.xlsx
Download elife-54532-fig6-data2-v2.xlsx

Figure 6—figure supplement 1

Download asset Open asset

Sequence homology-based network of all the experimentally characterized sequences form the GT-A fold families.

Nodes represent the sequences that were annotated as characterized and collected from the CAZy database to be used in the training dataset for ML. The color and shape of the nodes indicate the donor specificity for that sequence. An edge between two nodes indicate that the sequences are homologous with an e-value better than 1e-5. Smaller edge distance indicates a higher similarity between nodes. An edge-weighted spring embedded layout from Cytoscape was implemented to minimize edge crossings and enhance visual interpretability. At multiple locations in the network, closely related sequences differ in donor specificity, rendering prediction through similarity alone difficult.

Figure 6—figure supplement 2

Download asset Open asset

Distribution of training and prediction datasets used in machine learning.

The size of the bubbles next to GT-A family names indicates the number of sequences in the training and prediction set from that family. Color of the bubbles indicate training or prediction set.

Figure 6—figure supplement 2—source data 1 Distribution of sequences across different families. The counts in this table were mapped in to the phylogenetic tree in Figure 6—figure supplement 2.: https://cdn.elifesciences.org/articles/54532/elife-54532-fig6-figsupp2-data1-v2.xlsx
Download elife-54532-fig6-figsupp2-data1-v2.xlsx

Figure 7

Download asset Open asset

Top Contributing features from the GDBT model associated with sugar donor specificity.

(A) Heatmap showing the contributions of representative features. Features are ordered based on their importance for the final GDBT model along the vertical axis. The heatmap colors indicate how important each feature is for a given sugar donor type with red indicating ranks 1–10 (highly important) (M). (**B–E**) Contributing features important for individual donor types are mapped onto representative structures. The amino acids at the feature positions are shown in yellow sticks and labelled. Feature positions distal from the donor binding site are shown in green sticks. Labels include the amino acid code, aligned residue position and the amino acid position in the crystal structure within parentheses. Donor substrate with the sugar is shown in lines with surface bounds. Divalent metal ions are shown as spheres. The αC helix is shown. (B) Gal features mapped to a bovine β−1,4 Gal transferase (PDB ID: 1o0r). (C) GalNAc features mapped to a human UDP-GalNAc: polypeptide alpha-N-acetylgalactosaminyltransferase (PDB ID: 2d7i). (D) GlcNAc features mapped to a rabbit N-acetylglucosaminyltransferase I (PDB ID: 1foa). (E) Man features mapped to a bacterial Mannosyl-3-Phosphoglycerate Synthase (PDB ID: 2wvl).

Figure 7—source data 1 Feature Importance comparison for the full GDBT model with its importance for each sugar donor type. ‘GDBT_full’ columns include the rank and score for the 239 features in the GDBT model. Remaining columns show the rank and score for that same feature for the classification of the respective donor sugars. Feature positions ranked from 1 to 10 (most important) are colored red and ranked 11–20 are colored orange.: https://cdn.elifesciences.org/articles/54532/elife-54532-fig7-data1-v2.xlsx
Download elife-54532-fig7-data1-v2.xlsx

Tables

Key resources table

Reagent type (species) or resource	Designation	Source or reference	Identifiers
Software, algorithm	CAZy database	doi: 10.1093/nar/gkt1178	CAZy- Carbohydrate Active Enzyme, RRID:SCR_012909
Software, algorithm	mapgaps	doi: 10.1093/bioinformatics/btp342
Software, algorithm	omcBPPS	doi: 10.1089/cmb.2013.0099
Software, algorithm	GT-A family classification and sequences	This paper	doi:10.5061/dryad.v15dv41sh
Software, algorithm	MAFFT v7.3	doi: 10.1093/molbev/mst010	MAFFT, RRID:SCR_011811
Software, algorithm	Expresso from the t-coffee suite	doi: 10.1093/nar/gkl092	T-Coffee, RRID:SCR_011818
Software, algorithm	IQTree v1.6.1	doi: 10.1093/molbev/msu300
Software, algorithm	PyMOL v2.0.6	Schrödinger	PyMOL, RRID:SCR_000305
Software, algorithm	Python v3 with package scikitlearn	Pedregosa, 2011	scikit-learn, RRID:SCR_002577
Software, algorithm	R package ‘randomForest’	Liaw and Wiener, 2002	RandomForest Package in R, RRID:SCR_015718
Software, algorithm	WEKA version 3.8.3	Witten et al., 2016	Weka, RRID:SCR_001214

Additional files

Supplementary file 1 The CAZy GT-A families included in the analysis. The ‘Alignment source’ column includes the CDD alignment identifiers used to build the seed profiles for a given GT-A family. If a suitable CDD profile was not available, the seed profiles were built by manually selecting and aligning representative sequences.: https://cdn.elifesciences.org/articles/54532/elife-54532-supp1-v2.xlsx
Download elife-54532-supp1-v2.xlsx
Supplementary file 2 Mapping of the 231 aligned positions to representative crystal structures available for all GT-A fold families. The ‘Position Description’ column includes the labels for conserved motifs and hypervariable regions for the aligned positions. GT family, PDB IDs and reference to PubMed IDs are indicated at the header columns.: https://cdn.elifesciences.org/articles/54532/elife-54532-supp2-v2.xlsx
Download elife-54532-supp2-v2.xlsx
Supplementary file 3 The ancient archaeal and bacterial sequences that most closely resemble the GT-A core consensus. These sequences were collected by running a BLAST search against a single consensus sequence generated from the seed profiles of all GT-A families. The hits were further filtered to keep sequences that only had the minimal GT-A core (Materials and methods).: https://cdn.elifesciences.org/articles/54532/elife-54532-supp3-v2.xlsx
Download elife-54532-supp3-v2.xlsx
Transparent reporting form: https://cdn.elifesciences.org/articles/54532/elife-54532-transrepform-v2.pdf
Download elife-54532-transrepform-v2.pdf

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Rahil Taujale
Aarya Venkat
Liang-Chin Huang
Zhongliang Zhou
Wayland Yeung
Khaled M Rasheed
Sheng Li
Arthur S Edison
Kelley W Moremen
Natarajan Kannan

(2020)

Deep evolutionary analysis reveals the design principles of fold A glycosyltransferases

eLife 9:e54532.

https://doi.org/10.7554/eLife.54532

Share this article

Cite this article

Glycosyltransferase (GT) folds and mechanisms.

Figure 1—source data 1

The GT-A common core and its elements.

Structure based sequence alignment showing the hydrophobic residue positions present across a collection of Rossmann fold like enzymes.

Changes in the extended hydrophobic core residues in selected retaining families.

Comparison of structures for HV regions across GT-A families.

Phylogenetic tree highlighting the 53 major GT-A fold subfamilies.

Figure 3—source data 1

Figure 3—source data 2

Figure 3—source data 3

Figure 3—source data 4

Complete phylogenetic tree of 993 representative GT-A sequences.

Clade specific conserved features in the HVs.

Sankey diagram comparing topologies of phylogenetic tree with pdb and hmm based clustering of GT-A families.

Variations in the GT-A conserved core.

Family specific conserved features in the HV regions correlate with acceptor recognition and specificity.

Machine learning (ML) approach for predicting donor class.

Figure 6—source data 1

Figure 6—source data 2

Sequence homology-based network of all the experimentally characterized sequences form the GT-A fold families.

Distribution of training and prediction datasets used in machine learning.

Figure 6—figure supplement 2—source data 1

Top Contributing features from the GDBT model associated with sugar donor specificity.

Figure 7—source data 1

Supplementary file 1

Supplementary file 2

Supplementary file 3

Transparent reporting form

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)