Chemical structures of lysine lipids. Lys-PG, lysyl-phosphatidylglycerol; Lys-Glc-DAG, lysylglucosyl-diacylglycerol; Lys-Glc2-DAG, lysyl-diglucosyl-diacylglycerol.

Summary of different MprF variants expressed in S. mitis and the lysine lipids they produce.

Percentage of amino acid identity and similarity compared to S. agalactiae COH1 MprF; data obtained from BLASTp. The lipids each strain synthesize are denoted by a checkmark or an x. Lys-PG, lysyl-phosphatidylglycerol; Lys-Glc-DAG, lysyl-glycosyl-diacylglycerol.

Synthesis of lysine lipids (Lys-PG and Lys-Glc-DAG) in S. mitis expressing mprFs from S. agalactiae, S. salivarus, and S. ferus. (a) S. mitis NCTC12261 with empty vector control (pABG5) lacks lysine lipids; (b) S. agalactiae mprF (pGBSMprF) produces both Lys-PG and Lys-Glc-DAG; (c) S. salivarius mprF produces only Lys-PG; (d) S. ferus mprF produces only Lys-Glc-DAG. Left panels: total ion chromatograms (TIC); middle panels: mass spectra of retention time 19.5–21.5 min showing Lys-PG and PC; right panels: mass spectra of retention time 26–30 min showing Lys-Glc-DAG. Note: “*” is an extraction artifact due to chloroform used. DAG, diacylglycerol; MHDAG, monohexosyldiacylglycerol; DHDAG, dihexosyldiacylglycerol; PG, phosphatidylglycerol; Lys-PG, lysyl-phosphatidylglycerol; Lys-Glc-DAG, lysylglucosyl-diacylglycerol; PC, phosphatidylcholine.

Schematic of the RBM methodology. An aligned set of protein sequences is first used to learn a hidden unit representation that best describes the statistics of the sequence dataset given restrictions on the hidden unit representation. Then, the individual hidden units can be studied to find particular units which allow useful enzyme classification, and additionally, these weights can be meaningfully interpreted as statistically co-varying sequence configurations. Additionally, the classification can be used to create filtered datasets to train more models.

Example of hidden unit analysis and usage. (a) The structure of PDB:7DUW, with the red colored region being the transmembrane flippase domain and the yellow boxed region the cytosolic domain which we focus on. (b) The activations produced by inputting a sequence into a hidden unit, producing a single number as output which corresponds to a summation of negatively and positively weighted residues. Performed on entire training set (histogram in blue), highlighting sequences corresponding to predominantly positive weighted residues. (c) Hidden unit from an RBM trained on the Pfam DUF2156 domain. The MSA positions 152 and 212 correspond to residues S684 and R742, respectively. (d) Residues (in yellow) in the MprF cytosolic domain which form the binding pocket for Lys-tRNALys (the ligand analogue L-lysine amide shown in green), from PDB:4V36. LYN, L-lysine amide.

Proposed set of hidden units for classifying lipid specificity. (a) two hidden units found in an RBM trained on the filtered Pfam dataset. The hidden unit residues are highlighted in the PDB:4V34 structure, with arrows pointing to their corresponding residue sets. (b) the activations of the hidden units when scoring the sequences used in the training set and sequences from NCBI not used during training (N=23,138). S. agalactiae produces Lys-Glc-DAG and Lys-PG, while B. licheniformis and P. aeruginosa produces only aminoacylated-PG. Q1-Q4 are quadrant labels which we refer to throughout the paper.

Identification of Lys-Glc2-DAG in E. dispar. (a) Positive ion mass spectrum of Lys-Glc2-DAG species in E. dispar. Major Lys-Glc2-DAG species with carbon atoms (before colon) and double bonds (after colon) (b) MS/MS product ions and fragmentation scheme of the most abundant Lys-Glc2-DAG ion at m/z 1047.73. (c) Extracted ion chromatograms of LC/MS of lysine lipids (Lys-PG, Lys-Glc-DAG, Lys-Glc2-DAG) and Glc2-DAG separated on an amino HPLC column. Glc2-DAG, diglucosyl-diacylglycerol; Lys-PG, lysyl-phosphatidylglycerol; Lys-Glc-DAG, lysyl-glucosyl-diacylgylcerol; Lys-Glc2-DAG, lysyl-diglucosyl-diacylglycerol.

Table of all strains studied, the quadrant they occupy, and the lipids they synthesize.

** trace amounts Lys-Glc-DAG present in lipid extractions. * indicates heterologous expression of mprF in Streptococcus mitis. Lys-Glc-DAG, lysyl-glucosyl-diacylglycerol; Lys-Glc2-DAG, lysyl-diglucosyl-diacylglycerol; Lys-PG, lysyl-phosphatidylglycerol.

E. faecalis MprF2 and E. faecium MprF1 confer Lys-Glc2-DAG synthesis. (a) OG1RF-WT; (b) OG1RF_10760::Tn lacks lysine lipids; (c) OG1RF_10760::Tn + pABG5 lacks lysine lipids; (d) OG1RF_10760::Tn + pOGMprF2 restores lysine lipids; (e) OG1RF_10760::Tn + pEFMprF1 restores lysine lipids; (f) OG1RF_10760::Tn + pEFMprF2 lacks lysine lipids. Expression of OGMprF2 and EFMprF1 in OG1RF_10760 Tn mutant restore Lys-Glc2-DAG synthesis. Shown are the extracted ion chromatograms of lysine lipids (Lys-PG, Lys-Glc2-DAG) and Glc2-DAG separated on an amino HPLC column. Note: Lys-Glc-DAG was found in trace amounts or missing from lipid extractions. Glc2-DAG, diglucosyl-diacylglycerol; Lys-PG, lysyl-phosphatidylglycerol; Lys-Glc2-DAG, lysyl-diglucosyl-diacylglycerol.

The two proposed hidden unit activations with all of the sequences from Table 2 labeled. Protein ID of highlighted sequences are listed in Table S1. (#) indicates the lipid activity was confirmed through heterologous expression.

Assessment of RBM hidden unit’s reliance on total sequence identity in classification. A sliding window average is performed, where a window of width 1 is slid from the negative end to the positive of the hidden unit activation, where at each position sequence coordinates are sampled from the window. These sampled sequences are compared pairwise between themselves, computing the Hamming distance across their entire sequence length to produce an average. This sampling procedure is repeated 30 times for each window, with the light blue shading representing the 95% confidence interval.

The full sequences of our pre-filtered multiple sequence alignment were analyzed to identify which Pfam domains were present in them. We then split these results based on the procedure in Figure 3, where Removed sequences indicate domain composition of sequences removed by this procedure, and Training sequences for the data we trained our final model on. We remove domains from this plot which occur in less than 1% of the sequences of their respective sets for clarity.

Identification of Lys-Glc2-DAG which is highly retentive on a silica HPLC column. A) the total ion chromatogram of normal phase LC/MS of E. dispar lipids separated on a silica HPLC column. B) the positive ion mass spectrum of Lys-Glc2-DAG eluting at the end of the LC gradient. Glc-DAG, glucosyl-diacylglycerol; Glc2-DAG, diglucosyl-diacylglycerol; PG, phosphatidylglycerol; Lyso-Glc2-DAG, lyso diglucosyl-diacylglycerol; Lys-PG, lysyl-phosphatidylglycerol; Lys-Glc-DAG, lysyl-glucosyl-diacylglycerol; Lyso Lys-PG, lyso lysyl-phosphatidylglycerol; Lys-Glc2-DAG, lysyl-diglucosyl-diacylglycerol.

All Enterococcus sequences analyzed in the course of this study.

Sequence Locus IDs for all sequences listed in Table 2. Bold denotes confirmed mprF allele.

Fisher’s exact test for determining whether positive values of Hidden Unit 2 are predictive of GlcN-DAG specificity. p=0.028.

E.coli and plasmids used

Primers used in this study. Red indicates sequence complementarity to pABG5.

mprF sequences synthesized by Genewiz.

Red indicates sequence complementarity pABG5.