Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.
Read more about eLife’s peer review process.Editors
- Reviewing EditorAnne-Florence BitbolEcole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
- Senior EditorQiang CuiBoston University, Boston, United States of America
Reviewer #1 (Public Review):
The basic approach is that the authors first train an RBM on all MprF sequences, and then use this analysis to identify a subset of the family that catalyzes the addition of amino acids to PG. Then a second RBM is trained on this subset.
In the initial RBM training a particular hidden unit is identified that has a sparse and bimodal activation in response to the input sequences. The contribution of individual resides is shown in Figure 3c, which highlights one of the strengths of this RBM implementation - it is interpretable in a physically meaningful way. However, there are several decisions here, the justification of which is not entirely clear.
i) Some of the residues in Fig 3c are stated as "relevant" for aminoacylated PG production. But is this the only such hidden unit? Or are there others that are sparse, bimodal, and involve "relevant" AA?
ii) In order to filter the sequences for the second stage, only those that produce an activation over +2.0 in this particular hidden unit were taken. How was this choice made?
iii) How many sequences are in the set before and after this filtering? On the basis of the strength of the results that follow I expect that there are good reasons for these choices, but they should be more carefully discussed.
iv) Do the authors think that this gets all of the aminoacylated PG enzymes? Or are some missed?
The authors show that they can classify members of the family by training a second RBM on the filtered sequences. They do this by identifying two hidden unit activations in particular (Figure 5b) which seem to be useful for determining lipid substrate specificity, and they test several variants that obtain different responses of these two hidden units by experimentally determining what lipids they produce (Table 2). However, some similar criticisms from the last point occur here as well, namely the selection of which weights should be used to classify the enzymes' function. Again the approach is to identify hidden unit activations that are sparse (with respect to the input sequence), have a high overall magnitude, and "involve residues which could be plausibly linked to the lipid binding specificity."
i) Two hidden units are identified as useful for classification, but how many candidates are there that pass the first two criteria? Indeed, how many hidden units are there?
ii) The criterion "involve residues which could be plausibly linked to the lipid binding specificity" is again vague. Do all of the other candidate hidden units *not* involve significant contributions from substrate-binding residues? Maybe one of the other units does a better job of discriminating substrate specificity. (As indicated in Figure 8, there are examples of enzymes that confound the proposed classification.) Why combine the activations of two units for the classification, instead of 1 or 3 or...?
Reviewer #2 (Public Review):
In "Lipid discovery enabled by sequence statistics and machine learning" Christensen et al. address an important question: how can bacteria modify lipid charges to produce cationic lipids, prone to confer resistance to cationic antibiotics? One of the enzymes involved in this process is MprF, which can, through the transfer of amino acids, in particular, lysine, from charged tRNA modify the charge of anionic membrane phospholipid from negative to positive. Recent works have shown that MprF can also modify another substrate, glycolipid glucosyl-diacylglycerol, which is neutral. These findings immediately raise two questions: what are the determinants in the MrpF sequence controlling the lipid substrates it can modify? Are there other substrates for MrpF, so far unknown?
Christensen et al. address both of these questions in an elegant way, combining sequence analysis with machine-learning methods and experimental characterisation of the enzymatic products through mass spectrometry. Using restricted Boltzmann machines (RBM), an unsupervised architecture extracting statistical features from the sequence data, they identify putative amino-acid motifs along the MprF sequences possibly related to the substrate identity, select some bacterial species whose wild-type sequence contains those motifs, and validate the biological role of the motifs by identifying the produced lipids. Remarkably, with this approach, the authors find a novel cationic lipid with two glucosyl groups.
Besides these new results on MrpF and its operation, the present work is appealing, as it shows that the functional characterisation of a very small number of proteins (here, three!) combined with the guided classification of homologous sequence data with appropriate machine-learning methods can lead to the discovery of new functionalities.
Reviewer #3 (Public Review):
Summary:
After the previous identification that the Streptococcus agalactiae MprF enzyme can synthesize also lysyl-glucosyl-diacylglycerol (Lys-Glc-DAG), besides the already known lysyl-phosphatidylglycerol (Lys-PG), the authors aim for the current manuscript was to investigate the molecular determinants of MprF lipid substrate specificity in a variety of bacterial species.
Strengths:
- In general, the manuscript is well constructed and easy to follow, especially taking into account the multidisciplinary aspect of it (computational machine learning combined with lipid biology).
-The added value of the Restricted Boltzmann machines (RBM) approach, in comparison to standard computational pairwise sequence statistics, becomes evident. This is exemplified by a successful, although not perfect, classification and categorization of MprF activity.
- The MS analysis (monoisotopic mass, plus fragmentation pattern), convincingly shows the identification of a novel lipid species Lys-Glc2-DAG.
Weaknesses:
-In many of the analyzed strains, the presence of the lipid species Lys-PG, Lys-Glc-DAG, and Lys-Glc2-DAG is correlated to the presence of the MprF enzyme(s), but one should keep in mind that a multitude of other membrane proteins are present that in theory could be involved in the synthesis as well. Therefore, there is no direct evidence that the MprF enzymes are linked to the synthesis of these lipid species. Although, it is unlikely that other enzymes are involved, this weakens the connection between the observed lipids and the type of MprF.
-Related to this, in a few cases MprF activity is tested, but the manuscript does not contain any information on protein expression levels. Heterologous expression of membrane proteins is in general challenging and due to various reasons, proteins end up not being expressed at all. As an example, the absence of activity for the E. faecalis MprF1 and E. faecium MprF2 could very well be explained by the entire absence of the protein.
Overall, the authors largely achieved their goals, as the applied RBM approach led to specific sequence determinants in MprF enzymes that could categorize the specificity of these enzymes. The experimental data could largely confirm this categorization, although a stronger connection between synthesized lipids and enzyme activity would have further strengthened the observations.
The work now focuses only on MprF enzymes, but could in theory be expanded to other categories of lipid-synthesizing enzymes. In other words, the RBM approach could have an impact on the lipid synthesis field, if it would be a tool that is easily applicable. Moreover, the lipids synthesized by MprF (Lys-PG, but also other cationic lipids) play an important role in bacterial resistance against certain antibiotics.