Generative power of a protein language model trained on multiple sequence alignments

  1. Damiano Sgarbossa
  2. Umberto Lupo  Is a corresponding author
  3. Anne-Florence Bitbol  Is a corresponding author
  1. Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland

Abstract

Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally-validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.

Data availability

Python code for generating sequences using the iterative masking procedure is available in our GitHub repository: https://github.com/Bitbol-Lab/Iterative_masking.Raw data was collected from two public sources: 1) MSAs from the Pfam database (https://pfam.xfam.org/); 2) further MSAs from https://github.com/matteofigliuzzi/bmDCA. We generated sequences with bmDCA using code publicly available at https://github.com/ranganathanlab/bmDCA.

Article and author information

Author details

  1. Damiano Sgarbossa

    Institute of Bioengineering, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7878-6061
  2. Umberto Lupo

    Institute of Bioengineering, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
    For correspondence
    umberto.lupo@epfl.ch
    Competing interests
    The authors declare that no competing interests exist.
  3. Anne-Florence Bitbol

    Institute of Bioengineering, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
    For correspondence
    anne-florence.bitbol@epfl.ch
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1020-494X

Funding

European Research Council (851173)

  • Damiano Sgarbossa
  • Umberto Lupo
  • Anne-Florence Bitbol

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Reviewing Editor

  1. Lucy J Colwell, Cambridge University, United Kingdom

Version history

  1. Preprint posted: April 15, 2022 (view preprint)
  2. Received: April 28, 2022
  3. Accepted: February 2, 2023
  4. Accepted Manuscript published: February 3, 2023 (version 1)
  5. Version of Record published: March 24, 2023 (version 2)

Copyright

© 2023, Sgarbossa et al.

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 4,523
    views
  • 599
    downloads
  • 6
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Damiano Sgarbossa
  2. Umberto Lupo
  3. Anne-Florence Bitbol
(2023)
Generative power of a protein language model trained on multiple sequence alignments
eLife 12:e79854.
https://doi.org/10.7554/eLife.79854

Share this article

https://doi.org/10.7554/eLife.79854

Further reading

    1. Cell Biology
    2. Computational and Systems Biology
    Thomas Grandits, Christoph M Augustin ... Alexander Jung
    Research Article

    Computer models of the human ventricular cardiomyocyte action potential (AP) have reached a level of detail and maturity that has led to an increasing number of applications in the pharmaceutical sector. However, interfacing the models with experimental data can become a significant computational burden. To mitigate the computational burden, the present study introduces a neural network (NN) that emulates the AP for given maximum conductances of selected ion channels, pumps, and exchangers. Its applicability in pharmacological studies was tested on synthetic and experimental data. The NN emulator potentially enables massive speed-ups compared to regular simulations and the forward problem (find drugged AP for pharmacological parameters defined as scaling factors of control maximum conductances) on synthetic data could be solved with average root-mean-square errors (RMSE) of 0.47 mV in normal APs and of 14.5 mV in abnormal APs exhibiting early afterdepolarizations (72.5% of the emulated APs were alining with the abnormality, and the substantial majority of the remaining APs demonstrated pronounced proximity). This demonstrates not only very fast and mostly very accurate AP emulations but also the capability of accounting for discontinuities, a major advantage over existing emulation strategies. Furthermore, the inverse problem (find pharmacological parameters for control and drugged APs through optimization) on synthetic data could be solved with high accuracy shown by a maximum RMSE of 0.22 in the estimated pharmacological parameters. However, notable mismatches were observed between pharmacological parameters estimated from experimental data and distributions obtained from the Comprehensive in vitro Proarrhythmia Assay initiative. This reveals larger inaccuracies which can be attributed particularly to the fact that small tissue preparations were studied while the emulator was trained on single cardiomyocyte data. Overall, our study highlights the potential of NN emulators as powerful tool for an increased efficiency in future quantitative systems pharmacology studies.

    1. Computational and Systems Biology
    2. Neuroscience
    Domingos Leite de Castro, Miguel Aroso ... Paulo Aguiar
    Research Article Updated

    Closed-loop neuronal stimulation has a strong therapeutic potential for neurological disorders such as Parkinson’s disease. However, at the moment, standard stimulation protocols rely on continuous open-loop stimulation and the design of adaptive controllers is an active field of research. Delayed feedback control (DFC), a popular method used to control chaotic systems, has been proposed as a closed-loop technique for desynchronisation of neuronal populations but, so far, was only tested in computational studies. We implement DFC for the first time in neuronal populations and access its efficacy in disrupting unwanted neuronal oscillations. To analyse in detail the performance of this activity control algorithm, we used specialised in vitro platforms with high spatiotemporal monitoring/stimulating capabilities. We show that the conventional DFC in fact worsens the neuronal population oscillatory behaviour, which was never reported before. Conversely, we present an improved control algorithm, adaptive DFC (aDFC), which monitors the ongoing oscillation periodicity and self-tunes accordingly. aDFC effectively disrupts collective neuronal oscillations restoring a more physiological state. Overall, these results support aDFC as a better candidate for therapeutic closed-loop brain stimulation.