Generative power of a protein language model trained on multiple sequence alignments

  1. Damiano Sgarbossa
  2. Umberto Lupo  Is a corresponding author
  3. Anne-Florence Bitbol  Is a corresponding author
  1. Ecole Polytechnique Federale de Lausanne (EPFL), Switzerland

Abstract

Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally-validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.

Data availability

Python code for generating sequences using the iterative masking procedure is available in our GitHub repository: https://github.com/Bitbol-Lab/Iterative_masking.Raw data was collected from two public sources: 1) MSAs from the Pfam database (https://pfam.xfam.org/); 2) further MSAs from https://github.com/matteofigliuzzi/bmDCA. We generated sequences with bmDCA using code publicly available at https://github.com/ranganathanlab/bmDCA.

Article and author information

Author details

  1. Damiano Sgarbossa

    Institute of Bioengineering, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7878-6061
  2. Umberto Lupo

    Institute of Bioengineering, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
    For correspondence
    umberto.lupo@epfl.ch
    Competing interests
    The authors declare that no competing interests exist.
  3. Anne-Florence Bitbol

    Institute of Bioengineering, Ecole Polytechnique Federale de Lausanne (EPFL), Lausanne, Switzerland
    For correspondence
    anne-florence.bitbol@epfl.ch
    Competing interests
    The authors declare that no competing interests exist.
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1020-494X

Funding

European Research Council (851173)

  • Damiano Sgarbossa
  • Umberto Lupo
  • Anne-Florence Bitbol

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Copyright

© 2023, Sgarbossa et al.

This article is distributed under the terms of the Creative Commons Attribution License permitting unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 5,109
    views
  • 688
    downloads
  • 20
    citations

Views, downloads and citations are aggregated across all versions of this paper published by eLife.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Damiano Sgarbossa
  2. Umberto Lupo
  3. Anne-Florence Bitbol
(2023)
Generative power of a protein language model trained on multiple sequence alignments
eLife 12:e79854.
https://doi.org/10.7554/eLife.79854

Share this article

https://doi.org/10.7554/eLife.79854