Generative power of a protein language model trained on multiple sequence alignments

  1. Damiano Sgarbossa
  2. Umberto Lupo  Is a corresponding author
  3. Anne-Florence Bitbol  Is a corresponding author
  1. Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
  2. SIB Swiss Institute of Bioinformatics, Switzerland

Abstract

Computational models starting from large ensembles of evolutionarily related protein sequences capture a representation of protein families and learn constraints associated to protein structure and function. They thus open the possibility for generating novel sequences belonging to protein families. Protein language models trained on multiple sequence alignments, such as MSA Transformer, are highly attractive candidates to this end. We propose and test an iterative method that directly employs the masked language modeling objective to generate sequences using MSA Transformer. We demonstrate that the resulting sequences score as well as natural sequences, for homology, coevolution, and structure-based measures. For large protein families, our synthetic sequences have similar or better properties compared to sequences generated by Potts models, including experimentally validated ones. Moreover, for small protein families, our generation method based on MSA Transformer outperforms Potts models. Our method also more accurately reproduces the higher-order statistics and the distribution of sequences in sequence space of natural data than Potts models. MSA Transformer is thus a strong candidate for protein sequence generation and protein design.

Editor's evaluation

This important study proposes a method to sample novel sequences from a protein language model that could have exciting applications in protein sequence design. The claims are supported by a solid benchmarking of the designed sequences in terms of quality, novelty and diversity.

https://doi.org/10.7554/eLife.79854.sa0

Introduction

Designing new proteins with specific structure and function is a highly important goal of bioengineering. Indeed, it can allow to tune their stability or their biochemical properties, including their enzymatic activities, enabling important medical applications. The search for novel proteins is difficult due to the huge size of protein sequence space: for instance, there are 20100 different possible sequences for a short protein domain with 100 amino acids. Furthermore, only a small fraction of this space comprises sequences that do fold, as demonstrated by experiments studying random sequences (Socolich et al., 2005), and by theoretical arguments based on the physics of disordered systems (Bialek, 2012). De novo or rational protein design, which starts with target three-dimensional structures and physico-chemical potentials, can generate proteins which are not in a known protein family (Dahiyat and Mayo, 1997; Kuhlman et al., 2003; Liang et al., 2009), but is generally restricted to small proteins (Rocklin et al., 2017). Conversely, directed evolution allows to perform a local search of sequence space, but generally remains limited to the vicinity of a natural sequence (Arnold, 2018).

Generative computational models that build on the breadth of available natural protein sequence data, and capture a representation of protein families, now offer great alternatives that can allow to sample novel sequences belonging to protein families. In particular, Potts models, or DCA models (Weigt et al., 2009; Morcos et al., 2011; Marks et al., 2011; Ekeberg et al., 2013), which are pairwise maximum entropy models trained to reproduce the one- and two-body statistics of the sequences of a family, allow direct sampling from a probability distribution modeling this family (Figliuzzi et al., 2018), and have been used successfully for protein design (Russ et al., 2020). Variational autoencoders are deep learning models which also allow sampling, and they have been shown to successfully produce functional proteins (Hawkins-Hooker et al., 2021a), although their statistical properties appear to have a lower quality than with Potts models (McGee et al., 2021).

Protein language models are deep learning models based on natural language processing methods, especially attention (Bahdanau et al., 2015) and transformers (Vaswani et al., 2017). They are trained on large ensembles of protein sequences, and capture long-range dependencies within a protein sequence (Alley et al., 2019; Elnaggar et al., 2021; Rives et al., 2021; Rao et al., 2021b; Vig et al., 2021; Madani et al., 2020; Madani et al., 2021; Rao et al., 2021a). These pre-trained models are able to predict structure in an unsupervised way (Rao et al., 2021b), either taking as input a single sequence (Rives et al., 2021) or a multiple sequence alignment (MSA) (Rao et al., 2021a), potentially by transferring knowledge from their large training set (Bhattacharya et al., 2020; Bhattacharya et al., 2022). The great success of supervised protein structure prediction by AlphaFold (Jumper et al., 2021) is partly based on the use of transformers. It is therefore of strong interest to assess the generative ability of protein language models, and recent works show that this has high potential (Madani et al., 2021; Johnson et al., 2021; Hawkins-Hooker et al., 2021b; Ferruz et al., 2022; Hie et al., 2022).

Correlations in amino-acid usage that can be observed between the columns of MSAs of homologous proteins (Casari et al., 1995; Lapedes et al., 1999; Dunn et al., 2008) were experimentally demonstrated to be highly important to generate functional synthetic proteins (Socolich et al., 2005; Bialek and Ranganathan, 2007). The importance of pairwise coevolution signals was then corroborated by the success of Potts models at predicting structural contacts (Weigt et al., 2009; Marks et al., 2011; Morcos et al., 2011; Sułkowska et al., 2012; Ekeberg et al., 2013), analyzing mutational effects (Dwyer et al., 2013; Cheng et al., 2014; Cheng et al., 2016; Figliuzzi et al., 2016), protein evolution (de la Paz et al., 2020) and conformational changes (Morcos et al., 2013; Malinverni et al., 2015), designing proteins (Russ et al., 2020), and predicting protein–protein interaction partners (Bitbol et al., 2016; Gueudré et al., 2016; Cong et al., 2019; Green et al., 2021). Protein language models that take MSAs as input (Rao et al., 2021a; Jumper et al., 2021) are able to directly exploit this covariation signal, and are thus particularly interesting candidates for protein design. Thus motivated, we focus on MSA Transformer (Rao et al., 2021a), a protein language model which was trained on MSAs using the masked language modeling (MLM) objective, without additional supervised training – by contrast to AlphaFold (Jumper et al., 2021). We ask how the generative properties of MSA Transformer compare to those of Boltzmann machine DCA (bmDCA) (Figliuzzi et al., 2018; Russ et al., 2020), a state-of-the-art generative Potts model.

We propose and test a generating method that directly uses the MLM objective in an iterative way to generate sequences using MSA Transformer. Using homology, coevolution, and structural scores, we demonstrate that the sequences generated by this method score as well as natural sequences. We further show that this good performance is not restricted to synthetic sequences that are very similar to natural sequences. For large protein families, our synthetic sequences have homology and structure-based scores as good as or better than sequences generated by bmDCA, and have similar properties to experimentally validated bmDCA-generated sequences. Moreover, for small protein families, our generation method based on MSA Transformer outperforms bmDCA, by providing synthetic sequences that score well without being extremely similar to natural ones. However, we find that bmDCA better reproduces the one- and two-body statistics of the natural MSAs than MSA Transformer when used with default parameters, consistently with its training objective. Interestingly, the opposite generally holds for higher-order statistics. MSA-Transformer–generated sequences also better reproduce the distribution of sequences in sequence space than bmDCA-generated ones. Our conclusion is that MSA Transformer is a strong candidate for protein sequence generation and protein design.

Results

An iterative masking procedure allows MSA Transformer to generate novel sequences with high scores

Can the protein language model MSA Transformer (Rao et al., 2021a) be used to generate sequences that are credible members of protein families? How do its generative abilities compare to Potts models inferred by bmDCA (Figliuzzi et al., 2018), a state-of-the-art generative DCA method which has been experimentally shown to generate functional proteins (Russ et al., 2020)? To address these questions, we developed and employed an iterative masking procedure to generate synthetic MSAs from natural MSAs of 14 different large Pfam protein families (see Appendix 1—table 5) and 7 small ones (see Appendix 1—table 6) with MSA Transformer, as described in ‘Using MSA Transformer to generate sequences via an iterative masking procedure’. We also generated synthetic sequences by Markov Chain Monte Carlo (MCMC) sampling from Potts models inferred from these MSAs by bmDCA, using two variants that differ by sampling temperature T and regularization strength λ, matching, respectively, the default parameters employed in Figliuzzi et al., 2018, and some of those used in Russ et al., 2020, see ‘Sampling sequences from Potts models’ for details. For each protein family, we thus obtained four different MSAs of the same depth: the natural one, the one generated by our iterative masking procedure using MSA Transformer, and the two MSAs sampled from the inferred Potts model. To characterize each sequence, we consider four different scores (see ‘Scoring individual sequences’). First, we assess the quality of the generated sequences as homologs of the protein family of interest; we do this via the HMMER (http://hmmer.org) score of the hidden Markov model employed by Pfam to retrieve natural homologs. Second, we consider a score that accounts for coevolution between amino-acid sites, namely the statistical energy score from the Potts model fitted on the natural MSA. Third, we determine AlphaFold’s confidence in its determination of the three-dimensional structure of these sequences, via the predicted local-distance difference test (pLDDT) score. Fourth, to assess whether the predicted structures are similar to the native ones, we compute the root-mean-squared deviation (RMSD) between a reference experimental structure and the AlphaFold predicted structures. The first three scores are such that higher values are better, while smaller RMSD values indicate that predicted structures are similar to the native ones. Together, these scores account for very different aspects of proteins, namely homology, coevolution and structure.

Let us first consider the 14 large protein families in Appendix 1—table 5, where MSAs are deep enough to accurately fit Potts models using bmDCA (Figliuzzi et al., 2018). Figure 1 shows that, for all these protein families, and for these four different scores, the sequences generated by MSA Transformer using our iterative masking procedure have scores that are at least as good as those of natural sequences, and better than those of sequences generated by bmDCA with default parameters (Figliuzzi et al., 2018), as confirmed by the Kolmogorov–Smirnov test (see Appendix 1—table 1). Decreasing the sampling temperature and the regularization strength used with bmDCA improves the statistical energy score as expected (Russ et al., 2020), but also other scores. These other scores, and most importantly our two structural scores, are similar or better for MSA-Transformer–generated sequences compared to those generated by bmDCA with non-default parameters. In particular, the median pLDDT score is larger for the former than for the latter in 11 protein families out of 14, by a margin larger than the standard deviation in 4 of them (see Appendix 1—table 2). These results demonstrate that MSA Transformer is a good candidate to generate synthetic sequences from protein families, and that our iterative masking procedure allows to perform such generation.

Figure 1 with 1 supplement see all
Comparison of homology, coevolution, and structure-based scores between natural sequences and sequences generated by MSA Transformer or Boltzmann machine DCA (bmDCA).

For each Pfam family in Appendix 1—table 5, we compare a natural MSA from Pfam and three synthetic MSAs of the same depth. The first synthetic MSA was obtained using MSA Transformer via our iterative masking procedure, and the second and third ones were generated by a Potts model inferred from the natural MSA using bmDCA with two different pairs (λ,T) of regularization strength λ and sampling temperature T. For each of the four scores described in ‘Scoring individual sequences’, we show the distributions of score values among sequences in each MSA as a violin plot. Higher score values are better for all scores except root-mean-squared deviation (RMSD) (bottom panel), where smaller values indicate a closer match to an experimental structure. Top panel: For each Pfam family, HMMER scores are divided by the highest score found in the natural MSA. Note that sequences below HMMER’s default homology detection score (E-value larger than 10), and whose HMMER score is thus 0, are not shown (the median over families of the fraction of such sequences is 2% for bmDCA (10−2, 1.00)-generated MSAs, while there are no such sequences among the MSA-Transformer–generated ones). Second panel: Statistical energy scores are defined as minus the bmDCA statistical energies. To accommodate the highly family-dependent ranges of these scores, for each Pfam family we show their values after shifting by the mean score in the natural MSA, and normalizing by the standard deviation of natural MSA scores. Third panel: AlphaFold’s predicted local-distance difference test (pLDDT) confidence scores. Bottom panel: RMSD of predicted structures with respect to the experimental structures in Appendix 1—table 5. Structural scores (pLDDT and RMSD) were computed on 200 randomly chosen sequences from each MSA. All kernel-smoothed histograms are normalized such that all violins have the same maximal width. Outliers (less than 1% in all cases) were discarded for legibility.

How different are these synthetic sequences from the natural ones? In particular, are the best-scoring sequences novel, or are they almost copies of natural sequences? In Figure 2 we show, for two example protein families (PF00072 and PF00153), the HMMER score and the DCA statistical energy score versus the sequence’s Hamming distance to its closest natural sequence in the natural MSA.

Homology and coevolution scores versus distance to the natural multiple sequence alignment (MSA), for protein families PF00072 and PF00153.

We show contour plots of the HMMER score and the statistical energy score (defined as minus the DCA statistical energy, shifted by its mean value in the natural MSA) versus the Hamming distance of each sequence to the closest natural sequence (which is not itself, in the case of natural sequences). Results are shown for natural sequences and for sequences generated using MSA Transformer and Boltzmann machine DCA (bmDCA) (the same two (λ,T) pairs as in Figure 1 are used for bmDCA). The lightest contours shown include 99% of the cumulative probability mass.

From the marginal distributions of the Hamming distances in Figure 2, we observe that MSA Transformer generates sequences with variable distances to their closest natural sequences, and that these distances are overall larger than those between natural sequences and their closest neighbors (excluding themselves). With default parameters, bmDCA generates sequences which are generally very different from the natural ones, but decreasing sampling temperature makes bmDCA-generated sequences more similar to natural ones and to each other, see Figure 1—figure supplement 1. Besides, the marginal distributions of scores illustrate the general observation made on Figure 1 and in Appendix 1—table 2 that MSA-Transformer–generated sequences have good scores. Moreover, the plots in Figure 2 reveal that the MSA-Transformer–generated sequences featuring the highest HMMER scores tend to have large Hamming distances to natural sequences, that is to be truly novel (see also ‘Choosing parameters in the iterative masking procedure’). We observe these trends for most large protein families studied, and they are robust to using BLOSUM similarity scores (Henikoff and Henikoff, 1992) instead of Hamming distances. Therefore, our sequence generation method based on MSA Transformer is not reaching good scores by just reproducing natural sequences. Besides, the diversity of MSA-Transformer–generated MSAs, as measured by their effective depth (Equation 8), is only slightly smaller than that of natural MSAs (see Figure 1—figure supplement 1). Conversely, bmDCA at low temperature produces highly redundant sequences (Figure 1—figure supplement 1), which are concentrated in specific regions of the score versus distance space in Figure 2. Indeed, sequence generation by bmDCA is then constrained to exploring the local minima of the Potts model energy landscapes.

Sequence generation by the iterative masking procedure is successful for small protein families

Accurately fitting Potts models requires deep and diverse MSAs, as evidenced by the strong dependence of structural contact prediction by Potts models on MSA depth (Marks et al., 2011; Morcos et al., 2011). By contrast, MSA Transformer was trained on many MSAs, and is able to transfer knowledge across protein families. It outperforms Potts models at unsupervised contact prediction most strongly for shallow MSAs (Rao et al., 2021a). How does sequence generation using our iterative masking procedure based on MSA Transformer compare to bmDCA in the case of small protein families?

To address this question, we generated synthetic MSAs starting from seven small families, using both our iterative masking procedure based on MSA Transformer and bmDCA with default parameters and with low sampling temperature. Figure 3 reports all four scores discussed above in the case of these seven small families, listed in Appendix 1—table 6. We observe that MSA-Transformer–generated sequences have similar HMMER scores and structural scores to natural sequences. MSA-Transformer–generated sequences also generally have better HMMER scores and structural scores than those generated by bmDCA with default parameters. While low-temperature bmDCA yields better statistical energy scores (as expected), and also gives HMMER scores and structural scores comparable to natural sequences, it in fact generates sequences that are almost exact copies of natural ones (see Figure 3, bottom row). By contrast, MSA Transformer produces sequences that are quite different from natural ones, and have very good scores. Thus, our method based on MSA Transformer is particularly promising in the tricky case of small protein families.

Application of our sequence generation method based on MSA Transformer to small protein families.

We consider seven small protein families, with natural MSAs that comprise from nine to a few hundreds of sequences, see Appendix 1—table 6. As in Figure 1, for each family, we compare the natural MSA and three synthetic MSAs of the same depth. In all cases, we show violin plots of the same four scores as for large families in Figure 1, as well as of the Hamming distance to the closest natural sequence, which is not itself in the case of natural sequences (‘Distance’). For the three smallest families (left panel; fewer than 40 sequences), we also show the score of each individual sequence as a swarm plot. Note that while we employ the same sampling temperatures T as in Figure 1 for Boltzmann machine DCA (bmDCA), here, we use regularization strength λ=10-2 throughout, due to MSA shallowness (see ‘Sampling sequences from Potts models’).

Higher-order statistics are better reproduced by MSA Transformer, while lower-order statistics are better reproduced by bmDCA

How well do synthetic MSAs generated by our method based on MSA Transformer, and by bmDCA, reproduce the statistics of amino-acid usage observed in natural MSAs? To address this question, we consider the r20 score (Haldane et al., 2018; McGee et al., 2021), which quantifies the statistical similarity of two datasets at various orders (see ‘Analyzing the statistics of MSAs’). We compute it between each of our synthetic MSAs and the corresponding natural one, for the 14 large protein families in Appendix 1—table 5. We also present as reference an assumption-free null model, namely the r20 score between two subsets of each natural MSA. Figure 4 shows that bmDCA with default parameters is most often the best method at reproducing lower-order statistics, while MSA Transformer is the best at reproducing higher-order statistics, in all families considered. bmDCA at lower temperature performs more poorly at reproducing the statistics of natural MSAs than other methods, because low-temperature biases the sampling (bmDCA models are effectively learned at temperature T=1).

Figure 4 with 3 supplements see all
Similarity of statistics between synthetic and natural multiple sequence alignments (MSAs).

To compare the statistics of synthetic and natural MSAs at various orders, we compute r20 scores (Haldane et al., 2018; McGee et al., 2021), and plot them versus the number of different MSA columns that are considered (see ‘Analyzing the statistics of MSAs’ for details). All families in Figure 5 are considered. For each of them, the reference MSA comprises either half of the natural MSA (with sequences selected uniformly at random), or 30,000 sequences from it if the natural MSA depth is larger than 60,000. The null model compares the other half of the natural MSA to this reference MSA. It yields an estimate of the expected r20 scores due only to finite-size effects in a model-free, purely data-driven way.

To have a more detailed insight into lower-order correlations, we estimate frequencies and information theory measures, at the one-, two-, and three-body level, from our natural and synthetic MSAs, and compare them (see ‘Analyzing the statistics of MSAs’). Figure 4—figure supplement 1 shows that one- and two-body statistics are generally better reproduced by bmDCA with default parameters than by MSA Transformer, while results are more mixed for three-body statistics. Figure 4—figure supplement 2 and Figure 4—figure supplement 3 show a comparison of second- and third-order connected correlations for PF00072 and PF00153. For PF00072, bmDCA reproduces better the second- but also third-order connected correlations of the natural data than MSA Transformer, while for PF00153, MSA Transformer reproduces the third-order connected correlations better than bmDCA, consistently with Figure 4. Potts models are pairwise maximum entropy models constrained to match the one- and two-body frequencies from natural MSAs. Thus, bmDCA is trained to reproduce these frequencies, and achieves these objectives quite well, although the comparison to the null model in Figure 4—figure supplement 2 and Figure 4—figure supplement 3 hints that further improvements remain possible, see Meshulam et al., 2021. MSA Transformer has entirely different training objectives, but, interestingly, it performs comparably at reproducing three-body statistics and is better at reproducing even higher-order statistics than bmDCA.

MSA Transformer captures well the distribution of sequences in sequence space

How are synthetic MSAs generated by MSA Transformer and bmDCA impacted by the heterogeneous repartition of natural sequences in sequence space? While natural protein sequences in a family have evolved from a common ancestor along a phylogeny, synthetic sequences do not have a real evolutionary history. However, as bmDCA and MSA Transformer are trained on natural data, they can capture phylogenetic correlations (Lupo et al., 2022). Besides, inferred Potts models are known to be impacted by phylogenetic correlations (Weigt et al., 2009; Marks et al., 2011; Qin and Colwell, 2018; Vorberg et al., 2018; Rodriguez Horta et al., 2019; Rodriguez Horta and Weigt, 2021; Marmier et al., 2019; Colavin et al., 2022; Gerardos et al., 2022; Dietler et al., 2023).

To analyze the overall distribution of MSA sequences in sequence space, we first perform a principal component (PC) analysis of one-hot encoded MSAs, and focus on the top two PCs (Figliuzzi et al., 2018) (see ‘Characterizing the distribution of sequences in MSAs’). Figure 5 shows the distribution of sequences in the space spanned by these top two PCs, for natural and synthetic MSAs, in the cases of PF00072 and PF00153. We observe that MSA Transformer is able to generate sequences with a distribution in sequence space that is very similar to that of the natural MSA. Conversely, bmDCA captures the overall shape of this distribution, but appears to smooth it compared to the natural data with default parameters and to restrict to sparse regions of the sequence space at low temperature, consistently with our previous results. These observations are general across all the deep MSAs we considered (see Figure 5—figure supplement 1 and Figure 5—figure supplement 2). Note that a limitation of this analysis is that the top two PCs explain a small fraction of the variance in all cases (see Figure 5).

Figure 5 with 4 supplements see all
Distribution of sequences in sequence space, for families PF00072 and PF00153.

We show the distribution of one-hot encoded natural and synthetic sequences projected in the subspace of the first two principal components of the natural multiple sequence alignment (MSA). The same axis limits are used within one family, except for Boltzmann machine DCA (bmDCA) (10−3, 0.33) in the case of PF00072. Note that the fraction of the total variance explained by the first two principal components of each MSA is less than 4% for all families and all generation methods.

Next, to assess whether generated sequences most resemble natural ones that are well represented in their family or, rather, rare ones, we consider the closest natural sequence to each synthetic sequence, and count the neighbors of this natural sequence in the natural MSA (see ‘Characterizing the distribution of sequences in MSAs’). Figure 5—figure supplement 3 compares the distribution of these numbers of neighbors for natural sequences and for the closest natural sequences to generated sequences, in the cases of PF00072 and PF00153. It shows that bmDCA generates sequences similar to natural sequences with fewer neighbors than typical in the natural data. Conversely, MSA Transformer generates sequences whose closest natural sequences have a distribution of number of neighbors similar to that of the natural MSA. This suggests that our generation method based on MSA Transformer tends to sample from denser regions of the sequence space than bmDCA, while not reproducing natural sequences (see also Figure 2 and ‘Choosing parameters in the iterative masking procedure’).

Finally, to analyze in more detail the apparent relatedness of generated sequences, and compare it to real phylogenetic relationships in natural sequences, we infer phylogenetic trees from each synthetic and natural MSA, and analyze the eigenvalue spectrum of their modified graph Laplacian (MGL) to compare them (Lewitus and Morlon, 2016) (see ‘Characterizing the distribution of sequences in MSAs’). Figure 5—figure supplement 4 compares the density of these eigenvalue spectra for natural and synthetic MSAs regarding families PF00072 and PF00153. The skewness and the position of such distributions are indicators of the topology of the tree. In particular, distributions with negative skewness (right unbalanced) or which are shifted to the right, correspond to ‘tippy’ trees, while the opposite case corresponds to ‘stemmy’ trees (Lewitus and Morlon, 2016), which feature an accumulation of recent speciation events (short leaves length) (Molina‐Venegas, 2021). In this light, Figure 5—figure supplement 4 shows that both MSA Transformer and low-temperature bmDCA generate sequences with an apparent phylogeny that is more stemmy than the natural one, while bmDCA with default parameters yields a slightly more tippy tree. This is consistent with our observations regarding sequence diversity, which is larger than in natural data for bmDCA with default parameters, slightly smaller than in natural data using MSA Transformer and much lower using low-temperature bmDCA (see Figure 1—figure supplement 1).

Comparison with published experimental datasets

How do the sequences generated by our method based on MSA Transformer compare to published protein design experimental datasets? Recently, sequences sampled from a bmDCA Potts model of the chorismate mutase protein family were experimentally demonstrated to be functional (Russ et al., 2020). In Figure 6—figure supplement 1, we show plots analogous to those in Figure 2, plus additional ones for our two structural scores (pLDDT and RMSD), in the case of chorismate mutase. This allows a detailed comparison between the sequences we generate using MSA Transformer and the sequences designed in Russ et al., 2020 using bmDCA with a combination of different temperatures and regularization strengths. We find that our method based on MSA Transformer produces sequences that score as well as artificial sequences which have been tested experimentally. Besides, we obtained these results without fine-tuning the parameters of our generative procedure to this family, while several specific combinations of parameters were used in Russ et al., 2020.

To further compare our generated sequences to those tested experimentally in Russ et al., 2020, we consider relative enrichment, which is the experimental score used in Russ et al., 2020 to assess the function of chorismate mutase enzymes. This score was measured in Russ et al., 2020 for all sequences in the natural MSA and for sequences generated with bmDCA. We estimate the expected relative enrichment of our generated sequences as the relative enrichment of the closest natural sequence. To test our estimation procedure, we estimate the relative enrichments of the bmDCA-generated sequences from Russ et al., 2020, and we compare them to the experimentally measured values. We focus on the top third of sequences in terms of pLDDT scores, as it was shown in Malbranke et al., 2021 that good structural scores help to select functional sequences. Figure 6 shows that in this ensemble, sequences with a high (resp. low) estimated score have a high (resp. low) experimental score too. Next, we compare the distributions of estimated relative enrichment for sequences generated using our method based on MSA Transformer and for the bmDCA-generated sequences from Russ et al., 2020. Figure 6 shows that they are quite similar to each other. This holds both when focusing on the top third of sequences in terms of pLDDT scores for each generation method, and when considering all generated sequences. Furthermore, in the high-pLDDT case, these distributions are quite similar to the distribution of measured relative enrichment for bmDCA-generated sequences. Importantly, a similar fraction of MSA-Transformer–generated sequences and of bmDCA-generated sequences have a large estimated relative enrichment. This suggests that our method based on MSA Transformer should be able to generate functional sequences.

Figure 6 with 2 supplements see all
Comparison of our generated sequences to those experimentally tested in Russ et al., 2020, for the chorismate mutase family.

Left: The estimated relative enrichment (r.e.) scores of the Boltzmann machine DCA (bmDCA)-generated sequences that are in the top 33% in terms of predicted local-distance difference test (pLDDT) scores are plotted versus their experimentally measured counterparts from Russ et al., 2020. We estimate the expected r.e. of these generated sequences as the r.e. of the closest natural sequence measured in Russ et al., 2020. We observe that high estimated r.e. is associated with high measured r.e., as 71% of sequences with estimated r.e. > 0.4 (green) also have measured r.e. > 0.4. Note that in the top marginals (showing the measured r.e. for bmDCA-generated sequences), the green and yellow histograms are stacked on top of each other. Thus, the stacked histogram shows the distribution of all measured r.e. values for bmDCA-generated sequences that are in the top 33% in terms of pLDDT scores. Top right: Overlaid histograms of estimated r.e. are shown for our MSA-Transformer–generated sequences and for the bmDCA-generated ones from Russ et al., 2020, restricting in both cases to the sequences with top 33% pLDDT scores. Bottom right: Same as top right, but considering all generated sequences.

While the data from Russ et al., 2020 is particularly well suited to retrospectively evaluate our sequence generation method, we also propose a comparison of the distributions of scores based on experimental deep mutational scans (DMS) for protein families PF00595 (McLaughlin et al., 2012) and PF13354 (Stiffler et al., 2015). We compute these DMS scores for each natural and synthetic sequence, by summing the experimentally measured effects of the relevant single-point mutations with respect to the reference sequence of the experimental studies. Figure 6—figure supplement 2 shows the distribution of the DMS scores of natural and generated sequences for these two families. Our sequence generation method based on MSA Transformer better reproduces the DMS score distribution of natural sequences than bmDCA, and generates sequences with better average DMS scores. Despite the potential limitations of our DMS scores, for example their additivity, these results corroborate our other findings and provide further encouragement for our sequence generation method based on MSA Transformer.

Discussion

In this work, we proposed an iterative masking procedure which directly exploits the MLM objective of protein language models to generate sequences using the MSA-based neural language model MSA Transformer. We found that these sequences score as well as natural ones on three very different aspects, namely homology, coevolution, and structure. For large protein families, our synthetic sequences have homology and structure-based scores at least as good as bmDCA-generated sequences, and have similar properties to experimentally validated ones. Moreover, our generation method based on MSA Transformer is less limited by shallow MSAs than bmDCA, and is thus particularly promising for small protein families. Besides, MSA-Transformer–generated sequences better reproduce the higher-order statistics and the distribution of sequences in sequence space of natural data than bmDCA-generated ones. Conversely, bmDCA, with default parameters, better reproduces first- and second-order statistics, consistently with its training objective.

Our results are highly promising for sequence generation by MSA-based protein language models, and we hope that they will motivate further studies, especially experimental tests. They also show that protein deep learning models based on the MLM objective have great generative potential, despite not being obvious generative models. More generally, our results reinforce the new promising ‘coevolution-driven’ protein design approach of learning from sequences of evolutionarily related proteins the constraints associated to protein structure and function. This concept differs from structure- and physics-based de novo design (Dahiyat and Mayo, 1997; Kuhlman et al., 2003; Liang et al., 2009), and from the new possibility to use supervised deep learning models able to accurately predict protein structures (Jumper et al., 2021; Baek et al., 2021; Chowdhury et al., 2023) for structure-driven sequence generation (Anishchenko et al., 2021). One can view the coevolution-driven approach as intermediate between structure-based approaches and directed evolution ones (Arnold, 2018). The coevolution-driven approach was recently experimentally validated in the case of bmDCA Potts models, which capture pairwise coevolution patterns in MSAs (Russ et al., 2020), and for variational autoencoders (Hawkins-Hooker et al., 2021a; McGee et al., 2021). Protein language models trained on MSAs provide state-of-the-art unsupervised contact prediction and are able to capture coevolutionary patterns in their tied row attentions (Rao et al., 2021a), and capture phylogenetic relationships in column attentions (Lupo et al., 2022). This makes them ideal candidates to generate new protein sequences from given families. However, contrary to Potts models and variational autoencoders (McGee et al., 2021), they do not allow direct sampling from a probability distribution over sequences (Goyal et al., 2021). Here, we demonstrated the power of a simple generation method directly based on the MLM objective used for the training of MSA-based protein language models. It differs from using a decoder, which, though designed to perform autoregressive generation of amino acids to form a new sequence, requires training a full encoder–decoder model and learning a parametric function mapping an MSA to a distribution over its sequences (Hawkins-Hooker et al., 2021b). We instead directly employed the representation of protein families captured by the self-supervised model MSA Transformer to generate sequences. More sophisticated sampling methods could be considered along this line (Goyal et al., 2021), but our minimal approach already gives very promising results.

We have focused on a large protein language model and compared it to the simplest model capturing coevolution, namely the Potts model, but we note that interpretable models of intermediate complexity such as restricted Boltzmann machines (Tubiana et al., 2019) could also be explored for coevolution-driven protein design. All these methods rely on MSAs; this is very useful to capture coevolution, but also means that one has to rely on potentially imperfect alignments. Thus, starting from alignment-free methods (Bileschi et al., 2022; Shin et al., 2021; Madani et al., 2021) also constitutes a promising direction.

Methods

Using MSA Transformer to generate sequences via an iterative masking procedure

Iterative masking procedure

In order to generate new sequences using MSA Transformer, we directly leverage the model’s ability to assign, to arbitrary masked residue positions, a probability for each of the possible amino-acid tokens, given by the softmax of the model’s output logits (Wang and Cho, 2019; Goyal et al., 2021; Rao et al., 2021b). Indeed, in its pre-training, MSA Transformer applies the MLM objective to a training set of 26 million MSAs (Rao et al., 2021b). For this, it minimizes a pseudolikelihood loss, which reads, for an MSA M, and a version M~ of M in which some amino acids (those in a ‘mask’) are masked:

(1) LMLM(M,M~;θ)=(m,i)masklogp(xm,iM~;θ).

Here, xm,i denotes the amino acid at the ith residue position in the mth sequence of M, and θ denotes all model parameters. For each position i in each sequence m, the model outputs one value (‘logit’) per amino-acid/gap symbol, and softmax-normalizing all values from this location in the MSA yields the conditional probabilities p(xm,i|M~;θ) in Equation 1, which are then summed over the subset of masked MSA locations.

We propose an iterative masking procedure (see Figure 7) which, given an arbitrary MSA M of natural sequences, proceeds as follows:

  1. If necessary, subsample M to obtain an input MSA M for MSA Transformer. The depth of M is chosen given the memory footprint of MSA Transformer. In practice, we use input MSAs containing 600 sequences picked uniformly at random from our natural MSA. (During training, the authors of Rao et al., 2021a kept LM<214, where L is sequence length and M is MSA depth. However, we found that during inference we can use 217 tokens on an Nvidia V100 32GB GPU.) Note that, for large protein families, multiple 600-sequence MSAs obtained using the procedure presented here are then combined into a single MSA of the same depth as the natural one (see below).

  2. Randomly mask each residue of M with a masking probability p , otherwise leave it unchanged. In practice, we choose p=0.1 (see ‘Choosing parameters in the iterative masking procedure’).

  3. Feed the masked MSA to the model, and fill each masked entry with the token with highest probability (obtained from the model’s output logits).

  4. Repeat Steps 2–3 a number of times. In practice, we stop the algorithm after I=200 iterations.

Figure 7 with 4 supplements see all
Iterative masking procedure to generate sequences using MSA Transformer.

Here, the red hashtag (#) stands for a masked amino acid, while blue uppercase letters stand for predicted amino acids at the masked positions.

As natural MSAs, we use Pfam full MSAs for 14 protein families, described in ‘Datasets’. For each natural MSA M, we repeat the procedure above multiple times, sampling sequences each time from M without replacement to obtain a different input MSA M in Step 1, until all the sequences in M are used. Note that sequences remain aligned at all times during the procedure. Combining the MSAs resulting from all these batches then yields a synthetic MSA with the same depth as the natural one, which ensures that the statistical properties of the synthetic MSA are subject to the same magnitude of finite-size errors as those of the natural MSA.

Choosing parameters in the iterative masking procedure

Figure 7—figure supplement 1 illustrates, in the case of Pfam family PF00153, for different values of the masking probability p, how different properties of the generated MSAs evolve with the number I of iterations in the iterative masking procedure. For p<0.5, we observe a gradual divergence from the initial natural sequences (Figure 7—figure supplement 1A, B) and a simultaneous increase of scores (Figure 7—figure supplement 1C, D, see ‘Scoring individual sequences’ for definitions) and decrease of MSA diversity (Figure 7—figure supplement 1E), and then a saturation of these various measures, as I increases. Our choice I=200 is motivated by the fact that plateaus are reached at this point. However, the final values of all scores depend on p. Figure 7—figure supplement 4 shows the contact maps inferred by MSA Transformer (using the logistic regression on tied row attentions trained in Rao et al., 2021a) from generated sequences, for various values of I and p, in the case of family PF00153. We observe that the contact map characteristic of the protein family of interest gets gradually lost as I is increased for larger values of p (see Figure 7—figure supplement 1F and Figure 7—figure supplement 4). These issues when p is increased are understandable, given that the pseudolikelihood loss used for the MLM objective in MSA Transformer ignores dependencies between masked entries. We note that despite this, larger values of p yield overall better average HMMER scores (Eddy, 1998) and statistical energy scores (for p<0.5). Our choice of p=0.1 is motivated by the fact that this value is close to that employed in the training of the model (p0.12) (Rao et al., 2021a), and that it better preserves contact maps. The product pI gives the average number of times that each amino acid of the MSA is changed during the generation process. With our choices, each amino acid is masked 20 times on average.

The behaviors observed in Figure 7—figure supplement 1 for PF00153 are generic across the protein families we studied, as can be seen in Figure 7—figure supplement 2 and Figure 7—figure supplement 3, which show the same data as in Figure 7—figure supplement 1 for Pfam families PF00096 and PF13354 (which have different sequence lengths). This demonstrates that our sequence generation method is robust. In particular, as the parameters p=0.1 and I=200 yield satisfactory convergence of MSA properties and preservation of contact maps in all cases, we used these parameters throughout, without any family-specific fine-tuning.

The sequences thus generated by our method do not coincide with natural ones. The fraction of MSA-Transformer–generated sequences which are identical to sequences in the input natural MSAs is below 5×10-4 for all large families considered, except three families with low diversity and/or very short sequence length (PF00096, PF00397, and PF00595).

Variants of the iterative masking procedure

In our algorithm, we mask tokens randomly throughout the input MSA. We also explored an alternative procedure where masking is restricted to the first sequence of the input MSA. Thus, all other sequences act as a context for the first sequence which is gradually modified. This can be done either with a fixed context, or by sampling different sequences from the natural MSA at each iteration to form a variable context. Note that the procedure with fixed context is reminiscent of the non-iterative one used in Meier et al., 2021 to compute DMS scores from MSA Transformer. For the same masking probability p=0.1 as in our standard procedure (note that fewer iterations are needed for convergence, in practice I=20 suffices), the alternative procedure with fixed context yields sequences that are overall slightly less different from natural ones than the standard iterative masking procedure, while the opposite holds with variable context. Besides, both alternative procedures yield sequences with better HMMER scores, but worse statistical energy scores, than natural ones – see Appendix 1—table 3. Finally, the two- and three-body statistics (defined in ‘Analyzing the statistics of MSAs’) of the natural MSA are less well reproduced using these alternative procedures than the standard one – see Appendix 1—table 3. We also note that these variants are computationally more demanding. In this context, we decided to focus on the standard iterative masking procedure.

There are also different ways of selecting the token to fill each masked position. We have chosen a greedy sampling method where the token with highest probability is selected. We also explored an alternative method where the new token to fill the masked position is chosen by sampling the probability distribution given by the softmax of the logits, see Equation 1. This method allows to introduce a sampling temperature T into the softmax operation and compute the probability as p=softmax(ξ/T), where ξ is the logit vector. Note that the greedy method that we employ corresponds to sampling at T=0. We found that MSAs generated with higher values of T are farther from the corresponding natural MSAs, showing that increasing this sampling temperature promotes originality. However, they are of lower quality according to our HMMER and statistical energy scores, and reproduce the statistics of the natural data less well. These results, summarized in Appendix 1—table 3, motivated us to mainly consider greedy sampling.

Finally, in our iterative masking procedure, we subsample the initial natural MSAs uniformly at random. We also tried diversity maximizing sampling (Rao et al., 2021a), but we found that random sampling gives slightly better results.

Sampling sequences from Potts models

To sample independent equilibrium sequences from Potts models, we used the strategy described in Lupo et al., 2022. Specifically, we fitted Potts models on each of our natural MSAs using bmDCA (Figliuzzi et al., 2018) (https://github.com/ranganathanlab/bmDCA; Figliuzzi and Barrat-Charlaix, 2020). Using bmDCA is known to yield Potts models with good generative power (Figliuzzi et al., 2018; Russ et al., 2020).

Consider a sequence of L amino-acid sites. We denote by xi{1,,q} the state of site i{1,,L}, where q=21 is the number of possible states, namely the 20 natural amino acids and the alignment gap. The Potts model Hamiltonian of a sequence x=(x1,,xL) reads (Weigt et al., 2009; Cocco et al., 2018):

(2) H(x)=i=1Lhi(xi)j=1Li=1j1eij(xi,xj).

For each MSA M in Appendix 1—table 5, we inferred parameters hi(xi) and eij(xi,xj) by bmDCA (Figliuzzi et al., 2018; Russ et al., 2020). The Potts model probability distribution is then given by the Boltzmann distribution associated to the Hamiltonian H in Equation 2:

(3) P(x)=eH(x)/TZ,

where Z is a constant ensuring normalization and T is a parameter whose default value is 1. To generate a synthetic MSA from M, we performed equilibrium MCMC sampling from the Potts model with Hamiltonian H in Equation 2. Specifically, we used the implementation in Lupo et al., 2022 of the Metropolis–Hastings algorithm, in which each step is a proposed mutation at a single amino-acid site. We started from a set of M randomly and independently initialized sequences, where M is the depth of M, and made a total number N of Monte Carlo steps on each sequence. For each M, suitable values for N are estimated by bmDCA during its training, to ensure that Metropolis–Hastings sampling reaches thermal equilibrium after N steps when starting from a randomly initialized sequence (Figliuzzi et al., 2018). We thus used the value of N estimated by bmDCA at the end of training. This yielded, for each MSA in Appendix 1—table 5, a synthetic MSA of the same depth, composed of independent equilibrium sequences.

This procedure allows to tune the sampling temperature T, in a similar spirit as for MSA Transformer, cf. ‘Variants of the iterative masking procedure’. This amounts to tuning the selection strength. Recall that Potts models are inferred at T=1, which is thus the default value. Using MCMC sampling as described above, we first generated synthetic MSAs at T=1, and using regularization strength λ=10-2. These correspond to the default parameters (λ,T), matching those employed in Figliuzzi et al., 2018, and allowing direct comparison with those results. Importantly, using sampling temperature T=1 means that the distribution learnt from natural data is directly sampled. However, it was found in Russ et al., 2020 that sequences generated at T=1 have worse statistical energy scores than natural sequences, due at least in part to high regularization, and that this can be corrected by lower-temperature sampling. Therefore, for completeness, we also considered all parameter combinations (λ,T) used in Russ et al., 2020 for PF00072. Appendix 1—table 4 shows that decreasing sampling temperature strongly improves the mean statistical energy score, as it should, and somewhat improves HMMER scores and structural scores. However, this comes at the cost of decreasing MSA diversity and getting sequences substantially more similar to natural ones. It also strongly impairs the fitting of the one- and two-body statistics. The effect of changing regularization strength (at inference) appears to be more minor, but decreasing it allows to somewhat mitigate the loss of diversity associated to lowering temperature. In light of these results, and to make our comparison to bmDCA comprehensive, we used (λ,T)=(10-3,0.33) (Russ et al., 2020) in addition to (λ,T)=(10-2,1) (Figliuzzi et al., 2018) throughout our analysis of deep MSAs. In the case of shallow MSAs (3), we employed (λ,T)=(10-2,0.33) instead of (λ,T)=(10-3,0.33) because shallow MSAs require stronger regularization strengths.

Scoring individual sequences

We use different scores to compare natural and generated sequences.

First, HMMER scores (Eddy, 1998) are computed, for each sequence, from the Pfam profile Hidden Markov Models (HMM), employing the function hmmsearch from the HMMER Suite version 3.3.2 (http://hmmer.org). HMMER scores are homology scores, which are in particular used in Pfam to search sequence databases for sequence homologs and to construct full MSAs starting from curated seed MSAs. Higher HMMER scores indicate better matches to the Pfam HMM.

Second, DCA statistical energy scores are computed for each sequence using the Potts model Hamiltonian H in Equation 2 with the couplings and the fields inferred by bmDCA on the natural MSA of the family of interest (see ‘Sampling sequences from Potts models’). The statistical energy score is then defined as the opposite of the statistical energy, that is -H(x) for a sequence x, so that, here too, higher values mean better scores.

We also compute AlphaFold (Jumper et al., 2021) structural prediction confidence scores, that is pLDDT values. Given the computational cost, for each natural or generated MSA, we evaluate pLDDT values for a subset of 200 randomly sampled sequences.

Finally, we compute the root-mean-square deviation (RMSD) between a reference experimental structure of the family of focus (see list in Appendix 1—table 5) and the AlphaFold predicted structures, also for a subset of 200 randomly sampled sequences in each MSA.

Because AlphaFold takes MSAs as input, we compute these two structural scores using the whole natural MSA of the family of interest as context in all cases. In addition, for the protein family PF00072, we also used fully synthetic MSAs as input to AlphaFold. Structural scores are then very similar to those obtained using natural context (see Appendix 1—table 4).

Analyzing the statistics of MSAs

To compare the generated MSAs to the natural ones, we consider different statistical measures.

First, to analyze how faithfully the generated MSAs reproduce the statistics of the natural ones at various orders, we compute the r20 score (Haldane et al., 2018; McGee et al., 2021). Specifically, to obtain Figure 4, we analyze the frequency of subsequences spanning 2–10 non-contiguous columns. In each of 1000 randomly sampled sets of columns for each subsequence length, we compute the frequency of the 20 most frequent words in natural and synthetic MSAs of the family considered, and evaluate the Pearson correlation between these top 20 frequencies in the MSA of focus and those in a reference MSA. We then average these Pearson correlation values over all sets of 1000 columns, yielding the r20 score.

To further inspect low-order statistics, in each MSA, we compute the one-body frequencies of occurrence of each amino acid at each site, the two-body frequencies of each pair of amino acids at each pair of sites, and the three-body frequencies associated to triplets. We denote them by fi(x), fij(x,y), fijk(x,y,z), where i, j, and k denote sites, while x, y, and z represent amino acids (see ‘Sampling sequences from Potts models’). We then estimate the second- and third-order connected correlations as:

(4) Cij(x,y)=fij(x,y)-fi(x)fj(y);
(5) Cijk(x,y,z)=fijk(x,y,z)fij(x,y)fk(z)fik(x,z)fj(y)fjk(y,z)fi(x)+2fi(x)fj(y)fk(z).

We also compute the ‘plug-in’ estimates of the Shannon entropy of each site Hi, and of the two- and three-body joint entropies Hij and Hijk, from the frequencies. They yield the plug-in estimates of the mutual information Iij between two columns, and of the co-information Iijk between three columns:

(6) Iij=Hi+Hj-Hij;
(7) Iijk=Hi+Hj+Hk-Hij-Hik-Hjk+Hijk.

Co-information is a measure of higher-order statistical dependencies (McGill, 1954; Timme et al., 2014; Quax et al., 2017; Rosas et al., 2019), which generalizes mutual information to triplets of random variables, vanishes for independent variables, and reflects the balance between redundancy and synergy in these triplets (Williams and Beer, 2010; Rosas et al., 2016). A systematic finite-size error occurs when estimating entropies using the plug-in estimate from frequencies measured in finite datasets (Bialek, 2012), and it affects entropy-derived quantities such as mutual information and co-information. Here, we do not attempt to correct it. Rather, we only make comparisons between MSAs of the same length and depth, which are affected by the same finite-size errors.

Characterizing the distribution of sequences in MSAs

Another way of studying the properties of generated MSAs is to analyze the distribution of their sequences in sequence space, and to compare it to that of natural sequences in the same family.

First, to assess whether generated sequences most resemble natural ones that are well represented in their family or, rather, rare ones, we consider for each synthetic sequence its closest natural sequence. We then count the number of neighbors of this natural sequence in the natural MSA, that is the number of natural sequences that have (normalized) Hamming distance below δ=0.2 with the sequence of interest. Note that the inverse of this number of neighbors gives the sequence weight wi introduced in Equation 8.

Second, to explore the distributions in sequence space of sequences within each MSA, and compare synthetic and natural MSAs, we associate to each sequence the concatenation of the one-hot encodings of each of its amino acids (Figliuzzi et al., 2018). We perform a PC analysis of the matrix corresponding to the natural MSA in this representation. We can then represent natural and synthetic sequences as points projected in the space defined by the first two PCs of the natural MSA.

Third, to analyze in more detail the apparent relatedness of generated sequences, and compare it to real phylogenetic relationships in natural sequences, we infer phylogenetic trees from each MSA using FastTree 2 (Price et al., 2010). To quantitatively compare the topologies of these trees, which do not have the same leaves, we analyze the eigenvalue spectrum of their MGL (Lewitus and Morlon, 2016). The MGL of a phylogenetic tree is defined as the difference between its degree matrix (a diagonal matrix whose ith diagonal entry is the sum of the branch lengths from node i to all other nodes in the tree) and the matrix of patristic distances (whose (i,j) th entry is the branch length between nodes i and j). Given the computational cost of running such an analysis on our deep MSAs, we use a bootstrap-aggregating strategy in the spirit of Colijn and Plazzotta, 2018. Namely, for each MSA we compute 200 different trees, each one inferred from a different sub-MSA of 500 sequences, itself randomly sampled from the whole MSA. Then, for each of these trees, we compute the eigenvalue spectrum of the MGL. Next, we merge all these spectra together to obtain a single eigenvalue spectral density. Note that this method has the advantage of not depending on the details of the topology of one large inferred tree, which are known to be sensitive to the choice of phylogeny reconstruction algorithm.

Datasets

To generate synthetic MSAs with MSA Transformer and bmDCA and compare them to their natural counterparts, we consider the deep Pfam ‘full’ alignments (Mistry et al., 2021) associated to 14 different protein domains (Appendix 1—table 5). Each MSA is a matrix M with L columns, representing the different amino-acid sites, and M rows. Each row i, denoted by x(i), represents one sequence of the alignment. We refer to L as the MSA length, and to M as its depth. For all our MSAs, M>36000. These alignments are the same as in Lupo et al., 2022, except that we removed PF13354 (Beta-lactamase2) from this set of deep MSAs because of its smaller depth. However, this family is included in our additional analyses (see Appendix 1—table 6).

Deep MSAs generally include some highly similar sequences due to phylogenetic relatedness. This can be characterized via the effective depth (Weigt et al., 2009)

(8) Meff(δ):=i=1Mwi,withwi:=|{i:dH(x(i),x(i))<δ}|1,

where dH(x,e) is the (normalized) Hamming distance between two sequences x and e, that is the fraction of sites where the amino acids differ, and we set δ=0.2. Note that the inverse of the sequence weight wi in Equation 8 is the number of neighbors in ‘Characterizing the distribution of sequences in MSAs’, and that Meff(0.2)/M can be as low as 0.06 for our natural MSAs.

All these families were previously shown to be well fitted by Potts models inferred by bmDCA (Figliuzzi et al., 2018), making our results on sequence generation by bmDCA readily comparable with previous results. Our domains’ short lengths are convenient because bmDCA is computationally demanding, and also in view of MSA Transformer’s large memory footprint, which is O(LM2)+O(L2). Furthermore, their large depth is crucial to our comparisons, as it allows Potts models to be accurately fitted (Figliuzzi et al., 2018).

We extended our study to small protein families by considering seven additional families, listed in Appendix 1—table 6, for which we also started from Pfam ‘full’ MSAs. These families comprise from nine to a few hundreds of sequences. We also considered two additional protein families, also listed in Appendix 1—table 6, for our comparison with published experimental datasets.

Appendix 1

Appendix 1—table 1
p values of the Kolmogorov–Smirnov test comparing the distributions of homology, coevolution, and structure-based scores across natural and synthetic multiple sequence alignments (MSAs).

For each score except the root-mean-squared deviation (RMSD), we test the null hypothesis that the scores of MSA-Transformer–generated sequences are greater or equal than those of Boltzmann machine DCA (bmDCA)-generated sequences, in the (stringent) sense that the cumulative distribution function of the former is always below that of the latter. Here, bmDCA1 stands for bmDCA with (λ,T)=(10-3,0.33) and bmDCA2 for bmDCA with (λ,T)=(10-2,1). For the RMSD, the null hypothesis is that the scores of MSA-Transformer–generated sequences are smaller or equal than those of bmDCA-generated sequences (recall that smaller RMSDs are better). In all cases, a p value close to one (resp. zero) means that the null hypothesis tested should be accepted (resp. rejected). Reported zero p values are too small to be properly assessed by the algorithm.

Pfam IDHMMER scoreStatistical energy scorepLDDT confidenceRMSD
MSA Tr. ≥bmDCA1MSA Tr. ≥bmDCA2MSA Tr. ≥bmDCA1MSA Tr. ≥bmDCA2MSA Tr. ≥bmDCA1MSA Tr. ≥bmDCA2MSA Tr. ≤bmDCA1MSA Tr. ≤bmDCA2
PF0000401.001.00.201.00.330.96
PF000050.931.001.06.5 · 10−191.00.780.96
PF000410.991.001.01.01.02.9 · 10−145.9 · 10 −3
PF000724.7 · 10−121.001.09.1 · 10−61.03.4 · 10−60.96
PF000769.5 · 10−1341.001.06.5 · 10−191.09.2 · 10−51.0
PF000960.041.001.00.010.920.922.4 · 10−5
PF001530.911.001.00.980.849.3 · 10−101.0
PF002715.9 · 10−301.001.00.331.00.020.88
PF0039701.001.01.5 · 10−31.04.8 · 10−100.38
PF0051201.001.04.3 · 10−31.00.960.67
PF005950.831.001.01.0 · 10−151.07.2 · 10−241.0
PF015350.981.001.01.01.00.784.1 · 10−8
PF0251801.001.01.01.01.01.0
PF076791.01.001.01.01.04.3 · 10−31.0
Appendix 1—table 2
Median homology, coevolution, and structure-based scores in natural and synthetic sequences.

We report the median values of each of the scores shown in Figure 1, as well as their standard deviations (between parentheses), for natural sequences (’Nat.’), for sequences generated by our method based on MSA Transformer, and for sequences generated by Boltzmann machine DCA (bmDCA) at low temperature, that is with (λ,T)=(10-3,0.33) (denoted by ‘bmDCA’). Scores are normalized as Figure 1, except that, for statistical energy, we subtract the median of natural scores instead of the mean for clarity (therefore, all natural MSAs have median 0 and standard deviation 1 for this score). For all scores, the best median among those of the two synthetic MSAs is shown in bold, and for the predicted local-distance difference test (pLDDT) score, it is shown in red if it is better than that the other synthetic MSA by a margin larger than the largest standard deviation. Recall that higher values are better for all scores, except root-mean-squared deviation (RMSD), for which the opposite holds.

HMMER scoreStatistical energy scorepLDDT confidence (%)RMSD (Å)
Pfam IDNat.MSA Tr.bmDCANat.MSA Tr.bmDCANat.MSA Tr.bmDCANat.MSA Tr.bmDCA
PF000040.5 (0.2)0.6 (0.2)0.8 (0.2)0 (1)0.8 (0.9)1.6 (0.1)85.4 (4.1)85.8 (4.5)81.7 (0.7)3.4 (0.8)2.8 (0.7)3.6 (0.5)
PF000050.7 (0.1)0.8 (0.1)0.8 (0.1)0 (1)1.8 (0.9)3.1 (0.2)83.0 (6.9)89.0 (4.2)91.6 (1.6)3.8 (1.2)2.8 (1.0)2.8 (0.8)
PF000410.6 (0.1)0.9 (0.2)0.5 (0.1)0 (1)1.5 (1.0)4.9 (0.5)90.0 (4.5)92.0 (3.2)79.2 (2.7)2.1 (2.1)2.9 (0.5)3.4 (2.2)
PF000720.7 (0.1)0.9 (0.1)0.8 (0.1)0 (1)2.1 (0.8)3.8 (0.3)94.5 (3.4)94.9 (1.9)94.1 (0.5)2.4 (0.3)2.3 (0.1)2.1 (0.1)
PF000760.6 (0.1)0.8 (0.2)0.8 (0.1)0 (1)1.5 (0.8)3.5 (0.2)82.2 (4.4)84.6 (4.8)87.6 (1.5)1.8 (0.5)1.4 (0.6)1.4 (0.1)
PF000960.8 (0.0)0.9 (0.0)0.9 (0.0)0 (1)2.2 (0.8)2.8 (0.3)93.0 (2.0)94.0 (0.8)93.7 (0.2)0.6 (0.1)0.4 (0.1)0.4 (0.0)
PF001530.5 (0.1)0.6 (0.1)0.5 (0.1)0 (1)0.6 (0.8)2.6 (0.3)65.0 (5.0)66.6 (6.2)64.9 (4.3)5.1 (1.8)4.4 (1.5)4.3 (1.1)
PF002710.5 (0.1)0.5 (0.1)0.5 (0.2)0 (1)1.0 (0.9)2.4 (0.3)78.4 (4.6)86.4 (5.4)83.8 (2.2)2.0 (0.8)2.3 (0.6)1.8 (0.1)
PF003970.7 (0.1)0.8 (0.1)0.8 (0.1)0 (1)0.5 (0.9)2.1 (0.2)88.1 (2.2)88.9 (2.4)88.2 (1.0)0.9 (0.3)0.9 (0.3)0.9 (0.1)
PF005120.5 (0.1)0.8 (0.2)0.7 (0.2)0 (1)1.5 (1.0)3.2 (0.3)91.0 (4.0)90.2 (4.0)89.5 (1.5)2.1 (0.6)2.2 (0.5)3.1 (0.2)
PF005950.7 (0.1)0.7 (0.1)0.7 (0.1)0 (1)0.5 (0.9)2.6 (0.3)93.4 (4.5)94.0 (1.8)95.1 (0.8)1.8 (0.4)1.7 (0.5)1.4 (0.2)
PF015350.5 (0.1)0.9 (0.2)0.6 (0.1)0 (1)2.3 (1.1)4.1 (0.2)82.4 (6.2)94.3 (5.5)77.9 (3.6)1.0 (1.1)0.4 (0.7)0.5 (0.4)
PF025180.6 (0.2)0.8 (0.2)0.7 (0.2)0 (1)1.9 (0.9)3.5 (0.2)88.0 (6.0)91.0 (6.3)73.6 (2.3)4.1 (0.9)3.9 (0.5)4.7 (1.1)
PF076790.5 (0.1)0.7 (0.2)0.4 (0.1)0 (1)1.7 (1.0)5.2 (0.6)93.5 (3.8)95.3 (2.9)89.8 (2.2)1.3 (1.0)1.2 (0.5)1.2 (0.2)
Appendix 1—table 3
Comparing different generation methods of MSA Transformer.

Various scores are shown for the natural MSA of protein family PF00153 and for synthetic MSAs generated in different ways from this family (each synthetic MSA comprises 10,000 sequences). For generation using MSA Transformer (see ‘Using MSA Transformer to generate sequences via an iterative masking procedure’), our standard iterative masking procedure is shown with its default greedy sampling (corresponding to T=0) and two higher temperatures. Variants of the procedure where only the first sequence is masked (‘Context’, either fixed or variable, both with greedy sampling) are also shown. We report the mean Hamming distance to the closest natural sequence, which is not itself in the case of natural sequences (‘Distance’) as well as the mean HMMER and statistical energy scores (‘-Energy’) described in ‘Scoring individual sequences’. Note that statistical energy scores are shifted by the mean value obtained for the natural MSA (which is −235.8). We also report the Pearson correlations between the two- and three-body statistics of the natural and the generated MSAs, denoted, respectively, by ρ[Cij] and ρ[Cijk] (for the natural MSA we report the Pearson correlation between two halves of this MSA), as illustrated in Figure 4—figure supplement 2.

ScoreNatural sequencesMSA Transformer
Iterative maskingContext (greedy)
GreedyT=0.5T=1.0FixedVariable
Distance0.1550.2710.3050.5140.2320.262
HMMER48.058.258.148.458.763.8
− Energy013.08.5−42.0−15.4−13.2
ρ[Cij]0.940.840.840.620.730.81
ρ[Cijk]0.890.800.760.410.660.77
Appendix 1—table 4
Impact of regularization strength and sampling temperature on sequence generation by Boltzmann machine DCA (bmDCA), for family PF00072.

We compare MSAs obtained using bmDCA with different regularization strengths λ (for inference) and sampling temperatures T (for generation) with the natural and the MSA-Transformer–generated MSAs. In each case, we report the average of the Hamming distances of each sequence to their closest natural neighbor, which is not itself in the case of natural sequences (‘Distance’), as well as the effective MSA depth, the scores defined in ‘Scoring individual sequences’, and the Pearson correlation coefficients of the two- and three-body connected correlations computed from natural and generated MSAs (ρ[Cij] and ρ[Cijk]). For MSA Transformer ('MSA Tr.'), bmDCA (0.01,1) and bmDCA (0.001,0.33), we also computed structural scores by feeding the entire synthetic MSA to Alphafold as context MSA (instead of using the natural MSA as context, see ‘Scoring individual sequences’). Structural scores are then very similar to those obtained using natural context.

TypeλTDistanceMeff(0.2)HMMER- Energyρ[Cij]ρ[Cijk]pLDDT (%)RMSD (Å)
Natural--0.19340,18090.300.990.8893.62.5
MSA Tr.--0.3489304119.159.10.730.5394.72.35
Same as above, with synthetic context:95.12.37
bmDCA0.0110.5573,06266.5−37.00.960.5884.32.58
Same as above, with synthetic context:83.92.70
bmDCA0.010.660.29418,911101.792.20.480.1194.22.61
bmDCA0.010.330.25112103.2118.30.420.0594.22.55
bmDCA0.00110.52573,06286.9−18.30.970.6389.72.44
bmDCA0.0010.660.29621,294103.989.30.480.1994.32.6
bmDCA0.0010.330.27414107.7109.60.40.1394.02.14
Same as above, with synthetic context:94.22.24
Appendix 1—table 5
Pfam families and natural MSAs used in our analysis.

L denotes the length of an MSA, M its depth, and Meff(0.2) its effective depth with distance threshold δ=0.2, see Equation 8. The reference experimental PDB structures used for our root-mean-squared deviation (RMSD) calculations, and their resolutions (‘Resol.’), are also reported.

Pfam IDFamily nameLMMeff(0.2)PDB IDResol.
PF00004AAA13239,27790494D812.40 Å
PF00005ABC_tran13768,89143,8811L7V3.20 Å
PF00041fn38542,72117,7823UP12.15 Å
PF00072Response_reg11273,06340,1803ILH2.59 Å
PF00076RRM_16951,96420,2733NNH2.75 Å
PF00096zf-C2H22338,99612,5814R2A1.59 Å
PF00153Mito_carr9493,77617,8591OCK2.20 Å
PF00271Helicase_C11166,80925,0173EX72.30 Å
PF00397WW3139,04533614REX1.60 Å
PF00512HisKA66154,99867,3033DGE2.80 Å
PF00595PDZ8271,30340531BE91.82 Å
PF01535PPR31109,06437,5144M572.86 Å
PF02518HATPase_c11180,71459,1893G7E2.20 Å
PF07679I-set9036,14114,6111FHG2.00 Å
Appendix 1—table 6
Other Pfam families and natural MSAs used in our analysis.

L denotes the length of an MSA and M its depth. The reference experimental PDB structures used for our root-mean-squared deviation (RMSD) calculations, and their resolutions, are also reported.

Pfam IDFamily nameLMPDB IDResol.
PF01356A_amylase_inhib68511OK00.93 Å
PF03440APT87146RO02.13 Å
PF04008Adenosine_kin1543421WVQ1.45 Å
PF06351Allene_ox_cyc1753782BRJ1.50 Å
PF06355Aegerolysin1313226MYI1.15 Å
PF16747Adhesin_E125316GUT1.63 Å
PF18648ADPRTs_Tse215595AKO2.40 Å
PF13354Beta-lactamase219846426QW81.10 Å
-Chorismate mutase Russ et al., 20209611301ECM2.20 Å

Data availability

Python code for generating sequences using the iterative masking procedure is archived at: https://doi.org/10.5281/zenodo.7684052. Raw data were collected from two public sources: (1) MSAs from the Pfam database (https://pfam.xfam.org/); (2) further MSAs from https://github.com/matteofigliuzzi/bmDCA (Barrat-Charlaix, 2017). We generated sequences with bmDCA using code publicly available at https://github.com/ranganathanlab/bmDCA (Figliuzzi and Barrat-Charlaix, 2020).

References

  1. Conference
    1. Bahdanau D
    2. Cho K
    3. Bengio Y
    (2015)
    Neural machine translation by jointly learning to align and translate
    International Conference on Learning Representations.
  2. Conference
    1. Bhattacharya N
    2. Thomas N
    3. Rao R
    4. Dauparas J
    5. Koo PK
    6. Baker D
    7. Song YS
    8. Ovchinnikov S
    (2022)
    Interpreting potts and transformer protein models through the lens of simplified attention
    Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing. pp. 34–45.
  3. Book
    1. Bialek W
    (2012)
    Biophysics: Searching for principles
    Princeton University Press.
  4. Conference
    1. Elnaggar A
    2. Heinzinger M
    3. Dallago C
    4. Rehawi G
    5. Wang Y
    6. Jones L
    7. Gibbs T
    8. Feher T
    9. Angerer C
    10. Steinegger M
    11. Bhowmik D
    12. Rost B
    (2021)
    ProtTrans: Towards cracking the language of life’s code through self-supervised deep learning and high performance computing
    IEEE Transactions on Pattern Analysis and Machine Intelligence.
  5. Conference
    1. Hawkins-Hooker A
    2. Jones DT
    3. Paige B
    (2021b)
    MSA-Conditioned generative protein language models for fitness landscape modelling and design
    In Machine Learning for Structural Biology Workshop NeurIPS.
  6. Conference
    1. Lapedes AS
    2. Giraud BG
    3. Liu L
    4. Stormo GD
    (1999)
    Correlated mutations in models of protein sequences: Phylogenetic and structural effects
    Statistics in Molecular Biology and Genetics - IMS Lecture Notes - Monograph Series.
  7. Conference
    1. Rao RM
    2. Liu J
    3. Verkuil R
    4. Meier J
    5. Canny J
    6. Abbeel P
    7. Sercu T
    8. Rives A
    (2021a)
    MSA Transformer
    Proceedings of the 38th International Conference on Machine Learning. pp. 18–24.
  8. Conference
    1. Rao R
    2. Meier J
    3. Sercu T
    4. Ovchinnikov S
    5. Rives A
    (2021b)
    Transformer protein language models are unsupervised structure learners
    In International Conference on Learning Representations.
  9. Conference
    1. Vaswani A
    2. Shazeer N
    3. Parmar N
    4. Uszkoreit J
    5. Jones L
    6. Gomez AN
    7. Kaiser Ł
    8. Polosukhin I
    (2017)
    Attention is all you need
    Advances in Neural Information Processing Systems. pp. 5998–6008.

Decision letter

  1. Lucy J Colwell
    Reviewing Editor; Cambridge University, United Kingdom
  2. Aleksandra M Walczak
    Senior Editor; CNRS, France

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Generative power of a protein language model trained on multiple sequence alignments" for consideration by eLife. Your article has been reviewed by 2 peer reviewers, and the evaluation has been overseen by Lucy Colwell as Reviewing Editor and José Faraldo-Gómez as Senior Editor. All reviewers have opted to remain anonymous.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

(1) Please address the technical points raised by each referee below, paying particular attention to those around the generalizability of hyperparameter choices, the training of bmDCA, and the comparison between bmDCA and the MSA transformer and the sampling temperatures used, and the evaluation of sample diversity.

(2) Both reviewers note the lack of experimental validation. Adding some level of experimental validation of the proposed method would significantly improve the manuscript. If this is not possible, an alternative option might be to build a retrospective model evaluation using previously published protein design experimental datasets.

Reviewer #1 (Recommendations for the authors):

Sampling hyperparameters:

Figures S1/2 convincingly motivate the choice of parameters for one protein family, but it is unclear whether these would generalize well for other proteins. In particular, it is unclear whether the number of iterations should be the same across protein sizes.

bmDCA training:

In figure 3 middle left panel (comparing MSA covariance versus bmDCA-generated covariance) shows a fairly small regression line slope (0.67). This suggests too large a regularization on couplings and/or an insufficient number of iterations. Have you checked that the algorithm successfully converged, or was the maximum number of iterations reached? What about using a smaller regularization strength (two values were used in Russ et al. Science 2020, one small and one large)? I know that parameter tuning is a weakness of bmDCA, but the proposed method also has free parameters and it is not clear whether they are universal for all protein lengths. This could also explain the discrepancy in the topology of the sequence distribution shown in Figure 5 (weak couplings cannot create "ferromagnetic" multimodal distributions).

Evaluation of sequence quality:

I feel that the comparison of Figure 1 is a bit unfair. On the one hand, the proposed MSA-Transformer sampling method generates samples from the T=0 conditional distribution (according to step 3, the bottom of p9), whereas the bmDCA generates samples from the T=1 distribution. It is known that samples from the T=1 distribution have worse statistical energy scores than natural sequences due to the regularization of fields/couplings and that this must be adequately compensated by low-temperature sampling (Russ et al. Science 2020). The authors argue p13 that low-temperature sampling results in distortion of the first and second-order moments of the data, but I think this is not a practical problem for protein design (by definition, an optimal design protocol will discard all suboptimal amino acids for a given position and only retain the most favorable ones).

The comparison of AlphaFold confidence scores between MSAtransformer-generated and natural sequences is interesting but the results are not discussed at all in the main text. Is there a statistically significant difference between the distributions? Moreover, it is not shown here whether the predicted structures are similar to the native ones. Please provide a fold similarity metric such as backbone RMSD values.

Evaluation of sample diversity:

The authors convincingly show that sequences generated by MSAtransformer differ substantially from natural sequences. However, given that the MSA-Transformer samples are essentially obtained by zero-temperature dynamics, it is important to assess how diverse are the generated sequences. This is partially addressed in Figure S5, but consider using simpler metrics. Can the authors provide basic estimates of sample diversity? (e.g., effective number of sequences as a function of the number of samples).

Reviewer #2 (Recommendations for the authors):

I have a series of questions that if resolved could help, in my opinion, to create an improved version of this interesting study.

1. The manuscript describes how MSA Transformers could lead to metrics that are "even better" than the natural sequences. I feel that this statement is a bit misleading as the natural sequences are a baseline to compare. There are no better properties than those of the natural sequences because that is what we observe in nature. I would suggest removing those statements as they might produce confusion. On the other hand, if the authors would provide examples of sequences that are "better" in terms of function, stability, or any other quantifiable metric, then I would agree that these statements would make more sense.

2. The authors mention methods that can be trained in an unsupervised way to predict structure based on a single sequence. I wonder if this statement can be clarified, as it is obvious that those methods use several sequences as training. Are these methodologies based on physical principles instead of learning from sequences?

3. It is not clear to me what is the importance of the correlation between HMMER and Hamming distance, the authors should provide more intuition on why this is a relevant metric. Since those correlations are quite low for both models I am not sure if it contributes to the analysis in a meaningful way.

4. Are the contact maps created from the generated MSA of Transformer better than those of bmDCA? My understanding is that the value of p=0.1 optimizes contact map accuracy, but how are those compared against DCA maps?

5. Haldane et al. (https://doi.org/10.1016/j.bpj.2017.10.028) has proposed how pairwise statistics seem to be enough to recapitulate higher order mutational patterns in natural sequences. Could the authors comment on this, and mention if there is a substantial advantage in capturing 3-body statistics in the process of protein design or generative modeling?

6. It would be interesting to see how Alphafold would perform if some of these generated MSAs are used as input. I don't think it is needed to test it with all the families, but it would be an interesting experiment for a single family.

7. The families used in this study appear to have enough statistics to perform well, this is reasonable, however, for the sake of comparison, it would be interesting to see how the MSA Transformer would compare against bmDCA for a family with just a few sequence members. Are the trends the same? Or do we see a change in performance between the two generative methodologies?

https://doi.org/10.7554/eLife.79854.sa1

Author response

Essential revisions:

(1) Please address the technical points raised by each referee below, paying particular attention to those around the generalizability of hyperparameter choices, the training of bmDCA, and the comparison between bmDCA and the MSA transformer and the sampling temperatures used, and the evaluation of sample diversity.

We thank the editor for highlighting the technical points raised by the reviewers. We have treated all of these points thoroughly. Each of them is explained below in our point-by-point response to each reviewer. Here is a brief summary of how we addressed these points.

– Regarding the generalizability of hyperparameter choices, the values we chose yielded satisfactory convergence of MSA properties and preservation of contact maps for all families considered. This holds true in our revised version, where we considered 9 additional protein families (listed in Table 6), with various lengths and depths. In this context, we chose not to engage in family-specific parameter tuning. We now illustrate this robustness better, in two new figures (Figure 7—figure supplements 2 and 3).

– About the training of bmDCA, we did our best to reproduce the published bmDCA results, and to train bmDCA as in the original papers. In our response to reviewer 1 below, we show quantitative comparisons of the fitting performance of two-body correlations between our bmDCA results and published ones.

– About the comparison between bmDCA and MSA Transformer and the sampling temperatures used, we have now performed a comprehensive comparison of our

MSA-Transformer–generated data not only to bmDCA-generated data at sampling temperature T=1, but also at lower sampling temperatures and regularization strengths, following [Russ et al. 2020]. We also now provide an analysis of the statistical significance of the difference between the distributions of all scores in our various generated datasets in Table 1, using the Kolmogorov-Smirnov test, and in Table 2, focusing on the median and standard deviation of scores.

– Regarding the evaluation of sample diversity, we now provide a study of the effective MSA depth in Figure 1—figure supplement 1, and discuss this point thoroughly.

(2) Both reviewers note the lack of experimental validation. Adding some level of experimental validation of the proposed method would significantly improve the manuscript. If this is not possible, an alternative option might be to build a retrospective model evaluation using previously published protein design experimental datasets.

We thank the editor and the reviewers for these very relevant suggestions. While we were not in a position to conduct experiments, we performed a retrospective model evaluation using previously published protein design experimental datasets, and added a new section about this at the end of Results. There, we present a detailed comparison of our generated sequences to those experimentally validated in [Russ et al. 2020], and a calculation of deep mutational scanning scores for two other protein families. The corresponding data is shown in our new Figure 6 and Figure 6—figure supplements 1 and 2.

Reviewer #1 (Recommendations for the authors):

Sampling hyperparameters:

Figures S1/2 convincingly motivate the choice of parameters for one protein family, but it is unclear whether these would generalize well for other proteins. In particular, it is unclear whether the number of iterations should be the same across protein sizes.

We thank the reviewer for raising this important question. We agree that optimal hyperparameters could potentially depend on protein family characteristics. Throughout our work, we used the following hyperparameter values: masking probability p=0.1 and number of iterations I=200, because they worked well throughout. To illustrate this robustness better, we produced two new figures (Figure 7—figure supplement 2 and Figure 7—figure supplement 3) similar to Figure S1 (now Figure 7—figure supplement 1) for two other protein families, namely PF00096 and PF13354, which are respectively the shortest (L=23) and the longest (L=198) considered in our study. We also added a paragraph in Methods that discusses this point:

"The behaviors observed in Figure 7—figure supplement 1 for PF00153 are generic across the protein families we studied, as can be seen in Figure 7—figure supplement 2 and Figure 7—figure supplement 3, which show the same data as in Figure 7—figure supplement 1 for Pfam families PF00096 and PF13354 (which have different sequence lengths). This demonstrates that our sequence generation method is robust. In particular, as the parameters p = 0.1 and I = 200 yield satisfactory convergence of MSA properties and preservation of contact maps in all cases, we used these parameters throughout, without any family-specific fine-tuning."

bmDCA training:

In figure 3 middle left panel (comparing MSA covariance versus bmDCA-generated covariance) shows a fairly small regression line slope (0.67). This suggests too large a regularization on couplings and/or an insufficient number of iterations. Have you checked that the algorithm successfully converged, or was the maximum number of iterations reached? What about using a smaller regularization strength (two values were used in Russ et al. Science 2020, one small and one large)? I know that parameter tuning is a weakness of bmDCA, but the proposed method also has free parameters and it is not clear whether they are universal for all protein lengths. This could also explain the discrepancy in the topology of the sequence distribution shown in Figure 5 (weak couplings cannot create "ferromagnetic" multimodal distributions).

We did our best to reproduce the published bmDCA results, and to train bmDCA as in published papers. Our baseline fitting method employs the default parameters of [Figliuzzi et al. 2018] and uses the exact same convergence criteria, and our sampling method employs the equilibration time determined there. We chose to employ the default parameters to allow for direct comparison with the results of [Figliuzzi et al. 2018]. Below, we show quantitative comparisons of the fitting performance of two-body correlations between our bmDCA results and published ones.

More generally, we agree with the reviewer about the importance of regularization strength and temperature in bmDCA, as reported in [Russ et al. 2020]. Throughout our revised manuscript, we now present results with parameter values used in [Russ et al. 2020], in addition to our results with default parameters. We explain this important point in more detail in our response to the next reviewer question.

Details about bmDCA fitting quality:

We checked convergence and we double-checked that our results are consistent with these published results and those of [Trinquier et al. 2021]. In particular, the Pearson correlation values between the pairwise correlations in the natural and bmDCA-generated data are consistent with those reported in [Figliuzzi et al. 2018] and [Trinquier et al. 2021]:

Author response table 1
FamilyPearson (Figliuzzi/Trinquier)Pearson (ours)
PF000040.95/-0.98
PF000050.95/-0.96
PF000410.97/-0.96
PF000720.98/0.930.96
PF000760.98/0.970.95
PF000960.99/-0.93
PF001530.97/-0.92
PF00595-/0.970.93
PF015350.99/-0.98
PF025180.97/-0.97
PF076790.95/-0.96

As can be observed from this table, our results are similar to the published ones, and the range of differences between our results and published ones appears comparable to that between the results of two papers by the same group.

Regarding the slope of 0.67 mentioned by the reviewer for PF00153 (now shown in Figure 4—figure supplement 3), unfortunately, we cannot compare it directly to previous results, because the value of this slope was not reported in previous works. However, there are 3 other families for which this slope is explicitly reported in [Trinquier et al. 2021], and here is the comparison in this case:

Author response table 2
FamilySlope (Trinquier)Slope (ours)
PF000720.740.88
PF000760.920.82
PF005950.880.75

From these three examples, our slopes do not seem further from one than in [Trinquier et al. 2021].

Evaluation of sequence quality:

I feel that the comparison of Figure 1 is a bit unfair. On the one hand, the proposed MSA-Transformer sampling method generates samples from the T=0 conditional distribution (according to step 3, the bottom of p9), whereas the bmDCA generates samples from the T=1 distribution. It is known that samples from the T=1 distribution have worse statistical energy scores than natural sequences due to the regularization of fields/couplings and that this must be adequately compensated by low-temperature sampling (Russ et al. Science 2020). The authors argue p13 that low-temperature sampling results in distortion of the first and second-order moments of the data, but I think this is not a practical problem for protein design (by definition, an optimal design protocol will discard all suboptimal amino acids for a given position and only retain the most favorable ones).

We thank both reviewers for raising this very interesting point. We have now performed a comprehensive comparison of our MSA-Transformer–generated data not only to bmDCA-generated data at sampling temperature T=1 but also at lower sampling temperatures. We considered the two temperature values chosen in [Russ et al. 2020], namely T=0.33 and T=0.66. For completeness, we also considered the two values of regularization strength λ from [Russ et al. 2020] for these three temperatures, in the case of family PF00072, as reported in Table 4. Given the relatively small impact of λ observed there, we kept only one value of λ for each value of T in the rest of our manuscript – namely, λ=0.01 for T=1 to match the parameters in [Figliuzzi et al. 2018], and λ=0.001 for T=0.33 and T=0.66 as it gave slightly better scores in Table 4. Note that for our additional study of small protein families (see below), we employed λ=0.01 throughout because it is better suited to small families.

In particular, to make Figure 1 fairer, as per the reviewer's remark, we now include results obtained for bmDCA at λ=0.001 and T=0.33 in this figure. For completeness, we also include them in all other figures of the revised manuscript. Results for T=0.66 were quite similar to those obtained for T=0.33, and we also report them in Table 4 and Figure 5—figure supplements 1-2 for completeness.

Our general findings, which are discussed in the revised manuscript, are that decreasing T indeed improves the scores of bmDCA-generated sequences. However, the main improvement regards statistical energy (as expected from lowering T), while the improvements of other scores (HMMER score, and, more importantly, structural scores) are more modest. Even using T=0.33 for bmDCA, our MSA-Transformer–generated sequences have similar or better scores compared to bmDCA-generated sequences, apart from statistical energy (see Figure 1 and Table 1 and 1c). Moreover, we find that decreasing T with bmDCA substantially decreases MSA diversity, while MSA-Transformer–generated sequences do not suffer from such an issue (see Figure 1—figure supplement 1). In fact, at low T, bmDCA concentrates on local minima of the statistical energy landscape (see Figures 2, 5 and Figure 5—figure supplements 1-2), resulting in low diversity.

Overall, these new results confirm that our procedure for generating sequences using MSA Transformer is promising, featuring scores at least comparable with low-temperature bmDCA sequences, and high diversity.

The comparison of AlphaFold confidence scores between MSAtransformer-generated and natural sequences is interesting but the results are not discussed at all in the main text. Is there a statistically significant difference between the distributions? Moreover, it is not shown here whether the predicted structures are similar to the native ones. Please provide a fold similarity metric such as backbone RMSD values.

We thank the reviewer for these important remarks. We now present an analysis of the statistical significance of the difference between the distributions of all scores in the various generated datasets in Table 1, using the Kolmogorov-Smirnov test, and we also compare the median and standard deviation of all scores in the natural and generated datasets in Table 2. We revised the discussion in the main text accordingly.

In addition, we agree that checking whether the predicted structures are similar to the native ones is an important test that goes beyond the AlphaFold pLDDT. We therefore added an additional score throughout our manuscript, which is the RMSD between a reference experimental structure of the family (see Table 5) and the AlphaFold structure predicted for each sequence studied. The results from the RMSD analysis corroborate those obtained with pLDDT and show that predicted structures are indeed similar to the native ones. These results are now discussed in the main text. We believe that this point strengthens our conclusions, and we thank the reviewer for suggesting this analysis.

Evaluation of sample diversity:

The authors convincingly show that sequences generated by MSAtransformer differ substantially from natural sequences. However, given that the MSA-Transformer samples are essentially obtained by zero-temperature dynamics, it is important to assess how diverse are the generated sequences. This is partially addressed in Figure S5, but consider using simpler metrics. Can the authors provide basic estimates of sample diversity? (e.g., effective number of sequences as a function of the number of samples).

We thank the reviewer for asking this interesting question. We added a new supplementary figure, Figure 1—figure supplement 1, to address this point. In this figure, we show the effective number of sequences Meff for the MSAs we generated, versus the similarity threshold employed to define Meff. We find that the sequence diversity of MSA-Transformer–generated MSAs is slightly smaller than that of the natural MSAs, but remains of the same order of magnitude. Therefore, the method we propose to generate sequences using MSA Transformer preserves diversity to a large extent, despite using zero-temperature dynamics. This is probably enabled by the fact that we start from an MSA and not from a single sequence, and that MSA

Transformer uses the whole MSA as context.

Conversely, Figure 1—figure supplement 1 shows that low-temperature bmDCA sampling leads to substantially reduced Meff values, consistently with the idea that only the local minima of the energy landscape are then sampled. More precisely, we observe a regular increase of Meff with the similarity threshold for MSA-Transformer–generated data as well as for natural data, while the increase is more stepwise for low-temperature bmDCA.

Reviewer #2 (Recommendations for the authors):

I have a series of questions that if resolved could help, in my opinion, to create an improved version of this interesting study.

1. The manuscript describes how MSA Transformers could lead to metrics that are "even better" than the natural sequences. I feel that this statement is a bit misleading as the natural sequences are a baseline to compare. There are no better properties than those of the natural sequences because that is what we observe in nature. I would suggest removing those statements as they might produce confusion. On the other hand, if the authors would provide examples of sequences that are "better" in terms of function, stability, or any other quantifiable metric, then I would agree that these statements would make more sense.

We agree with the reviewer that natural sequences are the reference here. We changed the wording and we no longer claim that generated sequences have "better scores", in order to avoid any confusion about this.

2. The authors mention methods that can be trained in an unsupervised way to predict structure based on a single sequence. I wonder if this statement can be clarified, as it is obvious that those methods use several sequences as training. Are these methodologies based on physical principles instead of learning from sequences?

We agree that this statement was a bit misleading, and we thank the reviewer for pointing it out. We rephrased it to clarify that these models, which are trained on a large database of sequences, are then able to predict the structure taking as input either a single sequence (ESM1b and now ESM2) or an MSA (MSA Transformer). The corresponding sentence now reads:

“These pre-trained models are able to predict structure in an unsupervised way [21], either taking as input a single sequence [20] or an MSA [25], potentially by transferring knowledge from their large training set [26, 27].”

3. It is not clear to me what is the importance of the correlation between HMMER and Hamming distance, the authors should provide more intuition on why this is a relevant metric. Since those correlations are quite low for both models I am not sure if it contributes to the analysis in a meaningful way.

We thank the reviewer for raising this important point. One may have feared that generated sequences that are most similar to natural ones might always have higher scores, which would show that the model has difficulty to generalize and may be overfitting. The fact that HMMER scores tend to be positively correlated with Hamming distances for sequences generated using our method based on MSA Transformer is one of several pieces of evidence that this is not the case. However, we agree that our detailed discussion of the correlations between HMMER scores and Hamming distances gave too much emphasis to this point. Therefore, we have now strongly reduced this discussion.

4. Are the contact maps created from the generated MSA of Transformer better than those of bmDCA? My understanding is that the value of p=0.1 optimizes contact map accuracy, but how are those compared against DCA maps?

It was shown in the original MSA Transformer paper, [Rao et al. 2021], that MSA Transformer outperforms DCA at unsupervised contact prediction on natural data. In addition, in Figure 7—figure supplements 1 to 4, we show that for small values of the masking probability p (e.g. p=0.1), the contact maps inferred by MSA Transformer from our MSAs generated using our iterative masking procedure based on MSA Transformer reproduce quite well those of natural MSAs of the same protein family. The goal here was to check that MSA Transformer was remaining within the protein family of focus when using our generation method. We now clarified this point.

In addition, for the protein family PF00072, we now also used fully synthetic MSAs (either generated using our method based on MSA Transformer or generated by bmDCA) as input to AlphaFold (Table 4). We find that structural scores (both pLDDT and RMSD) are fully in line with those of natural data, both for our method based on MSA Transformer and for low-temperature bmDCA, while bmDCA at T=1 gives poorer results.

5. Haldane et al. (https://doi.org/10.1016/j.bpj.2017.10.028) has proposed how pairwise statistics seem to be enough to recapitulate higher order mutational patterns in natural sequences. Could the authors comment on this, and mention if there is a substantial advantage in capturing 3-body statistics in the process of protein design or generative modeling?

We thank the reviewer for pointing out this interesting study that we now cite. We took inspiration from it, and from the later study [McGee et al. 2021] by some of the same authors, and investigated higher-order statistics in natural and synthetic MSAs (beyond 2- and 3-body ones that were already studied in the first version of our manuscript). Our new Figure 4 shows the r20 score that quantifies the similarity of statistics at various orders between two datasets. It also presents as reference an assumption-free null model, namely the r20 score between two subsets of each natural MSA. As now mentioned in the main text,

"Figure 4 shows that bmDCA with default parameters is most often the best method at reproducing lower-order statistics, but that MSA Transformer is the best at reproducing higher-order statistics, in all families considered."

While pairwise models are extremely successful at modeling the statistics of protein MSAs and at making predictions e.g. about their structure, interactions, and mutational effects, this does not necessarily imply that higher-order statistics are always fully captured by these models. Besides, the training of these models is difficult and subject to finite size issues, as discussed e.g. in [McGee et al. 2021]. We note that in the context of neuroscience [Meshulam et al. 2021], a better agreement is obtained on 3-body correlations using pairwise maximum entropy models.

We now mention this point: "Thus, bmDCA is trained to reproduce these frequencies, and achieves these objectives quite well, although the comparison to the null model in Figure 4—figure supplement 2 and Figure 4—figure supplement 3 hints that further improvements remain possible, see [51]." Either way, here, we take these results as a pragmatic indication that lower-order statistics are better reproduced by bmDCA while higher-order statistics are better reproduced by our method based on MSA Transformer.

6. It would be interesting to see how Alphafold would perform if some of these generated MSAs are used as input. I don't think it is needed to test it with all the families, but it would be an interesting experiment for a single family.

We thank the reviewer for this interesting suggestion. As mentioned above, for the protein family PF00072, we now used fully synthetic MSAs (either generated using our method based on MSA Transformer or generated by bmDCA) as input to AlphaFold (Table 4). We find that structural scores (both pLDDT and RMSD) are then very similar to those obtained using natural context (which is our usual procedure). They are also fully in line with those of natural data for our method based on MSA Transformer, illustrating the robustness of the good structural scores of our synthetic sequences. We have added the sentence:

“In addition, for the protein family PF00072, we also used fully synthetic MSAs as input to AlphaFold. Structural scores are then very similar to those obtained using natural context (see Table 4)”.

7. The families used in this study appear to have enough statistics to perform well, this is reasonable, however, for the sake of comparison, it would be interesting to see how the MSA Transformer would compare against bmDCA for a family with just a few sequence members. Are the trends the same? Or do we see a change in performance between the two generative methodologies?

We thank the reviewer for asking this extremely interesting question. We now present new results for smaller protein families, listed in Table 6, whose shallow MSAs make it more challenging to accurately fit Potts models. Our results are shown in the new Figure 3, and discussed in the main text, in a new section titled "Sequence generation by the iterative masking procedure is successful for small protein families" at the end of Results. As mentioned there,

“We observe that MSA-Transformer–generated sequences have similar HMMER scores and structural scores to natural sequences. MSA-Transformer–generated sequences also generally have better HMMER scores and structural scores than those generated by bmDCA with default parameters. While low-temperature bmDCA yields better statistical energy scores (as expected), and also gives HMMER scores and structural scores comparable to natural sequences, it in fact generates sequences that are almost exact copies of natural ones. By contrast, MSA Transformer produces sequences that are quite different from natural ones, and have very good scores. Thus, our method based on MSA Transformer is particularly promising in the tricky case of small protein families.”

This analysis shows that our method not only performs as well as bmDCA for large families, but also has a broader scope, as it is less limited by MSA depth than bmDCA. We believe that this makes our manuscript stronger, and we thank the reviewer again for suggesting this very relevant additional study.

https://doi.org/10.7554/eLife.79854.sa2

Article and author information

Author details

  1. Damiano Sgarbossa

    1. Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
    2. SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
    Contribution
    Conceptualization, Software, Validation, Investigation, Visualization, Methodology, Writing – original draft
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0002-7878-6061
  2. Umberto Lupo

    1. Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
    2. SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
    Contribution
    Conceptualization, Resources, Supervision, Methodology, Writing – review and editing
    For correspondence
    umberto.lupo@epfl.ch
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0001-6767-493X
  3. Anne-Florence Bitbol

    1. Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
    2. SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
    Contribution
    Conceptualization, Supervision, Funding acquisition, Methodology, Writing – original draft, Writing – review and editing
    For correspondence
    anne-florence.bitbol@epfl.ch
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1020-494X

Funding

European Research Council (851173)

  • Damiano Sgarbossa
  • Umberto Lupo
  • Anne-Florence Bitbol

The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.

Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 851173, to A-FB).

Senior Editor

  1. Aleksandra M Walczak, CNRS, France

Reviewing Editor

  1. Lucy J Colwell, Cambridge University, United Kingdom

Version history

  1. Preprint posted: April 15, 2022 (view preprint)
  2. Received: April 28, 2022
  3. Accepted: February 2, 2023
  4. Accepted Manuscript published: February 3, 2023 (version 1)
  5. Version of Record published: March 24, 2023 (version 2)

Copyright

© 2023, Sgarbossa et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 3,111
    Page views
  • 440
    Downloads
  • 3
    Citations

Article citation count generated by polling the highest count across the following sources: PubMed Central, Crossref, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Damiano Sgarbossa
  2. Umberto Lupo
  3. Anne-Florence Bitbol
(2023)
Generative power of a protein language model trained on multiple sequence alignments
eLife 12:e79854.
https://doi.org/10.7554/eLife.79854

Further reading

    1. Computational and Systems Biology
    2. Neuroscience
    Huu Hoang, Shinichiro Tsutsumi ... Keisuke Toyama
    Research Article

    Cerebellar climbing fibers convey diverse signals, but how they are organized in the compartmental structure of the cerebellar cortex during learning remains largely unclear. We analyzed a large amount of coordinate-localized two-photon imaging data from cerebellar Crus II in mice undergoing 'Go/No-go' reinforcement learning. Tensor component analysis revealed that a majority of climbing fiber inputs to Purkinje cells were reduced to only four functional components, corresponding to accurate timing control of motor initiation related to a Go cue, cognitive error-based learning, reward processing, and inhibition of erroneous behaviors after a No-go cue. Changes in neural activities during learning of the first two components were correlated with corresponding changes in timing control and error learning across animals, indirectly suggesting causal relationships. Spatial distribution of these components coincided well with boundaries of Aldolase-C/zebrin II expression in Purkinje cells, whereas several components are mixed in single neurons. Synchronization within individual components was bidirectionally regulated according to specific task contexts and learning stages. These findings suggest that, in close collaborations with other brain regions including the inferior olive nucleus, the cerebellum, based on anatomical compartments, reduces dimensions of the learning space by dynamically organizing multiple functional components, a feature that may inspire new-generation AI designs.

    1. Computational and Systems Biology
    2. Genetics and Genomics
    John F Ouyang, Kunal Mishra ... Jacques Behmoaras
    Research Article

    Tissue fibrosis affects multiple organs and involves a master-regulatory role of macrophages which respond to an initial inflammatory insult common in all forms of fibrosis. The recently unravelled multi-organ heterogeneity of macrophages in healthy and fibrotic human disease suggests that macrophages expressing osteopontin (SPP1), associate with lung and liver fibrosis. However, the conservation of this SPP1+ macrophage population across different tissues, and its specificity to fibrotic diseases with different etiologies remain unclear. Integrating 15 single cell RNA-sequencing datasets to profile 235,930 tissue macrophages from healthy and fibrotic heart, lung, liver, kidney, skin and endometrium, we extended the association of SPP1+ macrophages with fibrosis to all these tissues. We also identified a subpopulation expressing matrisome-associated genes (e.g., matrix metalloproteinases and their tissue inhibitors), functionally enriched for ECM remodelling and cell metabolism, representative of a matrisome-associated macrophage (MAM) polarization state within SPP1+ macrophages. Importantly, the MAM polarization state follows a differentiation trajectory from SPP1+ macrophages and is associated with a core set of regulon activity. SPP1+ macrophages without the MAM polarization state (SPP1+MAM-) show a positive association with ageing lung in mice and humans. These results suggest an advanced and conserved polarization state of SPP1+ macrophages in fibrotic tissues resulting from prolonged inflammatory cues within each tissue microenvironment.