Generative power of a protein language model trained on multiple sequence alignments

  1. Damiano Sgarbossa
  2. Umberto Lupo  Is a corresponding author
  3. Anne-Florence Bitbol  Is a corresponding author
  1. Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
  2. SIB Swiss Institute of Bioinformatics, Switzerland
7 figures, 8 tables and 1 additional file

Figures

Figure 1 with 1 supplement
Comparison of homology, coevolution, and structure-based scores between natural sequences and sequences generated by MSA Transformer or Boltzmann machine DCA (bmDCA).

For each Pfam family in Appendix 1—table 5, we compare a natural MSA from Pfam and three synthetic MSAs of the same depth. The first synthetic MSA was obtained using MSA Transformer via our iterative masking procedure, and the second and third ones were generated by a Potts model inferred from the natural MSA using bmDCA with two different pairs (λ,T) of regularization strength λ and sampling temperature T. For each of the four scores described in ‘Scoring individual sequences’, we show the distributions of score values among sequences in each MSA as a violin plot. Higher score values are better for all scores except root-mean-squared deviation (RMSD) (bottom panel), where smaller values indicate a closer match to an experimental structure. Top panel: For each Pfam family, HMMER scores are divided by the highest score found in the natural MSA. Note that sequences below HMMER’s default homology detection score (E-value larger than 10), and whose HMMER score is thus 0, are not shown (the median over families of the fraction of such sequences is 2% for bmDCA (10−2, 1.00)-generated MSAs, while there are no such sequences among the MSA-Transformer–generated ones). Second panel: Statistical energy scores are defined as minus the bmDCA statistical energies. To accommodate the highly family-dependent ranges of these scores, for each Pfam family we show their values after shifting by the mean score in the natural MSA, and normalizing by the standard deviation of natural MSA scores. Third panel: AlphaFold’s predicted local-distance difference test (pLDDT) confidence scores. Bottom panel: RMSD of predicted structures with respect to the experimental structures in Appendix 1—table 5. Structural scores (pLDDT and RMSD) were computed on 200 randomly chosen sequences from each MSA. All kernel-smoothed histograms are normalized such that all violins have the same maximal width. Outliers (less than 1% in all cases) were discarded for legibility.

Figure 1—figure supplement 1
Multiple sequence alignment (MSA) diversity for each protein family and each generation method.

We show the relative effective depth of each MSA (natural or generated), that is Meff/M, where Meff is effective MSA depth and M is actual MSA depth, versus the similarity threshold 1-δ used to define the effective depth (see Equation 8).

Homology and coevolution scores versus distance to the natural multiple sequence alignment (MSA), for protein families PF00072 and PF00153.

We show contour plots of the HMMER score and the statistical energy score (defined as minus the DCA statistical energy, shifted by its mean value in the natural MSA) versus the Hamming distance of each sequence to the closest natural sequence (which is not itself, in the case of natural sequences). Results are shown for natural sequences and for sequences generated using MSA Transformer and Boltzmann machine DCA (bmDCA) (the same two (λ,T) pairs as in Figure 1 are used for bmDCA). The lightest contours shown include 99% of the cumulative probability mass.

Application of our sequence generation method based on MSA Transformer to small protein families.

We consider seven small protein families, with natural MSAs that comprise from nine to a few hundreds of sequences, see Appendix 1—table 6. As in Figure 1, for each family, we compare the natural MSA and three synthetic MSAs of the same depth. In all cases, we show violin plots of the same four scores as for large families in Figure 1, as well as of the Hamming distance to the closest natural sequence, which is not itself in the case of natural sequences (‘Distance’). For the three smallest families (left panel; fewer than 40 sequences), we also show the score of each individual sequence as a swarm plot. Note that while we employ the same sampling temperatures T as in Figure 1 for Boltzmann machine DCA (bmDCA), here, we use regularization strength λ=10-2 throughout, due to MSA shallowness (see ‘Sampling sequences from Potts models’).

Figure 4 with 3 supplements
Similarity of statistics between synthetic and natural multiple sequence alignments (MSAs).

To compare the statistics of synthetic and natural MSAs at various orders, we compute r20 scores (Haldane et al., 2018; McGee et al., 2021), and plot them versus the number of different MSA columns that are considered (see ‘Analyzing the statistics of MSAs’ for details). All families in Figure 5 are considered. For each of them, the reference MSA comprises either half of the natural MSA (with sequences selected uniformly at random), or 30,000 sequences from it if the natural MSA depth is larger than 60,000. The null model compares the other half of the natural MSA to this reference MSA. It yields an estimate of the expected r20 scores due only to finite-size effects in a model-free, purely data-driven way.

Figure 4—figure supplement 1
Ability of generated sequences to reproduce one-, two-, and three-body statistics.

In all panels, each marker represents a family in 5. For each statistical or information measure (rows), different scores (columns) comparing generated and natural sequences are the coordinates of these points. All scores are such that being close to 0 is better, and their value for multiple sequence alignments (MSAs) generated by Boltzmann machine DCA (bmDCA) with default parameters is shown versus that for MSAs generated by MSA Transformer. The statistical or information measures considered in each row are defined in ‘Analyzing the statistics of MSAs’ – from the top: one-body frequency, two- and three-body connected correlations, entropy, mutual information, and co-information. For each of them, we consider its values over all MSA columns (or pairs or triplets of columns), and all amino acids if appropriate, for both natural and synthetic MSAs. To obtain the vertical and horizontal coordinates (respectively) of the markers in each panel, we compare these values for each natural MSA with the values from the corresponding synthetic MSAs generated by bmDCA with default parameters or by our method based on MSA Transformer (respectively). We use four different scores for this comparison, and devote each column of the figure to one of these scores – from the left: |ρ-1| where ρ denotes the Pearson correlation; |Slope-1| where ‘Slope’ means the slope of best linear fit (see Figure 4—figure supplement 2 and Figure 4—figure supplement 3 for illustrations of these first two quantities in the case of two- and three-body connected correlations for families PF00072 and PF00153); the Jensen–Shannon divergence between the distributions of values; the Wasserstein distance between these distributions. For each statistical or information measure (row) and each score (column), and for each family in 5, we have one value of the score comparing the natural and bmDCA-generated MSAs and another one comparing the natural and MSA-Transformer–generated MSAs. We plot the former value versus the latter, yielding one marker per protein family in each plot. Thus, each plot compares the ability of bmDCA and MSA Transformer to reproduce the statistics of the natural data. Blue markers (above the diagonal) mean that the scores for MSA-Transformer–generated MSAs are better, while green markers (below the diagonal) mean the opposite.

Figure 4—figure supplement 2
Two- and three-body connected correlations estimated from generated multiple sequence alignments (MSAs) versus the natural one, for family PF00072.

Relationships between connected correlations estimated from the MSA generated by MSA Transformer or Boltzmann machine DCA (bmDCA), and those estimated from the natural MSA, are shown as binned scatter plots both for two-body (top row) and three-body (bottom row) statistics. We include a null model (third column) obtained by splitting the natural MSA in half and comparing the statistics of one half with those of the other. It yields an estimate of the expected dispersion in these plots due only to finite-size effects in a model-free, purely data-driven way. Pearson correlation coefficients, ρ and slopes of lines of best fit, are reported in each case. For the comparisons involving MSA Transformer, we used Equations 4 and 5 to estimate correlations. On the other hand, since bmDCA is trained to reproduce frequencies rescaled with phylogenetic weights (wi in 8), for the comparisons involving bmDCA we rescaled the natural frequencies before using Equations 4 and 5 to estimate correlations.

Figure 4—figure supplement 3
Two- and three-body connected correlations estimated from generated multiple sequence alignments (MSAs) versus the natural one, for family PF00153.

Same as Figure 4—figure supplement 2, but for family PF00153.

Figure 5 with 4 supplements
Distribution of sequences in sequence space, for families PF00072 and PF00153.

We show the distribution of one-hot encoded natural and synthetic sequences projected in the subspace of the first two principal components of the natural multiple sequence alignment (MSA). The same axis limits are used within one family, except for Boltzmann machine DCA (bmDCA) (10−3, 0.33) in the case of PF00072. Note that the fraction of the total variance explained by the first two principal components of each MSA is less than 4% for all families and all generation methods.

Figure 5—figure supplement 1
Distribution of sequences in sequence space for all large protein families in our dataset (part 1).

We show the distribution of one-hot encoded natural and synthetic sequences projected in the subspace of the first two principal components of the natural multiple sequence alignment (MSA), as in 5 for families PF00072 and PF00153, but also including Boltzmann machine DCA (bmDCA) at (λ,T)=(10-3,0.66).

Figure 5—figure supplement 2
Distribution of sequences in sequence space for all large protein families in our dataset (part 2).

We show the distribution of one-hot encoded natural and synthetic sequences projected in the subspace of the first two principal components of the natural multiple sequence alignment (MSA), as in 5 for families PF00072 and PF00153, but also including Boltzmann machine DCA (bmDCA) at (λ,T)=(10-3,0.66).

Figure 5—figure supplement 3
Neighbors of natural and synthetic sequences, for families PF00072 and PF00153.

We show the distribution of the number of neighbors of sequences in the natural multiple sequence alignment (MSA), and the distribution of the number of neighbors of the closest natural sequence to each of our generated sequences. Given a sequence in a natural MSA, its number of neighbors is the number of natural sequences that are within a (normalized) Hamming distance δ=0.2 from it. The moving average of the results is shown, using a window representing 5% of the total number of points.

Figure 5—figure supplement 4
Comparing phylogenies inferred from natural and generated multiple sequence alignments (MSAs) for families PF00072 and PF00153.

We show the averaged spectra of modified graph Laplacian (MGL) matrices computed, for each tree inferred from an MSA, by using the leaves of multiple sub-trees made of 500 randomly sampled sequences from the MSA of interest. Specifically, in each case, we perform an average over 200 different sub-trees of the histograms of counts of the eigenvalues (see ‘Characterizing the distribution of sequences in MSAs’). We compare MSAs generated using either MSA Transformer (left) or Boltzmann machine DCA (bmDCA) (center and right) to the natural MSA.

Figure 6 with 2 supplements
Comparison of our generated sequences to those experimentally tested in Russ et al., 2020, for the chorismate mutase family.

Left: The estimated relative enrichment (r.e.) scores of the Boltzmann machine DCA (bmDCA)-generated sequences that are in the top 33% in terms of predicted local-distance difference test (pLDDT) scores are plotted versus their experimentally measured counterparts from Russ et al., 2020. We estimate the expected r.e. of these generated sequences as the r.e. of the closest natural sequence measured in Russ et al., 2020. We observe that high estimated r.e. is associated with high measured r.e., as 71% of sequences with estimated r.e. > 0.4 (green) also have measured r.e. > 0.4. Note that in the top marginals (showing the measured r.e. for bmDCA-generated sequences), the green and yellow histograms are stacked on top of each other. Thus, the stacked histogram shows the distribution of all measured r.e. values for bmDCA-generated sequences that are in the top 33% in terms of pLDDT scores. Top right: Overlaid histograms of estimated r.e. are shown for our MSA-Transformer–generated sequences and for the bmDCA-generated ones from Russ et al., 2020, restricting in both cases to the sequences with top 33% pLDDT scores. Bottom right: Same as top right, but considering all generated sequences.

Figure 6—figure supplement 1
Homology, coevolution, and structural scores versus distance to the natural multiple sequence alignment (MSA), for the chorismate mutase family.

We assess the performance of our generative method in the case of the chorismate mutase family, by comparing our generated sequences (‘MSA Tr.’) to natural ones, and to the sequences generated in Russ et al., 2020 using Boltzmann machine DCA (bmDCA) at various values of the sampling temperature and of the regularization strength. Our synthetic sequences were generated using the iterative masking procedure based on MSA Transformer, starting from the natural alignment used in Russ et al., 2020. Our sequences score similar to natural sequences and to the bmDCA-generated sequences from Russ et al., 2020, which were experimentally validated there.

Figure 6—figure supplement 2
Deep mutational scanning (DMS) scores for families PF00595 and PF13354.

In DMS experiments, the fitness effects of all possible one-point mutations from a reference natural sequence are measured. To assign a DMS score to natural and generated sequences, we align each of them to the reference sequence, and sum the experimentally measured fitness effects of the relevant amino acids at each position. Higher values of the DMS score are better, as they mean higher fitness. We show normalized histograms of the DMS scores for sequences in the natural and generated multiple sequence alignments (MSAs) of protein families PF00595 and PF13354, based on the DMS experiments in McLaughlin et al., 2012; Stiffler et al., 2015, respectively. For generated sequences, we restrict to those whose Hamming distance to their closest natural neighbor is larger than δ=0.2.

Figure 7 with 4 supplements
Iterative masking procedure to generate sequences using MSA Transformer.

Here, the red hashtag (#) stands for a masked amino acid, while blue uppercase letters stand for predicted amino acids at the masked positions.

Figure 7—figure supplement 1
Evolution of mean scores during the iterative masking procedure, for family PF00153.

Average scores of the generated sequences are reported for different iteration numbers and masking probabilities. The scores employed are: (A) Hamming distances to the closest natural sequence, (B) Mutual information between synthetic and natural columns of the MSAs, (C) HMMER scores – see ‘Scoring individual sequences’, (D) Statistical energy scores (negative DCA statistical energies, shifted by their mean value for natural sequences – see ‘Scoring individual sequences’), (E) Pairwise Hamming distances between the generated sequences, and (F) Mean square deviations (MSD) of the predicted contact maps at each iteration from the initial one (zero iterations). The synthetic MSAs, comprising 5000 sequences, were generated with MSA Transformer starting from natural sequences of family PF00153.

Figure 7—figure supplement 2
Evolution of mean scores during the iterative masking procedure, for family PF00096.

Same as in Figure 7—figure supplement 1, but for PF00096. This family has the shortest length among those considered here, namely L=23 (see Appendix 1—table 5).

Figure 7—figure supplement 3
Evolution of mean scores during the iterative masking procedure, for family PF13354.

Same as in Figure 7—figure supplement 1, but for PF13354. This family has the largest length among those considered here, namely L=198 (see Appendix 1—table 6).

Figure 7—figure supplement 4
Evolution of inferred contact maps during the iterative masking procedure, for family PF00153.

Contact maps obtained from MSA Transformer for different iteration numbers and masking probabilities are reported. Probabilities of contacts are computed using the logistic regression on the output attention matrices from Rao et al., 2021a. The input of the model consists of 100 different sequences chosen uniformly at random from the synthetic MSA generated at each iteration of our iterative masking procedure. These contact maps were employed to compute the mean square deviations shown in Figure 7—figure supplement 1F.

Tables

Appendix 1—table 1
p values of the Kolmogorov–Smirnov test comparing the distributions of homology, coevolution, and structure-based scores across natural and synthetic multiple sequence alignments (MSAs).

For each score except the root-mean-squared deviation (RMSD), we test the null hypothesis that the scores of MSA-Transformer–generated sequences are greater or equal than those of Boltzmann machine DCA (bmDCA)-generated sequences, in the (stringent) sense that the cumulative distribution function of the former is always below that of the latter. Here, bmDCA1 stands for bmDCA with (λ,T)=(10-3,0.33) and bmDCA2 for bmDCA with (λ,T)=(10-2,1). For the RMSD, the null hypothesis is that the scores of MSA-Transformer–generated sequences are smaller or equal than those of bmDCA-generated sequences (recall that smaller RMSDs are better). In all cases, a p value close to one (resp. zero) means that the null hypothesis tested should be accepted (resp. rejected). Reported zero p values are too small to be properly assessed by the algorithm.

Pfam IDHMMER scoreStatistical energy scorepLDDT confidenceRMSD
MSA Tr. ≥bmDCA1MSA Tr. ≥bmDCA2MSA Tr. ≥bmDCA1MSA Tr. ≥bmDCA2MSA Tr. ≥bmDCA1MSA Tr. ≥bmDCA2MSA Tr. ≤bmDCA1MSA Tr. ≤bmDCA2
PF0000401.001.00.201.00.330.96
PF000050.931.001.06.5 · 10−191.00.780.96
PF000410.991.001.01.01.02.9 · 10−145.9 · 10 −3
PF000724.7 · 10−121.001.09.1 · 10−61.03.4 · 10−60.96
PF000769.5 · 10−1341.001.06.5 · 10−191.09.2 · 10−51.0
PF000960.041.001.00.010.920.922.4 · 10−5
PF001530.911.001.00.980.849.3 · 10−101.0
PF002715.9 · 10−301.001.00.331.00.020.88
PF0039701.001.01.5 · 10−31.04.8 · 10−100.38
PF0051201.001.04.3 · 10−31.00.960.67
PF005950.831.001.01.0 · 10−151.07.2 · 10−241.0
PF015350.981.001.01.01.00.784.1 · 10−8
PF0251801.001.01.01.01.01.0
PF076791.01.001.01.01.04.3 · 10−31.0
Appendix 1—table 2
Median homology, coevolution, and structure-based scores in natural and synthetic sequences.

We report the median values of each of the scores shown in Figure 1, as well as their standard deviations (between parentheses), for natural sequences (’Nat.’), for sequences generated by our method based on MSA Transformer, and for sequences generated by Boltzmann machine DCA (bmDCA) at low temperature, that is with (λ,T)=(10-3,0.33) (denoted by ‘bmDCA’). Scores are normalized as Figure 1, except that, for statistical energy, we subtract the median of natural scores instead of the mean for clarity (therefore, all natural MSAs have median 0 and standard deviation 1 for this score). For all scores, the best median among those of the two synthetic MSAs is shown in bold, and for the predicted local-distance difference test (pLDDT) score, it is shown in red if it is better than that the other synthetic MSA by a margin larger than the largest standard deviation. Recall that higher values are better for all scores, except root-mean-squared deviation (RMSD), for which the opposite holds.

HMMER scoreStatistical energy scorepLDDT confidence (%)RMSD (Å)
Pfam IDNat.MSA Tr.bmDCANat.MSA Tr.bmDCANat.MSA Tr.bmDCANat.MSA Tr.bmDCA
PF000040.5 (0.2)0.6 (0.2)0.8 (0.2)0 (1)0.8 (0.9)1.6 (0.1)85.4 (4.1)85.8 (4.5)81.7 (0.7)3.4 (0.8)2.8 (0.7)3.6 (0.5)
PF000050.7 (0.1)0.8 (0.1)0.8 (0.1)0 (1)1.8 (0.9)3.1 (0.2)83.0 (6.9)89.0 (4.2)91.6 (1.6)3.8 (1.2)2.8 (1.0)2.8 (0.8)
PF000410.6 (0.1)0.9 (0.2)0.5 (0.1)0 (1)1.5 (1.0)4.9 (0.5)90.0 (4.5)92.0 (3.2)79.2 (2.7)2.1 (2.1)2.9 (0.5)3.4 (2.2)
PF000720.7 (0.1)0.9 (0.1)0.8 (0.1)0 (1)2.1 (0.8)3.8 (0.3)94.5 (3.4)94.9 (1.9)94.1 (0.5)2.4 (0.3)2.3 (0.1)2.1 (0.1)
PF000760.6 (0.1)0.8 (0.2)0.8 (0.1)0 (1)1.5 (0.8)3.5 (0.2)82.2 (4.4)84.6 (4.8)87.6 (1.5)1.8 (0.5)1.4 (0.6)1.4 (0.1)
PF000960.8 (0.0)0.9 (0.0)0.9 (0.0)0 (1)2.2 (0.8)2.8 (0.3)93.0 (2.0)94.0 (0.8)93.7 (0.2)0.6 (0.1)0.4 (0.1)0.4 (0.0)
PF001530.5 (0.1)0.6 (0.1)0.5 (0.1)0 (1)0.6 (0.8)2.6 (0.3)65.0 (5.0)66.6 (6.2)64.9 (4.3)5.1 (1.8)4.4 (1.5)4.3 (1.1)
PF002710.5 (0.1)0.5 (0.1)0.5 (0.2)0 (1)1.0 (0.9)2.4 (0.3)78.4 (4.6)86.4 (5.4)83.8 (2.2)2.0 (0.8)2.3 (0.6)1.8 (0.1)
PF003970.7 (0.1)0.8 (0.1)0.8 (0.1)0 (1)0.5 (0.9)2.1 (0.2)88.1 (2.2)88.9 (2.4)88.2 (1.0)0.9 (0.3)0.9 (0.3)0.9 (0.1)
PF005120.5 (0.1)0.8 (0.2)0.7 (0.2)0 (1)1.5 (1.0)3.2 (0.3)91.0 (4.0)90.2 (4.0)89.5 (1.5)2.1 (0.6)2.2 (0.5)3.1 (0.2)
PF005950.7 (0.1)0.7 (0.1)0.7 (0.1)0 (1)0.5 (0.9)2.6 (0.3)93.4 (4.5)94.0 (1.8)95.1 (0.8)1.8 (0.4)1.7 (0.5)1.4 (0.2)
PF015350.5 (0.1)0.9 (0.2)0.6 (0.1)0 (1)2.3 (1.1)4.1 (0.2)82.4 (6.2)94.3 (5.5)77.9 (3.6)1.0 (1.1)0.4 (0.7)0.5 (0.4)
PF025180.6 (0.2)0.8 (0.2)0.7 (0.2)0 (1)1.9 (0.9)3.5 (0.2)88.0 (6.0)91.0 (6.3)73.6 (2.3)4.1 (0.9)3.9 (0.5)4.7 (1.1)
PF076790.5 (0.1)0.7 (0.2)0.4 (0.1)0 (1)1.7 (1.0)5.2 (0.6)93.5 (3.8)95.3 (2.9)89.8 (2.2)1.3 (1.0)1.2 (0.5)1.2 (0.2)
Appendix 1—table 3
Comparing different generation methods of MSA Transformer.

Various scores are shown for the natural MSA of protein family PF00153 and for synthetic MSAs generated in different ways from this family (each synthetic MSA comprises 10,000 sequences). For generation using MSA Transformer (see ‘Using MSA Transformer to generate sequences via an iterative masking procedure’), our standard iterative masking procedure is shown with its default greedy sampling (corresponding to T=0) and two higher temperatures. Variants of the procedure where only the first sequence is masked (‘Context’, either fixed or variable, both with greedy sampling) are also shown. We report the mean Hamming distance to the closest natural sequence, which is not itself in the case of natural sequences (‘Distance’) as well as the mean HMMER and statistical energy scores (‘-Energy’) described in ‘Scoring individual sequences’. Note that statistical energy scores are shifted by the mean value obtained for the natural MSA (which is −235.8). We also report the Pearson correlations between the two- and three-body statistics of the natural and the generated MSAs, denoted, respectively, by ρ[Cij] and ρ[Cijk] (for the natural MSA we report the Pearson correlation between two halves of this MSA), as illustrated in Figure 4—figure supplement 2.

ScoreNatural sequencesMSA Transformer
Iterative maskingContext (greedy)
GreedyT=0.5T=1.0FixedVariable
Distance0.1550.2710.3050.5140.2320.262
HMMER48.058.258.148.458.763.8
− Energy013.08.5−42.0−15.4−13.2
ρ[Cij]0.940.840.840.620.730.81
ρ[Cijk]0.890.800.760.410.660.77
Appendix 1—table 4
Impact of regularization strength and sampling temperature on sequence generation by Boltzmann machine DCA (bmDCA), for family PF00072.

We compare MSAs obtained using bmDCA with different regularization strengths λ (for inference) and sampling temperatures T (for generation) with the natural and the MSA-Transformer–generated MSAs. In each case, we report the average of the Hamming distances of each sequence to their closest natural neighbor, which is not itself in the case of natural sequences (‘Distance’), as well as the effective MSA depth, the scores defined in ‘Scoring individual sequences’, and the Pearson correlation coefficients of the two- and three-body connected correlations computed from natural and generated MSAs (ρ[Cij] and ρ[Cijk]). For MSA Transformer ('MSA Tr.'), bmDCA (0.01,1) and bmDCA (0.001,0.33), we also computed structural scores by feeding the entire synthetic MSA to Alphafold as context MSA (instead of using the natural MSA as context, see ‘Scoring individual sequences’). Structural scores are then very similar to those obtained using natural context.

TypeλTDistanceMeff(0.2)HMMER- Energyρ[Cij]ρ[Cijk]pLDDT (%)RMSD (Å)
Natural--0.19340,18090.300.990.8893.62.5
MSA Tr.--0.3489304119.159.10.730.5394.72.35
Same as above, with synthetic context:95.12.37
bmDCA0.0110.5573,06266.5−37.00.960.5884.32.58
Same as above, with synthetic context:83.92.70
bmDCA0.010.660.29418,911101.792.20.480.1194.22.61
bmDCA0.010.330.25112103.2118.30.420.0594.22.55
bmDCA0.00110.52573,06286.9−18.30.970.6389.72.44
bmDCA0.0010.660.29621,294103.989.30.480.1994.32.6
bmDCA0.0010.330.27414107.7109.60.40.1394.02.14
Same as above, with synthetic context:94.22.24
Appendix 1—table 5
Pfam families and natural MSAs used in our analysis.

L denotes the length of an MSA, M its depth, and Meff(0.2) its effective depth with distance threshold δ=0.2, see Equation 8. The reference experimental PDB structures used for our root-mean-squared deviation (RMSD) calculations, and their resolutions (‘Resol.’), are also reported.

Pfam IDFamily nameLMMeff(0.2)PDB IDResol.
PF00004AAA13239,27790494D812.40 Å
PF00005ABC_tran13768,89143,8811L7V3.20 Å
PF00041fn38542,72117,7823UP12.15 Å
PF00072Response_reg11273,06340,1803ILH2.59 Å
PF00076RRM_16951,96420,2733NNH2.75 Å
PF00096zf-C2H22338,99612,5814R2A1.59 Å
PF00153Mito_carr9493,77617,8591OCK2.20 Å
PF00271Helicase_C11166,80925,0173EX72.30 Å
PF00397WW3139,04533614REX1.60 Å
PF00512HisKA66154,99867,3033DGE2.80 Å
PF00595PDZ8271,30340531BE91.82 Å
PF01535PPR31109,06437,5144M572.86 Å
PF02518HATPase_c11180,71459,1893G7E2.20 Å
PF07679I-set9036,14114,6111FHG2.00 Å
Appendix 1—table 6
Other Pfam families and natural MSAs used in our analysis.

L denotes the length of an MSA and M its depth. The reference experimental PDB structures used for our root-mean-squared deviation (RMSD) calculations, and their resolutions, are also reported.

Pfam IDFamily nameLMPDB IDResol.
PF01356A_amylase_inhib68511OK00.93 Å
PF03440APT87146RO02.13 Å
PF04008Adenosine_kin1543421WVQ1.45 Å
PF06351Allene_ox_cyc1753782BRJ1.50 Å
PF06355Aegerolysin1313226MYI1.15 Å
PF16747Adhesin_E125316GUT1.63 Å
PF18648ADPRTs_Tse215595AKO2.40 Å
PF13354Beta-lactamase219846426QW81.10 Å
-Chorismate mutase Russ et al., 20209611301ECM2.20 Å
Author response table 1
FamilyPearson (Figliuzzi/Trinquier)Pearson (ours)
PF000040.95/-0.98
PF000050.95/-0.96
PF000410.97/-0.96
PF000720.98/0.930.96
PF000760.98/0.970.95
PF000960.99/-0.93
PF001530.97/-0.92
PF00595-/0.970.93
PF015350.99/-0.98
PF025180.97/-0.97
PF076790.95/-0.96
Author response table 2
FamilySlope (Trinquier)Slope (ours)
PF000720.740.88
PF000760.920.82
PF005950.880.75

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Damiano Sgarbossa
  2. Umberto Lupo
  3. Anne-Florence Bitbol
(2023)
Generative power of a protein language model trained on multiple sequence alignments
eLife 12:e79854.
https://doi.org/10.7554/eLife.79854