Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability

  1. Yang Tan
  2. Bingxin Zhou  Is a corresponding author
  3. Lirong Zheng
  4. Guisheng Fan
  5. Liang Hong  Is a corresponding author
  1. Shanghai-Chongqing Institute of Artificial Intelligence, Shanghai Jiao Tong University, China
  2. School of Information Science and Engineering, East China University of Science and Technology, China
  3. Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong University, China
  4. Shanghai Artificial Intelligence Laboratory, China
  5. Shanghai Jiao Tong University, Institute of Natural Sciences, China
  6. Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai Jiao Tong University, China
5 figures, 7 tables and 2 additional files

Figures

An illustration of ProtSSN that extracts the semantical and geometrical characteristics of a protein from its sequentially ordered global construction and spatially gathered local contacts with protein language models and equivariant graph neural networks.

The encoded hidden representation can be used for downstream tasks such as variants effect prediction that recognizes the impact of mutating a few sites of a protein on its functionality.

Number of trainable parameters and Spearman’s ρ correlation on DTm, DDG, and ProteinGym v1, with the medium value located by the dashed lines.

Dot, cross, and diamond markers represent sequence-based, structure-based, and sequence-structure models, respectively.

Ablation study on ProteinGym v0 with different modular settings of ProtSSN.

Each record represents the average Spearman’s correlation of all assays. (a) Performance using different structure encoders: Equivariant Graph Neural Networks (EGNN) (orange) versus GCN/GAT (purple). (b) Performance using different node attributes: ESM2-embedded hidden representation (orange) versus one-hot encoding (purple). (c) Performance with varying numbers of EGNN layers. (d) Performance with different versions of ESM2 for sequence encoding. (e) Performance using different amino acid perturbation strategies during pre-training.

An example source record of the mutation assay.

The record is derived from DDG for the A chain of protein 1A7V, experimented at pH = 6.5 and degree at 25°C.

Author response image 1
Scatter plots of the number of MSA sequences and spearman’s correlation.

Tables

Table 1
Spearman’s ρ correlation of variant effect prediction by with zero-shot methods on DTm, DDG, and ProteinGym v1.
ModelVersion# ParamsDTmDDGProteinGym v1
(million)ActivityBindingExpressionOrganismalStabilityOverall fitness
Sequence encoder
RITASmall300.1220.1430.2940.2750.3370.3270.2890.304
Medium3000.1310.1880.3520.2740.4060.3710.3480.350
Large6800.2130.2360.3590.2910.4220.3740.3830.366
xlarge1,2000.2210.2640.4020.3020.4230.3870.4450.373
ProGen2Small1510.1350.1940.3330.2750.3840.3370.3490.336
Medium7640.2260.2140.3930.2960.4360.3810.3960.380
Base7640.1970.2530.3960.2940.4440.3790.3830.379
Large27000.1810.2260.4060.2940.4290.3790.3960.381
xlarge64000.2320.2700.4020.3020.4230.3870.4450.392
ProtTransbert4200.2680.313------
bert_bfd4200.2170.293------
t5_xl_uniref5030000.3100.365------
t5_xl_bfd30000.2390.334------
TranceptionSmall850.1190.1690.2870.3490.3190.2700.2580.288
Medium3000.1890.2560.3490.2850.4090.3620.3420.349
Large7000.1970.2840.4010.2890.4150.3890.3810.375
ESM-1v-6500.2790.2660.3900.2680.4310.3620.4760.385
ESM-1b-6500.2710.3430.4280.2890.4270.3510.5000.399
ESM2t12350.2140.2160.3140.2920.3640.2180.4390.325
t301500.2880.3170.3910.3280.4250.3050.5100.392
t336500.3300.3920.4250.3390.4150.3380.5230.419
t3630000.3270.3510.4170.3220.4250.3790.5090.410
t4815,0000.3110.2520.4050.3180.4250.3880.4880.405
CARP-6400.2880.3330.3950.2740.4190.3640.4140.373
Structure encoder
ESM-if1-1420.3950.4090.3680.3920.4030.3240.6240.422
Sequence + structure encoder
MIF-ST-6430.4000.4060.3900.3230.4320.3730.4860.401
SaProtMasked6500.382-0.4590.3820.4850.3710.5830.456
Unmasked6500.3760.3590.4500.3760.4600.3720.5770.447
ProtSSN (ours)k20_h5121480.4190.4420.4580.3710.4360.3870.5660.444
Ensemble14670.4250.4400.4660.3710.4510.3980.5680.451
  1. The top three are highlighted by first, second, and third.

Table 2
Influence of folding strategies (AlphaFold2 and ESMFold) on prediction performance for structure-involved models.
DTmDDGProteinGym-interactionProteinGym-catalysis
AlphaFold2ESMFolddiff ↓AlphaFold2ESMFolddiffAlphaFold2ESMFolddiffAlphaFold2ESMFolddiff
Avg. pLDDT90.8683.22-95.1986.03-82.8665.69-85.8073.45-
ESM-if10.3950.3710.0240.4570.488–0.0310.3510.2590.0920.3860.3680.018
MIF-ST0.4000.3780.0220.4380.423–0.0150.3900.3270.0630.4080.3880.020
k30_h12800.3840.3700.0140.3960.3900.0060.3980.3730.0250.4430.4390.004
k30_h7680.3590.3560.0030.3780.3660.0120.4000.3740.0260.4420.4360.006
k30_h5120.4130.3990.0140.4080.3940.0140.4000.3720.0280.4470.4420.005
k20_h12800.4150.3910.0240.4290.4100.0190.3990.3650.0340.4460.4410.005
k20_h7680.4150.4030.0120.4190.3970.0220.4010.3700.0310.4490.4420.007
k20_h5120.4190.3950.0240.4410.4320.0090.4060.3710.0350.4490.4390.010
k10_h12800.4060.3910.0150.4260.4110.0150.3960.3650.0310.4400.4340.006
k10_h7680.4000.3910.0090.4140.4000.0140.3790.3490.0300.4310.4210.010
k10_h5120.3830.3640.0190.4240.4140.0100.3890.3640.0250.4400.4320.008
  1. The top three are highlighted by first, second, and third.

Table 3
Variant effect prediction on ProteinGym v0 with both zero-shot and few-shot methods.

Results are retrieved from Notin et al., 2022a.

ModelType# Paramsρ(by mutation depth)ρ(by MSA depth)ρ(by taxon)
(million)SingleDoubleAllLowMediumHighProkaryoteHumanEukaryoteVirus
Few-shot methods
SiteIndepSingle-0.3780.3220.3780.4310.3750.3420.3430.3750.4010.406
EVmutationSingle-0.4230.4010.4230.4060.4030.4840.4990.3960.4290.381
WavenetSingle-0.3990.3440.4000.2860.4040.4890.4920.3730.4420.321
GEMMESingle-0.4600.3970.4630.4440.4460.5200.5050.4360.4790.451
DeepSequenceSingle-0.4110.3580.4160.3860.3910.5050.4970.3960.4610.332
Ensemble-0.4330.3940.4350.3860.4110.5340.5220.4050.4800.361
EVESingle-0.4510.4060.4520.4170.4340.5250.5180.4110.4690.436
Ensemble-0.4590.4090.4590.4240.4410.5320.5260.4190.4810.437
MSA-TransfomerSingle1000.4050.3580.4260.3720.4150.5000.5060.3870.4680.379
Ensemble5000.4400.3740.4400.3870.4280.5130.5170.3980.4670.406
Tranception-RSingle7000.4500.4270.4530.4420.4380.5000.4950.4240.4850.434
TranceptEVESingle7000.4810.4450.4810.4540.4650.5420.5390.4470.4980.461
Zero-shot methods
RITAEnsemble2,2100.3930.2360.3990.3500.4140.4070.3910.3910.4050.417
ESM-1vEnsemble3,2500.4160.3090.4170.3900.3780.5360.5210.4390.4230.268
ProGen2Ensemble10,7790.4210.3120.4230.3840.4210.4670.4970.4120.4590.373
ProtTransEnsemble6,8400.4170.3600.4130.3720.3950.4920.4980.4190.4000.322
ESM2Ensemble18,8430.4150.3160.4130.3910.3810.5090.5080.4560.4610.213
SaProtEnsemble1,2850.4470.3680.4500.4560.4100.5440.5340.4640.4600.334
ProtSSNEnsemble1,4760.4330.3810.4330.4060.4020.5320.5300.4360.4910.290
  1. The top three are highlighted by first, second, and third.

Table 4
Average Spearman’s ρ correlation of variant effect prediction on DTm and DDG for zero-shot methods with model ensemble.

The values within () indicate the standard deviation of bootstrapping.

Model# Params (million)DTmDDG
RITA22100,195(0.045)0.255(0.061)
ESM-1v32500.300(0.036)0.310(0.054)
Tranception10850.202(0.039)0.277(0.062)
ProGen210,7790.293(0.042)0.282(0.063)
ProtTrans68400.323(0.039)0.389(0.059)
ESM218,8430.346(0.035)0.383(0.55)
SaProt12850.392(0.040)0.415(0.061)
ProtSSN14760.425(0.033)0.440(0.057)
Table 5
Statistical summary of DTm and DDG.
pH rangeDTmDDG
# AssaysAvg. # mutAvg. AA# AssaysAvg. # mutAvg.AA
Acid (pH < 7)2929.1272.82122.6125.3
Neutral (pH = 7)1437.4221.21023.178.3
Alkaline (pH > 7)2350.1233.3524.6101.4
Sum6638.2221.23623.0108.8
Table 6
Details of zero-shot baseline models.
ModelTypeDescriptionSource code
SeqStruct
RITA (Hesslow et al., 2022)A generative protein language model with billion-level parametershttps://github.com/lightonai/RITA (Hesslow and Poli, 2023)
ProGen2 (Nijkamp et al., 2023)A generative protein language model with billion-level parametershttps://github.com/salesforce/progen (Nijkamp, 2022)
ProtTrans (Elnaggar et al., 2021)Transformer-based models trained on large protein sequence corpushttps://github.com/agemagician/ProtTrans (Elnaggar and Heinzinger, 2025)
Tranception (Notin et al., 2022a)An autoregressive model for variant effect prediction with retrieve machinehttps://github.com/OATML-Markslab/Tranception (Notin, 2023)
CARP (Yang et al., 2024)Pretrained CNN protein sequence masked language models of various sizeshttps://github.com/microsoft/protein-sequence-models (microsoft, 2024)
ESM-1b (Rives et al., 2021)A masked language model-based pre-train method with various pre-training
dataset and positional embedding strategies
https://github.com/facebookresearch/esm (facebookresearch, 2023)
ESM-1v (Meier et al., 2021)
ESM2 (Lin et al., 2023)
ESM-if1 (Hsu et al., 2022)An inverse folding method with mask language modeling and geometric
vector perception (Jing et al., 2020)
MIF-ST (Yang et al., 2023)Pretrained masked inverse folding models with sequence pretraining transferhttps://github.com/microsoft/protein-sequence-models
SaProt (Su et al., 2023)Structure-aware vocabulary for protein language modeling with FoldSeekhttps://github.com/westlake-repl/SaProt (Su and fajieyuan, 2025)
Author response table 1
Spearmar’s score of the number of MSA sequences and the model’s performance.
Model NameModel TypeSpearmanr's score
EVE (ensemble)Alignment-based model0.239
GEMMEAlignment-based model0.207
TranceptEVE LHybrid - Alignment & PLM0.237
MSA Transformer (ensemble)Hybrid - Alignment & PLM0.262
ESM-IF1Inverse folding model0.346
ESM2 (650M)Protein language model0.297
ESM-1v (ensemble)Protein language model0.372
SaProt (650M)Hybrid - Structure & PLM0.260
ProtSSN (ensemble)Hybrid - Structure & PLM0.217

Additional files

MDAR checklist
https://cdn.elifesciences.org/articles/98033/elife-98033-mdarchecklist1-v1.pdf
Source code 1

The source code .ZIP file contains the complete implementation of model training and evaluation, the associated processed datasets, and the .README document which provides a general instruction.

https://cdn.elifesciences.org/articles/98033/elife-98033-code1-v1.zip

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Yang Tan
  2. Bingxin Zhou
  3. Lirong Zheng
  4. Guisheng Fan
  5. Liang Hong
(2025)
Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability
eLife 13:RP98033.
https://doi.org/10.7554/eLife.98033.4