Introduction

Enzymes are efficient, non-toxic and sustainable alternatives to traditional chemical catalysts in applications such as bio-remediation, pharmaceutical and biofuel production1. Catalytic residues refer to the amino acids in the protein sequence directly involved in the mechanism of action. Catalytic residues are involved both directly and indirectly in catalysis; for example, they can stabilize transition states 2, act as direct electron donors, or interact with secondary residues, water molecules, or cofactors that are then directly involved in the catalytic mechanism. As catalytic activity depends on specific structural conformations and sequence context, identifying catalytic residues remains a challenge.

Catalytic residue annotations can be used to infer protein function prediction and enable insights into mechanisms of action and evolutionary relationships 35. Annotations are used to facilitate enzyme engineering methods such as directed evolution and rational design 6, and can help construct the theoretical catalytic sites known as theozymes that are used to condition de novo enzyme design 7. Experimental methods to determine catalytic residues are time-consuming, labour-intensive, and costly, as they often require experimentally determined enzyme-substrate structures. To expedite annotation, computational approaches have been developed to predict catalytic residues from protein sequences or predicted structures 811, using annotations in databases such as UniProt 12 and M-CSA, a manually curated enzyme mechanism database 13. Existing computational methods can be broadly grouped into three categories: 1) similarity-based methods and sequence alignment, 2) structural methods, or 3) statistical and machine learning approaches.

Similarity-based approaches are effective when homologous sequences have been previously annotated. Structure-based approaches use a variety of geometric search algorithms to find potential binding pockets and then map sequence conservation scores to these regions to predict catalytic residues with higher specificity 1416. Geometric methods improve the accuracy of approaches reliant on homology, while the limitation of requiring homologous sequences remains. Data-driven approaches such as statistical models or machine learning algorithms offer a more flexible framework 9,11,17 but are restricted by the quality and diversity of datasets available for training. The current machine learning (ML) state-of-the-art (SOTA) tools apply multimodal deep learning strategies and often emphasise the value of the structural modality. Shen et al. 10 combined an adaptive edge-gated graph attention neural network (AEGAN) with graphs constructed using the 3D structure and sequence-derived positional-specific scoring matrices. SCREEN leverages protein sequence and structure by combining graphical convolutional neural networks and contrastive learning on Protein Language Models (PLMs) 9. EasIFA, developed by Wang et al. 11, predicts both catalytic residues and binding sites by incorporating graph attention representations of chemical reactions with PLM information and structure representations. Despite the reported improvements on sequence based methods by multi-modal approaches, current benchmarks inherently obscure the reliance on structural prediction as they often contain sequences with known structures, which are more likely to be predicted accurately by tools such as AlphaFold 18. This bias may limit the evaluation of generalisability to unseen data, i.e. enzymes without experimentally derived structures. Furthermore computing structures remains computationally intensive.

In comparison to structural models PLMs learn the distribution of amino acids across protein sequences. PLMs embed biologically meaningful representations that capture complex dependencies between residues within a protein sequence 1921. These models are trained in an unsupervised manner on large numbers (billions) of diverse sequences 22,23. Remote homology search and alignment algorithms based on PLM embeddings have been reported to outperform sequence similarity-based methods at low sequence identities 24,25. In enzyme function prediction, models highlight the utility of PLM embeddings in combination with contrastive learning, to achieve SOTA ML performance in the prediction of enzyme function (by proxy of enzyme classification numbers) and associated reactions catalysed 2628; however, similarity-based approaches (e.g. BLAST or reaction distance) perform comparably 28. We hypothesised that using PLM embeddings to predict catalytic residues would complement sequence similarity approaches in low homology settings.

Contrastive learning is a discriminative modelling approach that facilitates learning by distinguishing between paired observations in data that are “similar” or “different”, thus enabling information to be extracted from dense PLM embeddings 29. Data engineering, such as choosing a dataset to enhance specific relationships 30, enables domain expertise to inform pair schemes for training. For example, it is common to explicitly incorporate and maximise “hard negatives” (samples that are dissimilar but close in the latent space) to refine the model’s ability to discriminate subtle differences in the feature space 3133.

Although pair schemes can be generated automatically in contrastive learning frameworks 30, we posited that the physical principles of enzyme functions could be used to maximise hard negatives and create an accurate catalytic residue prediction algorithm, as in ref 28. Enzyme function can be grouped into different classes using the domain standard Enzyme Class (EC) ontology, which classifies enzymes into four levels based on the reactions they catalyse. The first level represents the general type of reaction (e.g., oxidoreductases), while subsequent levels add specificity, capturing information about substrates, cofactors, and products 34. This hierarchical structure provides as basis for relationships between enzyme functions and mechanisms that can be used to create biologically meaningful pairs.

To overcome the compute limitations with structural methods and existing benchmarks, we developed Squidly and a new benchmark for low sequence identity catalytic residue predictions. We combined a contrastive learning framework on PLM per-token embeddings with a rationally designed hierarchical pair scheme to create a sequence-based catalytic residue predictor that is accurate, scalable, and able to generalise in the absence of sequence similarity. Squidly exceeds the performance of both BLAST and SOTA ML tools in existing benchmarks achieving a F1 of greater than 0.85 across enzyme families, in addition to generalising to low sequence identity enzymes, with an F1 of 0.64 on sequences with less than 30% identity. Squidly is over 50x faster than folding a structure, enabling it to be used to annotate existing databases, in metagenomics pipelines, and to filter de novo designed proteins. Finally, owing to the added interpretability of sequence similarity-based approaches we provide Squidly as an ensemble with BLAST, and suggest using Squidly where the sequence identity is less than 50%. We envision Squidly will be broadly applicable in the enzyme engineering field and will complement existing sequence homology and structure-based methods.

Methods

AEGAN training and test datasets

To ensure comparability with the SOTA tools for catalytic residue prediction, the same training, test, and benchmark datasets (Uni14230, Uni3750) were utilised as described in prior work 10. These were originally sourced from UniProt and M-CSA databases. Uni14230 and Uni3750 datasets also include computational and manually curated predictions from SwissProt. To reduce redundancy, the data was filtered to retain only sequences with less than 60% sequence identity, resulting in 8,784 sequences in the training set and 1,955 sequences in the test set. Additionally, Squidly was evaluated against six previously reported benchmark datasets 3538. The EF family, superfamily, fold, and HA superfamily datasets were specifically designed to include one representative sequence from each enzyme family, fold, or superfamily present in existing databases, and are thus not designed to predict “novel” active sites 35,36. The NN and PC datasets were both constructed to be structurally and functionally heterogeneous based on SCOP and CATH classifications 37,38. These datasets were generated prior to 2007 and are therefore limited in their catalytic diversity compared to the current databases. Both AEGAN and SCREEN were originally benchmarked against distinct subsets of these datasets, however, we found that they contained sequences with high identity to their respective training data, as detailed in Supplementary Table 2. Squidly’s reported performance with these datasets is based on the Uni14230 training set to ensure a fair comparison with AEGAN and given the lower overall sequence similarity between the training and test sets.

New benchmark: CataloDB

To address the shortcomings of existing benchmark datasets for the evaluation of catalytic residue predictions, we introduce a benchmarking dataset named CataloDB. CataloDB collects the annotation data available for enzymes in the SwissProt database. The total set of enzymes includes experimentally determined annotations in addition to reviewed sequences with known structures, see Appendix A.1 for more details. Sequences with greater than 90% sequence similarity, based on shared overlap, were removed using mmSeqs2 39. Structural and sequence-based clustering was performed on the filtered dataset to create a test set of 232 sequences with less than 30% sequence and structural identity to the training set (the remaining 5357 sequences). Structural identity was calculated using FoldSeek 40. Sequences that were in any of the six commonly used benchmarks were omitted from the training data to ensure compatibility between CataloDB and existing benchmarks. Additional detailed descriptions of the data are available in the Appendix A.1, and see Supplementary Figures S1-S6 for distribution of amino acids and EC classes in the benchmark dataset.

To enable direct comparison with the SOTA tool SCREEN, we retrained a lightweight version of the model using the CataloDB benchmark training data. This version of SCREEN, provided by the original authors, omits costly evolutionary features which are cumbersome for larger datasets. The authors report that these features have minimal impact on performance 9. Minor changes were made to the source code to improve model training stability. Notably, we allowed optimal stopping to occur after 5 epochs of training, rather than after 200 epochs in the original source code as we observed exploding gradients at later epochs when trained on CataloDB.

ESM2 embeddings

Each sequence was encoded by either the 3 billion parameter or 15 billion ESM2 model 23. Pre-trained model weights were used without fine-tuning and per-token embeddings were extracted from the final layer of the encoder. These embeddings contain a vector of length 2560 (3B) or 5120 (15B) for each amino acid in the sequence. No normalization was applied to the embeddings before downstream use.

Contrastive model

The supervised contrastive model is a network that takes per-residue embeddings as input (2560 or 5120 length vectors) and is trained to output a new embedding (N=128) of the residue. The model contains an initial dropout layer connected to the inputs, with a dropout rate of 10%, and two subsequent hidden layers of 1280 and 640 size each. During training, batches of 16,000 pairs are passed through the model and the distance between the output representations of each pair is calculated. The following cosine embedding loss function implemented by PyTorch is used to calculate the loss based on the distances:

Where x1 and x2 are input vectors and y is the label, with y = 1 indicating a positive pair and y = 1 indicating a negative pair. When y = 1, the loss function rewards the model for maximising the cosine similarity, cos(x1, x2). The loss function minimises the cosine similarity when y = 1. Importantly, the pair selection scheme optimises for hard-negative pairs, and thus the margin is set equal to 0 when defining the loss function.

Reaction informed pair-mining

Positive and negative pairs were created to enable contrastive comparisons where in the simplest form positives are pairs of exclusively catalytic or exclusively non-catalytic residues while negatives include a catalytic and a non-catalytic residue. As all possible pair combinations of amino acids becomes intractable we are required to reduce this to make training feasible, hence we test three methods to subsample from this pool of possible pairs. Specifically, we used information about the EC number of the sequence and amino acid characters to refine the pair selection process. For further information, see Appendix A.2.

Three pair schemes were created to test for the effect of the pair scheme, herein referred to as “reaction-informed pair-mining”. See Figure 1 for a visual description.

Representation of the pair-mining schemes depicting the negative and positive classes for each scheme.

Scheme 1 is the most permissive and is amino acid and EC unaware. Schemes 2 and 3 are more restrictive, requiring both the amino acids and the EC classes to be shared or different in the pair-mining scheme. Scheme 3 included additional negative pairs to separate EC numbers in the latent space.

Scheme 1 samples pairs across all sequences and residues such that positive pairs are pairs of residues that are either both catalytic or both non-catalytic.

Scheme 2 samples positive and negative pairs much like in scheme 1, except that each pair of residues must be the same amino acid and must have come from a sequence which has the same EC number.

Scheme 3 performs the same pair sampling as in scheme 2; however, scheme 3 also includes additional negative pairs that are either exclusively catalytic residues or exclusively non-catalytic and come from different EC numbers or amino acids.

To combat the inherent bias in the data towards certain EC numbers, sequences were grouped by EC number at the 2nd level, and an upper limit of 600 catalytic residues and 600 non-catalytic residues were sampled from each EC number. An equal number of positive and negative pairs were generated for each EC number. A parameter called the sample limit was set to limit the total number of pairs made for each group. The sample limit chosen was 16,000, providing a total of 8-10 million pairs for schemes 2 and 3. Scheme 1 was then trained with an equal total number of pairs.

LSTM classifier

All LSTM classification models were trained using the same parameters. The models are bidirectional LSTM networks designed for binary classification of catalytic versus non-catalytic residues in protein sequences. The model takes as input 128-dimensional per-residue embeddings (from the contrastive model) that are processed by a 2-layer LSTM (hidden size: 128) with a dropout rate of 0.2, followed by a linear layer mapping the output to a single logit score, see Figure 2. Class imbalance is addressed by weighting the misclassification of catalytic residues 100 times higher during loss computation. The models were trained using a 90:10 train-validation split, with a batch size of 400 sequences, a learning rate of 0.01, and early stopping of 5 epochs was used. These parameters were not optimised. See Figure 2 for a depiction of the model architecture.

Overview of the model, depicting a sequence being translated into per-token embeddings via ESM2, each per-token embedding is then translated into a smaller representation space by the MLP trained with contrastive loss (CL).

Each residue from the CL model is then predicted as catalytic or not catalytic via a bidirectional LSTM network.

Squidly ensemble and uncertainty quantification

For each training dataset, five models were ensembled by taking the per-residue output from the LSTM and then combining these to compute the mean score, along with the predictive entropy for each residue41. Users can specify a threshold for classifying a residue as catalytic (between 0 to 1.0): if the mean score exceeds this threshold, the residue is designated as “catalytic”. Users can optionally use the uncertainty to further filter for confidence in predictions.

BLAST similarity-based search

BLAST is a local alignment search tool able to identify homologous sequences from a reference database42. For our study, we utilised diamond BLAST to search the training set for similar sequences to each test set 43. The ultra-sensitive flag was used to identify sequences with low sequence similarities. Upon retrieval of results, the sequence with the highest sequence similarity was retained. This sequence was then aligned against the query using the alignment tool ClustalOmega 44. The gapped alignment was used to infer the catalytic residues of the query sequence.

Results

ESM2 per-token embeddings contain catalytic residue specific variation

To illustrate the information content of per-residue embeddings for catalytic residue differentiation, we performed principal component analysis (PCA) on equally sampled catalytic and non-catalytic residues from EC 3.1 and 2.7, Figure 3. For EC 3.1, the first two principal components explained 5.24% and 1.86% of the variance for all residues, respectively. Although the explained variance in PCs 1 and 2 is minimal, the separation in PCA space indicates that the largest two signatures of linear variation are related to catalytic roles, with visible clusters of catalytic (red-coloured) and non-catalytic (blue-coloured) residues, Figure 3. The separability from a linear transformation suggests that ESM2 per-residue embeddings could be used to capture catalytic-specific signals.

Principal component analysis of per-residue embeddings from EC numbers 3.1 and 2.7.

In the upper row, red colours indicate a catalytic residue, and blue, non-catalytic. In the lower row, residues are coloured by amino acid class which represent their structural and chemical properties. The x and y axes represent the first and second principal components in each subplot. A clear separation of catalytic and non-catalytic residues is seen in the principal components which explain the most variance in our data.

Reaction informed pair-mining improves upon contrastive learning and state of the art prediction performance

We hypothesised that contrastive learning would present an effective approach to leverage the rich feature space of ESM2 embeddings to distinguish between catalytic and non-catalytic roles. We implemented a biologically informed hierarchical contrastive learning pair selection framework to select pairs which are informative at the per-residue level. Three datasets were created to compare the performance of different pair scheme selection criteria (see Methods for details).

Squidly performance on the Uni3175 dataset

Scheme 1, see Figure 1, is a “control” scheme and uses a standard contrastive loss function with random pair selection. We observed that scheme 1 performs poorly overall on the Uni3175 dataset, Figure 4. Across all datasets, Scheme 1 achieved relatively low F1 scores, with a mean F1 score of 0.49 and 0.43 for the 3B and 15B ESM2 models. Scheme 1 failed to surpass the LSTM classification model using unchanged ESM2 embeddings, which had a mean F1 score of 0.73, Figure 4. Schemes 2 and 3, both of which employ reaction-informed pair-mining, exhibited improved performance relative to both Scheme 1 and the ESM2 baseline. Scheme 2 achieved mean F1 scores of 0.80 and 0.82 using the 3B and 15B ESM2 models, respectively. Scheme 3 improves upon scheme 2 by contrasting catalytic and non-catalytic sites based on distinct EC and amino acid labels, Figure 1. This resulted in mean F1 scores of 0.79 and 0.84. The best performing model had an F1 score of 0.86 achieved using Scheme 3 and the 15B ESM2 model.

Comparison of the Squidly contrastive learning framework pair schemes on the Uni3175 dataset.

We directly compare the schemes with AEGAN, the tool by Shen et al., the authors of this dataset 10. Schemes 2 and 3 (S1 and S2) using the largest ESM2 model (15 billion parameters) embeddings slightly improved upon AEGAN’s performance. The smaller ESM2 (3 billion parameters) model still performs competitively, at a heavily reduced computation cost. In comparison to the rationally informed pair schemes, the Scheme 1 and raw embedding LSTM models performed poorly.

The results from Uni3175 benchmark show that Squidly improved marginally upon the performance of previous work and the current SOTA ML prediction model AEGAN by Shen et al. 10 Squidly uses only sequence through ESM2 embeddings for prediction, which, compared to the homology and structural modalities included in AEGAN, makes for a more flexible predictor without compromising performance. Overall, the results on the Uni3175 dataset indicate that the rationally informed pair-mining strategy is an effective method for discriminative tasks using per-token ESM2 embeddings. It also indicates that sequence-based models using PLM embeddings can compete with models using structural modalities. See Methods for details on each scheme design.

Squidly outperforms other ML-based methods yet works best in conjunction with sequence similarity methods

Squidly was evaluated against six previously reported benchmark datasets, namely the EF fold, family, and superfamily benchmarks, as well as the HA superfamily, NN, and PC benchmarks, see Supplementary Table 1 and Appendix 3 for description of each dataset, see Figure 5 for sequence identity between test and train sets. Squidly’s performance exceeded those of SOTA models AEGAN and SCREEN, Table 1. When compared to the baseline of BLAST, we find that BLAST outperforms all other ML models except Squidly, likely due to the homologous sequences in the test and train sets, see Appendix A3, and Supplementary Table 2 for details.

Sequence similarity based on sequence identity to the closest sequence in the training set between each family.

F1 scores of tools on 6 common benchmark datasets.

We hypothesised that BLAST would work best in instances where a similar sequence existed while Squidly would excel in low sequence similarity situations. To test for the optimal conditions for integration we varied the rate at which either predictor was used, Figure 6. For sequences with low sequence identity (<50%), using Squidly improves the F1 score. However, Squidly has a higher likelihood than BLAST to record false positives where similar sequences exist while BLAST is more precise. The cutoff of 50% sequence identity was used to compare the performance of the ensemble method to existing tools. On all datasets using this cutoff an ensemble of BLAST and Squidly performs SOTA, Table 1.

F1, recall, and precision for the ensemble of Squidly and BLAST with different sequence identity cut-offs.

The cut-offs indicate the point at which to transition from a Squidly to a BLAST prediction. At 100% only Squidly is used for predictions, and at 0%, BLAST is used, unless there are no homologous sequences. For sequences with less than 50% sequence identity (dashed black line), Squidly predictions are used. The dashed horizontal line represents the total score for BLAST on the specific dataset, while the dots represent the combined BLAST and Squidly ensemble approach. Squidly reports higher scores for recall, while BLAST is more precise. The performance is best when used as an ensemble.

Squidly is scalable and fast

Squidly’s sequence-based approach reduces the overall computational cost of prediction to enable the screening of large databases. SOTA ML models rely on accurate structures as input, and thus structural prediction must be done prior to inference when screening databases such as metagenomic samples. Squidly’s smaller 3B model can be run locally and can predict roughly 6 sequences a second in our tests using an NVIDIA H100 GPU. In comparison to the current widely used structural generation tool Chai 45, Squidly is approximately 50-fold faster (e.g. N=108, Squidly=108s, Chai=5757s). Note this under-represents the time for the structural tools to run, as they also require predictions which also add time, Table 2.

Resource usage measurements for Squidly models when predicting 1000 sequences with an average length of 401 residues.

Squidly generalises to low identity sequences

To evaluate the capacity of Squidly to generalise to low identity sequences, we developed a benchmark dataset, CataloDB with low structural and sequence identity, see Methods and Figure 7. A, B for similarities between the test and train sets. This benchmark serves as an alternative test set that takes structural and sequence similarity into account, allowing for fair comparison between future structural and sequence-based tools and testing in low identity settings. Squidly 3B and 15B models showed similar performance on the benchmark with most sequences (N=148/232, N=154/232), recording at least one correct catalytic residue. For comparison, BLAST identifies only 68 sequences that have a catalytic residue recovered correctly. Both Squidly models performed similarly with F1 scores of (0.64, 0.68), precision of (0.80, 0.82), and recall of (0.53, 0.58) for the 15B and 3B models (respectively), while BLAST records an F1 of 0.37, recall of 0.24, and precision of 0.72, see Figure 8. We also retrain SCREEN, with performances recorded above BLAST however still below Squidly, Figure 7. C, see Methods for details. Reflecting the bias of public data, CataloDB has a bias towards well-represented EC numbers and amino acids. For example, EC 6 was represented by only three sequences in the test data; and EC 7 had no representation. Overall, these results suggest that Squidly can generalise to low identity settings and be used in predictions where traditional homology transfer is not possible. However, caution should be taken when applied to poorly represented enzyme classes such as EC 6 and 7.

Distribution of the catalytic residues in the proposed benchmark dataset.

A. The distribution of the identity of the catalytic residues is not even and represents the uneven distribution of the catalytic mechanism within the benchmark dataset. B Almost 50% of the data points are in the well characterised class of EC 3, which are hydrolases. Neither EC 7 nor EC 6 is represented, which are translocases and ligases, respectively.

A. BLAST sequence similarity to the training set for low identity CataloDB test set sequences. B. Structural similarity for the final test set, filtered on both structure and sequence. C. F1, recall, and precision for Squidly, SCREEN and BLAST on CataloDB.

Discussion

Rationally constraining pair generation in contrastive learning can improve model performance on biological data. Using publicly available EC number annotations and amino acid sequences, we categorised pairs of catalytic residues into distinct groups enriched with hard negatives to enable sequence-only prediction of catalytic residues. This approach led to substantial improvements compared to a standard LSTM classifier and typical pair-mining strategies. Squidly significantly outperformed other ML-based sequence methods and showed marginal improvement to SOTA ML-based catalytic residue prediction models that use structural data while running in a fraction of the time. Furthermore, Squidly performs comparably to BLAST when homologous sequences exist, yet also shows good performance in low-identity settings, where BLAST is unable to identify any similar sequences. Given the complementarity of BLAST and Squidly, we provide Squidly as an ensemble with sequence similarity, leveraging the interpretability of sequence similarity-based methods where homologous sequences exist, and Squidly, in out-of-distribution settings. The focus of this work was catalytic site prediction; however, of similar utility are combined binding and catalytic prediction models, such as EasIFA 11, and therefore future work should additionally consider binding sites.

A limitation of this study is the inherent bias in publicly available data towards well studied enzyme classes. The bias likely confounds the perceived performance on less studied mechanisms, and in these instances the results should be considered with caution, for example, in EC classes 6 and 7. A further challenge is the lack of a consolidated benchmark for catalytic residue prediction, making direct and fair comparison between tools difficult. Notably, the multiclass classification objective and benchmarks used to evaluate EasIFA 11 made it infeasible to compare performance for the binary catalytic residue prediction task. Existing benchmark datasets do not capture the diversity of enzyme catalysis. Although we have provided a new benchmark that improves upon previous datasets by using structural identity filtering, the core issues remain. CataloDB and current benchmarks are small and include entire enzyme families. While low sequence and structural similarity samples are held out from training, they may not accurately represent the generalisability of these methods in out-of-distribution settings. Furthermore, the datasets often consist of sequences with known structures in the PDB, many of which were likely part of the AlphaFold training set or structurally similar clusters represented in the set. Tools also evaluate their performance using known, rather than predicted structures 10. Consequently, SOTA tools that rely on AlphaFold predicted structures tend to perform better in these benchmarks but are likely to underperform in practical applications involving truly unique and uncharacterised sequences. This issue also applies, albeit to a lesser extent, to the method proposed in this study, which utilises ESM2 representations. While ESM2’s training set is more diverse and comprehensive, it may not generalise to distant out-of-distribution sequences compared to the well studied sequences used in the benchmarks.

Squidly and other data-driven approaches to catalytic residue prediction should be used with caution when applied to retrieve understudied catalytic mechanisms. Currently, there exists a limited set of experimentally validated annotations 13. Datasets such as Uni14230 and CataloDB used for training in this study include SwissProt sequences with reviewed computational annotations to increase the size of the training data. However, this does not improve the diversity of catalytic mechanisms and enzyme sequences, as these methods reflect the inherent biases in databases towards well represented EC classes. Squidly, having learnt from this biased data, may indeed replicate this bias during inference and is therefore less likely to provide insights into novel or understudied catalytic mechanisms. Despite the limited data for catalytic residue prediction, we show that an ensemble approach of machine learning and sequence-similarity-based methods performs SOTA across multiple benchmark sets. By leveraging contrastive learning, we find that ML sequence-based predictions can generalize to low sequence/structural similarity settings. Finally, the methods reported in this paper that leverage rational data engineering for contrastive learning have broad implications for learning across biological data. Biological sequences are generated through evolutionary processes occurring over billions of years, resulting in hierarchically structured data that has significant variance between evolutionarily distant sequences. Moreover, mechanisms in biochemistry are often similar for related chemical reactions. By leveraging ontologies such as EC numbers, pairs can be selected rationally to improve training on proteins that are not only evolutionarily related but also share broader biochemical functions. A wide range of alternative ontologies exist that can capture different structures within biological data, suggesting that this approach is likely to be applicable across other biological classification tasks, particularly in studies utilising contrastive frameworks for per-token in-sequence classification 46. Future studies should explore and compare additional ontologies to better understand the utility of such data structures.

Acknowledgements

A.M. is supported by the Schmidt Science Fellows, in partnership with the Rhodes Trust. W.R. is supported by the Australian Government Research Training Program (RTP) Scholarship. The authors thank Charlie Trimble for his generous contributions to the project.

Additional information

Code and data availability

All source code is available at Github (https://github.com/WRiegs/Squidly). All associated data including CataloDB is available at Zenodo (https://doi.org/10.5281/zenodo.15541320).

Author contributions statement

W.R. and A.M. conceived the experiment(s), W.R. designed and implemented the model, W.R. and A.M. ran the experiment(s), analysed the results, and wrote the manuscript. M.B. and F.H.A reviewed the manuscript and provided input to the research.

A.1 Rationale for developing CataloDB

To enable comparisons to existing work, we utilized the AEGAN training and test datasets as described above. However, the limitations identified motivated us to create a new benchmark, CataloDB, for future evaluations. The pre-existing benchmarks were published prior to 2007 and were developed by selecting proteins with active site annotations representing unique protein families, superfamilies, or folds. Their construction lacked standardized metrics for ensuring appropriate similarity separation between training and test data. The continued use of these historical benchmarks likely persists to maintain backward compatibility in comparisons with earlier methods. However, these six benchmarks predominantly represent well-studied folds and families, making it challenging to create properly separated training and test sets.

Our results demonstrate that BLAST already performs comparably or better when predicting catalytic residues in sequences with greater than 0.35 identity. Therefore, we believe machine learning methods should aim to develop tools that generalize effectively to low-identity sequences. CataloDB addresses this challenge, and the concerns by containing test sequences with strictly less than 0.3 sequence and structure identity to the training set. This benchmark is more extensive than previous collections while still allowing for a larger training set with a higher redundancy threshold (90%), enabling machine learning methods to better learn from the underlying data distribution.

Evaluation of Swissprot data for use in CataloDB

To overcome the limited availability of experimentally validated catalytic residue annotations, annotations from sequences which have been reviewed by Swissprot’s expert review panel have been included in training and test data in recent machine learning tools 9,10. However, the assumption that these annotations are valid is not a given. Swissprot provides limited information to clarify how their review process for catalytic mechanisms is conducted 12. These annotations may have been generated manually or by computational predictions and likely contain the same bias present in the original experimentally validated set. Here we present initial analysis done, using an earlier version of Squidly, to determine whether reviewed sequences improved performance, and thus, the quality of training data.

Three datasets were curated to evaluate the impact of reviewed sequences on training catalytic residue prediction models. Dataset 1 contained 2,209 sequences with only experimentally validated catalytic residue annotations. Dataset 2 included 6,437 sequences, including experimentally validated annotations and sequences with catalytic residues supported by known structures. Dataset 3 included 99,926 sequences comprising all catalytic residue annotations in the SwissProt database. Redundancy reduction removed sequences with over 90% sequence identity. After filtering, Dataset 1 was reduced to 2,030 sequences, Dataset 2 to 5,921 sequences, and Dataset 3 had the highest proportion of redundant sequences with only 57,698 sequences left after filtering. See Table 2 for details.

Contrary to the expectation that increasing data size would improve model performance, we found that increasing the available training data by including reviewed sequences from SwissProt does not improve the overall bias or variance of the available experimental data. Figures S2, S4 and S6 show the EC distributions (EC numbers up to the second tier) across the datasets. All three datasets are biased toward hydrolases (EC 3), followed by classes 2, 1, 4, 5, 6, and 7. The relative abundance of EC class 2 increases between dataset 1, dataset 2, and dataset 3, while other classes remain relatively constant. This trend suggests that the additional sequences in the reviewed datasets likely originate from similarity-based prediction methods that incorporate experimental data, thus maintaining the relative abundance in EC representation despite increasing dataset size.

Figures S1, S3 and S5 compare the distributions of amino acids acting as catalytic residues across the datasets. The distributions are dominated by histidine, aspartic acid, glutamic acid, cysteine, and serine, in line with the dominant classes being hydrolases. Dataset 2 and dataset 3 show a relative increase in aspartic acid abundance compared to dataset 1. Despite this, and the much larger size of datasets 2 and 3, their amino acid distributions closely resemble dataset 1, indicating that reviewed datasets contribute little diversity in rare catalytic residues, which may correspond to mechanisms that are less studied. Overall, these findings suggest that training the models using the additional data will not greatly improve the performance of the models when generalising to sequences with low identity in the experimental ground truth set. Based on our analyses we use dataset 2 for CataloDB.

A.2 Reaction-informed pair schemes

By performing contrastive learning with pairs, we move from having too few samples to having too many samples. The large pool of amino acid pairs in our dataset varies in how informative each pair is for the model. Therefore, we must find the most effective ways to down sample and select the most informative pairs for our model to learn from. Particularly, we want to maximise the proportion of hard negative pairs in the training data.

The Uni14320 training dataset contains 8,735 non-redundant sequences from Dataset 1 (experimentally validated) with an average sequence length of 407 residues. Given the total pool of 3,555,145 residues, a total number of 6.32 × 1012 unique pairwise comparisons can be made, as determined by the binomial coefficient . We developed pair schemes 2 and 3 to capture the most informative pairs from this available data. Maximising hard negatives improves the model’s ability to discriminate catalytic and non-catalytic residues in a more generalisable way. If two pairs of residue embeddings are inherently very distinct from one another, the model will be unlikely to learn much from them. This is not just because our loss function will penalise the model less during training, but because the relationships extracted from the feature space that are used to separate these two inputs might not necessarily be related to our primary objective.

Reaction informed pair-mining assumes that there must be key biological contexts which create harder negative pairs for the model to differentiate. First, we make the basic assumption that the model will be unable to correctly discriminate between similar amino acids pairs. It is likely that a contrastive model will find pairs individual amino acids, such as histidine, harder to contrast, because the unique structure and properties of each amino acid determine the catalytic or non-catalytic roles in which these amino acids play within enzymes. Therefore, we decided to implicitly include the amino acid character of each catalytic site into the pair sampling scheme.

Another assumption we made was inspired by the use of contrastive learning in another catalytic site prediction tool developed by Tong Pan et al. 9. Their contrastive model creates EC specific representations of sequences for further use in a convolutional graph neural network classification model. Their contrastive model, however, did not try to use these contexts to better differentiate between catalytic and non-catalytic sites in their model, but as a support to ensure the information about the sequence EC is propagated through their multimodal system. EC numbers separate enzymes based on the reactions they perform. Not only that, but lower-level EC numbers group enzymes by the types of substrate bonds they cleave, or mechanisms by which they catalyse reactions. Inspired by this idea, we leveraged EC numbers to sample hard negative pairs. We specifically chose to represent sequences with only the first two levels of their EC-numbers so that there is sufficient variety in the data that represents each label.

Brief description of methods compared to Squidly in common benchmarks.

A.3 Evaluation of Existing Benchmarks

The six standard benchmarks traditionally used for catalytic residue prediction exhibit methodological inconsistencies in the literature. In our comparative analysis, we discovered that state-of-the-art methods benchmarked in this study – specifically SCREEN and AEGAN utilized different independent subsets of these benchmarks, complicating direct performance comparisons. According to the SCREEN authors, data preparation involved clustering the EF family and M-CSA training data to less than 0.4 sequence identity, selecting unique cluster representatives for training and validation. The authors reported that they “excluded enzymes from these five test datasets from our training/validation dataset.”

Benchmark datasets in prior work.

Our analysis identified that up to 85.4% of the test sequences had greater than 0.9 identity to the training set in the already filtered subsets provided by the authors. Notably, the high sequence similarity appears reduced in the EF superfamily and EF fold datasets, likely resulting from the authors’ use of the closely related EF family dataset during training set curation. The test results published by SCREEN align with our observations, showing a marked increase in F1 score when comparing the HA, NN, and PC datasets with the EF datasets, corresponding to their differing levels of data leakage.

Additionally, the curation process employed in SCREEN for the training data appears unnecessarily stringent, with the average closest training sequence similarity measured at only 0.21. This sparse training data likely impacts machine learning models attempting to fit such data distributions effectively. We hypothesize this factor may explain why AEGAN reportedly performed poorly when retrained using this data in the SCREEN authors’ experiments.

Amino acid distribution of Catalytic Residues in Dataset 1.

Dataset 1 contains 3,873 catalytic residues, with 15 possible amino acids. The database is skewed towards certain amino acids which are commonly involved in catalysis.

Dataset 1 EC Distribution.

Dataset 1 is made up of 2,210 sequences from Swissprot. The sequences have experimental validation for the catalytic residue annotations. Validation sets are derived exclusively from this distribution. A clear bias exists for hydrolases (EC 3.X).

Amino acid distribution of Catalytic Residues in Dataset 2.

Catalytic residue distribution. Dataset 2 contains 10,564 catalytic residues, with 18 possible amino acids. Amino acids M, V and F are additions not seen in dataset 1, although they exist in very few numbers.

Dataset 3 EC Distribution.

Dataset 3 is made up of 48,625 sequences. The sequences are taken from all the available proteins with catalytic site annotations in Swissprot. A very similar distribution is seen between datasets 3 and 2, with a notable increase in EC 2 again.

Amino acid distribution of Catalytic Residues in Dataset 3.

Dataset 3 contains 78,966 catalytic residues, with 19 possible catalytic amino acids. Isoleucine is an additional amino acid not seen in dataset 1 or 2. The distribution seen here is very similar to that of dataset 1 and 2.

Dataset 3 EC Distribution.

Dataset 3 is made up of 48,625 sequences. The sequences are taken from all the available proteins with catalytic site annotations in Swissprot. A very similar distribution is seen between datasets 3 and 2, with a notable increase in EC 2 again.