Figures and data

Overview of the model.
A. Depicting a sequence being translated into per-token embeddings via ESM2, each per-token embedding is then translated into a smaller representation space by the MLP trained with contrastive loss (CL). Each residue from the CL model is then predicted as catalytic or not catalytic via a bidirectional LSTM network. The CL and LSTM models are ensembled (5 models) to improve performance and capture variance between the models. B. Representation of the pair-mining schemes depict the negative and positive classes for each scheme. Scheme 1 is the most permissive and is amino acid and EC unaware. Schemes 2 and 3 are more restrictive, requiring both the amino acids and the EC classes to be shared or different in the pair-mining scheme. Scheme 3 included additional negative pairs to separate EC numbers in the latent space.

Performance of Squidly pair-mining schemes on the Uni3175 dataset.
A. We compare the Squidly schemes without the ensemble with AEGAN, the tool by Shen et al., the authors of this dataset 10. Reaction informed schemes (Schemes 2 and 3) achieved the highest F1 scores, slightly surpassing AEGAN when using the 15B ESM2 model. The 3B ESM2 model remains competitive at a much lower computation cost. In comparison to the rationally informed pair schemes, the Scheme 1 (random pairing, S1) and raw embedding LSTM models performed poorly. B. Sequence similarity based on sequence identity to the closest sequence in the training set between each family. C. F1, recall, and precision for the ensemble of Squidly (3B, scheme 3) and BLAST with different sequence identity cut-offs. The cut-offs indicate the point at which to transition from a Squidly to a BLAST prediction. At 100% only Squidly is used for predictions, and at 0%, BLAST is used, unless there are no similar sequences. For sequences with less than 30% sequence identity (dashed black line), Squidly predictions are used. The dashed horizontal line represents the total score for BLAST on the specific dataset, while the dots represent the combined BLAST and Squidly ensemble approach. Squidly reports higher scores for recall, while BLAST is more precise. The performance is best when used as an ensemble.

F1 scores of tools on 6 common benchmark datasets.

Resource usage measurements for Squidly models compared to structure prediction tool Chai-1 when predicting 100 sequences with an average length of 443 residues.

A. BLAST sequence similarity to the training set for low identity CataloDB test set sequences. B. Structural similarity for the final test set, filtered on both structure and sequence. C. F1, recall, and precision for Squidly (scheme 3), SCREEN and BLAST on CataloDB. D. The number of CataloDB catalytic residues predicted (true positives) by Squidly (scheme 3) for the 6 available EC classes relative to the decreasing total number of true catalytic residues. E. The Precision, F1 and Recall score of Squidly predictions for the 6 available EC classes. F. The distribution of the identity of the catalytic residues is not even and represents the uneven distribution of the catalytic mechanism within the benchmark dataset. G. Almost 50% of the data points are in the well characterised class of EC 3, which are hydrolases. Neither EC 7 nor EC 6 is represented, which are translocases and ligases, respectively. H. Performance of the Squidly 15B ensemble model on the CataloDB benchmark under varying prediction and variance thresholds. Note, default 0.5 prediction cutoffs were used to evaluate Squidly’s performance in C.

Principal component analysis of per-residue embeddings from EC numbers 3.1 and 2.7.
In the upper row, red colours indicate a catalytic residue, and blue, non-catalytic. In the lower row, residues are coloured by amino acid class which represent their structural and chemical properties. The x and y axes represent the first and second principal components in each subplot. A clear separation of catalytic and non-catalytic residues is seen in the principal components which explain the most variance in our data.

A. Recall, precision, and false positive rate (FPR) for AEGAN, EasIFA, and Squidly on the filtered EasIFA test subset. EasIFA shows balanced performance, Squidly is competitive given its sequence-only inputs, and AEGAN displays unusually low precision that may reflect a benchmarking artefact. B. Most of the test set are sequences from EC classes 2 and 3, with no representation for classes 4, 6 and 7. C. The differences in false positives, and true positives predicted by Squidly’s best model and EasIFA. Although the models agree with 100% of true positive predictions made by both tools, Squidly has a higher tendency for false positives, with many of the false positives being true binding sites.

The default ensemble threshold for prediction and variance cutoffs of 0.6 and 0.225 (outlined in red) were selected to balance the precision and recall of the Squidly ensemble model on the Uni3175 benchmark under varying mean prediction and variance thresholds.
Shown are F1 scores, precision, and recall for the 3B and 15B models. A. F1 of the 3B model, peaking at 0.70. B. F1 of the 15B model, peaking at 0.69. C. Precision of the 3B model shows that as the cutoff increases. D. Precision of the 15B model. E. Recall of the 3B model. F. Recall of the 15B model.

Performance of the Squidly ensemble model on the CataloDB benchmark under varying mean prediction and variance thresholds.
Shown are F1 scores, precision, and recall for the 3B and 15B models. A. F1 of the 3B model, peaking at 0.70. B. F1 of the 15B model, peaking at 0.69. C. Precision of the 3B model shows that as the cutoff increases. D. Precision of the 15B model. E. Recall of the 3B model. F. Recall of the 15B model. The default ensemble threshold for prediction and variance cutoffs of 0.6 and 0.225 (outlined in red) were selected using the Uni3175 benchmark.

The maximum F1 score achievable with the Squidly ensemble model for each EC number in the CataloDB benchmark under varying mean prediction and variance thresholds.
The default threshold of 0.6 and 0.225 determined using the uni3175 benchmark is relatively stable for each EC number. However, the limited data for EC 5 and 6 (N= 8, N=3, respectively) is such that users are recommended to further test the decision thresholds for their intended application.

Amino acid distribution of Catalytic Residues in Dataset 1.
Dataset 1 contains 3,873 catalytic residues, with 15 possible amino acids. The database is skewed towards certain amino acids which are commonly involved in catalysis.

Dataset 1 EC Distribution.
Dataset 1 is made up of 2,210 sequences from Swissprot. The sequences have experimental validation for the catalytic residue annotations. Validation sets are derived exclusively from this distribution. A clear bias exists for hydrolases (EC 3.X).

Amino acid distribution of Catalytic Residues in Dataset 2.
Catalytic residue distribution. Dataset 2 contains 10,564 catalytic residues, with 18 possible amino acids. Amino acids M, V and F are additions not seen in dataset 1, although they exist in very few numbers.

Dataset 3 EC Distribution.
Dataset 3 is made up of 48,625 sequences. The sequences are taken from all the available proteins with catalytic site annotations in Swissprot. A very similar distribution is seen between datasets 3 and 2, with a notable increase in EC 2 again.

Amino acid distribution of Catalytic Residues in Dataset 3.
Dataset 3 contains 78,966 catalytic residues, with 19 possible catalytic amino acids. Isoleucine is an additional amino acid not seen in dataset 1 or 2. The distribution seen here is very similar to that of dataset 1 and 2.

Dataset 3 EC Distribution.
Dataset 3 is made up of 48,625 sequences. The sequences are taken from all the available proteins with catalytic site annotations in Swissprot. A very similar distribution is seen between datasets 3 and 2, with a notable increase in EC 2 again.

Brief description of methods compared to Squidly in common benchmarks.
