Representation of the pair-mining schemes depicting the negative and positive classes for each scheme.

Scheme 1 is the most permissive and is amino acid and EC unaware. Schemes 2 and 3 are more restrictive, requiring both the amino acids and the EC classes to be shared or different in the pair-mining scheme. Scheme 3 included additional negative pairs to separate EC numbers in the latent space.

Overview of the model, depicting a sequence being translated into per-token embeddings via ESM2, each per-token embedding is then translated into a smaller representation space by the MLP trained with contrastive loss (CL).

Each residue from the CL model is then predicted as catalytic or not catalytic via a bidirectional LSTM network.

Principal component analysis of per-residue embeddings from EC numbers 3.1 and 2.7.

In the upper row, red colours indicate a catalytic residue, and blue, non-catalytic. In the lower row, residues are coloured by amino acid class which represent their structural and chemical properties. The x and y axes represent the first and second principal components in each subplot. A clear separation of catalytic and non-catalytic residues is seen in the principal components which explain the most variance in our data.

Comparison of the Squidly contrastive learning framework pair schemes on the Uni3175 dataset.

We directly compare the schemes with AEGAN, the tool by Shen et al., the authors of this dataset 10. Schemes 2 and 3 (S1 and S2) using the largest ESM2 model (15 billion parameters) embeddings slightly improved upon AEGAN’s performance. The smaller ESM2 (3 billion parameters) model still performs competitively, at a heavily reduced computation cost. In comparison to the rationally informed pair schemes, the Scheme 1 and raw embedding LSTM models performed poorly.

Sequence similarity based on sequence identity to the closest sequence in the training set between each family.

F1 scores of tools on 6 common benchmark datasets.

F1, recall, and precision for the ensemble of Squidly and BLAST with different sequence identity cut-offs.

The cut-offs indicate the point at which to transition from a Squidly to a BLAST prediction. At 100% only Squidly is used for predictions, and at 0%, BLAST is used, unless there are no homologous sequences. For sequences with less than 50% sequence identity (dashed black line), Squidly predictions are used. The dashed horizontal line represents the total score for BLAST on the specific dataset, while the dots represent the combined BLAST and Squidly ensemble approach. Squidly reports higher scores for recall, while BLAST is more precise. The performance is best when used as an ensemble.

Resource usage measurements for Squidly models when predicting 1000 sequences with an average length of 401 residues.

Distribution of the catalytic residues in the proposed benchmark dataset.

A. The distribution of the identity of the catalytic residues is not even and represents the uneven distribution of the catalytic mechanism within the benchmark dataset. B Almost 50% of the data points are in the well characterised class of EC 3, which are hydrolases. Neither EC 7 nor EC 6 is represented, which are translocases and ligases, respectively.

A. BLAST sequence similarity to the training set for low identity CataloDB test set sequences. B. Structural similarity for the final test set, filtered on both structure and sequence. C. F1, recall, and precision for Squidly, SCREEN and BLAST on CataloDB.

Brief description of methods compared to Squidly in common benchmarks.

Benchmark datasets in prior work.

Amino acid distribution of Catalytic Residues in Dataset 1.

Dataset 1 contains 3,873 catalytic residues, with 15 possible amino acids. The database is skewed towards certain amino acids which are commonly involved in catalysis.

Dataset 1 EC Distribution.

Dataset 1 is made up of 2,210 sequences from Swissprot. The sequences have experimental validation for the catalytic residue annotations. Validation sets are derived exclusively from this distribution. A clear bias exists for hydrolases (EC 3.X).

Amino acid distribution of Catalytic Residues in Dataset 2.

Catalytic residue distribution. Dataset 2 contains 10,564 catalytic residues, with 18 possible amino acids. Amino acids M, V and F are additions not seen in dataset 1, although they exist in very few numbers.

Dataset 3 EC Distribution.

Dataset 3 is made up of 48,625 sequences. The sequences are taken from all the available proteins with catalytic site annotations in Swissprot. A very similar distribution is seen between datasets 3 and 2, with a notable increase in EC 2 again.

Amino acid distribution of Catalytic Residues in Dataset 3.

Dataset 3 contains 78,966 catalytic residues, with 19 possible catalytic amino acids. Isoleucine is an additional amino acid not seen in dataset 1 or 2. The distribution seen here is very similar to that of dataset 1 and 2.

Dataset 3 EC Distribution.

Dataset 3 is made up of 48,625 sequences. The sequences are taken from all the available proteins with catalytic site annotations in Swissprot. A very similar distribution is seen between datasets 3 and 2, with a notable increase in EC 2 again.