Sensitive remote homology search by local alignment of small positional embeddings from protein language models

Sean R. Johnson; Meghana Peshwa; Zhiyi Sun

doi:10.7554/eLife.91415.1

Introduction

A common method for assigning a putative function to a protein sequence is to find sequences with experimentally determined functions that have similarities in sequence, structure, or evolutionary origin to the unannotated sequence (Loewenstein et al., 2009). Direct comparisons of primary sequence, for example using BLASTP (Camacho et al., 2009), are fast and reliable but show poor ability to detect homologs with less than about 20% identity to the query (Rost, 1999). Popular approaches for higher sensitivity sequence searches involve using sequence profiles, for example with PSI-BLAST (Altschul et al., 1997), HMMER3 (Eddy, 2011), HH-suite3 (Steinegger et al., 2019), or MMseqs2 (Steinegger and Söding, 2017). Sequence profiles are derived from multiple sequence alignments (MSAs) and are often modeled as profile hidden Markov models (HMMs), for example in HMMER and HH-suite. Profile HMMs model each position as the probability of each amino acid at the position together with insertion and deletion probabilities. Because of their reliance on the construction of MSAs, profile-based methods can have high computational overhead for database construction, query preparation, or both.

Protein structure searches also show higher sensitivity than sequence searches (Jambrich et al., 2023). Until recently, the utility of structure searches for protein annotation was limited by the lack of extensive reference databases and the inability to predict structures quickly and reliably for sequences lacking experimentally determined structures. In the past several years accurate protein structure prediction programs such as AlphaFold2 (Jumper et al., 2021) and ESMFold (Lin et al., 2023) have led to a massive increase in the size of databases of predicted protein structures. Coupled with fast structure search algorithms such as Foldseek (van Kempen et al., 2023), RUPEE (Ayoub and Lee, 2019), and Dali (Holm, 2022), structure prediction programs provide another powerful tool for remote homology detection. Foldseek achieves fast structure search by encoding the tertiary interactions of each amino acid in the 20-letter 3D interaction (3Di) alphabet. By using a structure alphabet of the same size as the amino acid alphabet, Foldseek can leverage optimized sequence search algorithms originally developed for amino acid sequences (Steinegger and Söding, 2017; van Kempen et al., 2023). Structure search methods suffer from some of the same drawbacks as profile-based methods, including the computational cost of converting primary sequences to predicted structures.

Emerging methods for protein annotation and remote homology detection rely on deep neural networks taking protein sequences as inputs and producing either a classification from a controlled vocabulary (Bileschi et al., 2022; Sanderson et al., 2023), a natural language description (Gane et al., 2022), positional embeddings, or a sequence embedding. Positional embeddings are fixed length vectors for each amino acid position of the protein. Positional embeddings produced by popular protein language models (pLMs) usually have large dimensions such as 1024 for ProtT5-XL-U50 (Elnaggar et al., 2021) and 2560 for ESM-2 3B (Lin et al., 2023). Sequence embeddings represent an entire sequence and are often calculated by element-wise averaging of the positional embeddings. Positional and sequence embeddings can be used for remote homology detection by using them to calculate substitution matrices in pairwise local alignments (Kaminski et al., 2022; Pantolini et al., 2022; Ye and Iovino, 2023), or by k-nearest neighbors searches (Hamamsy et al., 2022; Schütze et al., 2022), respectively.

While each of these emerging methods shows promise for improving sensitivity of protein search and annotation, they suffer various limitations. Classification models and methods relying on sequence embeddings struggle at discriminating individual domains of multi-domain proteins. Methods relying on large positional embeddings are space inefficient, and current search implementations are slow compared to other methods. Smaller positional embeddings would be more amenable to algorithmic optimizations using single instruction multiple data (SIMD) capabilities of central processing units (CPUs) that contribute to the speed of optimized sequence search algorithms (Buchfink et al., 2021; Eddy, 2011; Steinegger et al., 2019).

We recognized that profile HMMs and 3Di sequences are types of positional embeddings with dimensionality as low as 1 (3Di sequences) to about 25 (profile HMMs, including amino acid frequencies and state transition probabilities). We test the hypothesis that the ESM-2 3B pLM (Lin et al., 2023) could be used to directly convert primary amino acid sequences into profile HMMs compatible with HMMER3 or HH-suite, and 3Di sequences compatible with Foldseek, providing a sequence search workflow leveraging the speed and sensitivity advantages of profile and structure search algorithms with an accelerated query preparation step enabled by the pLM.

Results

ESM-2 was already pretrained on the masked language modeling task (Devlin et al., 2019; Lin et al., 2023) of predicting amino acid distributions at masked positions of input sequences, therefore no additional fine-tuning was necessary to induce it to produce probabilities compatible with profile HMM tools (Figure 1A). Positional amino acid frequencies predicted by ESM-2 3B resembled those found in MSAs built from sequence searches (Figure 2).

Schematics of embedding models and the experimental design.
(A) ESM-2 3B can be directly used to predict amino acid probability distributions at masked positions. Our implementation uses seven passes. The second pass is shown in the figure. (B) ESM-2 3B 3Di, a fine-tuned ESM-2 3B with a small CNN top model can be used to predict 3Di sequences from amino acid sequences. (C) Data flow from amino acid sequences through embedding models and other programs to produce files used in homology searches. Bold words correspond to line labels in Figure 3.

Logos related to the example test sequence YBGC_HELPY__14-90 from the 4HBT family.
(A) 4HBT family hmm from Pfam 32. (B) hmmbuild with default settings on an MSA of the top 100 hits supplied by an online blast search (blast.ncbi.nlm.nih.gov) of YBGC_HELPY__14-90 against the NCBI clustered nr database. (C) hmmbuild with default settings on the MSA sampled from the ESM-2 3B positional probabilities for YBGC_HELPY__14-90. (D) and (E) hmmbuild with Dirichlet priors disabled on the same MSAs as for (B) and (C), respectively. All logos were generated by uploading the corresponding .hmm file to skylign.org (Wheeler et al., 2014).

To produce Foldseek-compatible 3Di sequences, we trained a two-layer 1D convolutional neural network (CNN) to convert positional embeddings from the last transformer layer of ESM-2 3B into 3Di sequences, we then unfroze the last transformer layer and fine-tuned it together with the CNN. The fine-tuned model, which we call ESM-2 3B 3Di (Figure 1B), converted amino acids to 3Di with an accuracy of 64% compared to a test set of 3Di sequences derived from AlphaFold2 predicted structures.

To evaluate the capacity of small embeddings to improve search sensitivity, we generated predicted profiles and 3Di sequences from clustered Pfam 32 splits (Bileschi et al., 2022) and converted them into formats compatible with various search tools (Figure 1C). Pfam (Mistry et al., 2021) is a set of manually curated multiple sequence alignments of families of homologous protein domains. Some families presumed to have a common evolutionary origin are further grouped into clans. In the clustered splits, each family is divided into train and test groups such that each sequence in the test group has less than 25% identity to the most similar protein in the train group (Bileschi et al., 2022). The sensitivity of a search algorithm is evaluated by the ability to match sequences from the test groups to their corresponding train group at the family or clan level.

Methods using predicted profiles and those using predicted 3Di sequences both showed greater accuracy than phmmer (amino acid to amino acid) searches across all identity bins, and hmmscan (amino acid to profile HMM) searches on test sequences below 20% identity to the closest train sequence (Figure 3, compare lines 6 and 7 with lines 1 and 2). Foldseek searches where both the queries and database consisted of 3Di sequences produced by ESM-2 3B 3Di (line 7) performed the best overall. Foldseek considers both 3Di and amino acid sequences in its alignments and can therefore be conceptualized as using a 2-byte embedding. Running Foldseek in 3Di-only mode (line 8), a 1-byte embedding, led to a decrease in accuracy but still outperformed phmmer across all identity bins, and hmmscan on bins below 20% identity. We also tried creating HMMER3 profiles from predicted 3Di sequences (line 9). These performed worse than single-sequence Foldseek searches. Patching HMMER3 to use 3Di-derived background frequencies and Dirichlet priors (line 10) did not improve performance. Furthermore, we noticed that HMMER3 runs very slowly on 3Di sequences and profiles, presumably because the prefiltering steps were not optimized with 3Di sequences in mind.

Homology detection accuracy.
Test sequences were binned based on percent identity to the closest training sequence in the same family and annotated based on the top scoring hit from a search against the entire set of training sequences or training sequence family profiles, depending on the algorithm. (A) family recovery accuracy by bin. (B) clan recovery accuracy. (C) and (D) are detail views of the 0.8 to 1.0 accuracy range. There are a total of 21,293 test sequences. 12,246 test sequences have clan assignments.

While faster than MSA construction or full structure prediction, pLM embedding still have non-trivial computational overhead. This limits the possibility of using methods based on pLM embeddings to perform sensitive homology searches against large metagenomic databases such as the 2.4 billion sequence Mgnify database (Richardson et al., 2023). It would be desirable to have search methods where the database can remain as amino acid sequences or other cheaply-calculated representations and the queries can be processed with more expensive methods. To this end, we tested hmmsearch using ESM2 3B generated profiles as queries against amino acid databases (lines 3 and 4). This is similar to a two-iteration PSI-BLAST (Altschul et al., 1997) or JackHMMER (Johnson et al., 2010) search where the first search and MSA-building step is replaced by a pLM embedding step. Curiously, hmmsearch using profiles built directly from the pLM probabilities (line 3) had the lowest accuracy of any algorithm. Nevertheless, profiles processed with hmmbuild from HMMER3, applying HMMER3 Dirichlet priors on top of the pLM probabilities (line 4), had better family prediction accuracy than phmmer at all but the highest identity bin, and accuracy on par with hmmscan at 18% identity and lower.

Discussion

Our results show that small positional embeddings produced by pLMs can enhance search sensitivity, potentially with less computational overhead than MSA construction or full structure prediction. While this manuscript was being submitted, another group reported an amino acid to 3Di translation model, ProstT5, that also showed strong performance on remote homology detection (Heinzinger et al., 2023). There are many possible directions for future development of improved embeddings, faster conversion and search programs, and comprehensive reference databases. Of particular interest could be conversion to profile HMM-style embeddings in a single pass, rather than the seven we required, and predicting state transition probabilities. Small positional embeddings optimal for local alignment could also be learned directly from a differentiable alignment algorithm (Petti et al., 2023) instead of relying on proxy tasks of amino acid or 3Di prediction. Finally, asymmetric architectures where embedding a database sequence is cheaper than embedding a query, analogous to searches with profile HMM queries against primary sequence databases, could be a powerful method for improving search sensitivity against large and growing reference databases. We’ve made our model training, search, and data analysis code publicly available. We hope our results and code will serve as a springboard for further exploration of the utility of low-dimensionality positional embeddings of protein sequences.

Methods

Alignments

Unless otherwise noted, multiple sequence alignments were made using MAFFT (v7.505) (Katoh and Standley, 2013) with options --anysymbol --maxiterate 1000 --globalpair. Protein alignments used the BLOSUM62 substitution matrix (Henikoff and Henikoff, 1992). 3Di alignments used the 3Di substitution matrix from Foldseek (van Kempen et al., 2023) (https://github.com/steineggerlab/foldseek/blob/master/data/mat3di.out).

Patching HMMER3 with background frequencies and Dirichlet priors for 3Di

We created a fork of the HMMER3 program, replacing amino acid background frequencies and Dirichlet priors with values calculated from the 3Di alphabet instead of the amino acid alphabet (https://github.com/seanrjohnson/hmmer3di). To generate a set of 3Di MSAs, we converted the AlphaFold UniProt Foldseek database (Jumper et al., 2021; van Kempen et al., 2023; Varadi et al., 2022) to a 3Di fasta file. We then looked up every sequence name from the Pfam 35 seed file in the UniProt 3Di fasta file and, for cases where the corresponding sequence was identifiable, extracted the sub-sequence corresponding to the Pfam 35 seed. 3Di seeds from each profile were aligned using MAFFT. MSA columns with more than 10 rows were used to calculate background frequencies and Dirichlet priors using the HMMER3 program esl-mixdchlet fit with options -s 17 9 20. Amino acid background frequencies and Dirichlet priors in the HMMER3 source code were then replaced with the newly calculated 3Di background frequencies and Dirichlet priors. We call the patched HMMER3 as HMMER3Di.

Fine tuning ESM-2 3B to convert amino acid sequences into 3Di sequences

A 1D CNN was added on top of ESM-2 3B. The CNN takes as input position-wise embeddings from the last transformer layer of ESM-2 3B. The CNN consists of two layers, the first layer has 2560 input channels (the size of the embeddings from ESM-2 3B), and 300 output channels, kernel size 5, stride 1, padding 3. The second layer has 300 input channels and 21 output channels (one for each 3Di symbol plus a padding symbol), kernel size 5, stride 1, padding 1. The model was trained with a weighted cross-entropy loss function using weights of 0.1 * the diagonal from 3Di substitution matrix. The neural network was implemented in PyTorch (Paszke et al., 2019).

Training data was derived from the Foldseek AlphaFold2 UniProt50 dataset (Jumper et al., 2021; van Kempen et al., 2023; Varadi et al., 2022), a reduced-redundancy subset of the UniProt AlphaFold2 structures. The Foldseek database was downloaded using the Foldseek “databases” command line program, converted into protein and 3Di fasta files, filtered to remove sequences smaller than 120 amino acids and larger than 1000 amino acids, and split into train, validation, and test subsets, 90%:5%:5%, (33,924,764: 1,884,709: 1,884,710 sequences)

With ESM-2 layers frozen, the CNN was trained on the task of converting amino acids to 3Di sequences using the AdamW optimizer with learning rate 0.001, weight decay 0.001, and exponential learning rate decrease (gamma 0.98, applied every 100 batches). Training sequences were randomly selected in batches of 15 sequences. Training proceeded for 1301 batches, leading to a training accuracy of about 58%.

The last transformer layer of ESM-2 was then unfrozen and training restarted from the saved weights, with the same training parameters. Training continued for another 24001 batches of 10 random training sequences. Accuracy on the final training batch was 65%. Using the final trained weights, test sequences were converted to 3Di at an accuracy of 64.4%. We call the fine-tuned model ESM-2 3B 3Di.

The trained weights are available on Zenodo (https://doi.org/10.5281/zenodo.8174959). The training code is available on Github (https://github.com/seanrjohnson/esmologs). The model training code should be useful for fine-tuning ESM-2 to convert amino acid sequences to various other kinds of sequences, such as secondary structure codes, etc.

Pfam 32 clustered splits

Pfam 32 clustered splits (Bileschi et al., 2022) were downloaded from: https://console.cloud.google.com/storage/browser/brain-genomics-public/research/proteins/pfam/clustered_split. Data for mapping of individual Pfam 32 families to clans was downloaded from: https://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam32.0/. The sequences from the train, dev, and test splits were sorted into unaligned fasta files according to their split and family (aa seq).

Predicting 3Di sequences

Each training and test sequence was converted into a predicted 3Di sequence (pred 3Di) using the ESM-2 3B 3Di model described above. MAFFT alignments were made of both the amino acid and predicted 3Di training and test sequences for each family. HMMER3 profiles were built from the alignments using either unaltered HMMER3 (pred 3Di hmmer profile), or HMMER3Di (pred 3Di hmmer3Di profile).

Predicting profiles to generate HH-suite hhm files and HMMER3 hmm files from single sequences

Positional amino acid probabilities were calculated for unaligned train and test sequence using pre-trained ESM-2 3B in the following algorithm.

Prepend sequence with M. This is because ESM-2 3B has a strong bias towards predicting M as the first amino acid in every sequence, and most Pfam domains don’t start with M.
Mask the first, eighth, etc. position in the sequence
Run a forward pass of ESM-2 3B over the masked sequence.
Save the logits for each of the 20 amino acid tokens for each masked position.
Shift the masks one position to the right.
Repeat steps 3 through 5 another 6 times until logits have been saved for every position.
Use the softmax function to calculate the amino acid probabilities at each position from the logits.

Note that in our actual implementation, a single forward pass was run on a batch of seven copies of the input sequence, each with different masking.

The positional probabilities were written directly as an HH-suite compatible .hhm file (pred hhsuite profile). A 40 sequence fasta MSA file was written where each sequence was randomly sampled from the probability distribution. Hmmbuild was run with default settings on the sampled MSA (pred hmmer profile dchlet) and with the -pnone setting, which disables adjustments to the probability distribution based on Dirichlet priors (pred hmmer profile).

Building HH-suite hhm and HMMER3 hmm profiles from train amino acid MSAs

The amino acid MSA for each training family were converted into an HH-suite database (aa hhsuite profile) with the following bash script:

echo ‘#’$profile_name > msa/${profile_name}.a3m hhfilter -i $MSA_FASTA -a msa/${profile_name}.a3m -M 50 -id 90; hhconsensus -i msa/${profile_name}.a3m -o consensus/${profile_name}.a3m hhmake -name $base -i consensus/${profile_name}.a3m -o hhm/${base}.hhm ffindex_build -s db_hhm.ffdata db_hhm.ffindex hhm ffindex_build -s db_a3m.ffdata db_a3m.ffindex consensus cstranslate -f -x 0.3 -c 4 -I a3m -i db_a3m -o db_cs219

HMMER3 profiles were built by calling hmmbuild with default settings on the training family amino acid MSAs (aa hmmer profile).

Building HH-suite databases from predicted profiles

HH-suite databases were built from predicted profiles .hhm files and the corresponding sampled 40 sequence MSA fasta files using the following bash script:

ffindex_build -s db_hhm.ffdata db_hhm.ffindex esm2_3B_profiles ffindex_build -s db_a3m.ffdata db_a3m.ffindex esm2_3B_sampled_msas cstranslate -f -x 0.3 -c 4 -I a3m -i db_a3m -o db_cs219

Building Foldseek databases from predicted 3Di sequences

Amino acid and predicted 3Di fasta files were converted into Foldseek-compatible databases using a new script, fasta2foldseek.py, available from the esmologs python package (see below).

Hmmscan, phmmer, and hmmsearch HMMER3 searches

In an attempt to mimic the Top pick HMM strategy reported by (Bileschi et al., 2022) we ran all HMMER3 searches in up to two iterations. The first iteration was run with default settings. For test sequences where no hits were detected among the training sequences or profiles, depending on the program, a second iteration was run with the addition of parameters intended to maximize sensitivity at the expense of search speed:

--max -Z 1 --domZ 1 -E 1000000 --domE 1000000

It should be noted that while our phmmer results are directly comparable to the phmmer results reported by Bileschi et al., our hmmscan results are not directly comparable to the reported “Top Pick HMM” results because we re-aligned the training sequences for each family instead of using the Pfam seed alignments. Still our results were very similar. We observed a 17.6% error rate (3744 test sequences with mispredicted family assignments) by hmmscan, compared to the reported 18.1% error rate (3844 mispredictions).

3Di_hmmscan HMMER3Di searches

3Di_hmmscan searches were performed using the same two iteration method described above for searches using standard HMMER3 programs.

Hhblits HH-suite searches

Hhbilts was run with the options:

-tags -n 1 -v 0

Foldseek searches

After converting both query and target amino acid and predicted 3Di fasta files into Foldseek compatible databases (see above), Foldseek searches were run with the following commands:

foldseek search test_db train_db foldseek_results tmpFolder foldseek convertalis test_db train_db aln_3Di_only ../hits.tsv --format- output query,target,bits

For 3Di-only searches, the option --alignment-type 0 was added to the search call.

Data and Code availability

HMMER3 patched with 3Di background frequencies and Dirichlet priors: https://github.com/seanrjohnson/hmmer3di

Code for neural network training, sequence searches, and data analysis: https://github.com/seanrjohnson/esmologs

Model weights for ESM-2 3B 3Di, predicted profiles and 3Di sequences from the Pfam 32 clustered splits, and other data necessary to reproduce the analyses: https://doi.org/10.5281/zenodo.8174959

Acknowledgements

We thank Sergey Ovchinnikov for helpful discussion about protein language models, Sean Eddy for helpful discussion about HMMER3, and Gary Smith for IT support.

Sensitive remote homology search by local alignment of small positional embeddings from protein language models

Significance of findings

Strength of evidence

Abstract

Introduction

Results

Schematics of embedding models and the experimental design.

Logos related to the example test sequence YBGC_HELPY__14-90 from the 4HBT family.

Homology detection accuracy.

Discussion

Methods

Alignments

Patching HMMER3 with background frequencies and Dirichlet priors for 3Di

Fine tuning ESM-2 3B to convert amino acid sequences into 3Di sequences

Pfam 32 clustered splits

Predicting 3Di sequences

Predicting profiles to generate HH-suite hhm files and HMMER3 hmm files from single sequences

Building HH-suite hhm and HMMER3 hmm profiles from train amino acid MSAs

Building HH-suite databases from predicted profiles

Building Foldseek databases from predicted 3Di sequences

Hmmscan, phmmer, and hmmsearch HMMER3 searches

3Di_hmmscan HMMER3Di searches

Hhblits HH-suite searches

Foldseek searches

Data and Code availability

Acknowledgements

References

Article and author information

Author information

Sean R. Johnson

Meghana Peshwa

Zhiyi Sun

Version history

Cite all versions

Copyright

Metrics

Be the first to read new articles from eLife