Computational and Systems Biology

Sensitive remote homology search by local alignment of small positional embeddings from protein language models

Sean R. Johnson author has email address
Meghana Peshwa
Zhiyi Sun author has email address

New England Biolabs Inc., 240 County Road, Ipswich, MA 01938, United States

https://doi.org/10.7554/eLife.91415.2

Open access
Copyright information

Figures and data

Schematics of embedding models and the experimental design.
(A) ESM-2 3B can be directly used to predict amino acid probability distributions at masked positions. Our implementation uses seven passes. The second pass is shown in the figure. (B) ESM-2 3B 3Di, a fine-tuned ESM-2 3B with a small CNN top model can be used to predict 3Di sequences from amino acid sequences. (C) Data flow from amino acid sequences through embedding models and other programs to produce files used in homology searches. Bold words correspond to line labels in Figure 3.

Logos related to the example test sequence YBGC_HELPY____14-90 from the 4HBT family.
(A) 4HBT family hmm from Pfam 32. (B) hmmbuild with default settings on an MSA of the top 100 hits supplied by an online blast search (blast.ncbi.nlm.nih.gov) of YBGC_HELPY____14-90 against the NCBI clustered nr database. (C) hmmbuild with default settings on the MSA sampled from the ESM-2 3B positional probabilities for YBGC_HELPY____14-90. (D) and (E) hmmbuild with Dirichlet priors disabled on the same MSAs as for (B) and (C), respectively. All logos were generated by uploading the corresponding .hmm file to skylign.org (Wheeler et al., 2014).

Homology detection accuracy.
Test sequences were binned based on percent identity to the closest training sequence in the same family and annotated based on the top scoring hit from a search against the entire set of training sequences or training sequence family profiles, depending on the algorithm. (A) and (C) family recovery accuracy by bin. (B) and (D) clan recovery accuracy. (A) and (B) compare amino acid profile-based methods. (C) and (D) compare Foldseek-based methods. Dashed lines are controls. There are a total of 21,293 test sequences. 12,246 test sequences have clan assignments.

SCOPe40 benchmark.
Cumulative distribution plots of the number of queries attaining each level of sensitivity to the first false positive fold at the (A) family, (B) superfamily, and (C) fold level. (D) Data preparation, search times, and average sensitivity at the superfamily level. (E) Comparison of average sensitivity at the superfamily level of Foldseek run with queries and databases prepared by different methods.

Sign up for email alerts