Determining de novo antibody sequences from cryoEM data with ModelAngelo. A) Exemplary map (with top10% alignment scores of 1076/1174 for HC/LC) from the benchmark dataset, representing an Influenza B virus neuraminidase (NA) in complex with four copies of a neutralizing Fab, at global FSC resolution of 2.3 Å. Shown are the deposited map, model, and the de novo model generated by ModelAngelo, along with a detailed view of CDRH3. B) Consensus sequences for heavy and light chains as generated by Stitch compared to the true sequences. Sequencing errors are indicated by an asterisk (*).

V-gene assignment from ModelAngelo data. A) Correlation between Stitch alignment score and sequence identity between the top-scoring V-gene of the ModelAngelo vs PDB sequence of the heavy and light chain variable domains, as indicated with the non-parametric Spearman correlation coefficient. B) Distribution of V-gene sequence identity for progressive alignment score cutoffs, compared to the pairwise V-gene sequence identity in the IMGT repertoire.

Schematic workflow to estimate de novo antibody sequences by the integration of cryoEM and LC-MS/MS data with Stitch.

Sequencing CR3022 with integrated cryoEM and LC-MS/MS data. A) Shown are the deposited cryoEM map (global FSC resolution 4.1 Å), model, and de novo ModelAngelo output for the CR3022 Fab in complex with the SARS-CoV-2 Spike S1 subunit. The sequences were extracted from the de novo model and used as input for Stitch, resulting in the identification of the indicated V-genes and CDR3 sequences. These variable domains were used as templates in Stitch to assemble the LC-MS/MS derived de novo peptides. B) Consensus sequences for CR3022 from the integrated cryoEM-MS data in Stitch compared to the true heavy and light chain sequences. Sequencing errors are indicated with an asterisk (*).

Targeted sequencing of CR3022 against a complex background of other antibodies. Plotted are the de novo consensus sequence identities derived from the LC-MS/MS data using either the true sequences, the full IMGT repertoire, or the ModelAngelo-derived variable domains as templates. We compare the output from the CR3022 dataset alone (‘No backgr.’) with the output after adding either a diffuse polyclonal IgG background from a COVID-19 patient (‘+whole IgG’) or full datasets from five additional anti-Influenza-HA monoclonals (‘+5 mAbs’). Use of decoy sequences as indicated by dark/light colors.

Inferring V-genes from published EMPEM data. A) Plotted are all non-zero alignment scores in Stitch from published EMPEM maps. B) Views of the variable domains of EMPEM maps with alignment scores >50 for both heavy and light chains. The EMDB identifiers are indicated at each panel.