Schematic of integrated multi-dimensional proteogenomics framework for mapping tumor-specific proteoforms.

(A) ProteomeGenerator3 workflow overview - transcriptomes and proteomes derived from the same biological sample are analyzed in parallel using high-coverage Oxford Nanopore Technologies (ONT) cDNA sequencing and high-resolution, high-accuracy multi-dimensional mass spectrometry. ProteomeGenerator3 processes the FASTQ-formatted mRNA sequencing reads to assemble predicted transcripts, determine coding sequences and splice isoforms, and generate FASTA-formatted proteogenomic databases. These databases include both canonical and non-canonical protein isoforms and are used in subsequent mass spectrometry searches. The resulting FASTA and diaPASEF-acquired timsTOF.d files are searched using FragPipe, enabling sensitive and comprehensive identification of expressed protein isoforms from the sample-specific proteogenomic database. (B) Transcripts from long-read sequencing originating coding sequences (CDS) are classified as “canonical” (annotated in GENCODE), and “non-canonical” (not annotated in GENCODE, including neogenes and alternative splicing isoform). (C) Sample-specific transcriptomes are used to search high resolution mass spectrometry data from orthogonally fractionated, multi-enzyme protein digests, allowing detection of non-canonical proteins, deriving from new transcripts or from known transcripts previously annotated as “non-coding”. Created with BioRender.com.

Overview of proteogenomics protocol.

Oxford nanopore technologies (ONT) cDNA sequencing and MS proteomics were performed on the Ewing sarcoma cell line. Module 1 (labeled “Align”) shows alignment and quality control (QC) of long-read cDNA sequencing data using the nf-core/nanoseq pipeline and fusion calling with JAFFAL. Module 2 (labeled “transcript assembly”) shows transcript assembly performed with Bambu and extraction of cDNA sequences using gffread. Module 3 (labeled “Proteomics”) shows a simplified version of the Fragpipe diaTracer workflow which is utilized for peptide identification and quantification from diaPASEF-acquired data. diaTracer processes diaPASEF .d files to generate pseudo-MS/MS spectra (mzML) for MSFragger, enabling DDA-like database searching. This is followed by peptide-spectrum matching (PSM) using MSFragger, which is then rescored with deep learning-based algorithms via MSBooster and Percolator. Protein inference is conducted using ProteinProphet, and FDR filtering is applied at 1% FDR for PSM, ion, peptide, and protein levels using Philosopher. The Spectral libraries are generated directly from the DIA data using EasyPQP. The quantification is extracted from the DIA data using DIA-NN. Panel A was created with BioRender.com.

Quantification of protein IDs and sequence coverage of A673 proteome using multidimensional proteogenomics.

(A) Schematic representation of the fractionation strategy employed in this study. A673 peptides were first fractionated using strong cation exchange (SCX) chromatography, with stepwise elution performed at increasing KCl concentrations of 50 mM, 100 mM, 250 mM, and 500 mM. Each SCX fraction was subsequently subjected to high-pH reverse-phase (RP) fractionation (pH 10). Peptides were eluted in 10 stepwise fractions with increasing acetonitrile (ACN) concentrations using a spin column. Selected fractions were concatenated into 8 final fractions, resulting in a total of 32 fractions per enzyme, which were then analyzed by LC-MS/MS. Abbr. FT-flow-through. (B) Schematic depicting the multi-protease digestion strategy that was employed for deep sequence coverage of proteins. The table summarizes all proteolytic enzymes used in this study along with their specific cleavage sites. Cleavage positions are denoted by a prime symbol (′), indicating whether the bond is cleaved on the N-terminal or C-terminal side of the specified amino acid residue. (C) Bar plot highlighting the total number of protein IDs identified with different enzyme combinations. Corresponding mean amino acid sequence coverage is shown above each bar. Matrix details which enzyme is used in the combination. (D) Bar plot illustrating the evaluation of different combinations of orthogonal peptide fractionation methods and/or varying numbers of peptide fractions to identify the most effective strategy for maximizing protein identifications and coverage. (E) Venn diagram showing counts of ORFs from the A673 transcriptome (A673 transcriptome, yellow), counts of proteoforms detected when diaPASEF proteomics data of the A673 cell line was searched using the A673-specific protein search database generated from the A673 transcriptome (MS detected, pink), and counts of all SwissProt human reference proteome (SwissProt, blue). Panel A and B created with BioRender.com.

Long-read transcriptomics of Ewing sarcoma cell line detects non-canonical transcriptional events.

A. Upset plot comparing long-read transcript assembly of the A673 cell line to GENCODE v45 and GTEx v9 85. Vertical bars show intersection sizes and horizontal bars show total counts for each set within the A673 transcript assembly. “A673” row in the upset matrix refers to all transcripts detected from long-read transcript assembly of the A673 cell line. The remaining rows depict the subsets of A673 transcripts which overlapped with the Gencode v45 and GTEx v9 annotations, as well as what subset of A673 transcripts predicted to be coding by Transdecoder84. “Non-canonical” refers to transcripts not found in Gencode v45. B-C. Bar plot of alternative (alt) splicing events detected within non-canonical transcripts in the A673 cell line. Alt splicing events across all transcripts are shown in green whereas those found in transcripts with predicted ORFs are shown in purple. ME exons refers to mutually exclusive exons. D. Upset plot depicting putative neogenes detected in the A673 transcript assembly. Strip plot in top panel shows expression levels of neogenes quantified using Bambu 75 in counts per million (CPM). In the upset plot, vertical bars show intersection sizes of sets and horizontal bars show total counts for each set. The “Mol Cell 2022” row shows the neogenes initially detected in 3, which were also found in the Gencode v45 annotation. (E) IGV snapshot showing the detected neogene from ref.3 in yellow (labeled “ENST00000411596.1 (EW_NG3)”). The GGAA track shows microsatellite binding regions nearby to the transcriptional start site. (F) IGV snapshot of one of the A673-specific neogenes discovered in our long-read RNAseq (shown in yellow and labeled “BambuTx9216”).

Identification of non-canonical proteins using end-to-end proteogenomics protocol.

(A) Upset plot showing proteins with unique peptides from Figure 4A which had no BLASTp match to any protein in UniProt. (B) Histograms of posterior error probability (PEP) scores for SwissProt proteins and the “unknown” proteins from panel A, all of which were identified in the sample-specific A673 computational proteomics search with Fragpipe. PEP scores are the mean of all unique peptides for a given protein. C) Gene set enrichment analysis performed for the unknown proteins shown in panels A-B using the MSigDB Hallmark 2020 gene sets. (D-F) Examples of unknown proteins identified at the THAP7, EXOC7, and CD276 loci, respectively. For each panel, tracks from top to bottom show: ideogram, ONT read coverage, detected peptides (orange) shown on transcript structure (green), detected transcripts shown with predicted ORFs (the unknown protein is highlighted in green), and RefSeq Select.

Raw read distributions from long-read cDNA sequencing of the A673 cell line from Nanoplot.

Y-axis of main plot shows Phred scores and x-axis shows read lengths in kilobases.

Bar plots highlighting the total number of protein IDs identified with different enzyme combinations for membrane (A), cytoplasmic (B), and nuclear (C) proteins.

Corresponding mean amino acid sequence coverage is shown above each bar. Matrix details which enzyme is used in the combination.

Evaluation of different DDA-PASEF fragmentation isolation polygons for the identification of singly (+1) and multiply charged (≥+2) chymotryptic peptides.

A-D) Exemplary heat maps of ion intensities (blue-red scale) across the inversed ion mobility (1/K0) vs m/z dimensions showing fragmentation events (black line). E-H) All peptides identified across the 1/K0 vs m/z dimensions colored by charge state. The percentages of detected singly charged (+1) and multiply charged (≥+2) peptides are also highlighted for the none-, extended-, standard-, and +1 tailored-DDA-polygon acquisition strategies. I) Average number of unique peptides identified per injection for each method (error bars represent mean±stdev, n = 3). Venn diagram s illustrating the overlap in identified J) unique peptides and K) proteins among the four tested DDA-PASEF isolation window strategies.

Evaluation of different DIA-PASEF fragmentation isolation window strategies for the identification of singly charged (+1) and multiply charged (≥+2) chymotryptic peptides.

A–B) Representative heatmaps of ion intensities (blue-to-red scale) across the inverse ion mobility (1/K₀) vs. m/z dimensions, indicating fragmentation events (blue boxes, single injection). C–D) Distribution of all identified peptides across the 1/K₀ vs. m/z dimensions, color-coded by charge state. The percentages of detected singly charged (+1) and multiply charged (≥+2) peptides are highlighted for the standard and double-rainbow DIA-PASEF acquisition strategies. E–F) Peptide length and charge state distributions (7–50 amino acids) for the standard and double-rainbow DIA-PASEF isolation windows. The total number of identified peptides for each specific charge state is indicated in blue. G) Frequency distribution of identified peptide lengths and charge states using the Standard and Double Rainbow DIA-PASEF approaches. H) Number of MS2 scans triggered per injection in each method (error bars represent mean±stdev, n = 3). I) Venn diagrams comparing the numbers of identified proteins and peptides between the standard and double-rainbow DIA-PASEF isolation window strategies. J) Gene Ontology (GO) cellular component analysis performed on 263 uniquely identified proteins obtained using the Double Rainbow DIA-PASEF approach. Enrichment analysis was conducted using the STRING v12.0 database92.

Expression of non-canonical transcripts in the A673 transcript assembly which overlapped with the GTEx annotation.

(A) Heatmap of log-transformed transcripts per million (TPM) for individual transcripts, in which each column represents a different GTEx sample, and each row represents a different transcript. (B) Tukey boxplots with strip plots overlaid depicting log-transformed TPMs, in which each point represents the mean log2(TPM+1) for a given transcript taken over all samples in a given site.

Examples of A673 neogenes appearing in gene deserts (previously unannotated genomic regions).

For each panel, tracks from top to bottom show: ONT read coverage, transcript structure of detected neogenes, GENCODE v45 and RefSeq (downldetected peptides (orange) shown on transcript structure (green), detected transcripts shown with predicted ORFs (the unknown protein is highlighted in green), and RefSeq (downloaded from UCSC on 07/09/2025).

Benchmarking of computational proteomics search methods.

A) Bar plot depicts the number of quantified proteoforms in the A673 tryptic dataset analyzed using Spectronaut 19 (direct DIA), FragPipe+MSFragger v22.0 (diaTracer workflow), and DIA-NN v1.9.1 (library-free mode). The A673 PG3 protein sequence database, including nonhuman Archaebacteria loki proteins, was used for analysis. B) Sensitive vs specificity for different computational proteomics search methods using the results of panel A. Sensitivity was calculated as the fraction of identified protein groups relative to the total number of annotated A673 transcripts. Specificity was assessed as the inverse proportion of erroneously matched A. loki sequences among all identifications.

EWSR1:FLI1 fusion protein amino acid sequence and MS-identified chymotryptic peptides.

A) Structure of EWS, FLI1, and EWSR1:FLI1. Abbr. BD – Binding Domain, NLS – Nuclear Localization Signal/Sequence. B) The full-length amino acid (aa) sequence of the EWSR1:FLI1 fusion protein is shown, encompassing the N-terminal domain derived from EWSR1 (in blue) and the C-terminal ETS DNA-binding domain contributed by FLI1 (in purple). The breakpoint resulting from the chromosomal translocation t(11;22)(q24;q12) in A673 is indicated by vertical lines separating the EWSR1 and FLI1 segments. Peptides identified by LC-MS/MS are underlined, demonstrating sequence coverage across the fusion junction as well as within both EWSR1- and FLI1-derived regions (34%). The presence of peptide spanning the fusion breakpoint (underlined in green) confirms expression of the chimeric protein at the proteomic level. Peptide (MS1) C) and precursor ion (MS2) eXtracted Ion Chromatogram (XIC) D) profiles for EWSR1:FLI1 fusion protein with precursor GQQNPSYDSVRRGAW.3. Panel A and B created with BioRender.com.

Pfam search results for non-canonical EXOC7 proteoform.

Pfam search results corresponding to the non-canonical proteoform depicted in Figure 5E. For this proteoform, we found a unique tryptic peptide (PEP = 2e-4) related to an alternative ORF on ENST00000465252.5. 12 pfam domain matches were found (shown above).

Evaluation of SCX and High-pH RP fractionation strategies to optimize fraction numbers without compromising proteome coverage.