End-to-end proteogenomics for discovery of cryptic and non-canonical cancer proteoforms using long-read transcriptomics and multi-dimensional proteomics

  1. Tow Center for Developmental Oncology, Department of Pediatrics, Memorial Sloan Kettering Cancer Center, New York, United States
  2. Computational Oncology Program, Department of Epidemiology and Biostatistic s, Memorial Sloan Kettering Cancer Center, New York, United States
  3. Molecular Pharmacology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, United States
  4. Departments of Pediatrics, Pharmacology, and Physiology & Biophysics, Weill Cornell Medical College, Cornell University, New York, United States

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Volker Dötsch
    Goethe University Frankfurt, Frankfurt am Main, Germany
  • Senior Editor
    Volker Dötsch
    Goethe University Frankfurt, Frankfurt am Main, Germany

Reviewer #1 (Public review):

In this study, the authors provide an integrated proteogenomics pipeline to enable the discovery of novel peptides in an Ewing sarcoma cell line (A673). To identify novel full-length resolved isoforms, they performed long-read RNA sequencing (Oxford Nanopore Technology). Then, to increase the chance of detecting Ewing-specific neopeptides, the authors combined two approaches: a multi-protease digestion and a multi-dimensional proteomics approach.

Given the importance of novel isoforms and cryptic sites in neoantigen discovery and its putative applications in immunotherapy, this method and resource paper are of interest for the Ewing community and potentially for a broader cancer audience. The originality of this paper relies mostly on this optimized method to discover novel peptides (long-read sequencing with multiprotease, multi-dimensional trapped ion mobility spectrometry parallel accumulation-serial fragmentation mass spectrometry). Although, to my knowledge, no study combining long-read sequencing and proteomics methods has been published on Ewing Sarcoma, this study appears limited by a few aspects:

(1) The study is restricted to the analysis of a single cell line (A673). The authors should consider extending the analysis to other Ewing cell lines.

(2) The characterization of the 1121 non-canonical transcripts can be improved. How many are just splice variants of known genes, and how many are bona fide neogenes? In this respect, the definition of what the authors call neogene is quite unclear. Is a transcript with a new exon reported as a neogene? Is a transcript with a new start site reported as a neogene? It should be clearly indicated which categories of Figure 4B are reported on Figure 4D. A general flow chart would be very useful to help follow the analysis process.

(3) Similarly, the authors detect 3216 A673 specific proteins with no match in SwissProt. This number decreases to 72 "putative non-canonical proteoforms with unique peptides after BLASTp" against Uniprot. Again, a flow chart would conveniently enable one to follow the step-by-step analysis.

(4) Finally, only 17 spectral matches are suggested to be derived from non-canonical proteoforms. It would be important to compare the spectrum of these detected peptides with that of synthetic peptides. Such an analysis would enable us to assess the number of reliably detected proteoforms that can be expected in an Ewing sarcoma cell line.

(5) It is very unclear what the authors want to highlight in Supplementary Figure 5. Is it that non-canonical transcripts are broadly expressed in normal tissue? Which again raises the question of definitions of neogenes, non-canonical... Apparently, this figure shows that these non-canonical transcripts contain a large part of canonical sequences, which account for the strong signal in many normal tissues. A similar heatmap could be presented, including only the non-canonical sequences of the non-canonical transcripts. This figure should also include Ewing sarcoma samples.

Reviewer #2 (Public review):

The paper from Kulej et al. reports a set of tools for proteogenomic analysis of cancer proteomes. Their approach utilizes modern methods in long-read RNA sequencing to assemble a proteome database that is specific to Ewing sarcoma-derived A673 cells. To maximize proteome coverage and therefore increase the odds of detecting cancer-specific alterations at the protein level, the authors use multiple enzymes (trypsin, gluC, etc.) to digest cellular proteins and then perform multidimensional peptide fractionation. Peptide samples are then analyzed by LC-MS/MS using data-dependent and data-independent schemes on a timstof mass spectrometer. Proteogenomics is an important area of investigation for cancer research and does require new informatics tools.

The authors describe an end-to-end workflow where they claim to have optimized four different steps:

(1) Assembly of a sample-specific protein database using long-read transcriptomic data.

(2) Use of 8 different proteolytic enzymes to maximize diversity of peptides.

(3) Multiple stages of peptide fractionation using SCX and high pH rp chromatography.

(4) Utilize acquisition methods on the timstof mass spec to provide MS/MS data from single-charged peptides and multiply-charged peptides.

The authors published two earlier versions of ProteomeGenerator (versions 1 and 2) in the Journal of Proteome Research. In these earlier versions, 'ProteomeGenerator' was the set of software tools designed to integrate DNA and RNA sequencing to create a sample-specific protein database. To test the performance of each ProteomeGenerator version, the authors generated LC-MS/MS data using a combination of trypsin and LysC, then in the other paper, trypsin, LysC, and GluC. In both papers, they performed some levelof peptide fractionation prior to LC-MS/MS. They acquired LC-MS/MS data on a Thermo Q-Exactive in one paper and a Thermo Orbitrap mass spec in the other paper.

In the current paper, the primary innovation is the use of long-read sequencing to potentially improve the quality of the sample specific protein database. The other three components noted above are incremental compared to the authors' previous two papers and generally accepted practices in the field of proteomics. To note one example, the authors previously digested proteins using three enzymes and now use eight. Similarly, they are now using a timstof Bruker mass spec instead of one from Thermo. The detailed descriptions around the use of many enzymes and peptide fractionation, etc., create a very technically oriented paper, similar to or more so than the authors' earlier papers in J. Proteome Research. So, while there is enthusiasm for the use of long-read sequencing across biomedical research, the impact here for proteogenomic applications is somewhat lost with all of the technical description for experimental details that are not particularly innovative. In this respect, the report is not well matched to a broad readership.

Author response:

We thank you and reviewers for their thoughtful, constructive, and fair evaluation of our manuscript. We appreciate the recognition of the value of an end-to-end proteogenomics framework integrating long-read transcriptomics with deep proteomic analysis, and we are grateful for the specific guidance on how to strengthen clarity, generality, and impact for a broad scientific readership. We outline below the key revisions we plan to undertake in response to the public reviews.

Reviewer #1

We thank the reviewer for their positive assessment of the relevance of this work to Ewing sarcoma and cancer proteogenomics.

Scope and generality.

We agree that analysis of a single cell line limits generalization. In the revised manuscript, we will extend the ProteomeGenerator3 workflow to additional tumor specimens, including Ewing sarcoma tumors, to assess reproducibility and biological relevance beyond a single test cancer cell line.

Definitions and analytical clarity.

We will clarify definitions of non-canonical transcripts, alternative splice isoforms, and neogenes, and explicitly distinguish these categories throughout the manuscript. We will add a summary flow diagram that tracks transcripts through classification, ORF prediction, and proteoform detection, clarifying how Figures 4B and 4D relate.

Proteoform filtering and confidence.

To improve transparency, we will add a step-wise schematic summarizing how candidate non-canonical proteoforms are filtered to a high-confidence subset, including SwissProt comparison, BLASTp filtering, peptide uniqueness, and competitive database searches.

Validation.

We agree that orthogonal validation is important. We will include additional analyses of non-canonical proteofoms detected recurrently in additional tumor specimens to provide an empirical estimate of reliably detectable non-canonical proteoforms.

Supplementary Figure 5.

We will revise the presentation and explanation of this figure to avoid misinterpretation, including analyses focused specifically on non-canonical sequence segments and inclusion of tumor samples for direct comparison.

Reviewer #2

We thank the reviewer for placing this work in context with our prior ProteomeGenerator publications and for their guidance on framing the manuscript for a broad audience.

Emphasizing the central conceptual advance.

We agree that the primary innovation is the use of long-read transcriptomics to generate sample-specific proteogenomic databases. In the revised manuscript, we will directly compare long-read-derived and short-read-derived databases applied to the same samples and proteomic data, explicitly demonstrating where long-read sequencing enables discovery inaccessible to short-read approaches.

Manuscript reorganization.

We will substantially revise the manuscript to foreground the biological and conceptual consequences of long-read-enabled proteogenomics, using focused examples. Detailed descriptions of protease selection, fractionation, and acquisition optimization will be moved to supplementary methods, while retaining key conclusions about their impact on discovery.

Positioning of technical advances.

We will frame multi-protease and acquisition strategies as general principles required for unbiased proteoform discovery, rather than as static technical prescriptions, emphasizing their relevance across evolving proteomics platforms.

Overall Significance

In the revised manuscript, we will more clearly articulate that this work establishes long-read-informed, sample-specific proteogenomics as a discovery-grade framework, revealing cancer-specific proteoforms that are systematically invisible to reference-based and short-read-driven approaches, with broad implications for cancer biology and biomarker discovery.

We thank the editors and reviewers again for their constructive feedback, which we believe will substantially strengthen the clarity and broad impact of this work.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation