On the discovered Cancer Driving Nucleotides (CDNs) –Distributions across genes, cancer types and patients

Lingjie Zhang; Tong Deng; Zhongqi Liufu; Xiangnyu Chen; Shijie Wu; Xueyu Liu; Changhao Shi; Bingjie Chen; Zheng Hu; Qichun Cai; Xuemei Lu; Chenli Liu; Mengfeng Li; Haijun Wen; Chung-I Wu

doi:10.7554/eLife.99341.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Reviewing Editor
Detlef Weigel
Max Planck Institute for Biology Tübingen, Tübingen, Germany
Senior Editor
Detlef Weigel
Max Planck Institute for Biology Tübingen, Tübingen, Germany

Reviewer #1 (Public Review):

The study investigates Cancer Driving Nucleotides (CDNs) using the TCGA database, finding that these recurring point mutations could greatly enhance our understanding of cancer genomics and improve personalized treatment strategies. Despite identifying 50-150 CDNs per cancer type, the research reveals that a significant number remain undiscovered, limiting current therapeutic applications, and underscoring the need for further larger-scale research.

Strengths:

The study provides a detailed examination of cancer-driving mutations at the nucleotide level, offering a more precise understanding than traditional gene-level analyses. The authors found a significant number of CDNs remain undiscovered, with only 0-2 identified per patient out of an expected 5-8, indicating that many important mutations are still missing. The study indicated that identifying more CDNs could potentially significantly impact the development of personalized cancer therapies, improving patient outcomes.

Weaknesses:

The study is constrained by relatively small sample sizes for each cancer type, which reduces the statistical power and robustness of the findings. ICGC and other large-scale WGS datasets are publicly available but were not included in this study.

To be able to identify rare driver mutations, more samples are needed to improve the statistical power, which is well-known in cancer research.

The challenges in direct functional testing of CDNs due to the complexity of tumor evolution and unknown mutation combinations limit the practical applicability of the findings.

The QC of the TCGA data was not very strict, i.e, "patients with more than 3000 coding region point mutations were filtered out as potential hypermutator phenotypes", it would be better to remove patients beyond +/- 3*S.D from the mean number of mutations for each cancer type. Given some point mutations with >3 hits in the TCGA dataset, they were just false positive mutation callings, particularly in the large repeat regions in the human genome.

The codes for the statistical calculation (i.e., calculation of Ai_e, et al) are not publicly available, which makes the findings hard to be replicated.

https://doi.org/10.7554/eLife.99341.1.sa2

Reviewer #2 (Public Review):

Summary:

The study proposes that many cancer driver mutations are not yet identified but could be identified if they harbor recurrent SNVs. The paper leverages the analysis from Paper #1 that used quantitative analysis to demonstrate that SNVs or CDNs seen 3 or more times are more likely to occur due to selection (ie a driver mutation) than they are to occur by chance or random mutation.

Strengths:

Empirically, mutation frequency is an excellent marker of a driver gene because canonical driver mutations typically have recurrent SNVs. Using the TCGA database, the paper illustrates that CDNs can identify canonical driver mutations (Figure 3) and that most CDNs are likely to disrupt protein function (Figure 2). In addition, CDNs can be shared between cancer types (Figure 4).

Weaknesses:

Driver alteration validation is difficult, with disagreements on what defines a driver mutation, and how many driver mutations are present in a cancer. The value proposed by the authors is that the identification of all driver genes can facilitate the design of patient-specific targeting therapies, but most targeted therapies are already directed towards known driver genes. There is an incomplete discussion of oncogenes (where activating mutations tend to target a single amino acid or repeat) and tumor suppressor genes (where inactivating mutations may be more spread across the gene). Other alterations (epigenetic, indels, translocations, CNVs) would be missed by this type of analysis.

The method could be more valuable when applied to the noncoding genome, where driver mutations in promoters or enhancers are relatively rare, or as yet to be discovered. Increasingly more cancers have had whole genome sequencing. Compared to WES, criteria for driver mutations in noncoding regions are less clear, and this method could potentially provide new noncoding driver CDNs. Observing the same mutation in more than one cancer specimen is empirically unusual, and the authors provide a solid quantitative analysis that indicates many recurrent mutations are likely to be cancer-driver mutations.

https://doi.org/10.7554/eLife.99341.1.sa1

Author response:

We are grateful to the reviewers and editors for their insightful comments. All recognized that, while mutation recurrences have been used for inferring cancer drivers, our approach has the rigor of quantitative analysis. We would like to add that, without rigorously ruling out mutational hotspots, most CDNs have not been accepted as driver mutations.

This paper develops the theory stating that (i) recurrent point mutations are true Cancer Driving Nucleotides (CDNs); and (ii) non-recurrent mutations are unlikely to be CDNs. The reviewers question that, with the theory, we still have not discovered new driving mutations. This is done in the companion paper. Table 3 shows that, averaged across cancer types, the conventional method would identify 45 CDGs while the CDN method tallies 258 CDGs. The power of the CDN method in identifying new driver genes is evident.

The second question is "By this theory, will we be able discover most CDNs when the sample size increases from ~ 1000 to 10,000?" This is a question of forecast and can be partially answered using GENIE data. Fig. 7 of this study shows that, when n increases from ~ 1000 to ~ 9,000, the numbers of discovered CDNs increase by 3 – 5 fold, most of which come from the two-hit class, as expected.

Fig. 7 also addresses the queries whether we have used datasets other than TCGA. We indeed have used all public data, including GENIE, ICGC and other integrated resources such as COSMIC. For the main study, we rely on TCGA because it is unbiased for estimating the probability of CDN occurrences. In many datasets, the numerators are given but the denominators are not (the number of patients with the mutation / the total number of patients surveyed).

The third question is about mutation recurrences among cancer types. As stated by one reviewer, "different cancer types have unique mutational landscapes". While this is true when the analysis is done at the whole-gene level, one gets a different picture at the nucleotide level where the resolution is much higher. The pan-cancer trend of point mutations is evident in Fig. 4 of the companion paper.

Again, we heartily appreciate the criticisms and suggestions of the reviewers and editors!

https://doi.org/10.7554/eLife.99341.1.sa0

On the discovered Cancer Driving Nucleotides (CDNs) –Distributions across genes, cancer types and patients

Peer review process

Editors

Be the first to read new articles from eLife