The theory of massively repeated evolution and full identifications of Cancer Driving Nucleotides (CDNs)

  1. State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
  2. State Key Laboratory of Genetic Resources and Evolution/Yunnan Key Laboratory of Biodiversity Information, Kunming Institute of Zoology, The Chinese Academy of Sciences, Kunming, China
  3. GMU-GIBH Joint School of Life Sciences, Guangzhou Medical University, Guangzhou, China
  4. CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
  5. Department of Ecology and Evolution, University of Chicago, Chicago, USA

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.


  • Reviewing Editor
    Detlef Weigel
    Max Planck Institute for Biology Tübingen, Tübingen, Germany
  • Senior Editor
    Detlef Weigel
    Max Planck Institute for Biology Tübingen, Tübingen, Germany

Reviewer #1 (Public Review):

The authors developed a rigorous methodology for identifying all Cancer Driving Nucleotides (CDNs) by leveraging the concept of massively repeated evolution in cancer. By focusing on mutations that recur frequently in pan-cancer, they aimed to differentiate between true driver mutations and neutral mutations, ultimately enhancing the understanding of the mutational landscape that drives tumorigenesis. Their goal was to call a comprehensive catalogue of CDNs to inform more effective targeted therapies and address issues such as drug resistance.


(1) The authors introduced a concept of using massively repeated evolution to identify CDNs. This approach recognizes that advantageous mutations recur frequently (at least 3 times) across cancer patients, providing a lens to identify true cancer drivers.

(2) The theory showed the feasibility of identifying almost all CDNs if the number of sequenced patients increases to 100,000 for each cancer type.


(1) The methodology remains theoretical and no novel true driver mutations were identified in this study.

(2) Different cancer types have unique mutational landscapes. The methodology, while robust, might face challenges in uniformly identifying CDNs across various cancers with distinct genetic and epigenetic contexts.

(3) L223, the statement "In other words, the sequences surrounding the high-recurrence sites appear rather random.". Since it was a pan-cancer analysis, the unique patterns of each cancer type could be strongly diluted in the pan-cancer data.

(4) To solidify the findings, the results need to be replicated in an independent dataset.

(5) The key scripts and the list of key results (i.e., CDN sites with i{greater than or equal to}3) need to be shared to enable replication, validation, and further research. So far, only CDN sites with i{greater than or equal to}20 have been shared.

(6) The versions of data used in this study are not clearly detailed, such as the specific version of gnomAD and the version and date of TCGA data downloaded from the GDC Data Portal.

Reviewer #2 (Public Review):


The authors propose that cancer-driver mutations can be identified by Cancer Driving Nucleotides (CDNs). CDNs are defined as SNVs that occur frequently in genes. There are many ways to define cancer driver mutations, and the strengths and weaknesses are the reliance on statistics to define them.


There are many well-known approaches and studies that have already identified many canonical driver mutations. A potential strength is that mutation frequencies may be able to identify as yet unrecognized driver mutations. They use a previously developed method to estimate mutation hotspots across the genome (Dig, Sherman et al 2022). This publication has already used cancer sequence data to infer driver mutations based on higher-than-expected mutation frequencies. The advance here is to further illustrate that recurrent mutations (estimated at 3 or more mutations (CDNs) at the same base) are more likely to be the result of selection for a driver mutation (Figure 3). Further analysis indicates that mutation sequence context (Figure 4) or mutation mechanisms (Figure 5) are unlikely to be major causes for recurrent point mutations. Finally, they calculate (Figure 6) that most driver mutations identifiable by the CDN approach could be identified with about 100,000 to one million tumor coding genomes.


The manuscript does provide specific examples where recurrent mutations identify known driver mutations but do not identify "new" candidate driver mutations. Driver mutation validation is difficult and at least clinically, frequency (ie observed in multiple other cancer samples) is indeed commonly used to judge if an SNV has driver potential. The method would miss alternative ways to trigger driver alterations (translocations, indels, epigenetic, CNVs). Nevertheless, the value of the manuscript is its quantitative analysis of why mutation frequencies can identify cancer driver mutations.

Author response:

We are grateful to the reviewers and editors for their insightful comments. All recognized that, while mutation recurrences have been used for inferring cancer drivers, our approach has the rigor of quantitative analysis. We would like to add that, without rigorously ruling out mutational hotspots, most CDNs have not been accepted as driver mutations.

This paper develops the theory stating that i) recurrent point mutations are true Cancer Driving Nucleotides (CDNs); and ii) non-recurrent mutations are unlikely to be CDNs. The reviewers question that, with the theory, we still have not discovered new driving mutations. This is done in the companion paper. Table 3 shows that, averaged across cancer types, the conventional method would identify 45 CDGs while the CDN method tallies 258 CDGs. The power of the CDN method in identifying new driver genes is evident.

The second question is "By this theory, will we be able discover most CDNs when the sample size increases from ~ 1000 to 10,000?" This is a question of forecast and can be partially answered using GENIE data. Fig. 7 of this study shows that, when n increases from ~ 1000 to ~ 9,000, the numbers of discovered CDNs increase by 3 – 5 fold, most of which come from the two-hit class, as expected.

Fig. 7 also addresses the queries whether we have used datasets other than TCGA. We indeed have used all public data, including GENIE, ICGC and other integrated resources such as COSMIC. For the main study, we rely on TCGA because it is unbiased for estimating the probability of CDN occurrences. In many datasets, the numerators are given but the denominators are not (the number of patients with the mutation / the total number of patients surveyed).

The third question is about mutation recurrences among cancer types. As stated by one reviewer, "different cancer types have unique mutational landscapes". While this is true when the analysis is done at the whole-gene level, one gets a different picture at the nucleotide level where the resolution is much higher. The pan-cancer trend of point mutations is evident in Fig. 4 of the companion paper.

Again, we heartily appreciate the criticisms and suggestions of the reviewers and editors!

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation