Research Article

The theory of massively repeated evolution and full identifications of cancer-driving nucleotides (CDNs)

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, China
State Key Laboratory of Genetic Resources and Evolution/Yunnan Key Laboratory of Biodiversity Information, Kunming Institute of Zoology, The Chinese Academy of Sciences, China
GMU-GIBH Joint School of Life Sciences, Guangzhou Medical University, China
CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Institute of Advanced Technology, Chinese Academy of Sciences, China
Innovation Center for Evolutionary Synthetic Biology, Sun Yat-sen University, China
Department of Ecology and Evolution, University of Chicago, United States

Dec 17, 2024

https://doi.org/10.7554/eLife.99340.3

Open access
Copyright information

eLife Assessment

This important paper introduces a theoretical framework and methodology for identifying Cancer Driving Nucleotides (CDNs), primarily based on single nucleotide variant (SNV) frequencies. A variety of solid approaches indicate that a mutation recurring three or more times is more likely to reflect selection rather than being the consequence of a mutation hotspot. The method is rigorously quantitative, though the requirement for larger datasets to fully identify all CDNs remains a noted limitation. The work will be of broad interest to cancer geneticists and evolutionary biologists.

https://doi.org/10.7554/eLife.99340.3.sa0

Significance of the findings:

Important: Findings that have theoretical or practical implications beyond a single subfield

Landmark
Fundamental
Important
Valuable
Useful

Strength of evidence:

Solid: Methods, data and analyses broadly support the claims with only minor weaknesses

Exceptional
Compelling
Convincing
Solid
Incomplete
Inadequate

During the peer-review process the editor and reviewers write an eLife Assessment that summarises the significance of the findings reported in the article (on a scale ranging from landmark to useful) and the strength of the evidence (on a scale ranging from exceptional to inadequate). Learn more about eLife Assessments

Abstract
Introduction
Results
Discussion
Methods
Appendix 1
Data availability
References
Article and author information
Metrics

Abstract

Tumorigenesis, like most complex genetic traits, is driven by the joint actions of many mutations. At the nucleotide level, such mutations are cancer-driving nucleotides (CDNs). The full sets of CDNs are necessary, and perhaps even sufficient, for the understanding and treatment of each cancer patient. Currently, only a small fraction of CDNs is known as most mutations accrued in tumors are not drivers. We now develop the theory of CDNs on the basis that cancer evolution is massively repeated in millions of individuals. Hence, any advantageous mutation should recur frequently and, conversely, any mutation that does not is either a passenger or deleterious mutation. In the TCGA cancer database (sample size n=300–1000), point mutations may recur in i out of n patients. This study explores a wide range of mutation characteristics to determine the limit of recurrences (i^*) driven solely by neutral evolution. Since no neutral mutation can reach i^*=3, all mutations recurring at i≥3 are CDNs. The theory shows the feasibility of identifying almost all CDNs if n increases to 100,000 for each cancer type. At present, only <10% of CDNs have been identified. When the full sets of CDNs are identified, the evolutionary mechanism of tumorigenesis in each case can be known and, importantly, gene targeted therapy will be far more effective in treatment and robust against drug resistance.

Introduction

Cancers are complex genetic traits with multiple mutations that interact to yield the ensemble of tumor phenotypes. The ensemble has been characterized as ‘cancer hallmarks’ that include sustaining growth signaling, evading growth suppression, resisting apoptosis, achieving immortality, executing metastasis and so on Hanahan and Weinberg, 2000; Hanahan and Weinberg, 2011; Hanahan, 2022. It seems likely that each of the 6–10 cancer hallmarks is governed by a set of mutations. Most, if not all, of these mutations are jointly needed to drive the tumorigenesis.

In the genetic sense, cancers do not differ fundamentally from other complex traits whereby multiple mutations are simultaneously needed to execute the program. A well-known example is the genetics of speciation whereby interspecific hybrids are either sterile or infertile even though they do not have deleterious genes (Wu and Ting, 2004; Wang et al., 2022; Wu, 2022). A recent example is SARS-CoV-2. The early onset of COVID-19 requires all four mutations of the D614G group and the later Delta strain has 31 mutations accrued in three batches (Ruan et al., 2022b; Ruan et al., 2022a; Cao et al., 2023; Ruan et al., 2023) While cancer research has often proceeded one mutation at a time, each of the mutations has been shown to be insufficient for tumorigenesis until many (Ortmann et al., 2015; Takeda, 2021; Hodis et al., 2022) are co-introduced.

We now aim for the identification of all (or at least most) of the driver mutations in each patient. Both functional tests and treatments demand such identifications. The number of key drivers has been variously estimated to be 6–10 (Martincorena et al., 2017; Anandakrishnan et al., 2019; ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium, 2020). Although cancer driving ‘point mutations’, referred to as Cancer Driving Nucleotides (or CDNs), are not the only drivers, they are indeed abundant (ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium, 2020). Furthermore, CDNs, being easily quantifiable, may be the only type of drivers that can be fully identified (see below). Here, we will focus on the clonal mutations present in all cells of the tumor without considering within-tumor heterogeneity for now (Ling et al., 2015; Turajlic et al., 2019; Black and McGranahan, 2021; Chen et al., 2022a; Zhai et al., 2022; Bian et al., 2023; Zhu et al., 2023).

Since somatic evolution proceeds in parallel in millions of humans, point mutations can recur multiple times as shown in Figure 1. The recurrences should permit the detection of advantageous mutations with unprecedented power. The converse should also be true that mutations that do not recur frequently are unlikely to be advantageous. Figure 1 depicts organismal evolution and cancer evolution. While both panels A and B show 7 mutations, there can be two patterns for cancer evolution - the pattern in (C) where all mutations are at different sites is similar to the results of organismal evolution whereas the pattern in (D) is unique in cancer evolution. The hotspot of recurrences shown in (D) holds the key to finding all CDNs in cancers.

Figure 1

Download asset Open asset

Two modes of DNA sequence evolution.

(A) A hypothetical example of DNA sequences in organismal evolution. (B) Cancer evolution that experiences the same number of mutations as in (A) but with many short branches. (C) A common pattern of sequence variation in cancer evolution. (D) In cancer evolution, the same mutation at the same site may occasionally be seen in multiple sequences. The recurrent sites could be either mutational or functional hotspots, their distinction being the main objective of this study.

In the literature, hotspots of recurrent mutations have been commonly reported (Gartner et al., 2013; Chang et al., 2016; Cannataro et al., 2018; Buisson et al., 2019; Hess et al., 2019; Stobbe et al., 2019; Juul et al., 2021; Nesta et al., 2021; Zhao et al., 2021; Bergstrom et al., 2022; Sherman et al., 2022; Wong et al., 2022; Zeng and Bromberg, 2022). A hotspot, however, could be either a mutational or functional hotspot. Mutational hotspots are the properties of the mutation machinery that would include nucleotide composition, local chromatin structure, timing of replication, etc. (Stamatoyannopoulos et al., 2009; Pleasance et al., 2010; Makova and Hardison, 2015; Polak et al., 2015; Martincorena et al., 2017). In contrast, functional hotspots are CDNs under positive selection. CDN evolution is akin to ‘convergent evolution’ that repeats itself in different taxa (He et al., 2020a; He et al., 2020b; Wu et al., 2020; Pan et al., 2022b; Wu, 2023) and is generally considered the most convincing proof of positive selection.

While many studies conclude that sites of mutation recurrence are largely mutational hotspots (Buisson et al., 2019; Hess et al., 2019; Stobbe et al., 2019; Nesta et al., 2021; Bergstrom et al., 2022), others deem them functional hotspots, driven by positive selection (Gartner et al., 2013; Chang et al., 2016; Bailey et al., 2018; Cannataro et al., 2018; Juul et al., 2021; Zhao et al., 2021; Zeng and Bromberg, 2022). In the attempt to distinguish between these two hypotheses, studies make assumptions, often implicitly, about the relative importance of the two mechanisms in their estimation procedures. The conclusions naturally manifest the assumptions made when extracting information on mutation and selection from the same data (Elliott and Larsson, 2021).

This study consists of three parts. First, the mutational characteristics of sequences surrounding CDNs are analyzed. Second, a rigorous probability model is developed to compute the recurrence level at any sample size. Above a threshold of recurrence, all mutations are CDNs. Third, we determine the necessary sample sizes that will yield most, if not all, CDNs. In the companion study, the current cancer genomic data are analyzed for the characteristics of CDNs that have already been discovered (Zhang et al., 2024). Together, these two studies show how full functional tests and precise target therapy can be done on each cancer patient.

Results

In PART I, we search for the general mutation characteristics at and near high recurrence sites by both machine learning and extensive sequence comparisons. In PART II, we develop the mathematical theory for the maximal level of recurrences of neutral mutations (designated i^*). CDNs are thus defined as mutations with ≥i^* recurrences in n cancer samples. We then expand in PART III the theory to very large sample sizes (n≥10⁵), thus making it possible to identify all CDNs.

To carry out the analyses, we first compile from the TCGA database the statistics of multiple hit sites (i hits in n samples) in 12 cancer types. This study focuses on the mutational characteristics pertaining to recurrence sites. We often present the three cancer types with the largest sample sizes (lung, breast, and CNS cancers), while many analyses are based on pan-cancer data. Analyses of every of the 12 cancer types individually will be done in the companion paper. The TCGA database is used as it is well established and covers the entire coding region (Cancer Genome Atlas Research Network et al., 2013). Other larger databases (Cerami et al., 2012; Tate et al., 2019; de Bruijn et al., 2023) are employed when the whole exon analyses are not crucial.

The compilation of multi-hit sites across all genes in the genome

Throughout the study, S_i denotes the number of synonymous sites where the same nucleotide mutation occurs in i samples among n patients. A_i is the equivalent of S_i for non-synonymous (amino acid altering) sites. Table 1 presents the numbers from lung cancer for demonstration. It also shows the A_i and S_i numbers with CpG sites filtered out. There are 22.5 million nonsynonymous sites, among which ~0.2 million sites have one hit (A₁=195,958) in 1035 patients. The number then decreases sharply as i increases. Thus, A₂=2946 (number of 2-hit sites), A₃=99, and A₄ +A₅ +...=79. We also note that the A_i/S_i ratio increases from 2.89, 2.82, 3.04–4.71 and so on.

Table 1

An example of A_i and S_i (from lung cancer, n=1035).

	All sites			CpG sites removed
i	*A_i*	*S_i*	**A_i / S_i**	*A_i*	*S_i*	**A_i / S_i**
0	22540623	7804281	2.89	21375384	7014012	3.04
1	195958	69393	2.82	168371	56821	2.96
2	2946	969	3.04	2188	643	3.4
3	99	21	4.71	68	16	4.25
4	23	1	23	17	1	17
5	16	0	16 : 0	9	0	9 : 0
6	10	0	10 : 0	6	0	6 : 0
7	5	0	5 : 0	5	0	5 : 0
8	8	0	8 : 0	6	0	6 : 0
9	4	0	4 : 0	3	0	3 : 0
≥3	178	22	8.09	122	17	7.18
≥4	79	1	79	54	1	54
[10-20]	7	1	7	4	0	4 : 0
≥20	6	0	6 : 0	4	0	4 : 0

Note –The ratio of A_i/ S_i is provided as a measure of selection strength.

Figure 2 shows the average of A_i and S_i among the 12 cancer types (see Methods). The salient features are shown by differences between the solid and dotted lines. As will be detailed in PART II, the dotted lines, extending linearly from i=0 to i=1 in logarithmic scale, should be the expected values of A_i and S_i, if mutation rate is the sole driving force. In the actual data, (A₁) and S₁ decrease to ~0.002 of A₀ and S₀, the step being least affected by selection (see PART II later). For A₂ and S₂, the decrease is only ~0.01 of A₁ and S₁. The decrease from i to i+1 becomes smaller and smaller as i increases, suggesting that the process in not entirely neutral. Furthermore, the lower panel of Figure 2 shows that A_i/S_i continues to rise as i increases. These patterns again suggest a stronger positive selection at higher i values. The extrapolation lines shown in Figure 2 roughly define i=3 as a cutoff where the expected (A₃) falls below 1 (see PART II for details). The precise model of PART II will define high recurrence sites (i≥3) as CDNs.

Figure 2 with 1 supplement see all

Download asset Open asset

The average A_i and S_i values across different i ranges (X-axis).

(Top): The average of A_i and S_i in the log scale. Color lines - full data; gray lines - CpG sites removed. The dash lines are linear extrapolations. Bottom: The *A_i* / *S_i* ratio as a function of i. The drop of ***A_i* / *S_i*** ratio at i [8, 9] is due to the potential synonymous CDNs, see Supplementary file 1.

PART I - The mutational characteristic of high recurrence sites

In this part, the analyses are done in two different ways. The sequence-feature approach is to examine the mutation characteristics of sequence features (say, 3 mers, 5 mers, etc.) across patients. The patient-feature approach is to examine patients for their mutation signatures and mutation loads.

The sequence-feature approach

The simplest and best-known sequence feature associated with high mutation rate is CpG sites. In mammals, methylation and de-amination would enhance the mutation rate from CpG to TpG or CpA by five-~tenfold (Hodgkinson and Eyre-Walker, 2011; Ségurel et al., 2014). As the CpG site mutagenesis has been extensively reported, we only present the confirmation in the Supplement (Figure 2—figure supplement 1). Indeed, CpG sites account for ~6.5% of the coding sequences but contributing ~22% among the mutated sites in Figure 2. Hence, the sevenfold increase in the CpG mutation rate should contribute more to A_i and S_i as i increases. Table 1 has shown the effects of filtering out CpG sites in the counts of recurrences. Clearly, CpG sites do contribute disproportionately to the recurrences but, even when they are separately analyzed, the conclusion is unchanged. As shown later in PART II, every increment of i should decrease the site number by ~0.002 in the TCGA database. Thus, even with a 10-fold increase in mutation rate, the decrease rate would still be 0.02. In the theory sections, CpG site mutations are incorporated into the model.

In this section, we aim to find out how extreme the mutation mechanisms must be to yield the observed recurrences. If these mechanisms seem implausible, we may reject the mutational-hotspot hypothesis and proceed to test the functional hotspot hypothesis.

The analyses of mutability variation by Artificial Intelligence (AI)

The variation of mutation rate at site level could be shaped by multiple mutational characteristics. Epigenomic features, such as chromatin structure and accessibility, could affect regional mutation rate at kilobase or even megabase scale (Stamatoyannopoulos et al., 2009; Makova and Hardison, 2015), while nucleotide biases by mutational processes typically span only a few base pairs around the mutated site (Roberts et al., 2013; Haradhvala et al., 2018; Herzog et al., 2021). AI-powered multi-modal integration offers a new tool to quantify the joint effect of various factors on mutation rate variability (Luo et al., 2019; Sherman et al., 2022; Song et al., 2023). Here we explore the association between the mutation recurrence (i) and site-level mutation rate predicted by AI.

Figure 3A shows the mutation rate landscape across all recurrence sites in breast, CNS and lung cancer using the deep learning framework Dig (Sherman et al., 2022). In this approach, the mutability of a focus site is calculated based on both local stretch of DNA and broader scale of epigenetic features. The X-axis shows all mutated sites with i>0 scanned by Dig. While the mutation rate fluctuates around the average level, we detect no significant difference in mutation rate as a function of i (Methods). In CNS, two sites exhibited exceptional mutability at i=6, surpassing the average by tenfold. Unsurprisingly, these two are CpG sites and correspond to amino acid change of V774M and R222C in EGFR (Supplementary file 1), which are canonical actionable driver mutations in glioma target therapy. In other words, the two sites called by AI for possible high mutability appear to be selection driven.

Figure 3

Download asset Open asset

Site-level mutation rate variation obtained from *Dig* Sherman et al., 2022, a published AI tool.

(A) Each dot represents the expected SNVs (Y-axis) at a site where missense mutations occurred i times in the corresponding cancer population. The boxplot shows the overall distribution of mutability at i, with the red dashed line denoting the average. There is no observable trend that sites of higher i are more mutable (The blank areas are due to the absence of CDNs with mutation recurrence counts of 8 or 9 in CNS cancer mutation data, see Supplementary file 1). (B) A detailed look at the coding region of *PAX3* gene in colon cancer. The expected mutability of sites in the 200 bp window is plotted. The three mutated sites in this window, marked by green and red (a CDN site) stars, are not particularly mutable. Overall, the mutation rate varies by about tenfold as is generally known for CpG sites.

In Figure 3B, we take a closer look at how CDN is situated against the background of mutation rate variation, using the example of PAX3 (Paired Box Homeotic Gene 3; Wang et al., 2008; Li et al., 2019). In this typical example, Dig predicts site mutability to vary from site to site. In the lower panel is an expanded look at a stretch of 200 bps. In this stretch, about 8% of sites are five- to tenfold more mutable than the average. But none of them are mutated in the data (i=0). There are indeed three sites with i>0 in this DNA segment including a CDN site C1271T (marked by the red star). This CDN site is estimated to have a twofold elevation in mutability, which is less than 1/50 of the necessary mutability to reach i≥3. The other two mutated sites, marked by the green star, are also indicated.

Other AI methods have also been used in the mutability analysis (Fang et al., 2022), reaching nearly the same conclusion. Overall, while AI often suggests sequence context to influence the local variation in mutation rate, the reported variation does not correspond to the distribution of CDNs. In the next subsection, we further explore the local contexts for potential biases in mutability.

The conventional analyses of local contexts - from 3-mers to 101-mers

Since the AI analyses suggest the dominant role of local sequence context in mutability, we carry out such conventional analyses in depth. Other than the CpG sites, local features such as the TCW (W=A or T) motif recognized by APOBEC family of cytidine deaminases (Burns et al., 2013; Roberts et al., 2013), would have impacts as well. We first calculate the mutation rate for motifs of 3-mer, 5-mer and 7-mer, respectively, with 64, 1024, 16384 in number (see Methods). The pan-cancer analyses across the 12 cancer types are shown in Figure 4A. We use α to designate the fold change between the most mutable motif and the average. Since the number of motifs increases 16-fold between each length class, the α value increases from 4.7 to 8.8 and then to 11.5. Nevertheless, even the most mutable 7-mer¸ TAACGCG, which has a CpG site at the center, is only 11.52-fold higher than the average. This spread is insufficient to account for the high recurrences, which decrease to ~0.002 for each increment of i (see PART II below).

Figure 4 with 1 supplement see all

Download asset Open asset

Conventional analyses of local contexts at recurrence sites.

(A) From top panel down - For the 64 (4³) 3-mer motifs, their mutational rates are shown on the X-axis. The most mutable motif over the average mutability (α) is 4.69. For the 1024 (=4⁵) 5-mer and 16,384 (4⁷) 7-mer motifs, the α values are, respectively, 8.79 and 11.52. The most mutable motifs, as expected, are dominated by CpG’s. (B) Each dot represents the motif surrounding a high-recurrence site. The recurrence number is shown on the X-axis and the mutability of the associated motif’s mutability (mutations per 0.1 M) is shown on the Y-axis. The average mutation rate across all motifs of given length category is indicated by a red horizontal dashed line. The absence of a trend indicates that the high recurrence sites are not associated with the mutability of the motif. (C) The analysis is extended to longer motifs surrounding each CDN (21, 41, 61, 81, and 101 bp). For each length group, all pairwise comparisons are enumerated. The observed distributions (black bars and points) are compared to the expected Poisson distributions (red bars and curves) and no difference is observed. Thus, local sequences of CDNs do not show higher-than-expected similarity.

A more direct approach is given in Figure 4B explained in the legends. The absence of a trend shows that the high recurrence sites are not associated with the mutability of the motif. In Figure 4C, the analysis extends to longer motifs of 21 bp, 41 bp, 61 bp, 81 bp, and 101 bp surrounding the high-recurrence sites. For example, the motif of 101 bp may be (10, 90), (20, 80) and so on either side of a recurrence site (Figure 4—figure supplement 1). We then compute the pairwise differences in sequences of the motifs among recurrence sites. The logic is that, if certain motifs dictate high mutation rates, we may observe unusually high sequence similarity in the pairwise comparisons. As can be seen in all 5 panels of Figure 4C, there are no outliers in the tail of the distribution. In other words, the sequences surrounding the high-recurrence sites appear rather random. Detailed motif analysis of CDNs within individual cancer types using deep learning models (ResNet, LSTM and GRU) further supports this conclusion.

In conclusion, the analyses by the sequence approach do not find any association between high recurrence sites (i≥3) and the mutability of the local sequences.

The patient-feature approach

In this second approach, we examine the mutation characteristics among patients across sequence features. The first question is whether high recurrence sites tend to happen in patients with higher mutation loads. Figure 5A depicts the distribution of mutation loads among patients harboring a CDN of recurrence i. Hence, a patient’s load may appear several times in the plot, each appearance corresponding to one CDN in the patient’s data. For the comparison across i values, the mutation load is normalized by a z-score within each cancer population to equalize the three cancer types. The overall trend shows consistently that patients with recurrence sites do not bias toward high mutation loads. The presence of recurrence sites in patients with low mutation loads suggests that overall mutation burden is not a determining factor of recurrence.

Figure 5

Download asset Open asset

Patient level analysis for mutation load and mutational signatures.

(A) Boxplot depicting the distribution of mutation load among patients with recurrent mutations. The X-axis denotes the count of recurrent mutations, while the Y-axis depicts the normalized z-score of mutation load (see Methods). The green dashed line indicates the mean mutation load. In short, the mutation load does not influence the mutation recurrence among patients. (B) Signature analysis in patients with mutations of recurrences ≥i^* (X-axis). For lung cancer (left), the upper panel presents the number of patients for each group, while the lower panel depicts the relative contribution of mutational signatures. For breast cancer, APOBEC-related signatures (SBS2 and SBS13) are notably elevated in all groups of patients with i^*≥3, while patients with mutations of recurrence ≥ 20 in CNS cancer exhibit an increased exposure to SBS11 (Blough et al., 2011; Lin et al., 2021; Noeuveglise et al., 2023). Again, patients with higher mutation recurrences do not differ in their mutation signatures.

With the results of Figure 5A, we then ask a related question: whether these high recurrence mutations are driven by factors that affect mutation characteristics. Such influences have been captured by the analyses of ‘mutational signatures’ (Alexandrov et al., 2013; Alexandrov et al., 2020). Each signature represents a distinct mutation pattern (e.g. high rate of TCT ->TAT and other tri-nucleotide changes) associated with a known factor, such as smoking or an aberrant mutator (aristolochic acid, for example). Each patient’s mutation profile can then be summarized by the composite of multiple mutational signatures.

The issue is thus whether a patient’s CDNs can be explained by the patient’s composite mutational signatures. Figure 5B reveals that in lung cancer, the signature compositions among patients with different recurrence cutoffs are statistically indistinguishable (Methods). Smoking (signature SBS4) consistently emerges as the predominant mutational process across all levels of recurrences. In breast cancer, while SBS2 and SBS13 exhibit some differences in the bins of i^*=2 and i^*=3, the profiles remain rather constant for all bins of i^*≥3. The two lowest bins, not unexpectedly, are also different from the rest in the total mutation load (see Figure 5A). In Appendix 1—table 1, we provide a comprehensive review of supporting literature on genes with recurrence sites of i≥3 for breast cancer. In CNS cancer, SBS11 appears significantly different across bins, in particular, i≥20. This is a signature associated with Temozolomide treatment and should be considered a secondary effect. In short, while there are occasional differences in mutational signatures across i^* bins, none of such differences can account for the recurrences (see PART II).

To conclude PART I, the high-recurrence sites do not stand out for their mutation characteristics. Therefore, the variation in mutation rate across the whole genome can reasonably be approximated mathematically by a continuous distribution, as will be done below.

PART II - The theory of CDNs

We now develop the theory for S_i and A_i under neutrality where i is the recurrence of the mutation at each site. We investigate the maximal level of neutral mutation recurrences (i^*), above which the expected values of S_i and A_i are both <0. Since no neutral mutations are expected to reach i^*, every mutation with the recurrence of i^* or larger should be non-neutral. Importantly, given that the expected S_i and A_i is a function of Uⁱ where U=nE(u) is in the order of 10^–2 and 10^–3, i^* is insensitive to a wide range of mutation scenarios. For that reason, the conclusion is robust.

The mutation rate of each nucleotide (u) follows a Gamma distribution with a scale parameter θ and a shape parameter k. Gamma distribution is often used for its flexibility and, in this context, models the waiting time required to accumulate k mutations. Its mean (=kθ) and variance (=kθ2) are determined by both parameters but the shape (skewness and kurtosis) is determined only by k. In particular, the Gamma distribution has a long tail suited to modeling a small number of sites with very high mutation rate.

We now use synonymous (S_i) mutations as the proxy for neutrality. Hence, in n samples,

S_{i} = \sum_{l = 1}^{L_{S}} C_{n}^{i} u {(l)}^{i} {[1 - u (l)]}^{n - i} \sim C_{n}^{i} L_{S} E (u^{i})

where L_S is the total number of synonymous sites and u(l) is the mutation rate of site l. In the equation above, the term ${[1 - u (l)]}^{n - i}$ is dropped. We note that ${[1 - u (l)]}^{n - i} e^{[- u (l) (n - i)]} \sim 1$ as u is in the order of 10^–6 and n is in the order of 10² from the TCGA data. With the gamma distribution of u whereby the i^th moment is given by

E (u^{i}) = \frac{Γ (k + i)}{Γ (k)} θ^{i} = \frac{Γ (k + i)}{Γ (k) \cdot k^{i}} E {(u)}^{i}

we obtain:

S_{i} = L_{S} \cdot g (i, k) {[n E (u)]}^{i}

here:

g (i, k) = C_{k + i - 1}^{i} \cdot \frac{1}{k^{i}}

In a condensed form,

S_{i} = G \cdot {[n E (u)]}^{i}

where $G = L_{S} \cdot g (i, k)$ . Similarly, if nonsynonymous mutations are assumed neutral, then,

A_{i} = L_{A} \cdot g (i, k) {[n E (u)]}^{i} \sim 2.3 \cdot S_{i}

The number of 2.3 is roughly the ratio of the total number of nonsynonymous over that of synonymous sites (Hartl and Clark, 1989; Li, 1997; Chen et al., 2019). This number would vary moderately among cancers depending on their nucleotide substitution patterns.

E(u) of Equations 2 and 3 is generally (1~5)×10^–6 per site in cancer genomic data and n is generally between 300 and 1000. Hence, nE(u) is the total mutation rate summed over all n patients and is generally between 0.001 and 0.005 in the TCGA data. Given nE(u) is in the order of 10^–3, S_i and A_i would both decrease by 2~3 orders of magnitude with each increment of i by 1. We note that the total number of synonymous sites, L_S, is ~0.9 × 10⁷ and L_A is ~2.3 times larger. Therefore, S₃ <1 and A₃ <1. When i reaches 4, S_{i ≥4} and A_{i ≥4} would both be ≪ 1 when averaged over cancer types.

For each cancer type, the conclusion of S₃ <1 and A₃ <1 is valid with the actual value of S₃ and A₃ ranging between 0.01 and 1. In other words, with n<1000,, neutral mutations are unlikely to recur 3 times or more in the TCGA data (i^*=3). While S_{i ≥3} and A_{i ≥3} sites are high-confidence CDNs, the value of i^* is a function of n. At n≤1,000 for the TCGA data, i^* should be 3 but, when n reaches 10,000, i^* will be 6. The benefits of large n’s will be explored in PART III.

Possible outliers to the distribution of mutation rate

Although we have explored extensively the sequence contexts, other features beyond DNA sequences could still lead to outliers to mathematically distributions. These features may include DNA stem loops (Buisson et al., 2019) or unusual epigenetic features (Zheng et al., 2014; Makova and Hardison, 2015; Supek and Lehner, 2015). We therefore expand the model by assuming a small fraction of sites (p) to be hyper-mutable that is α fold more mutable than the genomic mean. Most likely, α and p are the inverse of each other. For the bulk of sites (1 p) of the genome, we assume that their mutations follow the Gamma distribution. (Nevertheless, the bulk can be assumed to have a fixed mutation rate of E(u) without affecting the conclusion qualitatively.)

We let p range from 10^–2 to 10^–5 and α up to 1000. As no stretches of DNA show such unusually high mutation rate (1000-fold higher than the average), such sites are assumed to be scattered across the genome and are rare. With the parameter space of defined above, we choose the (p, α) pairs that agree with the observed values of S₁ to S₃ which are sufficiently large for estimations. Table 2 presents the value range and standard deviation for p and α across the six cancer types that have >500 patient samples. Among the six cancer types, the lung cancer data do not conform to the constraints and we set p=0. With observed values for S_1~3 as constraints, S₄ rarely exceeds 1. Hence, even with the purely conjectured existence of outliers in mutation rate, i^*=4 is already too high a cutoff.

Table 2

Summary for modeling outlier sites in six cancer types.

Cancer Type	S₃	p	α	S₄	S₅
Lung*	--	0.0	--	--	--
Breast	0.12	8.75E-04 (8.21E-04)	88.6 (32.0)	0.102 (0.068)	0.004 (0.004)
CNS	0.02	2.73E-04 (1.09E-04)	295.1 (57.0)	0.448 (0.173)	0.026 (0.015)
Kidney	0.03	3.03E-05 (2.98E-05)	304.1 (108.0)	0.067 (0.056)	0.005 (0.006)
Upper-AD tract	0.47	0.002 (0.001)	48.9 (10.7)	0.174 (0.078)	0.005 (0.003)
Large intestine	1.03	0.009 (0.001)	51.6 (1.4)	0.998 (0.087)	0.026 (0.003)

Note – For each cancer type, p stands for the proportion of highly mutable sites, with mutation rate being α-fold of the average. S₃ gives the expected number without mutable outliers (P=0). S₄ and S₅ denote the expected number with the best (p, α) pairs with the standard deviation in parentheses. For lung cancer, S₂ and S₃ do not fit the outlier model (Table 2—source data 1); therefore, we set P=0.

Table 2—source data 1

The outlier model parameters and expected Si values for 6 cancer types analyzed.

‘pMinor’ and ‘alpha’ correspond to (‘p, a’) as described in the main text. ‘Eu’ represents the average mutation rate per site per patient for the given cancer type (‘ccType’). ‘s2Expt’ ‘s3Expt’ ‘4’ ‘s5’ and ‘s6’ are the expected S_i values for i = 2, 3, 4, 5, 6, respectively. ‘s2Obsv’ and ‘s3Obsv’ represent the observed S_i values for S₂ and S₃.: https://cdn.elifesciences.org/articles/99340/elife-99340-table2-data1-v1.zip
Download elife-99340-table2-data1-v1.zip

Table 2 suggests that p has to be smaller than 10^–5 and α>1,000 to yield S₄ >1. Since the coding region has 3×10⁷ sites, p<10^–5 would mean that the outliers are at most in the low hundreds. In other words, the number of high recurrence sites projected by the theory is close to the observed numbers. Therefore, there are really no unknown outlier sites of high mutation rate. Positive selection would be a more straightforward explanation, explained below.

The influence of selection on mutation occurrences

We now show that, although the mutational bias alone cannot account for the high occurrences, selection can easily do so. We assume a fraction, f, of A_i’s to be under positive selection. The fraction should be small, probably ≪ 0.01, and will be labeled $A_{i}^{*}$ . The rest, labeled A_i is considered neutral. Hence, A_i is proportional to $[{n E (u)}^{i}]$ and A_i/S_i =L_A/L_S ~2.3. Like S_{i ≥4}, A_{i ≥4} ≪ 1. In contrast,

A_{i}^{*} = G^{*} [w \cdot n E (u)]^{i}

\frac{A_{i}^{*}}{S_{i}} = \frac{G^{*}}{G} w^{i}

where G^* is also a constant but its value depends on f. The crucial parameter, w (=2 Ns), is the selective advantage (s) scaled by the population size of progenitor cancer cell (N). Since w can easily be >10, even at i=3, ${A_{1}}^{*} / S_{1}$ would be >100 as large as ${(1 - u)}^{n - i} e^{- u (n - i)}$ . In other words, observed mutation recurrences at i≥3 for advantageous mutation should not be uncommon. Equation 4 also shows that w and E(u) jointly affect the recurrence; therefore, CpG sites (many of which fall in functional sites) are expected to be strongly represented among high recurrence sites.

PART III. The theory of large samples (n > 10⁵) and identification of all CDNs

Using the theory of CDN developed above, the companion paper shows that each sequenced cancer genome in the current databases, on average, harbors only 1~2 CDNs. The number varies in this range depending on the cancer type (Zhang et al., 2024). For comparison, tumorigenesis may require at least 5~10 driver mutations as estimated by various criteria (Armitage and Doll, 1954; Hanahan and Weinberg, 2011; Belikov, 2017; Martincorena et al., 2017; Anandakrishnan et al., 2019; ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium, 2020). The results show that there are many more CDNs that have not been discovered. This is not unexpected since most CDNs are found in <1% of patients. If each CDN is observed in 1% of patients and each patient has 5 CDNs, then there should be at least 500 CDNs for each cancer type.

In the companion study that uses the A/S ratios of Figure 1, the estimated number of CDNs ranges from 500 to 2000, whereas the current estimates based on A_{i ≥3} sites is only 50~100 (Zhang et al., 2024). Where, then, are the undiscovered CDNs and how to find them? Since all A_{i ≥3} sites are concluded to be CDNs, the bulk of CDNs must be among A₁ and A₂ sites. The best way to identify the CDNs hidden in A₁ and A₂ is to increase the sample size, n, dramatically.

We hence extend Equation 1 for S_i and A_i to large n’s. Note that Equation 1 drops the term of $(1 - u)^{n - i} \sim e^{- u (n - i)}$ as it is ~1, when nE(u) ≪ 1. With a large n when $e^{- u (n - i)}$ is not near 1, the recurrence of mutations would follow a Poisson distribution with the expected value of nE(u). Assuming that u follows a gamma distribution with a shape parameter of k, the probability of observing i mutations would follow the negative binomial distribution as shown below:

f (i | k, n, E (u)) = \frac{Γ (i + k)}{Γ (i + 1) Γ (k)} k^{k} {[n E (u)]}^{i} {[k + n E (u)]}^{- k - i}

The cumulation density function for Equation 6 is then:

F (i \leq i^{*}) = \sum_{i = 1}^{i^{*}} f (i | k, n, E (u))

Then, by definition, A_{i ≥i*} should be ≪ 1 so that mutations with recurrences i>i^* could be defined as CDN. Thus:

F (i \leq i^{*}) = 1 - \frac{ε}{L_{A}}

where ε=A_i≥i* denotes the number of sites with mutation recurrence ≥i^* under the sole influence of mutational force. $\frac{ε}{L_{A}}$ could then be regarded as significance of i^* since it controls the overall false positive rate of CDNs.

Specifically, with k=1, the probability function of mutation recurrence of a given site would transform to a geometric distribution with P=1 / (1+nE(u)), the cumulative density function (CDF) is then:

F (i \leq i^{*}) = 1 - {(1 - \frac{1}{1 + n E (u)})}^{i^{*}}

Combined with Equation 5, the mutation recurrence cutoff i^* of being a CDN could be expressed as:

i^{*} = \frac{\log (\frac{ε}{L_{A}})}{\log (\frac{n E (u)}{1 + n E (u)})}

For very large n, 1/nE(u) is small and i^*/n can be approximated as

i^{*} / n = \log (L_{A}) \cdot E (u) \sim 5 \times 10^{- 5}

Equation 11 shows that i^*/n would approach asymptotically as n increases. This asymptotic value is attained when n reaches ~10⁶.

Figure 6 shows the range of i^* for n up to 10⁶. As expected, i^* increases by small increments while n increases in 10-fold jumps. For example, when n increases by 3 orders of magnitude, from 100 to 100,000,, i^* only doubles from 3 to 12. The disproportional increment between i^* and n explains why we use the actual number of i^* for the cutoff, instead of the ratio i^*/n. As shown in the inset of Figure 2, the ratio of i^*/n would approach the asymptote at n~10⁶, where an advantageous mutation only needs to rise to 0.00006 to be detected. With n reaching this level, we shall be able to separate most CDNs apart from the mutation background.

Figure 6

Download asset Open asset

i^* values (Y-axis, log scale) against sample sizes (n), X-axis across different shape parameter k’s.

The Y axis presents the i^* values under different sample sizes (n) of the X-axis in log scale. Five shape parameters (k) of the gamma-Poisson model are used. In the literatures on the evolution of mutation rate, k is usually greater than 1. The inset figure illustrates how i^*/ n (prevalence) would decrease with increasing sample sizes. The prevalence would approach the asymptotic line of [ $l o g (L_{A}) \cdot E (u)$ ] when n reaches 10⁶. In short, more CDNs (those with lower prevalence) will be discovered as n increases. Beyond n=10⁶, there will be no gain.

When n approaches 10⁵, the number of CDNs will likely increase more than 10-fold as conjectured in Figure 7A. In that case, every patient would have, say, 5 CDNs that can be subjected to gene targeting. (The companion study shows that, at present, an average cancer patient would have fewer than one targetable CDN.) Before the project is realized, it is nevertheless possible to test some aspects of it using the GENIE data of targeted sequencing. Such screening for mutations in the roughly 700 canonical genes serves the purpose of diagnosis with n ranging between 10~17 thousands for the breast, lung and CNS cancers. Clearly, GENIE efforts did not engage in discovering new mutations although they would discover additional CDNs in the canonical genes. In the companion study, we demonstrate that the analysis of CDNs identifies a potential set of 1.6 times more driver genes than those detected by whole gene selection signal calls (Zhang et al., 2024).

Figure 7

Download asset Open asset

Analysis of CDNs with expanded sample set in GENIE.

(A) Schematic illustrating the impact of sample size expansion on the number of discovered CDNs. The two vertical lines show the cutoffs of i^*/n at (3/1000) vs. (12/100,000). The Y axis shows that the potential number of sites would decrease with i^*/n, which is a function of selective advantage. The area between the two cutoffs below the line represents the new CDNs to be discovered when n reaches 100,000. The power of n=100,000 is even larger if the distribution follows the blue dashed line. (B) The prevalence (i/n) of sites is well correlated between datasets of different n (TCGA with n<1000 and GENIE with generally tenfold higher), as it should be. Sites are displayed by color. ‘1-hit’: CDNs identified in GENIE but remain in singleton in TCGA, ‘2-hit’: CDNs identified in GENIE but present in doubleton in TCGA. ‘CDN both’: CDNs identified in both databases. (**C–E**) CDNs discovered in GENIE (n>9000) but absent in TCGA (n<1000). The newly discovered CDNs may fall in TCGA as 0–2 hit sites. The numbers in the middle column show the percentage of lower recurrence (non-CDN) sites in TCGA that are detected as CDNs in the GENIE database, which has much larger n’s.

Figure 7A assumes that the prevalence, i/n, should not be much affected by n, but the cutoff for CDNs, i^*/n, would decrease rapidly as n increases. (Figure 7B) shows that the i/n ratios indeed correspond well between TCGA and GENIE, which differ by 10–20-fold in sample size. Importantly, as predicted in Figure 6, the number of CDNs increases by three to fivefold (Figure 7C-E). Many of these newly discovered CDNs from GENIE are found in the A₁ and A₂ classes of TCGA while many more are found in the A₀ class in TCGA. In conclusion, increasing n by one to two orders of magnitude would be the simplest means of finding all CDNs.

Discussion

The nature of high-recurrence mutations has been controversial. Many authors have argued for mutational hotspots (Hess et al., 2019; Stobbe et al., 2019; Nesta et al., 2021; Bergstrom et al., 2022; Wong et al., 2022) but just as many have contended that they are CDNs driven by selection (Gartner et al., 2013; Chang et al., 2016; Bailey et al., 2018; Cannataro et al., 2018; Juul et al., 2021; Zhao et al., 2021; Zeng and Bromberg, 2022). While the two views co-exist, they are in fact incompatible. If the mutational hotspot hypothesis is correct, the selection hypothesis, and the determination of CDNs, would not be needed.

We believe that this study is the first to comprehensively test of the null hypothesis of mutational hotspots. In PART I, the mutational characteristics near all putative CDNs are examined and PART II presents the probability theory based on the analyses. The conclusion is that it is possible to reject the null hypothesis for recurrences as low as 3 in the TCGA data. The main reason for the high sensitivity is shown in Equations 2 and 3 where A_i and S_i is proportional to $n_{l}$ or, roughly 0.002ⁱ. We recognized that the conclusion is based on what we currently know about mutation mechanisms. In a sense, the theory developed here can help the search for such unknown mutation mechanisms, if they do exist. Finally, the theory developed would permit the explorations in several new fronts when the sample size, n, expands to 10⁵.

The first front is to identify (nearly) all CDNs. When n reaches 10⁵, any point mutation with a prevalence higher than 12/100,000 would be a CDN, which is 25-fold more sensitive than in the TCGA data (3/1000). The companion analysis suggests that CDNs with lower prevalence, say 12/100,000 may still be highly tumorigenic in patients with the said mutation. If prevalence and potence are indeed poorly correlated, the search for lower prevalence CDNs by increasing n to 10⁵ is equivalent to searching for less common but still potent cancer driving mutations.

The second one is functional tests in patient-derived cell lines. When we have all CDNs identified, a patient can be expected to have multiple (≥5; Zhang et al., 2024) CDNs. These mutations will be the basis of in vitro test, as well as in animal model experiments, by gene editing, as shown recently (Hodis et al., 2022). Targeting multiple mutations simultaneously is necessary and may even be sufficient.

The third front, and arguably the most important one, is cancer treatment by targeted therapy (Dang et al., 2017; Danesi et al., 2021; Waarts et al., 2022; Lin et al., 2023; Zhou et al., 2023). When multiple CDN mutations in the same patient can be simultaneously targeted, the efficacy should be high. No less crucial, resistance to treatment should be diminished since it would be harder on cancer cells to evolve multiple escape routes to evade multiple drugs. Moreover, CDN analysis is crucial for stratifying patients for targeted therapy, as only targeting genes that are positively selected during cancer evolution can truly achieve therapeutic effects.

There will be other fronts to explore with the full set of CDNs. A large database will facilitate the detection of negative selection which has eluded detection (Chen et al., 2022b). Chen et al. have analyzed a curious phenomenon in somatic evolution, which they term ‘quasi-neutral evolution’ (Chen et al., 2019). It will also be possible to study the evolution of mutation mechanisms in cancer cells based on such large samples (Jackson and Loeb, 1998; Ruan et al., 2020). This last topic is addressed in Appendix 1 Note 5 (Appendix 1—figure 2). Finally, at the center of evolutionary genetics is the multi-genic interactions that control complex phenotypes such as human diseases (e.g. diabetes; Vujkovic et al., 2020; Lagou et al., 2023; Xue et al., 2023; Suzuki et al., 2024), genetics of speciation (Chen et al., 2022b; Pan et al., 2022b) and the emergence of viral strains (Deng et al., 2022; Pan et al., 2022a). Cancers may be the first such complex genetic systems that can be unraveled thanks to the massively repeated evolution. As cancer genomics is increasingly adopted in cancer treatment, these benefits should become apparent when n reaches 10⁵ for most cancer types.

Methods

Data collection

Single nucleotide variation (SNV) data for the TCGA cohort was downloaded from the GDC Data Portal (https://portal.gdc.cancer.gov/, data version 2022-02-28). Only mutations identified by at least two pipelines were included in this study. Mutations were further filtered based on their population frequency recorded in the Genome Aggregation Database (gnomAD, version v2.1.1), with an upper threshold of 1‰. We focused on coding region mutations of missense, nonsense, and synonymous types on autosomes. The mutation load for each patient was defined as the sum of these three types of mutations. Patients with a mutation load exceeding 3000 were identified as having a mutator phenotype and were excluded from our analysis. In total, 7369 samples representing 12 cancer types were included for mutation analysis.

Additional mutation data was acquired from the AACR Project GENIE Consortium via cBioPortal. Due to the prevalence of targeted sequencing within the dataset, filtering was implemented to ensure the inclusion of samples with sequencing assays encompassing all exonic regions of target genes. Furthermore, sample-level filtering was performed to guarantee a unique sequencing sample per patient. Germline filtering was applied to the resulting point mutations, removing mutations with SNP frequencies exceeding 0.0005 in any subpopulation annotated by gnomAD. To exclude patients exhibiting hypermutator phenotypes, mutation loads were scaled to the whole exon level with $\tilde{n_{L}} = n_{l} \cdot \frac{L}{l}$ , where $n_{l}$ and l represent the mutation load and genomic length of target sequencing region, respectively. L denotes the genome-wide whole coding region length. Patients with $\tilde{n_{L}} > 3000$ were subsequently excluded, consistent with the threshold employed for the TCGA dataset.

Calculation of missense and synonymous site number (L_A and L_S)

The idea of missense or synonymous sites originates from the question that how many missense or synonymous mutations would be expected if each site of the genome were mutated once given background mutation patterns. Here, the background mutation patterns refer to intrinsic biases in the mutational process, such as the over-representation of C>T (or G>A) mutations at CpG sites due to spontaneous deamination of 5-methylcytosine. In coding regions, fourfold degenerate sites are generally considered neutral, as any mutation path would not alter the encoded amino acid sequence. This analysis follows established methods at the single-base level to infer the expected number of missense and synonymous sites across the genome (Gojobori et al., 1982; Wu and Maeda, 1987; Hartl and Clark, 1989; Martincorena et al., 2017).

To illustrate the calculation process, we provide an example of synonymous site estimation. At four-fold degenerate sites throughout the genome, we tally the number of mutations from base m to base v as $n_{m > v} (m, v \in {A, C, G, T}, m \neq v)$ . The likelihood of observing a mutation from reference base m to variant base v will be $r_{m > v} = \frac{n_{m > v}}{N_{m}}$ , where $N_{m}$ represents the number of fourfold degenerate site with reference base m. We then normalize all likelihoods by $R_{m > v} = \frac{r_{m > v}}{\sum r}$ , where $\sum r$ represents the sum of likelihoods across 12 possible mutation paths, $R_{m > v}$ thus describes the relative probability of an occurred mutation to be $m > v$ at any site. For a given genomic coding region of length L, the synonymous site number will be:

L_{S} = \sum_{L} δ_{m > v}^{s y n} R_{m > v}

with $δ_{m > v}^{s y n}$ being a Kronecker delta function where:

$m > v$ if is synonymous

$δ_{m > v}^{s y n} = 0$ otherwise

Similarly, the expected number of missense sites L_A, is calculated as follows:

L_{S} = \sum_{L} δ_{m > v}^{m i s} R_{m > v}

Calculation of A_i and S_i

For each mutated site, we track its number of recurrent mutations i (i>0) across two mutation categories: missense and synonymous. Subsequently, we aggregate across entire coding region to count the number of sites harboring i missense mutations (A_i) and synonymous mutations (S_i). For i=0, we define A₀ and S₀ as the estimated number of potential missense and synonymous sites that remain unmutated within the current sample size. $A_{0} = L_{A} - Σ_{i > 0} A_{i}, S_{0} = L_{s} - Σ_{i > 0} S_{i} .$

AI-based mutation rate analysis

To capture the complex interplay of genomic and epigenomic factors influencing mutation susceptibility, we employed pre-trained artificial intelligence (AI) models from Dig, an aggregated tool combining deep learning and probabilistic models (Sherman et al., 2022). Downloaded from the Dig data portal (http://cb.csail.mit.edu/cb/DIG/downloads/), these models leverage a rich set of features encompassing both kilobase-scale epigenomic context (replicating timing, chromatin accessibility, etc.) and fine-grained base-pair level information (such as sequence context biases) to predict the site level mutation rate. For each cancer type, we re-fitted the pre-trained models with mutations analyzed in our study. The mutation rate for each site, scaled by population size, was obtained via the elementDriver function within Dig, and was represented by EXP_SNV from the final results. For a closer look at mutation rate landscape of PAX3, we re-fitted the AI-model with point mutations from large intestine cancer. The mutation rates were generated site-by-site for the coding regions of PAX3.

Given the scarcity of mutated sites with recurrence i≥3 (comprising only 0.15% of all mutated sites), a rigorous statistical approach was adopted to assess the significance of mutability differences between these high-hit groups and low-hit groups. We implemented the procedure as follows:

(1) Raw significance level: For each recurrence group i (containing A_i sites), a one-sided Kolmogorov-Smirnov (K-S) test was employed to calculate a raw significance level (denoted as $p_{0}$ ) against the low-hit group.

(2) Resampling for Significance Pool: we resample A_i sites from the entire pool of mutated sites with missense mutations. The significance $p_{j}$ from one-sided K-S test is calculated against the low hit group. The resampling process was repeated 100,000 times, generating a distribution of resampled significance levels, denoted as ${p_{j}, j = 1, 2, . . ., 100, 000}$ .

(3) Adjusted Significance Level: the raw significance $p_{0}$ was then compared to the resampled significance pool ${p_{j}, j = 1, 2, . . ., 100, 000}$ . The proportion of $p_{0} < p_{j}$ was then calculated as the adjusted significance level that accounted for potential sampling effects.

Motif-based mutability

For a given nucleotide base, we extended the sequence to each side by 1, 2, and 3 base pairs, producing sets of 3-mer, 5-mer, and 7-mer motifs, respectively. We then pooled point mutations from 12 cancer types to create a comprehensive dataset. For 3-mer and 5-mer motifs, we utilized synonymous mutations as the reference for mutation rate calculations. For 7-mer motifs, the vast number of possible sequence combinations (4⁷=16,384) posed a challenge, as synonymous mutations alone might not adequately cover all potential contexts. To address this, we employed all singleton mutations (1-hit mutations) from both missense and synonymous categories for 7-mer motif analysis. This decision was based on the assumption that singleton mutations are less affected by selective pressures, supported by the genome-wide observation that the ratio of missense to synonymous singletons (A₁/S₁) approximates the ratio of unmutated missense to synonymous sites (A₀/S₀).

The site number for 3-mer and 5-mer motifs of a given context c was calculated as follows:

L_{c, S} = \sum_{L} δ_{c, m > v}^{s y n} \cdot R_{m > v}

Which is an extension of Equation S1, with $δ_{c, m > v}^{s y n}$ being a Kronecker delta function where:

$δ_{c, m > v}^{s y n} = 1$ if base change of m to v (m > v) is synonymous under sequence context c.

$δ_{c, m > v}^{s y n} = 0$ otherwise.

The mutation rate then could be expressed as:

μ_{c} = \frac{n_{c, m > v}^{s y n}}{L_{c, s}}

for 7-mer motifs, the calculation is:

L_{c} = L_{c, S} + L_{c, A} = \sum_{L} δ_{c, m > v}^{s y n} \cdot R_{m > v} + \sum_{L} δ_{c, m > v}^{m i s} \cdot R_{m > v}

μ_{c} = \frac{n_{c, m > v}^{s y n} + n_{c, m > v}^{m i s}}{L_{c}}

Where for Equations S3 and S4, $n_{c, m > v}^{s y n}$ and $n_{c, m > v}^{m i s}$ represent the mutation numbers with m>v being synonymous and missense under sequence context c, respectively. The mutation rate is then scaled as the expected mutation number per 10⁵ corresponding sequence motifs for better presentation.

Significance for motif enrichment (Figure 4B) mirrored the AI analysis. For each i≥3 site, we calculated raw K-S p-values against motif mutabilities (denoted as $p_{0}$ ). These were then compared to a resampled significance pool ${p_{j}, j = 1, 2, . . ., 100, 000}$ , with the proportion of $p_{0} \leq p_{j}$ employed as the final p-value, depicting enrichment significance for highly mutable motifs in recurrence group i against low hits.

Consensus length comparison

To explore potential sequence motifs associated with recurrent mutations (i≥3), we employ a sliding window of 10 bp stride to extract the local context from reference genome (Figure 4—figure supplement 1). We examined diverse window sizes (21, 41, 61, 81, and 101 bp) to capture potential motifs of varying lengths and distances to the mutated site. Consensus length of local contexts was measured by Hamming distance in pairwise comparisons of aligned windows (with same stride) between mutated sites.

To prioritize sequence similarities likely driven by mutational mechanisms rather than functional constraints or gene structure, we restricted consensus comparisons to non-homologous genes. This approach effectively mitigated potential biases arising from homologous genes (e.g., KRAS and NRAS) or repeated domains within a single gene (e.g., FBXW7). The statistical significance of observed consensus lengths was assessed using the K-S test, which compared the empirical distribution of consensus lengths against a Poisson distribution, with mean of λ set to one-quarter of the window size, which reflects the expected distribution under random scenarios.

Mutational signature analysis

The mutation load of each patient could be further decomposed to several known mutational processes, which is represented by mutational signatures. In general, each mutational signature embodies the relative mutabilities across distinct mutational contexts. Leveraging single base substitution (SBS) signatures from COSMIC (v3.3), we employed the SignatureAnalyzer tool to quantify the contribution of each signature to individual mutational loads (Kim et al., 2016). For composition analysis in Figure 5B, we focused on signatures contributing at least 2% to the total mutations within a given cancer type, given that there are 79 mutational signatures in use for deconvolution.

To assess signature contribution changes across recurrence cutoffs (i), we grouped patients with mutations of recurrence i≥i^* and scaled signature contributions to 1 to cancel out the population size effect. Pairwise K-S test between different i^*s is employed to determine whether signature contributions are significantly different under each i^*.

Outlier model

The purpose of the outlier model is to investigate if high-hit sites could be explained by a fraction (p) of highly mutable sites (with mutability α-fold higher). By definition, we have:

S_{i} = (1 - p) \cdot L_{S} \cdot {[n E (u)]}^{i} + p \cdot L_{S} \cdot {[α \cdot n E (u)]}^{i}

With n represents the population size and E(u) denotes the average mutation rate per site per patient.

We let p range from 10^–5 to 10^–2 and α from 1.1 to 1,000. For each (p, α) pair, we solve Equation S5 based on observed S₁ (i=1) to obtain E(u). Then, we calculate expectedS_i with i=2, 3. The (p, α) pairs are filtered by imposing constraints grounded in observed S₂ and S₃ values. Specifically, we retained only those pairs whose expected S₂ and S₃ values resided within the 95% quantile range of a Poisson distribution with λ set to the observed values. This filtering process yielded biologically plausible (p, α) pairs that were then used to derive S₄ and S₅. Finally, we computed the mean and standard deviation for p, α, S₄, and S₅ across all filtered pairs to capture their central tendencies and variability.

CDN analysis in GENIE

To circumvent potential biases in E(u) estimation stemming from the varying target gene coverage across sequencing panels within the GENIE dataset, we leveraged E(u) values derived from the corresponding cancer types in the TCGA dataset. Specifically, we focused on 1-hit synonymous mutations within coding regions, as these are generally considered to be the least influenced by selective pressures in coding regions. Based on Equation 1 from the main text, we have:

S_{1} = L_{S} \cdot n E (u) e^{- (n - 1) E (u)}

where $e^{(n - 1) E (u)}$ comes from approximation of $[1 - E (u)]^{(n - 1)}$ from binomial distribution, and n is the population size in TCGA. The calculation for the threshold i^*, based on Equation 10 from the main text, is:

i^{*} = \frac{l o g (\frac{ε}{L_{A}})}{l o g (\frac{n_{e} E (u)}{1 + n_{e} E (u)})}

The only difference in Equation S7 is that we use n_e to represent the number of patients sequenced for a target gene in GENIE, considering the overlapping between different assays in use. In essence, we will have for each gene a CDN threshold i^*.

The comparison of CDNs between TCGA and GENIE are restricted to genes sequenced by GENIE panels. For CDNs identified in GENIE using Equation S7, we investigated their hit information and CDN identity within TCGA dataset. The CDN flow proportion depicted in Figure 7C–E represents the ratio of sites identified as CDN in GENIE to $A_{i}^{*}$ (i=0, 1, 2) of TCGA. $A_{i}^{*}$ mirrors A_i but specifically considers the coding region sequenced by the GENIE panel. For CDNs identified in TCGA, the flow ratio is just the proportion of sites being identified as CDNs in GENIE. Notably, for CDNs identified in GENIE but lacking mutations in TCGA (i=0), $A_{0}^{*}$ is obtained using Equation S2 with L being the length of coding region sequenced in GENIE.

Appendix 1

1. Literature support for CDNs identified in breast cancer

Verification of site level positive selection in cancer genome has primarily focused on canonical cancer drivers. For non-canonical candidates with experimentally proven tumorigenic activity, CDN sites within these genes emerge as potential key drivers due to their statistically stronger selective advantage. Under this premise, we search for literature evidence for genes harboring CDN sites in breast cancer.

Among the 17 genes with CDN sites in breast cancer, 11 are recognized as canonical drivers by all three major driver gene lists (Appendix 1—table 1). 4 genes (CDC42BPA, ERBB3, KIF1B, NUP93), despite lacking inclusion in canonical breast cancer driver lists, possess explicit experimental support indicating their driving roles in breast tumorigenesis. HIST1H3B has been recognized as a driver gene in breast cancer by IntOGen, corroborated by literatures supporting its association with breast cancer. The four mutation recurrences with R6C alteration in amino acid sequence in RARS2 have been proposed to be linked to defects in mitochondrial transport, the explicit role of RARS2 in breast cancer tumorigenesis remains to be explored.

Appendix 1—table 1

literature support for CDN genes in breast cancer.

Gene Id	Gene Name	Support
*AKT1*	v-akt murine thymoma viral oncogene homolog 1	① ② ③
*CDC42BPA*	CDC42 binding protein kinase alpha (DMPK-like)	Unbekandt and Olson, 2014; Collins et al., 2018; Kwa et al., 2021; Jiang et al., 2023
*CDH1*	cadherin 1, type 1, E-cadherin (epithelial)	① ② ③
*ERBB2*	v-erb-b2 avian erythroblastic leukemia viral oncogene homolog 2	① ② ③
*ERBB3*	v-erb-b2 avian erythroblastic leukemia viral oncogene homolog 3	Holbro et al., 2003; Xue et al., 2006; Hamburger, 2008; Sithanandam and Anderson, 2008; Stern, 2008; Huang et al., 2010
*FGFR2*	fibroblast growth factor receptor 2	① ② ③
*FOXA1*	forkhead box A1	① ② ③
*GATA3*	GATA binding protein 3	① ② ③
*HIST1H3B*	histone cluster 1, H3b	① ② ③ Xie et al., 2019; Wang et al., 2023 *
*KIF1B*	kinesin family member 1B	Munirajan et al., 2008; Yu and Feng, 2010; Liu et al., 2022
*KRAS*	Kirsten rat sarcoma viral oncogene homolog	① ② ③
*NUP93*	nucleoporin 93 kDa	Bersini et al., 2020; Nataraj et al., 2022
*PIK3CA*	phosphatidylinositol-4,5-bisphosphate 3-kinase, catalytic subunit alpha	① ② ③
*PTEN*	phosphatase and tensin homolog	① ② ③
*RARS2*	arginyl-tRNA synthetase 2, mitochondrial	Wang et al., 2020 *
*SF3B1*	splicing factor 3b, subunit 1, 155 kDa	① ② ③
*TP53*	tumor protein p53	① ② ③

The serial number corresponds to the inclusion of target gene in the following driver gene list: ① CGC Tier-1 list, ② IntOGen, ③ Bailey’s list.
The inclusion necessitates that the target gene is annotated as a cancer driver in breast cancer.
*

ambiguous, meaning the literature indicates an association between the candidate gene and breast cancer, but lacks explicit experimental evidence.

2. The impact of k for gamma-binomial model

From Equation 2 in the main text, S_i is affected by two terms: G and nE(u), where G is:

G = L_{s} \cdot g (i, k) = L_{s} \cdot C_{k + i - 1}^{i} \cdot \frac{1}{k^{i}}

L_S is a constant value given specific cancer type, here we demonstrate how $g (i, k)$ varies with respect to i and k, and elucidate why S_i of k=1 indicates the upper bound of CDN cutoff.

Considering the fold change of G from i-1 to i.

φ (i, k) = \frac{g (i, k)}{g (i - 1, k)} = \frac{C_{k + i - 1}^{i} \cdot \frac{1}{k^{i}}}{C_{k + i - 2}^{i} \cdot \frac{1}{k^{i - 1}}}

Appendix 1—figure 1 illustrates how $φ (i, k)$ changes with i and k. With k range from 0.1 to 10, the curve of $φ (i, k)$ elucidates the extent to which G would impact S_i with each increment of i. As detailed in Section 4, k will be >1 for biological significance. In such cases, $φ (i, k)$ will always be < 1, meaning G will synergistically collaborate with nE(u) to decrease S_i. The diminishing impact of $φ (i, k)$ intensifies as k increases. A higher k value would suggest that, for most sites across the genome, mutability falls within a narrow range of >0. In an extreme case, when k=10, $φ (i, k)$ = 0.145 at i=20, which is 2 orders weaker than nE(u) for cancer types in TCGA. In practice, k is usually estimated to be between 2–5, depending on the cancer types being investigated (Appendix 1—table 2). Consequently, the reduction of S_i with each increase of i is predominantly governed by nE(u), and the S_i values with k=1 represent the upper limit driven solely by mutational force.

Appendix 1—figure 1

Download asset Open asset

The trend of $φ (i, k)$ with each increase of recurrence (i, the x-axis) under different shape parameters of the gamma distribution (k, designated by different colors).

Appendix 1—table 2

k estimated from 12 cancer types.

Cancer type	k
Breast	5.05
CNS	2.59
Endometrium	5.49
Kidney	7.70
Large intestine	4.76
Liver	5.23
Lung	2.62
Ovary	4.30
Prostate	3.60
Stomach	4.17
Upper-AD tract	4.14
Urinary tract	6.14
merged set^*	3.27

Note:- Estimation of k is derived from negative binomial regression, based on synonymous changes aggregated by the 3 bp local context at mutated sites across all coding genes. The estimation method is implemented in package dndscv.
*

The merged set contains mutation information from all 12 cancer types.

3. The impact of negative selection on shape parameter k

Across various studies aiming to depict mutability variation across genome under the gamma distribution, the shape parameter k is always pivotal. Adopting a dichotomous perspective, we inquire into how k compares to 1 under large sample sizes (n≥10⁶). This inquiry is fundamentally linked to the prevalence of negative selection across the cancer genome, as the observed mutation abundance is an amalgamation of mutational and selection forces. In scenarios where purifying selection extensively operates throughout the genome in cancer evolution, even for synonymous sites (Sharp and Li, 1987; Plotkin and Kudla, 2011; Gartner et al., 2013; Chu and Wei, 2019), most genomic sites would not exhibit mutations, resulting in k being ≤1. In attempts to detect negative selection signals in cancer, researchers typically identify only a limited number of genes (Luo et al., 2008; Van den Eynden et al., 2016; Zapata et al., 2018; Bányai et al., 2021). In a CRISPR-Cas9 loss-of-function screen covering 16,540 genes conducted across 558 cancer cell lines, only approximately 6% of genes are under strong negative selection in at least 90% of the cell lines (De Kegel and Ryan, 2019). In an in-house mutation accumulation experiment carried out in HCT116 (a human colorectal carcinoma cell line, data not published), the proportion of mutations under strong negative selection is 0.66% with a selection coefficient (s) of –0.6 (indicating that the survivability of the mutant is 40% of the wildtype). These evidences, in concordance with quasi-neutrality of cancer evolution, suggest that purifying selection is indeed rare in cancer evolution. The mutability for the majority of genomic sites is greater than 0, with shape parameter k>1.

4. Detailed derivation for negative binomial distribution and approximation of i^*/n

The derivation of Equation 5 from joint distribution of Gamma-Poisson distribution is well presented in statistically analysis. Here, we assume that the mutation recurrence (i) observed at site level across the genome follows a Poisson distribution of $P o i s (i | λ)$ , where the expected number of mutation recurrence $λ$ follows a Gamma distribution of $G a m m a (λ | k, θ)$ , with k and $θ$ being the shape and scale parameters, respectively. Then, the joint probability density function for i can be expressed as:

f (i | k, θ) = \int P o i s (i | λ) \cdot G a m m a (λ | k, θ) \cdot d λ

= \int \frac{λ^{i} e^{- λ}}{i!} \cdot \frac{1}{Γ (k) θ^{k}} λ^{k - 1} e^{\frac{- λ}{θ}} \cdot d λ

= \frac{θ^{- k}}{i! Γ (k)} \int λ^{i} e^{λ} \cdot λ^{k - 1} e^{\frac{λ}{θ}} \cdot d λ

= \frac{θ^{- k}}{i! Γ (k)} \int λ^{(i + k) - 1} e^{-} (1 + \frac{1}{θ}) λ \cdot d λ

Now, we make use of the probability density function of Gamma distribution,

\int G a m m a (λ | k, θ) = 1

Which is:

\int \frac{1}{Γ (k) θ^{k}} λ^{k - 1} e^{- \frac{λ}{θ}} \cdot d λ = 1

Therefore,

$\int λ^{k - 1} e^{\frac{- λ}{θ}} \cdot d λ = Γ (k) {(\frac{1}{θ})}^{- k}$ (S10)

Comparing with Equation S10, Equation S9 can be rewrite as:

f (i | k, θ) = \frac{θ^{- k}}{i! Γ (k)} \cdot {(1 + \frac{1}{θ})}^{- (i + k)} Γ (i + k)

\frac{θ^{- k}}{i! Γ (k)} \cdot {(\frac{θ}{1 + θ})}^{i + k} Γ (i + k)

= \frac{Γ (i + k)}{Γ (i + 1) Γ (k)} {(\frac{1}{1 + θ})}^{k} {(\frac{θ}{1 + θ})}^{i}

Note that the mean for Gamma distribution is $k θ = n E (u)$ , which leads to $θ = \frac{n E (u)}{k}$ . Then, the negative-binomial form of Equation S11 could be further expressed as:

f (i | k, θ) = \frac{Γ (i + k)}{Γ (i + 1) Γ (k)} {(\frac{k}{k + n E (u)})}^{k} {(\frac{n E (u)}{k + n E (u)})}^{i}

\frac{Γ (i + k)}{Γ (i + 1) Γ (k)} k^{k} {[n E (u)]}^{i} {[k + n E (u)]}^{- k - i}

Which is Equation 6 from the main text.

With k=1, $n E (u) = k θ = θ$ , Equation S11 then transforms to:

f (i | 1, θ) = \frac{Γ (i + 1)}{Γ (i + 1) Γ (1)} {(\frac{1}{1 + θ})}^{1} {(\frac{θ}{1 + θ})}^{i}

(\frac{1}{1 + θ}) \cdot {(1 - \frac{1}{1 + θ})}^{i}

(\frac{1}{1 + n E (u)}) \cdot {(1 - \frac{1}{1 + n E (u)})}^{i}

Which is a geometric distribution with $p = \frac{1}{1 + n E (u)}$ .

For the approximation of i^*/n, we let ε=1, then Equation 10 from main text could be rewritten as:

i^{*} \cdot \log (\frac{1}{1 + \frac{1}{n E (u)}}) = \log (\frac{1}{L_{A}})

i^{*} \cdot \log (1 + \frac{1}{n E (u)}) = \log (L_{A})

For the left side of Equation S12, we use the first-order Tayler expansion,

i^{*} \cdot \log (1 + \frac{1}{n E (u)}) = \log (L_{A})

Substitute this to Equation S12, we have:

\frac{i^{*}}{n} = \log (L_{A}) \cdot E (u)

Which is Equation 11 from main text.

5. Probing mutation rate variation with large samples

With large sample size sequenced, the data will yield an additional benefit by revealing the evolution of the mutation rate itself. Given that the mutation rate per site is extremely small, the evolution of mutation rate itself has been a most challenging issue (André and Godelle, 2006; Lynch, 2010; Lynch, 2011; Lynch et al., 2016; Ruan et al., 2020; Wei et al., 2022). In particular, without the check of selection, the mutation rate is liable to be trapped in the runaway evolution (Ruan et al., 2020).

Appendix 1—figure 2

Download asset Open asset

The gamma distribution of recurrences (i) under different shapes.

With **E(u)**=5 × 10^–6, we set the shape parameter k to 0.2, 1 and 5, represented by three distinct colors. The site number of synonymous recurrence i (S_i) is indicated on Y-axis. In the context of a large sample size (n=10⁶), the S_i distribution clearly distinguishes between different k values, mitigating the overdispersion issue encountered in smaller sample sizes. The inset depicts the distribution on a log10 scale for i≥10, with a horizontal dashed line indicating S_i=1, where i^* is the CDN cutoff.

The theory of mutation rate of evolution should be based on the distribution of the per-site mutation rate across the genome. However, the empirical data so far only yield the mean. In particular, the spectrum of S_i’s for i’s close to 1 would be most informative about the evolution of the mutation mechanism. Appendix 1—figure 2 shows the S_i spectrum with k=0.2, 1 or 5 in a Gamma distribution. Note the mode of the distribution (i.e. the peak of the curve) among the 3 curves, which is at 0 or >0 depending on whether k≤1 or>1. Clearly, the observed S_i’s can distinguish among the three distributions only when n is very large. The implications of such distributions for the theory of mutation rate evolution are addressed in Discussion.

Data availability

The key scripts used in this study are available at GitLab, copy archived at Zhang, 2024. A subset of key example files for breast cancer analysis can be found in the "/example_data_files" directory. The complete list of CDNs analyzed in this study is provided in Supplementary file 1.

References

1. Alexandrov LB
2. Nik-Zainal S
3. Wedge DC
4. Aparicio SAJR
5. Behjati S
6. Biankin AV
7. Bignell GR
8. Bolli N
9. Borg A
10. Børresen-Dale A-L
11. Boyault S
12. Burkhardt B
13. Butler AP
14. Caldas C
15. Davies HR
16. Desmedt C
17. Eils R
18. Eyfjörd JE
19. Foekens JA
20. Greaves M
21. Hosoda F
22. Hutter B
23. Ilicic T
24. Imbeaud S
25. Imielinski M
26. Jäger N
27. Jones DTW
28. Jones D
29. Knappskog S
30. Kool M
31. Lakhani SR
32. López-Otín C
33. Martin S
34. Munshi NC
35. Nakamura H
36. Northcott PA
37. Pajic M
38. Papaemmanuil E
39. Paradiso A
40. Pearson JV
41. Puente XS
42. Raine K
43. Ramakrishna M
44. Richardson AL
45. Richter J
46. Rosenstiel P
47. Schlesner M
48. Schumacher TN
49. Span PN
50. Teague JW
51. Totoki Y
52. Tutt ANJ
53. Valdés-Mas R
54. van Buuren MM
55. van ’t Veer L
56. Vincent-Salomon A
57. Waddell N
58. Yates LR
59. Zucman-Rossi J
60. Futreal PA
61. McDermott U
62. Lichter P
63. Meyerson M
64. Grimmond SM
65. Siebert R
66. Campo E
67. Shibata T
68. Pfister SM
69. Campbell PJ
70. Stratton MR
71. Australian Pancreatic Cancer Genome Initiative
72. ICGC Breast Cancer Consortium
73. ICGC MMML-Seq Consortium
74. ICGC PedBrain
(2013) Signatures of mutational processes in human cancer
Nature 500:415–421.

https://doi.org/10.1038/nature12477
- PubMed
- Google Scholar
(2020) The repertoire of mutational signatures in human cancer
Nature 578:94–101.

https://doi.org/10.1038/s41586-020-1943-3
- PubMed
- Google Scholar
(2019) Estimating the number of genetic mutations (hits) required for carcinogenesis based on the distribution of somatic mutations
PLOS Computational Biology 15:e1006881.

https://doi.org/10.1371/journal.pcbi.1006881
- PubMed
- Google Scholar
1. André JB
2. Godelle B
(2006) The evolution of mutation rate in finite asexual populations
Genetics 172:611–626.

https://doi.org/10.1534/genetics.105.046680
- PubMed
- Google Scholar
1. Armitage P
2. Doll R
(1954) The age distribution of cancer and a multi-stage theory of carcinogenesis
British Journal of Cancer 8:1–12.

https://doi.org/10.1038/bjc.1954.1
- PubMed
- Google Scholar
1. Bailey MH
2. Tokheim C
3. Porta-Pardo E
4. Sengupta S
5. Bertrand D
6. Weerasinghe A
7. Colaprico A
8. Wendl MC
9. Kim J
10. Reardon B
11. Ng PKS
12. Jeong KJ
13. Cao S
14. Wang Z
15. Gao J
16. Gao Q
17. Wang F
18. Liu EM
19. Mularoni L
20. Rubio-Perez C
21. Nagarajan N
22. Cortés-Ciriano I
23. Zhou DC
24. Liang WW
25. Hess JM
26. Yellapantula VD
27. Tamborero D
28. Gonzalez-Perez A
29. Suphavilai C
30. Ko JY
31. Khurana E
32. Park PJ
33. Van Allen EM
34. Liang H
35. MC3 Working Group
36. Cancer Genome Atlas Research Network
37. Lawrence MS
38. Godzik A
39. Lopez-Bigas N
40. Stuart J
41. Wheeler D
42. Getz G
43. Chen K
44. Lazar AJ
45. Mills GB
46. Karchin R
47. Ding L
(2018) Comprehensive characterization of cancer driver genes and mutations
Cell 173:371–385.

https://doi.org/10.1016/j.cell.2018.02.060
- PubMed
- Google Scholar
1. Bányai L
2. Trexler M
3. Kerekes K
4. Csuka O
5. Patthy L
(2021) Use of signals of positive and negative selection to distinguish cancer genes and passenger genes
eLife 10:e59629.

https://doi.org/10.7554/eLife.59629
- PubMed
- Google Scholar
1. Belikov AV
(2017) The number of key carcinogenic events can be predicted from cancer incidence
Scientific Reports 7:12170.

https://doi.org/10.1038/s41598-017-12448-7
- Google Scholar
1. Bergstrom EN
2. Luebeck J
3. Petljak M
4. Khandekar A
5. Barnes M
6. Zhang T
7. Steele CD
8. Pillay N
9. Landi MT
10. Bafna V
11. Mischel PS
12. Harris RS
13. Alexandrov LB
(2022) Mapping clustered mutations in cancer reveals APOBEC3 mutagenesis of ecDNA
Nature 602:510–517.

https://doi.org/10.1038/s41586-022-04398-6
- Google Scholar
1. Bersini S
2. Lytle NK
3. Schulte R
4. Huang L
5. Wahl GM
6. Hetzer MW
(2020) Nup93 regulates breast tumor growth by modulating cell proliferation and actin cytoskeleton remodeling
Life Science Alliance 3:e201900623.

https://doi.org/10.26508/lsa.201900623
- PubMed
- Google Scholar
1. Bian S
2. Wang Y
3. Zhou Y
4. Wang W
5. Guo L
6. Wen L
7. Fu W
8. Zhou X
9. Tang F
(2023) Integrative single-cell multiomics analyses dissect molecular signatures of intratumoral heterogeneities and differentiation states of human gastric cancer
National Science Review 10:wad094.

https://doi.org/10.1093/nsr/nwad094
- PubMed
- Google Scholar
1. Black JRM
2. McGranahan N
(2021) Genetic and non-genetic clonal diversity in cancer evolution
Nature Reviews. Cancer 21:379–392.

https://doi.org/10.1038/s41568-021-00336-2
- PubMed
- Google Scholar
(2011) Effect of aberrant p53 function on temozolomide sensitivity of glioma cell lines and brain tumor initiating cells from glioblastoma
Journal of Neuro-Oncology 102:1–7.

https://doi.org/10.1007/s11060-010-0283-9
- PubMed
- Google Scholar
1. Buisson R
2. Langenbucher A
3. Bowen D
4. Kwan EE
5. Benes CH
6. Zou L
7. Lawrence MS
(2019) Passenger hotspot mutations in cancer driven by APOBEC3A and mesoscale genomic features
Science 364:eaaw2872.

https://doi.org/10.1126/science.aaw2872
- PubMed
- Google Scholar
(2013) Evidence for APOBEC3B mutagenesis in multiple human cancers
Nature Genetics 45:977–983.

https://doi.org/10.1038/ng.2701
- PubMed
- Google Scholar
(2013) The cancer genome atlas pan-cancer analysis project
Nature Genetics 45:1113–1120.

https://doi.org/10.1038/ng.2764
- PubMed
- Google Scholar
(2018) Effect sizes of somatic mutations in cancer
Journal of the National Cancer Institute 110:1171–1177.

https://doi.org/10.1093/jnci/djy168
- PubMed
- Google Scholar
1. Cao Y
2. Chen L
3. Chen H
4. Cun Y
5. Dai X
6. Du H
7. Gao F
8. Guo F
9. Guo Y
10. Hao P
11. He S
12. He S
13. He X
14. Hu Z
15. Hoh BP
16. Jin X
17. Jiang Q
18. Jiang Q
19. Khan A
20. Kong HZ
21. Li J
22. Li SC
23. Li Y
24. Lin Q
25. Liu J
26. Liu Q
27. Lu J
28. Lu X
29. Luo S
30. Nie Q
31. Qiu Z
32. Shi T
33. Song X
34. Su J
35. Tao SC
36. Wang C
37. Wang CC
38. Wang GD
39. Wang J
40. Wu Q
41. Wu S
42. Xu S
43. Xue Y
44. Yang W
45. Yang Z
46. Ye K
47. Ye YN
48. Yu L
49. Zhao F
50. Zhao Y
51. Zhai W
52. Zhang D
53. Zhang L
54. Zheng H
55. Zhou Q
56. Zhu T
57. Zhang YP
(2023) Was Wuhan the early epicenter of the COVID-19 pandemic?-A critique
National Science Review 10:wac287.

https://doi.org/10.1093/nsr/nwac287
- PubMed
- Google Scholar
1. Cerami E
2. Gao J
3. Dogrusoz U
4. Gross BE
5. Sumer SO
6. Aksoy BA
7. Jacobsen A
8. Byrne CJ
9. Heuer ML
10. Larsson E
11. Antipin Y
12. Reva B
13. Goldberg AP
14. Sander C
15. Schultz N
(2012) The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data
Cancer Discovery 2:401–404.

https://doi.org/10.1158/2159-8290.CD-12-0095
- PubMed
- Google Scholar
1. Chang MT
2. Asthana S
3. Gao SP
4. Lee BH
5. Chapman JS
6. Kandoth C
7. Gao J
8. Socci ND
9. Solit DB
10. Olshen AB
11. Schultz N
12. Taylor BS
(2016) Identifying recurrent mutations in cancer reveals widespread lineage diversity and mutational specificity
Nature Biotechnology 34:155–163.

https://doi.org/10.1038/nbt.3391
- PubMed
- Google Scholar
1. Chen B
2. Shi Z
3. Chen Q
4. Shen X
5. Shibata D
6. Wen H
7. Wu CI
(2019) Tumorigenesis as the paradigm of quasi-neutral molecular evolution
Molecular Biology and Evolution 36:1430–1441.

https://doi.org/10.1093/molbev/msz075
- PubMed
- Google Scholar
1. Chen B
2. Wu X
3. Ruan Y
4. Zhang Y
5. Cai Q
6. Zapata L
7. Wu CI
8. Lan P
9. Wen H
(2022a) Very large hidden genetic diversity in one single tumor: evidence for tumors-in-tumor
National Science Review 9:wac250.

https://doi.org/10.1093/nsr/nwac250
- PubMed
- Google Scholar
1. Chen Q
2. Yang H
3. Feng X
4. Chen Q
5. Shi S
6. Wu CI
7. He Z
(2022b) Two decades of suspect evidence for adaptive molecular evolution-negative selection confounding positive-selection signals
National Science Review 9:wab217.

https://doi.org/10.1093/nsr/nwab217
- PubMed
- Google Scholar
1. Chu D
2. Wei L
(2019) Nonsynonymous, synonymous and nonsense mutations in human cancer-related genes undergo stronger purifying selections than expectation
BMC Cancer 19:359.

https://doi.org/10.1186/s12885-019-5572-x
- PubMed
- Google Scholar
1. Collins KAL
2. Stuhlmiller TJ
3. Zawistowski JS
4. East MP
5. Pham TT
6. Hall CR
7. Goulet DR
8. Bevill SM
9. Angus SP
10. Velarde SH
11. Sciaky N
12. Oprea TI
13. Graves LM
14. Johnson GL
15. Gomez SM
(2018) Proteomic analysis defines kinase taxonomies specific for subtypes of breast cancer
Oncotarget 9:15480–15497.

https://doi.org/10.18632/oncotarget.24337
- PubMed
- Google Scholar
1. Danesi R
2. Fogli S
3. Indraccolo S
4. Del Re M
5. Dei Tos AP
6. Leoncini L
7. Antonuzzo L
8. Bonanno L
9. Guarneri V
10. Pierini A
11. Amunni G
12. Conte P
(2021) Druggable targets meet oncogenic drivers: opportunities and limitations of target-based classification of tumors and the role of Molecular Tumor Boards
ESMO Open 6:100040.

https://doi.org/10.1016/j.esmoop.2020.100040
- Google Scholar
1. Dang CV
2. Reddy EP
3. Shokat KM
4. Soucek L
(2017) Drugging the “undruggable” cancer targets
Nature Reviews. Cancer 17:502–508.

https://doi.org/10.1038/nrc.2017.36
- PubMed
- Google Scholar
1. de Bruijn I
2. Kundra R
3. Mastrogiacomo B
4. Tran TN
5. Sikina L
6. Mazor T
7. Li X
8. Ochoa A
9. Zhao G
10. Lai B
(2023) analysis and visualization of longitudinal genomic and clinical data from the AACR Project GENIE biopharma collaborative in cBioPortal
Cancer Research 83:3861–3867.

https://doi.org/10.1158/0008-5472.CAN-23-0816
- Google Scholar
1. De Kegel B
2. Ryan CJ
(2019) Paralog buffering contributes to the variable essentiality of genes in cancer cell lines
PLOS Genetics 15:e1008466.

https://doi.org/10.1371/journal.pgen.1008466
- PubMed
- Google Scholar
1. Deng S
2. Xing K
3. He X
(2022) Mutation signatures inform the natural host of SARS-CoV-2
National Science Review 9:wab220.

https://doi.org/10.1093/nsr/nwab220
- Google Scholar
1. Elliott K
2. Larsson E
(2021) Non-coding driver mutations in human cancer
Nature Reviews. Cancer 21:500–509.

https://doi.org/10.1038/s41568-021-00371-z
- PubMed
- Google Scholar
1. Fang Y
2. Deng S
3. Li C
(2022) A generalizable deep learning framework for inferring fine-scale germline mutation rate maps
Nature Machine Intelligence 4:1209–1223.

https://doi.org/10.1038/s42256-022-00574-5
- Google Scholar
1. Gartner JJ
2. Parker SCJ
3. Prickett TD
4. Dutton-Regester K
5. Stitzel ML
6. Lin JC
7. Davis S
8. Simhadri VL
9. Jha S
10. Katagiri N
11. Gotea V
12. Teer JK
13. Wei X
14. Morken MA
15. Bhanot UK
16. Chen G
17. Elnitski LL
18. Davies MA
19. Gershenwald JE
20. Carter H
21. Karchin R
22. Robinson W
23. Robinson S
24. Rosenberg SA
25. Collins FS
26. Parmigiani G
27. Komar AA
28. Kimchi-Sarfaty C
29. Hayward NK
30. Margulies EH
31. Samuels Y
32. Becker J
33. Benjamin B
34. Blakesley R
35. Bouffard G
36. Brooks S
37. Coleman H
38. Dekhtyar M
39. Gregory M
40. Guan X
41. Gupta J
42. Han J
43. Hargrove A
44. Ho S
45. Johnson T
46. Legaspi R
47. Lovett S
48. Maduro Q
49. Masiello C
50. Maskeri B
51. McDowell J
52. Montemayor C
53. Mullikin J
54. Park M
55. Riebow N
56. Schandler K
57. Schmidt B
58. Sison C
59. Stantripop M
60. Thomas J
61. Thomas P
62. Vemulapalli M
63. Young A
64. NISC Comparative Sequencing Program
(2013) Whole-genome sequencing identifies a recurrent functional synonymous mutation in melanoma
PNAS 110:13481–13486.

https://doi.org/10.1073/pnas.1304227110
- Google Scholar
1. Gojobori T
2. Li WH
3. Graur D
(1982) Patterns of nucleotide substitution in pseudogenes and functional genes
Journal of Molecular Evolution 18:360–369.

https://doi.org/10.1007/BF01733904
- PubMed
- Google Scholar
1. Hamburger AW
(2008) The role of ErbB3 and its binding partners in breast cancer progression and resistance to hormone and tyrosine kinase directed therapies
Journal of Mammary Gland Biology and Neoplasia 13:225–233.

https://doi.org/10.1007/s10911-008-9077-5
- PubMed
- Google Scholar
1. Hanahan D
2. Weinberg RA
(2000) The hallmarks of cancer
Cell 100:57–70.

https://doi.org/10.1016/s0092-8674(00)81683-9
- PubMed
- Google Scholar
1. Hanahan D
2. Weinberg RA
(2011) Hallmarks of cancer: the next generation
Cell 144:646–674.

https://doi.org/10.1016/j.cell.2011.02.013
- PubMed
- Google Scholar
1. Hanahan D
(2022) Hallmarks of cancer: new dimensions
Cancer Discovery 12:31–46.

https://doi.org/10.1158/2159-8290.CD-21-1059
- PubMed
- Google Scholar
1. Haradhvala NJ
2. Kim J
3. Maruvka YE
4. Polak P
5. Rosebrock D
6. Livitz D
7. Hess JM
8. Leshchiner I
9. Kamburov A
10. Mouw KW
11. Lawrence MS
12. Getz G
(2018) Distinct mutational signatures characterize concurrent loss of polymerase proofreading and mismatch repair
Nature Communications 9:1746.

https://doi.org/10.1038/s41467-018-04002-4
- PubMed
- Google Scholar
Book
1. Hartl DL
2. Clark AG
(1989)
Principles of population genetics

Sunderland, Mass: Sinauer.
- Google Scholar
1. He Z
2. Xu S
3. Shi S
(2020a) Adaptive convergence at the genomic level-prevalent, uncommon or very rare?
National Science Review 7:947–951.

https://doi.org/10.1093/nsr/nwaa076
- PubMed
- Google Scholar
1. He Z
2. Xu S
3. Zhang Z
4. Guo W
5. Lyu H
6. Zhong C
7. Boufford DE
8. Duke NC
9. Consortium TIM
10. Shi S
(2020b) Convergent adaptation of the genomes of woody plants at the land-sea interface
National Science Review 7:978–993.

https://doi.org/10.1093/nsr/nwaa027
- PubMed
- Google Scholar
1. Herzog M
2. Alonso-Perez E
3. Salguero I
4. Warringer J
5. Adams DJ
6. Jackson SP
7. Puddu F
(2021) Mutagenic mechanisms of cancer-associated DNA polymerase ϵ alleles
Nucleic Acids Research 49:3919–3931.

https://doi.org/10.1093/nar/gkab160
- PubMed
- Google Scholar
1. Hess JM
2. Bernards A
3. Kim J
4. Miller M
5. Taylor-Weiner A
6. Haradhvala NJ
7. Lawrence MS
8. Getz G
(2019) Passenger hotspot mutations in cancer
Cancer Cell 36:288–301.

https://doi.org/10.1016/j.ccell.2019.08.002
- PubMed
- Google Scholar
1. Hodgkinson A
2. Eyre-Walker A
(2011) Variation in the mutation rate across mammalian genomes
Nature Reviews. Genetics 12:756–766.

https://doi.org/10.1038/nrg3098
- PubMed
- Google Scholar
1. Hodis E
2. Torlai Triglia E
3. Kwon JYH
4. Biancalani T
5. Zakka LR
6. Parkar S
7. Hütter J-C
8. Buffoni L
9. Delorey TM
10. Phillips D
11. Dionne D
12. Nguyen LT
13. Schapiro D
14. Maliga Z
15. Jacobson CA
16. Hendel A
17. Rozenblatt-Rosen O
18. Mihm MC Jr
19. Garraway LA
20. Regev A
(2022) Stepwise-edited, human melanoma models reveal mutations’ effect on tumor and microenvironment
Science 376:eabi8175.

https://doi.org/10.1126/science.abi8175
- PubMed
- Google Scholar
1. Holbro T
2. Beerli RR
3. Maurer F
4. Koziczak M
5. Barbas CF
6. Hynes NE
(2003) The ErbB2/ErbB3 heterodimer functions as an oncogenic unit: ErbB2 requires ErbB3 to drive breast tumor cell proliferation
PNAS 100:8933–8938.

https://doi.org/10.1073/pnas.1537685100
- Google Scholar
1. Huang X
2. Gao L
3. Wang S
4. McManaman JL
5. Thor AD
6. Yang X
7. Esteva FJ
8. Liu B
(2010) Heterotrimerization of the growth factor receptors erbB2, erbB3, and insulin-like growth factor-I receptor in breast cancer cells resistant to herceptin
Cancer Research 70:1204–1214.

https://doi.org/10.1158/0008-5472.CAN-09-3321
- PubMed
- Google Scholar
1. ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium
(2020) Pan-cancer analysis of whole genomes
Nature 578:82–93.

https://doi.org/10.1038/s41586-020-1969-6
- PubMed
- Google Scholar
1. Jackson AL
2. Loeb LA
(1998) The mutation rate and cancer
Genetics 148:1483–1490.

https://doi.org/10.1093/genetics/148.4.1483
- PubMed
- Google Scholar
1. Jiang Z
2. Ju Y
3. Ali A
4. Chung PED
5. Skowron P
6. Wang D-Y
7. Shrestha M
8. Li H
9. Liu JC
10. Vorobieva I
11. Ghanbari-Azarnier R
12. Mwewa E
13. Koritzinsky M
14. Ben-David Y
15. Woodgett JR
16. Perou CM
17. Dupuy A
18. Bader GD
19. Egan SE
20. Taylor MD
21. Zacksenhaus E
(2023) Distinct shared and compartment-enriched oncogenic networks drive primary versus metastatic breast cancer
Nature Communications 14:4313.

https://doi.org/10.1038/s41467-023-39935-y
- Google Scholar
1. Juul RI
2. Nielsen MM
3. Juul M
4. Feuerbach L
5. Pedersen JS
(2021) The landscape and driver potential of site-specific hotspots across cancer genomes
NPJ Genomic Medicine 6:33.

https://doi.org/10.1038/s41525-021-00197-6
- PubMed
- Google Scholar
1. Kim J
2. Mouw KW
3. Polak P
4. Braunstein LZ
5. Kamburov A
6. Tiao G
7. Kwiatkowski DJ
8. Rosenberg JE
9. Van Allen EM
10. D’Andrea AD
11. Getz G
(2016) Somatic ERCC2 mutations are associated with a distinct genomic signature in urothelial tumors
Nature Genetics 48:600–606.

https://doi.org/10.1038/ng.3557
- Google Scholar
1. Kwa MQ
2. Brandao R
3. Phung TH
4. Ge J
5. Scieri G
6. Brakebusch C
(2021) MRCKα is dispensable for breast cancer development in the MMTV-PyMT model
Cells 10:942.

https://doi.org/10.3390/cells10040942
- Google Scholar
1. Lagou V
2. Jiang L
3. Ulrich A
4. Zudina L
5. González KSG
6. Balkhiyarova Z
7. Faggian A
8. Maina JG
9. Chen S
10. Todorov PV
11. Sharapov S
12. David A
13. Marullo L
14. Mägi R
15. Rujan R-M
16. Ahlqvist E
17. Thorleifsson G
18. Gao Η
19. Εvangelou Ε
20. Benyamin B
21. Scott RA
22. Isaacs A
23. Zhao JH
24. Willems SM
25. Johnson T
26. Gieger C
27. Grallert H
28. Meisinger C
29. Müller-Nurasyid M
30. Strawbridge RJ
31. Goel A
32. Rybin D
33. Albrecht E
34. Jackson AU
35. Stringham HM
36. Corrêa IR Jr
37. Farber-Eger E
38. Steinthorsdottir V
39. Uitterlinden AG
40. Munroe PB
41. Brown MJ
42. Schmidberger J
43. Holmen O
44. Thorand B
45. Hveem K
46. Wilsgaard T
47. Mohlke KL
48. Wang Z
49. GWA-PA Consortium
50. Shmeliov A
51. den Hoed M
52. Loos RJF
53. Kratzer W
54. Haenle M
55. Koenig W
56. Boehm BO
57. Tan TM
58. Tomas A
59. Salem V
60. Barroso I
61. Tuomilehto J
62. Boehnke M
63. Florez JC
64. Hamsten A
65. Watkins H
66. Njølstad I
67. Wichmann H-E
68. Caulfield MJ
69. Khaw K-T
70. van Duijn CM
71. Hofman A
72. Wareham NJ
73. Langenberg C
74. Whitfield JB
75. Martin NG
76. Montgomery G
77. Scapoli C
78. Tzoulaki I
79. Elliott P
80. Thorsteinsdottir U
81. Stefansson K
82. Brittain EL
83. McCarthy MI
84. Froguel P
85. Sexton PM
86. Wootten D
87. Groop L
88. Dupuis J
89. Meigs JB
90. Deganutti G
91. Demirkan A
92. Pers TH
93. Reynolds CA
94. Aulchenko YS
95. Kaakinen MA
96. Jones B
97. Prokopenko I
98. Meta-Analysis of Glucose and Insulin-Related Traits Consortium (MAGIC)
(2023) GWAS of random glucose in 476,326 individuals provide insights into diabetes pathophysiology, complications and treatment stratification
Nature Genetics 55:1448–1461.

https://doi.org/10.1038/s41588-023-01462-3
- PubMed
- Google Scholar
Book
1. Li WH
(1997)
Molecular Evolution

Sunderland, Mass: Sinauer Associates.
- Google Scholar
1. Li M
2. Chen W
3. Sun X
4. Wang Z
5. Zou X
6. Wei H
7. Wang Z
8. Chen W
(2019) Metastatic colorectal cancer and severe hypocalcemia following irinotecan administration in a patient with X-linked agammaglobulinemia: a case report
BMC Medical Genetics 20:157.

https://doi.org/10.1186/s12881-019-0880-1
- PubMed
- Google Scholar
1. Lin L
2. Cai J
3. Tan Z
4. Meng X
5. Li R
6. Li Y
7. Jiang C
(2021) Mutant IDH1 enhances temozolomide sensitivity via regulation of the ATM/CHK2 pathway in glioma
Cancer Research and Treatment 53:367–377.

https://doi.org/10.4143/crt.2020.506
- PubMed
- Google Scholar
1. Lin J
2. Zhan G
3. Liu J
4. Maimaitiyiming Y
5. Deng Z
6. Li B
7. Su K
8. Chen J
9. Sun S
10. Zheng W
11. Yu X
12. He F
13. Cheng X
14. Wang L
15. Shen B
16. Yao Z
17. Yang X
18. Zhang J
19. He W
20. Wu H
21. Naranmandura H
22. Chang KJ
23. Min J
24. Ma J
25. Björklund M
26. Xu PF
27. Wang F
28. Hsu CH
(2023) YTHDF2-mediated regulations bifurcate BHPF-induced programmed cell deaths
National Science Review 10:wad227.

https://doi.org/10.1093/nsr/nwad227
- PubMed
- Google Scholar
1. Ling S
2. Hu Z
3. Yang Z
4. Yang F
5. Li Y
6. Lin P
7. Chen K
8. Dong L
9. Cao L
10. Tao Y
11. Hao L
12. Chen Q
13. Gong Q
14. Wu D
15. Li W
16. Zhao W
17. Tian X
18. Hao C
19. Hungate EA
20. Catenacci DVT
21. Hudson RR
22. Li WH
23. Lu X
24. Wu CI
(2015) Extremely high genetic diversity in a single tumor points to prevalence of non-Darwinian cell evolution
PNAS 112:E6496.

https://doi.org/10.1073/pnas.1519556112
- PubMed
- Google Scholar
1. Liu L
2. Zhang Z
3. Xia X
4. Lei J
(2022) KIF18B promotes breast cancer cell proliferation, migration and invasion by targeting TRIP13 and activating the Wnt/β-catenin signaling pathway
Oncology Letters 23:112.

https://doi.org/10.3892/ol.2022.13232
- PubMed
- Google Scholar
1. Luo B
2. Cheung HW
3. Subramanian A
4. Sharifnia T
5. Okamoto M
6. Yang X
7. Hinkle G
8. Boehm JS
9. Beroukhim R
10. Weir BA
11. Mermel C
12. Barbie DA
13. Awad T
14. Zhou X
15. Nguyen T
16. Piqani B
17. Li C
18. Golub TR
19. Meyerson M
20. Hacohen N
21. Hahn WC
22. Lander ES
23. Sabatini DM
24. Root DE
(2008) Highly parallel identification of essential genes in cancer cells
PNAS 105:20380–20385.

https://doi.org/10.1073/pnas.0810485105
- Google Scholar
1. Luo P
2. Ding Y
3. Lei X
4. Wu FX
(2019) deepDriver: predicting cancer driver genes based on somatic mutations using deep convolutional neural networks
Frontiers in Genetics 10:13.

https://doi.org/10.3389/fgene.2019.00013
- PubMed
- Google Scholar
1. Lynch M
(2010) Evolution of the mutation rate
Trends in Genetics 26:345–352.

https://doi.org/10.1016/j.tig.2010.05.003
- PubMed
- Google Scholar
1. Lynch M
(2011) The lower bound to the evolution of mutation rates
Genome Biology and Evolution 3:1107–1118.

https://doi.org/10.1093/gbe/evr066
- PubMed
- Google Scholar
1. Lynch M
2. Ackerman MS
3. Gout JF
4. Long H
5. Sung W
6. Thomas WK
7. Foster PL
(2016) Genetic drift, selection and the evolution of the mutation rate
Nature Reviews. Genetics 17:704–714.

https://doi.org/10.1038/nrg.2016.104
- PubMed
- Google Scholar
1. Makova KD
2. Hardison RC
(2015) The effects of chromatin organization on variation in mutation rates in the genome
Nature Reviews. Genetics 16:213–223.

https://doi.org/10.1038/nrg3890
- PubMed
- Google Scholar
1. Martincorena I
2. Raine KM
3. Gerstung M
4. Dawson KJ
5. Haase K
6. Van Loo P
7. Davies H
8. Stratton MR
9. Campbell PJ
(2017) Universal patterns of selection in cancer and somatic tissues
Cell 171:1029–1041.

https://doi.org/10.1016/j.cell.2017.09.042
- PubMed
- Google Scholar
1. Munirajan AK
2. Ando K
3. Mukai A
4. Takahashi M
5. Suenaga Y
6. Ohira M
7. Koda T
8. Hirota T
9. Ozaki T
10. Nakagawara A
(2008) KIF1Bbeta functions as a haploinsufficient tumor suppressor gene mapped to chromosome 1p36.2 by inducing apoptotic cell death
The Journal of Biological Chemistry 283:24426–24434.

https://doi.org/10.1074/jbc.M802316200
- PubMed
- Google Scholar
1. Nataraj NB
2. Noronha A
3. Lee JS
4. Ghosh S
5. Mohan Raju HR
6. Sekar A
7. Zuckerman B
8. Lindzen M
9. Tarcitano E
10. Srivastava S
11. Selitrennik M
12. Livneh I
13. Drago-Garcia D
14. Rueda O
15. Caldas C
16. Lev S
17. Geiger T
18. Ciechanover A
19. Ulitsky I
20. Seger R
21. Ruppin E
22. Yarden Y
(2022) Nucleoporin-93 reveals a common feature of aggressive breast cancers: robust nucleocytoplasmic transport of transcription factors
Cell Reports 38:110418.

https://doi.org/10.1016/j.celrep.2022.110418
- Google Scholar
1. Nesta AV
2. Tafur D
3. Beck CR
(2021) Hotspots of human mutation
Trends in Genetics 37:717–729.

https://doi.org/10.1016/j.tig.2020.10.003
- PubMed
- Google Scholar
(2023) Impact of EGFRA289T/V mutation on relapse pattern in glioblastoma
ESMO Open 8:100740.

https://doi.org/10.1016/j.esmoop.2022.100740
- Google Scholar
1. Ortmann CA
2. Kent DG
3. Nangalia J
4. Silber Y
5. Wedge DC
6. Grinfeld J
7. Baxter EJ
8. Massie CE
9. Papaemmanuil E
10. Menon S
11. Godfrey AL
12. Dimitropoulou D
13. Guglielmelli P
14. Bellosillo B
15. Besses C
16. Döhner K
17. Harrison CN
18. Vassiliou GS
19. Vannucchi A
20. Campbell PJ
21. Green AR
(2015) Effect of mutation order on myeloproliferative neoplasms
The New England Journal of Medicine 372:601–612.

https://doi.org/10.1056/NEJMoa1412098
- PubMed
- Google Scholar
1. Pan Y
2. Liu P
3. Wang F
4. Wu P
5. Cheng F
6. Jin X
7. Xu S
(2022a) Lineage-specific positive selection on ACE2 contributes to the genetic susceptibility of COVID-19
National Science Review 9:wac118.

https://doi.org/10.1093/nsr/nwac118
- PubMed
- Google Scholar
1. Pan Y
2. Zhang C
3. Lu Y
4. Ning Z
5. Lu D
6. Gao Y
7. Zhao X
8. Yang Y
9. Guan Y
10. Mamatyusupu D
11. Xu S
(2022b) Genomic diversity and post-admixture adaptation in the Uyghurs
National Science Review 9:wab124.

https://doi.org/10.1093/nsr/nwab124
- PubMed
- Google Scholar
1. Pleasance ED
2. Cheetham RK
3. Stephens PJ
4. McBride DJ
5. Humphray SJ
6. Greenman CD
7. Varela I
8. Lin M-L
9. Ordóñez GR
10. Bignell GR
11. Ye K
12. Alipaz J
13. Bauer MJ
14. Beare D
15. Butler A
16. Carter RJ
17. Chen L
18. Cox AJ
19. Edkins S
20. Kokko-Gonzales PI
21. Gormley NA
22. Grocock RJ
23. Haudenschild CD
24. Hims MM
25. James T
26. Jia M
27. Kingsbury Z
28. Leroy C
29. Marshall J
30. Menzies A
31. Mudie LJ
32. Ning Z
33. Royce T
34. Schulz-Trieglaff OB
35. Spiridou A
36. Stebbings LA
37. Szajkowski L
38. Teague J
39. Williamson D
40. Chin L
41. Ross MT
42. Campbell PJ
43. Bentley DR
44. Futreal PA
45. Stratton MR
(2010) A comprehensive catalogue of somatic mutations from A human cancer genome
Nature 463:191–196.

https://doi.org/10.1038/nature08658
- PubMed
- Google Scholar
1. Plotkin JB
2. Kudla G
(2011) Synonymous but not the same: the causes and consequences of codon bias
Nature Reviews. Genetics 12:32–42.

https://doi.org/10.1038/nrg2899
- PubMed
- Google Scholar
(2015) Cell-of-origin chromatin organization shapes the mutational landscape of cancer
Nature 518:360–364.

https://doi.org/10.1038/nature14221
- PubMed
- Google Scholar
1. Roberts SA
2. Lawrence MS
3. Klimczak LJ
4. Grimm SA
5. Fargo D
6. Stojanov P
7. Kiezun A
8. Kryukov GV
9. Carter SL
10. Saksena G
11. Harris S
12. Shah RR
13. Resnick MA
14. Getz G
15. Gordenin DA
(2013) An APOBEC cytidine deaminase mutagenesis pattern is widespread in human cancers
Nature Genetics 45:970–976.

https://doi.org/10.1038/ng.2702
- PubMed
- Google Scholar
1. Ruan Y
2. Wang H
3. Chen B
4. Wen H
5. Wu CI
(2020) Mutations beget more mutations-rapid evolution of mutation rate in response to the risk of runaway accumulation
Molecular Biology and Evolution 37:1007–1019.

https://doi.org/10.1093/molbev/msz283
- PubMed
- Google Scholar
1. Ruan Y
2. Hou M
3. Tang X
4. He X
5. Lu X
6. Lu J
7. Wu CI
8. Wen H
(2022a) The runaway evolution of SARS-CoV-2 leading to the highly evolved delta strain
Molecular Biology and Evolution 39:msac046.

https://doi.org/10.1093/molbev/msac046
- PubMed
- Google Scholar
1. Ruan Y
2. Wen H
3. Hou M
4. He Z
5. Lu X
6. Xue Y
7. He X
8. Zhang YP
9. Wu CI
(2022b) The twin-beginnings of COVID-19 in Asia and Europe-one prevails quickly
National Science Review 9:wab223.

https://doi.org/10.1093/nsr/nwab223
- PubMed
- Google Scholar
1. Ruan Y
2. Wen H
3. Hou M
4. Zhai W
5. Xu S
6. Lu X
(2023) On the epicenter of COVID-19 and the origin of the pandemic strain
National Science Review 10:wac286.

https://doi.org/10.1093/nsr/nwac286
- PubMed
- Google Scholar
(2014) Determinants of mutation rate variation in the human germline
Annual Review of Genomics and Human Genetics 15:47–70.

https://doi.org/10.1146/annurev-genom-031714-125740
- PubMed
- Google Scholar
1. Sharp PM
2. Li WH
(1987) The codon Adaptation Index--a measure of directional synonymous codon usage bias, and its potential applications
Nucleic Acids Research 15:1281–1295.

https://doi.org/10.1093/nar/15.3.1281
- PubMed
- Google Scholar
1. Sherman MA
2. Yaari AU
3. Priebe O
4. Dietlein F
5. Loh PR
6. Berger B
(2022) Genome-wide mapping of somatic mutation rates uncovers drivers of cancer
Nature Biotechnology 40:1634–1643.

https://doi.org/10.1038/s41587-022-01353-8
- PubMed
- Google Scholar
1. Sithanandam G
2. Anderson LM
(2008) The ERBB3 receptor in cancer and cancer gene therapy
Cancer Gene Therapy 15:413–448.

https://doi.org/10.1038/cgt.2008.15
- PubMed
- Google Scholar
1. Song Q
2. Li M
3. Li Q
4. Lu X
5. Song K
6. Zhang Z
7. Wei J
8. Zhang L
9. Wei J
10. Ye Y
11. Zha J
12. Zhang Q
13. Gao Q
14. Long J
15. Liu X
16. Lu X
17. Zhang J
(2023) DeepAlloDriver: a deep learning-based strategy to predict cancer driver mutations
Nucleic Acids Research 51:W129–W133.

https://doi.org/10.1093/nar/gkad295
- PubMed
- Google Scholar
(2009) Human mutation rate associated with DNA replication timing
Nature Genetics 41:393–395.

https://doi.org/10.1038/ng.363
- PubMed
- Google Scholar
1. Stern DF
(2008) ERBB3/HER3 and ERBB2/HER2 duet in mammary development and breast cancer
Journal of Mammary Gland Biology and Neoplasia 13:215–223.

https://doi.org/10.1007/s10911-008-9083-7
- PubMed
- Google Scholar
1. Stobbe MD
2. Thun GA
3. Diéguez-Docampo A
4. Oliva M
5. Whalley JP
6. Raineri E
7. Gut IG
(2019) Recurrent somatic mutations reveal new insights into consequences of mutagenic processes in cancer
PLOS Computational Biology 15:e1007496.

https://doi.org/10.1371/journal.pcbi.1007496
- PubMed
- Google Scholar
1. Supek F
2. Lehner B
(2015) Differential DNA mismatch repair underlies mutation rate variation across the human genome
Nature 521:81–84.

https://doi.org/10.1038/nature14173
- PubMed
- Google Scholar
1. Suzuki K
2. Hatzikotoulas K
3. Southam L
4. Taylor HJ
5. Yin X
6. Lorenz KM
7. Mandla R
8. Huerta-Chagoya A
9. Melloni GEM
10. Kanoni S
11. Rayner NW
12. Bocher O
13. Arruda AL
14. Sonehara K
15. Namba S
16. Lee SSK
17. Preuss MH
18. Petty LE
19. Schroeder P
20. Vanderwerff B
21. Kals M
22. Bragg F
23. Lin K
24. Guo X
25. Zhang W
26. Yao J
27. Kim YJ
28. Graff M
29. Takeuchi F
30. Nano J
31. Lamri A
32. Nakatochi M
33. Moon S
34. Scott RA
35. Cook JP
36. Lee J-J
37. Pan I
38. Taliun D
39. Parra EJ
40. Chai J-F
41. Bielak LF
42. Tabara Y
43. Hai Y
44. Thorleifsson G
45. Grarup N
46. Sofer T
47. Wuttke M
48. Sarnowski C
49. Gieger C
50. Nousome D
51. Trompet S
52. Kwak S-H
53. Long J
54. Sun M
55. Tong L
56. Chen W-M
57. Nongmaithem SS
58. Noordam R
59. Lim VJY
60. Tam CHT
61. Joo YY
62. Chen C-H
63. Raffield LM
64. Prins BP
65. Nicolas A
66. Yanek LR
67. Chen G
68. Brody JA
69. Kabagambe E
70. An P
71. Xiang AH
72. Choi HS
73. Cade BE
74. Tan J
75. Broadaway KA
76. Williamson A
77. Kamali Z
78. Cui J
79. Thangam M
80. Adair LS
81. Adeyemo A
82. Aguilar-Salinas CA
83. Ahluwalia TS
84. Anand SS
85. Bertoni A
86. Bork-Jensen J
87. Brandslund I
88. Buchanan TA
89. Burant CF
90. Butterworth AS
91. Canouil M
92. Chan JCN
93. Chang L-C
94. Chee M-L
95. Chen J
96. Chen S-H
97. Chen Y-T
98. Chen Z
99. Chuang L-M
100. Cushman M
101. Danesh J
102. Das SK
103. de Silva HJ
104. Dedoussis G
105. Dimitrov L
106. Doumatey AP
107. Du S
108. Duan Q
109. Eckardt K-U
110. Emery LS
111. Evans DS
112. Evans MK
113. Fischer K
114. Floyd JS
115. Ford I
116. Franco OH
117. Frayling TM
118. Freedman BI
119. Genter P
120. Gerstein HC
121. Giedraitis V
122. González-Villalpando C
123. González-Villalpando ME
124. Gordon-Larsen P
125. Gross M
126. Guare LA
127. Hackinger S
128. Hakaste L
129. Han S
130. Hattersley AT
131. Herder C
132. Horikoshi M
133. Howard A-G
134. Hsueh W
135. Huang M
136. Huang W
137. Hung Y-J
138. Hwang MY
139. Hwu C-M
140. Ichihara S
141. Ikram MA
142. Ingelsson M
143. Islam MT
144. Isono M
145. Jang H-M
146. Jasmine F
147. Jiang G
148. Jonas JB
149. Jørgensen T
150. Kamanu FK
151. Kandeel FR
152. Kasturiratne A
153. Katsuya T
154. Kaur V
155. Kawaguchi T
156. Keaton JM
157. Kho AN
158. Khor C-C
159. Kibriya MG
160. Kim D-H
161. Kronenberg F
162. Kuusisto J
163. Läll K
164. Lange LA
165. Lee KM
166. Lee M-S
167. Lee NR
168. Leong A
169. Li L
170. Li Y
171. Li-Gao R
172. Ligthart S
173. Lindgren CM
174. Linneberg A
175. Liu C-T
176. Liu J
177. Locke AE
178. Louie T
179. Luan J
180. Luk AO
181. Luo X
182. Lv J
183. Lynch JA
184. Lyssenko V
185. Maeda S
186. Mamakou V
187. Mansuri SR
188. Matsuda K
189. Meitinger T
190. Melander O
191. Metspalu A
192. Mo H
193. Morris AD
194. Moura FA
195. Nadler JL
196. Nalls MA
197. Nayak U
198. Ntalla I
199. Okada Y
200. Orozco L
201. Patel SR
202. Patil S
203. Pei P
204. Pereira MA
205. Peters A
206. Pirie FJ
207. Polikowsky HG
208. Porneala B
209. Prasad G
210. Rasmussen-Torvik LJ
211. Reiner AP
212. Roden M
213. Rohde R
214. Roll K
215. Sabanayagam C
216. Sandow K
217. Sankareswaran A
218. Sattar N
219. Schönherr S
220. Shahriar M
221. Shen B
222. Shi J
223. Shin DM
224. Shojima N
225. Smith JA
226. So WY
227. Stančáková A
228. Steinthorsdottir V
229. Stilp AM
230. Strauch K
231. Taylor KD
232. Thorand B
233. Thorsteinsdottir U
234. Tomlinson B
235. Tran TC
236. Tsai F-J
237. Tuomilehto J
238. Tusie-Luna T
239. Udler MS
240. Valladares-Salgado A
241. van Dam RM
242. van Klinken JB
243. Varma R
244. Wacher-Rodarte N
245. Wheeler E
246. Wickremasinghe AR
247. van Dijk KW
248. Witte DR
249. Yajnik CS
250. Yamamoto K
251. Yamamoto K
252. Yoon K
253. Yu C
254. Yuan J-M
255. Yusuf S
256. Zawistowski M
257. Zhang L
258. Zheng W
259. Raffel LJ
260. Igase M
261. Ipp E
262. Redline S
263. Cho YS
264. Lind L
265. Province MA
266. Fornage M
267. Hanis CL
268. Ingelsson E
269. Zonderman AB
270. Psaty BM
271. Wang Y-X
272. Rotimi CN
273. Becker DM
274. Matsuda F
275. Liu Y
276. Yokota M
277. Kardia SLR
278. Peyser PA
279. Pankow JS
280. Engert JC
281. Bonnefond A
282. Froguel P
283. Wilson JG
284. Sheu WHH
285. Wu J-Y
286. Hayes MG
287. Ma RCW
288. Wong T-Y
289. Mook-Kanamori DO
290. Tuomi T
291. Chandak GR
292. Collins FS
293. Bharadwaj D
294. Paré G
295. Sale MM
296. Ahsan H
297. Motala AA
298. Shu X-O
299. Park K-S
300. Jukema JW
301. Cruz M
302. Chen Y-DI
303. Rich SS
304. McKean-Cowdin R
305. Grallert H
306. Cheng C-Y
307. Ghanbari M
308. Tai E-S
309. Dupuis J
310. Kato N
311. Laakso M
312. Köttgen A
313. Koh W-P
314. Bowden DW
315. Palmer CNA
316. Kooner JS
317. Kooperberg C
318. Liu S
319. North KE
320. Saleheen D
321. Hansen T
322. Pedersen O
323. Wareham NJ
324. Lee J
325. Kim B-J
326. Millwood IY
327. Walters RG
328. Stefansson K
329. Ahlqvist E
330. Goodarzi MO
331. Mohlke KL
332. Langenberg C
333. Haiman CA
334. Loos RJF
335. Florez JC
336. Rader DJ
337. Ritchie MD
338. Zöllner S
339. Mägi R
340. Marston NA
341. Ruff CT
342. van Heel DA
343. Finer S
344. Denny JC
345. Yamauchi T
346. Kadowaki T
347. Chambers JC
348. Ng MCY
349. Sim X
350. Below JE
351. Tsao PS
352. Chang K-M
353. McCarthy MI
354. Meigs JB
355. Mahajan A
356. Spracklen CN
357. Mercader JM
358. Boehnke M
359. Rotter JI
360. Vujkovic M
361. Voight BF
362. Morris AP
363. Zeggini E
364. VA Million Veteran Program
(2024) Genetic drivers of heterogeneity in type 2 diabetes pathophysiology
Nature 627:347–357.

https://doi.org/10.1038/s41586-024-07019-6
- PubMed
- Google Scholar
1. Takeda H
(2021) A platform for validating colorectal cancer driver genes using mouse organoids
Frontiers in Genetics 12:698771.

https://doi.org/10.3389/fgene.2021.698771
- PubMed
- Google Scholar
1. Tate JG
2. Bamford S
3. Jubb HC
4. Sondka Z
5. Beare DM
6. Bindal N
7. Boutselakis H
8. Cole CG
9. Creatore C
10. Dawson E
11. Fish P
12. Harsha B
13. Hathaway C
14. Jupe SC
15. Kok CY
16. Noble K
17. Ponting L
18. Ramshaw CC
19. Rye CE
20. Speedy HE
21. Stefancsik R
22. Thompson SL
23. Wang S
24. Ward S
25. Campbell PJ
26. Forbes SA
(2019) COSMIC: the catalogue of somatic mutations in cancer
Nucleic Acids Research 47:D941–D947.

https://doi.org/10.1093/nar/gky1015
- PubMed
- Google Scholar
(2019) Resolving genetic heterogeneity in cancer
Nature Reviews. Genetics 20:404–416.

https://doi.org/10.1038/s41576-019-0114-6
- PubMed
- Google Scholar
1. Unbekandt M
2. Olson MF
(2014) The actin-myosin regulatory MRCK kinases: regulation, biological functions and associations with human cancer
Journal of Molecular Medicine 92:217–225.

https://doi.org/10.1007/s00109-014-1133-6
- PubMed
- Google Scholar
(2016) Somatic mutation patterns in hemizygous genomic regions unveil purifying selection during tumor evolution
PLOS Genetics 12:e1006506.

https://doi.org/10.1371/journal.pgen.1006506
- Google Scholar
1. Vujkovic M
2. Keaton JM
3. Lynch JA
4. Miller DR
5. Zhou J
6. Tcheandjieu C
7. Huffman JE
8. Assimes TL
9. Lorenz K
10. Zhu X
11. Hilliard AT
12. Judy RL
13. Huang J
14. Lee KM
15. Klarin D
16. Pyarajan S
17. Danesh J
18. Melander O
19. Rasheed A
20. Mallick NH
21. Hameed S
22. Qureshi IH
23. Afzal MN
24. Malik U
25. Jalal A
26. Abbas S
27. Sheng X
28. Gao L
29. Kaestner KH
30. Susztak K
31. Sun YV
32. DuVall SL
33. Cho K
34. Lee JS
35. Gaziano JM
36. Phillips LS
37. Meigs JB
38. Reaven PD
39. Wilson PW
40. Edwards TL
41. Rader DJ
42. Damrauer SM
43. O’Donnell CJ
44. Tsao PS
45. HPAP Consortium
46. Regeneron Genetics Center
47. VA Million Veteran Program
48. Chang K-M
49. Voight BF
50. Saleheen D
(2020) Discovery of 318 new risk loci for type 2 diabetes and related vascular outcomes among 1.4 million participants in a multi-ancestry meta-analysis
Nature Genetics 52:680–691.

https://doi.org/10.1038/s41588-020-0637-y
- PubMed
- Google Scholar
(2022) Targeting mutations in cancer
The Journal of Clinical Investigation 132:e154943.

https://doi.org/10.1172/JCI154943
- PubMed
- Google Scholar
1. Wang Q
2. Fang WH
3. Krupinski J
4. Kumar S
5. Slevin M
6. Kumar P
(2008) Pax genes in embryogenesis and oncogenesis
Journal of Cellular and Molecular Medicine 12:2281–2294.

https://doi.org/10.1111/j.1582-4934.2008.00427.x
- PubMed
- Google Scholar
1. Wang J
2. Vallee I
3. Dutta A
4. Wang Y
5. Mo Z
6. Liu Z
7. Cui H
8. Su AI
9. Yang XL
(2020) Multi-omics database analysis of aminoacyl-tRNA synthetases in cancer
Genes 11:1384.

https://doi.org/10.3390/genes11111384
- PubMed
- Google Scholar
1. Wang X
2. He Z
3. Guo Z
4. Yang M
5. Xu S
6. Chen Q
7. Shao S
8. Li S
9. Zhong C
10. Duke NC
11. Shi S
(2022) Extensive gene flow in secondary sympatry after allopatric speciation
National Science Review 9:wac280.

https://doi.org/10.1093/nsr/nwac280
- PubMed
- Google Scholar
1. Wang H
2. Liu B
3. Long J
4. Yu J
5. Ji X
6. Li J
7. Zhu N
8. Zhuang X
9. Li L
10. Chen Y
11. Liu Z
12. Wang S
13. Zhao S
(2023) Integrative analysis identifies two molecular and clinical subsets in Luminal B breast cancer
iScience 26:107466.

https://doi.org/10.1016/j.isci.2023.107466
- Google Scholar
1. Wei W
2. Ho WC
3. Behringer MG
4. Miller SF
5. Bcharah G
6. Lynch M
(2022) Rapid evolution of mutation rate and spectrum in response to environmental and population-genetic challenges
Nature Communications 13:4752.

https://doi.org/10.1038/s41467-022-32353-6
- PubMed
- Google Scholar
(2022) Association of mutation signature effectuating processes with mutation hotspots in driver genes and non-coding regions
Nature Communications 13:178.

https://doi.org/10.1038/s41467-021-27792-6
- PubMed
- Google Scholar
1. Wu CI
2. Maeda N
(1987) Inequality in mutation rates of the two strands of DNA
Nature 327:169–170.

https://doi.org/10.1038/327169a0
- PubMed
- Google Scholar
1. Wu CI
2. Ting CT
(2004) Genes and speciation
Nature Reviews. Genetics 5:114–122.

https://doi.org/10.1038/nrg1269
- PubMed
- Google Scholar
1. Wu CI
2. Wang GD
3. Xu S
(2020) Convergent adaptive evolution-how common, or how rare?
National Science Review 7:945–946.

https://doi.org/10.1093/nsr/nwaa081
- PubMed
- Google Scholar
1. Wu CI
(2022) What are species and how are they formed?
National Science Review 9:wad017.

https://doi.org/10.1093/nsr/nwad017
- PubMed
- Google Scholar
1. Wu CI
(2023) The genetics of race differentiation-should it be studied?
National Science Review 10:wad068.

https://doi.org/10.1093/nsr/nwad068
- PubMed
- Google Scholar
1. Xie W
2. Zhang J
3. Zhong P
4. Qin S
5. Zhang H
6. Fan X
7. Yin Y
8. Liang R
9. Han Y
10. Liao Y
11. Yu X
12. Long H
13. Lv Z
14. Ma C
15. Yu F
(2019) Expression and potential prognostic value of histone family gene signature in breast cancer
Experimental and Therapeutic Medicine 18:4893–4903.

https://doi.org/10.3892/etm.2019.8131
- Google Scholar
1. Xue C
2. Liang F
3. Mahmood R
4. Vuolo M
5. Wyckoff J
6. Qian H
7. Tsai K-L
8. Kim M
9. Locker J
10. Zhang Z-Y
11. Segall JE
(2006) ErbB3-dependent motility and intravasation in breast cancer metastasis
Cancer Research 66:1418–1426.

https://doi.org/10.1158/0008-5472.CAN-05-0550
- PubMed
- Google Scholar
1. Xue D
2. Narisu N
3. Taylor DL
4. Zhang M
5. Grenko C
6. Taylor HJ
7. Yan T
8. Tang X
9. Sinha N
10. Zhu J
11. Vandana JJ
12. Nok Chong AC
13. Lee A
14. Mansell EC
15. Swift AJ
16. Erdos MR
17. Zhong A
18. Bonnycastle LL
19. Zhou T
20. Chen S
21. Collins FS
(2023) Functional interrogation of twenty type 2 diabetes-associated genes using isogenic human embryonic stem cell-derived β-like cells
Cell Metabolism 35:1897–1914.

https://doi.org/10.1016/j.cmet.2023.09.013
- PubMed
- Google Scholar
1. Yu Y
2. Feng YM
(2010) The role of kinesin family proteins in tumorigenesis and progression
Cancer 116:5150–5160.

https://doi.org/10.1002/cncr.25461
- Google Scholar
(2018) Negative selection in tumor genome evolution acts on essential cellular functions and the immunopeptidome
Genome Biology 19:67.

https://doi.org/10.1186/s13059-018-1434-0
- PubMed
- Google Scholar
1. Zeng Z
2. Bromberg Y
(2022) Inferring potential cancer driving synonymous variants
Genes 13:778.

https://doi.org/10.3390/genes13050778
- PubMed
- Google Scholar
1. Zhai W
2. Lai H
3. Kaya NA
4. Chen J
5. Yang H
6. Lu B
7. Lim JQ
8. Ma S
9. Chew SC
10. Chua KP
11. Alvarez JJS
12. Chen PJ
13. Chang MM
14. Wu L
15. Goh BKP
16. Chung AY-F
17. Chan CY
18. Cheow PC
19. Lee SY
20. Kam JH
21. Kow AW-C
22. Ganpathi IS
23. Chanwat R
24. Thammasiri J
25. Yoong BK
26. Ong DB-L
27. de Villa VH
28. Dela Cruz RD
29. Loh TJ
30. Wan WK
31. Zeng Z
32. Skanderup AJ
33. Pang YH
34. Madhavan K
35. Lim TK-H
36. Bonney G
37. Leow WQ
38. Chew V
39. Dan YY
40. Tam WL
41. Toh HC
42. Foo RS-Y
43. Chow PK-H
(2022) Dynamic phenotypic heterogeneity and the evolution of multiple RNA subtypes in hepatocellular carcinoma: the PLANET study
National Science Review 9:wab192.

https://doi.org/10.1093/nsr/nwab192
- Google Scholar
Software
1. Zhang L
(2024) CDN_V1, version swh:1:rev:967361fff2b70ae2a39360e5546c18710dc3700f
Software Heritage.

https://archive.softwareheritage.org/swh:1:dir:537fa75d5dbe96ca6724820877ba5255b2d9cac3;origin=https://gitlab.com/ultramicroevo/cdn_v1;visit=swh:1:snp:f4700c8f857c51a5745c5f3ef4b6c6dbddc3b4c0;anchor=swh:1:rev:967361fff2b70ae2a39360e5546c18710dc3700f
1. Zhang L
2. Deng T
3. Liufu Z
4. Chen X
5. Wu S
6. Liu X
7. Shi C
8. Chen B
9. Hu Z
10. Cai Q
11. Lu X
12. Liu C
13. Li M
14. Wen H
15. Wu CI
(2024) On the discovered cancer driving nucleotides (CDNs) –distributions across genes, cancer types and patients
eLife 01:99341.1.

https://doi.org/10.7554/eLife.99341.1
- Google Scholar
1. Zhao W
2. Yang J
3. Wu J
4. Cai G
5. Zhang Y
6. Haltom J
7. Su W
8. Dong MJ
9. Chen S
10. Wu J
11. Zhou Z
12. Gu X
(2021) CanDriS: posterior profiling of cancer-driving sites based on two-component evolutionary model
Briefings in Bioinformatics 22:bbab131.

https://doi.org/10.1093/bib/bbab131
- PubMed
- Google Scholar
1. Zheng CL
2. Wang NJ
3. Chung J
4. Moslehi H
5. Sanborn JZ
6. Hur JS
7. Collisson EA
8. Vemula SS
9. Naujokas A
10. Chiotti KE
11. Cheng JB
12. Fassihi H
13. Blumberg AJ
14. Bailey CV
15. Fudem GM
16. Mihm FG
17. Cunningham BB
18. Neuhaus IM
19. Liao W
20. Oh DH
21. Cleaver JE
22. LeBoit PE
23. Costello JF
24. Lehmann AR
25. Gray JW
26. Spellman PT
27. Arron ST
28. Huh N
29. Purdom E
30. Cho RJ
(2014) Transcription restores DNA repair to heterochromatin, determining regional mutation rates in cancer genomes
Cell Reports 9:1228–1234.

https://doi.org/10.1016/j.celrep.2014.10.031
- PubMed
- Google Scholar
1. Zhou Y
2. Litfin T
3. Zhan J
(2023) 3 = 1 + 2: how the divide conquered de novo protein structure prediction and what is next?
National Science Review 10:wad259.

https://doi.org/10.1093/nsr/nwad259
- PubMed
- Google Scholar
1. Zhu H
2. Lin Y
3. Lu D
4. Wang S
5. Liu Y
6. Dong L
7. Meng Q
8. Gao J
9. Wang Y
10. Song N
11. Suo Y
12. Ding L
13. Wang P
14. Zhang B
15. Gao D
16. Fan J
17. Gao Q
18. Zhou H
(2023) Proteomics of adjacent-to-tumor samples uncovers clinically relevant biological events in hepatocellular carcinoma
National Science Review 10:wad167.

https://doi.org/10.1093/nsr/nwad167
- PubMed
- Google Scholar

Article and author information

Author details

Lingjie Zhang

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China

Contribution
Conceptualization, Data curation, Formal analysis, Visualization, Methodology, Writing – original draft, Project administration

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0002-6506-4457
Tong Deng

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China

Contribution
Data curation, Validation, Investigation

Competing interests
No competing interests declared
Zhongqi Liufu
1. State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
2. State Key Laboratory of Genetic Resources and Evolution/Yunnan Key Laboratory of Biodiversity Information, Kunming Institute of Zoology, The Chinese Academy of Sciences, Kunming, China
Contribution
Validation, Investigation

Competing interests
No competing interests declared
Xueyu Liu

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China

Contribution
Validation, Visualization

Competing interests
No competing interests declared
Bingjie Chen
1. State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
2. GMU-GIBH Joint School of Life Sciences, Guangzhou Medical University, Guangzhou, China
Contribution
Validation, Investigation

Competing interests
No competing interests declared
Zheng Hu

CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

Contribution
Supervision, Validation, Investigation, Project administration

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0003-1552-0060
Chenli Liu

CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China

Contribution
Validation, Project administration

Competing interests
No competing interests declared
Miles E Tracy

State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China

Contribution
Writing – review and editing

Competing interests
No competing interests declared
Xuemei Lu

State Key Laboratory of Genetic Resources and Evolution/Yunnan Key Laboratory of Biodiversity Information, Kunming Institute of Zoology, The Chinese Academy of Sciences, Kunming, China

Contribution
Conceptualization, Supervision, Validation, Investigation, Project administration

For correspondence
xuemeilu@mail.kiz.ac.cn

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-6044-6002
Hai-Jun Wen
1. State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
2. Innovation Center for Evolutionary Synthetic Biology, Sun Yat-sen University, Guangzhou, China
Contribution
Validation

For correspondence
wenhj5@mail.sysu.edu.cn

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-8676-1254
Chung-I Wu
1. State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
2. Innovation Center for Evolutionary Synthetic Biology, Sun Yat-sen University, Guangzhou, China
3. Department of Ecology and Evolution, University of Chicago, Chicago, United States
Contribution
Conceptualization, Supervision, Funding acquisition, Validation, Investigation, Methodology, Project administration, Writing – review and editing

For correspondence
ciwu@uchicago.edu

Competing interests
No competing interests declared

"This ORCID iD identifies the author of this article:" 0000-0001-7263-4238

Funding

National Natural Science Foundation of China (32150006)

Xuemei Lu
Chung-I Wu

Guangdong Key R&D Project of China (2022B1111030001)

Hai-Jun Wen

National Natural Science Foundation of China (32293193)

Chung-I Wu

National Natural Science Foundation of China (32293190)

Chung-I Wu

Yunnan Revitalization Talent Support Program Top Team (202405AS350022)

Xuemei Lu
Chung-I Wu

National Natural Science Foundation of China (82341092)

Hai-Jun Wen

National Natural Science Foundation of China (32200493)

Chung-I Wu

National Key Research and Development Program of China (2021YFC2301300)

Chung-I Wu

National Key Research and Development Program of China (2021YFC0863400)

Chung-I Wu

Yunnan Revitalization Talent Support Program Yunling Scholar Project

Xuemei Lu

National Natural Science Foundation of China (32370659)

Chung-I Wu

Guangdong Basic and Applied Basic Research Foundation (2023A1515010016)

Chung-I Wu

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

The authors gratefully acknowledge the following for their support in the initiation of the Cancer Driving Nucleotide (CDN) project: the First Affiliated Hospital, the Seventh Affiliated Hospital of Sun Yat-sen University; Cancer Center of Clifford Hospital, Jinan University; Cancer Hospital Chinese Academy of Medical Sciences, Shenzhen Center; Guangdong Academy of Medical Sciences, Guangdong Provincial People’s Hospital. We thank the Kunming Institute of Zoology for valuable discussions on the CDN concept. We are also grateful to Weiwei Zhai, Qianfei Wang, and Weini Huang for their insightful comments and suggestions. Finally, we acknowledge the American Association for Cancer Research (AACR) and The Cancer Genome Atlas (TCGA) project for providing invaluable datasets and resources that have significantly advanced our understanding of cancer biology and improved patient outcomes. This work was supported by the National Natural Science Foundation of China (32150006 to CIW and XL, 32293193, 32293190, 32200493, and 32370659) to CIW, 82341092 to HJ Wen, the National Key Research and Development Projects of the Ministry of Science and Technology of China (2021YFC2301300, 2021YFC0863400), Guangdong Key Research and Development Program (No. 2022 No. B1111030001), Guangdong Basic and Applied Basic Research Foundation (No. 2023A1515010016), Yunnan Revitalization Talent Support Program Top Team (202405AS350022 to CIW and XL) and Yunnan Revitalization Talent Support Program Yunling Scholar Project (XL).

Version history

Sent for peer review: May 29, 2024
Preprint posted: June 5, 2024
Reviewed Preprint version 1: September 3, 2024
Reviewed Preprint version 2: October 25, 2024
Version of Record published: December 17, 2024

Cite all versions

You can cite all versions using the DOI https://doi.org/10.7554/eLife.99340. This DOI represents all versions, and will always resolve to the latest one.

Copyright

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.