Two modes of DNA sequence evolution. (A) A hypothetical example of DNA sequences in organismal evolution. (B) Cancer evolution that experiences the same number of mutations as in (A) but with many short branches. (C) A common pattern of sequence variation in cancer evolution. (D) In cancer evolution, the same mutation at the same site may occasionally be seen in multiple sequences. The recurrent sites could be either mutational or functional hotspots, their distinction being the main objective of this study.

An example of Ai and Si (from lung cancer, n = 1035).

The average Ai and Si values across different i ranges (X-axis). Top: The average of Ai and Si in the log scale. Color lines - full data; gray lines - CpG sites removed. The dash lines are linear extrapolations. Bottom: The Ai / Si ratio as a function of i. The drop of Ai / Si ratio at i [8, 9] is due to the potential synonymous CDNs, see Supplement File S2.

Site-level mutation rate variation obtained from Dig (Sherman et al. 2022), a published AI tool. (A) Each dot represents the expected SNVs (Y-axis) at a site where missense mutations occurred i times in the corresponding cancer population. The boxplot shows the overall distribution of mutability at i, with the red dashed line denoting the average. There is no observable trend that sites of higher i are more mutable (The blank areas are due to the absence of CDNs with mutation recurrence counts of 8 or 9 in CNS cancer mutation data, see Supplement File S2). (B) A detailed look at the coding region of PAX3 gene in colon cancer. The expected mutability of sites in the 200 bp window is plotted. The three mutated sites in this window, marked by green and red (a CDN site) stars, are not particularly mutable. Overall, the mutation rate varies by about 10-fold as is generally known for CpG sites.

Conventional analyses of local contexts at recurrence sites. (A) From top panel down - For the 64 (43) 3-mer motifs, their mutational rates are shown on the X-axis. The most mutable motif over the average mutability (α) is 4.69. For the 1024 (=45) 5-mer and 16,384 (47) 7-mer motifs, the α values are, respectively, 8.79 and 11.52. The most mutable motifs, as expected, are dominated by CpG’s. (B) Each dot represents the motif surrounding a high-recurrence site. The recurrence number is shown on the X-axis and the mutability of the associated motif’s mutability (mutations per 0.1) is shown on the Y-axis. The average mutation rate across all motifs of given length category is indicated by a red horizontal dashed line. The absence of a trend indicates that the high recurrence sites are not associated with the mutability of the motif. (C) The analysis is extended to longer motifs surrounding each CDN (21, 41, 61, 81, and 101 bp). For each length group, all pairwise comparisons are enumerated. The observed distributions (black bars and points) are compared to the expected Poisson distributions (red bars and curves) and no difference is observed. Thus, local sequences of CDNs do not show higher-than-expected similarity.

Patient level analysis for mutation load and mutational signatures. (A) Boxplot depicting the distribution of mutation load among patients with recurrent mutations. The X-axis denotes the count of recurrent mutations, while the Y-axis depicts the normalized z-score of mutation load (see Methods). The green dashed line indicates the mean mutation load. In short, the mutation load does not influence the mutation recurrence among patients. (B) Signature analysis in patients with mutations of recurrences ≥ i* (X-axis). For lung cancer (left), the upper panel presents the number of patients for each group, while the lower panel depicts the relative contribution of mutational signatures. For breast cancer, APOBEC-related signatures (SBS2 and SBS13) are notably elevated in all groups of patients with i*≥ 3, while patients with mutations of recurrence ≥ 20 in CNS cancer exhibit an increased exposure to SBS11 (Blough et al. 2011; Lin et al. 2021; Noeuveglise et al. 2023). Again, patients with higher mutation recurrences do not differ in their mutation signatures.

Summary for modeling outlier sites in 6 cancer types

i* values (Y-axis, log scale) against sample sizes (n, X-axis) across different shape parameter k’s. The Y axis presents the i* values under different sample sizes (n of the X-axis) in log scale. Five shape parameters (k) of the gamma-Poisson model are used. In the literatures on the evolution of mutation rate, k is usually greater than 1. The inset figure illustrates how i */ n (prevalence) would decrease with increasing sample sizes. The prevalence would approach the asymptotic line of [ ] when n reaches 106. In short, more CDNs (those with lower prevalence) will be discovered as n increases. Beyond n = 106, there will be no gain.

Analysis of CDNs with expanded sample set in GENIE. (A) Schematic illustrating the impact of sample size expansion on the number of discovered CDNs. The two vertical lines show the cutoffs of i* /n at (3/1000) vs. (12/100,000). The Y axis shows that the potential number of site would decrease with i* /n, which is a function of selective advantage. The area between the two cutoffs below the line represent the new CDNs to be discovered when n reaches 100,000. The power of n = 100,000 is even larger if the distribution follows the blue dashed line. (B) The prevalence (i*/n) of sites is well correlated between datasets of different n (TCGA with n < 1000 and GENIE with n generally 10-fold higher), as it should be. Sites are displayed by color. “1-hit”: CDNs identified in GENIE but remain in singleton in TCGA, “2-hit”: CDNs identified in GENIE but present in doubleton in TCGA. “CDN both”: CDNs identified in both databases. (C-E) CDNs discovered in GENIE (n > 9000) but absent in TCGA (n < 1000). The newly discovered CDNs may fall in TCGA as 0 - 2 hit sites. The numbers in the middle column show the percentage of lower recurrence (non-CDN) sites in TCGA that are detected as CDNs in the GENIE database, which has much larger n’s.