The theory of massively repeated evolution and full identifications of cancer-driving nucleotides (CDNs)

  1. Lingjie Zhang
  2. Tong Deng
  3. Zhongqi Liufu
  4. Xueyu Liu
  5. Bingjie Chen
  6. Zheng Hu
  7. Chenli Liu
  8. Miles E Tracy
  9. Xuemei Lu  Is a corresponding author
  10. Hai-Jun Wen  Is a corresponding author
  11. Chung-I Wu  Is a corresponding author
  1. State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, China
  2. State Key Laboratory of Genetic Resources and Evolution/Yunnan Key Laboratory of Biodiversity Information, Kunming Institute of Zoology, The Chinese Academy of Sciences, China
  3. GMU-GIBH Joint School of Life Sciences, Guangzhou Medical University, China
  4. CAS Key Laboratory of Quantitative Engineering Biology, Shenzhen Institute of Synthetic Biology, Institute of Advanced Technology, Chinese Academy of Sciences, China
  5. Innovation Center for Evolutionary Synthetic Biology, Sun Yat-sen University, China
  6. Department of Ecology and Evolution, University of Chicago, United States
14 figures, 5 tables and 2 additional files

Figures

Two modes of DNA sequence evolution.

(A) A hypothetical example of DNA sequences in organismal evolution. (B) Cancer evolution that experiences the same number of mutations as in (A) but with many short branches. (C) A common pattern …

Figure 2 with 1 supplement
The average Ai and Si values across different i ranges (X-axis).

(Top): The average of Ai and Si in the log scale. Color lines - full data; gray lines - CpG sites removed. The dash lines are linear extrapolations. Bottom: The Ai / Si ratio as a function of i. The …

Figure 2—figure supplement 1
Mutation and context landscape across 12 cancer types.

(A) Single nucleotide changes within a local context of 3 bp across 12 cancer types. Mutations are grouped into 6 nucleotide change directions based on base complementarity, with colors representing …

Site-level mutation rate variation obtained from Dig Sherman et al., 2022, a published AI tool.

(A) Each dot represents the expected SNVs (Y-axis) at a site where missense mutations occurred i times in the corresponding cancer population. The boxplot shows the overall distribution of …

Figure 4 with 1 supplement
Conventional analyses of local contexts at recurrence sites.

(A) From top panel down - For the 64 (43) 3-mer motifs, their mutational rates are shown on the X-axis. The most mutable motif over the average mutability (α) is 4.69. For the 1024 (=45) 5-mer and …

Figure 4—figure supplement 1
Sliding window to explore the consensus sequences between recurrence sites.

The blue arrow indicates the positive strand of reference genome, with a mutated site highlighted by the red box. The green strip represents a sliding window covering the mutated site. With each …

Patient level analysis for mutation load and mutational signatures.

(A) Boxplot depicting the distribution of mutation load among patients with recurrent mutations. The X-axis denotes the count of recurrent mutations, while the Y-axis depicts the normalized z-score …

i* values (Y-axis, log scale) against sample sizes (n), X-axis across different shape parameter k’s.

The Y axis presents the i* values under different sample sizes (n) of the X-axis in log scale. Five shape parameters (k) of the gamma-Poisson model are used. In the literatures on the evolution of …

Analysis of CDNs with expanded sample set in GENIE.

(A) Schematic illustrating the impact of sample size expansion on the number of discovered CDNs. The two vertical lines show the cutoffs of i*/n at (3/1000) vs. (12/100,000). The Y axis shows that …

Appendix 1—figure 1
The trend of φi,k with each increase of recurrence (i, the x-axis) under different shape parameters of the gamma distribution (k, designated by different colors).
Appendix 1—figure 2
The gamma distribution of recurrences (i) under different shapes.

With E(u)=5 × 10–6, we set the shape parameter k to 0.2, 1 and 5, represented by three distinct colors. The site number of synonymous recurrence i (Si) is indicated on Y-axis. In the context of a …

Author response image 1
5-prime.
Author response image 2
3-prime.
Author response image 3
5-prime shuffled.
Author response image 4
3-prime shuffled.
Author response image 5
random sequences from coding regions.

Tables

Table 1
An example of Ai and Si (from lung cancer, n=1035).
All sitesCpG sites removed
iAiSiAi / SiAiSiAi / Si
02254062378042812.892137538470140123.04
1195958693932.82168371568212.96
229469693.0421886433.4
399214.7168164.25
42312317117
516016 : 0909 : 0
610010 : 0606 : 0
7505 : 0505 : 0
8808 : 0606 : 0
9404 : 0303 : 0
≥3178228.09122177.18
≥47917954154
[10-20]717404 : 0
≥20606 : 0404 : 0
  1. Note –The ratio of Ai/ Si is provided as a measure of selection strength.

Table 2
Summary for modeling outlier sites in six cancer types.
Cancer TypeS3pαS4S5
Lung*--0.0------
Breast0.128.75E-04 (8.21E-04)88.6 (32.0)0.102 (0.068)0.004 (0.004)
CNS0.022.73E-04 (1.09E-04)295.1 (57.0)0.448 (0.173)0.026 (0.015)
Kidney0.033.03E-05 (2.98E-05)304.1 (108.0)0.067 (0.056)0.005 (0.006)
Upper-AD tract0.470.002 (0.001)48.9 (10.7)0.174 (0.078)0.005 (0.003)
Large intestine1.030.009 (0.001)51.6 (1.4)0.998 (0.087)0.026 (0.003)
  1. Note – For each cancer type, p stands for the proportion of highly mutable sites, with mutation rate being α-fold of the average. S3 gives the expected number without mutable outliers (P=0). S4 and S5 denote the expected number with the best (p, α) pairs with the standard deviation in parentheses. For lung cancer, S2 and S3 do not fit the outlier model (Table 2—source data 1); therefore, we set P=0.

Table 2—source data 1

The outlier model parameters and expected Si values for 6 cancer types analyzed.

pMinor’ and ‘alpha’ correspond to (‘p, a’) as described in the main text. ‘Eu’ represents the average mutation rate per site per patient for the given cancer type (‘ccType’). ‘s2Expt’ ‘s3Expt’ ‘4’ …

https://cdn.elifesciences.org/articles/99340/elife-99340-table2-data1-v1.zip
Appendix 1—table 1
literature support for CDN genes in breast cancer.
Gene IdGene NameSupport
AKT1v-akt murine thymoma viral oncogene homolog 1① ② ③
CDC42BPACDC42 binding protein kinase alpha (DMPK-like)Unbekandt and Olson, 2014; Collins et al., 2018; Kwa et al., 2021; Jiang et al., 2023
CDH1cadherin 1, type 1, E-cadherin (epithelial)① ② ③
ERBB2v-erb-b2 avian erythroblastic leukemia viral oncogene homolog 2① ② ③
ERBB3v-erb-b2 avian erythroblastic leukemia viral oncogene homolog 3Holbro et al., 2003; Xue et al., 2006; Hamburger, 2008; Sithanandam and Anderson, 2008; Stern, 2008; Huang et al., 2010
FGFR2fibroblast growth factor receptor 2① ② ③
FOXA1forkhead box A1① ② ③
GATA3GATA binding protein 3① ② ③
HIST1H3Bhistone cluster 1, H3b① ② ③ Xie et al., 2019; Wang et al., 2023*
KIF1Bkinesin family member 1BMunirajan et al., 2008; Yu and Feng, 2010; Liu et al., 2022
KRASKirsten rat sarcoma viral oncogene homolog① ② ③
NUP93nucleoporin 93 kDaBersini et al., 2020; Nataraj et al., 2022
PIK3CAphosphatidylinositol-4,5-bisphosphate 3-kinase, catalytic subunit alpha① ② ③
PTENphosphatase and tensin homolog① ② ③
RARS2arginyl-tRNA synthetase 2, mitochondrialWang et al., 2020*
SF3B1splicing factor 3b, subunit 1, 155 kDa① ② ③
TP53tumor protein p53① ② ③
  1. The serial number corresponds to the inclusion of target gene in the following driver gene list: ① CGC Tier-1 list, ② IntOGen, ③ Bailey’s list.

  2. The inclusion necessitates that the target gene is annotated as a cancer driver in breast cancer.

  3. *

    ambiguous, meaning the literature indicates an association between the candidate gene and breast cancer, but lacks explicit experimental evidence.

Appendix 1—table 2
k estimated from 12 cancer types.
Cancer typek
Breast5.05
CNS2.59
Endometrium5.49
Kidney7.70
Large intestine4.76
Liver5.23
Lung2.62
Ovary4.30
Prostate3.60
Stomach4.17
Upper-AD tract4.14
Urinary tract6.14
merged set*3.27
  1. Note:- Estimation of k is derived from negative binomial regression, based on synonymous changes aggregated by the 3 bp local context at mutated sites across all coding genes. The estimation method is implemented in package dndscv.

  2. *

    The merged set contains mutation information from all 12 cancer types.

Author response table 1
TRAIN accuracyTEST accuracy
Randomrandom
Model# layers# parametersALLdonneracceptorCDSALLdonneracceptorCDS
resNet1463,7210.770.770.760.790.710.720.690.72
resNet216564,4450.960.960.950.950.760.770.770.73
deepGRU23144,9250.860.850.840.890.790.790.760.8
deepLSTM624,4450.850.820.870.860.790.760.820.78

Additional files

Supplementary file 1

All CDN sites with population allele frequency annotation.

CDN_sites.i ≥ 20’ presents CDN sites with total hits ≥20 in each cancer type. ‘CDN.Missense.thres_3’ provides all CDN sites analyzed in this study, ranked in decreasing order based on the highest recurrence across 12 cancer types. ‘CDN.Missense.gnomAD’ presents the gnomAD population allele frequency of all missense CDNs. ‘Synonymous_high_hits’ lists the synonymous mutations potentially under selection, while ‘Synonymous_high_hits.gnomAD’ provides their corresponding allele frequency annotations from gnomAD.

https://cdn.elifesciences.org/articles/99340/elife-99340-supp1-v1.xlsx
MDAR checklist
https://cdn.elifesciences.org/articles/99340/elife-99340-mdarchecklist1-v1.pdf

Download links