Multimodal analysis of methylomics and fragmentomics in plasma cell-free DNA for multi-cancer early detection and localization

8 figures, 1 table and 2 additional files

Figures

Workflow of SPOT-MAS (screening for the presence of tumor by methylation and size) assay for multi-cancer detection and localization.

There are three main steps in the SPOT-MAS assay. First, cell-free DNA (cfDNA) is isolated from peripheral blood, then treated with bisulfite conversion and adapter ligation to make whole-genome bisulfite cfDNA library. Second, whole-genome bisulfite cfDNA library is subjected to hybridization by probes specific for 450 target regions to collect the target capture fraction. The whole-genome fraction was retrieved by collecting the ‘flow-through’ and hybridized with probes specific for adapter sequences of DNA library. Both the target capture and whole-genome fractions were subjected to massive parallel sequencing and the resulting data were pre-processed into five different features of cfDNA: target methylation (TM), genome-wide methylation (GWM), fragment length profile (FLEN), DNA copy number (CNA), and end motif (EM). Finally, machine learning models and graph convolutional neural networks are adopted for classification of cancer status and identification tissue of origin.

Analysis of targeted methylation in cell-free DNA (cfDNA).

(A) Volcano plot shows log2 fold change (logFC) and significance (-log10 Benjamini-Hochberg adjusted p-value from Wilcoxon rank-sum test) of 450 target regions when comparing 499 cancer patients and 1076 healthy controls in the discovery cohort. There are 402 DMRs (p-value <0.05), color-coded by genomic locations. (B) Number of differentially methylated regions (DMRs) in the four genomic locations. (C) Kyoto Encyclopedia of Genes and Genomes (KEGG) and WikiPathway (WP) pathway enrichment analysis using g:Profiler for genes associated with the DMRs. A total of 36 pathways are enriched, suggesting a link between differences in methylation regions and tumorigenesis.

Genome-wide methylation changes in cell-free DNA (cfDNA) of cancer patients.

(A) Density plot showing the distribution of genome-wide methylation ratio for all cancer patients (red curve, n=499) and healthy participants (blue curve, n=1076). The left-ward shift in cancer samples indicates global hypomethylation in the cancer genome (p<0.0001, two-sample Kolmogorov-Smirnov test). (B) Log2 fold change of methylation ratio between cancer patients and healthy participants in each bin across 22 chromosomes. Each dot indicates a bin, identified as hypermethylated (red), hypomethylated (blue), or no significant change in methylation (gray).

Figure 4 with 1 supplement
Analysis of copy number aberration (CNA) in cell-free DNA (cfDNA).

(A) Log2 fold change of DNA copy number in each bin across 22 autosomes between 499 cancer patients and 1076 healthy participants in the discovery cohort. Each dot represents a bin identified as gain (red), loss (blue), or no change (gray) in copy number. (B) Proportions of different CNA bins in each autosomes.

Figure 4—figure supplement 1
Association between methylation changes and copy number aberration (CNA).

(A) Box plot indicates the log2 fold change in CNA of hypomethylated bins and bins with unchanged methylation. (B) Box plot shows log2 fold change in methylation of bins with CNA gain, loss, or unchanged CNA. p-Value estimated by the one-tailed Mann-Whitney U test.

Figure 5 with 1 supplement
Analysis of fragment length patterns of circulating tumor DNA (ctDNA) in plasma.

(A) Density plot of fragment length between cancer patients (red, n=499) and healthy participants (blue, n=1076) in the discovery cohort. Inset corresponds to an x-axis expansion of short fragment (<150 bp). (B) Ratio of short to long fragments across 22 autosomes. Each dot indicates a mean ratio for each bin in cancer patients (red) and healthy participants (blue).

Figure 5—figure supplement 1
Correlations between bisulfite and non-bisulfite converted data.

Pearson’s correlation analysis shows correlations of fragment length patterns (A) or end motifs (B) between bisulfite and non-bisulfite-treated cell-free DNA (cfDNA) from controls (n=3) and cancer samples (n=9).

Differences in 4-mer end motif between cancer and healthy cell-free DNA (cfDNA).

(A) Heatmap shows log2 fold change of 256 4-mer end motifs in cancer patients (n=499) compared to healthy controls (n=1076). (B) Box plots showing the top 10 motifs with significant differences in frequency between cancer patients (red) and healthy controls (blue) using Wilcoxon rank-sum test with Bonferroni-adjusted p-value <0.0001.

Figure 7 with 2 supplements
Model construction and performance validation for SPOT-MAS (screening for the presence of tumor by methylation and size).

(A) Two-model construction strategies for cancer detection. (B, C) Receiver operating characteristic (ROC) curves comparing the performance of single-feature models, and two combination models (concatenate and ensemble stacking) in the discovery (B) and validation cohorts (C). (D, E) Bar charts showing the specificity and sensitivity of single-feature models and two combination models (concatenate and ensemble stacking) in the discovery (D) and validation cohorts (E). (F, G) Dot plots showing the sensitivity of SPOT-MAS assay in detection of five different cancer types in the discovery (F) and validation cohorts (G). The points and error bars represent the sensitivity and 95% confidence intervals. Feature abbreviations as follows: TM – target methylation density, GWM – genome-wide methylation density, CNA – copy number aberration, EM – 4-mer end motif, FLEN – fragment length distribution, LONG – long fragment count, SHORT – short fragment count, TOTAL – all fragment count, RATIO – ratio of short/long fragment.

Figure 7—figure supplement 1
Exhaustive search for the optimal stacking ensemble model.

The red line indicates the area under the curve (AUC) ranking of 511 ensemble combinations. The inset shows the top 10 combinations with the highest AUC value.

Figure 7—figure supplement 2
The effects of age, gender, tumor diameter, and cancer stages on model performance.

(A, C) Box plots show probability scores of having cancer for male and female participants in the discovery (A) and validation cohort (C). (B, D) Box plots show probability scores of having cancer for male and female participants when breast cancer samples are separated from the other four cancer types in the discovery (B) and validation cohort (D). (E, F) Pearson’s correlation analysis shows no correlation between age and model prediction scores. (G, H) Box plots show prediction scores of patients with tumor diameter <3.5 cm versus those with tumor diameter >3.5 cm in the discovery (G) and validation cohort (H). (I, K) Receiver operating characteristic (ROC) curves show the classification performance of the stacking ensemble model on cancer patients with different stages (I, II, and IIIA) in the discovery (I) and validation cohort (K). (J, L) Dot plots show the sensitivity and 95% confidence intervals of SPOT-MAS (screening for the presence of tumor by DNA methylation and size) assay in the detection of stage I, II, and IIIA cancer in the discovery (J) and validation cohort (L). (A–D, G–H) Boxes correspond to interquartile ranges (IQR) which include values between 25th to 75th percentile. The horizontal line inside the box indicated the median. The whiskers extended to the smallest or largest data points. The one-tailed Mann-Whitney U test was used to compare the prediction scores among different groups. ns: not significant; ****, p<0.0001.

Figure 8 with 2 supplements
The performance of SPOT-MAS (screening for the presence of tumor by methylation and size) assay in prediction of the tissue of origin.

(A) Model construction strategy to predict tissue of origin by combining nine sets of cell-free DNA (cfDNA) features using graph convolutional neural networks. (B) Heatmap shows feature important scores of five cancer types. (C) Bar chart indicates the contribution of important features for classifying five different cancers. (D) Three dimensions graph represents the classification of five cancer types. (E, F) Cross-tables show agreement between the prediction (x-axis) and the reference (y-axis) to predict tissue of origin in the discovery cohort (E) and validation cohort (F).

Figure 8—figure supplement 1
Construction of machine learning models for tissue of origin (TOO) identification.

(A) Model construction strategy. Random forest (RF), convolutional neural network (CNN), and graph convolutional neural network (GCNN) are used to classify the five cancer types from the input of concatenated nine sets of cell-free DNA (cfDNA) features. The performance of constructed models was evaluated on the validation cohort. (B, C) Bar charts comparing the performance accuracy of the three models in the discovery (B) and validation cohort (C).

Figure 8—figure supplement 2
Comparison of accuracy for detecting five cancer types between single-feature model and stack model.

(A, B) The accuracy of single-feature models and stack model for detecting five cancer types in the discovery (A) and validation cohort (B). (C, D) The number of missed cases by the single-feature models and stack model in the discovery (C) and validation cohort (D).

Tables

Table 1
Summary of clinical features of 738 cancer patients and 1550 healthy controls in discovery and validation cohorts.
Clinical featuresDiscovery cohort (N=1575)Validation cohort (N=713)
Cancer (N=499)Healthy (N=1076)p-Value (cancer vs healthy)Cancer (N=239)Healthy (N=474)p-Value (cancer vs healthy)
NPercentageNPercentageNPercentageNPercentage
GenderFemale27955.9%59955.7%0.9281*12652.72%27056.1%0.2818*
Male22044.1%47744.3%11347.28%20443.9%
AgeMedian5847<0.00015948<0.0001
Min25182819
Max97849285
StageI5210.4%239.6%0.4947*
II16933.9%6928.9%
IIIA15030.1%7732.2%
Non-metastasis with unknown staging information12825.7%7029.3%
  1. *

    p-Values from Chi-square test.

  2. p-Values from Mann-Whitney test.

Additional files

MDAR checklist
https://cdn.elifesciences.org/articles/89083/elife-89083-mdarchecklist1-v1.docx
Supplementary file 1

Supplementary Tables.

Table S1. Detailed clinical information of all cancer and healthy subjects. Table S2. Summary of clinical information of patients with five cancer types. Table S3. Summary of sequencing quality metrics of samples in both cohorts. Table S4. List of 450 target regions. Table S5. List of significant pathways analyzed by g:Profiler. Table S6. List of 256 4-mer end motifs (EM). Table S7. Comparison of performance between single feature-based models and stacked model. Table S8. The sensitivity of SPOT-MAS (screening for the presence of tumor by methylation and size) model for detecting different cancer types and stages at a specificity of >95%. Table S9. The accurracy of random forest (RF), deep neural network (DNN), and graph convolutional neural network (GCNN) model for tissue of origin (TOO) identification. Table S10. List of significant features for TOO identification. Table S11. Overview of liquid biopsy assays for multi-cancer early detection described in recent publications.

https://cdn.elifesciences.org/articles/89083/elife-89083-supp1-v1.xlsx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Van Thien Chi Nguyen
  2. Trong Hieu Nguyen
  3. Nhu Nhat Tan Doan
  4. Thi Mong Quynh Pham
  5. Giang Thi Huong Nguyen
  6. Thanh Dat Nguyen
  7. Thuy Thi Thu Tran
  8. Duy Long Vo
  9. Thanh Hai Phan
  10. Thanh Xuan Jasmine
  11. Van Chu Nguyen
  12. Huu Thinh Nguyen
  13. Trieu Vu Nguyen
  14. Thi Hue Hanh Nguyen
  15. Le Anh Khoa Huynh
  16. Trung Hieu Tran
  17. Quang Thong Dang
  18. Thuy Nguyen Doan
  19. Anh Minh Tran
  20. Viet Hai Nguyen
  21. Vu Tuan Anh Nguyen
  22. Le Minh Quoc Ho
  23. Quang Dat Tran
  24. Thi Thu Thuy Pham
  25. Tan Dat Ho
  26. Bao Toan Nguyen
  27. Thanh Nhan Vo Nguyen
  28. Thanh Dang Nguyen
  29. Dung Thai Bieu Phu
  30. Boi Hoan Huu Phan
  31. Thi Loan Vo
  32. Thi Huong Thoang Nai
  33. Thuy Trang Tran
  34. My Hoang Truong
  35. Ngan Chau Tran
  36. Trung Kien Le
  37. Thanh Huong Thi Tran
  38. Minh Long Duong
  39. Hoai Phuong Thi Bach
  40. Van Vu Kim
  41. The Anh Pham
  42. Duc Huy Tran
  43. Trinh Ngoc An Le
  44. Truong Vinh Ngoc Pham
  45. Minh Triet Le
  46. Dac Ho Vo
  47. Thi Minh Thu Tran
  48. Minh Nguyen Nguyen
  49. Thi Tuong Vi Van
  50. Anh Nhu Nguyen
  51. Thi Trang Tran
  52. Vu Uyen Tran
  53. Minh Phong Le
  54. Thi Thanh Do
  55. Thi Van Phan
  56. Hong-Dang Luu Nguyen
  57. Duy Sinh Nguyen
  58. Van Thinh Cao
  59. Thanh-Thuy Thi Do
  60. Dinh Kiet Truong
  61. Hung Sang Tang
  62. Hoa Giang
  63. Hoai-Nghia Nguyen
  64. Minh-Duy Phan
  65. Le Son Tran
(2023)
Multimodal analysis of methylomics and fragmentomics in plasma cell-free DNA for multi-cancer early detection and localization
eLife 12:RP89083.
https://doi.org/10.7554/eLife.89083.3