Ultra-low coverage fragmentomic model of cell-free DNA for cancer detection based on whole-exome regions

  1. Center of Multidisciplinary Technology for Advanced Medicine (CMUTEAM), Faculty of Medicine, Chiang Mai University, Chiang Mai, Thailand

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Hui Zhao
  • Senior Editor
    Tony Ng
    King's College London, London, United Kingdom

Reviewer #1 (Public Review):

Summary:

The authors are looking to assess fragmentomics effects using the Delfi method in exonic regions (Exome sequencing). They argue that this is to make the test more cost effective by extracting this information from exome sequencing.

Strengths:

Well written and explained. Different ML approaches tried.

Weaknesses:

To assess fragmentomics in WES, it doesn't seem valid to downsample WGS. WES is generated by a different library preparations so to answer this question, it would be necessary to try this in WES samples. The coverage of WES is generally done much higher because this is necessary to assess mutation calls therefore the approach of combining seems flawed because these were not generated by the same experiment.

The authors do not really show why they included longer fragment sizes in their model that had previously been excluded from the original Delfi publication

As a proof of concept this is a good idea but really needs a bit of a rethink on the utility and impact.

Reviewer #2 (Public Review):

Apiwat Sangphukieo et al. have developed machine learning models, exomeDELFI and xDELFI trained on 4 public datasets comprising 721 cfDNA samples. They demonstrate the exomeDELFI model utilizing DNA from whole exome, exhibits higher AUC values compared to the original DELFI model at equal whole-genome sequencing depth for distinguishing patients with and without cancer. Additionally, the xDELFI model, integrating coverage of overall fragments, fragments within 3 fragment size thresholds (short, medium, long) and fragment size distribution (FSD), resulting in 2,952 features, shows improved enhanced prediction performance. Furthermore, the authors have devised a multiclass machine learning model capable of classifying the tissue of origin for eight cancer types, using distinct tissue-specific fragmentomic patterns in cfDNA from whole-exome regions.

However, the conclusions drawn in this paper rely heavily on cross-validation of machine learning models constructed from hundreds of samples but employing thousands of features, posing a risk of overfitting. Thus, more rigorous validation is warranted.

(1) The claim in line 18 is misleading. The authors assert that the high cost of whole-genome sequencing (WGS) limited the application of cfDNA in clinic, and therefore imply their model are more cost-efficient by using fewer DNA molecules only originated from exosmic regions. However, WGS is essential in their analysis. Instead of using whole-exome sequencing data, they extracted DNA molecules from WGS data which fall within gene exome regions for feature extraction and downstream analysis, resulting in the same cost for DNA sequencing. In this regard, xDELFI, which selectively uses DNA from exomic regions, demonstrates inferior performance compared to the DELFI model using all WGS data (AUC: 0.896 vs. 0.920) at the same cost using same WGS data.

(2) The utilization of WGS data from 4 distinct datasets (Jiang et al., 2015, Snyder et al., 2016, Cristiano et al., 2019 and Sun et al., 2019) raises concerns about potential batch effects arising from different DNA library preparation kits (e.g., Kapa Library Preparation Kit (Kapa Biosystems); ThruPLEX DNA-seq kits (Rubicon Genomics); NEBNext DNA Library Prep Kit for Illumina (New England Biolabs); and KAPA HTP Library Preparation Kit (Kapa Biosystems), receptivity). Each kit may induce varying pre-analytical effects on cfDNA fragmentomic features, as evidenced by differing size distribution profiles (e.g., in Fig.4 in Jiang et al., 2015, the cfDNA size distribution profiles show the major peak at ~166 bp with frequency of ~3%. However, in Fig.1B in Snyder et al., 2016, the major peak at ~166 bp is ~2%). To enhance the robustness of their models, the authors should develop sophisticated normalization pipeline to mitigate batch effects and split training and testing sets without mixing any dataset. The author should demonstrate their model performs equally well between training and testing sets and across different datasets.

(3) The uneven distribution of cancer patients across different datasets introduces another layer of complexity, potentially confounding the analysis of tissue of origin. In line 300, the authors find that liver, colorectal, and lung cancers had the highest prediction accuracy in their models. However, the cancer patient distribution is not even across different datasets (e.g., liver cancer patients are all from Jiang et al., 2015; colorectal cancer patients are mostly from Sun et al., 2019, and Cristiano et al., 2019; and lung cancer patients are mainly from Cristiano et al., 2019. The potential pre-analytical differences in each dataset, coupled with overwhelming cancer types in each database, underscores the importance of addressing these discrepancies to ensure the validity of tissue of origin predictions.

(4) In Line 145, the authors mention selection of features used in the xDELFI model but did not specify the number of remaining features in each fragmentomic category post-selection. Providing this information would enhance the transparency and reproducibility of their methodology.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation