Unsupervised detection of fragment length signatures of circulating tumor DNA using non-negative matrix factorization

  1. Gabriel Renaud
  2. Maibritt Nørgaard
  3. Johan Lindberg
  4. Henrik Grönberg
  5. Bram De Laere
  6. Jørgen Bjerggaard Jensen
  7. Michael Borre
  8. Claus Lindbjerg Andersen
  9. Karina Dalsgaard Sørensen
  10. Lasse Maretty  Is a corresponding author
  11. Søren Besenbacher  Is a corresponding author
  1. Department of Health Technology, Section of Bioinformatics, Technical University of Denmark, Denmark
  2. Department of Molecular Medicine, Aarhus University, Denmark
  3. Department of Clinical Medicine, Aarhus University, Denmark
  4. Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Sweden
  5. Cancer Research Institute Gent (CRIG), Ghent University, Belgium
  6. Department of Human Structure and Repair, Ghent University, Belgium
  7. Department of Urology, Regional Hospital of West Jutland, Denmark
  8. Department of Urology, Aarhus University Hospital, Denmark
  9. Bioinformatics Research Centre, Aarhus University, Denmark
3 figures and 5 additional files

Figures

Discovering fragment length signatures using non-negative matrix factorization.

The cell-free DNA (cfDNA) pool contains a mixture of fragments from different sources such as tumor cells and background (mainly cells of hematopoietic origin). After performing paired-end …

Figure 2 with 7 supplements
Non-negativematrix factorization (NMF) on shallow whole-genome sequencing (sWGS) and deep targeted sequencing of cell-freeDNA (cfDNA) from prostate cancer patients.

(a) sWGS fragment length histograms for 86 prostate cancer patients; colors reflect ctDNA fractions estimated from driver variant allele fractions obtained from targeted sequencing performed on the …

Figure 2—figure supplement 1
Distribution of sample ctDNA% in the metastatic castration-resistant prostate cancer (mCRPC) cohort.

Circulating tumor DNA (ctDNA) fractions were determined either based on copy-number variants (‘ichorcna’) or using variant allele fractions (‘vaf’) of putative driver variants identified using deep, …

Figure 2—figure supplement 2
Optimal number of non-negative matrix factorization (NMF) components on sWGS data from the metastatic castration-resistant prostate cancer (mCRPC) cohort.

NMF was run with different numbers of components (x-axis) and for each fitted model, the maximum correlation between the summed weights of any subset of components and the ctDNA% (VAF-based) obtained.

Figure 2—figure supplement 3
Ratio of short (100–150 bp) to long (151–220 bp) fragments vs ctDNA% (VAF-based) on shallow whole-genome sequencing (sWGS) data from the metastatic castration-resistant prostate cancer (mCRPC) cohort.
Figure 2—figure supplement 4
ichorCNA ctDNA% estimates vs ctDNA% (VAF-based) on shallow whole-genome sequencing (sWGS) data from the metastatic castration-resistant prostate cancer (mCRPC) cohort.
Figure 2—figure supplement 5
Stability of NMF fragment length signatures.

Non-negative matrix factorization (NMF) signatures were estimated for random subsets of samples of different sizes on the shallow whole-genome sequencing (sWGS) data from the metastatic …

Figure 2—figure supplement 6
Correlation of non-negative matrix factorization (NMF) tumor signature weights with ctDNA% on hold-out data.

The shallow whole-genome sequencing (sWGS) dataset from the mCRPC cohort was randomly split into halves multiple times. For each partitioning, NMF trained on one half of the data (‘Train’) was used …

Figure 2—figure supplement 7
Overview of the analyses on metastatic castration-resistant prostate cancer (mCRPC) data presented in Figure 2.

The figure shows which datasets and analyses were used to produce each of the 6 panels in Figure 2.

Figure 3 with 5 supplements
Non-negative matrix factorization (NMF) on cell-free DNA (cfDNA) shallow whole-genome sequencing (sWGS) from the DELFI study.

(a) sWGS fragment length histograms for the 533 DELFI samples; colors indicate case-control status of the sample. (b) Fragment length signatures inferred using NMF with two components on the sWGS …

Figure 3—figure supplement 1
Shallow whole-genome sequencing (sWGS) fragment length histograms for the 533 DELFI samples stratified by stage; ‘H’ indicates healthy controls.
Figure 3—figure supplement 2
Using non-negative matrix factorization (NMF) with two fragment length signatures for classification on DELFI data stratified by cancer type.

(a) The distribution of Signature #1 weights for different cancer types. (b) ROC curve for cancer vs control classification using Signature #1.

Figure 3—figure supplement 3
Cancer type specific non-negative matrix factorization (NMF) models.

Two component NMF models were trained separately for each cancer type by combining samples for that cancer type with healthy controls (a) AUCs for discriminating cases vs controls across cancer …

Figure 3—figure supplement 4
Using a Support Vector Machine (SVM) trained on 30 non-negative matrix factorization (NMF) components to classify sequential samples.

Each subplot shows an individual with at least three samples. The days on the x-axis are relative to the operation date. Top facet shows the variant allele frequency of the EGFR or ERBB2 mutation …

Figure 3—figure supplement 5
Overview of the analyses on the DELFI dataset presented in Figure 3.

The figure shows which datasets and analyses were used to produce each of the 6 panels in Figure 3.

Additional files

Supplementary file 1

Fragment length distributions in mCRPC cohort.

Sheet 1 contains raw fragment length distributions from WGS data along with ctDNA% estimates. Sheet 2 contains raw fragment length distributions from targeted data.

https://cdn.elifesciences.org/articles/71569/elife-71569-supp1-v2.xlsx
Transparent reporting form
https://cdn.elifesciences.org/articles/71569/elife-71569-transrepform1-v2.pdf
Source code 1

Source code and data to produce Figure 2.

https://cdn.elifesciences.org/articles/71569/elife-71569-code1-v2.gz
Source code 2

Source code and data to produce Figure 3.

https://cdn.elifesciences.org/articles/71569/elife-71569-code2-v2.gz
Source code 3

Scripts used for the analysis of the DELFI data.

This includes a script to train NMF, a script to estimate the weight of NMF components and a script to train and evaluate a linear SVM model.

https://cdn.elifesciences.org/articles/71569/elife-71569-code3-v2.gz

Download links