Unsupervised detection of fragment length signatures of circulating tumor DNA using non-negative matrix factorization

  1. Gabriel Renaud
  2. Maibritt Nørgaard
  3. Johan Lindberg
  4. Henrik Grönberg
  5. Bram De Laere
  6. Jørgen Bjerggaard Jensen
  7. Michael Borre
  8. Claus Lindbjerg Andersen
  9. Karina Dalsgaard Sørensen
  10. Lasse Maretty  Is a corresponding author
  11. Søren Besenbacher  Is a corresponding author
  1. Department of Health Technology, Section of Bioinformatics, Technical University of Denmark, Denmark
  2. Department of Molecular Medicine, Aarhus University, Denmark
  3. Department of Clinical Medicine, Aarhus University, Denmark
  4. Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Sweden
  5. Cancer Research Institute Gent (CRIG), Ghent University, Belgium
  6. Department of Human Structure and Repair, Ghent University, Belgium
  7. Department of Urology, Regional Hospital of West Jutland, Denmark
  8. Department of Urology, Aarhus University Hospital, Denmark
  9. Bioinformatics Research Centre, Aarhus University, Denmark

Abstract

Sequencing of cell-free DNA (cfDNA) is currently being used to detect cancer by searching both for mutational and non-mutational alterations. Recent work has shown that the length distribution of cfDNA fragments from a cancer patient can inform tumor load and type. Here, we propose non-negative matrix factorization (NMF) of fragment length distributions as a novel and completely unsupervised method for studying fragment length patterns in cfDNA. Using shallow whole-genome sequencing (sWGS) of cfDNA from a cohort of patients with metastatic castration-resistant prostate cancer (mCRPC), we demonstrate how NMF accurately infers the true tumor fragment length distribution as an NMF component - and that the sample weights of this component correlate with ctDNA levels (r=0.75). We further demonstrate how using several NMF components enables accurate cancer detection on data from various early stage cancers (AUC = 0.96). Finally, we show that NMF, when applied across genomic regions, can be used to discover fragment length signatures associated with open chromatin.

Editor's evaluation

The authors introduce non-negative matrix factorization to analyze shallow WGS sequencing data to detect cell-free DNA fragments diagnostic of cancer. This is an area of active research and the authors add a potentially very useful unsupervised approach to analyze such data.

https://doi.org/10.7554/eLife.71569.sa0

Introduction

Circulating cell-free DNA (cfDNA) is rapidly emerging as an important biomarker – most notably in cancer and pregnancy. In the cancer setting, the detection of tumor cell derived DNA fragments containing somatically acquired mutations can reveal the presence of cancer. Most approaches rely on the detection of mutations using deep, targeted sequencing of a few genomic regions known to harbor driver mutations for the cancer type of interest (Phallen et al., 2017). While deleterious or activating mutations in known driver genes are highly specific for cancer, the sensitivity of this approach is constrained as the cancer may not contain the mutation – or the mutations may not be detectable in the blood sample due to low concentration of circulating tumor DNA (ctDNA) (Bettegowda et al., 2014).

It is, however, possible to get more information out of cfDNA data than just genetic variants. A key difference between cfDNA data and ordinary sequence data is that cfDNA is fragmented in vivo by a combination of enzymatic and non-enzymatic processes. Most importantly, during apoptosis, DNA is enzymatically cut between nucleosomes, and hence the lengths and positions of cfDNA fragments reflect the epigenetic state of the cell-types of origin (Ulz et al., 2016; Snyder et al., 2016). Furthermore, other enzymatic and non-enzymatic fragmentation processes (e.g. oxidative stress) may further contribute to cancer associated fragmentation patterns (Heitzer et al., 2020). In contrast to mutations, these signals are expected to occur across the entire genome and suggest that a focus on sequencing width instead of depth can improve sensitivity.

The fragment length distribution has been a major focus of studies searching for non-mutational signals in cfDNA. Early studies used quantitative PCR with competitive primer sets targeting amplicons of different lengths to study the cfDNA fragment length distribution in cancer using either human-mouse xenografts or blood cfDNA representing different clinical states (Wang et al., 2003; Umetani et al., 2006; Chan et al., 2008; Mouliere et al., 2011). Yet, while cancer was clearly associated with changes in cfDNA fragmentation across all studies, conflicting results were obtained with respect to the direction of change (shortening or lengthening).

A more detailed picture of how cancer manifests in the fragment length distribution of cfDNA was later obtained using short-read sequencing based approaches. First, Jiang et al. used sequencing to determine that hepatocellular carcinoma is associated with a left shift in the cfDNA fragment length distribution (Jiang et al., 2015). Later on, Mouliere et al. used exome and shallow whole-genome sequencing (sWGS) to investigate cancer-specific cfDNA fragmentation patterns using human-mouse xenografts or cancer mutations to separate cancer fragments from background cfDNA (Mouliere et al., 2018). Again, they observed a number of cancer-specific distortions including left-shifting of both the mono- and di-nucleosome peaks, and a more prominent di-nucleosome peak, and used these to accurately discriminate cancer patients from healthy controls. Finally, Sanchez et al. also applied WGS to study differences in the fragment length distribution between cases and controls, and demonstrated significant differences in fragment lengths between single- and double-stranded cfDNA populations (Sanchez et al., 2021).

In their DELFI approach, Cristiano et al. added a genomic dimension to cfDNA fragment length analyses by computing the ratio of short (100–150 bp) to long (151–220 bp) fragments in 5 MB windows along the genome and showed how these capture nucleosomal distances, which in turn reflect chromatin state (Cristiano et al., 2019). Using these fragment length profiles as inputs to a machine learning classifier enabled accurate discrimination of earlier stage cancers from controls.

In this manuscript, we introduce a new computational method based on non-negative matrix factorization (NMF), an unsupervised learning method, for simultaneously determining the contributions of different cfDNA sources (e.g. background and tumor) to a sample along with the fragment length signatures of each source (Lee and Seung, 1999). The method is completely unsupervised and uses only fragment length histograms as input and hence provide estimates that do not require xenografting approaches, knowledge about genomic alterations (e.g. SNV, CNV) or sample information like disease state. Software for calculating fragment lengths and applying NMF is available under the MIT license (Renaud, 2022a; copy archived at swh:1:rev:cf9ed4240b74c866f62b3da2cdb4f0bbceb7f551).

Results

Discovering tumor fragment length signatures using non-negative matrix factorization

Our approach begins by computing cfDNA fragment length histograms in a series of samples based on paired-end read alignments and then uses these to construct a matrix with cfDNA fragment counts such that each row corresponds to a sample and each column to a specific fragment length (see Figure 1 for a schematic representation of the workflow). We then normalize the rows of this matrix such that they sum to one before performing NMF, where the input matrix is approximated as the product of two non-negative matrices – both smaller than the input. One of these matrices, the signature matrix, has as many columns as the original matrix and represents the preference of observing each fragment length for each cfDNA source. The other matrix, the weight matrix, has as many rows as the input matrix and represents the contributions of each cfDNA source to each sample. The number of cfDNA sources is a hyperparameter that needs to be set in advance.

Discovering fragment length signatures using non-negative matrix factorization.

The cell-free DNA (cfDNA) pool contains a mixture of fragments from different sources such as tumor cells and background (mainly cells of hematopoietic origin). After performing paired-end sequencing of cfDNA, we estimate fragment length histograms for each sample by aligning reads to the reference genome. We next generate a matrix with fragment length frequencies such that rows and columns represent samples and fragment lengths, respectively. After normalizing the rows of this matrix, we then factorize it into two non-negative matrices: (1) The signature matrix is aligned with columns and expresses the preference of each cfDNA source for different fragment lengths and (2) the weight matrix, which is aligned with rows, and contains the estimated contributions of each source to each sample.

We first tested our method on sWGS of cfDNA (coverage mean: 0.60 X, range: 0.36 X–0.93X, median #read-pairs: 19.9 M) in 142 plasma samples from 94 patients with metastatic castration resistant prostate cancer (mCRPC). Samples were collected either before the initiation of first line treatment (n=93) or at disease progression after first line treatment (n=34) or later treatments (n=15) with some individuals (n=36) sampled at multiple timepoints. The observed fragment length distributions of high and low ctDNA samples differed (Figure 2a) and resembled those of previous reports (e.g. Jiang et al., 2015; Mouliere et al., 2018; Sanchez et al., 2021). Assuming two cfDNA sources (tumor/non-tumor), we then estimated fragment length signatures and weights using NMF on the normalized table of fragment length frequencies. One of the signatures (signature#2) recapitulates key features of the previously described tumor signatures including left skew, increased 10 bp periodicity left of the main mode and an enlarged second peak suggesting that this source represents the tumor (Figure 2b).

Figure 2 with 7 supplements see all
Non-negativematrix factorization (NMF) on shallow whole-genome sequencing (sWGS) and deep targeted sequencing of cell-freeDNA (cfDNA) from prostate cancer patients.

(a) sWGS fragment length histograms for 86 prostate cancer patients; colors reflect ctDNA fractions estimated from driver variant allele fractions obtained from targeted sequencing performed on the same samples. (b) Fragment length signatures inferred using NMF with two components on the sWGS dataset. (c) Lengths of fragments containing a driver mutation (red dots), lengths of fragments overlapping the mutated position but not containing the mutation (green line) and tumor fragment length signature estimated by NMF (blue line). (d) ctDNA fractions estimated using driver allele frequencies from targeted data versus weights of the second NMF component estimated on sWGS data (signature#2 in panel b). (e) ctDNA fractions estimated using driver allele frequencies from targeted data versus weights of the second NMF component estimated on the same targeted data. (f) Correlation of tumor signature weights estimated using NMF and ichorCNA with ctDNA fractions for different levels of downsampling of the sWGS data.

To confirm the ability of NMF to separate tumor and background (i.e. non-tumor) sources, we performed targeted, deep sequencing (coverage mean: 647 X, range:152X–1198X, median #read-pairs cfDNA / PBMC: 35.0 M / 6.70 M) for a subset of 86 mCRPC patients using a panel of genes related to prostate cancer (5137 regions, ~1.2 MB) (Mayrhofer et al., 2018; Crippa et al., 2020). We then called somatic variants in this data and estimated NMF using only fragments that overlap a mutated position. The putative cancer signature could then be compared to the fragment length distribution of fragments with and without mutations (Figure 2c). The fragment length distribution of fragments containing mutations closely matched the suspected tumor signature estimated using NMF and hence confirms that our method is able to separate tumor and background cfDNA sources solely based on fragment length information.

We next sought to investigate whether the estimated tumor signature weights are related to the blood ctDNA fraction. We, therefore, compared the sample weights of the tumor fragment length signature against sample ctDNA fractions estimated based on driver variant allele fractions (VAFs) obtained from the targeted sequencing data (min, median, max: 0.009, 0.197, 0.895) as well as estimates obtained using ichorCNA (min, median, max: 0.030, 0.071, 0.728), which uses CNVs (Figure 2—figure supplement 1). The estimated weights correlate strongly with ctDNA fraction for both sWGS (r=0.75, Figure 2d) and targeted sequencing data (r=0.83, Figure 2e) and hence both confirm the tumor origin of the signatures and demonstrate that our method can be used to estimate ctDNA fraction in the absence of any variant (i.e. SNVs/indels/CNVs) or clinical information.

We also explored using more cfDNA sources in the NMF. More specifically, we ran NMF for up to four components and empirically tested adding the weights of different combinations of NMF components (i.e. assuming different tumor sources) by comparing them with the driver VAF-based estimates (Figure 2—figure supplement 2). Only marginal improvements were observed using the more complex models suggesting the simpler alternative with two cfDNA sources is preferable as it is more interpretable and leaves little room for overfitting. In resumé, our method is able to determine fragment length signatures and ctDNA fractions on both sWGS and panel sequencing data in a completely unsupervised manner.

The performance of our method was superior to using the ratio of short-to-long fragments (100–150 bp vs. 151–220 bp) proposed by Cristiano et al. and analogous to the PCR-based methods as judged against driver VAF-based ctDNA fraction estimates (r=0.68, Figure 2—figure supplement 3). ichorCNA, which is based on CNV signals, attained better performance than NMF (r=0.79, Figure 2—figure supplement 4). We speculated that our method might work better when the data is sparse as it leverages information across the entire genome rather than only regions affected by CNVs. We thus ran both NMF and ichorCNA on the sWGS data for different levels of subsampling (Figure 2f). We observed that, while ichorCNA correlated better with the driver VAF-based ctDNA fractions for higher depths, our method seemed markedly better at low depths. Intriguingly, our method attained correlations greater than 0.68 with the VAF-based estimates using as little as 1000 fragments per sample. We further wished to assess how many samples are required to obtain reliable estimates using NMF. We therefore repeatedly constructed new datasets by randomly selecting a subset of samples and then training NMF models on the reduced datasets. We then computed cosine similarities between signatures estimated on the reduced datasets and those estimated on the entire dataset. Using this approach, we found that the correct fragment length signatures could be estimated with as little as 20 samples (Figure 2—figure supplement 5). Finally, to determine our model’s ability to generalize on unseen data, we performed repeated experiments with training the model on half of the samples and predicting only the signature weights on the remaining samples. Here, we observed similar ctDNA% correlations between the samples used for training relative to those that were held out (Figure 2—figure supplement 6).

Detecting cancer using two fragment length signatures

To see if fragment length signatures can detect the presence of cancer fragments from different cancer types, we reanalyzed the data from the DELFI study (Cristiano et al., 2019). We obtained the raw sWGS data (mean coverage: 2.84X range: 0.71X–13.4X) from 498 samples from this study including 260 healthy controls and 238 cancers distributed across seven different cancer types. The fragment length distributions in these samples (see Figure 3a; Figure 3—figure supplement 1) were similar to the prostate cancer data, and as expected, we saw a general tendency for shorter fragment lengths in cases compared to controls. There were, however, also visible differences between the prostate data and the DELFI data set. We observed, for instance, fewer di-nucleosome fragments in both DELFI cases and controls compared to the prostate cancer data. Like the prostate analysis, we trained NMF on the matrix containing fragment length histograms for each sample, again assuming two cfDNA sources (see Figure 3b). We assumed that the signature with the lower mean fragment length was the cancer-related signature. This signature did indeed tend to have a higher weight in the cancer samples (Figure 3—figure supplement 2a), and it could differentiate between cancer and control samples with an AUC of 0.75 (Figure 3—figure supplement 2b). We also investigated if the AUC differed when restricting the input data to a specific cancer type. The AUC was highest for colorectal cancer patients (0.93) and lowest for gastric cancer (0.56). Full results are shown in Figure 3—figure supplement 3a. Furthermore, we sought to determine whether the inferred signatures differ from one cancer type to the next. The different inferred signatures were plotted (Figure 3—figure supplement 3b). Apart from gastric cancer, which happens to be the one with the lowest AUC, most fragment length signatures do not seem to change depending on the cancer type.

Figure 3 with 5 supplements see all
Non-negative matrix factorization (NMF) on cell-free DNA (cfDNA) shallow whole-genome sequencing (sWGS) from the DELFI study.

(a) sWGS fragment length histograms for the 533 DELFI samples; colors indicate case-control status of the sample. (b) Fragment length signatures inferred using NMF with two components on the sWGS dataset. (c) AUCs obtained when discriminating cases versus controls using a linear Support Vector Machine (SVM) on the sample component weights across different numbers of components in the NMF model. Boxplots are based on repeating the Cross Validation 50 times. (d) Chromatin state fragment length signatures estimated using fragment length histograms from 250 kb bins along the genome aggregated across all control samples. (e) Ratio of short (100–150 bp) to long (151–220 bp) fragments (‘DELFI ratio’) or weight of the first NMF component (signature#1 in panel d) versus ENCODE ATACseq from a Lymphoblastoid cell-line for 250 kb genomic bins. (f) AUCs obtained when discriminating cases versus controls using a linear SVM on ‘DELFI ratio’ or weight of the first NMF components from panel d (red) or panel b (blue) inferred in bins along the genome for different bin sizes.

Detecting cancer using more than two fragment length signatures

We then tested whether using more signatures would improve this classification. We used a linear Support Vector Machine (SVM) to see if we could separate cancer and control samples based on their signature weights for a given number of signatures. The results showed that adding more signatures significantly improved the AUC (see Figure 3c). The classification continued to get better until we reached ~30 signatures, and for 30+signatures, we got an AUC above 0.95. This AUC is comparable to the AUC of 0.94 the DELFI method achieves by using Gradient Boosting Machine on ~500 features per sample. When testing a SVM with 30 signatures on the sequential samples from the DELFI article a majority of the samples had a positive correlation between the predicted case probability and the tumor fraction estimated based on VAF (Figure 3—figure supplement 4).

Estimating chromatin state signatures using NMF

The DELFI method also uses fragment length information to detect cancer samples, but rather than looking at the total length distribution, it looks at fragment lengths in bins along the genome (Cristiano et al., 2019). The DELFI creators have shown that the length distribution of cfDNA fragments in a genomic region carries information about its chromatin state (open or closed) and that this information can be used to distinguish cancer samples from healthy controls. Specifically, the DELFI method uses the ratio of short (100–150 bp) to long (151–220 bp) fragments in 5 MB windows along the genome as input to a machine learning classifier. We wished to investigate whether we could use NMF fragment length signatures to better capture this chromatin state and hence yield better classification. First, we partitioned the genome into non-overlapping bins of 250 kb and computed the fragment length histograms for each bin in each sample. We then summed histograms for each bin across all healthy controls to yield a matrix with rows corresponding to genomic bins instead of samples as before, and ran NMF using two signatures. This resulted in the two bin-wise NMF signatures shown in Figure 3d. To compare, we also calculated the short-to-long ratio used by the DELFI method in each bin. Looking at chromatin accessibility in a lymphoblastoid cell-line measured by ATACseq for the 250 kb bins, we observed a slightly better Spearman correlation for the weight of the first signature than the DELFI ratio (Figure 3e).

Using regional signature weights for classification

We then wanted to test whether using the NMF weight in each bin as input features for a classification method would be superior to using the DELFI ratios. We estimated the first bin-wise signature’s weight in each bin in each sample and used these weights as input to an SVM and assessed classification performance across different bin sizes. The results in Figure 3f show that for small bin sizes, we get a slightly better result by using the bin-wise NMF weight for each bin instead of the short-to-long DELFI ratio as SVM input. But for larger bin sizes, the DELFI ratio is better than using the bin-wise NMF signature. Finally, we also tried inferring the weight of the sample-wise NMF signature from Figure 3b for each bin and using that as input to the SVM (blue line in Figure 3f). This turned out to give superior results compared to using the bin-wise NMF signature.

Discussion

In this article, we propose non-negative matrix factorization (NMF) of fragment length distributions as a new tool in the cfDNA-seq analysis toolbox. A key objective in cfDNA-seq analyses in the cancer setting is to determine the amount of circulating tumor DNA – most importantly whether any tumor DNA is present at all. This carries immense importance both for early detection of cancer in the screening setting – and for detection of relapse after treatment, due to residual disease after surgery or resistance to chemotherapy.

Most analyses have focused on using mutational signals for determining ctDNA load through estimation of variant allele fractions for SNVs in deep, targeted cfDNA-seq data (Phallen et al., 2017) – or CNVs by sWGS (Adalsteinsson et al., 2017). Yet non-mutational signals such as those shown by several studies to be manifested in the fragment length distribution are gaining traction as they may improve sensitivity by enabling aggregation of the cancer signal across the entire genome rather than just positions affected by mutations. So far, fragment length signals have been approached either using a simple summary statistic like the ratio of short to long fragments e.g. DELFI (Cristiano et al., 2019) or by manually curating features reflecting the distribution and then use these as input for a supervised machine learning algorithm (Mouliere et al., 2018). However, manual featurization may not make the best use of all relevant information contained in the distribution – and supervised learning with many features carries the risk of overfitting.

We propose NMF as a general analytical framework for working with cfDNA-seq fragment length distributions. NMF enables us to simultaneously estimate fragment length signatures and their weights in each sample. Using sWGS cfDNA-seq data from a cohort of patients with metastatic prostate cancer, we show that NMF estimated with two components discovers a signature, which accurately matches the true tumor fragment length distribution and exhibits many of the characteristics previously associated with ctDNA. The weights of this signature correlated strongly with ctDNA levels – nearly as good as ichorCNA - without using any information about variants or ctDNA levels. Importantly, similar results were obtained when using deep, targeted cfDNA-seq. Furthermore, subsampling experiments revealed that NMF was markedly more robust when less data is available than ichorCNA. This may indicate that NMF is more robust than ichorCNA at lower tumor fractions. However, the lack of low ctDNA samples is a limitation of the current study and further experiments either based on samples with lower tumor burdens, spike-ins or in silico dilution are needed to confirm this assertion as it could not be directly tested with the data available in this study. A lack of healthy controls for the mCRPC cohort data meant that we were not able to perform spike-in or in silico dilution experiments in this study. Finally, as the fragment length signal is likely orthogonal to any mutational signal, it may be possible to combine these lines of information to obtain a better, joint estimate of tumor load.

The unsupervised nature of NMF implies little risk of overfitting and the ability to inspect the fragment length profiles of each signature provides full transparency of the method in contrast to many other approaches such as the supervised learning by Random Forest strategy applied by Mouliere et al. for determining ctDNA levels. Transparency is important because using non-mutational information carries the risk of using information that is not directly linked to the presence of ctDNA in blood, but instead reflects, e.g. an ongoing immune response. This in turn may impact a models’ ability to generalize to unseen data and clinicians’ trust in the model. Using NMF, we were able to verify that the fragment length profile of the signature correlating with ctDNA levels does indeed match the length distribution of fragments containing mutations and hence that the model is directly measuring ctDNA load. Hence, the combination of unsupervised learning and transparency suggest that NMF constitutes a robust modeling framework for cfDNA-seq length spectra.

The data from the prostate cancer cohort generally contained patients with a high tumor burden and we wished to also test the models applicability in a screening context characterized by low ctDNA load and also look at different cancer types. We obtained access to the data from the DELFI study, which contains cfDNA-seq data from a range of cancer types and primarily from patients with early stage disease (Cristiano et al., 2019). The signatures estimated from high ctDNA load prostate cancers and those of DELFI cases shared features (e.g. left skew of the main mode), but also differed as for instance the second mode of the distribution was less pronounced in the DELFI data. These differences may reflect differences in sample processing (e.g. DNA extraction method) and sequencing technology rather than actual biological differences between the studies suggesting that transferring models trained on one dataset to another may be difficult although this assertion was not directly tested in the present study. Hence, this constitutes an important, potential limitation of ‘fragmentomics’ that will require further attention in the future. That said, using an unsupervised method such as NMF to learn the relevant fragment length signatures can help alleviate such transferability problems as the model can easily be retrained on, e.g. a new batch of data. To know which of the two signatures learned corresponds to cancer fragments, one could use a previous set of signatures trained on a labeled set as the starting point for the NMF optimization.

We note that a number of technical and biological factors in the DELFI study design may have inflated our classification performance. For instance, cases were generally older than controls and hence the observed differences in the fragment length distribution between cases and controls may partially reflect underlying comorbidities associated with higher age.

On the DELFI dataset, we furthermore demonstrated that using several NMF components enabled accurate cancer detection of early stage cancers – on par with the original DELFI results. We obtained the best classification results by using 30 or more signatures, and an even larger number of signatures could likely be relevant for larger or more heterogeneous datasets. Indeed, we speculate that the difference in gain from using more signatures between the mCRPC cohort, where two signatures worked well, to the DELFI study reflects a larger degree of heterogeneity in the multi-cancer DELFI study. Even when more components are required, using a supervised model with tens of parameters rather than hundreds as in the genomic window model makes overfitting less likely.

Finally, we investigated whether the NMF approach could improve upon the genomic bin-based approach proposed by Cristiano et al. We first showed how NMF can discover fragment length signatures of different chromatin states when trained across genomic bins, where the fragment length histogram in each bin has been aggregated across multiple samples (bin-wise training). The learned chromatin state signatures turned out to correlate better with open chromatin as measured using ATACseq than the DELFI ratio, but did not yield a clear improvement in classification performance over the DELFI method. We speculated that the lack of classification improvement could be due changes between cases and controls not related to chromatin status. We therefore investigated whether the bin-based approach could be improved by instead inferring the signature weights of the sample-wise trained NMF model in each genomic bin. To our surprise, this model outperformed both the bin-wise NMF and the DELFI ratio across all bins sizes, which may indicate that the DELFI classification signal is not purely a chromatin state signal but in part caused by CNVs in the tumors or other cancer-specific distorsions.

Conclusions

In resumé, we here demonstrate the use of NMF as a general and robust statistical approach for analyzing fragment length distributions from cfDNA-seq.

Materials and methods

Sample processing and DNA extraction

Request a detailed protocol

A total of 142 EDTA-blood samples were collected from 94 patients with metastatic castration-resistant prostate cancer at Aarhus University Hospital and Regional Hospital of West Jutland between April 2016 and August 2019. Samples were collected either before the initiation of first line treatment (n=93) or at disease progression after first line treatment (n=34) or later treatments (2nd– 4th lines, n=15) with 36 patients having multiple samples taken (of these: median 2 samples/patient, range: 2–4 samples/patient).

Blood samples were collected in BD Vacutainer K2 EDTA tubes (Beckton Dickinson) and processed within 2 hr (stored at 4°C until processing). To separate plasma from cellular components, EDTA blood samples were centrifuged at 2000–3000 g for 10 min (20°C) and plasma stored in cryo tubes (TPP) at –80°C until cfDNA extraction. Plasma samples (2.0–4.5 ml) were thawed at room temperature and centrifuged at 3000 g for 10 min (20°C). cfDNA was extracted on a QIAsymphony robot (Qiagen) using the QIAamp Circulating Nucleic Acids kit (Qiagen) as described by the manufacturer. Extracted cfDNA was stored in LoBind tubes (Eppendorf AG) at –80°C until further analysis (<1 month). cfDNA concentration was determined by droplet digital PCR (ddPCR) using a QX200 AutoDG Droplet Digital PCR System (Bio-Rad) according to the manufacturer’s instructions as previously described (Reinert et al., 2016). Germline DNA from buffy coats [peripheral blood mononuclear cells (PBMC)] was extracted on a QIAsymphony robot (Qiagen) using the QiaSymphony DSP DNA Mini Kit (Qiagen) following manufacturer’s instructions. DNA concentrations were determined using Qubit fluorometric quantification (Qubit dsDNA Broad range, Thermo Fisher Scientific). Prior to library preparation, the Covaris E220 Evolution ultrasonicator (Covaris) was used to shear germline DNA to shorter fragments (~250–350 bp).

Library preparation and shallow whole-genome sequencing

Request a detailed protocol

Libraries for next generation sequencing were prepared using the Kapa Hyper Library Preparation Kit (KAPA Biosystems) with 7.0–50.0 ng cfDNA (median 30.5 ng) or 50 ng for germline DNA as input. xGen CS-adapters – Tech Access (IDT-DNA) with Unique Molecular Identifiers (UMIs) on both strands and primers containing unique indexes on both strands were used to generate indexed libraries. Plasma libraries were pooled equimolarly and paired-end sequenced (2 × 151 bp) on an Illumina Novaseq instrument (S-prime flowcell), generating 12.66–34.30 million read pairs/ sample (median: 19.99) corresponding to coverages from 0.36 to 0.93X (median: 0.60X). Fastq files were demultiplexed using bcl2fastq (v2.20.0.422) and quality checked using fastQC and fastqScreen (Available from: http://www.bioinformatics.babraham.ac.uk/projects).

Estimation of fragment lengths

Request a detailed protocol

We used leeHom with the ‘--ancientdna’ option on the raw fastq files to strip adapters, and, where possible, to reconstruct DNA fragments by merging overlapping paired-end reads. UMIs were removed by trimming the first five nts of reach read (Renaud et al., 2014). Reads were then mapped to hg19 using BWA-MEM v.0.7.17 with seed length (‘-k’) set to 19 (Li, 2013). PCR and optical duplicates were removed using samtools rmdup using option ‘-s’ for single-end reads. Paired-end reads with mapping quality below 30, mapping within ENCODE excluded regions (ENCFF001TDO) or which contained soft or hard clips in either of the two reads were filtered away before computing fragment lengths. Fragment lengths were then directly obtained as the length of the reconstructed fragment when reconstruction was possible and otherwise obtained from the insert size calculated by the aligner based on distances on the reference sequence. Finally, fragment lengths less than 30 or greater than 700 were discarded and a matrix with fragment length counts constructed such that rows and columns corresponded to samples and fragment lengths, respectively.

Non-negative matrix factorization

Request a detailed protocol

The rows of the matrix with fragment length counts were first scaled such that they sum to one. NMF of the normalized matrix was then performed using scikit-learn (sklearn.decomposition.NMF) with random initialization, multiplicative updates and the Kullback–Leibler loss function. For the bin-wise NMF experiments, the NMF was repeated across 20 different random initializations and the fit achieving the lowest loss selected. The estimated fragment length signature and weight matrices were scaled to sum to one for each signature and sample, respectively. Software for calculating fragment lengths and applying NMF is available under the MIT license (Renaud, 2022a).

Subsampling experiments

Request a detailed protocol

To determine the sequence coverage required to estimate tumor fraction using NMF (Figure 2f) we subsampled input bam files using a custom C++program (https://github.com/grenaud/libbam/blob/master/subsamplebamFixedNumberSingleFirstMate.cpp; Renaud, 2022b; copy archived at swh:1:rev:c210c1ce1d4f0e1fe07fcbb4a438f9a7a2e3cf9b). Briefly, this program subsamples a specific number of fragments while taking into account paired-end information. The subsampled bam files were then used as input for NMF and ichorCNA and for each subsampling level, we calculated the average Pearson correlation between the NMF/ichorCNA tumor fraction estimates and the VAF based tumor fraction estimate.

To determine the number of samples required to reliably estimate fragment length signatures (Figure 2—figure supplement 5), we repeatedly sampled a given number of samples across a range of dataset sizes and trained NMF models on each of the generated datasets. For each trained model, we then calculated the maximum cosine similarity between signatures estimated on the smaller dataset with those estimated using the full dataset. For each number of samples, we conducted 100 replicates as to have an average behavior.

To determine accuracy on hold-out samples (Figure 2—figure supplement 6), we repeatedly divided the data into two partitions, a training and a test set of equal size (43 samples). For each pair, we then trained an NMF model on the training set and predicted the weights of each signature on the test set using the signatures learned from the training set. We then calculated the Pearson correlation between the estimated weights and the cfDNA% using both the weights from the training and test sets.

Analyzing fragment lengths in bins along the genome

Request a detailed protocol

We first divided the genome into bins of 250 kb and estimated the fragment length distributions for each sample in each bin. We then calculated the mean number fragments in each bin across the controls. To avoid bins with possible mapping problems, we excluded bins where the mean number of fragments in the controls were less than the median-2*IQR or greater than the median + 2*IQR. Comparison with ATACseq data was performed using ENCODE track ‘ENCFF603BJO’ (fold change over control, GM12878 cell line) lifted from hg38 to hg19.

Building support-vector machine classification models

Request a detailed protocol

The classification results based on multiple fragment length signatures and the classification results on chromatin state signatures and DELFI ratios were calculated using linear SVMs implemented in R using tidymodels and kernlab. Before training, the values for each sample was standardized by subtracting the mean and dividing with the standard deviation. Accuracy and AUC were calculated using repeated 10-fold cross-validation (50 repeats). Scripts used for the analysis of the DELFI data are available in Source code 3.

Deep targeted prostate-tailored sequencing

Request a detailed protocol

A total of 86 indexed libraries were subjected to deep targeted sequencing based on ichorCNA ctDNA% estimates from plasma cfDNA libraries. A previously designed gene panel that captures regions in the human genome commonly altered in PC was used for capture (Mayrhofer et al., 2018; Crippa et al., 2020). Libraries were pooled equimolarly (8-plex) for in-solution target enrichment using Twist Bioscience’s Custom Target Enrichment. Captured pools were paired-end sequenced (2 × 100 bp) on an Illumina Novaseq instrument (S1 flowcell). Fastq files were generated as for the sWGS data and adaptor trimming, mapping, duplicate removal, realignment, and quality score recalibration performed as described above for the ichorCNA workflow for sWGS data. The resulting depth of coverage for the targeted regions was on average 647X with a minimum of 152X and a maximum of 1198X. The average coverage for the targeted sequencing of germline DNA from buffy coats was 306X (range: 253–728X).

Variant calling and interpretation

Request a detailed protocol

Somatic variants were called using VarDict (v. 1.6) (Lai et al., 2016), Strelka2 Somatic (v. 2.9.10) (Kim et al., 2018), GATK mutect2 (v. 4.1.2.0) (Cibulskis et al., 2013), and VarScan2 (v. 2.4.2) (Koboldt et al., 2012). For each sample, a patient-matched germline sample was used as control. The impact of each variant was annotated using the Ensembl Variant Effect Predictor (ensemble-vep v. 96.0) (McLaren et al., 2016). Somatic variants were required to be supported by at least 10 reads, called by ≥3 callers, and annotated as either pathogenic or likely pathogenic in OncoKB or ClinVar (Chakravarty et al., 2017; Landrum et al., 2014) or introduce a premature stop or frameshift in the coding sequence. High-impact variants called by 2 callers were not discarded whereas low-frequency variants (VAFs <0.02) were discarded unless they had high impact, were called by all 4 callers, or were detected in another sample from the same individual. LOH-status was annotated for each variant using cfDNA copy number profiles and allele ratio of heterozygous SNPs. All variants were manually curated in IGV (v. 2.5.3) and we made sure that no alternative reads were seen in the germline DNA.

Running ichorCNA

Request a detailed protocol

Adapter sequences were trimmed using Cutadapt (v1.16) (Martin, 2011) and paired-end sequences mapped to the hg19 reference genome using BWA MEM (v0.7.15-r1140) (Li, 2013). PCR and optical duplicates were removed from each library independently using Samblaster (v0.1.24) (Faust and Hall, 2014) and the final bam files realigned (GATK v3.8.1.0) (DePristo et al., 2011). ichorCNA was run with default parameters (Adalsteinsson et al., 2017).

Estimation of ctDNA fractions from targeted seq data

Request a detailed protocol

CtDNA fractions were also estimated from the targeted sequencing data. First, the tumor cell purity (i.e. tumor cell fraction) was calculated from somatic mutations with moderate or high impact. In brief, for each sample the somatic mutation with the highest VAF was used (i.e. clear driver mutation). If multiple mutations had similar VAFs (±2%), the median VAF of these mutations were used instead. A total of 1–2 mutations per sample were used to estimate purity. Accounting for LOH status of the mutation(s), the tumor cell purity was estimated as follows: purity = 2*VAF (no LOH) or purity = 2/(1/VAF + 1) (LOH). Next, to obtain ctDNA fractions, the tumor cell purity was adjusted for tumor ploidy: ctDNA fraction = tumor cell purity * tumor ploidy/(tumor cell purity * tumor ploidy + normal cell purity * normal ploidy). Here, normal purity was set to 1-tumor cell purity and normal ploidy was set to 2. Tumor ploidy estimates were obtained from PureCN (Riester et al., 2016).

Data availability

Danish law requires ethical approval of any specific research aim and imposes restrictions on sharing of personal data. This means that the prostate cancer data used in this article cannot be uploaded to international databases. External researchers (academic or commercial) interested in analysing the prostate dataset (including any derivatives of it) will need to contact the Data Access Committee via email to kdso@clin.au.dk. The Data Access Committee is formed of co-authors Karina Dalsgaard Sø;rensen and Michael Borre, and Ole Halfdan Larsen (Department Head Consultant, Department of Clinical Medicine, Aarhus University). Due to Danish Law, for the authors to be allowed to share the data (pseudonymised) it will require prior approval from The Danish National Committee on Health Research Ethics (or similar) for the specific new research goal. The author (based in Denmark) has to submit the application for ethical approval, with the external researcher(s) as named collaborator(s). In addition to ethical approval, a Collaboration Agreement and a Data Processing Agreement is required, both of which must be approved by the legal office of the institution of the author (data owner) and the legal office of the institution of the external researcher (data processor). Raw fragment length distributions along with ctDNA% estimates are available in Supplementary file 1.

The following previously published data sets were used
    1. Karlijn L
    (2019) EGA
    ID EGAD00001005339. Genome-wide cell-free DNA fragmentation in patients with cancer.

References

    1. Wang BG
    2. Huang H-Y
    3. Chen Y-C
    4. Bristow RE
    5. Kassauei K
    6. Cheng C-C
    7. Roden R
    8. Sokoll LJ
    9. Chan DW
    10. Shih I-M
    (2003)
    Increased plasma DNA integrity in cancer patients
    Cancer Research 63:3966–3968.

Decision letter

  1. Detlef Weigel
    Senior and Reviewing Editor; Max Planck Institute for Biology Tübingen, Germany
  2. Alain Thierry
    Reviewer

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Discovering fragment length signatures of circulating tumor DNA using Non-negative Matrix Factorization" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, and the evaluation has been overseen by Y M Dennis Lo as the Senior/Reviewing Editor. The following individuals involved in review of your submission have agreed to reveal their identity: Alain Thierry (Reviewer #1).

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1. More information on the source of cancer patient plasma, characteristics of the patients, especially in respect to treatments at time of blood sampling, should be included. A flow chart summarising the overall results might be useful.

2. The authors can consider citing some of the older work analyzing cfDNA fragment size distribution using PCR-based approaches.

3. The authors showed exactly the same illustrative fragment size profile in Figure 1 (normal and tumor signature) as those presented earlier by Sanchez et al. (JCI insight, 2021). Consequently, they should at least mention in a few sentences the observations made by Jiang et al. and by Sanchez et al. (JCI insight, 2021) in order to validate their assumption.

4. As one could argue that the signatures of cfDNA fragment length had been discovered through various previously published work, in light of the previous concerns, term 'discovery' is perhaps inappropriate. Hence, we would suggest replacing the word 'discovery' by another term such as 'detection' or 'assessment'.

5. In the introduction, it would be better to describe a bit more details about tumor DNA length characteristics that were reported previously, so that readers would have a more comprehensive understanding of what is already known in the prior art and why the analysis of tumor-specific length profile in this study is important.

6. The authors should compare whether an NMF component would give a better correlation with tumor DNA estimated by ichorCNA, compared with the short DNA proportion (i.e.% of plasma DNA < 150 bp) in the overall size profile.

7. It seems somewhat intriguing that the classification power between cases and controls improved as the number of components used in the NMF process. However, the authors did not see an improvement in the correlation between tumor DNA fraction and the weight of one component reflecting the tumor contribution. These observations should be discussed and clarified in a revised manuscript.

8. The AUC was calculated using repeated 10-fold cross-validation. It would be better to show the boxplot of AUC values for a given number of NMF components (Figure 3c).

9. It will be informative to estimate the minimum number of samples is required for performing an accurate NMF analysis.

10. The authors claim their model is directly measuring ctDNA load. Since an untargeted (unsupervised) model to assess the tumor fraction for monitoring purposes would be of great value, it would be interesting to see if changing levels of ctDNA can be captured by changing weights of the putative cancer Signature 1. Since their mCRPC cohort included patients with follow-up samples (at therapy start and progression) it is not clear why this has not been tested.

11. It should be mentioned that ichorCNA can only provide informative results at ctDNA fraction of 3% and higher. At higher tumor fractions the fragment signature signal might be relatively strong, explaining the good correlation, but what about samples with lower fractions? Please indicate the tumor fractions (median, min, max) assessed with ichorCNA and based on the VAFs of the identified mutations for all patients.

12. The authors claim that using NMF they were able to verify that the fragment length profile of the signature correlating with ctDNA levels does indeed match the length distribution of fragments containing mutations. However, although they included PBMCs in their mutation analysis the authors did not correct for clonal hematopoiesis (CH). Recent studies showed that up to 20% of identified mutation are related to CH and that such mutations are also found in driver genes for solid tumors. Since there might be a size difference of mutated fragment originating from the tumor or PBMCs, respectively, this statement is overrated. In case their sequencing approach was not sensitive enough to detected CH-related variants this should at least be mentioned.

13. In Figure 2c, the shift towards smaller fragment is only seen for the dinucelosomes peak but not for the mononucleosomal fragments. How do the authors explain this (if not by contamination of CH-related variants)?

14. The authors demonstrated that NMF was more robust when less data was available than the ichorCNA algorithm, but they lacked to provide convincing data that NMF as a quantitative measure is more sensitive with respect to tumor fraction. To this end, in silico dilution experiments where they computationally mixed healthy fragments and high ctDNA fraction sample(s) to a ctDNA fraction of their choice (with repetitions to achieve a more robust es-timation) should be performed.

15. If there are no CNA or tumor fraction >3%, ichorCNA can literally not be used to determine ctDNA fraction, so the feasibility of the dilution approach to benchmark ichorCNA is not clear to the reviewer (in particular for early-stage cases, in which most samples have tumor fractions >1%). Moreover, low depth can be overcome by additional sequencing whereas the tumor fraction cannot be enriched by more reads.

16. Using NMF the authors demonstrate that there are 2 major contributors to fragment length distribution and the addition of further signatures does not add anything. However, this statement is conflicting with results of the linear SVM classification case/control approach (Figure 3c). A plateau of classification accuracy is reached when using around 30 signatures, which suggests this would be a good trade-off for number of signatures to reach high accuracy (AUC).

17. Figure 3d: Interestingly, open chromatin is associated with a higher fraction of dinucleosomal fragments. How do the authors explain this? Figure 3: the reviewer is not sure what the authors want to show here with Pearson's correlation coefficient, this does not look like a linear relationship.

18. The paragraph on "epigenetic fragment length" could be better presented. I would re-phrase "epigenetic state". Both the DELFI ratio and the fragment length signature are surrogates for the chromatin state (open or closed). First the author talks about bin-wise extraction of two signatures. Why is it, that reducing bin size gives better classification results? ~50 kb bins seem to be best to predict ATACseq results. Is there literature that describes open/closed chromatin in this way? Then they correlate the weights from that with ATACseq results. And finally, they use their bin-wise signature extraction results/weights as input for cancer/control classification. Could be split in two parts for clarity: one that talks about correlation with ATACseq results and one that discusses classification results.

19. cfDNA fragment length profiles are particularly prone to pre-analytical (extraction, library prep) and analytical (clustering efficiency of various sequencings machines) factors. How do preanalyitcal/analytical factors affect the classification? The authors mention in the discussion that changes in the size profiles from their mCRPC cohort and the DELFI samples vary, which might be attributed due to such factors. Yet such confounding event will impact the robustness of such approaches and may prevent a widespread implementation of fragmentomics. This should be further discussed.

20. How can the authors exclude that the above-mentioned differences are attributed to higher/lower ctDNA fractions in the metastatic CRPC and the early stage DELFI samples? Even in Figure 2a it seems that samples with higher tumor fraction have more dinucleosomal fragments. Separate fragmentation profile plots for each stage should be shown; it would be interesting whether such profiles show any differences for stage I and stage II patients.

21. The authors could have checked in the DELFI cohort whether there is a (physiologic fragment length variation) between the seven cancers types? Would there always be the same two signatures coming up if the only consider one tumor type? Based on Supplementary Figure 4 this seems not to be case. In addition, there is a huge overlap of healthy individuals and cancer patients for signature 1? Please also present Supplementary Figure 4 with Signature 2. Can we really assume that this signature is the one related to cancer?

22. Abbreviations should be defined at the beginning for better readability.

23. Resolution of figures need to be improved. Comprehensive legends should also be included for Supplementary figures.

24. Matching colors for better readability are recommended.

25. Specify average read pairs for cfDNA samples and corresponding PBMCs.

26. Excel Supplementary file: Supplementary table 2 should be mentioned in the manuscript.

27. ctDNA (%) on the x-axis in Figures 2 is calculated from targeted mutation analysis, right? How was est.ctDNA % in the Supplementary figures 2&3 calculated?

28. What is the original h matrix in NMF in Figure 2f? There is a normal h matrix described as weight matrix, but I haven't seen the definition of what with "original" h matrix is meant.

29. Assumptions for linear correlation using Pearson are not mentioned, I don't know whether they are met at all. Looking at the figure it does not seem like they are met. Normality of the variables should at least be checked.

30. How is NMF performing on previously unseen data? This part is missing. The authors could have checked for overfitting in this way for example calculating the reconstruction error on a test dataset.

31. It is not clear whether cross validation was used only for the classification task or also to evaluate the NMF decomposition. Are those signatures representative?

32. It would be nice to see the position of the excluded bins based on +- 2* IQR. They already excluded the ENCODE blacklist regions. Other outliers may be real biological signal especially in some type of cancer where copy number alteration play an important role.

Reviewer #1 (Recommendations for the authors):

Renaud et al. aims at developing a new computational method based on NMF relying on the difference in the fragment size profiles of cfDNA extracted from plasma of healthy individuals and metastatic castration-resistant prostate cancer patients. The manuscript is clear and the drawings very helpful. While data are convincing, the reviewer has a few major concerns that should be necessarily addressed:

Major concerns:1. The authors generally poorly refer to the source of cancer patient plasma. Limitations of the work, especially method performance, is slightly described in particular about cancer stages, or other cancers. In addition, characteristics of the patients are missing, especially in respect to treatments at time of blood sampling. Lastly, a flow chart is missing, enabling to evaluate successful rates.

2. Citation of critical reports or issues are missing when introducing this work and placing it into the development of methods for pan cancer screening. Below are concerns that need to addressed:

– The authors are citing reports on cancer vs healthy cfDNA fragment size distribution published after 2016 years. Most of the pre-omics era studies on distinguishing cancer to healthy individuals by fragment size distribution were based on the q-PCR analysis, while this is not mentioned. The first observation was made by Mouliere et al. (PlosOne, 2011) who studied by fractional size distribution with nested Q-PCR systems "the size distribution profile of ctDNA fragments in plasma" and indicated that "the size distribution profile of ctDNA fragments can be used to discriminate between healthy and cancer patients". This team later showed that mutant fragments are shorter than wild type counterparts (Mouliere et al.; Transl.Oncol, 2013), Lo's group later confirmed this observation by showing strong but somewhat different differences in regards to size ranges Jiang et al. (PNAS, 2015) or more recently in Sun Kun et al. (Genome Res, 2019).

– Renaud et al. work as well than those cited in the manuscript such as Cristiano et al., relies on Mouliere et al. (PlosOne, 2011) first observation. Note, Cristiano et al. cited this report as well as Sanchez et al. (npgGenomic Medicine, 2018), Jiang et al., and Underhill (Plos Gen. 2015). Finally, Sanchez et al. (JCI insight, 2021) refined Mouliere et al. (2011) observation by using sWGS from double and single stranded DNA library preparations. The authors should necessarily add these citations.

– The authors showed exactly the same illustrative fragment size profile in Figure 1 (normal and tumor signature) as those presented earlier by Sanchez et al. (JCI insight, 2021). Consequently, they should at least mention in a few sentences the observations made by Jiang et al. and by Sanchez et al. (JCI insight, 2021) in order to validate their assumption.

3. In light of the previous concerns, the title should be corrected, as the term discovery is improper in regards to the study. The signatures were previously made; the method used here, solely allowed to combine fragment length markers in an optimal setting while developing a technology. This is already important since this method appears very performant, and the reviewer is impressed. Discovery should be replaced by another term such as detection or assessment.

Reviewer #2 (Recommendations for the authors):

Specific comments for authors to address:

1) In the introduction, it would be better to describe a bit more details about tumor DNA length characteristics that were reported previously such as https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1006162, https://www.pnas.org/content/112/11/E1317, https://www.pnas.org/content/115/46/E10925, https://stm.sciencemag.org/content/10/466/eaat4921.abstract, therefore the readers would have a more comprehensive understanding of what is already known in the prior art and why the analysis of tumor-specific length profile in this study is important.

2) The authors should compare whether an NMF component would give a better correlation with tumor DNA estimated by ichorCNA, compared with the short DNA proportion (i.e.% of plasma DNA < 150 bp) in the overall size profile.

3) It seems a bit confused to the reviewer that the classification power between cases and controls improved as the number of components used in the NMF process. But the authors did not see the improvement in the correlation between tumor DNA fraction and the weight of one component reflecting the tumor contribution. Such observation should be discussed and clarified.

4) The AUC was calculated using repeated 10-fold cross-validation. It would be better to show the boxplot of AUC values for a given number of NMF components (Figure 3c).

5) It will be informative to estimate the minimum number of samples is required for performing an accurate NMF analysis.

Reviewer #3 (Recommendations for the authors):

1) The authors claim their model is directly measuring ctDNA load. Since an untargeted (unsu-pervised) model to assess the tumor fraction for monitoring purposes would be of great value, it would be interesting to see if changing levels of ctDNA can be captured by changing weights of the putative cancer Signature 1. Since their mCRPC cohort included patients with follow-up samples (at therapy start and progression) it is not clear why this has not been tested.

2) It should be mentioned that ichorCNA can only provide informative results at ctDNA frac-tion of 3% and higher. At higher tumor fractions the fragment signature signal might be rel-atively strong, explaining the good correlation, but what about samples with lower frac-tions? Please indicate the tumor fractions (median, min, max) assessed with ichorCNA and based on the VAFs of the identified mutations for all patients.

3) The authors claim that using NMF they were able to verify that the fragment length profile of the signature correlating with ctDNA levels does indeed match the length distribution of fragments containing mutations. However, although they included PBMCs in their mutation analysis the authors did not correct for clonal hematopoiesis. Recent studies showed that up to 20% of identified mutation are related to CH and that such mutations are also found in driver genes for solid tumors. Since there might be a size difference of mutated fragment originating from the tumor or PBMCs, respectively, this statement is overrated. In case their sequencing approach was not sensitive enough to detected CH-related variants this should at least be mentioned.

4) In Figure 2c, the shift towards smaller fragment is only seen for the dinucelosomes peak but not for the mononucleosomal fragments. How do the authors explain this (if not by contamination of CH-related variants)?

5) The authors demonstrate that NMF was more robust when less data is available than the ichorCNA algorithm, but they lacked to provide convincing data that NMF as a quantitative measure is more sensitive with respect to tumor fraction. To this end, in silico dilution ex-periments where they computationally mix healthy fragments and high ctDNA fraction sample(s) to a ctDNA fraction of their choice (with repetitions to achieve a more robust es-timation) should be performed.

6) Also, if there are no CNA or tumor fraction >3%, ichorCNA can literally not be used to de-termine ctDNA fraction, so the feasibility of the dilution approach to benchmark ichorCNA is not clear to me (in particular for early stage cases, in which most samples have tumor fractions >1%). Moreover, low depth can be overcome by additional sequencing whereas the tumor fraction cannot be enriched by more reads.

7) Using NMF the authors demonstrate that there are 2 major contributors to fragment length distribution and the addition of further signatures does not add anything. However, this statement is conflicting with results of the linear SVM classification case/control ap-proach (Figure 3c). A plateau of classification accuracy is reached when using around 30 signatures, which suggests this would be a good trade-off for number of signatures to reach high accuracy (AUC).

8) Figure 3d: Interestingly, open chromatin is associated with a higher fraction of dinucleosomal fragments. How do the authors explain this? Figure 3: I am not sure what the authors want to show here with Pearson's correlation coefficient, this does not look like a linear relationship.

9) The paragraph on "epigenetic fragment length" could be better presented. I would rephrase "epigenetic state". Both the DELFI ratio and the fragment length signature are surrogates for the chromatin state (open or closed). First the author talks about bin-wise ex-traction of two signatures. Why is it, that reducing bin size gives better classification results? ~50 kb bins seem to be best to predict ATACseq results. Is there literature that de-scribes open/closed chromatin in this way? Then they correlate the weights from that with ATACseq results. And finally, they use their bin-wise signature extraction results/weights as input for cancer/control classification. Could be split in two parts for clarity: one that talks about correlation with ATACseq results and one that discusses classification results

10) cfDNA fragment length profiles are particularly prone to pre-analytical (extraction, library prep) and analytical (clustering efficiency of various sequencings machines) factors. How do preanalyitcal/analytical factors affect the classification? The authors mention in the discussion that changes in the size profiles from their mCRPC cohort and the DELFI samples vary, which might be attributed due to such factors. Yet such confounding event will impact the robustness of such approaches and may prevent a widespread implementation of frag-mentomics. This should be further discussed.

11) How can the authors exclude that the above-mentioned differences are attributed to high-er/lower ctDNA fractions in the metastatic CRPC and the early stage DELFI samples? Even in Figure 2a it seems that samples with higher tumor fraction have more dinucelosomal frag-ments. Separate fragmentation profile plots for each stage should be shown; it would be interesting whether such profiles show any differences for stage I and stage II patients.

12) The authors could have checked in the DELFI cohort whether there is a (physiologic fragment length variation) between the seven cancer types? Would there always be the same two signatures coming up if the only consider one tumor type? Based on Supplementary Figure 4 this seems not to be case. In addition, there is a huge overlap of healthy individuals and cancer patients for signature 1? Please also present Supplementary Figure 4 with Signature 2. Can we really assume that this signature is the one related to cancer?

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Unsupervised discovery of fragment length signatures of circulating tumor DNA using Non-negative Matrix Factorization" for further consideration by eLife. Your revised article has been evaluated by Detlef Weigel (Senior Editor) and a Reviewing Editor.

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

Reviewer #1 (Recommendations for the authors):

Essential revisions:

1. The authors partially answered to the concerns but did not add a flow chart to ease reading. In addition, they did not precise the pre-analytics such as delay between blood draw and plasma isolation, storage, centrifugation steps and speeds,… This would validate quantitative data accuracy and may influence fragment size.

2. Please replace "discovery" with "detection" in the title.

3. The authors should describe in more detail how they concluded that the NMF model can be reliably estimated (as well on which criteria) using as few as 20 samples.

4. The authors should indicate whether they have used a spiking strategy to evaluate whether their model is directly measuring ctDNA load, or whether they use another ctDNA techniques to compare. Otherwise, their evaluation is not complete, and this should be indicated in the limitations of the study in the discussion.

5. The authors answered to this concern by showing adequate data but omitting to discuss them. This is of importance since this cutoff appears as a critical limitation of the model.

6. Explanations from authors are reasonable. However, this discrepancy as well as these explanations should be inserted in the text.

7. The authors weakly addressed this concern. In particular, one of the major criticism of the Cristiano work is the selection of the healthy individuals: Are they age matched? Are they fully healthy? Are their plasma absolutely examined as those of cancer individuals? Any difference about pre-analytics or analytics between cancer types? Please discuss. Finally, please be cautious about fragment size cancer signature specificity.

https://doi.org/10.7554/eLife.71569.sa1

Author response

Essential revisions:

1. More information on the source of cancer patient plasma, characteristics of the patients, especially in respect to treatments at time of blood sampling, should be included. A flow chart summarising the overall results might be useful.

The timing of blood samples has been clarified in the results and methods sections of the manuscript. In brief, most samples were collected either before starting first line treatment or following relapse after first line treatment with a small minority of samples collected following relapse from later treatments.

2. The authors can consider citing some of the older work analyzing cfDNA fragment size distribution using PCR-based approaches.

We thank the referee for pointing out the lack of representation of the PCR-based studies of the cfDNA fragment length distribution in our introduction. We have now updated the introduction section to also cover four PCR-based studies.

3. The authors showed exactly the same illustrative fragment size profile in Figure 1 (normal and tumor signature) as those presented earlier by Sanchez et al. (JCI insight, 2021). Consequently, they should at least mention in a few sentences the observations made by Jiang et al. and by Sanchez et al. (JCI insight, 2021) in order to validate their assumption.

Again, we thank the referee for pointing out the missing references to two sequencing-based studies of cfDNA fragment length signals in cancer. We have updated the introduction and Results sections with references to both Jiang et al., 2016 and Sanchez et al., 2021.

4. As one could argue that the signatures of cfDNA fragment length had been discovered through various previously published work, in light of the previous concerns, term 'discovery' is perhaps inappropriate. Hence, we would suggest replacing the word 'discovery' by another term such as 'detection' or 'assessment'.

It was not our intention to indicate that we were the first to “discover” fragment length signatures of circulating tumor DNA. Instead, we wished to indicate that our method was able to discover tumor signature(s) without using e.g. mutations or xenografts for identifying tumor-derived cfDNA fragments. We have updated the title, introduction and Discussion sections to better reflect this.

5. In the introduction, it would be better to describe a bit more details about tumor DNA length characteristics that were reported previously, so that readers would have a more comprehensive understanding of what is already known in the prior art and why the analysis of tumor-specific length profile in this study is important.

We agree with the referee that prior art was insufficiently represented in the previous submission. We have included several additional references to both PCR and sequencing based studies of the cfDNA fragment length distribution in cancer. Finally, we have now included a more thorough motivation for our NMF model as an important contribution to the field. In brief, previous studies have relied either on xenografts or detailed knowledge about mutations e.g. from targeted sequencing to accurately determine the tumor fragment length distribution. NMF enables us to do this in a completely unsupervised manner and hence independently of mutational or clinical information.

6. The authors should compare whether an NMF component would give a better correlation with tumor DNA estimated by ichorCNA, compared with the short DNA proportion (i.e.% of plasma DNA < 150 bp) in the overall size profile.

We tested using the ratio of short-to-long fragments (100-150bp vs. 151-220bp) proposed by Cristiano et al. This yielded a correlation with the VAF based estimates of ctDNA fraction of 0.68 on the shallow WGS dataset and hence significantly lower than the NMF based estimate (r=0.75). We have added a comment to the main text and a supplementary figure (Figure 2 —figure supplement 3).

7. It seems somewhat intriguing that the classification power between cases and controls improved as the number of components used in the NMF process. However, the authors did not see an improvement in the correlation between tumor DNA fraction and the weight of one component reflecting the tumor contribution. These observations should be discussed and clarified in a revised manuscript.

On the mCRPC data, using more than two components does in fact improve the correlation – albeit only slightly (two components: 0.75, three components: 0.78, four components: 0.79; Figure 2 —figure supplement 2). Using more than two components involves a supervised element as we need to find out which components are tumor associated and sum up their contributions. In order to avoid this, and also to have a more interpretable model, we settled on the model with two components. On the DELFI data, two components were clearly insufficient. We attribute this to greater diversity in the DELFI data due to the many different cancer types included as well as potential technical differences. We have clarified this in both the results and Discussion sections of the manuscript.

8. The AUC was calculated using repeated 10-fold cross-validation. It would be better to show the boxplot of AUC values for a given number of NMF components (Figure 3c).

We now show boxplots to reveal the variation in AUC values between repeats.

9. It will be informative to estimate the minimum number of samples is required for performing an accurate NMF analysis.

We have now included tests of the robustness of our method to the number of samples. Across a range of dataset sizes, we repeatedly sampled a dataset with the appropriate number of samples and trained an NMF model. For each trained model, we then compared the estimated signatures with those estimated using the full datasets. Using this approach, we found that the NMF model can be reliably estimated using as little as 20 samples. These results have been incorporated in the Results section and a new supplementary figure (Figure 2 —figure supplement 5).

10. The authors claim their model is directly measuring ctDNA load. Since an untargeted (unsupervised) model to assess the tumor fraction for monitoring purposes would be of great value, it would be interesting to see if changing levels of ctDNA can be captured by changing weights of the putative cancer Signature 1. Since their mCRPC cohort included patients with follow-up samples (at therapy start and progression) it is not clear why this has not been tested.

We thank the referee for suggesting to use serially sampled data to assess our method. Unfortunately, all the samples from the mCRPC cohort were collected either at baseline (i.e. before treatment) or upon progression with no samples collected at response. We thus do not have a clear a priori expectation of the direction of change of the ctDNA fraction across time and hence cannot use serial information to evaluate our method. Serially collected samples are also available in the DELFI cohort. A small subset of the subjects in the DELFI cohort have serial samples that were also examined using targeted sequencing in a previous publication. We have now added a supplementary figure that compares the prediction probabilities obtained by our classification method using 30 fragment length signatures to the maximum driver mutation allele fraction observed in the targeted sequencing (Figure 3 —figure supplement 4).

11. It should be mentioned that ichorCNA can only provide informative results at ctDNA fraction of 3% and higher. At higher tumor fractions the fragment signature signal might be relatively strong, explaining the good correlation, but what about samples with lower fractions? Please indicate the tumor fractions (median, min, max) assessed with ichorCNA and based on the VAFs of the identified mutations for all patients.

We have added the requested information about ichorCNA and VAF-based ctDNA% estimates in the mCRPC dataset both as summary statistics (median, min, max) in the Results section and as violin plots in the supplement (Figure 2 —figure supplement 1).

12. The authors claim that using NMF they were able to verify that the fragment length profile of the signature correlating with ctDNA levels does indeed match the length distribution of fragments containing mutations. However, although they included PBMCs in their mutation analysis the authors did not correct for clonal hematopoiesis (CH). Recent studies showed that up to 20% of identified mutation are related to CH and that such mutations are also found in driver genes for solid tumors. Since there might be a size difference of mutated fragment originating from the tumor or PBMCs, respectively, this statement is overrated. In case their sequencing approach was not sensitive enough to detected CH-related variants this should at least be mentioned.

We do make sure that the ctDNA estimates from targeted sequencing are not biased by clonal hematopoiesis. During variant calling, we compare the cfDNA sequence data with sequence data from PBMCs. And after variant calling, we manually inspect all the driver mutations that are used to estimate ctDNA fraction and make sure that no alternative reads are seen in the PBMC data. The PBMC sequence data has a mean coverage of 306X (range: 253-728X). We have now added the mean PBMC sequence depth to the methods section and have specified that we check that no alternative reads are seen in the germline during the manual inspection of the driver variants.

13. In Figure 2c, the shift towards smaller fragment is only seen for the dinucelosomes peak but not for the mononucleosomal fragments. How do the authors explain this (if not by contamination of CH-related variants)?

The reviewers are correct that the shift towards lower fragment sizes is more pronounced for the nucleosomal peak but we do also see a small shift in the mononucleosome peak. This plot is only based on reads that overlap known driver mutations and is therefore based on a small number of fragments from a few positions in the genome, which explains the differences between the global fragment length distributions and the distributions seen in this plot. As explained in the response above, we are quite certain that the mutations we look at are not CH-related variants. A possible explanation for the differences in the fragment length distributions is that the global distribution could reflect that the ctDNA fragments tend to come from other parts of the genome than the non-cancer cfDNA fragments but in Figure 2c, we restrict the fragments to come from the same positions.

14. The authors demonstrated that NMF was more robust when less data was available than the ichorCNA algorithm, but they lacked to provide convincing data that NMF as a quantitative measure is more sensitive with respect to tumor fraction. To this end, in silico dilution experiments where they computationally mixed healthy fragments and high ctDNA fraction sample(s) to a ctDNA fraction of their choice (with repetitions to achieve a more robust es-timation) should be performed.

We agree with the referee that in silico dilution experiments would be informative regarding the robustness of NMF versus other methods at different ctDNA levels. Unfortunately, we do not have any healthy controls available for the mCRPC cohort and hence cannot perform the requested experiments. We have added a comment about this limitation to the discussion and have also made it clear that more experiments will be needed in order to determine which method, NMF or ichorCNA, is most sensitive.

15. If there are no CNA or tumor fraction >3%, ichorCNA can literally not be used to determine ctDNA fraction, so the feasibility of the dilution approach to benchmark ichorCNA is not clear to the reviewer (in particular for early-stage cases, in which most samples have tumor fractions >1%). Moreover, low depth can be overcome by additional sequencing whereas the tumor fraction cannot be enriched by more reads.

As the vast majority of samples have ctDNA levels above 3% as judged using driver VAFs and as our subsampling experiments do not change this, we believe our conclusions still stand. We agree with the referee that additional sequencing can overcome low depth but not low tumor fraction. A more ideal experiment would be to instead lower the ctDNA fractions by in silico dilution as also suggested by a referee in comment #14. Unfortunately, no healthy controls were available in the mCRPC cohort and hence we had instead to rely on subsampling experiments to prove the robustness of methods.

16. Using NMF the authors demonstrate that there are 2 major contributors to fragment length distribution and the addition of further signatures does not add anything. However, this statement is conflicting with results of the linear SVM classification case/control approach (Figure 3c). A plateau of classification accuracy is reached when using around 30 signatures, which suggests this would be a good trade-off for number of signatures to reach high accuracy (AUC).

This questions was also raised by another referee (comment #7) and we agree that our presentation of results regarding the optimal number of NMF components was not sufficiently clear in the previous version of the manuscript. This has now been clarified in the manuscript. In brief, using more than two components on the mCRPC data did not yield significant improvement in ctDNA correlation, whereas the gain was much bigger on the DELFI data. We speculate that this is due to higher heterogeneity (more cancer types, individuals) in the DELFI data.

17. Figure 3d: Interestingly, open chromatin is associated with a higher fraction of dinucleosomal fragments. How do the authors explain this? Figure 3: the reviewer is not sure what the authors want to show here with Pearson's correlation coefficient, this does not look like a linear relationship.

We have surveyed the literature to see if previous articles have mentioned that open chromatin is associated with a higher fraction of dinucleosomal fragments but have not found any previous reports of this. We agree that it is an interesting observation but unfortunately we do not have a good explanation for why it should be the case.

We agree that Figure 3e did not show a linear relationship and have now changed the figure and the text so that we use Spearman's rank correlation coefficient instead of Pearson’s correlation coefficient.

18. The paragraph on "epigenetic fragment length" could be better presented. I would re-phrase "epigenetic state". Both the DELFI ratio and the fragment length signature are surrogates for the chromatin state (open or closed). First the author talks about bin-wise extraction of two signatures. Why is it, that reducing bin size gives better classification results? ~50 kb bins seem to be best to predict ATACseq results. Is there literature that describes open/closed chromatin in this way? Then they correlate the weights from that with ATACseq results. And finally, they use their bin-wise signature extraction results/weights as input for cancer/control classification. Could be split in two parts for clarity: one that talks about correlation with ATACseq results and one that discusses classification results.

We appreciate the suggestions for improving the section and have tried to do so. Instead of writing about “epigenetic state” we now use “chromatin state” and we have split the section in two as suggested. With regards to why a smaller bin size leads to better classification results we can see that the ATACseq values do fluctuate to some degree from bin to bin even with the smaller bins sizes. So if the classification results are achieved through prediction of chromatin state then it makes sense that the smaller bins that better capture the small range changes in chromatin state would be best. We are not aware of any existing literature besides the 2019 DELFI article that deals with this issue.

19. cfDNA fragment length profiles are particularly prone to pre-analytical (extraction, library prep) and analytical (clustering efficiency of various sequencings machines) factors. How do preanalyitcal/analytical factors affect the classification? The authors mention in the discussion that changes in the size profiles from their mCRPC cohort and the DELFI samples vary, which might be attributed due to such factors. Yet such confounding event will impact the robustness of such approaches and may prevent a widespread implementation of fragmentomics. This should be further discussed.

We agree with the referee that technical factors may significantly impact the fragment length distribution and hence make it hard for fragment length based models to generalize to new data with different technical characteristics. We have now added an additional statement regarding this limitation to the discussion.

20. How can the authors exclude that the above-mentioned differences are attributed to higher/lower ctDNA fractions in the metastatic CRPC and the early stage DELFI samples? Even in Figure 2a it seems that samples with higher tumor fraction have more dinucleosomal fragments. Separate fragmentation profile plots for each stage should be shown; it would be interesting whether such profiles show any differences for stage I and stage II patients.

Our statements regarding potential technical differences between mCRPC and DELFI were based on comparing the estimated NMF signatures (e.g. Figure 2b vs Figure 3b), which – hopefully – eliminates the effect of different ctDNA levels. That said, it cannot be ruled out that the large differences in ctDNA levels between mCRPC and DELFI affects the signature estimates or that other, biological differences (e.g. stage effects) contribute. We have now clarified this in the manuscript and included a “spaghetti plot” showing fragment length distribution for the different stages in the DELFI data (Figure 3 —figure supplement 1). The stage plot confirms differences across stages, albeit subtle, which likely reflect increased ctDNA% at higher stages.

21. The authors could have checked in the DELFI cohort whether there is a (physiologic fragment length variation) between the seven cancers types? Would there always be the same two signatures coming up if the only consider one tumor type? Based on Supplementary Figure 4 this seems not to be case. In addition, there is a huge overlap of healthy individuals and cancer patients for signature 1? Please also present Supplementary Figure 4 with Signature 2. Can we really assume that this signature is the one related to cancer?

To address this, we reran NMF separately for each cancer type. The AUC was highest for colorectal cancer patients (0.93) and lowest for gastric cancer (0.56). Full results are shown in Figure 3 —figure supplement 3a. Furthermore, we sought to determine whether the inferred signatures differ from one cancer type to the next. The different inferred signatures were plotted (Figure 3 —figure supplement 3b). Apart from gastric cancer, which happens to be the one with the lowest AUC, most fragment length signatures do not seem to change depending on the cancer type.

22. Abbreviations should be defined at the beginning for better readability.

Done.

23. Resolution of figures need to be improved. Comprehensive legends should also be included for Supplementary figures.

We have now uploaded all figures as good quality pdf figures instead of just the low-quality inlined figures. Furthermore, we have improved the appearance of the supplementary figures and provided them with comprehensive legends.

24. Matching colors for better readability are recommended.

We have now tried to use the same color scheme in all figures.

25. Specify average read pairs for cfDNA samples and corresponding PBMCs.

We have now added a summary of the number of reads pairs for all samples sequencing in the mCRPC cohort.

26. Excel Supplementary file: Supplementary table 2 should be mentioned in the manuscript.

We thank the referee for pointing out the omission of these references. Supplementary tables 1 and 2 contain cfDNA fragment length counts for all samples for sWGS and targeted sequencing, respectively. Furthermore, VAF- and CNV-based estimates of the ctDNA% are included in both tables. References to the two tables have now been included in the manuscript.

27. ctDNA (%) on the x-axis in Figures 2 is calculated from targeted mutation analysis, right? How was est.ctDNA % in the Supplementary figures 2&3 calculated?

We apologize for not making this sufficiently clear. All ctDNA% estimates used for validation were based on VAFs from targeted mutation analyses unless it is explicitly specified that it is based on CNVs / ichorCNA. We have now clarified this in all figures.

28. What is the original h matrix in NMF in Figure 2f? There is a normal h matrix described as weight matrix, but I haven't seen the definition of what with "original" h matrix is meant.

“Original h matrix” referred to using signatures estimated on the full dataset to estimate the weights on the subsampled data. We agree with the referee that this was not clear in the previous version of the manuscript. We have now removed that line from Figure 2f in order to improve readability of the subsampling analysis.

29. Assumptions for linear correlation using Pearson are not mentioned, I don't know whether they are met at all. Looking at the figure it does not seem like they are met. Normality of the variables should at least be checked.

We thank the reviewers for bringing this omission to our attention. Since the results in Figure 3e do not show a linear relationship we have now changed the figure and the text so that we use Spearman's rank correlation coefficient instead of Pearson’s correlation coefficient.

30. How is NMF performing on previously unseen data? This part is missing. The authors could have checked for overfitting in this way for example calculating the reconstruction error on a test dataset.

To address this, we modified our subsampling analysis to repeatedly divide the data into two, a training and testing set. The average correlation of the weights estimated on the training set with ctDNA% was 0.75, whereas the average correlation on the testing set was 0.74 thus indicating that the NMF is robust to novel data (Figure 2 —figure supplement 6).

31. It is not clear whether cross validation was used only for the classification task or also to evaluate the NMF decomposition. Are those signatures representative?

We do not retrain the NMF components in each fold in the cross validation. The results mentioned in the response to question 9 and shown in Figure 2 —figure supplement 5 show that signatures inferred from 90% of the data are indistinguishable from signatures inferred from the whole data set.

32. It would be nice to see the position of the excluded bins based on +- 2* IQR. They already excluded the ENCODE blacklist regions. Other outliers may be real biological signal especially in some type of cancer where copy number alteration play an important role.

We exclude bins that are outliers with regards to depth in controls. It is the same bins that are removed in all samples and we do not look at the case samples when we determine which bins to remove. So biological signals from copy number alterations in the cancer samples do not affect which bins are removed.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

The manuscript has been improved but there are some remaining issues that need to be addressed, as outlined below:

Reviewer #1 (Recommendations for the authors):

Essential revisions:

1. The authors partially answered to the concerns but did not add a flow chart to ease reading. In addition, they did not precise the pre-analytics such as delay between blood draw and plasma isolation, storage, centrifugation steps and speeds,… This would validate quantitative data accuracy and may influence fragment size.

We agree with the referee that the technical description of cfDNA processing was not completely clear in the previous version of the manuscript and we have now added extra details regarding this to the methods section. The description in the “Sample processing and DNA extraction” subsection now includes all the requested details about the pre-analytics. Besides the main overview in Figure 1 we have now added a flow chart giving an overview of the different analyses and experiments presented in Figure 2 and Figure 3 as supplementary figures to Figure 2 and Figure 3 respectively.

2. Please replace "discovery" with "detection" in the title.

The title has been changed as requested.

3. The authors should describe in more detail how they concluded that the NMF model can be reliably estimated (as well on which criteria) using as few as 20 samples.

We agree with the referee that the subsampling procedure and criteria used for assessing reliability was not clear enough in the previous version of the manuscript. We have now added more thorough descriptions to both the results and methods sections. The methods section now includes a subsection called “Subsampling experiments” containing the details about the robustness experiments presented in Figure 2f, Figure 2 supp. 5 and Figure 2 supp.6.

4. The authors should indicate whether they have used a spiking strategy to evaluate whether their model is directly measuring ctDNA load, or whether they use another ctDNA techniques to compare. Otherwise, their evaluation is not complete, and this should be indicated in the limitations of the study in the discussion.

We agree with the referee that using spike-ins could have been a valuable add on to the present study, however this was not done due to lack of healthy controls for the mCRPC cohort. We have clarified this in the Discussion section of the manuscript. Please see the response to questions #11 for more details.

5. The authors answered to this concern by showing adequate data but omitting to discuss them. This is of importance since this cutoff appears as a critical limitation of the model.

We have added the following text to the discussion acknowledging the limitations of the current study:

“However, the lack of low ctDNA samples is a limitation of the current study and further experiments either based on samples with lower tumor burdens, spike-ins or in silico dilution are needed to confirm this assertion as it could not be directly tested with the data available in this study. A lack of healthy controls for the mCRPC cohort data means that we were not able to perform spike-in or in silico dilution experiments in this article”

6. Explanations from authors are reasonable. However, this discrepancy as well as these explanations should be inserted in the text.

We believe that the text added in response to the previous question covers these concerns as well.

7. The authors weakly addressed this concern. In particular, one of the major criticism of the Cristiano work is the selection of the healthy individuals: Are they age matched? Are they fully healthy? Are their plasma absolutely examined as those of cancer individuals? Any difference about pre-analytics or analytics between cancer types? Please discuss. Finally, please be cautious about fragment size cancer signature specificity.

We agree with the referee that technical or biological factors may have affected our results. We have now added the statement below to the discussion:

“A number of technical and biological factors in the DELFI study design may have inflated our classification performance. For instance, cases were generally older than controls and hence the observed differences in the fragment length distribution between cases and controls may partially reflect underlying comorbidities associated with higher age.”

We are unaware of any specific differences in how samples from cases and controls were processed in the DELFI study and hence feel unable to comment further on this in the manuscript.

https://doi.org/10.7554/eLife.71569.sa2

Article and author information

Author details

  1. Gabriel Renaud

    1. Department of Health Technology, Section of Bioinformatics, Technical University of Denmark, Kongens Lyngby, Denmark
    2. Department of Molecular Medicine, Aarhus University, Aarhus, Denmark
    3. Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
    Contribution
    Conceptualization, Software, Formal analysis, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing – original draft, Project administration, Writing – review and editing
    Competing interests
    No competing interests declared
  2. Maibritt Nørgaard

    1. Department of Molecular Medicine, Aarhus University, Aarhus, Denmark
    2. Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
    Contribution
    Data curation, Software, Formal analysis, Investigation, Visualization, Writing – original draft, Writing – review and editing
    Competing interests
    No competing interests declared
  3. Johan Lindberg

    Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden
    Contribution
    Resources, Data curation, Formal analysis, Investigation, Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
  4. Henrik Grönberg

    Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden
    Contribution
    Resources, Methodology
    Competing interests
    No competing interests declared
  5. Bram De Laere

    1. Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden
    2. Cancer Research Institute Gent (CRIG), Ghent University, Ghent, Belgium
    3. Department of Human Structure and Repair, Ghent University, Ghent, Belgium
    Contribution
    Resources, Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
  6. Jørgen Bjerggaard Jensen

    Department of Urology, Regional Hospital of West Jutland, Holstebro, Denmark
    Contribution
    Resources, Methodology, Writing – review and editing
    Competing interests
    No competing interests declared
  7. Michael Borre

    Department of Urology, Aarhus University Hospital, Aarhus, Denmark
    Contribution
    Resources
    Competing interests
    No competing interests declared
  8. Claus Lindbjerg Andersen

    1. Department of Molecular Medicine, Aarhus University, Aarhus, Denmark
    2. Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
    Contribution
    Resources, Writing – review and editing
    Competing interests
    No competing interests declared
  9. Karina Dalsgaard Sørensen

    1. Department of Molecular Medicine, Aarhus University, Aarhus, Denmark
    2. Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
    Contribution
    Supervision, Funding acquisition, Methodology, Project administration, Writing – review and editing
    Competing interests
    No competing interests declared
  10. Lasse Maretty

    1. Department of Molecular Medicine, Aarhus University, Aarhus, Denmark
    2. Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
    3. Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
    Contribution
    Conceptualization, Formal analysis, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing – original draft, Project administration, Writing – review and editing
    For correspondence
    lasse.maretty@clin.au.dk
    Competing interests
    No competing interests declared
  11. Søren Besenbacher

    1. Department of Molecular Medicine, Aarhus University, Aarhus, Denmark
    2. Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
    3. Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark
    Contribution
    Conceptualization, Formal analysis, Supervision, Funding acquisition, Investigation, Visualization, Methodology, Writing – original draft, Project administration, Writing – review and editing
    For correspondence
    besenbacher@clin.au.dk
    Competing interests
    No competing interests declared
    ORCID icon "This ORCID iD identifies the author of this article:" 0000-0003-1455-1738

Funding

Independent Research Fund Denmark (Sapere Aude Research Leader)

  • Søren Besenbacher

Danish Cancer Society

  • Karina Dalsgaard Sørensen

The Central Denmark Region Health Fund

  • Karina Dalsgaard Sørensen

Aarhus Universitet (Graduate School of Health)

  • Maibritt Nørgaard

Direktør Emil C. Hertz og Hustru Inger Hertz Fond

  • Karina Dalsgaard Sørensen

KV Fonden

  • Karina Dalsgaard Sørensen

Raimond og Dagmar Ringgård-Bohns Fond

  • Karina Dalsgaard Sørensen

Beckett-Fonden

  • Karina Dalsgaard Sørensen

Snedkermester Sophus Jacobsen og Hustru Astrid Jacobsens Fond

  • Karina Dalsgaard Sørensen

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Acknowledgements

The authors wish to thank the staff at the departments of urology at Aarhus University Hospital and Regional Hospital West Jutland for patient recruitment and collection of clinical data. We would also like to thank lab technicians and clinical academics at the Department of Molecular Medicine (AUH, Denmark) and the Department of Medical Epidemiology and Biostatistics (Karolinska Institutet, Sweden) for excellent assistance throughout the project.

Ethics

Human subjects: The prostate study was approved by The National Committee on Health Research Ethics (#1901101) and notified to The Danish Data Protection Agency (#1-16-02-366-15). All patients provided written informed consent.

Senior and Reviewing Editor

  1. Detlef Weigel, Max Planck Institute for Biology Tübingen, Germany

Reviewer

  1. Alain Thierry

Publication history

  1. Preprint posted: June 10, 2021 (view preprint)
  2. Received: June 23, 2021
  3. Accepted: July 27, 2022
  4. Accepted Manuscript published: July 27, 2022 (version 1)
  5. Version of Record published: August 9, 2022 (version 2)

Copyright

© 2022, Renaud et al.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Metrics

  • 821
    Page views
  • 317
    Downloads
  • 0
    Citations

Article citation count generated by polling the highest count across the following sources: Crossref, PubMed Central, Scopus.

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Gabriel Renaud
  2. Maibritt Nørgaard
  3. Johan Lindberg
  4. Henrik Grönberg
  5. Bram De Laere
  6. Jørgen Bjerggaard Jensen
  7. Michael Borre
  8. Claus Lindbjerg Andersen
  9. Karina Dalsgaard Sørensen
  10. Lasse Maretty
  11. Søren Besenbacher
(2022)
Unsupervised detection of fragment length signatures of circulating tumor DNA using non-negative matrix factorization
eLife 11:e71569.
https://doi.org/10.7554/eLife.71569

Further reading

    1. Cancer Biology
    James Boot, Gabriel Rosser ... Silvia Marino
    Research Article

    We describe a subset of glioblastoma, the most prevalent malignant adult brain tumour, harbouring a bias towards hypomethylation at defined differentially methylated regions. This epigenetic signature correlates with an enrichment for an astrocytic gene signature, which together with the identification of enriched predicted binding sites of transcription factors known to cause demethylation and to be involved in astrocytic/glial lineage specification, point to a shared ontogeny between these glioblastomas and astroglial progenitors. At functional level, increased invasiveness, at least in part mediated by SRPX2, and macrophage infiltration characterise this subset of glioblastoma.

    1. Biochemistry and Chemical Biology
    2. Cancer Biology
    Jacob M Winter, Heidi L Fresenius ... Jared Rutter
    Research Article

    The tumor suppressor gene PTEN is the second most commonly deleted gene in cancer. Such deletions often include portions of the chromosome 10q23 locus beyond the bounds of PTEN itself, which frequently disrupts adjacent genes. Coincidental loss of PTEN-adjacent genes might impose vulnerabilities that could either affect patient outcome basally or be exploited therapeutically. Here we describe how the loss of ATAD1, which is adjacent to and frequently co-deleted with PTEN, predisposes cancer cells to apoptosis triggered by proteasome dysfunction and correlates with improved survival in cancer patients. ATAD1 directly and specifically extracts the pro-apoptotic protein BIM from mitochondria to inactivate it. Cultured cells and mouse xenografts lacking ATAD1 are hypersensitive to clinically used proteasome inhibitors, which activate BIM and trigger apoptosis. This work furthers our understanding of mitochondrial protein homeostasis and could lead to new therapeutic options for the hundreds of thousands of cancer patients who have tumors with chromosome 10q23 deletion.