Alleviating cell-free DNA sequencing biases with optimal transport

  1. Dynamical Systems, Signal Processing and Data Analytics (STADIUS), KU Leuven, Leuven, Belgium
  2. Laboratory for Cytogenetics and Genome Research, Department of Human Genetics, KU Leuven, Leuven, Belgium
  3. Department of Gynaecology and Obstetrics, University Hospitals Leuven, Leuven, Belgium
  4. Division of Gynaecological Oncology, Leuven Cancer Institute, KU Leuven, Leuven, Belgium
  5. Department of Oncology, Laboratory of Tumor Immunology and Immunotherapy, Leuven Cancer Institute, Leuven, Belgium
  6. Center for Cancer Biology, VIB, Leuven, Belgium
  7. Laboratory for Translational Genetics, Department of Human Genetics, KU Leuven, Leuven, Belgium

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Yongliang Yang
    Dalian University of Technology, Dalian, China
  • Senior Editor
    Aleksandra Walczak
    École Normale Supérieure - PSL, Paris, France

Reviewer #1 (Public Review):

Summary:

The authors applied a domain adaptation method using the principal of optimal transport (OT) to superimpose read count data onto each other. While the title suggests that the presented method is independent from and performs better than other methods of bias correction, the presented work uses a self-implemented version of GC bias correction apart of the OT domain adaptation. Performance comparisons were done both on normalized read counts as well as on copy number profiles which is already the complete set of presented use cases. Results involving copy number profiles from iChorCNA were also subjected to the bias correction measures implemented there. It is not clear at many points which correction method actually causes the observed performance.

Strengths:

The quality of superimposing distributions of normalized read counts (and copy number profiles) was sufficiently shown using uniformly distributed p-values in the interval of 0 to 1 for healthy controls D7 and D8 which differed in the choice of library preparation kit.

The ability to select a sample from the source domain for samples in the target domain was demonstrated.

Weaknesses:

Experiment Design:

The chosen bias correction methods are not explicitly designed for nor aimed at domain adaptation. The benchmark against GC bias correction while doing GC bias correction during the OT procedure is probably the most striking flaw of the entire work. GC bias correction has the purpose of correction GC biases, wherever present, NOT correcting categorical pre-analytical variables of undefined character. A more thorough examination of the presented results should address why plain iChor CNA is the best performing "domain adaptation" in some cases. Also, the extent to which the implemented GC bias correction is contributing to the performance increase independent of the OT procedure should be assessed separately in each case.
Moreover, the center-and-scale standardization is probably not the most relevant contestant in domain adaptation that is out there.

Comparison of cohorts (domains) - especially healthy from D7 and D8 - it is not described which type of ChIP analysis was done for the healthy controls of the D7 domain. The utilized library preparation kit implies that D7 represents a subset of available cfDNA in a plasma sample by precipitating only certain cfDNA fragments to which undisclosed type of protein was bound. Even if the type of protein turns out to be histones, the extracted subset of cfDNA should not be regarded as coming from the same distribution of cfNDAs. For example, fragments with sub-mononucleosomal length would be depleted in the ChIP-seq data set while these could be extracted in an untargeted cfDNA sequencing data set. It needs to be clarified why the authors deem D7 and D8 healthy controls to be identical with regards to SCNA analysis. Best start with the protein targets of D7 ChIP-seq samples.

From the Illumina TruSeq ChIP product description page:
"TruSeq ChIP Libary Preparation Kits provide a simple, cost-effective solution for generating chromatin immunoprecipitation sequencing (ChIP-Seq) libraries from ChIP-derived DNA. ChIP-seq leverages next-generation sequencing (NGS) to quickly and efficiently determine the distribution and abundance of DNA-bound protein targets of interest across the genome."

Redundancy:

Some parts throughout the results and discussion part reappear in the methods. The description of the methodology should be concentrated in the method section and only reiterated in a summarizing fashion where absolutely necessary.
Unnecessary repetition inflate the presented work which is not appealing to the reader. Rather include more details of the utilized materials and methods in the corresponding section.

Transparency:

At the time point of review, the code was not available under the provided link.
A part of the healthy controls from D8 is not contained under the provided accession (367 healthy samples are available in the data base vs. sum of D7 and D8 healthy controls is 499)

Neither in the paper nor in reference 4 is an explanation of what was targeted with the ChIP-seq approach.

Consistency:

It is not evident why a ChIP-seq library prep kit was used (sample cohorts designated as D7). The DNA isolation procedure was not presented as having an immunoprecipitation step. Furthermore, it is not clear which DNA bound proteins were targeted during ChIP seq, if such an immunoprecipitation was actually carried out.The authors self-implemented a GC bias correction procedure although they already mentioned other procedures earlier like LIQUORICE. Also, there already exist tools that can be used to correct GC bias, like deepTools (github.com/deeptools/deepTools). Other GC bias correction algorithms designed specifically for cfDNA would be Griffin (github.com/adoebley/Griffin) and GCparagon (github.com/BGSpiegl/GCparagon). When benchmarking against state-of-the-art cfDNA GC bias correction, these algorithms should appear in a relevant scientific work, somewhere other than the introduction, preferably in the results section. It should be shown that the chosen GC bias correction method is performing best under the given circumstances.

Accuracy:

Use clear labels for each group of samples. The domain number is not sufficient to effectively distinguish sample groups. Already the source name plus a simple enumeration would improve the clarity at some points.

The healthy controls of D7 and D8 are described but the numbers do not add up (257 healthy controls in line 227 vs. 260 healthy controls in line 389). Please double check this and use representative sample cohort labels in the materials description for improved clarity!

Avoid statements like "the rest" when talking about a mixed set of samples. It is not clear how many samples from which domain are addressed.

For optimal transport, knowledge about the destination is required ("where do I want to transport to?") and, thus, the proposed method can never be unsupervised. It is always necessary to know the label of both the source and target domains. In practice, this is not often the case and users might fall prey to the error of superimposing data that is actually separated by valid differences in some experimental variables.

Seemingly arbitrary cutoff values are mentioned. For example, it is not clear if choosing "the cutoff that produced the highest MCCs" is meant across methods or for each method separately (are the results for each method reported that also resulted in the highest MCC for that method?).

The Euclidean metric for assessing the similarity of (normalized) read counts is questionable for a high dimensional space: read counts are assessed for 1 Mb genomic intervals which yields around 3000 intervals (dimensions), depending on the number of excluded intervals (which was not described in more detail). There might be more appropriate measures in this high dimensional space.

It is sometimes not clear what data actually is presented. An example would be the caption of Figure 2, (C): it is suggested that all (320) ovarian cancer cases are shown in one copy number profile.

Furthermore, the authors do not make a distinction between male and female samples. A clarification is needed why the authors think SCNAs of ovarian cancer samples should be called against a reference set that contains male controls.
The procedure would likely benefit from a strict separation of male and female cases which would also allow for chrX (and chrY) being included in downstream analysis.

The GC bias and mappability correction implicitly done by iChorCNA for the SCNA profile comparison is presented as "no correction" which is highly misleading. (for clarification, this is also deemed inappropriate, not just inaccurate))

The majority of interpretations presented procedure does not give any significant improvement regarding the similarity of copy number profiles are off and in many instances favor the OT procedure in an unscientific and highly inappropriate manner.

Apart of duplicate marking (which is not specified any further - provide the command(s)!), there is no information on which read (pairs) were used (primary, secondary, supplementary, mapped in a proper pair, fragment length restrictions, clipping restrictions, etc.). The authors should explain why base quality score re-calibration was done as this might be an unnecessary step if the base quality values are not used later on.

The adaptation method presented as "center-and-scale standardization" is inappropriate for unbalanced cancer profiles since it assumes the presence of identical SCNAs in all samples belonging to the same cancer entity.
Please explain why normalizing 1 Mb genomic intervals to the average copy number across different cancer samples should be valid or use another domain adaptation method for performance comparison.

Statements like in line 83 (unsupervised DA) are plain wrong because transport from one domain to another requires the selection of a target domain based on a label, e.g., based on health status, cancer entity, or similar.

Relevance and Appropriateness:

Many of the presented results are not relevant or details of the procedure were incomprehensible or incomplete: the results presented in table 2 - sample assignment. The Euclidean metric seems to be inappropriate for high dimensional data. Also the selection of the cutoff based on Euclidean distance seems to enable the optimization in favor of the OT procedure. It is hypothesized that there might exist other cutoff values for which the selection of samples form the source domain would also work for other correction methods but this is not further described. It could simply be the case that OT can assign a relationship between domains

The statement that there are no continuous pre-analytical variables is wrong (304). The effect of target depth-of-coverage (DoC) was not analyzed although this represents one of the most common (continuous) and difficult to control variables in NGS data analysis. The inclusion of multiple samples from a single patient in a cohort likely represents introduction of a confounding factor ["contamination"] to the model training procedure: the temporal difference that lies between the taken samples of that patient represents leakage of information. As far as can be told from the presented data, this potential bias has not been ruled out (e.g., exclusion of all samples beyond the first from each patient or alternatively: picking all samples of a patient either for the training set or the test set).

Conscientiousness:

Statements like "good"/"best" on their own should be avoided. A clear description of why a certain procedure/methodology/algorithm performs better should be preferred in scientific writing (e.g., "highest MCC values" instead of "best MCC values").
Otherwise, such statements represent mere opinions of the author rather than an unbiased evaluation of the results.
The domain D8 of healthy controls seems to contain samples from multiple sources (some published other in-house). Contrary to the data availability statement (533), not all healthy control samples of the HEMA data set are available from ArrayExpress

Other Major Concerns:

Potential Irrelevance:

The manuscript represents a mere performance assessment of the proposed sWGS per-bin-read-count fitting procedure and, thus, a verification in its character, not a validation (although the model training itself was "validated" - but this is to be viewed separately from the validity of the achieved correction in a biological context). A proper (biological) validation is missing.

It is of utmost importance that parameters of the adapted (transported) samples -that lie outside of what has been optimized to be highly similar- are checked to actually validate the procedure. Especially biological signals and genome-wide parameters (GC content distribution before/after transport) need to be addressed also in hindsight of the rampant criticism towards GC bias correction by the authors. At no point in the manuscript was GC bias addressed properly, i.e., how much of an improvement is expected from GC bias correction if there is no significant GC bias?

The (potential - not clear so far) ability of making ChIP-seq data look like cfDNA data (even if only the copy number profiles SCNAs appear highly similar) raises the concern of potential future users of the tool to superimpose domains that should not be superimposed form a biological point of view because the true domain the superimposed cohorts belong to are different. The ability to superimpose anything onto anything s troubling. There is no control mechanism that allows for failure in cases where the superposition is invalid.

Chromosome X was excluded which could be avoided if data sets were split according to biological sex.

The difference between the distributions was never attributed to GC bias, hence, the benchmark against GC bias correction tools might not be relevant in the first place.

Stability of OT data transformation:

The authors state that the straight forward choice of lambda resulted in many occasions where disruptions (of unspecified nature and amplitude) are introduced in the copy number profiles of transformed data. It is not evident from the proposed work to which extent this behavior was removed from the procedure and if it can occur and how the user could resolve such a problem on their own.

In summary, the presented work needs considerable adaptation and additions before it can actually be considered a valuable contribution to the liquid biopsy field.

Reviewer #2 (Public Review):

The authors present a computational methodology for de-biasing/denoising high-throughput genomic signals using optimal transport techniques, thus allowing disparate datasets to be merged and jointly analysed. They apply this methodology on liquid biopsy data and they demonstrate improved performance (compared to simpler bias-correcting approaches) for cancer detection using common machine learning algorithms. This is a theoretically interesting and potentially useful approach for addressing a very common practical problem in computational genomics.

I have the following recommendations:

(1) When comparing performance metrics between different approaches (e.g., tables 3 and 4), 95% confidence intervals should also be provided and a pairwise statistical test should be applied to establish whether the observed difference in each performance metric between the proposed method and the alternatives is statistically significant, thus justifying the claim that the proposed method offers an improvement over existing methodologies.

(2) The commonly used center-and-scale and GC debias approaches presented by the authors are fairly simple. How does their methodology compare to more elaborate approaches, such as tangent normalisation (https://academic.oup.com/bioinformatics/article/38/20/4677/6678978) and robust PCA (https://github.com/mskilab-org/dryclean)?

(3) What is the computational cost of the proposed methodology and how does it compare to the alternatives?

(4) The proposed approach relies on a reference dataset, against which a given dataset is adapted. What are the implications for cross-validation experiments (which are essential for assessing the out-of-sample error of every methodology), particularly with regards to the requirement to avoid information leakage between training and validation/test data sets?

In conclusion, this is an interesting and potentially useful paper and I would like to encourage the authors to address the above points, which hopefully will strengthen their case.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation