Correction of the healthy samples from the HEMA data set.

(A) Kernel principal component analysis of coverage profiles from the two control cohorts (haematological cancer data set). (B) Two-sample Kolmogorov-Smirnov testing, for each bin, of the difference between the two cohorts. p-values are shown in both linear and log scales. (C) z-scores of the coverage profiles before and after GC-correction and domain adaptation. (D) Histogram depicting, for each patient of the target domain, the number of patients in the source domain for which the transport plan shows a relationship.

Performance assessment using paired samples from the OV data set.

Accuracy obtained after data correction, when assigning a sample from target domain to the closest sample in source domain, and using the Euclidean metric. In the first column, samples from domain D9 have been corrected toward D10, and vice versa.

Performance assessment using paired samples from the NIPT data set.

Accuracy obtained after data correction, when assigning a sample from target domain to the closest sample in source domain and using the Euclidean metric, on each of the 6 validation groups. Each validation group was designed to control for one preanalytical variable at a time.

Haematological cancer detection using supervised approaches.

Sensitivity, specificity, Matthews correction coefficient (MCC), AUROC and AUPR obtained through validation of 3 supervised models. These models have been successively trained to distinguish Hodgkin lymphoma, DLBCL and multiple myeloma cases from healthy controls. Sensitivity, specificity and MCC were computed using the cutoff that maximises MCC.

Ovarian carcinoma detection using supervised approaches.

Sensitivity, specificity, Matthews correction coefficient (MCC), AUROC and AUPR obtained through validation of 3 supervised models. These models have been trained to distinguish ovarian carcinoma cases from healthy individuals. Sensitivity, specificity and MCC were computed using the cutoff that maximises MCC.

Quantitative assessment of the consistency of CNA calling.

ichorCNA results on ovarian carcinoma cases from D9 using the panel of controls from D9, compared to the same cases corrected (D9 →− D10) by each method and using the panel of controls from D10. Metrics in the upper part of the table focus on per-bin metrics, namely the copy number in each bin, the presence of a CNA (copy number ̸= 2) in each bin, the SOV REFINE [55] score and the log-ratios. We used the SOV REFINE segmentation metric to measure the overlap between called CNAs. The metrics in the bottom section of the table are the average absolute errors on different model parameters estimated by ichorCNA.

Qualitative assessment of the consistency of CNA calling.

Comparison of the CNAs called by ichorCNA on a late stage ovarian carcinoma case from D9, before and after domain adaptation. Green and red colouring correspond to deletions and gains, respectively. (A) Using D9 controls. (B) Using D10 controls. (C) D9 cancer cases (including the case shown) centered-and-scaled toward D10 and analysed with D10 controls. (D) D9 cancer cases OT-corrected toward D10, analysed with D10 controls.

Tumour fractions before and after domain adaptation.

Fractions have been estimated with ichorCNA before and after adapting the D9 ovarian carcinoma cases toward the D10 cases. Results have been produced both with (Left) and without (Right) the panel of controls from domain D10. Fractions are shown in log-scale.

Effect of plasma separation delay on coverage profiles.

(A) GC-corrected normalised read counts of all samples from the NIPT data set for a specific bin (first 1 Mb bin from chromosome 6), namely the one giving the strongest linear correlation with the plasma separation delay (Pearson’s r=0.1838, p-value=1.38e-5, two-sided test). (B) Distribution of the p-values computed likewise for each 1 Mb bin, and reported as a histogram. Histogram is shown in log-scale.

Coverage difference between domains as a function of GC-content.

Two-sample Kolmogorov-Smirnov testing, for each 1 Mb bin, of the difference between the two control sets from the HEMA data set processed with different library preparation kits. p-values are shown in log-scale, as a function of the GC-content of each bin. Dashed line corresponds to a 0.5 p-value and black markers to the median p-values per 0.5% GC stratum.

Summary of the data sets used in this study.

Samples in sets marked with a ‘’ have been processed twice, allowing quantitative assessment of the different biases caused by the changes of sequencing protocols. Domains have been defined based on our experiments, as well as the protocol differences shown in the table. For clarity purposes, we systematically refer to these domains in the results section.

Illustrative summary of our methods.

(A) Given two cohorts of cfDNA samples differing by the sequencing pipeline that processed them, the model corrects the second cohort to match the distribution of the first one. After correction, the cost matrix for our OT problem is given by the pairwise Euclidean distances. (B) The solution of the OT problem, named transport plan, assigns patients from Domain 2 to similar patients in Domain 1. The model parameters are found by minimising the Wasserstein distance, as defined by the cost matrix and transport plan. (C) After inference, the two cohorts are merged and ready for downstream analysis. (D) Depiction of the validation procedure used for the purpose of this study.

© 2024 BioRender Inc. Any parts of this image created with BioRender are not made available under the same license as the Reviewed Preprint, and are © 2024, BioRender Inc.