Figures and data in Improved clinical data imputation via classical and quantum determinantal point processes

Figures
Tables
Additional files

7 figures, 8 tables and 1 additional file

Figures

Figure 1

Download asset Open asset

Example of overall workflow for patient management through clinical data imputation and downstream classification.

Figure 2

Download asset Open asset

Imputation and downstream classification procedure to benchmark the imputation method’s performance.

First, the imputer is trained on the whole observed dataset X as shown in step (a). In step (b), the imputed data is split into three consecutive folds (holdout sets H1, H2, and H3), then a classifier is trained on each combination of two holdout sets (development sets D1, D2, and D3) and the area under the receiver operating curve (AUC) is calculated for each holdout set.

Figure 3

Download asset Open asset

Figure 4

Download asset Open asset

The sampling and training procedure for the DPP-Random Forest algorithm: the dataset is divided into batches of similar size, the DPP sampling algorithm is then applied to every batch in parallel, and the subsequent samples are then combined to form larger datasets used to train the decision trees.

Since the batches are fixed, DPP sampling can be easily parallelized, either classically or quantumly. DPP, determinantal point processes.

Figure 5

Download asset Open asset

Deterministic determinantal point processes (DPP) sampling procedure for training decision trees.

At each step, a decision tree is trained usingthe sample that corresponds to the highest determinantal probability, and which is then removed from the original batch before continuing to the next decision tree.

Figure 6

Download asset Open asset

Types of data loaders.

Each line corresponds to a qubit. Each vertical line connecting two qubits corresponds to a reconfigurable beam splitter (RBS) gate. We also use $X, Z, C Z$ gates. The depth of the first two loaders is linear, and the last one is logarithmic on the number of qubits.

Figure 7

Download asset Open asset

Quantum determinant sampling circuit for an orthogonal matrix $A = (a^{1}, \dots, a^{d})$ .

It uses the Clifford loader, which is a unitary quantum operator: $𝒞 (x) = \sum_{i = 1}^{n} x_{i} Z^{i - 1} X I^{n - i}, for x \in ℝ^{n}$ .

Tables

Table 1

AUC results for the SYNTH and MIMIC-III datasets, with MCAR and MNAR missingness, three holdout sets, and six different imputation methods.

Values are expressed as mean ± SD of 10 values for each experiment. DPP-MICE and detDPP-MICE are in bold when outperforming MICE and the underlined one is the best of the three. DPP-MissForest and detDPP-MissForest are in bold when outperforming MissForest and the underlined one is the best of the three.

Dataset	Missingness	Set	MICE	DPP-MICE	detDPP-MICE	MissForest	DPP-MissForest	detDPP-MissForest
SYNTH	MCAR	H1	0.8318 ± 0.0113	0.835 ± 0.0083	0.8352	0.8525 ± 0.0044	0.8552 ± 0.0049	0.8582
		H2	0.8316 ± 0.008	0.8369 ± 0.0128	0.84	0.8465 ± 0.0057	0.849 ± 0.003	0.8491
		H3	0.8205 ± 0.0127	0.8266 ± 0.0096	0.8272	0.8436 ± 0.0031	0.8452 ± 0.0048	0.855
	MNAR	H1	0.8903 ± 0.0046	0.8915 ± 0.007	0.8934	0.7133 ± 0.0063	0.7171 ± 0.01	0.7185
		H2	0.8755 ± 0.01	0.8745 ± 0.0072	0.8955	0.7052 ± 0.0036	0.7124 ± 0.0078	0.7167
		H3	0.9003 ± 0.0059	0.9005 ± 0.006	0.9041	0.769 ± 0.0103	0.7773 ± 0.0129	0.7905
MIMIC	MCAR	H1	0.7621 ± 0.0046	0.7628 ± 0.0049	0.7641	0.7687 ± 0.0012	0.77 ± 0.0013	0.771
		H2	0.7541 ± 0.0037	0.7532 ± 0.0047	0.7619	0.7649 ± 0.0019	0.777 ± 0.0019	0.7707
		H3	0.7365 ± 0.0055	0.7394 ± 0.0052	0.7471	0.7485 ± 0.001	0.7507 ± 0.0017	0.7515
	MNAR	H1	0.77 ± 0.0026	0.7717 ± 0.0036	0.7722	0.6616 ± 0.0065	0.6715 ± 0.07	0.6760
		H2	0.777 ± 0.0064	0.7818 ± 0.0029	0.7812	0.6748 ± 0.0045	0.6778 ± 0.0048	0.6798
		H3	0.7324 ± 0.0047	0.7363 ± 0.0031	0.7403	0.6368 ± 0.0034	0.64 ± 0.004	0.6419

AUC = area under the receiver operating curve; MCAR = missing completely at random; MNAR = missing not at random.

Table 2

AUC results on the different holdout sets after imputation using MICE, DPP-MICE, and detDPP-MICE.

In the case of MICE and DPP-MICE, the boxplots correspond to 10 AUC values for 10 iterations of the same imputation and classification algorithms, depicting the lower and upper quartiles as well as the median of these 10 values. The AUC values are the same for every iteration of the detDPP-MICE algorithm.

Table 3

AUC results on the different holdout sets after imputation using MissForest, DPP-MissForest, and detDPP-MissForest.

In the case of MissForest and DPP-MissForest, the boxplots correspond to 10 AUC values for 10 iterations of the same imputation and classification algorithm, depicting the lower and upper quartiles as well as the median of these 10 values. The AUC values are always the same for every iteration of the detDPP-MissForest algorithm.

Table 4

Data matrix sizes used by the quantum determinantal point processes (DPP) circuits to train each tree.

The number of rows corresponds to the number of data points and is equal to the number of qubits of every circuit.

Batch size	Tree 1	Tree 2	Tree 3	Tree 4
7	(7,2)	(5,2)	-	-
8	(8,2)	(6,2)	(4,2)	-
10	(10,2)	(8,2)	(6,2)	(4,2)

Table 5

Hardware results using the IBM quantum processor, depicting AUC results of the downstream classifier task after imputing missing values using DPP-MissForest.

In the case of MissForest and the quantum hardware DPP-MissForest implementations, the boxplots correspond to 10 AUC values for 10 iterations of the same imputation and classification algorithm, depicting the lower and upper quartiles as well as the median of these 10 values. The AUC values are the same for every iteration of the quantum DPP-MissForest algorithm using the simulator.

Table 6

Numerical quantum hardware results showing the AUC results of the downstream classifier task on reduced datasets.

Values are represented according to mean ± SD format given 10 values for each experiment.

Dataset	Missingness	Batch size	Trees	MissForest	detDPP-MissForest (simulator)	detDPP-MissForest (hardware)
SYNTH	MCAR	7	2	0.868 ± 0.0302	0.9026	0.8598 ± 0.021
		8	3	0.8667 ± 0.0342	0.9256	0.8923 ± 0.027
		10	4	0.8725 ± 0.0275	0.9028	0.8902 ± 0.024
	MNAR	7	2	0.7122 ± 0.0264	0.78	0.7149 ± 0.02
		8	3	0.7153 ± 0.022	0.729	0.7036 ± 0.0167
		10	4	0.7258 ± 0.0157	0.7868	0.7082 ± 0.036
MIMIC	MCAR	7	2	0.7127 ± 0.038	0.7522	0.7117 ± 0.0315
		8	3	0.7136 ± 0.03	0.7728	0.7448 ± 0.0258
		10	4	0.6968 ± 0.03	0.7327	0.7262 ± 0.0299
	MNAR	7	2	0.7697 ± 0.0133	0.7794	0.7742 ± 0.0108
		8	3	0.7713 ± 0.0112	0.7943	0.767 ± 0.0125
		10	4	0.7712 ± 0.0116	0.7922	0.7675 ± .01545

AUC = area under the receiver operating curve; MCAR = missing completely at random; MNAR = missing not at random.

Table 7

Summary of the characteristics of the different quantum determinantal point processes (DPP) circuits.

NN = nearest neighbor connectivity.

Clifford loader	Hardware connectivity	Depth	# of RBS gates
Diagonal	NN	2nd	2nd
Semi-diagonal	NN	nd	2nd
Parallel	All-to-all	4d log(n)	2nd

Table 8

Complexity comparison of d-DPP sampling algorithms, both classical (Mahoney et al., 2019) and quantum (Kerenidis and Prakash, 2022).

The problem considered is DPP sampling of $d$ rows from an $n \times d$ matrix, where $n = O (d)$ . For the quantum case, we provide both the depth and the size of the circuits.

	Classical	Quantum
Preprocessing	$O (d^{3})$	$O (d^{3})$
Sampling	$\tilde{O} (d^{3})$	$\begin{array}{ll} \tilde{O} (d) d e p t h \\ \tilde{O} (d^{2}) g a t e s \end{array}$

DPP = determinantal point processes.

Additional files

MDAR checklist: https://cdn.elifesciences.org/articles/89947/elife-89947-mdarchecklist1-v1.docx
Download elife-89947-mdarchecklist1-v1.docx

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Article PDF

Open citations (links to open the citations from this article in various online reference manager services)

Mendeley

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Skander Kazdaghli
Iordanis Kerenidis
Jens Kieckbusch
Philip Teare

(2024)

Improved clinical data imputation via classical and quantum determinantal point processes

eLife 12:RP89947.

https://doi.org/10.7554/eLife.89947.3

	Batch size: 7Number of trees: 2	Batch size: 8Number of trees: 3	Batch size: 10Number of trees: 4
MCAR SYNTH
MCAR MIMIC
MNAR SYNTH
MNAR MIMIC

Figures

Example of overall workflow for patient management through clinical data imputation and downstream classification.

Imputation and downstream classification procedure to benchmark the imputation method’s performance.

IBM Hanoi 27-qubit quantum processor.

The sampling and training procedure for the DPP-Random Forest algorithm: the dataset is divided into batches of similar size, the DPP sampling algorithm is then applied to every batch in parallel, and the subsequent samples are then combined to form larger datasets used to train the decision trees.

Deterministic determinantal point processes (DPP) sampling procedure for training decision trees.

Types of data loaders.

Quantum determinant sampling circuit for an orthogonal matrix $A = (a^{1}, \dots, a^{d})$ .

Tables

AUC results for the SYNTH and MIMIC-III datasets, with MCAR and MNAR missingness, three holdout sets, and six different imputation methods.

AUC results on the different holdout sets after imputation using MICE, DPP-MICE, and detDPP-MICE.

AUC results on the different holdout sets after imputation using MissForest, DPP-MissForest, and detDPP-MissForest.

Data matrix sizes used by the quantum determinantal point processes (DPP) circuits to train each tree.

Hardware results using the IBM quantum processor, depicting AUC results of the downstream classifier task after imputing missing values using DPP-MissForest.

Numerical quantum hardware results showing the AUC results of the downstream classifier task on reduced datasets.

Summary of the characteristics of the different quantum determinantal point processes (DPP) circuits.

Complexity comparison of d-DPP sampling algorithms, both classical (Mahoney et al., 2019) and quantum (Kerenidis and Prakash, 2022).

Additional files

MDAR checklist

Download links

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Be the first to read new articles from eLife

Share this article

Cite this article

Example of overall workflow for patient management through clinical data imputation and downstream classification.

Imputation and downstream classification procedure to benchmark the imputation method’s performance.

IBM Hanoi 27-qubit quantum processor.

The sampling and training procedure for the DPP-Random Forest algorithm: the dataset is divided into batches of similar size, the DPP sampling algorithm is then applied to every batch in parallel, and the subsequent samples are then combined to form larger datasets used to train the decision trees.

Deterministic determinantal point processes (DPP) sampling procedure for training decision trees.

Types of data loaders.

Quantum determinant sampling circuit for an orthogonal matrix A=(a1,…,ad).

AUC results for the SYNTH and MIMIC-III datasets, with MCAR and MNAR missingness, three holdout sets, and six different imputation methods.

AUC results on the different holdout sets after imputation using MICE, DPP-MICE, and detDPP-MICE.

AUC results on the different holdout sets after imputation using MissForest, DPP-MissForest, and detDPP-MissForest.

Data matrix sizes used by the quantum determinantal point processes (DPP) circuits to train each tree.

Hardware results using the IBM quantum processor, depicting AUC results of the downstream classifier task after imputing missing values using DPP-MissForest.

Numerical quantum hardware results showing the AUC results of the downstream classifier task on reduced datasets.

Summary of the characteristics of the different quantum determinantal point processes (DPP) circuits.

Complexity comparison of d-DPP sampling algorithms, both classical (Mahoney et al., 2019) and quantum (Kerenidis and Prakash, 2022).

MDAR checklist

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

Quantum determinant sampling circuit for an orthogonal matrix $A = (a^{1}, \dots, a^{d})$ .