Improved clinical data imputation via classical and quantum determinantal point processes

  1. Skander Kazdaghli  Is a corresponding author
  2. Iordanis Kerenidis
  3. Jens Kieckbusch
  4. Philip Teare
  1. QC Ware, France
  2. Universite de Paris, CNRS, IRIF, France
  3. Emerging Innovations Unit, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, United Kingdom
  4. Centre for AI, Data Science & AI, BioPharmaceuticals R&D, AstraZeneca, United Kingdom
7 figures, 8 tables and 1 additional file

Figures

Example of overall workflow for patient management through clinical data imputation and downstream classification.
Imputation and downstream classification procedure to benchmark the imputation method’s performance.

First, the imputer is trained on the whole observed dataset X as shown in step (a). In step (b), the imputed data is split into three consecutive folds (holdout sets H1, H2, and H3), then a classifier is trained on each combination of two holdout sets (development sets D1, D2, and D3) and the area under the receiver operating curve (AUC) is calculated for each holdout set.

IBM Hanoi 27-qubit quantum processor.
The sampling and training procedure for the DPP-Random Forest algorithm: the dataset is divided into batches of similar size, the DPP sampling algorithm is then applied to every batch in parallel, and the subsequent samples are then combined to form larger datasets used to train the decision trees.

Since the batches are fixed, DPP sampling can be easily parallelized, either classically or quantumly. DPP, determinantal point processes.

Deterministic determinantal point processes (DPP) sampling procedure for training decision trees.

At each step, a decision tree is trained usingthe sample that corresponds to the highest determinantal probability, and which is then removed from the original batch before continuing to the next decision tree.

Types of data loaders.

Each line corresponds to a qubit. Each vertical line connecting two qubits corresponds to a reconfigurable beam splitter (RBS) gate. We also use X,Z,CZ gates. The depth of the first two loaders is linear, and the last one is logarithmic on the number of qubits.

Quantum determinant sampling circuit for an orthogonal matrix A=(a1,,ad).

It uses the Clifford loader, which is a unitary quantum operator: 𝒞(x)=i=1nxiZi-1XIn-i,forxn.

Tables

Table 1
AUC results for the SYNTH and MIMIC-III datasets, with MCAR and MNAR missingness, three holdout sets, and six different imputation methods.

Values are expressed as mean ± SD of 10 values for each experiment. DPP-MICE and detDPP-MICE are in bold when outperforming MICE and the underlined one is the best of the three. DPP-MissForest and detDPP-MissForest are in bold when outperforming MissForest and the underlined one is the best of the three.

DatasetMissingnessSetMICEDPP-MICEdetDPP-MICEMissForestDPP-MissForestdetDPP-MissForest
SYNTHMCARH10.8318 ± 0.01130.835 ± 0.00830.83520.8525 ± 0.00440.8552 ± 0.00490.8582
H20.8316 ± 0.0080.8369 ± 0.01280.840.8465 ± 0.00570.849 ± 0.0030.8491
H30.8205 ± 0.01270.8266 ± 0.00960.82720.8436 ± 0.00310.8452 ± 0.00480.855
MNARH10.8903 ± 0.00460.8915 ± 0.0070.89340.7133 ± 0.00630.7171 ± 0.010.7185
H20.8755 ± 0.010.8745 ± 0.00720.89550.7052 ± 0.00360.7124 ± 0.00780.7167
H30.9003 ± 0.00590.9005 ± 0.0060.90410.769 ± 0.01030.7773 ± 0.01290.7905
MIMICMCARH10.7621 ± 0.00460.7628 ± 0.00490.76410.7687 ± 0.00120.77 ± 0.00130.771
H20.7541 ± 0.00370.7532 ± 0.00470.76190.7649 ± 0.00190.777 ± 0.00190.7707
H30.7365 ± 0.00550.7394 ± 0.00520.74710.7485 ± 0.0010.7507 ± 0.00170.7515
MNARH10.77 ± 0.00260.7717 ± 0.00360.77220.6616 ± 0.00650.6715 ± 0.070.6760
H20.777 ± 0.00640.7818 ± 0.00290.78120.6748 ± 0.00450.6778 ± 0.00480.6798
H30.7324 ± 0.00470.7363 ± 0.00310.74030.6368 ± 0.00340.64 ± 0.0040.6419
  1. AUC = area under the receiver operating curve; MCAR = missing completely at random; MNAR = missing not at random.

Table 2
AUC results on the different holdout sets after imputation using MICE, DPP-MICE, and detDPP-MICE.

In the case of MICE and DPP-MICE, the boxplots correspond to 10 AUC values for 10 iterations of the same imputation and classification algorithms, depicting the lower and upper quartiles as well as the median of these 10 values. The AUC values are the same for every iteration of the detDPP-MICE algorithm.

MCARMNAR
SYNTH
MIMIC
  1. AUC = area under the receiver operating curve; MCAR = missing completely at random; MNAR = missing not at random.

Table 3
AUC results on the different holdout sets after imputation using MissForest, DPP-MissForest, and detDPP-MissForest.

In the case of MissForest and DPP-MissForest, the boxplots correspond to 10 AUC values for 10 iterations of the same imputation and classification algorithm, depicting the lower and upper quartiles as well as the median of these 10 values. The AUC values are always the same for every iteration of the detDPP-MissForest algorithm.

MCARMNAR
SYNTH
MIMIC
  1. AUC = area under the receiver operating curve; MCAR = missing completely at random; MNAR = missing not at random.

Table 4
Data matrix sizes used by the quantum determinantal point processes (DPP) circuits to train each tree.

The number of rows corresponds to the number of data points and is equal to the number of qubits of every circuit.

Batch sizeTree 1Tree 2Tree 3Tree 4
7(7,2)(5,2)--
8(8,2)(6,2)(4,2)-
10(10,2)(8,2)(6,2)(4,2)
Table 5
Hardware results using the IBM quantum processor, depicting AUC results of the downstream classifier task after imputing missing values using DPP-MissForest.

In the case of MissForest and the quantum hardware DPP-MissForest implementations, the boxplots correspond to 10 AUC values for 10 iterations of the same imputation and classification algorithm, depicting the lower and upper quartiles as well as the median of these 10 values. The AUC values are the same for every iteration of the quantum DPP-MissForest algorithm using the simulator.

Batch size: 7Number of trees: 2Batch size: 8Number of trees: 3Batch size: 10Number of trees: 4
MCAR SYNTH
MCAR MIMIC
MNAR SYNTH
MNAR MIMIC
  1. AUC = area under the receiver operating curve; MCAR = missing completely at random; MNAR = missing not at random.

Table 6
Numerical quantum hardware results showing the AUC results of the downstream classifier task on reduced datasets.

Values are represented according to mean ± SD format given 10 values for each experiment.

DatasetMissingnessBatch sizeTreesMissForestdetDPP-MissForest (simulator)detDPP-MissForest (hardware)
SYNTHMCAR720.868 ± 0.03020.90260.8598 ± 0.021
830.8667 ± 0.03420.92560.8923 ± 0.027
1040.8725 ± 0.02750.90280.8902 ± 0.024
MNAR720.7122 ± 0.02640.780.7149 ± 0.02
830.7153 ± 0.0220.7290.7036 ± 0.0167
1040.7258 ± 0.01570.78680.7082 ± 0.036
MIMICMCAR720.7127 ± 0.0380.75220.7117 ± 0.0315
830.7136 ± 0.030.77280.7448 ± 0.0258
1040.6968 ± 0.030.73270.7262 ± 0.0299
MNAR720.7697 ± 0.01330.77940.7742 ± 0.0108
830.7713 ± 0.01120.79430.767 ± 0.0125
1040.7712 ± 0.01160.79220.7675 ± .01545
  1. AUC = area under the receiver operating curve; MCAR = missing completely at random; MNAR = missing not at random.

Table 7
Summary of the characteristics of the different quantum determinantal point processes (DPP) circuits.

NN = nearest neighbor connectivity.

Clifford loaderHardware connectivityDepth# of RBS gates
DiagonalNN2nd2nd
Semi-diagonalNNnd2nd
ParallelAll-to-all4d log(n)2nd
Table 8
Complexity comparison of d-DPP sampling algorithms, both classical (Mahoney et al., 2019) and quantum (Kerenidis and Prakash, 2022).

The problem considered is DPP sampling of d rows from an n×d matrix, where n=O(d). For the quantum case, we provide both the depth and the size of the circuits.

ClassicalQuantum
PreprocessingO(d3)O(d3)
SamplingO~(d3)O~(d)depthO~(d2)gates
  1. DPP = determinantal point processes.

Additional files

Download links

A two-part list of links to download the article, or parts of the article, in various formats.

Downloads (link to download the article as PDF)

Open citations (links to open the citations from this article in various online reference manager services)

Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)

  1. Skander Kazdaghli
  2. Iordanis Kerenidis
  3. Jens Kieckbusch
  4. Philip Teare
(2024)
Improved clinical data imputation via classical and quantum determinantal point processes
eLife 12:RP89947.
https://doi.org/10.7554/eLife.89947.3