Computational and Systems Biology

Improved clinical data imputation via classical and quantum determinantal point processes

Skander Kazdaghli author has email address
Iordanis Kerenidis
Jens Kieckbusch
Philip Teare

QC Ware, Palo Alto, USA and Paris, France
Université de Paris, CNRS, IRIF, 8 Place Aurélie Nemours, Paris 75013, France
Emerging Innovations Unit, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK
Centre for AI, Data Science & AI, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK

https://doi.org/10.7554/eLife.89947.2

Open access
Copyright information

Figures and data

Example of overall workflow for patient management through clinical data imputation and downstream classification.

Imputation and downstream classification procedure to benchmark the imputation method’s performance.
First, the imputer is trained on the whole observed dataset X as shown in step (a). In step (b), the imputed data is split into 3 consecutive folds (holdout sets H1, H2 and H3) then a classifier is trained on each combination of 2 holdout sets (development sets D1, D2 and D3) and the AUC is calculated for each holdout set.

AUC results for the SYNTH and MIMIC-III datasets, with MCAR and MNAR missingness, three holdout sets, and six different imputation methods.
Values are expressed as mean ± SD (standard deviation) of 10 values for each experiment. DPP-MICE and detDPP-MICE are bold when outperforming MICE and the underlined one is the best of the three. DPP-MissForest and detDPP-MissForest are bold when outperforming MissForest and the underlined one is the best of the three.

AUC results on the different holdout sets after imputation using MICE, DPP-MICE and detDPP-MICE.
In the case of MICE and DPP-MICE, the boxplots correspond to 10 AUC values for 10 iterations of the same imputation and classification algorithms, depicting the lower and upper quartiles as well as the median of these 10 values. The AUC values are the same for every iteration of the detDPP-MICE algorithm.

AUC results on the different holdout sets after imputation using MissForest, DPP-MissForest and detDPP-MissForest.
In the case of MissForest and DPP-MissForest, the boxplots correspond to 10 AUC values for 10 iterations of the same imputation and classification algorithm, depicting the lower and upper quartiles as well as the median of these 10 values. The AUC values are always the same for every iteration of the detDPP-MissForest algorithm.

IBM Hanoi 27-qubit quantum processor

Data matrix sizes used by the quantum DPP circuits to train each tree.
The number of rows corresponds to the number of data points and is equal to the number of qubits of every circuit.

Hardware results using the IBM quantum processor, depicting AUC results of the downstream classifier task after imputing missing values using DPP-MissForest.
In the case of MissForest and the quantum hardware DPP-MissForest implementations, the boxplots correspond to 10 AUC values for 10 iterations of the same imputation and classification algorithm, depicting the lower and upper quartiles as well as the median of these 10 values. The AUC values are the same for every iteration of the quantum DPP-MissForest algorithm using the simulator.

Numerical quantum hardware results showing the AUC results of the downstream classifier task on reduced datasets.
Values are represented according to Mean±SD (standard deviation) format given 10 values for each experiment.

Deterministic k-DPP algorithm.

The sampling and training procedure for the DPP-Random Forest algorithm: the dataset is divided into batches of similar size, the DPP sampling algorithm is then applied to every batch in parallel, the subsequent samples are then combined to form larger datasets used to train the decision trees. Since the batches are fixed, DPP sampling can be easily parallelized, either classically or quantumly.

Deterministic DPP sampling procedure for training decision trees.
At each step, a decision tree is trained usingthe sample that corresponds to the highest determinantal probability, and which is then removed from the original batch before continuing to the next decision tree.

Types of data loaders. Each line corresponds to a qubit.
Each vertical line connecting two qubits corresponds to an RBS gate. We also use X, Z, CZ gates. The depth of the first two loaders is linear and the last one is logarithmic on the number of qubits.

Summary of the characteristics of the different quantum DPP circuits.
NN = Nearest Neighbor connectivity.

Quantum Determinant Sampling circuit for an orthogonal matrix A = (a¹, …, a^d).
It uses the Clifford loader which is a unitary quantum operator: , for x ∈ ℝⁿ

Quantum Determinant Sampling circuit for an orthogonal matrix A = (a¹, …, a^d).
It uses the Clifford loader which is a unitary quantum operator: , for x ∈ ℝⁿ

Complexity comparison of d-DPP sampling algorithms, both classical [19] and quantum [12].
The problem considered is DPP sampling of d rows from an n × d matrix where n = O(d). For the quantum case we provide both the depth and the size of the circuits.

Sign up for email alerts