Improved clinical data imputation via classical and quantum determinantal point processes
Abstract
Imputing data is a critical issue for machine learning practitioners, including in the life sciences domain, where missing clinical data is a typical situation and the reliability of the imputation is of great importance. Currently, there is no canonical approach for imputation of clinical data and widely used algorithms introduce variance in the downstream classification. Here we propose novel imputation methods based on determinantal point processes (DPP) that enhance popular techniques such as the multivariate imputation by chained equations and MissForest. Their advantages are twofold: improving the quality of the imputed data demonstrated by increased accuracy of the downstream classification and providing deterministic and reliable imputations that remove the variance from the classification results. We experimentally demonstrate the advantages of our methods by performing extensive imputations on synthetic and real clinical data. We also perform quantum hardware experiments by applying the quantum circuits for DPP sampling since such quantum algorithms provide a computational advantage with respect to classical ones. We demonstrate competitive results with up to 10 qubits for smallscale imputation tasks on a stateoftheart IBM quantum processor. Our classical and quantum methods improve the effectiveness and robustness of clinical data prediction modeling by providing better and more reliable data imputations. These improvements can add significant value in settings demanding high precision, such as in pharmaceutical drug trials where our approach can provide higher confidence in the predictions made.
eLife assessment
The methods presented in this work provide modest yet consistent accuracy improvements for data classification tasks where certain data are missing. The authors also present a way to use quantum computers for this task. The methodology and results for the classical (nonquantum) case are solid, although evidence for the practical quantum advantage via their approach in 'next generation' quantum computers remains incomplete. The results are valuable and should interest data scientists, life scientists and anyone working in quantum computing.
https://doi.org/10.7554/eLife.89947.3.sa0Introduction
Missing data is a recurring problem in machine learning and in particular for clinical datasets, where it is common that numerous feature values are not present for reasons including incomplete data collection and discrepancies in data formats and data corruption (Luo, 2022; Emmanuel et al., 2021; Pedersen et al., 2017; Myers, 2000). Machine learning is routinely used in life science and clinical research for prediction tasks, such as diagnostics (Qin et al., 2019) and prognostics (Booth et al., 2021), as well as estimation tasks, such as biomarker proxies (Wang et al., 2017) and digital biomarkers (Rendleman et al., 2019). Beyond the research setting, machine learning is becoming more commonplace as regulated Software as a Medical Device, where machine learning models are influencing – or making – clinical decisions that affect patient care.
Machine learning algorithms typically require complete datasets and missing values can significantly affect the quality of the machine learning models trained on such data. This is in large part due to the fact that there can be different underlying reasons for the missingness: for example, feature values can be missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR), each one with their own characteristics.
Despite its importance for clinical trials, there is no canonical approach for dealing with missingness and finding appropriate, effective and reproducible methods remains a challenge. A common way to deal with missing clinical data is to exclude subjects that do not have the complete set of feature values present. A drawback of this approach is that excluding subjects can in fact introduce significant biases in the final model. For example, it can result in the model being trained to be more effective for the type of subjects that are likely to have complete data than for those that do not. Moreover, the effectiveness and reliability of clinical trials are reduced when subjects with missing feature values are excluded from the clinical trial.
Data imputation is an alternative to the complete dataset approach, where subjects with missing feature values are not excluded. Instead, missing values are imputed to create a complete dataset that is then used for a classification task as shown in Figure 1. There are different ways to achieve this, including ‘filling’ missing values with zeros, or with the mean value of the feature across all subjects that have such a value present. These methods provide consistent imputation results, but there are important caveats for using such simple methods since they ignore possible correlations between features and can make the dataset appear more homogeneous than it really is. More advanced data imputation methods have been proposed in the literature: iterative methods include the multivariate imputation by chained equations (MICE) (GroothuisOudshoorn, 2011) and MissForest (Stekhoven and Bühlmann, 2012) algorithms, and deep learning methods include GAIN (generative adversarial imputation nets) Yoon and Jordon, 2018 and MIWAE (missing data importanceweighted autoencoder) (Mattei and Frellsen, 2019). Recent results Shadbahr et al., 2022 have shown that for clinical data two iterative imputation methods, MiceRanger, which uses predictive mean matching, and MissForest, which uses Random Forests to predict the missing values of each feature using the other features, provide the best results and have been used here as a baseline.
Several metrics are routinely used to quantify the quality of data imputation: pointwise discrepancy measures include root mean square error, mean absolute error, and coefficient of determination (${R}^{2}$). Featurewise discrepancy measures include Kullback–Leibler divergence, twosample Kolmogorov–Smirnov statistic or (2)Wasserstein distance. Ultimately, the quality and reliability of imputations can be measured by the performance of a downstream predictor, which is usually the area under the receiver operating curve (AUC) for a classification task. In practical terms, the performance of the downstream classifier is usually of highest importance for clinical datasets: for example, in one of our datasets, the classifier denominates a binary outcome of a critical care unit stay (e.g., survival) for each patient. Accordingly, we have used AUC for the classification task here on different holdout sets (see Figure 2) to assess the performance of our novel methods.
In order to increase the resulting AUC, we combine the MissForest and MiceRanger imputation methods with determinantal sampling, based on determinantal point processes (DPP) (Dereziński and Mahoney, 2021; Kulezsa and Taskar, 2011), which favors samples that are diverse and thus reduces the variance of the training of each decision tree that in turn provides more accurate models. In essence, determinantal sampling picks subsets of data according to a distribution that gives more weight to subsets of data that contain diverse data points. More precisely, each subset of data points is picked according to the volume encapsulated by these data points. The determinantal distribution increases the attention given to uncommon or outoftheordinary data points rather than biasing the learning process towards the more commonly found data, which can improve the overall prediction accuracy in particular for unbalanced datasets as is often the case for clinical data (Derezińsk and Mahoney). Determinantal sampling for regression and classification tasks with full data has been proposed previously for linear regressors (Dereziński et al., 2018) and for Random Forest training for a financial data classification use case where it outperformed the standard Random Forest model (Thakkar et al., 2023). However, an inherent feature of standard Random Forest and determinantal sampling algorithms is randomness that produces data imputations that vary from one run of the algorithm to the next. This is often undesirable since the downstream classification performance can also be affected, which motivated us to apply a deterministic version of determinantal sampling (Schreurs and Suykens, 2021) within the Random Forests of the imputation methods to provide more robust and reliable imputations.
Through deterministic determinantal sampling, we address two challenges in data imputation: first, we provide improved data imputation methods that can increase the performance of the downstream classifier; and second, we remove the variance of the common stochastic and multiple imputation methods, thus ensuring reproducibility, easier integration in machine learning workflows, and compliance with healthcare regulations. While these improvements are of particular relevance for clinical data, our algorithms can also be advantageous for other imputation tasks where improving downstream classification and removing variance is of importance.
In order to demonstrate this improvement, we apply our methods to two classification datasets: a synthetic dataset and a public clinical dataset where the predicted outcome is the survival of the patient.
In addition, we explore the potential of quantum computing to speed up these novel imputation methods: we provide a quantum circuit implementation of the determinantal sampling algorithm that offers a computational advantage compared to its classical counterpart. The best classical algorithms for determinantal sampling take in practice cubic time in the number of features to provide a sample (Dereziński and Mahoney, 2021). In contrast, the quantum algorithm we present here, based on theoretical analysis in Kerenidis and Prakash, 2022, has running time that scales linearly with the number of features. We measure running time as the depth of the necessary quantum circuits, given that the quantum processing units that are being developed currently offer the possibility of performing parallel operations on disjoint qubits.
This suggests that with the advent of nextgeneration quantum computers with more and better qubits, one could also expect a computational speedup in performing determinantal sampling using a quantum computer. Here, we demonstrate competitive results with up to 10 qubits for smallscale imputation tasks on a stateoftheart IBM quantum processor.
This work combines classical (Dereziński and Mahoney, 2021) and quantum (Kerenidis and Prakash, 2022) DPP algorithms with widely used data imputation methods, resulting in novel data imputation algorithms that can improve performance on classical computers while also having the potential of a quantum speedup in the future.
Results
In ‘Methods’, we provide a detailed description of our four imputation methods, DPPMICE, DPPMissForest, detDPPMICE, and detDPPMissForest. All of them are based on iterative imputation methods that use the observed values of every column to predict the missing values. The model used to fill missing values in each column is the Random Forest classifier. Our imputation methods replace the standard Random Forest used by the original miceRanger and MissForest imputers by the DPPRandom Forest model, for our first two imputers, and the detDPPRandom Forest for the latter two. The DPPRandom Forest model subsamples the data for each decision tree using determinantal sampling instead of uniform sampling, while the detDPPRandom Forest model deterministically picks for each decision tree the subset of data that has the maximum probability according to the determinantal distribution. We also demonstrate a computationally advantageous way to perform the determinantal sampling on quantum computers.
In order to benchmark the different imputation methods, we used two types of datasets with a categorical outcome variable. First, a synthetic dataset, created using the scikitlearn method make_classification. It consists of 2000 rows with 25 informative features. This is useful to study the imputation quality where features have equal importance. Second, the MIMICIII dataset (Johnson et al., 2016): the Medical Information Mart for Intensive Care (MIMIC) dataset, which is a freely available clinical database. It is comprised of data for patients who stayed in critical care units at the Beth Israel Deaconess Medical Center between 2001 and 2012. It contains the data of 7214 patients with 14 features.
We also applied two types of missingness on these datasets: MCAR, where the missingness distribution is independent of any observed or unobserved variable, and MNAR, where the missingness distribution depends on the outcome variable. We expect similar results to hold for the MAR case as well, but it was not considered in this work.
We present the numerical results in terms of the AUC of the downstream classification task in Table 1 and provide graphs of the results in Tables 2 and 3. Each experiment was run 10 times with different random seeds to get the variance of the results.
Overall, DPPMICE and DPPMissForest provide improved results compared to their classical baseline MICE and MissForest. This is the case for both the synthetic and the MIMIC datasets and for both MCAR and MNAR missingness. Even more interestingly, the detDPPMICE and detDPPMissForest collapse the variance of the imputed data to 0 and moreover lead in most cases to even higher AUC than the expectation of the previous methods.
DPPMICE and detDPPMICE outperform MICE
We present the performance results of MICEbased methods in terms of the AUC of the downstream classification task using an XGBoost classifier, which has been shown to be the strongest classifier for such datasets (Shadbahr et al., 2022). We used the default parameters of the classifier since our focus is comparing the different imputation methods. In each case, the original dataset with induced missing values is imputed using MICE, DPPMICE, or detDPPMICE, then it is divided into threefolds of development/holdout sets. The downstream classifier is then trained on each development set and its performance is measured by the AUC for the corresponding holdout set. The results are shown in Table 1 and in the figures in Table 2.
The imputation procedure is performed for a total of 10 iterations over all the columns and for each column, a (DPP) Random Forest regressor is trained using 10 trees. For each Random Forest training, the dataset is divided into batches of 150 points each and DPPs are used to sample from every batch.
The results show that across the 12 in total dataset experiments DPPMICE outperforms MICE on expectation in 10 of them, while detDPPMICE provides a single deterministic imputation, which outperforms the expected result from MICE in all 12 datasets and from DPPMICE 11 out of 12 times.
DPPMissForest and detDPPMissForest outperform MissForest
Here we present the performance results of MissForestbased methods in terms of the AUC of the downstream classification task using again an XGBoost classifier. In each case, the original dataset with induced missing values is imputed using MissForest, DPPMissForest, or detDPPMissForest, then it is divided into threefolds of development/holdout sets. The downstream classifier is again then trained on each development set and its performance is measured by the AUC for the corresponding holdout set. The results are shown in Table 1 and in the figures in Table 3. The specifics of the Random Forest training are the same as in the case of MICE.
The results show that across all experiments DPPMissForest outperforms MissForest in all 12 of them, while detDPPMissForest provides a single deterministic imputation that outperforms the expected result from MissForest in all 12 datasets and from DPPMissForest in 11 out of 12 times.
Quantum hardware implementation of DPPMissForest results in competitive downstream classification
As we describe in ‘Methods’, quantum computers can in principle be used to offer a computational advantage in determinantal sampling. In order to better understand the state of the art of current quantum hardware, we used a currently available quantum computer to perform determinantal sampling within a DPPMissForest imputation method for scaleddown versions of the synthetic and MIMIC datasets.
Reduced synthetic dataset: 100 points and three features, created using the sklearn method make_classification.
Reduced MIMIC dataset: 200 points and three features. The three features were chosen from the original dataset features based on low degree of missingness and their predictiveness of the downstream classifier, and they were ‘Oxygen saturation std’, ‘Oxygen saturation mean’, and ‘Diastolic blood pressure mean’.
For the purposes of our experiments, we used the ‘ibm_hanoi’ 27qubit quantum processor shown in Figure 3. We implemented quantum circuits with up to 10 qubits. We also performed quantum simulations using the qiskit noiseless simulator. The decision trees of the DPPRandom Forests used by the imputation models are trained using batches of decreasing sizes (see Table 4). For example, for the algorithm with batch size equal to 10, the algorithm first samples 2 out of the 10 data points to use for the first decision tree, then from the remaining 8 data points it picks another 2 for the second tree, then 2 from the remaining 6, and last 2 from the remaining 4. In other words, we train four different trees, and each time we use a quantum circuit with number of qubits equal to 10, 8, 6, and 4, to perform the respective determinantal sampling.
In the figures of Table 5 and in Table 6, we provide for the different dataset experiments the AUC for MissForest, the simulated results of the quantum version of DPPMissForest, and the actual hardware experimental results of running the quantum version of DPPMissForest. Even for these very small datasets, when simulating the quantum version of DPPMissForest, we demonstrate an increase in the AUC compared to the MissForest algorithm. This further highlights the potential advantages of determinantal sampling within imputation methods. Of note, running our algorithms on current hardware introduces variance in the downstream classifier. Importantly, this variance is due to noise in the hardware rather than inherent to the algorithm.
Our quantum hardware results are competitive with standard methods and in many cases close to the values expected from the simulation. In some cases, we observed a clear deterioration of the AUC due to the noise and errors in the quantum hardware. The results are closer to the simulations when using MCAR missingness with larger batch sizes that use more trees both for synthetic and the MIMIC datasets. As explained above, even though the algorithm with batch size 10 means using a quantum circuit with 10 qubits, the fact that we use four trees overall with a decreasing number of data points each time, and thus a decreasing number of qubits (namely, 10, 8, 6, and 4), results in an overall more reliable imputation.
Discussion
Missing data is a critical issue for machine learning practitioners as complete datasets are usually required for training machine learning algorithms. To achieve complete datasets, missing values are usually imputed. In the case of clinical data, missing values and imputation can be a potential source of bias and can considerably influence the robustness and interpretability of results. Nevertheless, there is no canonical way to deal with missing data, which makes improvements in data imputation methods an attractive and impactful approach to increase the effectiveness and reliability of clinical trials. In this proofofconcept study, we assessed the downstream consequences of implementing such improvements focussing on MCAR and MNAR to assess the usefulness of our approach. MNAR and MCAR represent two extreme cases of missingness with importance for clinical data imputation applications.
Determinantal point processing methods increase the diversity of the data picked to train the models, showcasing also that data gathering and preprocessing are important to remove biases related to overrepresentation of particular data types. This is more important when dealing with unbalanced datasets, as is the case often with clinical data. Determinantal sampling is an important tool not only for Random Forest models, but also for linear regression, where data diversity results in more robust and fair models (Dereziński et al., 2018). Moreover, such sampling methods based on DPP are computationally intensive and quantum computers are expected to be useful in this case: quantum computers offer an asymptotic speedup for performing this sampling, and it is expected that nextgeneration quantum computers will provide a speedup in practice.
We show that, as expected, the quantum version of detDPPMissForest does not introduce any variance in the downstream classifier when simulated in the absence of hardware noise. While the AUC improvements achieved in our experiments may seem modest, it is the consistency of improvements we observed in our simulation results coupled with removal of variance that makes our approach attractive for clinical data applications where these characteristics are extremely desirable. When implemented on quantum hardware, we observed variance that is caused by the noise in the hardware itself. More precisely, the output of the quantum circuit is not a sample from the precise determinantal distribution but from a noisy version of it, and this noise depends on the particular quantum circuit implemented and the quality of the hardware. Thus when attempting to compute the highest probability element using samples from the quantum circuit on current hardware, the result is not deterministic. Importantly, unlike for standard MissForest, this variance is not inherent in the algorithm and is expected to reduce considerably with the advent of better quality quantum computers. The quantum circuits needed to efficiently perform determinantal sampling require a number of qubits equal to the batch size used for each decision tree within the Random Forest training and the depth of the quantum circuit is roughly proportional to the number of features. As an example, if we would like to perform the quantum version of the determinantal imputation methods we used for MIMICIII, then we would need a quantum computer with 150 qubits (the batch size) that can be reliably used to perform a quantum circuit of depth around 400 (the depth is given by $4d\mathrm{log}n$, where $n=150$ is the batch size and $d=14$ is the number of features; Kerenidis and Prakash, 2022). While quantum hardware with a few hundred qubits that can perform computations of a few hundred steps are not available right now, it seems quite possible that they will be available in the not so far future. In the meantime, further optimization could also help reduce the quantum resources needed for such imputation methods.
While our DPPbased imputation methods can run classically on small datasets such as MIMICIII, they are computationally intensive and are hard to parallelize due to the sequential nature of the algorithm. This results in less and less efficient imputation for larger datasets where DPP sampling is applied to bigger batches. For example, when a DPPMICE imputation is run on a dataset of 200 features and batches of size 400, then the training is expected to take multiple hours on a single GPU. The quantum DPP algorithm therefore could provide a way to speed up the hardest part of the imputer using a nextgeneration quantum computer. For instance, if $d=200$, and batch size is 400, the number of qubits will be 400 and the depth of the quantum circuit would be ≈ 6400, whereas it would take $\sim 8*{10}^{6}$ classical steps for DPP sampling. These are of course simply illustrative calculations and will require more detailed analysis as these machines become available and will need to include parameters such as clock speeds and error correction overheads. Only then can it be experimentally proven that this theoretical asymptotic speedup can translate to a practical speedup for this particular algorithm.
In summary, here we propose novel data imputation methods that, first, improve the widely used iterative imputation methods – MiceRanger and MissForest – as measured by the AUC of a downstream classifier; second, remove the variance of the imputation methods, thus ensuring reproducibility and simpler integration into machine learning workflows; and third, become even more efficient when run on quantum computers. Based on our results, we anticipate an impact of our algorithms on the reliability of models in highprecision value settings, including in pharmaceutical drug trials where they can provide higher confidence in the predictions made by eradicating the stochastic variance due to multiple imputations. In addition, tasks that are currently overwhelmed by the challenges of missingness become more tractable through the approaches introduced here, which is a common problem with realworldevidence investigations, where detDPPMICE and detDPPMissForest can yield improved performance in the face of missingness.
Methods
Determinantal point processes (DPPs)
Given a set of items $\mathcal{Y}=\{{y}_{1},\mathrm{\dots},{y}_{N}\}$, a point process $\mathcal{P}$ is a probability distribution over all subsets of the set $\mathcal{Y}$. It is called a determinantal point process (DPP) if, for any subset $Y$ drawn from $\mathcal{Y}$ according to $\mathcal{P}$, we have
where $K$ is a real symmetric $N\times N$ matrix, and $K}_{T,T$ is its submatrix whose rows and columns are indexed by $T$. The matrix $K$ is called the marginal kernel of $Y$.
For an $n\times d$ data matrix $A$ and $L=A{A}^{T}$, we define the $L$ensemble ${\mathrm{D}\mathrm{P}\mathrm{P}}_{L}(\mathbf{L})$ as the distribution where the probability of sampling $T$ is
where $Vol(\{{a}_{i}:i\in T\})$ is the volume of the parallelepiped spanned by the rows of $A$ indexed by $T$.
According to this distribution, the probability of sampling points that are similar and thus form a smaller volume is reduced in favor of samples that are more diverse.
An $L$ensemble is a determinantal point process if $K=L(I+L{)}^{1}$.
Stochastic $k$DPPs
The distribution $k{\text{DPP}}_{L}(L)$ is defined as an $L$ensemble which is constrained to subsets of size $T=k$.
Different algorithms have been proposed in the literature to sample from $k\text{DPP}s$, namely Kulesza and Taskar, 2012, where sampling $d$ rows from an $N\times d$ matrix takes $O(N{d}^{2})$ time. There have been improvements over this initial proposal as in Mahoney et al., 2019, where there is a preprocessing cost of $O(N{d}^{2})$ and each DPP sample requires $O({d}^{3})$ arithmetic operations.
Deterministic $k$DPPs
Stochastic DPP sampling may be efficient in practice; however, deterministic algorithms are important for different use cases since they are more interpretable, are less prone to errors, and have no failure probability, which is especially relevant for clinical data (El Shawi et al., 2019).
We use a deterministic version of DPP sampling as proposed in Schreurs and Suykens, 2021 (see Algorithm 1), which is a greedy maximum volume approach. For each deterministic $k\text{DPP}$ sample, elements with the highest probability are added iteratively. The complexity of the algorithm for selecting deterministically $d$ rows from an $N\times d$ matrix is $O({N}^{2}d)$ for the preprocessing step and $O(N{d}^{3})$ for the sampling step.
Algorithm 1 Deterministic kDPP algorithm 

Input:$\displaystyle N\times N$ Kernel matrix $\displaystyle K\succ 0$, sample size $\displaystyle k$. Initialization:$\displaystyle \mathcal{T}\leftarrow \mathrm{\varnothing}$ $V\in {\mathbb{R}}^{n\times k}$: first $\displaystyle k$ eigenvectors of $K$. $P=V{V}^{\u22ba}$ ${p}_{0}(i)={\Vert {\mathrm{V}}^{T}{e}_{i}\Vert}^{2},\phantom{\rule{1em}{0ex}}i=1,\dots ,k$ $p\leftarrow {p}_{0}$ and$i=0$ while: $\displaystyle i\le k$ do $\displaystyle {t}_{i}\in \mathrm{arg}maxp$ $\displaystyle \mathcal{T}\leftarrow \mathcal{T}\cup \left\{{t}_{i}\right\}$ $\displaystyle p(j)={p}_{0}(j){P}_{\mathcal{T}j}^{T}{P}_{\mathcal{T}\mathcal{T}}^{\u2020}{P}_{\mathcal{T}j},\phantom{\rule{1em}{0ex}}j=1\dots n$ $\displaystyle i\leftarrow i+1$ end while Output:$\displaystyle \mathcal{T}$. 
DPPRandom Forest and detDPPRandom Forest
The Random Forest is a widely used ensemble learning model for classification and regression problems. It trains a number of decision trees on different samples from the dataset, and the final prediction of the Random Forest is the average of the decision trees for regression tasks or the class predicted by the most decision trees for classification tasks.
The samples used to train each tree are drawn uniformly with replacement from the original dataset (bootstrapping). The DPPRandom Forest algorithm (see Figure 4) replaces the uniform sampling with DPP sampling without replacement.
The running time of the standard Random Forest training on an $N\times d$ matrix is $\stackrel{~}{O}(Nd)$, whereas the DPPRandom Forest algorithm takes $\stackrel{~}{O}(N{d}^{2}+{d}^{3})$ steps to run. This shows that while for small $d$ the classical DPPenhanced algorithms can still be efficient, they quickly become inefficient for larger feature spaces.
Determinantal sampling for regression and classification tasks with full data has been proposed previously for linear regressors (Michał Dereziński and Hsu 2018) and for Random Forest training for a financial data classification use case where it outperformed the standard Random Forest model (Thakkar et al., 2023).
We can also use the deterministic version of DPP sampling for the Random Forest algorithm. This requires removing the sample used at each step (which is the one with the highest probability according to the determinantal distribution) in order to create a smaller dataset from which to sample for the next decision tree (see Figure 5). We call this new model detDPPRandom Forest.
Let us note that the distributions of the inbag DPP samples, which are biased toward diversity, and the outofbag (OOB) samples, which reflect the original dataset’s distribution, may be different. This could lead to an inaccurate calculation of the OOB error that can be in fact overestimated (Janitza and Hornung, 2018). In the DPPRandom Forest case, the batches are stratified and according to the output variable that follows the same distribution as the original dataset. Thus, sampling from different batches could bridge the gap between the inbag and the OOB distributions. We leave these considerations for future work.
Quantum methods for DPPs
Quantum machine learning has been a rapidly developing field and many applications have been explored, including with biomedical data, both using quantum algorithms to speedup linear algebraic procedures and through quantum neural networks (Cerezo et al., 2022; Biamonte et al., 2017; Landman et al., 2022; Cherrat et al., 2022).
In Kerenidis and Prakash, 2022, it was shown that there exist quantum algorithms for performing the determinantal sampling with better computational complexity than the best known classical methods. We describe below the quantum circuits that are needed for performing this quantum algorithm on quantum hardware with different connectivity characteristics and provide a resource analysis for the number of qubits, the number of gates, and the depth of the quantum circuit.
First, we introduce an important component of the quantum DPP circuit, which is the Clifford loader. Given an input state $x\in {\mathbb{R}}^{n}$, it performs the following operation:
In other words, it encodes the vector $x$ as a sum of the mutually anticommuting operators generating the Clifford algebra.
For implementing this operation with an efficient quantum circuit, we use standard one and twoqubit gates, such as the X, Z, CZ gates as well as a parameterized twoqubit gate called the reconfigurable beam splitter (RBS) gate, which does the following operation:
We provide in Figure 6 three different versions of the Clifford loader that take advantage of the specific connectivity of the quantum hardware, for example, grid connectivity for superconducting qubits or alltoall connectivity for trappedion qubits. These constructions are optimal (up to constant factor) on the number of twoqubit gates. We provide the exact resource analysis in Table 7.
We can now use the Clifford loaders described above to perform $kDPP$ sampling, as described (Kerenidis and Prakash, 2022).
Given an orthogonal matrix $A=({a}^{1},\mathrm{\dots},{a}^{d})$, we can apply the qDPP circuit shown in Figure 7, which is just a sequential application of $d$ Clifford loaders, one for each column of the matrix, to the ${0}^{n}\u27e9$ state, and that leads to the following result:
Directly measuring at the end of the circuit provides a sample from the correct determinantal distribution.
Both the classical and the quantum algorithms require a preprocessing step with a similar complexity (see Table 8), and the improvement using the quantum method achieves a quadratic to cubic speedup in the sampling step. This speedup holds for $n=O(d)$. This is the case for our current implementation of DPP sampling from smaller batches (see Figure 4). In addition, the quantum DPP algorithm is efficient in terms of the number of measurements required since one measurement is equivalent to generating one DPP sample.
Quantum versions of the imputation methods
It is easy to define now a quantum version of the DPPMICE and DPPMissForest algorithms, where we use the quantum circuit described above to sample from the corresponding DPP. We can also define a variant of the deterministic algorithms, though here we need to pay attention to the fact that the quantum circuit enables to sample from the determinantal distribution but does not efficiently give us a classical description of the entire distribution. Hence, one can instead sample many times from the quantum circuit and output the most frequent element. This provides a sample with less variance but it only becomes deterministic in the limit of infinite measurements. In the experiments we performed, we used 1000 shots and the samples from the quantum circuits were indeed most of the time the highest probability elements. Of course in the worst case, there exist distributions where, for example, the highest and second highest elements are exponentially close to each other, in which case the quantum algorithm would need an exponential number of samples to output the highest element with high probability. Note though that the quantum imputation algorithm will still have a good performance even with few samples (any highprobability element provides the needed diversity of the inputs), though it will not be deterministic.
Availability of data and code
The code for the different DPP imputation methods is publicly available at github.com/AstraZeneca/dpp_imp, (copy archived at AstraZeneca, 2023). The synthetic dataset can be generated using the make_classification method from scikitlearn. The MIMICIII dataset (Johnson et al., 2016) is also publicly available.
Data availability
The synthetic dataset can be generated using the make classification method from scikitlearn. The MIMICIII dataset is publicly available at https://mimic.mit.edu/.
References

Challenges and opportunities in quantum machine learningNature Computational Science 2:567–576.https://doi.org/10.1038/s43588022003113

Determinantal point processes in randomized numerical linear algebraNotices of the American Mathematical Society 68:34–45.https://doi.org/10.1090/noti2202

ConferenceInterpretability in HealthCare A Comparative Study of Local Machine Learning Interpretability Techniques2019 IEEE 32nd International Symposium on ComputerBased Medical Systems (CBMS).https://doi.org/10.1109/CBMS.2019.00065

A survey on missing data in machine learningJournal of Big Data 8:140.https://doi.org/10.1186/s40537021005169

Mice: multivariate imputation by chained equations in rJournal of Statistical Software 45:1–67.https://doi.org/10.18637/jss.v045.i03

MIMICIII, a freely accessible critical care databaseScientific Data 3:160035.https://doi.org/10.1038/sdata.2016.35

Determinantal point processes for machine learningFoundations and Trends in Machine Learning 5:123–286.https://doi.org/10.1561/2200000044

ConferenceKdpps: fixedsize determinantal point processesICML’11: Proceedings of the 28th International Conference on International Conference on Machine Learning. Cited by: Improved clinical data imputation via classical and quantum determinantal point processes.

Evaluating the state of the art in missing data imputation for clinical dataBriefings in Bioinformatics 23:bbab489.https://doi.org/10.1093/bib/bbab489

ConferenceMinimax experimental design: bridging the gap between statistical and worstcase approaches to least squares regressionConference on Learning Theory.

Handling missing data in clinical trials: An overviewDrug Information Journal 34:525–533.https://doi.org/10.1177/009286150003400221

Missing data and multiple imputation in clinical epidemiological researchClinical Epidemiology 9:157–166.https://doi.org/10.2147/CLEP.S129785

Towards deterministic diverse subset samplingArtificial Intelligence and Machine Learning pp. 137–151.https://doi.org/10.1007/9783030651541

ConferenceGAIN: missing data imputation using generative adversarial netsProceedings of the 35th International Conference on Machine Learning. Cited by: Improved clinical data imputation via classical and quantum determinantal point processes.
Article and author information
Author details
Funding
No external funding was received for this work.
Acknowledgements
This work is a collaboration between QC Ware and AstraZeneca. We acknowledge the use of IBM Quantum services for this work. The views expressed are those of the authors and do not reflect the official policy or position of IBM or the IBM Quantum team.
Version history
 Preprint posted: March 31, 2023 (view preprint)
 Sent for peer review: June 9, 2023
 Preprint posted: August 23, 2023 (view preprint)
 Preprint posted: March 18, 2024 (view preprint)
 Version of Record published: May 9, 2024 (version 1)
Cite all versions
You can cite all versions using the DOI https://doi.org/10.7554/eLife.89947. This DOI represents all versions, and will always resolve to the latest one.
Copyright
© 2023, Kazdaghli et al.
This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.
Metrics

 321
 views

 12
 downloads

 0
 citations
Views, downloads and citations are aggregated across all versions of this paper published by eLife.
Download links
Downloads (link to download the article as PDF)
Open citations (links to open the citations from this article in various online reference manager services)
Cite this article (links to download the citations from this article in formats compatible with various reference manager tools)
Further reading

 Computational and Systems Biology
 Evolutionary Biology
A comprehensive census of McrBC systems, among the most common forms of prokaryotic Type IV restriction systems, followed by phylogenetic analysis, reveals their enormous abundance in diverse prokaryotes and a plethora of genomic associations. We focus on a previously uncharacterized branch, which we denote coiledcoil nuclease tandems (CoCoNuTs) for their salient features: the presence of extensive coiledcoil structures and tandem nucleases. The CoCoNuTs alone show extraordinary variety, with three distinct types and multiple subtypes. All CoCoNuTs contain domains predicted to interact with translation system components, such as OBfolds resembling the SmpB protein that binds bacterial transfermessenger RNA (tmRNA), YTHlike domains that might recognize methylated tmRNA, tRNA, or rRNA, and RNAbinding Hsp70 chaperone homologs, along with RNases, such as HEPN domains, all suggesting that the CoCoNuTs target RNA. Many CoCoNuTs might additionally target DNA, via McrC nuclease homologs. Additional restriction systems, such as Type I RM, BREX, and Druantia Type III, are frequently encoded in the same predicted superoperons. In many of these superoperons, CoCoNuTs are likely regulated by cyclic nucleotides, possibly, RNA fragments with cyclic termini, that bind associated CARF (CRISPRAssociated Rossmann Fold) domains. We hypothesize that the CoCoNuTs, together with the ancillary restriction factors, employ an echeloned defense strategy analogous to that of Type III CRISPRCas systems, in which an immune response eliminating virus DNA and/or RNA is launched first, but then, if it fails, an abortive infection response leading to PCD/dormancy via host RNA cleavage takes over.

 Computational and Systems Biology
Interacting molecules create regulatory architectures that can persist despite turnover of molecules. Although epigenetic changes occur within the context of such architectures, there is limited understanding of how they can influence the heritability of changes. Here, I develop criteria for the heritability of regulatory architectures and use quantitative simulations of interacting regulators parsed as entities, their sensors, and the sensed properties to analyze how architectures influence heritable epigenetic changes. Information contained in regulatory architectures grows rapidly with the number of interacting molecules and its transmission requires positive feedback loops. While these architectures can recover after many epigenetic perturbations, some resulting changes can become permanently heritable. Architectures that are otherwise unstable can become heritable through periodic interactions with external regulators, which suggests that mortal somatic lineages with cells that reproducibly interact with the immortal germ lineage could make a wider variety of architectures heritable. Differential inhibition of the positive feedback loops that transmit regulatory architectures across generations can explain the genespecific differences in heritable RNA silencing observed in the nematode Caenorhabditis elegans. More broadly, these results provide a foundation for analyzing the inheritance of epigenetic changes within the context of the regulatory architectures implemented using diverse molecules in different living systems.