Improved clinical data imputation via classical and quantum determinantal point processes

  1. QC Ware, Palo Alto, USA and Paris, France
  2. Université de Paris, CNRS, IRIF, 8 Place Aurélie Nemours, Paris 75013, France
  3. Emerging Innovations Unit, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK
  4. Centre for AI, BioPharmaceuticals R&D, AstraZeneca, Cambridge, UK

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Martin Graña
    Institut Pasteur de Montevideo, Montevideo, Uruguay
  • Senior Editor
    Aleksandra Walczak
    CNRS, Paris, France

Reviewer #1 (Public Review):

Summary:
The article written by Kazdaghli et al. proposes a modification of imputation methods, to better account for and exploit the variability of the data. The aim is to reduce the variability of the imputation results. The authors propose two methods, one that still includes some imputation variability, but accounts for the distribution of the data points to improve the imputation. The other one proposes a determinantal sampling, that presents no variation in the imputation data, at least no variation in the classification task. As these methods grow easily in computation requirements and time, they also propose an algorithm to run these methods in quantum processors.

Strengths:
The sampling method for imputing missing values that accounts for the variability of the data seems to be accurate.

Weaknesses:
While the proposed method seems accurate and should improve the imputation task, I think that the authors must explain a little better some parts of the algorithm that they are using. Although I think the authors could have evaluated the imputations directly, as they mention in the introduction, I understand that the final goal in the task is to have a better classification. The problem is that they do not explain what the classification is, or how is it trained. In a real situation, they would have data that would be used for training the algorithm, and then new data that needs to be imputed and classified. In this article, I do not see any train, plus test or validation data. I wonder if there could be some interaction between the imputation and the classification methods, that leads to overfitting the data; in particular when the deterministic DPP is used.

In its current state, I do not think this article brings not very much value to the community that could benefit from it. I did not find an implementation of the method available to other scientists, nor the data used to evaluate it (while one data set is public, the simulated data is not available). This not only hinders the use of the method by the scientific community, but also makes it impossible to reproduce the results or test the algorithm in similar databases.

Reviewer #2 (Public Review):

In this work, the authors address the problem of missing data imputation in the life sciences domain and propose several new algorithms which improve on the current state-of-the-art. In particular (i) they modify two existing Random Forest-based imputation methods -- MissForest and miceRanger -- to use either determinantal sampling or deterministic determinantal sampling, and show slightly improved classification performance on two datasets (one synthetic, one real); in addition, (ii) the authors present a quantum circuit for performing the determinantal sampling which scales asymptotically better than the best-known classical methods, and perform small scale experiments using both a (noiseless) quantum simulator as well as a 10 qubit IBM quantum computer to validate that the approach works in principle.

The problem of data imputation is important in practice, and results that improve on existing methods should be of interest to those in the field. The use of determinantal sampling for applications beyond data imputation should also be of broader interest, and the connection to quantum computing warrants further investigation and analysis.

The use of classification accuracy (as measured by AUC) as a measure of success is well-motivated, and the authors use both real and synthetic datasets to evaluate their methods, which consistently (if only marginally) outperform the existing state-of-the-art. The results obtained here motivate the further study of this approach to a wider class of datasets, and to areas beyond life sciences.

As it stands, in my opinion, two points need addressing.

1. Additional clarity is required on what is novel:

While the application of determinantal and deterministic determinantal sampling to the specific case of data imputation appears to be novel, the authors should make it more clear that both of these methods themselves are not new, and have been directly lifted from the literature. As it stands, the current wording in the main body of the paper gives the impression that the deterministic determinantal algorithm is novel, e.g. "this motivated us to develop a deterministic version of determinantal sampling" (p.2), and it is only in the methods section that a reference is made to the paper of Schreurs et al. which proposed the algorithm.

Similarly, in the abstract and main body of the text, the wording gives the impression that the quantum circuits presented here are new (e.g., "We also develop quantum circuits for implementing determinantal point processes") whereas they have been previously proposed (although one of the authors of the current paper was also an author of the original paper proposing the quantum circuits for determinantal sampling).

2. Additional analysis is needed to support the claims of potential for quantum advantage:

The authors claim that the quantum algorithm for implementing determinantal point processes provides a computational advantage over classical ones, in that the quantum circuits scale linearly in the number of features compared with cubic scaling classically. While this may be true asymptotically, in my opinion, more discussion is required about the utility and feasibility of this method in practice, as well as the realistic prospects of this being a potential area of application for quantum computing.

For example, the authors mention that a quantum computer of 150 qubits capable of running circuits of depth 400 is needed to perform the determinantal sampling for the MIMIC-III dataset considered, and say "while [such hardware is] not available right now, it seems quite possible that they will be available in the not so far future". The authors also state "This suggests that with the advent of next-generation quantum computers... one could expect a computational speedup in performing determinantal sampling" and "it is expected that next-generation quantum computers will provide a speedup in practice". These are strong assertions (even if 'next generation' is not clearly defined), and in my opinion, are not sufficiently backed by evidence to date. Given that datasets of the size of MIMIC-III (and presumably much larger) can be handled efficiently classically, the authors should clarify whether one expects a quantum advantage by this approach in the "NISQ" (pre-error-corrected) era of quantum computing. This seems unlikely, and any argument that this is the case should include an analysis accounting for the absolute operation speeds and absolute times required to perform such computations, including any time required for inputting data, resetting quantum circuits etc. On the other hand, if by 'next generation' the authors mean quantum computers beyond the NISQ era (i.e., assuming fault-tolerant quantum computers and logical qubits), then the overhead costs of quantum error correction (both in terms of physical qubit numbers as well as computational time) should be analyzed, and the crossover regime (i.e., data size where a quantum computation takes less absolute time than classical) estimated in order to assess the prospects of a practical quantum advantage, especially in light of recent analyses e.g., [1,2] below.

[1] Hoefler, Haner, Troyer. Communicatios of the ACM, 66.5 (2023):82-87
[2] Babbush et al., PRX Quantum 2.1 (2021):010103

Other comments and suggestions:
The authors measure "running time [as] the depth of the necessary quantum circuits." While circuit depth may indeed correspond to wall-clock time, quantum circuit size (i.e. number of gates) is the fairer complexity metric for comparison with classical running time. If depth is used, then a fair comparison to classical methods should be to compare with classical parallel processing time using N processors. However, if circuit size is used, then the quantum complexity is Nd, which contrasts with the classical value of Nd^2 (pre-processing) + d^3 (per sample). This yields a subquadratic quantum speedup over classical, as opposed to a qubic speedup.

The results (e.g Table 1) show that the new algorithms consistently outperform the original miceRanger and MissForest methods, although the degree of improvement is small, typically of order 1% or less. Some discussion is therefore warranted on the practical benefits of this method, and any tradeoff in terms of efficiency. In particular, while Table 1 compares the classification accuracy (as measured by AUC) of the newly proposed methods vs the existing state-of-the-art, a discussion on the scalability and efficiency would be welcome. The determinantal sampling takes time Nd^2, how does this compare with the original methods? For what dataset and feature sizes are the determinantal methods feasible (which will determine the scale at which other approaches, e.g. those based on quantum computing may be required).

A discussion (or at least mention) of the algorithmic complexity of the classical deterministic determinantal sampling (which seems to also be Nd^2) in the main body of the text would be welcome.

The final paragraph of the Methods section discusses sampling many times from the quantum circuits to estimate the most probable outcome, and hence perform the deterministic determinantal sampling. A more careful analysis of the number of samples needed (for bounded variance/error) and the impact on the running time (and whether one still expects an advantage over classical (although one must define some bounded error version of the deterministic algorithm to do so) or performance of the algorithm would be welcome.

A discussion on the absolute running time required for the quantum experiments performed (and how they compare to classical) would be interesting.

A mention of which quantum simulator was used would be welcome.

In the introduction, three kinds of data missingness (MCAR, MAR, MNAR) are mentioned, although experiments are only performed for MCAR and MNAR. Can some explanation for excluding MAR be given?

Reference 24 (Shadbar et al., the study that demonstrated the effectiveness of miceRanger and MissForest) used 4 datasets: MIMIC-III, Simulated, Breast Cancer, and NHSX COVID-19. Of these, MIMIC-III is used in the current paper, and Simulated appears similar (although with 1000 instead of 2000 rows) to the synthetic dataset of the current paper. An analysis of the determinantal sampling methods applied to the Breast Cancer and NHSX COVID-19 datasets (which have naturally occurring missingness), and a comparison to the results of Shadbar et al. would be interesting.

Author Response

The following is the authors’ response to the original reviews.

Reviewer #1

More details about the classification and how it is trained

We included a sentence in the introduction to clarify which data we are using: "In order to demonstrate this improvement, we apply our methods to two classification datasets: a synthetic dataset and a public clinical dataset where the predicted outcome is the survival of the patient"

And about how the classifier is trained in the "Results" section: "we used the default parameters of the classifier, since our focus is comparing the different imputation methods"

Availability of the code

Now the code is publicly available in a github repository https://github.com/AstraZeneca/dpp_imp/ (see Availability of Data and Code section)

Reviewer #2

Clarifying that Determinantal Point Processes and their deterministic version have been introduced before but are applied for the first time for data imputation in this work:

We added explanation in the 6th paragraph of the introduction that we use pre-existing DPP and deterministic-DPP algorithms for our imputation methods and include the references to avoid confusion

We also added a paragraph at the end of the introduction to summarize this work's contribution

Explaining the claim about the computational advantage of using quantum determinantal point processes for the imputation methods:

In the fourth paragraph of the "Discussion" section (page 8), we give an imputation example by numerically comparing the classical and quantum algorithms running time for DPP sampling, which shows the advantage of using the quantum algorithm.

Regarding running time for classical DPP and quantum DPP sampling algorithms:

We included Table VIII (page 13) that compares the preprocessing and sampling complexities for both classical and quantum DPP algorithms, we consider the case where we sample d rows from an (n,d) matrix and n=O(d) which is usually the case for our DPP-Random Forest algorithm

We added some details regarding the quantum advantage in the first paragraph of page 12

Regarding the comment about the modest improvement of the DPP methods and questions about their practical benefit:

As mentioned in the third paragraph of the "Discussion" section, we point out that the consistency of the improvement and the removal of variance as a result of using the DPP and deterministic DPP methods make our methods very beneficial to use on clinical data. Further exploration with different data sets can provide a more result in a more complete understanding of the practical advantages of the methods

Algorithmic complexity of the deterministic DPP algorithm:

Detailed in the last sentence of the "Determinantal Point Processes" subsection of the "Methods" section: O(N^2 d) for the preprocessing step and O(Nd^3) for the sampling step

Running time for the quantum deterministic DPP sampling and how it is done in practice:

While it is difficult to assess the real running time for the quantum detDPP algorithm for large circuits (100 or more qubits), due to the unavailability of such devices, we give more details about our practical implementation in the last paragraph of the "Methods" section. In our case (up to 10 qubits) we used 1000 shots to sample the highest probability elements.

On which quantum simulator was used

We point out in the first paragraph of page 5 that we employ the qiskit noiseless simulator

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation