Peer review process
Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a provisional response from the authors.
Read more about eLife’s peer review process.Editors
- Reviewing EditorMartin GrañaInstitut Pasteur de Montevideo, Montevideo, Uruguay
- Senior EditorAleksandra WalczakCNRS, Paris, France
Reviewer #1 (Public Review):
Summary:
The article written by Kazdaghli et al. proposes a modification of imputation methods, to better account for and exploit the variability of the data. The aim is to reduce the variability of the imputation results. The authors propose two methods, one that still includes some imputation variability, but accounts for the distribution of the data points to improve the imputation. The other one proposes a determinantal sampling, that presents no variation in the imputation data, at least no variation in the classification task. As these methods grow easily in computation requirements and time, they also propose an algorithm to run these methods in quantum processors.
Strengths:
The sampling method for imputing missing values that accounts for the variability of the data seems to be accurate.
Weaknesses:
While the proposed method seems accurate and should improve the imputation task, I think that the authors must explain a little better some parts of the algorithm that they are using. Although I think the authors could have evaluated the imputations directly, as they mention in the introduction, I understand that the final goal in the task is to have a better classification. The problem is that they do not explain what the classification is, or how is it trained. In a real situation, they would have data that would be used for training the algorithm, and then new data that needs to be imputed and classified. In this article, I do not see any train, plus test or validation data. I wonder if there could be some interaction between the imputation and the classification methods, that leads to overfitting the data; in particular when the deterministic DPP is used.
In its current state, I do not think this article brings not very much value to the community that could benefit from it. I did not find an implementation of the method available to other scientists, nor the data used to evaluate it (while one data set is public, the simulated data is not available). This not only hinders the use of the method by the scientific community, but also makes it impossible to reproduce the results or test the algorithm in similar databases.
Reviewer #2 (Public Review):
In this work, the authors address the problem of missing data imputation in the life sciences domain and propose several new algorithms which improve on the current state-of-the-art. In particular (i) they modify two existing Random Forest-based imputation methods -- MissForest and miceRanger -- to use either determinantal sampling or deterministic determinantal sampling, and show slightly improved classification performance on two datasets (one synthetic, one real); in addition, (ii) the authors present a quantum circuit for performing the determinantal sampling which scales asymptotically better than the best-known classical methods, and perform small scale experiments using both a (noiseless) quantum simulator as well as a 10 qubit IBM quantum computer to validate that the approach works in principle.
The problem of data imputation is important in practice, and results that improve on existing methods should be of interest to those in the field. The use of determinantal sampling for applications beyond data imputation should also be of broader interest, and the connection to quantum computing warrants further investigation and analysis.
The use of classification accuracy (as measured by AUC) as a measure of success is well-motivated, and the authors use both real and synthetic datasets to evaluate their methods, which consistently (if only marginally) outperform the existing state-of-the-art. The results obtained here motivate the further study of this approach to a wider class of datasets, and to areas beyond life sciences.
As it stands, in my opinion, two points need addressing.
1. Additional clarity is required on what is novel:
While the application of determinantal and deterministic determinantal sampling to the specific case of data imputation appears to be novel, the authors should make it more clear that both of these methods themselves are not new, and have been directly lifted from the literature. As it stands, the current wording in the main body of the paper gives the impression that the deterministic determinantal algorithm is novel, e.g. "this motivated us to develop a deterministic version of determinantal sampling" (p.2), and it is only in the methods section that a reference is made to the paper of Schreurs et al. which proposed the algorithm.
Similarly, in the abstract and main body of the text, the wording gives the impression that the quantum circuits presented here are new (e.g., "We also develop quantum circuits for implementing determinantal point processes") whereas they have been previously proposed (although one of the authors of the current paper was also an author of the original paper proposing the quantum circuits for determinantal sampling).
2. Additional analysis is needed to support the claims of potential for quantum advantage:
The authors claim that the quantum algorithm for implementing determinantal point processes provides a computational advantage over classical ones, in that the quantum circuits scale linearly in the number of features compared with cubic scaling classically. While this may be true asymptotically, in my opinion, more discussion is required about the utility and feasibility of this method in practice, as well as the realistic prospects of this being a potential area of application for quantum computing.
For example, the authors mention that a quantum computer of 150 qubits capable of running circuits of depth 400 is needed to perform the determinantal sampling for the MIMIC-III dataset considered, and say "while [such hardware is] not available right now, it seems quite possible that they will be available in the not so far future". The authors also state "This suggests that with the advent of next-generation quantum computers... one could expect a computational speedup in performing determinantal sampling" and "it is expected that next-generation quantum computers will provide a speedup in practice". These are strong assertions (even if 'next generation' is not clearly defined), and in my opinion, are not sufficiently backed by evidence to date. Given that datasets of the size of MIMIC-III (and presumably much larger) can be handled efficiently classically, the authors should clarify whether one expects a quantum advantage by this approach in the "NISQ" (pre-error-corrected) era of quantum computing. This seems unlikely, and any argument that this is the case should include an analysis accounting for the absolute operation speeds and absolute times required to perform such computations, including any time required for inputting data, resetting quantum circuits etc. On the other hand, if by 'next generation' the authors mean quantum computers beyond the NISQ era (i.e., assuming fault-tolerant quantum computers and logical qubits), then the overhead costs of quantum error correction (both in terms of physical qubit numbers as well as computational time) should be analyzed, and the crossover regime (i.e., data size where a quantum computation takes less absolute time than classical) estimated in order to assess the prospects of a practical quantum advantage, especially in light of recent analyses e.g., [1,2] below.
[1] Hoefler, Haner, Troyer. Communicatios of the ACM, 66.5 (2023):82-87
[2] Babbush et al., PRX Quantum 2.1 (2021):010103
Other comments and suggestions:
The authors measure "running time [as] the depth of the necessary quantum circuits." While circuit depth may indeed correspond to wall-clock time, quantum circuit size (i.e. number of gates) is the fairer complexity metric for comparison with classical running time. If depth is used, then a fair comparison to classical methods should be to compare with classical parallel processing time using N processors. However, if circuit size is used, then the quantum complexity is Nd, which contrasts with the classical value of Nd^2 (pre-processing) + d^3 (per sample). This yields a subquadratic quantum speedup over classical, as opposed to a qubic speedup.
The results (e.g Table 1) show that the new algorithms consistently outperform the original miceRanger and MissForest methods, although the degree of improvement is small, typically of order 1% or less. Some discussion is therefore warranted on the practical benefits of this method, and any tradeoff in terms of efficiency. In particular, while Table 1 compares the classification accuracy (as measured by AUC) of the newly proposed methods vs the existing state-of-the-art, a discussion on the scalability and efficiency would be welcome. The determinantal sampling takes time Nd^2, how does this compare with the original methods? For what dataset and feature sizes are the determinantal methods feasible (which will determine the scale at which other approaches, e.g. those based on quantum computing may be required).
A discussion (or at least mention) of the algorithmic complexity of the classical deterministic determinantal sampling (which seems to also be Nd^2) in the main body of the text would be welcome.
The final paragraph of the Methods section discusses sampling many times from the quantum circuits to estimate the most probable outcome, and hence perform the deterministic determinantal sampling. A more careful analysis of the number of samples needed (for bounded variance/error) and the impact on the running time (and whether one still expects an advantage over classical (although one must define some bounded error version of the deterministic algorithm to do so) or performance of the algorithm would be welcome.
A discussion on the absolute running time required for the quantum experiments performed (and how they compare to classical) would be interesting.
A mention of which quantum simulator was used would be welcome.
In the introduction, three kinds of data missingness (MCAR, MAR, MNAR) are mentioned, although experiments are only performed for MCAR and MNAR. Can some explanation for excluding MAR be given?
Reference 24 (Shadbar et al., the study that demonstrated the effectiveness of miceRanger and MissForest) used 4 datasets: MIMIC-III, Simulated, Breast Cancer, and NHSX COVID-19. Of these, MIMIC-III is used in the current paper, and Simulated appears similar (although with 1000 instead of 2000 rows) to the synthetic dataset of the current paper. An analysis of the determinantal sampling methods applied to the Breast Cancer and NHSX COVID-19 datasets (which have naturally occurring missingness), and a comparison to the results of Shadbar et al. would be interesting.