Establishing the foundations for a data-centric AI approach for virtual drug screening through a systematic assessment of the properties of chemical data

Allen Chong; Ser-Xian Phua; Yunzhi Xiao; Woon Yee Ng; Hoi Yeung Li; Wilson Wen Bin Goh

doi:10.7554/eLife.97821.1

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews.

Reviewing Editor
Alan Talevi
Universidad Nacional de La Plata, La Plata, Argentina
Senior Editor
Aleksandra Walczak
CNRS, Paris, France

Reviewer #1 (Public Review):

Summary:

The work provides more evidence of the importance of data quality and representation for ligand-based virtual screening approaches. The authors have applied different machine learning (ML) algorithms and data representation using a new dataset of BRAF ligands. First, the authors evaluate the ML algorithms and demonstrate that independently of the ML algorithm, predictive and robust models can be obtained in this BRAF dataset. Second, the authors investigate how the molecular representations can modify the prediction of the ML algorithm. They found that in this highly curated dataset the different molecule representations are adequate for the ML algorithms since almost all of them obtain high accuracy values, with Estate fingerprints obtaining the worst-performing predictive models and ECFP6 fingerprints producing the best classificatory models. Third, the authors evaluate the performance of the models on subsets of different composition and size of the BRAF dataset. They found that given a finite number of active compounds, increasing the number of inactive compounds worsens the recall and accuracy. Finally, the authors analyze if the use of "less active" molecules affect the model's predictive performance using "less active" molecules taken from ChEMBl Database or using decoys from DUD-E. As results, they found that the accuracy of the model falls as the number of "less active" examples in the training dataset increases while the implementation of decoys in the training set generates results as good as the original models or even better in some cases. However, the use of decoys in the training set worsens the predictive power in the test sets that contain active and inactive molecules.

Strengths:

It is a very interesting topic in medicinal chemistry and drug discovery. This work is very well written and contains up-to-date references. The general structure of the work is adequate, allowing easy reading. The hypotheses are clear and were explored correctly. This work provides new evidence about the importance of inferring models from high-quality data and that, if such a condition is met, it is not necessary to use complex computational methods to obtain predictive models. The generated BRAF dataset is also a valuable benchmark dataset for medicinal chemists working in ligand based virtual screening.

Weaknesses:

Leaving aside the new curated BRAF dataset, the work lacks novelty since it is a topic widely studied in chemoinformatics and medicinal chemistry. Furthermore, the conclusions drawn here correspond to the analysis of only one high-quality dataset where the similarity between the molecules is not quantitatively assessed (maybe active and inactive molecules are very dissimilar and any ML algorithm and fingerprint could obtain good results). To generalize the conclusions, it would be fundamental to repeat the analysis with other high-quality datasets.

Some key tasks are not clearly described, for example, there is no information about the new BRAF dataset (e.g., where the molecules were obtained from or why the inactive molecules provide better results than the "less active" from ChEMBL... what differentiates them?). The defintion of an "inactive" compound is not clear. It is not described if global or balanced accuracy was used in the imbalanced datasets. When using decoys to evaluate the models it is important to consider that decoys were generated to be topologically different from active compounds by the comparison of the ECFP4 fingerprints using the Tanimoto coefficient. Therefore, it is quite obvious that when fingerprints are used to characterize molecules, the models will be able to easily discriminate them. It is important to note that this is not necessarily true for models based on other molecular descriptors, since they are not used in the generation of the decoys. In some cases, the differences between accuracies are very small and there are no statistical analyzes to demonstrate whether they are statistically different or not.

https://doi.org/10.7554/eLife.97821.1.sa2

Reviewer #2 (Public Review):

Summary:

The authors explored the importance of data quality and representation for ligand-based virtual screening approaches. I believe the results could be of potential benefit to the drug discovery community, especially to those scientists working in the field of machine learning applied to drug research. The in silico design is comprehensive and adequate for the proposed comparisons.

This manuscript by Chong A. et al describes that it is not necessary to resort to the use of sophisticated deep learning algorithms for virtual screening, since based on their results considering conventional ML may perform exceptionally well if fed by the right data and molecular representations.

The article is interesting and well-written. The overview of the field and the warning about dataset composition are very well thought-out and should be of interest to a broad segment of the AI in drug discovery readership. This article further highlights some of the considerations that need to be taken into consideration for the implementation of data-centric AI for computer-aided drug design methods.

Strengths:

This study contributes significantly to the field of machine learning and data curation in drug discovery. The paper is, in general, well-written and structured. However, in my opinion, there are some suggestions regarding certain aspects of the data analyses.

Weaknesses:

The conclusions drawn in the study are based on the analysis of a single dataset, and I am not sure they can be generalized. Therefore, in my opinion, the conclusions are only partially supported by the data. To generalize the conclusions, it is imperative to conduct a benchmark with diverse datasets, for different molecular targets.
The conclusion cannot be immediately extended to molecular descriptors or features different from the ones used in this study
It is advisable to present statistical analyses to ascertain whether the observed differences in metrics hold statistical significance.

https://doi.org/10.7554/eLife.97821.1.sa1

Reviewer #3 (Public Review):

Summary:

The authors presented a data-centric ML approach for virtual ligand screening. They used BRAF as an example to demonstrate the predictive power of their approach.

Strengths:

The performance of predictive models in this study is superior (nearly perfect) with respect to exiting methods.

Weaknesses:

I feel the training and testing datasets may not be rigorously constructed. If that is the case, the results would be significantly affected.

I have 3 major comments:

(1) The authors identified ~4100 BRAF actives, then randomly selected 3600 BRAF actives to be part of the training dataset with the remaining 500 actives becoming a part of the hold-out test set. The problem is that, the authors did not evaluate the chemical similarity between the 3600 actives in the training, and the 500 actives in the testing set. If some of them were similar, the testing results would be very good but partially due to information leakage. The authors should carefully examine the chemical similarity between any pairs of their training and testing datasets, before any conclusion is made.

(2) The authors tried to explore the role of dataset size in the performance, in particular, what would happen when the number of actives are reduced. However the minimal number of actives used is 500 while the number of inactives ranges from 500 to 3600. This is quite different from real applications where the number of expected actives in the screening library would be at most 1-2% of the whole database. The authors should further reduced the number of actives (e.g. 125, 25, 5, 1), and evaluate their model's performance.

(3) The authors chose BRAF as example in this study. BRAF is a well studied drug target with thousands of known actives. In real applications, the target may only have a handful of known actives. The authors should try to apply their approach, to a couple other targets that have less known actives than BRAF, to evaluate their method's transferability.

https://doi.org/10.7554/eLife.97821.1.sa0

Establishing the foundations for a data-centric AI approach for virtual drug screening through a systematic assessment of the properties of chemical data

Peer review process

Editors

Be the first to read new articles from eLife