Establishing the foundations for a data-centric AI approach for virtual drug screening through a systematic assessment of the properties of chemical data

  1. Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore 636921
  2. School of Biological Science, Nanyang Technological University, Singapore 637551
  3. School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798
  4. Center for Biomedical Informatics, Nanyang Technological University, Singapore 636921
  5. Center for AI in Medicine, Nanyang Technological University, Singapore 636921
  6. Division of Neurology, Department of Brain Sciences, Faculty of Medicine, Imperial College London London W12 0NN

Peer review process

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, public reviews, and a response from the authors (if available).

Read more about eLife’s peer review process.

Editors

  • Reviewing Editor
    Alan Talevi
    National University of La Plata, La Plata, Argentina
  • Senior Editor
    Aleksandra Walczak
    École Normale Supérieure - PSL, Paris, France

Reviewer #1 (Public Review):

Summary:

The work provides more evidence of the importance of data quality and representation for ligand-based virtual screening approaches. The authors have applied different machine learning (ML) algorithms and data representation using a new dataset of BRAF ligands. First, the authors evaluate the ML algorithms and demonstrate that independently of the ML algorithm, predictive and robust models can be obtained in this BRAF dataset. Second, the authors investigate how the molecular representations can modify the prediction of the ML algorithm. They found that in this highly curated dataset the different molecule representations are adequate for the ML algorithms since almost all of them obtain high accuracy values, with Estate fingerprints obtaining the worst-performing predictive models and ECFP6 fingerprints producing the best classificatory models. Third, the authors evaluate the performance of the models on subsets of different composition and size of the BRAF dataset. They found that given a finite number of active compounds, increasing the number of inactive compounds worsens the recall and accuracy. Finally, the authors analyze if the use of "less active" molecules affect the model's predictive performance using "less active" molecules taken from ChEMBl Database or using decoys from DUD-E. As results, they found that the accuracy of the model falls as the number of "less active" examples in the training dataset increases while the implementation of decoys in the training set generates results as good as the original models or even better in some cases. However, the use of decoys in the training set worsens the predictive power in the test sets that contain active and inactive molecules.

Strengths:

It is a very interesting topic in medicinal chemistry and drug discovery. This work is very well written and contains up-to-date references. The general structure of the work is adequate, allowing easy reading. The hypotheses are clear and were explored correctly. This work provides new evidence about the importance of inferring models from high-quality data and that, if such a condition is met, it is not necessary to use complex computational methods to obtain predictive models. The generated BRAF dataset is also a valuable benchmark dataset for medicinal chemists working in ligand based virtual screening.

Weaknesses:

Leaving aside the new curated BRAF dataset, the work lacks novelty since it is a topic widely studied in chemoinformatics and medicinal chemistry. Furthermore, the conclusions drawn here correspond to the analysis of only one high-quality dataset where the similarity between the molecules is not quantitatively assessed (maybe active and inactive molecules are very dissimilar and any ML algorithm and fingerprint could obtain good results). To generalize the conclusions, it would be fundamental to repeat the analysis with other high-quality datasets.

Some key tasks are not clearly described, for example, there is no information about the new BRAF dataset (e.g., where the molecules were obtained from or why the inactive molecules provide better results than the "less active" from ChEMBL... what differentiates them?). The defintion of an "inactive" compound is not clear. It is not described if global or balanced accuracy was used in the imbalanced datasets. When using decoys to evaluate the models it is important to consider that decoys were generated to be topologically different from active compounds by the comparison of the ECFP4 fingerprints using the Tanimoto coefficient. Therefore, it is quite obvious that when fingerprints are used to characterize molecules, the models will be able to easily discriminate them. It is important to note that this is not necessarily true for models based on other molecular descriptors, since they are not used in the generation of the decoys. In some cases, the differences between accuracies are very small and there are no statistical analyzes to demonstrate whether they are statistically different or not.

Reviewer #2 (Public Review):

Summary:

The authors explored the importance of data quality and representation for ligand-based virtual screening approaches. I believe the results could be of potential benefit to the drug discovery community, especially to those scientists working in the field of machine learning applied to drug research. The in silico design is comprehensive and adequate for the proposed comparisons.

This manuscript by Chong A. et al describes that it is not necessary to resort to the use of sophisticated deep learning algorithms for virtual screening, since based on their results considering conventional ML may perform exceptionally well if fed by the right data and molecular representations.

The article is interesting and well-written. The overview of the field and the warning about dataset composition are very well thought-out and should be of interest to a broad segment of the AI ​​in drug discovery readership. This article further highlights some of the considerations that need to be taken into consideration for the implementation of data-centric AI for computer-aided drug design methods.

Strengths:

This study contributes significantly to the field of machine learning and data curation in drug discovery. The paper is, in general, well-written and structured. However, in my opinion, there are some suggestions regarding certain aspects of the data analyses.

Weaknesses:

The conclusions drawn in the study are based on the analysis of a single dataset, and I am not sure they can be generalized. Therefore, in my opinion, the conclusions are only partially supported by the data. To generalize the conclusions, it is imperative to conduct a benchmark with diverse datasets, for different molecular targets.
The conclusion cannot be immediately extended to molecular descriptors or features different from the ones used in this study
It is advisable to present statistical analyses to ascertain whether the observed differences in metrics hold statistical significance.

Reviewer #3 (Public Review):

Summary:

The authors presented a data-centric ML approach for virtual ligand screening. They used BRAF as an example to demonstrate the predictive power of their approach.

Strengths:

The performance of predictive models in this study is superior (nearly perfect) with respect to exiting methods.

Weaknesses:

I feel the training and testing datasets may not be rigorously constructed. If that is the case, the results would be significantly affected.

I have 3 major comments:

(1) The authors identified ~4100 BRAF actives, then randomly selected 3600 BRAF actives to be part of the training dataset with the remaining 500 actives becoming a part of the hold-out test set. The problem is that, the authors did not evaluate the chemical similarity between the 3600 actives in the training, and the 500 actives in the testing set. If some of them were similar, the testing results would be very good but partially due to information leakage. The authors should carefully examine the chemical similarity between any pairs of their training and testing datasets, before any conclusion is made.

(2) The authors tried to explore the role of dataset size in the performance, in particular, what would happen when the number of actives are reduced. However the minimal number of actives used is 500 while the number of inactives ranges from 500 to 3600. This is quite different from real applications where the number of expected actives in the screening library would be at most 1-2% of the whole database. The authors should further reduced the number of actives (e.g. 125, 25, 5, 1), and evaluate their model's performance.

(3) The authors chose BRAF as example in this study. BRAF is a well studied drug target with thousands of known actives. In real applications, the target may only have a handful of known actives. The authors should try to apply their approach, to a couple other targets that have less known actives than BRAF, to evaluate their method's transferability.

  1. Howard Hughes Medical Institute
  2. Wellcome Trust
  3. Max-Planck-Gesellschaft
  4. Knut and Alice Wallenberg Foundation