Establishing the foundations for a data-centric AI approach for virtual drug screening through a systematic assessment of the properties of chemical data

Allen Chong; Ser-Xian Phua; Yunzhi Xiao; Woon Yee Ng; Hoi Yeung Li; Wilson Wen Bin Goh

doi:10.7554/eLife.97821.1

Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore 636921
School of Biological Science, Nanyang Technological University, Singapore 637551
School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798
Center for Biomedical Informatics, Nanyang Technological University, Singapore 636921
Center for AI in Medicine, Nanyang Technological University, Singapore 636921
Division of Neurology, Department of Brain Sciences, Faculty of Medicine, Imperial College London London W12 0NN

Figures and data

Establishing the foundations for a data-centric AI approach for virtual drug screening through a systematic assessment of the properties of chemical data

Figures and data

The top performing standalone fingerprints for each of the 5 ML algorithms

The best and worst performing models using a merged fingerprint for all 5 ML algorithms

Accuracy (%) of models trained with an imbalanced training dataset where the number of BRAF actives is decreased but the number of BRAF inactives is maintained at a fixed number (3600)

Accuracy (%) of models trained with a balanced training dataset where the numbers of BRAF actives and BRAF inactives are both similarly decreased

Recall and precision (%) of models trained with an imbalanced training dataset where the number of BRAF actives is decreased but the number of BRAF inactives is maintained at a fixed number (3600)

Recall and precision (%) of models trained with a balanced training dataset where the numbers of BRAF actives and BRAF inactives are both similarly decreased

Average accuracy for the ‘spiked-in’ “less active”-trained models based on testing with 10 balanced BRAF actives and inactives hold-out test sets

Average accuracy for the ‘spiked-in’ decoy-trained models based on testing with 10 balanced BRAF actives and inactives hold-out test sets

Accuracy for the ‘spiked-in’ decoy-trained models based on testing with a balanced BRAF actives and decoys hold-out test set

Be the first to read new articles from eLife