Deep Batch Active Learning for Drug Discovery

Michael Bailey
Saeed Moayedpour
Ruijiang Li
Alejandro Corrochano-Navarro
Alexander Kötter
Lorenzo Kogler-Anele
Saleh Riahi
Christoph Grebner
Gerhard Hessler
Hans Matter
Marc Bianciotto
Pablo Mas
Ziv Bar-Joseph author has email address
Sven Jager author has email address

R&D Data & Computational Science, Sanofi,, Cambridge, MA, United States
Digital Data, Sanofi, Shanghai, China
Synthetic Molecular Design, Integrated Drug Discovery, Sanofi-Aventis Deutschland GmbH, Industriepark Höchst, Building G838, 65926 Frankfurt am Main, Germany
Molecular Design Sciences, Integrated Drug Discovery, Sanofi, Vitry-sur-Seine, 94403, France

https://doi.org/10.7554/eLife.89679.2

Open access
Copyright information

Figures and data

Method overview. a) The active learning loop. We consider the entire dataset as the pool, and the oracle as the entity that holds the corresponding labels. During each iteration of the active learning loop, we select a batch of points from the pool and obtain their corresponding labels from the oracle. We then update our model using the selected batch and evaluate its performance on a holdout set. This process is repeated until the desired level of performance is achieved. b) Prediction of binding affinity is the target function for the ChEMBL and Sanofi-Aventis datasets c) Active learning batch selection. At the last layer of our model, we use either Laplace approximation or Monte Carlo dropout to compute covariances (COVLAP and COVDROP), from which an ensemble of predictions is generated. With the derived covariance matrix, we optimize batches iteratively based on their information content.

Number or required experiments for reaching model error with 10% higher than the minimum calculated RMSE for the whole training set across different methods. NC - Number of compounds in this dataset. Chron - analysis performed with batch selection using the actual order in which the data was profiled (only available for internal data). % gain estimates the improvement over the Random selection computed from (1 − N_COVDROP/N_Random) × 100. Where N_COVDROP as well as N_Random represents the average number of experiments out of 25 active learning cycles executed and a batch size of 32.

Performance of different methods for optimizing models for ADMET and affinity related datasets and batch selection strategies: a) the cell effective permeability b) lipophilicity. Panels c–f are affinity measurements to various targets proteins: c) GSK3β (ChEMBL) d) MMP3 (ChEMBL) e) Renin (Sanofi-Aventis) f) FXa (Sanofi-Aventis)

Sign up for email alerts