Indaba Grand Challenge: Curing Leishmaniasis by Deep Learning Indaba

3000 Zindi Points

The official competition of the Deep Learning Indaba 2020

321 data scientists enrolled, 24 on the leaderboard

Africa

29 June 2020—31 May 2021

337 days

Winning solution: Indaba Grand Challenge Curing Leishmaniasis by Deep Learning

First of all, i would like to thank Instadeep and Zindi for making such a competition possible. It's an honour to tackle one of Africas greatest problems.

My Solution:

The first approach was to perform Bayesian Optimization directly on the searchspace of possible valid molecules aka drug candidates. Therefore i calculated druglike properties of the compounds with the RDkit package ( molecular weight, the logP value, HDonors and HAcceptor values) and saved them to a Pandas dataframe. This dataframe served now as a template for any further operations. The molecules which were to examine, consisted of four databases and were all placed in the orignal repo of the Zindi competition in the deep_leishmaniasis/dataset folder. These are : drugbank_leishmania.sdf, drug_Central.sdf, in-trials.sdf and world.sdf and their according csv files. All together i then received a database of about 25522 entries, each one containing a drug/molecule.

A sidenote: Docking with autodock-vina on Google-Colab, with two CPU's provided, for 25 K compounds with an average docking time of about 2-5 (depending on the size of the molecules) minutes would take approximately 2080 hours! And these are only average values. In fact there are molecules which easily take up to 40 minutes or longer for evaluation in the docking procedure.

The next step was to identify the residues of the target proteins.

Since there is a correlation between the size of the target molecules and the number of possible targets/binding pockets but no correlation between the size of the molecule and the size ( in Angstrom) of the binding pockets, I wanted to consider only target molecules in a range of maximal 20 K of atoms. I downloaded the molecules from the Storage and focussed on targets out of the preferredTargets.unique.fasta file. Doing a quick literature research, I found that the Lanosterol 14-alpha-demethylase in the Leishmania donovani proteom was a good starting point for the investigation. I downloaded the E9BAU8 molecule, which contains the desired amino acid sequence and also downloaded the 14-alpha demethylase directly from RDC Uniprot, which can be found under /targets/3juv.pdb and /targets/5esh.pdb

Bayesian Optimization(BO):

Bayesian Optimization is an Optimzation algortihm and best suited for finding solutions within a large parameter space. BO is often used in hyperparameter tuning in Deep neural networks and shows generally good performance comparing to Random Search or Grid Search, which is nothing but brute-forcing the parameter space and hence inattractable for this project.

The basic idea is to formulate an surrogate model(often a gaussian process (GP)) from which to take samples from. Instead of a GP one could also use a Tree-of_Parzen-Estimators(TPE) or even a XG-Boost approach from which to sample. But the GP is the most mathematical consistent approach. Core of an GP is the kernel function in which prior and posterior beliefs about the designated funtion (black-box function) are incorporated. Then an acquisition function is set which guides the GP on where to search the parameter space. Finally a new point in the space gets evaluated and the GP is updated.

I used the Squared Exponential kernel (SE), the matern52 kernel (M52), which gave the best results, and Expected Improvement with constraints (EI) and also Probability of Improvement(PI) as acquisition functions. The hyperparameters of the SE were set to lengthscale l = 1 , sigmaf = 1.0 , sigman = 1e-12 and are quite standard values. The M52 hyperpamaters were set to v = 0.5 l =1.2 ,sigmaf=1, sigman=1e-6 and are also quite standard values according to literature.

I then focussed on data from the drugCentral.csv molecules to see if already approved drugs are worth investigating.

In the data_prep notebook, I read in the file, calculated the properties and saved them to a csv file, which should serve as a DataFrame to work on. This Dataframe consists of 4096 molecules and is used by the binding_affinity evaluation for autodock-vina and the Gaussian Process kernel and can be easily adapted/switched when a different dataset should be investigated

The Bayesian Optimization was then set to 12 initial evaluation before fitting the GP to the data and 20 evaluations, so that there was an overall of 32 evaluations from the black-box-functions(the autodock-vina binding-value).

This resulted in an 32-dimensional Gaussian Process. For the binding value, I calculated an average of the eight binding_poses and gave it back as binding_value. Finally these values were sorted in ascending order of the best binding_affinity values.

I then repeated this procedure for at least five times on every kernel/acquisition pair and saved the results to Bayes_Opt/results and named them bayes_sorted_bindings.csv

Observations:

While the EIC clearly preferred molecules in a smaller range with atomic weight of 700 to 1200 Daltons,the PI favoured molecules in the range of 1000- 1600 and 2000-2900 Daltons and logP's going from -3.0 to 15.0.

One could observe that both kernel/acquisition pairs often discoverd abarelix, ganirelix, evorolimus, buselin, vintafolide, acetyldigitoxin as the molecules where the BO assumed to minimize the black box functions.

Indeed these kind of molecules/drugs were those who gave me the best binding-values and accordingly the best score on the leaderboard.

abarelix/teverelix ~ -30 kcal/mol score: -105.098 vintafolide - 31.2 kcal/mol score: -89.6 evorolimus - 24.2 kcal/mol score: -74.345 acetyldigitoxin ~ -22.0 kcal/mol score: -51.41

In general I discoverd ca. 100 different drugs which all scored in the -40 to -100 range.

I then used Miltefosine as a reference molecule and submitted the poses/models of my found drugs which were quite close to the binding of Miltefosine where I finally contained the best scoring results.

Finally I tested the found molecules agains a variety of proteins from the preferred targets and the different types/specimens like leishmania major, infantum and donovani. The drugs average binding values never have fallen below the - 20. kcal/mol line and gave average scoring values of ~ -40.0.

To my best knowledge, I could not find any work related of doing Bayesian Optimization on direct drug properties and this seems to be quite a complete novel approach.

Finally, a lot of thanks goes out to all the participants, especially to @jahir and @hothman, which whom I had nice conversations during the competition.

There is still is a lot of work to do and I encourage anyone to work on this topic beyond the competition.

Best regards,

Philipp Pusch

Thanks for a detailed write up. Lots to learn

Great explanation. Thanks for sharing.

Great work and very interesting approach @ppusch ! Thanks for sharing

Well done Philipp!

Thank you for sharing. Great work!