A Comprehensive Database for Compound Property Prediction Using Deep Learning Technique

Bioinformatics Internship Presentation

Ying Dong (Mentor: Dr. Xiaohua Zhang, Lawrence Livermore National Laboratory)

August 29, 2017, 2:30pm, Room 341, Basic Science

On/off target effects are one of the most crucial phenomenon in drug-protein interaction, where on-target means a drug is 100% specific for aimed target and off-target stands for lacking of specificity. However, under most circumstances, off-target binding caused by certain mechanism like membrane transporter always happen, which significantly decrease drug efficacy. In the whole project of LLNL group, people aim to construct a new model for drug-protein interaction using machine learning techniques, where drug-protein interaction could be identified and then we can select drugs with best affinity to desired targets. To achieve the goal, ChEMBL and PDB local database were built and connected as protein and small molecular compound data source. After the construction of relational database, simulation and molecular docking using new high performance computing (HPC) were performed, and a group of interested small molecular compounds were selected for further research. After drug-protein simulation, KEGG and SIDER database were built and connected as training as well as testing dataset for prediction model development via deep learning.

To be more specific, my project mainly focused on relational database construction and management, also performing deep learning to develop and test the prediction model. Firstly, the PDB, KEGG and SIDER plain-text data were converted into the MySQL database, ChEMBL database was downloaded and installed on database server. Then KEGG, ChEMBL, PDB and SIDER MySQL database were composed together (named Mother of All Databases, MOAD) via ChEMBL ID, PDB ID, Uniprot ID and etc. After database connection, data sets of small molecular compounds with physical properties including Log solubility/LogD/LogP were retrieved from ChEMBL database. Then data of more than 10,000 small molecules were used to train by python module DeepChem, after which various deep learning models were built and reliable prediction model was selected and tested. Finally, I did version control of the codes, scripts and documents and wrote README files to instruct other users to use my codes.