Exploration of chemical space with partial labeled noisy student self-training and self-supervised graph embedding.

TitleExploration of chemical space with partial labeled noisy student self-training and self-supervised graph embedding.
Publication TypeJournal Article
Year of Publication2022
AuthorsLiu Y, Lim H, Xie L
JournalBMC Bioinformatics
Volume23
IssueSuppl 3
Pagination158
Date Published2022 May 02
ISSN1471-2105
KeywordsAlgorithms, Humans, Machine Learning, Neural Networks, Computer, Quantitative Structure-Activity Relationship, Students
Abstract

BACKGROUND: Drug discovery is time-consuming and costly. Machine learning, especially deep learning, shows great potential in quantitative structure-activity relationship (QSAR) modeling to accelerate drug discovery process and reduce its cost. A big challenge in developing robust and generalizable deep learning models for QSAR is the lack of a large amount of data with high-quality and balanced labels. To address this challenge, we developed a self-training method, Partially LAbeled Noisy Student (PLANS), and a novel self-supervised graph embedding, Graph-Isomorphism-Network Fingerprint (GINFP), for chemical compounds representations with substructure information using unlabeled data. The representations can be used for predicting chemical properties such as binding affinity, toxicity, and others. PLANS-GINFP allows us to exploit millions of unlabeled chemical compounds as well as labeled and partially labeled pharmacological data to improve the generalizability of neural network models.

RESULTS: We evaluated the performance of PLANS-GINFP for predicting Cytochrome P450 (CYP450) binding activity in a CYP450 dataset and chemical toxicity in the Tox21 dataset. The extensive benchmark studies demonstrated that PLANS-GINFP could significantly improve the performance in both cases by a large margin. Both PLANS-based self-training and GINFP-based self-supervised learning contribute to the performance improvement.

CONCLUSION: To better exploit chemical structures as an input for machine learning algorithms, we proposed a self-supervised graph neural network-based embedding method that can encode substructure information. Furthermore, we developed a model agnostic self-training method, PLANS, that can be applied to any deep learning architectures to improve prediction accuracies. PLANS provided a way to better utilize partially labeled and unlabeled data. Comprehensive benchmark studies demonstrated their potentials in predicting drug metabolism and toxicity profiles using sparse, noisy, and imbalanced data. PLANS-GINFP could serve as a general solution to improve the predictive modeling for QSAR modeling.

DOI10.1186/s12859-022-04681-3
Alternate JournalBMC Bioinformatics
PubMed ID35501680
PubMed Central IDPMC9063120
Grant ListR01 AG057555 / AG / NIA NIH HHS / United States
R01 GM122845 / GM / NIGMS NIH HHS / United States
R01GM122845 / GM / NIGMS NIH HHS / United States
R01AG057555 / AG / NIA NIH HHS / United States