Exploration of chemical space with partial labeled noisy student self-training and self-supervised graph embedding. | Helen & Robert Appel Alzheimer’s Disease Research Institute

Title	Exploration of chemical space with partial labeled noisy student self-training and self-supervised graph embedding.
Publication Type	Journal Article
Year of Publication	2022
Authors	Liu Y, Lim H, Xie L
Journal	BMC Bioinformatics
Volume	23
Issue	Suppl 3
Pagination	158
Date Published	2022 May 02
ISSN	1471-2105
Keywords	Algorithms, Humans, Machine Learning, Neural Networks, Computer, Quantitative Structure-Activity Relationship, Students
Abstract	BACKGROUND: Drug discovery is time-consuming and costly. Machine learning, especially deep learning, shows great potential in quantitative structure-activity relationship (QSAR) modeling to accelerate drug discovery process and reduce its cost. A big challenge in developing robust and generalizable deep learning models for QSAR is the lack of a large amount of data with high-quality and balanced labels. To address this challenge, we developed a self-training method, Partially LAbeled Noisy Student (PLANS), and a novel self-supervised graph embedding, Graph-Isomorphism-Network Fingerprint (GINFP), for chemical compounds representations with substructure information using unlabeled data. The representations can be used for predicting chemical properties such as binding affinity, toxicity, and others. PLANS-GINFP allows us to exploit millions of unlabeled chemical compounds as well as labeled and partially labeled pharmacological data to improve the generalizability of neural network models. RESULTS: We evaluated the performance of PLANS-GINFP for predicting Cytochrome P450 (CYP450) binding activity in a CYP450 dataset and chemical toxicity in the Tox21 dataset. The extensive benchmark studies demonstrated that PLANS-GINFP could significantly improve the performance in both cases by a large margin. Both PLANS-based self-training and GINFP-based self-supervised learning contribute to the performance improvement. CONCLUSION: To better exploit chemical structures as an input for machine learning algorithms, we proposed a self-supervised graph neural network-based embedding method that can encode substructure information. Furthermore, we developed a model agnostic self-training method, PLANS, that can be applied to any deep learning architectures to improve prediction accuracies. PLANS provided a way to better utilize partially labeled and unlabeled data. Comprehensive benchmark studies demonstrated their potentials in predicting drug metabolism and toxicity profiles using sparse, noisy, and imbalanced data. PLANS-GINFP could serve as a general solution to improve the predictive modeling for QSAR modeling.
DOI	10.1186/s12859-022-04681-3
Alternate Journal	BMC Bioinformatics
PubMed ID	35501680
PubMed Central ID	PMC9063120
Grant List	R01 AG057555 / AG / NIA NIH HHS / United States R01 GM122845 / GM / NIGMS NIH HHS / United States R01GM122845 / GM / NIGMS NIH HHS / United States R01AG057555 / AG / NIA NIH HHS / United States