Title | Exploration of chemical space with partial labeled noisy student self-training and self-supervised graph embedding. |
Publication Type | Journal Article |
Year of Publication | 2022 |
Authors | Liu Y, Lim H, Xie L |
Journal | BMC Bioinformatics |
Volume | 23 |
Issue | Suppl 3 |
Pagination | 158 |
Date Published | 2022 May 02 |
ISSN | 1471-2105 |
Keywords | Algorithms, Humans, Machine Learning, Neural Networks, Computer, Quantitative Structure-Activity Relationship, Students |
Abstract | BACKGROUND: Drug discovery is time-consuming and costly. Machine learning, especially deep learning, shows great potential in quantitative structure-activity relationship (QSAR) modeling to accelerate drug discovery process and reduce its cost. A big challenge in developing robust and generalizable deep learning models for QSAR is the lack of a large amount of data with high-quality and balanced labels. To address this challenge, we developed a self-training method, Partially LAbeled Noisy Student (PLANS), and a novel self-supervised graph embedding, Graph-Isomorphism-Network Fingerprint (GINFP), for chemical compounds representations with substructure information using unlabeled data. The representations can be used for predicting chemical properties such as binding affinity, toxicity, and others. PLANS-GINFP allows us to exploit millions of unlabeled chemical compounds as well as labeled and partially labeled pharmacological data to improve the generalizability of neural network models. RESULTS: We evaluated the performance of PLANS-GINFP for predicting Cytochrome P450 (CYP450) binding activity in a CYP450 dataset and chemical toxicity in the Tox21 dataset. The extensive benchmark studies demonstrated that PLANS-GINFP could significantly improve the performance in both cases by a large margin. Both PLANS-based self-training and GINFP-based self-supervised learning contribute to the performance improvement. CONCLUSION: To better exploit chemical structures as an input for machine learning algorithms, we proposed a self-supervised graph neural network-based embedding method that can encode substructure information. Furthermore, we developed a model agnostic self-training method, PLANS, that can be applied to any deep learning architectures to improve prediction accuracies. PLANS provided a way to better utilize partially labeled and unlabeled data. Comprehensive benchmark studies demonstrated their potentials in predicting drug metabolism and toxicity profiles using sparse, noisy, and imbalanced data. PLANS-GINFP could serve as a general solution to improve the predictive modeling for QSAR modeling. |
DOI | 10.1186/s12859-022-04681-3 |
Alternate Journal | BMC Bioinformatics |
PubMed ID | 35501680 |
PubMed Central ID | PMC9063120 |
Grant List | R01 AG057555 / AG / NIA NIH HHS / United States R01 GM122845 / GM / NIGMS NIH HHS / United States R01GM122845 / GM / NIGMS NIH HHS / United States R01AG057555 / AG / NIA NIH HHS / United States |