Elucidation of genome-wide understudied proteins targeted by PROTAC-induced degradation using interpretable machine learning.

TitleElucidation of genome-wide understudied proteins targeted by PROTAC-induced degradation using interpretable machine learning.
Publication TypeJournal Article
Year of Publication2023
AuthorsXie L, Xie L
JournalPLoS Comput Biol
Date Published2023 Aug
KeywordsAlzheimer Disease, Amino Acid Sequence, Genome, Human, Humans, Machine Learning, Proteolysis Targeting Chimera, Ubiquitin-Protein Ligases

Proteolysis-targeting chimeras (PROTACs) are hetero-bifunctional molecules that induce the degradation of target proteins by recruiting an E3 ligase. PROTACs have the potential to inactivate disease-related genes that are considered undruggable by small molecules, making them a promising therapy for the treatment of incurable diseases. However, only a few hundred proteins have been experimentally tested for their amenability to PROTACs, and it remains unclear which other proteins in the entire human genome can be targeted by PROTACs. In this study, we have developed PrePROTAC, an interpretable machine learning model based on a transformer-based protein sequence descriptor and random forest classification. PrePROTAC predicts genome-wide targets that can be degraded by CRBN, one of the E3 ligases. In the benchmark studies, PrePROTAC achieved a ROC-AUC of 0.81, an average precision of 0.84, and over 40% sensitivity at a false positive rate of 0.05. When evaluated by an external test set which comprised proteins from different structural folds than those in the training set, the performance of PrePROTAC did not drop significantly, indicating its generalizability. Furthermore, we developed an embedding SHapley Additive exPlanations (eSHAP) method, which extends conventional SHAP analysis for original features to an embedding space through in silico mutagenesis. This method allowed us to identify key residues in the protein structure that play critical roles in PROTAC activity. The identified key residues were consistent with existing knowledge. Using PrePROTAC, we identified over 600 novel understudied proteins that are potentially degradable by CRBN and proposed PROTAC compounds for three novel drug targets associated with Alzheimer's disease.

Alternate JournalPLoS Comput Biol
PubMed ID37590332
PubMed Central IDPMC10464998
Grant ListR01 AG057555 / AG / NIA NIH HHS / United States
R01 GM122845 / GM / NIGMS NIH HHS / United States