Abstract
Semi-supervised learning has proved its efficacy in utilizing extensive unlabeled data to alleviate the use of a large amount of supervised data and improve model performance. Despite its tremendous potential, semi-supervised learning has yet to be implemented in the field of drug discovery. Empirical testing of drugs and their classification is costly and time-consuming. In contrast, predicting therapeutic applications of drugs from their structural formulas using semi-supervised learning would reduce costs and time significantly. Herein, we employ a new multicontrastive-based semi-supervised learning algorithm - MultiCon - for classifying drugs into 12 categories, according to therapeutic applications, on the basis of image analyses of their structural formulas. By rational use of data balancing, online augmentations of the drug image data during training, and the combined use of multicontrastive loss with consistency regularization, MultiCon achieves better class prediction accuracies when compared with the state-of-the-art machine learning methods across a variety of existing semi-supervised learning benchmarks. In particular, it performs exceptionally well with a limited number of labeled examples. For instance, with just 5000 labeled drugs in a PubChem (D3) data set, MultiCon achieved a class prediction accuracy of 97.74%.
Original language | English (US) |
---|---|
Pages (from-to) | 5995-6006 |
Number of pages | 12 |
Journal | Journal of Chemical Information and Modeling |
Volume | 60 |
Issue number | 12 |
DOIs | |
State | Published - Dec 28 2020 |
Funding
The authors thank Northwestern University and The University of Texas at Dallas for their continued support for this research. The research reported herein was supported in part by NSF awards (DMS-1737978, DGE-2039542, OAC-1828467, OAC-1931541, DGE-1906630) and an IBM faculty award (research).
ASJC Scopus subject areas
- General Chemical Engineering
- General Chemistry
- Library and Information Sciences
- Computer Science Applications