Semi-supervised learning has proved its efficacy in utilizing extensive unlabeled data to alleviate the use of a large amount of supervised data and improve model performance. Despite its tremendous potential, semi-supervised learning has yet to be implemented in the field of drug discovery. Empirical testing of drugs and their classification is costly and time-consuming. In contrast, predicting therapeutic applications of drugs from their structural formulas using semi-supervised learning would reduce costs and time significantly. Herein, we employ a new multicontrastive-based semi-supervised learning algorithm - MultiCon - for classifying drugs into 12 categories, according to therapeutic applications, on the basis of image analyses of their structural formulas. By rational use of data balancing, online augmentations of the drug image data during training, and the combined use of multicontrastive loss with consistency regularization, MultiCon achieves better class prediction accuracies when compared with the state-of-the-art machine learning methods across a variety of existing semi-supervised learning benchmarks. In particular, it performs exceptionally well with a limited number of labeled examples. For instance, with just 5000 labeled drugs in a PubChem (D3) data set, MultiCon achieved a class prediction accuracy of 97.74%.
ASJC Scopus subject areas
- Chemical Engineering(all)
- Computer Science Applications
- Library and Information Sciences