Abstract
Malware datasets inevitably contain incorrect labels due to the shortage of expertise and experience needed for sample labeling. Previous research demonstrated that a training dataset with incorrectly labeled samples would result in inaccurate model learning. To address this problem, researchers have proposed various noise learning methods to offset the impact of incorrectly labeled samples, and in image recognition and text mining applications, these methods demonstrated great success. In this work, we apply both representative and state-of-the-art noise learning methods to real-world malware classification tasks. We surprisingly observe that none of the existing methods could minimize incorrect labels' impact. Through a carefully designed experiment, we discover that the inefficacy mainly results from extreme data imbalance and the high percentage of incorrectly labeled data samples. As such, we further propose a new noise learning method and name it after MORSE. Unlike existing methods, MORSE customizes and extends a state-of-the-art semi-supervised learning technique. It takes possibly incorrectly labeled data as unlabeled data and thus avoids their potential negative impact on model learning. In MORSE, we also integrate a sample re-weighting method that balances the training data usage in the model learning and thus handles the data imbalance challenge. We evaluate MORSE on both our synthesized and real-world datasets. We show that MORSE could significantly outperform existing noise learning methods and minimize the impact of incorrectly labeled data.
Original language | English (US) |
---|---|
Title of host publication | Proceedings - 44th IEEE Symposium on Security and Privacy, SP 2023 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 2602-2619 |
Number of pages | 18 |
ISBN (Electronic) | 9781665493369 |
DOIs | |
State | Published - 2023 |
Event | 44th IEEE Symposium on Security and Privacy, SP 2023 - Hybrid, San Francisco, United States Duration: May 22 2023 → May 25 2023 |
Publication series
Name | Proceedings - IEEE Symposium on Security and Privacy |
---|---|
Volume | 2023-May |
ISSN (Print) | 1081-6011 |
Conference
Conference | 44th IEEE Symposium on Security and Privacy, SP 2023 |
---|---|
Country/Territory | United States |
City | Hybrid, San Francisco |
Period | 5/22/23 → 5/25/23 |
Funding
We thank the anonymous reviewers for their helpful comments. This project was supported in part by NSF grant 2225234, 2225225, by the Amazon Research Award.
ASJC Scopus subject areas
- Safety, Risk, Reliability and Quality
- Software
- Computer Networks and Communications