From data breaches to ransomware infections, the increasingly frequent and sophisticated attacks are posing serious challenges to today’s defense mechanisms. Machine learning is an attractive solution for its ability to identify hidden patterns that cannot be easily expressed by rules or signatures. Unfortunately, most learning-based security systems are trained under a “closed-world” assumption, expecting the testing data to roughly match the training data. When the model is deployed in the “open-world”environment, however, dynamic changes of both benign players and malicious attackers can easily cause a shift in the testing distribution, leading to concept drift and serious model failures. Addressing concept drift requires labeling a large number of new samples for model re-training. This process is extremely expensive in the security domain. Different from labeling images or text documents (which can be effectively crowdsourced), labeling malware, for example, requires years of security training and practice. The high expertise requirement makes it difficult to scale up the labeling efforts. In this proposal, we ask one critical question --- assuming we would never have representative labels, what can we do to significantly improve the adaptability and resilience of learning-based defenses with extremely limited labeling capacity? While the problem may look challenging, recent progress in self supervised learning has shown great promise to perform complex learning tasks with limited labels. Self supervision is about designing pretext learning tasks to better utilize unlabeled data and obtaining supervision from the data itself. While most existing efforts focus on computer vision and natural language processing tasks, we believe some of the fundamental ideas can significantly benefit the security community to address the concept drift problem. Our preliminary analysis has also returned encouraging results. In this proposal, we want to combine the idea of self-supervision with the domain-specific insights in malware detection to build new solutions to combat concept drift.
|Effective start/end date
|11/15/21 → 9/30/24
- National Science Foundation (CNS-2225225)
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.