TY - GEN
T1 - Ensemble models for data-driven prediction of malware infections
AU - Kang, Chanhyun
AU - Park, Noseong
AU - Prakash, B. Aditya
AU - Serra, Edoardo
AU - Subrahmanian, V. S.
N1 - Funding Information:
This paper is based on work partially supported by the Maryland Procurement Office under Contract No. H98230-14-C-0137, by the NEH under Grant No. HG-229283-15, by ORNL under Task Order 4000143330, by the VT College of Engineering, and a Facebook faculty gift. We thank Symantec for providing access to the wine platform. Other researchers may reproduce and verify our results by analyzing the reference data set we recorded in wine (WINE-2013-001) after signing a research agreement with Symantec.
Publisher Copyright:
© 2016 ACM.
PY - 2016/2/8
Y1 - 2016/2/8
N2 - Given a history of detected malware attacks, can we predict the number of malware infections in a country? Can we do this for different malware and countries? This is an important question which has numerous implications for cyber security, right from designing better anti-virus software, to designing and implementing targeted patches to more accurately measuring the economic impact of breaches. This problem is compounded by the fact that, as externals, we can only detect a fraction of actual malware infections. In this paper we address this problem using data from Symantec covering more than 1.4 million hosts and 50 malware spread across 2 years and multiple countries. We first carefully design domain-based features from both malware and machine-hosts perspectives. Secondly, inspired by epidemiological and information diffusion models, we design a novel temporal non-linear model for malware spread and detection. Finally we present ESM, an ensemble-based approach which combines both these methods to construct a more accurate algorithm. Using extensive experiments spanning multiple malware and countries, we show that ESM can effectively predict malware infection ratios over time (both the actual number and trend) upto 4 times better compared to several baselines on various metrics. Furthermore, ESM's performance is stable and robust even when the number of detected infections is low.
AB - Given a history of detected malware attacks, can we predict the number of malware infections in a country? Can we do this for different malware and countries? This is an important question which has numerous implications for cyber security, right from designing better anti-virus software, to designing and implementing targeted patches to more accurately measuring the economic impact of breaches. This problem is compounded by the fact that, as externals, we can only detect a fraction of actual malware infections. In this paper we address this problem using data from Symantec covering more than 1.4 million hosts and 50 malware spread across 2 years and multiple countries. We first carefully design domain-based features from both malware and machine-hosts perspectives. Secondly, inspired by epidemiological and information diffusion models, we design a novel temporal non-linear model for malware spread and detection. Finally we present ESM, an ensemble-based approach which combines both these methods to construct a more accurate algorithm. Using extensive experiments spanning multiple malware and countries, we show that ESM can effectively predict malware infection ratios over time (both the actual number and trend) upto 4 times better compared to several baselines on various metrics. Furthermore, ESM's performance is stable and robust even when the number of detected infections is low.
KW - Anti-virus
KW - Cyber security
KW - Information diffusion
KW - Malware attacks
KW - Prediction model
UR - http://www.scopus.com/inward/record.url?scp=84964329255&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84964329255&partnerID=8YFLogxK
U2 - 10.1145/2835776.2835834
DO - 10.1145/2835776.2835834
M3 - Conference contribution
AN - SCOPUS:84964329255
T3 - WSDM 2016 - Proceedings of the 9th ACM International Conference on Web Search and Data Mining
SP - 583
EP - 592
BT - WSDM 2016 - Proceedings of the 9th ACM International Conference on Web Search and Data Mining
PB - Association for Computing Machinery, Inc
T2 - 9th ACM International Conference on Web Search and Data Mining, WSDM 2016
Y2 - 22 February 2016 through 25 February 2016
ER -