TY - JOUR
T1 - Defending unknown attacks on cyber-physical systems by semi-supervised approach and available unlabeled data
AU - Huda, Shamsul
AU - Miah, Suruz
AU - Mehedi Hassan, Mohammad
AU - Islam, Rafiqul
AU - Yearwood, John
AU - Alrubaian, Majed
AU - Almogren, Ahmad
N1 - Includes bibliographical references.
PY - 2017/2/10
Y1 - 2017/2/10
N2 - Cyber-physical systems (CPS) are used increasingly in modern industrial systems. These systems currently encounter a significant threat of malicious activities created by malicious software intent on exploiting the fact that the software of such industrial systems is integrated with hardware and network systems. Malicious codes dynamically and continuously change their internal structure and attack patterns using obfuscation techniques, such as polymorphism and metamorphism, in order to bypass and hide from conventional malware detection engines. This requires continuously updating the database of the malware detection engine, which requires periodic effort from manual experts. This could limit the real-time protection of CPS. In addition, this also makes preserving the availability and integrity of the services provided by CPS against malicious code challenging because there is a demand for the development of specialized malware detection techniques for CPS. In this paper, we propose a semi-supervised approach that automatically integrates the knowledge about unknown malware from already available and cheap unlabeled data into the detection system. The novelty of the proposed approach is that it does not require expert effort to update the database of the detection engine. Instead, the dynamic changes in malware attack patterns are extracted by unsupervised clustering from already available unlabeled data. Then the extracted geometric information about the intrinsic attack characteristics of the clusters is integrated into the classification systems of the detection engine, which updates the detection system automatically. The proposed approach uses global K-means clustering with term-frequency (TF), inverse document frequency (IDF), and cosine similarity as a distance measure for extracting the cluster information and adding it to a support vector machine (SVM) classification system. The proposed approach has been tested extensively on a real malware data set for both static and dynamic malware features. The experiment results show that the proposed semi-supervised approach achieves higher accuracy over the existing supervised approaches for all classifiers. We note that the static feature-based semi-supervised approach can improve detection accuracy significantly. While applying the proposed semi-supervised approach with the run-time characteristics of dynamic feature analysis, the combined effect of dynamic analysis and the proposed approach further increases the detection accuracy of all classifiers by up to a 100% for the SVM and the random forest classifiers, thus exceeding the existing supervised approaches with similar features.
AB - Cyber-physical systems (CPS) are used increasingly in modern industrial systems. These systems currently encounter a significant threat of malicious activities created by malicious software intent on exploiting the fact that the software of such industrial systems is integrated with hardware and network systems. Malicious codes dynamically and continuously change their internal structure and attack patterns using obfuscation techniques, such as polymorphism and metamorphism, in order to bypass and hide from conventional malware detection engines. This requires continuously updating the database of the malware detection engine, which requires periodic effort from manual experts. This could limit the real-time protection of CPS. In addition, this also makes preserving the availability and integrity of the services provided by CPS against malicious code challenging because there is a demand for the development of specialized malware detection techniques for CPS. In this paper, we propose a semi-supervised approach that automatically integrates the knowledge about unknown malware from already available and cheap unlabeled data into the detection system. The novelty of the proposed approach is that it does not require expert effort to update the database of the detection engine. Instead, the dynamic changes in malware attack patterns are extracted by unsupervised clustering from already available unlabeled data. Then the extracted geometric information about the intrinsic attack characteristics of the clusters is integrated into the classification systems of the detection engine, which updates the detection system automatically. The proposed approach uses global K-means clustering with term-frequency (TF), inverse document frequency (IDF), and cosine similarity as a distance measure for extracting the cluster information and adding it to a support vector machine (SVM) classification system. The proposed approach has been tested extensively on a real malware data set for both static and dynamic malware features. The experiment results show that the proposed semi-supervised approach achieves higher accuracy over the existing supervised approaches for all classifiers. We note that the static feature-based semi-supervised approach can improve detection accuracy significantly. While applying the proposed semi-supervised approach with the run-time characteristics of dynamic feature analysis, the combined effect of dynamic analysis and the proposed approach further increases the detection accuracy of all classifiers by up to a 100% for the SVM and the random forest classifiers, thus exceeding the existing supervised approaches with similar features.
KW - API parameters
KW - Dynamic analysis
KW - Malware behavior selection
KW - String feature
UR - http://www.scopus.com/inward/record.url?scp=84996526106&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84996526106&partnerID=8YFLogxK
U2 - 10.1016/j.ins.2016.09.041
DO - 10.1016/j.ins.2016.09.041
M3 - Article
AN - SCOPUS:84996526106
SN - 0020-0255
VL - 379
SP - 211
EP - 228
JO - Information Sciences
JF - Information Sciences
ER -