An automated classification system based on the strings of trojan and virus families

Ronghua Tian, Lynn Batten, MD Rafiqul Islam, Steve Versteeg

Research output: Book chapter/Published conference paperConference paperpeer-review

87 Citations (Scopus)


Classifying malware correctly is an important research issue for anti-malware software producers. This paper presents an effective and efficient malware classification technique based on string information using several well-known classification algorithms. In our testing we extracted the printable strings from 1367 samples, including unpacked trojans and viruses and clean files. Information describing the printable strings contained in each sample was input to various classification algorithms, including tree-based classifiers, a nearest neighbour algorithm, statistical algorithms and AdaBoost. Using k-fold cross validation on the unpacked malware and clean files, we achieved a classification accuracy of 97%. Our results reveal that strings from library code (rather than malicious code itself) can be utilised to distinguish different malware families.
Original languageEnglish
Title of host publicationMALWARE 2009
Subtitle of host publication4th proceedings
Place of PublicationUnited States
PublisherInstitute of Electrical and Electronics Engineers
Number of pages8
ISBN (Electronic)9781424457878
Publication statusPublished - 2009
EventInternational Conference on Malicious and Unwanted Software (MALWARE) - Montreal, QC, Canada, Canada
Duration: 13 Oct 200914 Oct 2009


ConferenceInternational Conference on Malicious and Unwanted Software (MALWARE)


Dive into the research topics of 'An automated classification system based on the strings of trojan and virus families'. Together they form a unique fingerprint.

Cite this