Context-Aware Natural Language Processing for Malware Detection
Publication Date
8-27-2025
Document Type
Conference Proceeding
Publication Title
2025 Silicon Valley Cybersecurity Conference Svcc 2025
DOI
10.1109/SVCC65277.2025.11133638
Abstract
As malware continues to evolve and cyber attacks become increasingly prevalent, it is critical to develop effective malware classification techniques for the detection and prevention of such malicious attacks. In our research, we approach malware classification from a natural language processing perspective to explore how various tokenization techniques on malware opcode features can enhance classification accuracy and combat malware obfuscation and evolution techniques with a deep learning approach, previously unaddressed by existing research. We bridge this gap by conducting extensive hyperparameter tuning experiments that examine the effects of five common tokenization methods, that is, White Space Separation, Top Single Words, Byte-Pair Encoding, WordPiece, and Unigram, and a novel method which we introduce called Top Word Pairs. We then transform the tokenized opcode sequences into feature vectors with four embedding methods, that is, Word2Vec, HMM2Vec, BERT, and ELMo. Finally, we classify malware using three classification techniques: Support Vector Machines, Random Forest, and Multi-Layer Perceptron. Our results show that by choosing the correct tokenization technique based on the embedding and the classification models implemented, the accuracy of malware classification can be improved by up to 2% with our dataset.
Funding Number
2244597
Funding Sponsor
National Science Foundation
Keywords
Computer Security, Embedding Vectors, Machine Learning, Malware Classification, Malware Detection, Natural Language Processing, Tokenization
Department
Computer Science
Recommended Citation
Helen Liu, Summer Mccune, Quang Duy Tran, Fabio Di Troia, and Younghee Park. "Context-Aware Natural Language Processing for Malware Detection" 2025 Silicon Valley Cybersecurity Conference Svcc 2025 (2025). https://doi.org/10.1109/SVCC65277.2025.11133638