Context-Aware Natural Language Processing for Malware Detection

Publication Date

8-27-2025

Document Type

Conference Proceeding

Publication Title

2025 Silicon Valley Cybersecurity Conference Svcc 2025

DOI

10.1109/SVCC65277.2025.11133638

Abstract

As malware continues to evolve and cyber attacks become increasingly prevalent, it is critical to develop effective malware classification techniques for the detection and prevention of such malicious attacks. In our research, we approach malware classification from a natural language processing perspective to explore how various tokenization techniques on malware opcode features can enhance classification accuracy and combat malware obfuscation and evolution techniques with a deep learning approach, previously unaddressed by existing research. We bridge this gap by conducting extensive hyperparameter tuning experiments that examine the effects of five common tokenization methods, that is, White Space Separation, Top Single Words, Byte-Pair Encoding, WordPiece, and Unigram, and a novel method which we introduce called Top Word Pairs. We then transform the tokenized opcode sequences into feature vectors with four embedding methods, that is, Word2Vec, HMM2Vec, BERT, and ELMo. Finally, we classify malware using three classification techniques: Support Vector Machines, Random Forest, and Multi-Layer Perceptron. Our results show that by choosing the correct tokenization technique based on the embedding and the classification models implemented, the accuracy of malware classification can be improved by up to 2% with our dataset.

Funding Number

2244597

Funding Sponsor

National Science Foundation

Keywords

Computer Security, Embedding Vectors, Machine Learning, Malware Classification, Malware Detection, Natural Language Processing, Tokenization

Department

Computer Science

Share

COinS