Publication Date

8-27-2025

Document Type

Conference Proceeding

Publication Title

2025 Silicon Valley Cybersecurity Conference Svcc 2025

DOI

10.1109/SVCC65277.2025.11133638

Abstract

As malware continues to evolve and cyber attacks become increasingly prevalent, it is critical to develop effective malware classification techniques for the detection and prevention of such malicious attacks. In our research, we approach malware classification from a natural language processing perspective to explore how various tokenization techniques on malware opcode features can enhance classification accuracy and combat malware obfuscation and evolution techniques with a deep learning approach, previously unaddressed by existing research. We bridge this gap by conducting extensive hyperparameter tuning experiments that examine the effects of five common tokenization methods, that is, White Space Separation, Top Single Words, Byte-Pair Encoding, WordPiece, and Unigram, and a novel method which we introduce called Top Word Pairs. We then transform the tokenized opcode sequences into feature vectors with four embedding methods, that is, Word2Vec, HMM2Vec, BERT, and ELMo. Finally, we classify malware using three classification techniques: Support Vector Machines, Random Forest, and Multi-Layer Perceptron. Our results show that by choosing the correct tokenization technique based on the embedding and the classification models implemented, the accuracy of malware classification can be improved by up to 2% with our dataset.

Funding Number

2244597

Funding Sponsor

National Science Foundation

Keywords

Computer Security, Embedding Vectors, Machine Learning, Malware Classification, Malware Detection, Natural Language Processing, Tokenization

Comments

© 2025 IEEE.  Personal use of this material is permitted.  Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Department

Computer Science

Share

COinS