Context-Aware Natural Language Processing for Malware Detection
Abstract
As malware continues to evolve and cyber attacks become increasingly prevalent, it is critical to develop effective malware classification techniques for the detection and prevention of such malicious attacks. In our research, we approach malware classification from a natural language processing perspective to explore how various tokenization techniques on malware opcode features can enhance classification accuracy and combat malware obfuscation and evolution techniques with a deep learning approach, previously unaddressed by existing research. We bridge this gap by conducting extensive hyperparameter tuning experiments that examine the effects of five common tokenization methods, that is, White Space Separation, Top Single Words, Byte-Pair Encoding, WordPiece, and Unigram, and a novel method which we introduce called Top Word Pairs. We then transform the tokenized opcode sequences into feature vectors with four embedding methods, that is, Word2Vec, HMM2Vec, BERT, and ELMo. Finally, we classify malware using three classification techniques: Support Vector Machines, Random Forest, and Multi-Layer Perceptron. Our results show that by choosing the correct tokenization technique based on the embedding and the classification models implemented, the accuracy of malware classification can be improved by up to 2% with our dataset.