Publication Date

Fall 2023

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)


Computer Science

First Advisor

Fabio Di Troia

Second Advisor

Faranak Abri

Third Advisor

Navrati Saxena


N-grams, Opcodes, Static Analysis, Word2Vec, Doc2Vec, FastText, SVM, RF, kNN, CNN


Malware is a serious risk to any software application whether it is standalone or over the network. In order to protect computer systems, it is essential to detect and classify malware effectively. Modern malware classification research focuses on Machine Learning and Deep Learning techniques to identify advanced malicious software. This project explores malware classification by combining two robust methods: n-grams and word embedding. By extracting opcode n-grams, we make use of sequential nature of malware execution to identify any local patterns within the malware executable.

We use word embedding methods such as Word2Vec, Doc2Vec, and FastText to produce dense vector representations of these opcode n-grams in order to improve our feature representation. These feature extraction techniques are combined with a variety of classifiers in our experimental framework, such as Support Vector Machines (SVM), Random Forest (RF), k-Nearest Neighbors (k-NN), and Convolutional Neural Networks (CNN). With these combinations, we can investigate the advantages and disadvantages of various classifiers when it comes to malware categorization. Comparing classifiers provides important information about how well they work with different feature representations. Using this approach, we perform experiments for Multi-class classification. The findings of this research indicate that using opcode n-grams with word embedding is a promising solution to detect and classify real-world malware.

Available for download on Friday, December 20, 2024