Master of Science (MS)
Fabio Di Troia
Word embeddings are often used in natural language processing as a means to quantify relationships between words. More generally, these same word embedding techniques can be used to quantify relationships between features. In this paper, we conduct a series of experiments that are designed to determine the effectiveness of word embedding in the context of malware classification. First, we conduct experiments where hidden Markov models (HMM) are directly applied to opcode sequences. These results serve to establish a baseline for comparison with our subsequent word embedding experiments. We then experiment with word embedding vectors derived from HMMs— a technique that we refer to as HMM2Vec. In another set of experiments, we generate vector embeddings based on principal component analysis, which we refer to as PCA2Vec. And, for a third set of word embedding experiments, we consider the well- known neural network based technique, Word2Vec. In each of these word embedding experiments, we derive feature embeddings based on opcode sequences for malware samples from a variety of different families. We show that in most cases, we obtain improved classification accuracy using feature embeddings, as compared to our baseline HMM experiments. These results provide strong evidence that word embedding techniques can play a useful role in feature engineering within the field of malware analysis.
Chandak, Aniket, "Word Embedding Techniques for Malware Classification" (2020). Master's Projects. 926.
Available for download on Thursday, May 20, 2021