Publication Date
Spring 2020
Degree Type
Master's Project
Degree Name
Master of Science (MS)
Department
Computer Science
First Advisor
Mark Stamp
Second Advisor
Wendy Lee
Third Advisor
Fabio Di Troia
Keywords
Malware classification, HMM2Vec, Word2Vec, PCA2Vec
Abstract
Word embeddings are often used in natural language processing as a means to quantify relationships between words. More generally, these same word embedding techniques can be used to quantify relationships between features. In this paper, we conduct a series of experiments that are designed to determine the effectiveness of word embedding in the context of malware classification. First, we conduct experiments where hidden Markov models (HMM) are directly applied to opcode sequences. These results serve to establish a baseline for comparison with our subsequent word embedding experiments. We then experiment with word embedding vectors derived from HMMs— a technique that we refer to as HMM2Vec. In another set of experiments, we generate vector embeddings based on principal component analysis, which we refer to as PCA2Vec. And, for a third set of word embedding experiments, we consider the well- known neural network based technique, Word2Vec. In each of these word embedding experiments, we derive feature embeddings based on opcode sequences for malware samples from a variety of different families. We show that in most cases, we obtain improved classification accuracy using feature embeddings, as compared to our baseline HMM experiments. These results provide strong evidence that word embedding techniques can play a useful role in feature engineering within the field of malware analysis.
Recommended Citation
Chandak, Aniket, "Word Embedding Techniques for Malware Classification" (2020). Master's Projects. 926.
DOI: https://doi.org/10.31979/etd.yhdp-898b
https://scholarworks.sjsu.edu/etd_projects/926