Publication Date

Spring 2020

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Mark Stamp

Second Advisor

Wendy Lee

Third Advisor

Fabio Di Troia

Keywords

Malware classification, HMM2Vec, Word2Vec, PCA2Vec

Abstract

Word embeddings are often used in natural language processing as a means to quantify relationships between words. More generally, these same word embedding techniques can be used to quantify relationships between features. In this paper, we conduct a series of experiments that are designed to determine the effectiveness of word embedding in the context of malware classification. First, we conduct experiments where hidden Markov models (HMM) are directly applied to opcode sequences. These results serve to establish a baseline for comparison with our subsequent word embedding experiments. We then experiment with word embedding vectors derived from HMMs— a technique that we refer to as HMM2Vec. In another set of experiments, we generate vector embeddings based on principal component analysis, which we refer to as PCA2Vec. And, for a third set of word embedding experiments, we consider the well- known neural network based technique, Word2Vec. In each of these word embedding experiments, we derive feature embeddings based on opcode sequences for malware samples from a variety of different families. We show that in most cases, we obtain improved classification accuracy using feature embeddings, as compared to our baseline HMM experiments. These results provide strong evidence that word embedding techniques can play a useful role in feature engineering within the field of malware analysis.

Recommended Citation

Chandak, Aniket, "Word Embedding Techniques for Malware Classification" (2020). Master's Projects. 926.
DOI: https://doi.org/10.31979/etd.yhdp-898b
https://scholarworks.sjsu.edu/etd_projects/926

Download

Included in

Artificial Intelligence and Robotics Commons, Information Security Commons

COinS

DOI

https://doi.org/10.31979/etd.yhdp-898b

Master's Projects

Word Embedding Techniques for Malware Classification

Publication Date

Degree Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Abstract

Recommended Citation

Included in

DOI

Search

Browse All

Links

Master's Projects

Word Embedding Techniques for Malware Classification

Author

Publication Date

Degree Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Abstract

Recommended Citation

Included in

Share

DOI

Search

Browse All

Links