Publication Date

Spring 2023

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Fabio Di Troia

Second Advisor

William Andreopoulos

Third Advisor

Thomas Austin

Keywords

Graph Convolution Network, Graph Attention network, GraphSAGE, Word2Vec

Abstract

Word embeddings are widely recognized as important in natural language pro- cessing for capturing semantic relationships between words. In this study, we conduct experiments to explore the effectiveness of word embedding techniques in classifying malware. Specifically, we evaluate the performance of Graph Neural Network (GNN) applied to knowledge graphs constructed from opcode sequences of malware files. In the first set of experiments, Graph Convolution Network (GCN) is applied to knowledge graphs built with different word embedding techniques such as Bag-of-words, TF-IDF, and Word2Vec. Our results indicate that Word2Vec produces the most effective word embeddings, serving as a baseline for comparison with three GNN models- Graph Convolution network, Graph Attention network (GAT), and GraphSAGE network

(GraphSAGE). For the next set of experiments, we generate vector embeddings of various lengths using Word2Vec and construct knowledge graphs with these embed- dings as node features. Through performance comparison of the GNN models, we show that larger vector embeddings improve the models’ performance in classifying the malware files into their respective families. Our experiments demonstrate that word embedding techniques can enhance feature engineering in malware analysis.

Share

COinS