Master of Science (MS)
Fabio Di Troia
Graph Convolution Network, Graph Attention network, GraphSAGE, Word2Vec
Word embeddings are widely recognized as important in natural language pro- cessing for capturing semantic relationships between words. In this study, we conduct experiments to explore the effectiveness of word embedding techniques in classifying malware. Specifically, we evaluate the performance of Graph Neural Network (GNN) applied to knowledge graphs constructed from opcode sequences of malware files. In the first set of experiments, Graph Convolution Network (GCN) is applied to knowledge graphs built with different word embedding techniques such as Bag-of-words, TF-IDF, and Word2Vec. Our results indicate that Word2Vec produces the most effective word embeddings, serving as a baseline for comparison with three GNN models- Graph Convolution network, Graph Attention network (GAT), and GraphSAGE network
(GraphSAGE). For the next set of experiments, we generate vector embeddings of various lengths using Word2Vec and construct knowledge graphs with these embed- dings as node features. Through performance comparison of the GNN models, we show that larger vector embeddings improve the models’ performance in classifying the malware files into their respective families. Our experiments demonstrate that word embedding techniques can enhance feature engineering in malware analysis.
Mananjaya, Manasa, "Malware Classification using Graph Neural Networks" (2023). Master's Projects. 1268.