Author

Atishay Jain

Publication Date

Spring 2025

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

Fabio Di Troia

Second Advisor

William Andreopoulos

Third Advisor

Sayma Akther

Keywords

Word2vec, DistilBERT, Elmo, fastText, GloVe, WGAN-GP, Dif- fusion, SMOTE, Random Forest Classifier, Support Vector Classifier, Multilayer Perceptron, T-SNE, Agglomerative Clustering

Abstract

Malware is software used to damage and disrupt computer systems with the intent to cause damage to the victim. Malware detection and classification into malware families is a crucial problem for cybersecurity researchers. One of the major bottlenecks in improving these systems is the shortage of good quality labeled malware data, especially for malware families with scarce samples. Researchers have utilized generative models to generate malware data to address this issue. Malware embeddings encode patterns within a malware file, which can be used to detect and classify malware. Recently, encouraging results have been obtained in generating malware embeddings using generative models. The experiments presented in this report aim to create high-quality malware opcode embeddings and then perform robust evaluations to assess their quality. The project seeks to generate high-quality malware embeddings that could be utilized to train malware detection and classification models.

Available for download on Monday, May 25, 2026

Share

COinS