Publication Date
Spring 2025
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
Fabio Di Troia
Second Advisor
Navrati Saxena
Third Advisor
Sandeep Gundapu
Keywords
Malware detection, Data augmentation, Generative adversarial networks, Word embeddings, Opcode sequences, Class imbalance, Machine learning.
Abstract
Malware grows in numbers and complexity, evading conventional signature-and anomaly-based defenses and worsening extreme data sparsity and class imbalance problems for machine learning based detection. Generative models, specifically GANs conditioned on contextual embeddings like BERT have proved effective augmenting training corpora to improve classifier accuracy, but these approaches have largely produced family-specific samples In this paper, we propose a generalized augmentation scheme for generating robust malware embeddings for various families. We begin by extracting opcode sequences from 13 malware families and encoding them into three embedding methods: CountVectorizer, TF-IDF, and BERT’s ‘[CLS]‘ vectors. We therefore train standard GANs and Wasserstein GANs to generate synthetic embeddings, specifically testing on eight held-out families not observed during GAN training. To validate utility, we create ten augmented training sets at 100%–10% synthetic ratios and compare four classifiers to 100% real data baselines.
Our experiments show that embeddings generated by GANs match the perfor- mance of models trained on real training. Most importantly, the samples generated by GANs generalize across families without overfitting, completing gaps within the data. In our future work, we will test stronger embeddings (e.g., GloVe, Word2Vec, ELMo) and newer adversarial frameworks like WGAN-GP to further enhance malware detection robustness.
Recommended Citation
Chandana, Phanidhar Sai Sravan, "Synthetic Malware Generation using Generative AI" (2025). Master's Projects. 1551.
DOI: https://doi.org/10.31979/etd.3pj3-qvyk
https://scholarworks.sjsu.edu/etd_projects/1551