Publication Date

Spring 2025

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

Fabio Di Troia

Second Advisor

Navrati Saxena

Third Advisor

Sandeep Gundapu

Keywords

Malware detection, Data augmentation, Generative adversarial networks, Word embeddings, Opcode sequences, Class imbalance, Machine learning.

Abstract

Malware grows in numbers and complexity, evading conventional signature-and anomaly-based defenses and worsening extreme data sparsity and class imbalance problems for machine learning based detection. Generative models, specifically GANs conditioned on contextual embeddings like BERT have proved effective augmenting training corpora to improve classifier accuracy, but these approaches have largely produced family-specific samples In this paper, we propose a generalized augmentation scheme for generating robust malware embeddings for various families. We begin by extracting opcode sequences from 13 malware families and encoding them into three embedding methods: CountVectorizer, TF-IDF, and BERT’s ‘[CLS]‘ vectors. We therefore train standard GANs and Wasserstein GANs to generate synthetic embeddings, specifically testing on eight held-out families not observed during GAN training. To validate utility, we create ten augmented training sets at 100%–10% synthetic ratios and compare four classifiers to 100% real data baselines.

Our experiments show that embeddings generated by GANs match the perfor- mance of models trained on real training. Most importantly, the samples generated by GANs generalize across families without overfitting, completing gaps within the data. In our future work, we will test stronger embeddings (e.g., GloVe, Word2Vec, ELMo) and newer adversarial frameworks like WGAN-GP to further enhance malware detection robustness.

Available for download on Monday, May 25, 2026

Share

COinS