Embedding-Driven Synthetic Malware Generation with Autoencoders and Cluster-Tangent Diffusion
Abstract
Malware has become increasingly sophisticated over the years, with zero-day attacks emerging at an alarming pace. Effective detection and analysis demand real malware samples, which are expensive and skill-dependent to extract. As a result, generating high quality synthetic samples from scarce data sets becomes a crucial method for strengthening detection software. This paper focuses on presenting generation techniques that optimize the embedding space to produce high-quality synthetic samples, even under constrained datasets. The dataset used in this paper consists of 500 Windows malware API call samples that were processed using embedding and Generative AI (Gen AI) techniques to generate synthetic malware. Two novel contributions are highlighted in this paper. (1) The integration of autoencoders with pretrained NLP models (BERT and ELMo) to enhance the quality of embeddings. Autoencoders extract features and learn patterns from the data to generate higher-quality embeddings than those generated using other techniques alone. (2) Cluster-Tangent Diffusion (CT-Diff): a novel application of manifold diffusion. Manifold diffusion improves upon diffusion and other Gen AI techniques by focusing on generating samples along the distribution of the original data using structured noise instead of standard gaussian noise. Collectively these two contributions have consistently outperformed previous techniques. Furthermore, the results demonstrate the feasibility of generating reliable fake samples even in low data scenarios.