Faculty Research, Scholarly, and Creative Activity

Generating Synthetic Malware Samples Using Generative AI

Tiffany Bao, Department of Computer Science
Kylie Trousil, University of Wisconsin-La Crosse
Quang Duy Tran, San Jose State UniversityFollow
Fabio Di Troia, San Jose State UniversityFollow
Younghee Park, Silicon Valley Research Institute

Publication Date

1-1-2025

Document Type

Article

Publication Title

IEEE Access

Volume

DOI

10.1109/ACCESS.2025.3556704

First Page

59725

Last Page

59736

Abstract

Malware attacks have a significant negative impact on organizations of varied scales in the field of cybersecurity. Recently, malware researchers have increasingly turned to machine learning techniques to combat sophisticated obfuscation methods used in malware. However, collecting a diverse set of malware samples with various obfuscation techniques is challenging and often takes years, especially for newly developed malware. This issue is further compounded by a well-known limitation of machine learning models: their poor performance when training data is scarce. In this paper, we propose a new system for generating synthetic malware samples to augment imbalanced malware dataset. Our approach decomposes malware binary samples into mnemonic opcode sequences, leveraging natural language processing to extract contextual meaning behind malware opcode features to aid the learning of generative AI (GenAI) employed in this paper, Generative Adversarial Networks (GAN), Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP), and a modified Diffusion model. The experiment results show that augmenting training data with Diffusion-based synthetic data significantly improves classification performance for minor classes by up to 60% on average. This enhancement ultimately leads to an overall malware classification performance of 96%, an 8% improvement. These findings demonstrate the high quality and fidelity of the synthetic data, its robustness, and its potential applications in malware analysis. Specifically, synthetic malware data proves effective in improving the classification of minor malware classes and detection rates, even though the size of known malware data is significantly small.

Keywords

data augmentation, Diffusion, GAN, generative AI, imbalanced datasets, machine learning, malware, natural language processing

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Department

Computer Science

Recommended Citation

Tiffany Bao, Kylie Trousil, Quang Duy Tran, Fabio Di Troia, and Younghee Park. "Generating Synthetic Malware Samples Using Generative AI" IEEE Access (2025): 59725-59736. https://doi.org/10.1109/ACCESS.2025.3556704

Download

Find in your library

COinS

Faculty Research, Scholarly, and Creative Activity

Generating Synthetic Malware Samples Using Generative AI

Publication Date

Document Type

Publication Title

Volume

DOI

First Page

Last Page

Abstract

Keywords

Creative Commons License

Department

Recommended Citation

Search

Browse All

Links

Faculty Research, Scholarly, and Creative Activity

Generating Synthetic Malware Samples Using Generative AI

Authors

Publication Date

Document Type

Publication Title

Volume

DOI

First Page

Last Page

Abstract

Keywords

Creative Commons License

Department

Recommended Citation

Share

Search

Browse All

Links