Communications in Computer and Information Science
Signature and anomaly-based techniques are the fundamental methods to detect malware. However, in recent years this type of threat has advanced to become more complex and sophisticated, making these techniques less effective. For this reason, researchers have resorted to state-of-the-art machine learning techniques to combat the threat of information security. Nevertheless, despite the integration of the machine learning models, there is still a shortage of data in training that prevents these models from performing at their peak. In the past, generative models have been found to be highly effective at generating image-like data that are similar to the actual data distribution. In this paper, we leverage the knowledge of generative modeling on opcode sequences and aim to generate malware samples by taking advantage of the contextualized embeddings from BERT. We obtained promising results when differentiating between real and generated samples. We observe that generated malware has such similar characteristics to actual malware that the classifiers are having difficulty in distinguishing between the two, in which the classifiers falsely identify the generated malware as actual malware almost of the time.
BERT, GAN, Malware, Malware detection, Word embedding
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Quang Duy Tran and Fabio Di Troia. "Word Embeddings for Fake Malware Generation" Communications in Computer and Information Science (2022): 22-37. https://doi.org/10.1007/978-3-031-24049-2_2