Publication Date

1-1-2022

Document Type

Conference Proceeding

Publication Title

Communications in Computer and Information Science

Volume

1683 CCIS

DOI

10.1007/978-3-031-24049-2_2

First Page

22

Last Page

37

Abstract

Signature and anomaly-based techniques are the fundamental methods to detect malware. However, in recent years this type of threat has advanced to become more complex and sophisticated, making these techniques less effective. For this reason, researchers have resorted to state-of-the-art machine learning techniques to combat the threat of information security. Nevertheless, despite the integration of the machine learning models, there is still a shortage of data in training that prevents these models from performing at their peak. In the past, generative models have been found to be highly effective at generating image-like data that are similar to the actual data distribution. In this paper, we leverage the knowledge of generative modeling on opcode sequences and aim to generate malware samples by taking advantage of the contextualized embeddings from BERT. We obtained promising results when differentiating between real and generated samples. We observe that generated malware has such similar characteristics to actual malware that the classifiers are having difficulty in distinguishing between the two, in which the classifiers falsely identify the generated malware as actual malware almost of the time.

Keywords

BERT, GAN, Malware, Malware detection, Word embedding

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Department

Computer Science

Share

COinS