Publication Date
7-1-2024
Document Type
Article
Publication Title
Applied Sciences (Switzerland)
Volume
14
Issue
13
DOI
10.3390/app14135731
Abstract
Malware classification stands as a crucial element in establishing robust computer security protocols, encompassing the segmentation of malware into discrete groupings. Recently, the emergence of machine learning has presented itself as an apt approach for addressing this challenge. Models can undergo training employing diverse malware attributes, such as opcodes and API calls, to distill valuable insights for effective classification. Within the realm of natural language processing, word embeddings assume a pivotal role by representing text in a manner that aligns closely with the proximity of similar words. These embeddings facilitate the quantification of word resemblances. This research embarks on a series of experiments that harness hybrid machine learning methodologies. We derive word vectors from dynamic API call logs associated with malware and integrate them as features in collaboration with diverse classifiers. Our methodology involves the utilization of Hidden Markov Models and Word2Vec to generate embeddings from API call logs. Additionally, we amalgamate renowned models like BERT and ELMo, noted for their capacity to yield contextualized embeddings. The resultant vectors are channeled into our classifiers, namely Support Vector Machines (SVMs), Random Forest (RF), k-Nearest Neighbors (kNNs), and Convolutional Neural Networks (CNNs). Through two distinct sets of experiments, our objective revolves around the classification of both malware families and categories. The outcomes achieved illuminate the efficacy of API call embeddings as a potent instrument in the domain of malware classification, particularly in the realm of identifying malware families. The best combination was RF and word embeddings generated by Word2Vec, ELMo, and BERT, achieving an accuracy between 0.91 and 0.93. This result underscores the potential of our approach in effectively classifying malware.
Keywords
API calls, BERT, CNN, dynamic analysis, ELMo, Hmm2Vec, kNN, RF, SVM, word embeddings, Word2Vec
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Department
Computer Science
Recommended Citation
Sahil Aggarwal and Fabio Di Troia. "Malware Classification Using Dynamically Extracted API Call Embeddings" Applied Sciences (Switzerland) (2024). https://doi.org/10.3390/app14135731