Master of Science (MS)
Fabio Di Troia
Word Embeddings, Dynamic Analysis, API Calls, Hmm2Vec, Word2Vec, ELMo, BERT, SVM, RF, kNN, CNN
Malware classification is the process of classifying malware into recognizable categories and is an integral part of implementing computer security. In recent times, machine learning has emerged as one of the most suitable techniques to perform this task. Models can be trained on various malware features such as opcodes, and API calls among many others to deduce information that would be helpful in the classification.
Word embeddings are a key part of natural language processing and can be seen as a representation of text wherein similar words will have closer representations. These embeddings can be used to discover a quantifiable measure of similarity between words. In this research, we conduct a series of experiments using hybrid machine learning techniques, where we generate word vectors and use them as features with various classifiers. We use Hidden Markov Models and Word2Vec to generate embeddings based on dynamic API call logs of the malware. Apart from these, we also use the popular BERT and ELMo models which are known for generating contextualized embeddings. The resulting vectors are used as input for our classifiers, specifically Support Vector Machines (SVM), Random forest (RF), k-Nearest Neighbors (kNN), and Convolutional Neural Networks (CNN). Using these, we conduct two distinct sets of experiments where we try to classify the family of malware as well as the category of malware. The results achieved here prove that embeddings of API calls can be a useful tool in malware classification, especially in the case of families.
Aggarwal, Sahil, "Malware Classification using API Call Information and Word Embeddings" (2023). Master's Projects. 1267.