An evaluation of machine learning methods for domain name classification

Publication Date

12-10-2020

Document Type

Conference Proceeding

Publication Title

2020 IEEE International Conference on Big Data (Big Data)

Conference Location

Atlanta, GA

DOI

10.1109/BigData50022.2020.9377787

First Page

4577

Last Page

4585

Abstract

For a long time researchers have focused on the binary classification of domain names sent to DNS servers for resolutions to IP addresses. The objective is to identify malicious domains versus legitimate ones to protect networks from attacks. For legitimate domains, an emerging interest is to classify them into content categories to enable DNS servers deployed in an organization to monitor and potentially block the resolution of irrelevant domains. For example, a financial organization wants to flag gaming related domains, and an elementary school wants to block suspicious adult domains. Classifying a domain by just the domain name is a challenging task. Currently, there exist no publicly available datasets that include an extensive mapping of domains to content categories, since this is usually proprietary information. Our focus is three-fold in this work: a) to develop a data collection methodology and create rich labelled datasets that are appropriate for training such predictive models, b) to share the datasets with the research community by making them publicly available, and c) to evaluate and identify appropriate machine learning and deep learning algorithms for this problem domain. We consider two different datasets. The first is created following a SERP (Search Engine Response Page)-mining approach, having a set of content categories as input. The second is an enhancement of the DMOZ dataset that is publicly available, including both domains and category names as input. In addition to the dataset creation input and methodology, these two datasets differ in the content category number and distribution, yielding different results in our analysis. Overall, we observe that the deep learning-based approach carefully considers the key features of the input data and hence outperforms existing traditional machine learning pipelines, achieving 98.37% and 79.29% accuracy on the respective datasets.

Keywords

domain classification, DNS request classification, safe network, secure network, SERP-mining, content mining, machine learning data collection, feature engineering

Comments

SJSU users: Use the following link to login and access the article via SJSU databases.

Department

Computer Engineering

Share

COinS