![]() ![]() However, differently from prior efforts, we focus on passive flow level traces, and we are limited to little information. The problem of classifying domain names is not new various researchers studied it in the past . This problem falls in the supervised classification class, and we aim at training a classifier based on different machine learning approaches. #Domain graph manual#Given the flow sequence, each with the timestamp of when the request was observed, we want to automatically assign each domain a category, assuming to know only a small subset of domain categories (e.g., via manual labeling). Each flow contains the client identifier (e.g., the client IP address - properly anonymized in our data) and the domain name of the server (recovered directly from the HTTP request, or DNS and TLS negotiation when HTTPS is in place). We specifically consider logs collected by passively monitoring the network traffic, where a passive sniffer identifies TCP or UDP flows, and recovers the name of the servers serving such flows. In this paper focus, websites are URLs such as, whereas domains refer to the URL domain only, i.e., Our goal is to perform domain classification using large flow level logs only. This task is the focus of our paper and has several applications such as information integration , building efficient focused crawler or vertical search engines , helping to choose the appropriate model for extracting information from a web page , improving quality of search results , constructing and expanding web directories , web filtering and advertising . A common machine learning classification task is to assign a category to a domain (i.e., mapping to a category such as “News and Media”) . 1 With this ever-growing nature of the Web, researchers and practitioners resort more and more to automated approaches bases on machine learning to process and understand such vast variety. ![]() The latest estimations show that there are over 1.6 billion websites on the Internet, distributed over more than 268 million domains. Our work is the first to perform an extensive evaluation of domain name classification using only passive flow-level logs to the best of our knowledge. However, in this framework, classification scores are lower than those usually found when exploiting the page-specific content. ![]() Using graphs, we incorporate in the classifier aspects not strictly related to the labeled data, and we can classify most of the unlabeled domains. Using a large dataset with hundreds of thousands of domain names and 25 different categories, we show that semi-supervised learning methods are more suitable for this task than traditional supervised approaches. For this, we propose and evaluate different classification methods based on machine learning. We exploit the information carried by network logs, using just the name of the websites and the sequence of visited websites by users. Differently from prior efforts that need to crawl and use the web pages’ actual content, we rely only on traffic logs passively collected, observing traffic regularly flowing in the network, without the burden to crawl and parse web pages. Domain name classification is challenging due to the high number of class labels and the highly skewed class distributions. In this work, we tackle the problem of classifying websites domain names to a category, e.g., mapping bbc.com to the ”News and Media” class. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |