Publication Date

Spring 2013

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Science

Advisor

Chris J. Pollett

Keywords

Active Learning, Classification, Logistic Regression, Machine Learning

Subject Areas

Computer science; Statistics

Abstract

This thesis project augments the Yioop search engine with a general facility for automatically assigning "class" meta words (e.g., "class:advertising") to web pages based on the output of a logistic regression text classifier. Users can create multiple classifers using Yioop's web-based interface, each trained first on a small set of labeled documents drawn from previous crawls then improved over repeated rounds of active learning using density-weighted pool-based sampling.

The classification system's accuracy when classifying new documents was found to be comparable to published results for a common dataset, approaching 82% for a corpus of advertisements to be filtered from content-providers' web pages. In agreement with previous work, logistic regression was found to provide greater accuracy than Naive Bayes for training sets consisting of more than two hundred documents. Active learning with density-weighted pool-based sampling was found to offer a small accuracy boost over random document sampling for training sets consisting of less than one hundred documents.

Overall, the system was shown to be effective for the proposed task of allowing users to create novel web page classifiers, but the active learning component will require more work if it is to provide users with a salient benefit over random sampling.

Share

COinS