Publication Date
Spring 2013
Degree Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer Science
Advisor
Chris J. Pollett
Keywords
Active Learning, Classification, Logistic Regression, Machine Learning
Subject Areas
Computer science; Statistics
Abstract
This thesis project augments the Yioop search engine with a general facility for automatically assigning "class" meta words (e.g., "class:advertising") to web pages based on the output of a logistic regression text classifier. Users can create multiple classifers using Yioop's web-based interface, each trained first on a small set of labeled documents drawn from previous crawls then improved over repeated rounds of active learning using density-weighted pool-based sampling.
The classification system's accuracy when classifying new documents was found to be comparable to published results for a common dataset, approaching 82% for a corpus of advertisements to be filtered from content-providers' web pages. In agreement with previous work, logistic regression was found to provide greater accuracy than Naive Bayes for training sets consisting of more than two hundred documents. Active learning with density-weighted pool-based sampling was found to offer a small accuracy boost over random document sampling for training sets consisting of less than one hundred documents.
Overall, the system was shown to be effective for the proposed task of allowing users to create novel web page classifiers, but the active learning component will require more work if it is to provide users with a salient benefit over random sampling.
Recommended Citation
Tice, Shawn Cameron, "Classification of Web Pages in Yioop with Active Learning" (2013). Master's Theses. 4318.
DOI: https://doi.org/10.31979/etd.zegm-nq79
https://scholarworks.sjsu.edu/etd_theses/4318