Publication Date

Spring 2017

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Leonard Wesley

Second Advisor

Robert Chun

Third Advisor

Robin James

Keywords

Machine Learning, Document Classification

Abstract

To perform document classification algorithmically, documents need to be represented such that it is understandable to the machine learning classifier. The report discusses the different types of feature vectors through which document can be represented and later classified. The project aims at comparing the Binary, Count and TfIdf feature vectors and their impact on document classification. To test how well each of the three mentioned feature vectors perform, we used the 20-newsgroup dataset and converted the documents to all the three feature vectors. For each feature vector representation, we trained the Naïve Bayes classifier and then tested the generated classifier on test documents. In our results, we found that TfIdf performed 4% better than Count vectorizer and 6% better than Binary vectorizer if stop words are removed. If stop words are not removed, then TfIdf performed 6% better than Binary vectorizer and 11% better than Count vectorizer. Also, Count vectorizer performs better than Binary vectorizer, if stop words are removed by 2% but lags behind by 5% if stop words are not removed. Thus, we can conclude that TfIdf should be the preferred vectorizer for document representation and classification.

Recommended Citation

Basarkar, Ankit, "DOCUMENT CLASSIFICATION USING MACHINE LEARNING" (2017). Master's Projects. 531.
DOI: https://doi.org/10.31979/etd.6jmu-9xdt
https://scholarworks.sjsu.edu/etd_projects/531

Download

Included in

Artificial Intelligence and Robotics Commons, Databases and Information Systems Commons

COinS

DOI

https://doi.org/10.31979/etd.6jmu-9xdt

Master's Projects

DOCUMENT CLASSIFICATION USING MACHINE LEARNING

Publication Date

Degree Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Abstract

Recommended Citation

Included in

DOI

Search

Browse All

Links

Master's Projects

DOCUMENT CLASSIFICATION USING MACHINE LEARNING

Author

Publication Date

Degree Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Abstract

Recommended Citation

Included in

Share

DOI

Search

Browse All

Links