Publication Date

Spring 2017

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Teng Moh

Second Advisor

Melody Moh

Third Advisor

Suneuy Kim

Keywords

text classification, feature size reduction

Abstract

One challenge in text classification is that it is hard to make feature reduction basing upon the meaning of the features. An improper feature reduction may even worsen the classification accuracy. Word2Vec, a word embedding method, has recently been gaining popularity due to its high precision rate of analyzing the semantic similarity between words at relatively low computational cost. However, there are only a limited number of researchers focusing on feature reduction using Word2Vec. In this project, we developed a Word2Vec based method to reduce the feature size while increasing the classification accuracy. The feature reduction is achieved by loosely clustering the similar features using graph search techniques. The similarity thresholds above 0.5 are used in our method to pair and cluster the features. Finally, we utilize Multinomial Naïve Bayes classifier, Support Vector Machine, K-Nearest Neighbor and Random Forest classifier to evaluate the effect of our method. Four datasets with dimensions up to 100,000 feature size and 400,000 document size are used to evaluate the result of our method. The result shows that around 4-10% feature reduction is achieved with up to 1-4% improvement of classification accuracy in terms of different datasets and classifiers. Meanwhile, we also show success in improving feature reduction and classification accuracy by combining our method with other classic feature reduction techniques such as chi-square and mutual information.

Recommended Citation

Ge, Lihao, "Improving Text Classification with Word Embedding" (2017). Master's Projects. 541.
DOI: https://doi.org/10.31979/etd.vu9x-6drr
https://scholarworks.sjsu.edu/etd_projects/541

Download

Included in

Artificial Intelligence and Robotics Commons, Databases and Information Systems Commons

COinS

DOI

https://doi.org/10.31979/etd.vu9x-6drr

Master's Projects

Improving Text Classification with Word Embedding

Publication Date

Degree Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Abstract

Recommended Citation

Included in

DOI

Search

Browse All

Links

Master's Projects

Improving Text Classification with Word Embedding

Author

Publication Date

Degree Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Abstract

Recommended Citation

Included in

Share

DOI

Search

Browse All

Links