Master of Science (MS)
text classification, feature size reduction
One challenge in text classification is that it is hard to make feature reduction basing upon the meaning of the features. An improper feature reduction may even worsen the classification accuracy. Word2Vec, a word embedding method, has recently been gaining popularity due to its high precision rate of analyzing the semantic similarity between words at relatively low computational cost. However, there are only a limited number of researchers focusing on feature reduction using Word2Vec. In this project, we developed a Word2Vec based method to reduce the feature size while increasing the classification accuracy. The feature reduction is achieved by loosely clustering the similar features using graph search techniques. The similarity thresholds above 0.5 are used in our method to pair and cluster the features. Finally, we utilize Multinomial Naïve Bayes classifier, Support Vector Machine, K-Nearest Neighbor and Random Forest classifier to evaluate the effect of our method. Four datasets with dimensions up to 100,000 feature size and 400,000 document size are used to evaluate the result of our method. The result shows that around 4-10% feature reduction is achieved with up to 1-4% improvement of classification accuracy in terms of different datasets and classifiers. Meanwhile, we also show success in improving feature reduction and classification accuracy by combining our method with other classic feature reduction techniques such as chi-square and mutual information.
Ge, Lihao, "Improving Text Classification with Word Embedding" (2017). Master's Projects. 541.