Publication Date

Spring 5-20-2019

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science

First Advisor

Mike Wu

Second Advisor

Robert Chun

Third Advisor

Katerina Potika


There has been research around the idea of representing words in text as vectors and many models proposed that vary in performance as well as applications. Text processing is used for content recommendation, sentiment analysis, plagiarism detection, content creation, language translation, etc. to name a few. Specifically, we want to look at the problem of topic detection in text content of articles/blogs/summaries. With the humungous amount of text content published each and every minute on the internet, it is imperative that we have very good algorithms and approaches to analyze all the content and be able to classify most of it with high confidence for further use.

The project aims to work with unsupervised and supervised machine learning algorithms in an effort to tackle the topic detection problem. The project will target various unsupervised learning algorithms like Word2vec, doc2vec and LDA for corpus and language dictionary learning to have a trained model which understand the semantic of texts. The objective of the project is to combine this unsupervised learning with supervised learning algorithms like Support Vector Machine and deep learning methods to analyze and hopefully better the performance in terms of accuracy of topic detection. The project also aims at performing user interest-based modelling, which is orthogonal to topics modeling. The idea is to make sure the model is free of predefined categories.

The project results show that hybrid models are comfortably accurate when classifying text in a particular topic category. The project also concludes that user interest modelling can also be accurately achieved along with topic detection. The project successfully determines these results without any meta information about the input text and purely based on the corpus of the input text. This makes the project framework really robust as it has no dependency on source of text, length of text or any other meta information about the text content.