Publication Date

Spring 2018

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science


Yelp is a review platform that connects people to local businesses. It is a very popular platform that helps customers decide which business to choose. It relies on crowd sourced plain text reviews. From the business’s description some facts can be determined, such as category and location. However, more detailed description can be extracted from the reviews. Discovering latent topics and subtopics in Yelp reviews, can help summarize the reviews to gain knowledge. For example, we can deduce that reviews related to the Restaurant category tend to emphasize on service, food, order etc. Additionally, one can deduce positive or negative feedback on each topic and subtopic. In this project, we study the problem of content topic discovery using probabilistic and other models in a Yelp dataset. Various experiments were performed to extract word features, by trying to keep the initial context and sentence structure with the use techniques such as Document to bag of words, Word Embedding, Parts of Speech (POS) tagging and Term Frequency-Inverse Document Frequency (TFIDF). In our approach, we discover topics in the Yelp corpus with the use of Machine Learning techniques. Specifically, we use the Latent Dirichlet Allocation (LDA), the Latent Semantic Analysis (LSA) and the K-Means technique. These unsupervised learning techniques divide the corpus into latent topics that summarize the review text and highlights the insight of it. The methods are compared using the Coherence Model and the resultant LDA model is visualized using pyLDAvis. Finally, by comparing our techniques, we conclude that the K-Means using Word Embeddings on particular Parts of Speech tagged words gives best results, but is time consuming. On the contrary, LDA applied on cleaned corpus containing POS tagged words with TF-IDF is much faster albeit topics report loss of context in comparison to K-Means.