Off-campus SJSU users: To download campus access theses, please use the following link to log into our proxy server with your SJSU library user name and PIN.

Publication Date

Fall 2020

Degree Type

Thesis - Campus Access Only

Degree Name

Master of Science (MS)

Department

Computer Engineering

Advisor

Mahima Agumbe Suresh

Keywords

aspect based sentiment analysis, domain specific, machine learning, natural language processing, semantics, word embeddings

Subject Areas

Computer engineering; Computer science; Linguistics

Abstract

Customer reviews are a rich, abundant source of valuable information that could predict commercial success or failure. Product designers, in particular, benefit significantly if they can better understand customer requirements. Through aspect-based sentiment analysis, we can analyze large amounts of online reviews for customer sentiment towards specific product features or components. However, most such machine learning models require large amounts of aspect-annotated data to train before commercial use is viable. Further, product design is a very industry-specific process. Any algorithm attempting to learn such a product's features will need to train on semantically similar data. These dependencies pose challenges since domain-specific data for a particular product could be extremely hard to find. On the other hand, a machine learning practitioner may wonder whether gathering hard-to-come-by text data discussing a limited set of topics is worth the time and resources it takes; after all, machine learning algorithms trained to generalize across different distributions of data are more robust. In the interest of thoroughness, we gathered large amounts of text data from various generic, domain-related, and topic-specific sources before conducting extensive experimentation on model training. We then compare the results of models trained on the various text data distributions across three different product categories. Our findings clearly show the advantages of gathering text data that are semantically similar to the data we ultimately analyze and evaluate, even if the latter's domain cannot be exactly matched by the former. We also gain valuable insights along the way that could help machine learning practitioners in this field make informed decisions when designing systems.

Share

COinS