Publication Date

Spring 2020

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Chris Pollett

Second Advisor

Robert Chun

Third Advisor

Tthomas Austtin

Keywords

Hash2vec, Word2Vec, Sequence-to-Sequence Translation, ​Deep Learning​, R​ecurrent Neural Network (RNN), Principal Component Analysis (PCA).

Abstract

Machine Translation is the study of computer translation of a text written in one human language into text in a different language. Within this field, a word embedding is a mapping from terms in a language into small dimensional vectors which can be processed using mathematical operations. Two traditional word embedding approaches are word2vec, which uses a Neural Network, and hash2vec, which is based on a simpler hashing algorithm. In this project, we have explored the relative suitability of each approach to sequence to sequence text translation using a Recurrent Neural Network (RNN). We also carried out experiments to test if we can directly compute a mapping between word embeddings in one language to word embeddings in another language using Linear Regression followed by Principal Component Analysis (PCA).

We trained the word2vec model for 24 hours using google collab default settings. This word2vec model when applied to sentence translation produced results with 85% accuracy. Surprisingly, the hash2vec model performed relatively well with 60% accuracy. The hash2vec model required only 6 hours of processing time which saved a lot of time spent in training the word2vec model. Further research can be carried out using the hash2vec technique on larger datasets and applying it to different machine learning models.

Share

COinS