Publication Date
Spring 2020
Degree Type
Master's Project
Degree Name
Master of Science (MS)
Department
Computer Science
First Advisor
Chris Pollett
Second Advisor
Robert Chun
Third Advisor
Tthomas Austtin
Keywords
Hash2vec, Word2Vec, Sequence-to-Sequence Translation, Deep Learning, Recurrent Neural Network (RNN), Principal Component Analysis (PCA).
Abstract
Machine Translation is the study of computer translation of a text written in one human language into text in a different language. Within this field, a word embedding is a mapping from terms in a language into small dimensional vectors which can be processed using mathematical operations. Two traditional word embedding approaches are word2vec, which uses a Neural Network, and hash2vec, which is based on a simpler hashing algorithm. In this project, we have explored the relative suitability of each approach to sequence to sequence text translation using a Recurrent Neural Network (RNN). We also carried out experiments to test if we can directly compute a mapping between word embeddings in one language to word embeddings in another language using Linear Regression followed by Principal Component Analysis (PCA).
We trained the word2vec model for 24 hours using google collab default settings. This word2vec model when applied to sentence translation produced results with 85% accuracy. Surprisingly, the hash2vec model performed relatively well with 60% accuracy. The hash2vec model required only 6 hours of processing time which saved a lot of time spent in training the word2vec model. Further research can be carried out using the hash2vec technique on larger datasets and applying it to different machine learning models.
Recommended Citation
Gaikwad, Neha, "Comparison of Word2vec with Hash2vec for Machine Translation" (2020). Master's Projects. 919.
DOI: https://doi.org/10.31979/etd.e7pz-uhqh
https://scholarworks.sjsu.edu/etd_projects/919