Publication Date

Fall 2023

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

Ching-Seh Wu

Second Advisor

Navrati Saxena

Third Advisor

Mrunmayi Deshpande

Keywords

Sign language translation, deep learning, transformers, multi cue networks, temporal pooling, linear competing units, stochastic weights, weight compression, bilingual evaluation understudy

Abstract

Sign language is a form of visual language that uses face expression and hand gestures to communicate thoughts and concepts. The term refers to multiple visual languages that share some common visual cues but differ in their grammar and syntax. Sign language translation (SLT) is a crucial step in closing the communication gap between hearing and hearingimpaired people. The study of SLT using machine learning has gotten a lot of interest during the last three years but despite progress, SLT research is still in its early phases. Most of the previous approaches first convert the signs to glosses and then convert the glosses into meaningful sentences. A gloss is a word connected to a sign or can also be called a label of a particular sign. These words don't convey what the sign means but only roughly capture the sign's intent. Though such approaches have good accuracy, obtaining gloss sequence ground truth is a tedious job and it also increases memory requirements during translation. This research project proposes an ensemble transformer architecture by combining SpatialTemporal Multi-Cue Networks (STMC) with stochastic transformers to avoid the explicit use of glosses and directly perform video-to-natural spoken language conversion while reducing memory consumption during inference. The results show a 3% increase in translation accuracy in terms of BLEU-4 score compared to previous state of the art research along with 5 times lesser training time and 70.7% reduction in memory consumption during inference.

Share

COinS