Publication Date
Spring 2024
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
Sayma Akther
Second Advisor
Nada Attar
Third Advisor
William Andreopoulos
Keywords
Deep Learning, Convolutional Neural Networks, Machine Learning.
Abstract
Lip-reading, a ubiquitous field between computer vision and speech processing, focuses on identifying what spoken words a person generates depending on their uttering lip movements. This paper presents a streamlined lip-reading solution that employs machine learning and deep learning. First, Our work utilizes the Multi-Task Cascaded Convo- lutional Networks to detect facial “landmarks,” including the face and lips region, and the aligns the face. The aligned faces are segmented to get the lip images. Lip images are preprocessed using the Real-Enhanced Super Resolution Generative Adversarial Network to enhance image resolution to identify subtle lip movement in video images: a critical aspect of lip-reading. Once lip images have been preprocessed, it is fed into the architecture based on CNN from which the features could be learned. The feature extraction and lip movement are learned through 3D convolutional network utilizing time distributed layer with LSTM in either direction. We use a text corpus dataset known as the GRID and train our model for obtaining 2.3% Character error rate on seen speakers and 5.2% on unseen.
Recommended Citation
Ambati, Srujith Rao, "Deciphering Speech through Vision: A Deep Learning Lip Reading System" (2024). Master's Projects. 1407.
DOI: https://doi.org/10.31979/etd.sfst-j66z
https://scholarworks.sjsu.edu/etd_projects/1407