Publication Date

Spring 2024

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)


Computer Science

First Advisor

Sayma Akther

Second Advisor

Nada Attar

Third Advisor

William Andreopoulos


Deep Learning, Convolutional Neural Networks, Machine Learning.


Lip-reading, a ubiquitous field between computer vision and speech processing, focuses on identifying what spoken words a person generates depending on their uttering lip movements. This paper presents a streamlined lip-reading solution that employs machine

learning and deep learning. First, Our work utilizes the Multi-Task Cascaded Convo- lutional Networks to detect facial “landmarks,” including the face and lips region, and

the aligns the face. The aligned faces are segmented to get the lip images. Lip images are preprocessed using the Real-Enhanced Super Resolution Generative Adversarial Network to enhance image resolution to identify subtle lip movement in video images: a critical aspect of lip-reading. Once lip images have been preprocessed, it is fed into the architecture based on CNN from which the features could be learned. The feature extraction and lip movement are learned through 3D convolutional network utilizing time distributed layer with LSTM in either direction. We use a text corpus dataset known as the GRID and train our model for obtaining 2.3% Character error rate on seen speakers and 5.2% on unseen.

Available for download on Sunday, May 25, 2025