Faculty Research, Scholarly, and Creative Activity

Deciphering Speech Through Vision: A Deep Learning Lip Reading System

Srujith Rao Ambati, San Jose State University
Sayma Akther, San Jose State UniversityFollow

Publication Date

9-29-2025

Document Type

Conference Proceeding

Publication Title

3rd IEEE International Conference on Data Science and Network Security Icdsns 2025

DOI

10.1109/ICDSNS65743.2025.11168793

Abstract

Lip-reading bridges computer vision and speech processing by recognizing spoken words from visual lip movements alone. This study presents a streamlined framework combining facial landmark detection, image enhancement, and deep spatiotemporal modeling. We use MTCNN to detect and align lip regions, enhanced by Real-ESRGAN for higher resolution and finer detail. The enhanced images feed into a 3D CNN with timedistributed layers and bidirectional LSTM, trained using CTC loss for effective spatial-temporal feature learning and alignmentfree transcription. Evaluated on the GRID corpus, our model achieves a character error rate (CER) of 2.3% on seen speakers and 5.2% on unseen speakers. Overall, it delivers state-of-the-art performance with a 5.2% CER and 95.8% accuracy, improving CER by 18.8% over LipNet. Notably, for unseen speakers, it reduces CER from LipNet's 9.4% to 5.2%, a 44.7% relative decrease, showcasing superior generalization and robustness. These results highlight that combining super-resolution with deep temporal modeling substantially enhances visual speech recognition accuracy and reliability.

Keywords

Convolutional Neural Network, Deep learning, Long Short-Term Memory, Multi-Task Cascaded Convolutional Networks, Super Resolution Generative Adversarial Network

Department

Computer Science

Recommended Citation

Srujith Rao Ambati and Sayma Akther. "Deciphering Speech Through Vision: A Deep Learning Lip Reading System" 3rd IEEE International Conference on Data Science and Network Security Icdsns 2025 (2025). https://doi.org/10.1109/ICDSNS65743.2025.11168793

Link to Full Text

Find in your library

COinS

Faculty Research, Scholarly, and Creative Activity

Deciphering Speech Through Vision: A Deep Learning Lip Reading System

Publication Date

Document Type

Publication Title

DOI

Abstract

Keywords

Department

Recommended Citation

Search

Browse All

Links

Faculty Research, Scholarly, and Creative Activity

Deciphering Speech Through Vision: A Deep Learning Lip Reading System

Authors

Publication Date

Document Type

Publication Title

DOI

Abstract

Keywords

Department

Recommended Citation

Share

Search

Browse All

Links