Deciphering Speech Through Vision: A Deep Learning Lip Reading System
Publication Date
9-29-2025
Document Type
Conference Proceeding
Publication Title
3rd IEEE International Conference on Data Science and Network Security Icdsns 2025
DOI
10.1109/ICDSNS65743.2025.11168793
Abstract
Lip-reading bridges computer vision and speech processing by recognizing spoken words from visual lip movements alone. This study presents a streamlined framework combining facial landmark detection, image enhancement, and deep spatiotemporal modeling. We use MTCNN to detect and align lip regions, enhanced by Real-ESRGAN for higher resolution and finer detail. The enhanced images feed into a 3D CNN with timedistributed layers and bidirectional LSTM, trained using CTC loss for effective spatial-temporal feature learning and alignmentfree transcription. Evaluated on the GRID corpus, our model achieves a character error rate (CER) of 2.3% on seen speakers and 5.2% on unseen speakers. Overall, it delivers state-of-the-art performance with a 5.2% CER and 95.8% accuracy, improving CER by 18.8% over LipNet. Notably, for unseen speakers, it reduces CER from LipNet's 9.4% to 5.2%, a 44.7% relative decrease, showcasing superior generalization and robustness. These results highlight that combining super-resolution with deep temporal modeling substantially enhances visual speech recognition accuracy and reliability.
Keywords
Convolutional Neural Network, Deep learning, Long Short-Term Memory, Multi-Task Cascaded Convolutional Networks, Super Resolution Generative Adversarial Network
Department
Computer Science
Recommended Citation
Srujith Rao Ambati and Sayma Akther. "Deciphering Speech Through Vision: A Deep Learning Lip Reading System" 3rd IEEE International Conference on Data Science and Network Security Icdsns 2025 (2025). https://doi.org/10.1109/ICDSNS65743.2025.11168793