Deciphering Speech Through Vision: A Deep Learning Lip Reading System
Abstract
Lip-reading bridges computer vision and speech processing by recognizing spoken words from visual lip movements alone. This study presents a streamlined framework combining facial landmark detection, image enhancement, and deep spatiotemporal modeling. We use MTCNN to detect and align lip regions, enhanced by Real-ESRGAN for higher resolution and finer detail. The enhanced images feed into a 3D CNN with timedistributed layers and bidirectional LSTM, trained using CTC loss for effective spatial-temporal feature learning and alignmentfree transcription. Evaluated on the GRID corpus, our model achieves a character error rate (CER) of 2.3% on seen speakers and 5.2% on unseen speakers. Overall, it delivers state-of-the-art performance with a 5.2% CER and 95.8% accuracy, improving CER by 18.8% over LipNet. Notably, for unseen speakers, it reduces CER from LipNet's 9.4% to 5.2%, a 44.7% relative decrease, showcasing superior generalization and robustness. These results highlight that combining super-resolution with deep temporal modeling substantially enhances visual speech recognition accuracy and reliability.