A Survey on Lip-Reading with Deep Learning

Ali Erbey; Necaattin Barışçı

doi:10.29137/umagd.1038899

Review

Year 2022, Volume: 14 Issue: 2, 844 - 860, 31.07.2022

Ali Erbey , Necaattin Barışçı

https://doi.org/10.29137/umagd.1038899

Cited By: 1

Abstract

References

Adeel, A., Gogate, M., & Hussain, A. (2020). Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments. Information Fusion, 59, 163-170.
Afouras, T., Chung, J. S., & Zisserman, A. (2018). Deep lip reading: a comparison of models and an online application. arXiv preprint arXiv:1806.06053.
Afouras, T., Chung, J. S., & Zisserman, A. (2018). LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496.
Akmese Ö.F., Erbay H., Kör H., (2019). Derin Ögrenme ile Görüntü Kümeleme. In: 5th International Management Information Systems Conference, Ankara.
Alpaydin, E. (2020). Introduction to machine learning. MIT press.
Amanullah, M. A., Habeeb, R. A. A., Nasaruddin, F. H., Gani, A., Ahmed, E., Nainar, A. S. M., ... & Imran, M. (2020). Deep learning and big data technologies for IoT security. Computer Communications, 151, 495-517.
Anina, I., Zhou, Z., Zhao, G., & Pietikäinen, M. (2015, May). Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) (Vol. 1, pp. 1-5). IEEE.
Arı, A., & Hanbay, D. (2019). Tumor detection in MR images of regional convolutional neural networks. Journal of the Faculty of Engineering and Architecture of Gazi University, 34(3), 1395-1408.
Bacciu, D., Micheli, A., & Podda, M. (2020). Edge-based sequential graph generation with recurrent neural networks. Neurocomputing, 416, 177-189.
Bayram, F. (2020). Derin öğrenme tabanlı otomatik plaka tanıma. Politeknik Dergisi, 23(4), 955-960.
Bear, H. L., & Harvey, R. (2017). Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Communication, 95, 40-67.
Bi, C., Zhang, D., Yang, L., & Chen, P. (2019, November). An Lipreading Modle with DenseNet and E3D-LSTM. In 2019 6th International Conference on Systems and Informatics (ICSAI) (pp. 511-515). IEEE.
Bollier, D. (2017). Artificial intelligence comes of age. The promise and challenge of integrating AI into cars, healthcare and journalism. The Aspen Institute Communications and Society Program. Washington, DC.
Chen, L., Xu, G., Zhang, S., Yan, W., & Wu, Q. (2020). Health indicator construction of machinery based on end-to-end trainable convolution recurrent neural networks. Journal of Manufacturing Systems, 54, 1-11.
Chen, X., Du, J., & Zhang, H. (2020). Lipreading with DenseNet and resBi-LSTM. Signal, Image and Video Processing, 14(5), 981-989.
Chen, Y., Zhao, X., & Jia, X. (2015). Spectral–spatial classification of hyperspectral data based on deep belief network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(6), 2381-2392.
Cheok, M. J., Omar, Z., & Jaward, M. H. (2019). A review of hand gesture and sign language recognition techniques. International Journal of Machine Learning and Cybernetics, 10(1), 131-153.
Chung, J. S., & Zisserman, A. (2016, November). Lip reading in the wild. In Asian conference on computer vision (pp. 87-103). Springer, Cham.
Chung, J. S., & Zisserman, A. P. (2017). Lip reading in profile.
Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017, July). Lip reading sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3444-3453). IEEE.
Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5), 2421-2424.
Cox, S. J., Harvey, R. W., Lan, Y., Newman, J. L., & Theobald, B. J. (2008, September). The challenge of multispeaker lip-reading. In AVSP (pp. 179-184).
Doğan, M., Nemli, O. N., Yüksel, O. M., Bayramoğlu, İ., & Kemaloğlu, Y. K. (2008). İşitme Kaybının Yaşam Kalitesine Etkisini İnceleyen Anket Çalışmalarına Ait Bir Derleme. Turkiye Klinikleri J Int Med Sci, 4, 33.
Dupont, S., & Luettin, J. (2000). Audio-visual speech modeling for continuous speech recognition. IEEE transactions on multimedia, 2(3), 141-151.
Erdoğan A.A., (2016). Hearing Loss and Approaches to Hearing Loss in Elderly, The Turkish Journal of Family Medicine and Primary Care, 10 (1): 25-33, (2016). doi:10.5455/tjfmpc.204524
Ergezer, H., Dikmen, M., & Özdemir, E. (2003). Yapay sinir ağları ve tanıma sistemleri. PiVOLKA, 2(6), 14-17.
Ertam, F., & Aydın, G. (2017, October). Data classification with deep learning using Tensorflow. In 2017 international conference on computer science and engineering (UBMK) (pp. 755-758). IEEE.
Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., ... & Dean, J. (2019). A guide to deep learning in healthcare. Nature medicine, 25(1), 24-29.
Farsal, W., Anter, S., & Ramdani, M. (2018, October). Deep learning: An overview. In Proceedings of the 12th International Conference on Intelligent Systems: Theories and Applications (pp. 1-6).
Fayjie, A. R., Hossain, S., Oualid, D., & Lee, D. J. (2018, June). Driverless car: Autonomous driving using deep reinforcement learning in urban environment. In 2018 15th International Conference on Ubiquitous Robots (UR) (pp. 896-901). IEEE.
Feng, W., Guan, N., Li, Y., Zhang, X., & Luo, Z. (2017, May). Audio visual speech recognition with multimodal recurrent neural networks. In 2017 International Joint Conference on Neural Networks (IJCNN) (pp. 681-688). IEEE.
Fernandez-Lopez, A., & Sukno, F. M. (2017). Automatic viseme vocabulary construction to enhance continuous lip-reading. arXiv preprint arXiv:1704.08035.
Fernandez-Lopez, A., & Sukno, F. M. (2017, February). Optimizing Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish. In International Joint Conference on Computer Vision, Imaging and Computer Graphics (pp. 305-328). Springer, Cham.
Fernandez-Lopez, A., & Sukno, F. M. (2018). Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing, 78, 53-72.
Fernandez-Lopez, A., Martinez, O., & Sukno, F. M. (2017, May). Towards estimating the upper bound of visual-speech recognition: The visual lip-reading feasibility database. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) (pp. 208-215). IEEE.
Fook, C. Y., Hariharan, M., Yaacob, S., & Adom, A. H. (2012, February). A review: Malay speech recognition and audio visual speech recognition. In 2012 International Conference on Biomedical Engineering (ICoBE) (pp. 479-484). IEEE.
Fung, I., & Mak, B. (2018, April). End-to-end low-resource lip-reading with maxout CNN and LSTM. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2511-2515). IEEE.
Gogate, M., Dashtipour, K., Adeel, A., & Hussain, A. (2020). CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement. Information Fusion, 63, 273-285.
Goh, Y. H., Lau, K. X., & Lee, Y. K. (2019, October). Audio-Visual Speech Recognition System Using Recurrent Neural Network. In 2019 4th International Conference on Information Technology (InCIT) (pp. 38-43). IEEE.
Hamurcu, M., Şener, B. M., Ataş, A., Atalay, R. B., Bora, F., & Yiğit, Ö. (2012). İşitme cihazı kullanan hastalarda memnuniyetin değerlendirilmesi.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the national academy of sciences, 81(10), 3088-3092.
Jang, D. W., Kim, H. I., Je, C., Park, R. H., & Park, H. M. (2019). Lip reading using committee networks with two different types of concatenated frame images. IEEE Access, 7, 90125-90131.
Kahveci, O. K., Miman, M. C., Okur, E., Ayçiçek, A., Sevinç, S., & Altuntaş, A. (2011). Hearing aid use and patient satisfaction. Kulak burun bogaz ihtisas dergisi: KBB= Journal of ear, nose, and throat, 21(3), 117-121.
Keyvanrad, M. A., & Homayounpour, M. M. (2014). A brief survey on deep belief networks and introducing a new object oriented toolbox (DeeBNet). arXiv preprint arXiv:1408.3264.
Koumparoulis, A., & Potamianos, G. (2018, December). Deep view2view mapping for view-invariant lipreading. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 588-594). IEEE.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 1097-1105.
Kumar, Y., Jain, R., Salik, M., ratn Shah, R., Zimmermann, R., & Yin, Y. (2018, December). Mylipper: A personalized system for speech reconstruction using multi-view visual feeds. In 2018 IEEE International Symposium on Multimedia (ISM) (pp. 159-166). IEEE.
Lan, Y., Theobald, B. J., & Harvey, R. (2012, July). View independent computer lip-reading. In 2012 IEEE International Conference on Multimedia and Expo (pp. 432-437). IEEE.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
Li, X., Neil, D., Delbruck, T., & Liu, S. C. (2019, May). Lip reading deep network exploiting multi-modal spiking visual and auditory sensors. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1-5). IEEE.
Lu, Y., & Yan, J. (2020). Automatic lip reading using convolution neural network and bidirectional long short-term memory. International Journal of Pattern Recognition and Artificial Intelligence, 34(01), 2054003.
Luo, M., Yang, S., Shan, S., & Chen, X. (2020, November). Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (pp. 273-280). IEEE.
Lv, Z., & Qiao, L. (2020). Deep belief network and linear perceptron based cognitive computing for collaborative robots. Applied Soft Computing, 92, 106300.
Mamatha G., Roshan B.B.R., Vasudha S.R., (2020). Lip Reading to Text using Artificial Intelligence, International Journal of Engineering Research & Technology (IJERT), 9 (01): 483-484.
Martinez, B., Ma, P., Petridis, S., & Pantic, M. (2020, May). Lipreading using temporal convolutional networks. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6319-6323). IEEE.
Matthews, I., Cootes, T. F., Bangham, J. A., Cox, S., & Harvey, R. (2002). Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), 198-213.
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4), 115-133.
Mesbah, A., Berrahou, A., Hammouchi, H., Berbia, H., Qjidaa, H., & Daoudi, M. (2019). Lip reading with Hahn convolutional neural networks. Image and Vision Computing, 88, 76-83.
Minsky, M., & Papert, S. (1969). An introduction to computational geometry. Cambridge tiass., HIT.
Muljono, M., Saraswati, G., Winarsih, N., Rokhman, N., Supriyanto, C., & Pujiono, P. (2019). Developing BacaBicara: An Indonesian Lipreading System as an Independent Communication Learning for the Deaf and Hard-of-Hearing. International Journal of Emerging Technologies in Learning (iJET), 14(4), 44-57.
Mulrow, C. D., Aguilar, C., Endicott, J. E., Tuley, M. R., Velez, R., Charlip, W. S., ... & DeNino, L. A. (1990). Quality-of-life changes and hearing impairment: a randomized trial. Annals of internal medicine, 113(3), 188-194.
Mulrow, C. D., Aguilar, C., Endicott, J. E., Velez, R., Tuley, M. R., Charlip, W. S., & Hill, J. A. (1990). Association between hearing impairment and the quality of life of elderly individuals. Journal of the American Geriatrics Society, 38(1), 45-50.
Mulrow, C. D., Tuley, M. R., & Aguilar, C. (1992). Sustained benefits of hearing aids. Journal of Speech, Language, and Hearing Research, 35(6), 1402-1405.
Oliveira, D. A. B., Mattos, A. B., & da Silva Morais, E. (2019, May). Improving Viseme Recognition with GAN-based Muti-view Mapping. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019) (pp. 1-8). IEEE.
Olgun, N., Aslan, F. E., Yücel, N., Öntürk, Z. K., & Laçin, Z. (2013). Yaşlıların sağlık durumlarının değerlendirilmesi. Acıbadem Üniversitesi Sağlık Bilimleri Dergisi, (2), 72-78.
Ozcan, T., & Basturk, A. (2019). Lip reading using convolutional neural networks with and without pre-trained models. Balkan Journal of Electrical and Computer Engineering, 7(2), 195-201.
Pang, Z., Niu, F., & O’Neill, Z. (2020). Solar radiation prediction using recurrent neural network and artificial neural network: A case study with comparisons. Renewable Energy, 156, 279-289.
Patterson, E. K., Gurbuz, S., Tufekci, Z., & Gowdy, J. N. (2002). Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus. EURASIP Journal on Advances in Signal Processing, 2002(11), 1-13.
Patterson, E. K., Gurbuz, S., Tufekci, Z., & Gowdy, J. N. (2002, May). CUAVE: A new audio-visual database for multimodal human-computer interface research. In 2002 IEEE International conference on acoustics, speech, and signal processing (Vol. 2, pp. II-2017). IEEE.
Petridis, S., Li, Z., & Pantic, M. (2017, March). End-to-end visual speech recognition with LSTMs. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2592-2596). IEEE.
Petridis, S., Shen, J., Cetin, D., & Pantic, M. (2018, April). Visual-only recognition of normal, whispered and silent speech. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6219-6223). IEEE.
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., & Pantic, M. (2018, April). End-to-end audiovisual speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6548-6552). IEEE.
Petridis, S., Wang, Y., Li, Z., & Pantic, M. (2017). End-to-end audiovisual fusion with LSTMs. arXiv preprint arXiv:1709.04343.
Petridis, S., Wang, Y., Li, Z., & Pantic, M. (2017). End-to-end multi-view lipreading. arXiv preprint arXiv:1709.00443.
Petridis, S., Wang, Y., Ma, P., Li, Z., & Pantic, M. (2020). End-to-end visual speech recognition for small-scale datasets. Pattern Recognition Letters, 131, 421-427.
Potamianos, G., Neti, C., Gravier, G., Garg, A., & Senior, A. W. (2003). Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9), 1306-1326.
Potamianos, G., Neti, C., Luettin, J., & Matthews, I. (2004). Audio-visual automatic speech recognition: An overview. Issues in visual and audio-visual speech processing, 22, 23.
Qu, L., Weber, C., & Wermter, S. (2019, September). LipSound: Neural Mel-Spectrogram Reconstruction for Lip Reading. In INTERSPEECH (pp. 2768-2772).
Rahmani, M. H., & Almasganj, F. (2017, April). Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In 2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA) (pp. 195-199). IEEE.
Rekik, A., Ben-Hamadou, A., & Mahdi, W. (2014, October). A new visual speech recognition approach for RGB-D cameras. In International conference image analysis and recognition (pp. 21-28). Springer, Cham.
Rosenbaltt, F. (1957). The perceptron–a perciving and recognizing automation. Cornell Aeronautical Laboratory.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. nature, 323(6088), 533-536.
Russell, S. J., & Norvig, P. Artificial intelligence: a modern approach. 2016: Malaysia.
Saif, D., El-Gokhy, S. M., & Sallam, E. (2018). Deep Belief Networks-based framework for malware detection in Android systems. Alexandria engineering journal, 57(4), 4049-4057.
Sak, H., Senior, A. W., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling.
Sam, S. M., Kamardin, K., Sjarif, N. N. A., & Mohamed, N. (2019). Offline signature verification using deep learning convolutional neural network (CNN) architectures GoogLeNet Inception-v1 and Inception-v3. Procedia Computer Science, 161, 475-483.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Sindhura, P. V., Preethi, S. J., & Niranjana, K. B. (2018, December). Convolutional neural networks for predicting words: A lip-reading system. In 2018 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT) (pp. 929-933). IEEE.
Stafylakis, T., & Tzimiropoulos, G. (2017). Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105.
Sui, C., Togneri, R., & Bennamoun, M. (2017). A cascade gray-stereo visual feature extraction method for visual and audio-visual speech recognition. Speech Communication, 90, 26-38.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).
Thangthai, K., & Harvey, R. (2017, August). Improving computer lipreading via DNN sequence discriminative training techniques. ISCA.
Thangthai, K., Bear, H. L., & Harvey, R. (2018). Comparing phonemes and visemes with DNN-based lipreading. arXiv preprint arXiv:1805.02924.
Turing A.M., “Computing Machinery and Intelligence”, Mind Journal, 49: 433-460, (1950).
Uğur, A., & Kınacı, A. C. (2006). Yapay zeka teknikleri ve yapay sinir ağları kullanılarak web sayfalarının sınıflandırılması. XI. Türkiye'de İnternet Konferansı (inet-tr'06), Ankara, 1-4.
Wand, M., & Schmidhuber, J. (2017). Improving speaker-independent lipreading with domain-adversarial training. arXiv preprint arXiv:1708.01565.
Wand, M., Koutník, J., & Schmidhuber, J. (2016, March). Lipreading with long short-term memory. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6115-6119). IEEE.
Wand, M., Schmidhuber, J., & Vu, N. T. (2018, April). Investigations on end-to-end audiovisual fusion. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3041-3045). IEEE.
Wang, J., Gao, Y., Zhang, J., Wei, J., & Dang, J. (2015). Lipreading using profile lips rebuilt by 3D data from the Kinect. Journal of Computational Information Systems, 11(7), 2429-2438.
Xiao, J., Yang, S., Zhang, Y., Shan, S., & Chen, X. (2020, November). Deformation flow based two-stream network for lip reading. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (pp. 364-370). IEEE.
Xu, B., Wang, J., Lu, C., & Guo, Y. (2020). Watch to listen clearly: Visual speech enhancement driven multi-modality speech recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 1637-1646).
Xu, K., Li, D., Cassimatis, N., & Wang, X. (2018, May). LCANet: End-to-end lipreading with cascaded attention-CTC. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) (pp. 548-555). IEEE.
Yang, R., Singh, S. K., Tavakkoli, M., Amiri, N., Yang, Y., Karami, M. A., & Rai, R. (2020). CNN-LSTM deep learning architecture for computer vision-based modal frequency detection. Mechanical Systems and signal processing, 144, 106885.
Yang, S., Zhang, Y., Feng, D., Yang, M., Wang, C., Xiao, J., ... & Chen, X. (2019, May). LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019) (pp. 1-8). IEEE.
Yargıç, A., & Doğan, M. (2013, June). A lip reading application on MS Kinect camera. In 2013 IEEE INISTA (pp. 1-5). IEEE.
Yu, Y., Hu, C., Si, X., Zheng, J., & Zhang, J. (2020). Averaged Bi-LSTM networks for RUL prognostics with non-life-cycle labeled dataset. Neurocomputing, 402, 134-147.
Yueh, B., Shapiro, N., MacLean, C. H., & Shekelle, P. G. (2003). Screening and management of adult hearing loss in primary care: scientific review. Jama, 289(15), 1976-1985.
Zhao, X., Yang, S., Shan, S., & Chen, X. (2020, November). Mutual information maximization for effective lip reading. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (pp. 420-427). IEEE.
Zhou, P., Yang, W., Chen, W., Wang, Y., & Jia, J. (2019, May). Modality attention for end-to-end audio-visual speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6565-6569). IEEE.
Zhou, Z., Zhao, G., Hong, X., & Pietikäinen, M. (2014). A review of recent advances in visual speech decoding. Image and vision computing, 32(9), 590-605.

A Survey on Lip-Reading with Deep Learning

Year 2022, Volume: 14 Issue: 2, 844 - 860, 31.07.2022

Ali Erbey , Necaattin Barışçı

https://doi.org/10.29137/umagd.1038899

Cited By: 1

Abstract

Very successful results have been obtained in areas such as computer vision and voice recognition when applying deep learning methods. Technologies that facilitate the lives of people have been developed as a result of the successes of deep learning within these areas. One of these technologies is voice recognition devices. Research has shown that these devices do not give good results in noisy environments; although, they do give good results in silent environments. With deep learning methods, voice recognition in noisy environments can be achieved using visual signals. Thanks to computerized vision, the success of voice recognition devices can be increased with the analysis of human lips in order to determine what the speaker is saying. In this study, lip-reading studies using deep learning methods published between 2017 and 2020 were examined and data sets were introduced. As a result of the study, it is seen that CNN and LSTM architectures are used more intensively in lip-reading studies, hybrid models are preferred more and the success rates are increasing day by day. In this context, it is seen that technologies that can be used in line with the need can be developed by conducting more academic studies on lip reading.

Keywords

Lipreading, Deep Learning, Convolutional Neural Networks, Artificial Neural Networks

References

Adeel, A., Gogate, M., & Hussain, A. (2020). Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments. Information Fusion, 59, 163-170.
Afouras, T., Chung, J. S., & Zisserman, A. (2018). Deep lip reading: a comparison of models and an online application. arXiv preprint arXiv:1806.06053.
Afouras, T., Chung, J. S., & Zisserman, A. (2018). LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496.
Akmese Ö.F., Erbay H., Kör H., (2019). Derin Ögrenme ile Görüntü Kümeleme. In: 5th International Management Information Systems Conference, Ankara.
Alpaydin, E. (2020). Introduction to machine learning. MIT press.
Amanullah, M. A., Habeeb, R. A. A., Nasaruddin, F. H., Gani, A., Ahmed, E., Nainar, A. S. M., ... & Imran, M. (2020). Deep learning and big data technologies for IoT security. Computer Communications, 151, 495-517.
Anina, I., Zhou, Z., Zhao, G., & Pietikäinen, M. (2015, May). Ouluvs2: A multi-view audiovisual database for non-rigid mouth motion analysis. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) (Vol. 1, pp. 1-5). IEEE.
Arı, A., & Hanbay, D. (2019). Tumor detection in MR images of regional convolutional neural networks. Journal of the Faculty of Engineering and Architecture of Gazi University, 34(3), 1395-1408.
Bacciu, D., Micheli, A., & Podda, M. (2020). Edge-based sequential graph generation with recurrent neural networks. Neurocomputing, 416, 177-189.
Bayram, F. (2020). Derin öğrenme tabanlı otomatik plaka tanıma. Politeknik Dergisi, 23(4), 955-960.
Bear, H. L., & Harvey, R. (2017). Phoneme-to-viseme mappings: the good, the bad, and the ugly. Speech Communication, 95, 40-67.
Bi, C., Zhang, D., Yang, L., & Chen, P. (2019, November). An Lipreading Modle with DenseNet and E3D-LSTM. In 2019 6th International Conference on Systems and Informatics (ICSAI) (pp. 511-515). IEEE.
Bollier, D. (2017). Artificial intelligence comes of age. The promise and challenge of integrating AI into cars, healthcare and journalism. The Aspen Institute Communications and Society Program. Washington, DC.
Chen, L., Xu, G., Zhang, S., Yan, W., & Wu, Q. (2020). Health indicator construction of machinery based on end-to-end trainable convolution recurrent neural networks. Journal of Manufacturing Systems, 54, 1-11.
Chen, X., Du, J., & Zhang, H. (2020). Lipreading with DenseNet and resBi-LSTM. Signal, Image and Video Processing, 14(5), 981-989.
Chen, Y., Zhao, X., & Jia, X. (2015). Spectral–spatial classification of hyperspectral data based on deep belief network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 8(6), 2381-2392.
Cheok, M. J., Omar, Z., & Jaward, M. H. (2019). A review of hand gesture and sign language recognition techniques. International Journal of Machine Learning and Cybernetics, 10(1), 131-153.
Chung, J. S., & Zisserman, A. (2016, November). Lip reading in the wild. In Asian conference on computer vision (pp. 87-103). Springer, Cham.
Chung, J. S., & Zisserman, A. P. (2017). Lip reading in profile.
Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017, July). Lip reading sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3444-3453). IEEE.
Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5), 2421-2424.
Cox, S. J., Harvey, R. W., Lan, Y., Newman, J. L., & Theobald, B. J. (2008, September). The challenge of multispeaker lip-reading. In AVSP (pp. 179-184).
Doğan, M., Nemli, O. N., Yüksel, O. M., Bayramoğlu, İ., & Kemaloğlu, Y. K. (2008). İşitme Kaybının Yaşam Kalitesine Etkisini İnceleyen Anket Çalışmalarına Ait Bir Derleme. Turkiye Klinikleri J Int Med Sci, 4, 33.
Dupont, S., & Luettin, J. (2000). Audio-visual speech modeling for continuous speech recognition. IEEE transactions on multimedia, 2(3), 141-151.
Erdoğan A.A., (2016). Hearing Loss and Approaches to Hearing Loss in Elderly, The Turkish Journal of Family Medicine and Primary Care, 10 (1): 25-33, (2016). doi:10.5455/tjfmpc.204524
Ergezer, H., Dikmen, M., & Özdemir, E. (2003). Yapay sinir ağları ve tanıma sistemleri. PiVOLKA, 2(6), 14-17.
Ertam, F., & Aydın, G. (2017, October). Data classification with deep learning using Tensorflow. In 2017 international conference on computer science and engineering (UBMK) (pp. 755-758). IEEE.
Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., ... & Dean, J. (2019). A guide to deep learning in healthcare. Nature medicine, 25(1), 24-29.
Farsal, W., Anter, S., & Ramdani, M. (2018, October). Deep learning: An overview. In Proceedings of the 12th International Conference on Intelligent Systems: Theories and Applications (pp. 1-6).
Fayjie, A. R., Hossain, S., Oualid, D., & Lee, D. J. (2018, June). Driverless car: Autonomous driving using deep reinforcement learning in urban environment. In 2018 15th International Conference on Ubiquitous Robots (UR) (pp. 896-901). IEEE.
Feng, W., Guan, N., Li, Y., Zhang, X., & Luo, Z. (2017, May). Audio visual speech recognition with multimodal recurrent neural networks. In 2017 International Joint Conference on Neural Networks (IJCNN) (pp. 681-688). IEEE.
Fernandez-Lopez, A., & Sukno, F. M. (2017). Automatic viseme vocabulary construction to enhance continuous lip-reading. arXiv preprint arXiv:1704.08035.
Fernandez-Lopez, A., & Sukno, F. M. (2017, February). Optimizing Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish. In International Joint Conference on Computer Vision, Imaging and Computer Graphics (pp. 305-328). Springer, Cham.
Fernandez-Lopez, A., & Sukno, F. M. (2018). Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing, 78, 53-72.
Fernandez-Lopez, A., Martinez, O., & Sukno, F. M. (2017, May). Towards estimating the upper bound of visual-speech recognition: The visual lip-reading feasibility database. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) (pp. 208-215). IEEE.
Fook, C. Y., Hariharan, M., Yaacob, S., & Adom, A. H. (2012, February). A review: Malay speech recognition and audio visual speech recognition. In 2012 International Conference on Biomedical Engineering (ICoBE) (pp. 479-484). IEEE.
Fung, I., & Mak, B. (2018, April). End-to-end low-resource lip-reading with maxout CNN and LSTM. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2511-2515). IEEE.
Gogate, M., Dashtipour, K., Adeel, A., & Hussain, A. (2020). CochleaNet: A robust language-independent audio-visual model for real-time speech enhancement. Information Fusion, 63, 273-285.
Goh, Y. H., Lau, K. X., & Lee, Y. K. (2019, October). Audio-Visual Speech Recognition System Using Recurrent Neural Network. In 2019 4th International Conference on Information Technology (InCIT) (pp. 38-43). IEEE.
Hamurcu, M., Şener, B. M., Ataş, A., Atalay, R. B., Bora, F., & Yiğit, Ö. (2012). İşitme cihazı kullanan hastalarda memnuniyetin değerlendirilmesi.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the national academy of sciences, 81(10), 3088-3092.
Jang, D. W., Kim, H. I., Je, C., Park, R. H., & Park, H. M. (2019). Lip reading using committee networks with two different types of concatenated frame images. IEEE Access, 7, 90125-90131.
Kahveci, O. K., Miman, M. C., Okur, E., Ayçiçek, A., Sevinç, S., & Altuntaş, A. (2011). Hearing aid use and patient satisfaction. Kulak burun bogaz ihtisas dergisi: KBB= Journal of ear, nose, and throat, 21(3), 117-121.
Keyvanrad, M. A., & Homayounpour, M. M. (2014). A brief survey on deep belief networks and introducing a new object oriented toolbox (DeeBNet). arXiv preprint arXiv:1408.3264.
Koumparoulis, A., & Potamianos, G. (2018, December). Deep view2view mapping for view-invariant lipreading. In 2018 IEEE Spoken Language Technology Workshop (SLT) (pp. 588-594). IEEE.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 1097-1105.
Kumar, Y., Jain, R., Salik, M., ratn Shah, R., Zimmermann, R., & Yin, Y. (2018, December). Mylipper: A personalized system for speech reconstruction using multi-view visual feeds. In 2018 IEEE International Symposium on Multimedia (ISM) (pp. 159-166). IEEE.
Lan, Y., Theobald, B. J., & Harvey, R. (2012, July). View independent computer lip-reading. In 2012 IEEE International Conference on Multimedia and Expo (pp. 432-437). IEEE.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
Li, X., Neil, D., Delbruck, T., & Liu, S. C. (2019, May). Lip reading deep network exploiting multi-modal spiking visual and auditory sensors. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1-5). IEEE.
Lu, Y., & Yan, J. (2020). Automatic lip reading using convolution neural network and bidirectional long short-term memory. International Journal of Pattern Recognition and Artificial Intelligence, 34(01), 2054003.
Luo, M., Yang, S., Shan, S., & Chen, X. (2020, November). Pseudo-convolutional policy gradient for sequence-to-sequence lip-reading. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (pp. 273-280). IEEE.
Lv, Z., & Qiao, L. (2020). Deep belief network and linear perceptron based cognitive computing for collaborative robots. Applied Soft Computing, 92, 106300.
Mamatha G., Roshan B.B.R., Vasudha S.R., (2020). Lip Reading to Text using Artificial Intelligence, International Journal of Engineering Research & Technology (IJERT), 9 (01): 483-484.
Martinez, B., Ma, P., Petridis, S., & Pantic, M. (2020, May). Lipreading using temporal convolutional networks. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6319-6323). IEEE.
Matthews, I., Cootes, T. F., Bangham, J. A., Cox, S., & Harvey, R. (2002). Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), 198-213.
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4), 115-133.
Mesbah, A., Berrahou, A., Hammouchi, H., Berbia, H., Qjidaa, H., & Daoudi, M. (2019). Lip reading with Hahn convolutional neural networks. Image and Vision Computing, 88, 76-83.
Minsky, M., & Papert, S. (1969). An introduction to computational geometry. Cambridge tiass., HIT.
Muljono, M., Saraswati, G., Winarsih, N., Rokhman, N., Supriyanto, C., & Pujiono, P. (2019). Developing BacaBicara: An Indonesian Lipreading System as an Independent Communication Learning for the Deaf and Hard-of-Hearing. International Journal of Emerging Technologies in Learning (iJET), 14(4), 44-57.
Mulrow, C. D., Aguilar, C., Endicott, J. E., Tuley, M. R., Velez, R., Charlip, W. S., ... & DeNino, L. A. (1990). Quality-of-life changes and hearing impairment: a randomized trial. Annals of internal medicine, 113(3), 188-194.
Mulrow, C. D., Aguilar, C., Endicott, J. E., Velez, R., Tuley, M. R., Charlip, W. S., & Hill, J. A. (1990). Association between hearing impairment and the quality of life of elderly individuals. Journal of the American Geriatrics Society, 38(1), 45-50.
Mulrow, C. D., Tuley, M. R., & Aguilar, C. (1992). Sustained benefits of hearing aids. Journal of Speech, Language, and Hearing Research, 35(6), 1402-1405.
Oliveira, D. A. B., Mattos, A. B., & da Silva Morais, E. (2019, May). Improving Viseme Recognition with GAN-based Muti-view Mapping. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019) (pp. 1-8). IEEE.
Olgun, N., Aslan, F. E., Yücel, N., Öntürk, Z. K., & Laçin, Z. (2013). Yaşlıların sağlık durumlarının değerlendirilmesi. Acıbadem Üniversitesi Sağlık Bilimleri Dergisi, (2), 72-78.
Ozcan, T., & Basturk, A. (2019). Lip reading using convolutional neural networks with and without pre-trained models. Balkan Journal of Electrical and Computer Engineering, 7(2), 195-201.
Pang, Z., Niu, F., & O’Neill, Z. (2020). Solar radiation prediction using recurrent neural network and artificial neural network: A case study with comparisons. Renewable Energy, 156, 279-289.
Patterson, E. K., Gurbuz, S., Tufekci, Z., & Gowdy, J. N. (2002). Moving-talker, speaker-independent feature study, and baseline results using the CUAVE multimodal speech corpus. EURASIP Journal on Advances in Signal Processing, 2002(11), 1-13.
Patterson, E. K., Gurbuz, S., Tufekci, Z., & Gowdy, J. N. (2002, May). CUAVE: A new audio-visual database for multimodal human-computer interface research. In 2002 IEEE International conference on acoustics, speech, and signal processing (Vol. 2, pp. II-2017). IEEE.
Petridis, S., Li, Z., & Pantic, M. (2017, March). End-to-end visual speech recognition with LSTMs. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2592-2596). IEEE.
Petridis, S., Shen, J., Cetin, D., & Pantic, M. (2018, April). Visual-only recognition of normal, whispered and silent speech. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6219-6223). IEEE.
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., & Pantic, M. (2018, April). End-to-end audiovisual speech recognition. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6548-6552). IEEE.
Petridis, S., Wang, Y., Li, Z., & Pantic, M. (2017). End-to-end audiovisual fusion with LSTMs. arXiv preprint arXiv:1709.04343.
Petridis, S., Wang, Y., Li, Z., & Pantic, M. (2017). End-to-end multi-view lipreading. arXiv preprint arXiv:1709.00443.
Petridis, S., Wang, Y., Ma, P., Li, Z., & Pantic, M. (2020). End-to-end visual speech recognition for small-scale datasets. Pattern Recognition Letters, 131, 421-427.
Potamianos, G., Neti, C., Gravier, G., Garg, A., & Senior, A. W. (2003). Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9), 1306-1326.
Potamianos, G., Neti, C., Luettin, J., & Matthews, I. (2004). Audio-visual automatic speech recognition: An overview. Issues in visual and audio-visual speech processing, 22, 23.
Qu, L., Weber, C., & Wermter, S. (2019, September). LipSound: Neural Mel-Spectrogram Reconstruction for Lip Reading. In INTERSPEECH (pp. 2768-2772).
Rahmani, M. H., & Almasganj, F. (2017, April). Lip-reading via a DNN-HMM hybrid system using combination of the image-based and model-based features. In 2017 3rd International Conference on Pattern Recognition and Image Analysis (IPRIA) (pp. 195-199). IEEE.
Rekik, A., Ben-Hamadou, A., & Mahdi, W. (2014, October). A new visual speech recognition approach for RGB-D cameras. In International conference image analysis and recognition (pp. 21-28). Springer, Cham.
Rosenbaltt, F. (1957). The perceptron–a perciving and recognizing automation. Cornell Aeronautical Laboratory.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. nature, 323(6088), 533-536.
Russell, S. J., & Norvig, P. Artificial intelligence: a modern approach. 2016: Malaysia.
Saif, D., El-Gokhy, S. M., & Sallam, E. (2018). Deep Belief Networks-based framework for malware detection in Android systems. Alexandria engineering journal, 57(4), 4049-4057.
Sak, H., Senior, A. W., & Beaufays, F. (2014). Long short-term memory recurrent neural network architectures for large scale acoustic modeling.
Sam, S. M., Kamardin, K., Sjarif, N. N. A., & Mohamed, N. (2019). Offline signature verification using deep learning convolutional neural network (CNN) architectures GoogLeNet Inception-v1 and Inception-v3. Procedia Computer Science, 161, 475-483.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Sindhura, P. V., Preethi, S. J., & Niranjana, K. B. (2018, December). Convolutional neural networks for predicting words: A lip-reading system. In 2018 International Conference on Electrical, Electronics, Communication, Computer, and Optimization Techniques (ICEECCOT) (pp. 929-933). IEEE.
Stafylakis, T., & Tzimiropoulos, G. (2017). Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105.
Sui, C., Togneri, R., & Bennamoun, M. (2017). A cascade gray-stereo visual feature extraction method for visual and audio-visual speech recognition. Speech Communication, 90, 26-38.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).
Thangthai, K., & Harvey, R. (2017, August). Improving computer lipreading via DNN sequence discriminative training techniques. ISCA.
Thangthai, K., Bear, H. L., & Harvey, R. (2018). Comparing phonemes and visemes with DNN-based lipreading. arXiv preprint arXiv:1805.02924.
Turing A.M., “Computing Machinery and Intelligence”, Mind Journal, 49: 433-460, (1950).
Uğur, A., & Kınacı, A. C. (2006). Yapay zeka teknikleri ve yapay sinir ağları kullanılarak web sayfalarının sınıflandırılması. XI. Türkiye'de İnternet Konferansı (inet-tr'06), Ankara, 1-4.
Wand, M., & Schmidhuber, J. (2017). Improving speaker-independent lipreading with domain-adversarial training. arXiv preprint arXiv:1708.01565.
Wand, M., Koutník, J., & Schmidhuber, J. (2016, March). Lipreading with long short-term memory. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6115-6119). IEEE.
Wand, M., Schmidhuber, J., & Vu, N. T. (2018, April). Investigations on end-to-end audiovisual fusion. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3041-3045). IEEE.
Wang, J., Gao, Y., Zhang, J., Wei, J., & Dang, J. (2015). Lipreading using profile lips rebuilt by 3D data from the Kinect. Journal of Computational Information Systems, 11(7), 2429-2438.
Xiao, J., Yang, S., Zhang, Y., Shan, S., & Chen, X. (2020, November). Deformation flow based two-stream network for lip reading. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (pp. 364-370). IEEE.
Xu, B., Wang, J., Lu, C., & Guo, Y. (2020). Watch to listen clearly: Visual speech enhancement driven multi-modality speech recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 1637-1646).
Xu, K., Li, D., Cassimatis, N., & Wang, X. (2018, May). LCANet: End-to-end lipreading with cascaded attention-CTC. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) (pp. 548-555). IEEE.
Yang, R., Singh, S. K., Tavakkoli, M., Amiri, N., Yang, Y., Karami, M. A., & Rai, R. (2020). CNN-LSTM deep learning architecture for computer vision-based modal frequency detection. Mechanical Systems and signal processing, 144, 106885.
Yang, S., Zhang, Y., Feng, D., Yang, M., Wang, C., Xiao, J., ... & Chen, X. (2019, May). LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019) (pp. 1-8). IEEE.
Yargıç, A., & Doğan, M. (2013, June). A lip reading application on MS Kinect camera. In 2013 IEEE INISTA (pp. 1-5). IEEE.
Yu, Y., Hu, C., Si, X., Zheng, J., & Zhang, J. (2020). Averaged Bi-LSTM networks for RUL prognostics with non-life-cycle labeled dataset. Neurocomputing, 402, 134-147.
Yueh, B., Shapiro, N., MacLean, C. H., & Shekelle, P. G. (2003). Screening and management of adult hearing loss in primary care: scientific review. Jama, 289(15), 1976-1985.
Zhao, X., Yang, S., Shan, S., & Chen, X. (2020, November). Mutual information maximization for effective lip reading. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (pp. 420-427). IEEE.
Zhou, P., Yang, W., Chen, W., Wang, Y., & Jia, J. (2019, May). Modality attention for end-to-end audio-visual speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6565-6569). IEEE.
Zhou, Z., Zhao, G., Hong, X., & Pietikäinen, M. (2014). A review of recent advances in visual speech decoding. Image and vision computing, 32(9), 590-605.

There are 112 citations in total.

Details

Primary Language	English
Subjects	Engineering
Journal Section	Articles
Authors	Ali Erbey 0000-0002-0930-4081 Necaattin Barışçı 0000-0002-8762-5091
Publication Date	July 31, 2022
Submission Date	December 22, 2021
Published in Issue	Year 2022 Volume: 14 Issue: 2

Cite

APA	Erbey, A., & Barışçı, N. (2022). A Survey on Lip-Reading with Deep Learning. International Journal of Engineering Research and Development, 14(2), 844-860. https://doi.org/10.29137/umagd.1038899

Cited By

Urdu Lip Reading Systems for Digits in Controlled and Uncontrolled Environment

IEEE Access

https://doi.org/10.1109/ACCESS.2025.3531640

Download Cover Image

Article Files

Full Text