Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti

Ömer Faruk Öztürk; Elham Pashaei

doi:10.24012/dumf.1001914

Araştırma Makalesi

Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti

Yıl 2021, , 581 - 589, 29.09.2021

Ömer Faruk Öztürk , Elham Pashaei

https://doi.org/10.24012/dumf.1001914

Cited By: 3

Öz

Konuşmada duygu tanıma İngilizce adıyla Speech emotion recognition (SER), duyguların konuşma sinyalleri aracılığıyla tanınması işlemidir. İnsanlar, iletişiminin doğal bir parçası olarak bu işlemi verimli bir şekilde yerine getirebilse de programlanabilir cihazlar kullanarak duygu tanıma işlemi hali hazırda devam eden bir çalışma alanıdır. Makinelerin de duyguları algılaması, onların insan gibi görünmesini ve davranmasını sağlayacağından dolayı, konuşmada duygu tanıma, insan-bilgisayar etkileşiminin gelişmesinde önemli bir rol oynar. Geçtiğimiz on yıl içerisinde çeşitli SER teknikleri geliştirilmiştir, ancak sorun henüz tam olarak çözülmemiştir. Bu makale, Evrişimsel Sinir Ağı (Convolutional neural networks -CNN) ve Uzun-Kısa Süreli Bellek (Long Short Term Memory-LSTM) olmak üzere iki derin öğrenme mimarisinin birleşimine dayanan bir konuşmada duygu tanıma tekniği önermektedir. CNN lokal öznitelik seçiminde etkinliğini gösterirken, LSTM büyük metinlerin sıralı işlenmesinde büyük başarı göstermiştir. Önerilen Evrişimsel LSTM (Convolutional LSTM – Co-LSTM) yaklaşımı, insan-makine iletişiminde etkili bir otomatik duygu algılama yöntemi oluşturmayı amaçlamaktadır. İlk olarak, Mel Frekansı Kepstrum Katsayıları (Mel Frequency Cepstral Coefficient- MFCC) kullanılarak önerilen yöntemde konuşma sinyalinden bir görüntüsel öznitelikler matrisi çıkarılır ve ardından bu matris bir boyuta indigenir. Sonrasında modelin eğitimi için öznitelik seçme ve sınıflandırma yöntemi olarak Co-LSTM kullanılır. Deneysel analizler, konuşmanın sekiz duygusunun tamamının RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song) ve TESS (Toronto Emotional Speech Set) veri tabanlarından sınıflandırılması üzerine yapılmıştır. MFCC Spektrogram öznitelikleri kullanılarak Co-LSTM ile %86,7 doğruluk oranı elde edilmiştir. Elde edilen sonuçlar, önceki çalışmalar ve diğer iyi bilinen sınıflandırıcılarla karşılaştırıldığında önerilen algoritmanın etkinliğini ikna edici bir şekilde kanıtlamaktadır.

Anahtar Kelimeler

Konuşmada Duygu Tanıma (SER), Uzun-Kısa Süreli Bellek (LSTM), Tekrarlayan Sinir Ağı (RNN), Evrişimli Sinir Ağı (CNN), RAVDESS veri seti

Kaynakça

[1] “United Nations Educational, Scientific, and Cultural Organization. (2019). I’d blush if I could: closing gender divides in digital skills through education,” 2)., (Programme Document GEN/2019/EQUALS/1 REV. [Online]. Available: http://unesdoc.unesco.org/images/0021/002170/217073e.pdf.
[2] K. Venkataramanan and H. R. Rajamohan, “Emotion Recognition from Speech,” SpringerBriefs Speech Technol., pp. 31–32, Dec. 2019.
[3] L. B. Krithika and G. G. Lakshmi Priya, “Student Emotion Recognition System (SERS) for e-learning Improvement Based on Learner Concentration Metric,” Procedia Comput. Sci., vol. 85, pp. 767–776, Jan. 2016, doi: 10.1016/J.PROCS.2016.05.264.
[4] A. E. Wells, L. M. Hunnikin, D. P. Ash, and S. H. M. van Goozen, “Improving emotion recognition is associated with subsequent mental health and well-being in children with severe behavioural problems,” Eur. Child Adolesc. Psychiatry 2020, vol. 1, pp. 1–9, Sep. 2020, doi: 10.1007/S00787-020-01652-Y.
[5] J. R. I. Coleman, K. J. Lester, R. Keers, M. R. Munafò, G. Breen, and T. C. Eley, “Genome-wide association study of facial emotion recognition in children and association with polygenic risk for mental health disorders,” Am. J. Med. Genet. Part B Neuropsychiatr. Genet., vol. 174, no. 7, pp. 701–711, Oct. 2017, doi: 10.1002/AJMG.B.32558.
[6] M. Bebawy, S. Anwar, and M. Milanova, “Active Shape Model vs. Deep Learning for Facial Emotion Recognition in Security,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 10183 LNAI, pp. 1–11, 2016, doi: 10.1007/978-3-319-59259-6_1.
[7] H. Aouani and Y. Ben Ayed, “Speech Emotion Recognition with deep learning,” Procedia Comput. Sci., vol. 176, pp. 251–260, Jan. 2020, doi: 10.1016/J.PROCS.2020.08.027.
[8] B. Kratzwald, S. Ilić, M. Kraus, S. Feuerriegel, and H. Prendinger, “Deep learning for affective computing: Text-based emotion recognition in decision support,” Decis. Support Syst., vol. 115, pp. 24–35, Nov. 2018, doi: 10.1016/J.DSS.2018.09.002.
[9] E. Frant, I. Ispas, V. Dragomir, M. Dascalu, E. Zoltan, and I. C. Stoica, “Voice Based Emotion Recognition with Convolutional Neural Networks for Companion Robots,” Rom. J. Inf. Sci. Technol., vol. 20, no. 3, pp. 222–240, 2017.
[10] V. Sreenivas, V. Namdeo, and E. V. Kumar, “Group based emotion recognition from video sequence with hybrid optimization based recurrent fuzzy neural network,” J. Big Data 2020 71, vol. 7, no. 1, pp. 1–21, Aug. 2020, doi: 10.1186/S40537-020-00326-5.
[11] D. Issa, M. Fatih Demirci, and A. Yazici, “Speech emotion recognition with deep convolutional neural networks,” Biomed. Signal Process. Control, vol. 59, p. 101894, May 2020, doi: 10.1016/j.bspc.2020.101894.
[12] M. A. Ozdemir, B. Elagoz, A. Alaybeyoglu, R. Sadighzadeh, and A. Akan, “Real time emotion recognition from facial expressions using CNN architecture,” TIPTEKNO 2019 - Tip Teknol. Kongresi, Oct. 2019, doi: 10.1109/TIPTEKNO.2019.8895215.
[13] M. A. Ozdemir, M. Degirmenci, E. Izci, and A. Akan, “EEG-based emotion recognition with deep convolutional neural networks,” Biomed. Tech. (Berl)., vol. 66, no. 1, pp. 43–57, Feb. 2020, doi: 10.1515/BMT-2019-0306.
[14] L. Kerkeni, Y. Serrestou, M. Mbarki, K. Raoof, M. A. Mahjoub, and C. Cleder, “Automatic Speech Emotion Recognition Using Machine Learning,” Soc. Media Mach. Learn., Mar. 2019, doi: 10.5772/INTECHOPEN.84856.
[15] A. Saxena, A. Khanna, and D. Gupta, “Emotion Recognition and Detection Methods: A Comprehensive Survey,” J. Artif. Intell. Syst., vol. 2, no. 1, pp. 53–79, Feb. 2020, doi: 10.33969/AIS.2020.21005.
[16] J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition using deep 1D & 2D CNN LSTM networks,” Biomed. Signal Process. Control, vol. 47, pp. 312–323, Jan. 2019, doi: 10.1016/J.BSPC.2018.08.035.
[17] N. A. Zaidan and M. S. Salam, “MFCC Global Features Selection in Improving Speech Emotion Recognition Rate,” Lect. Notes Electr. Eng., vol. 387, pp. 141–153, 2016, doi: 10.1007/978-3-319-32213-1_13.
[18] S. R. Livingstone and F. A. Russo, “The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north American english,” PLoS One, vol. 13, no. 5, p. e0196391, May 2018, doi: 10.1371/journal.pone.0196391.
[19] M. K. Pichora-Fuller and K. Dupuis, “Toronto emotional speech set (TESS).” Scholars Portal Dataverse, 2020, doi: doi/10.5683/SP2/E8H2MF.
[20] F. Burkhardt, A. Paeschke, M. Rolfes, W. Sendlmeier, and B. Weiss, “A database of German emotional speech,” in INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, 2005.
[21] B. Zupan, D. Neumann, D. R. Babbage, and B. Willer, “The importance of vocal affect to bimodal processing of emotion: Implications for individuals with traumatic brain injury,” Journal of Communication Disorders, vol. 42, no. 1. pp. 1–17, Jan-2009, doi: 10.1016/j.jcomdis.2008.06.001.
[22] “Voice-enabled smart speakers to reach 55% of U.S. households by 2022, says report | TechCrunch.” [Online]. Available: https://techcrunch.com/2017/11/08/voice-enabled-smart-speakers-to-reach-55-of-u-s-households-by-2022-says-report/. [Accessed: 05-Sep-2021].
[23] A. S. Popova, A. G. Rassadin, and A. A. Ponomarenko, “Emotion Recognition in Sound,” in Studies in Computational Intelligence, 2018, vol. 736, pp. 117–124, doi: 10.1007/978-3-319-66604-4_18.
[24] L. Li et al., “Hybrid Deep Neural Network - Hidden Markov Model (DNN-HMM) based speech emotion recognition,” in Proceedings - 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, ACII 2013, 2013, pp. 312–317, doi: 10.1109/ACII.2013.58.
[25] M. G. De Pinto, M. Polignano, P. Lops, and G. Semeraro, “Emotions Understanding Model from Spoken Language using Deep Neural Networks and Mel-Frequency Cepstral Coefficients,” in IEEE Conference on Evolving and Adaptive Intelligent Systems, 2020, vol. 2020-May, doi: 10.1109/EAIS48028.2020.9122698.
[26] G. Tangriberganov, T. Adesuyi, and B. M. Kim, “(PDF) A Hybrid approach for speech emotion recognition using 1D-CNN LSTM,” in Korea Computer Congress (KCC 2020), 2020.
[27] G. Agarwal and H. Om, “Performance of deer hunting optimization based deep learning algorithm for speech emotion recognition,” Multimed. Tools Appl. 2020 807, vol. 80, no. 7, pp. 9961–9992, Nov. 2020, doi: 10.1007/S11042-020-10118-X.
[28] R. Sarkar, S. Choudhury, S. Dutta, A. Roy, and S. K. Saha, “Recognition of emotion in music based on deep convolutional neural network,” Multimed. Tools Appl., vol. 79, no. 1–2, pp. 765–783, Jan. 2020, doi: 10.1007/s11042-019-08192-x.
[29] E. Yucesoy and V. V. Nabiyev, “Gender identification of a speaker using MFCC and GMM,” in ELECO 2013 - 8th International Conference on Electrical and Electronics Engineering, 2013, pp. 626–629, doi: 10.1109/eleco.2013.6713922.
[30] B. McFee et al., “librosa: Audio and Music Signal Analysis in Python,” in Proceedings of the 14th Python in Science Conference, 2015, pp. 18–24, doi: 10.25080/majora-7b98e3ed-003.
[31] E. Pashaei, M. Ozen, and N. Aydin, “Splice sites prediction of human genome using AdaBoost,” in 3rd IEEE EMBS International Conference on Biomedical and Health Informatics, BHI 2016, 2016, doi: 10.1109/BHI.2016.7455894.
[32] E. Pashaei, M. Ozen, and N. Aydin, “Random Forest in Splice Site Prediction of Human Genome,” in XIV Mediterranean Conference on Medical and Biological Engineering and Computing 2016, 2016, vol. 57, pp. 518–523, doi: 10.1007/978-3-319-32703-7_99.
[33] E. Pashaei and E. Pashaei, “Gene Selection using Intelligent Dynamic Genetic Algorithm and Random Forest,” in 2019 11th International Conference on Electrical and Electronics Engineering (ELECO), 2019, pp. 470–474, doi: 10.23919/ELECO47770.2019.8990557.
[34] H. K. Palo, M. Chandra, and M. N. Mohanty, “Emotion recognition using MLP and GMM for Oriya language,” Int. J. Comput. Vis. Robot., vol. 7, no. 4, pp. 426–442, 2017, doi: 10.1504/IJCVR.2017.084987.
[35] Mustaqeem and S. Kwon, “A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition,” Sensors 2020, Vol. 20, Page 183, vol. 20, no. 1, p. 183, Dec. 2019, doi: 10.3390/S20010183.
[36] F. Tao and G. Liu, “Advanced LSTM: A Study about Better Time Dependency Modeling in Emotion Recognition,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., vol. 2018-April, pp. 2906–2910, Sep. 2018, doi: 10.1109/ICASSP.2018.8461750.
[37] L. Chen, W. Su, Y. Feng, M. Wu, J. She, and K. Hirota, “Two-layer fuzzy multiple random forest for speech emotion recognition in human-robot interaction,” Inf. Sci. (Ny)., vol. 509, pp. 150–163, Jan. 2020, doi: 10.1016/J.INS.2019.09.005.
[38] Z. T. Liu, M. Wu, W. H. Cao, J. W. Mao, J. P. Xu, and G. Z. Tan, “Speech emotion recognition based on feature selection and extreme learning machine decision tree,” Neurocomputing, vol. 273, pp. 271–280, Jan. 2018, doi: 10.1016/J.NEUCOM.2017.07.050.
[39] L. Sun, B. Zou, S. Fu, J. Chen, and F. Wang, “Speech emotion recognition based on DNN-decision tree SVM model,” Speech Commun., vol. 115, pp. 29–37, Dec. 2019, doi: 10.1016/J.SPECOM.2019.10.004.
[40] E. Pashaei, A. Yilmaz, and N. Aydin, “A combined SVM and Markov model approach for splice site identification,” 2016 6th Int. Conf. Comput. Knowl. Eng. ICCKE 2016, no. Iccke, pp. 200–204, 2016, doi: 10.1109/ICCKE.2016.7802140.
[41] J. Umamaheswari and A. Akila, “An Enhanced Human Speech Emotion Recognition Using Hybrid of PRNN and KNN,” Proc. Int. Conf. Mach. Learn. Big Data, Cloud Parallel Comput. Trends, Prespectives Prospect. Com. 2019, pp. 177–183, Feb. 2019, doi: 10.1109/COMITCON.2019.8862221.
[42] T. Zhang, W. Zheng, Z. Cui, Y. Zong, and Y. Li, “Spatial-Temporal Recurrent Neural Network for Emotion Recognition,” IEEE Trans. Cybern., vol. 49, no. 3, pp. 939–947, Mar. 2019, doi: 10.1109/TCYB.2017.2788081.
[43] R. K. Behera, M. Jena, S. K. Rath, and S. Misra, “Co-LSTM: Convolutional LSTM model for sentiment analysis in social big data,” Inf. Process. Manag., vol. 58, no. 1, p. 102435, Jan. 2021, doi: 10.1016/j.ipm.2020.102435.
[44] V. Passricha and R. K. Aggarwal, “A Hybrid of Deep CNN and Bidirectional LSTM for Automatic Speech Recognition,” J. Intell. Syst., vol. 29, no. 1, pp. 1261–1274, Jan. 2020, doi: 10.1515/JISYS-2018-0372.
[45] L. Luo, Y. Xiong, Y. Liu, and X. Sun, “Adaptive Gradient Methods with Dynamic Bound of Learning Rate,” 7th Int. Conf. Learn. Represent. ICLR 2019, Feb. 2019.
[46] M. A. Ozdemir, G. D. Ozdemir, and O. Guren, “Classification of COVID-19 electrocardiograms by using hexaxial feature mapping and deep learning,” BMC Med. Informatics Decis. Mak. 2021 211, vol. 21, no. 1, pp. 1–20, May 2021, doi: 10.1186/S12911-021-01521-X.
[47] M. A. Ozdemir, O. K. Cura, and A. Akan, “Epileptic EEG Classification by Using Time-Frequency Images for Deep Learning,” https://doi.org/10.1142/S012906572150026X, May 2021, doi: 10.1142/S012906572150026X.
[48] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for Hyper-Parameter Optimization,” Adv. Neural Inf. Process. Syst., vol. 24, 2011.
[49] Z. Aldeneh and E. M. Provost, “Using regional saliency for speech emotion recognition,” in ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2017, pp. 2741–2745, doi: 10.1109/ICASSP.2017.7952655.
[50] R. V. Darekar and A. P. Dhande, “Emotion recognition from Marathi speech database using adaptive artificial neural network,” Biol. Inspired Cogn. Archit., vol. 23, pp. 35–42, Jan. 2018, doi: 10.1016/j.bica.2018.01.002.
[51] A. Bhavan, P. Chauhan, Hitkul, and R. R. Shah, “Bagged support vector machines for emotion recognition from speech,” Knowledge-Based Syst., vol. 184, p. 104886, Nov. 2019, doi: 10.1016/J.KNOSYS.2019.104886.
[52] S. Mekruksavanich, A. Jitpattanakul, and N. Hnoohom, “Negative Emotion Recognition using Deep Learning for Thai Language,” in 2020 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering, ECTI DAMT and NCON 2020, 2020, pp. 71–74, doi: 10.1109/ECTIDAMTNCON48261.2020.9090768.
[53] A. Keesing, I. Watson, and M. Witbrock, “Convolutional and Recurrent Neural Networks for Spoken Emotion Recognition,” in Proceedings of the The 18th Annual Workshop of the Australasian Language Technology Association, 2020, pp. 104–109.
[54] P. Singh, G. Saha, and M. Sahidullah, “Deep scattering network for speech emotion recognition,” May 2021.

Toplam 54 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Bölüm	Makaleler
Yazarlar	Ömer Faruk Öztürk 0000-0003-1780-3152 Elham Pashaei Bu kişi benim 0000-0001-7401-4964
Yayımlanma Tarihi	29 Eylül 2021
Gönderilme Tarihi	5 Temmuz 2021
Yayımlandığı Sayı	Yıl 2021

Kaynak Göster

IEEE	Ö. F. Öztürk ve E. Pashaei, “Konuşmalardaki duygunun evrişimsel LSTM modeli ile tespiti”, DÜMF MD, c. 12, sy. 4, ss. 581–589, 2021, doi: 10.24012/dumf.1001914.

Cited By

Bir İnsan Bilgisayar Etkileşimi Örneği: Sesli Komutlar İle Veri Tabanı Sorgulama Uygulaması

Karadeniz Fen Bilimleri Dergisi

https://doi.org/10.31466/kfbd.1384401

CREMA-D: Improving Accuracy with BPSO-Based Feature Selection for Emotion Recognition Using Speech

Journal of Soft Computing and Artificial Intelligence

https://doi.org/10.55195/jscai.1214312

Konuşma Duygu Tanıma için Akustik Özelliklere Dayalı LSTM Tabanlı Bir Yaklaşım

Computer Science

https://doi.org/10.53070/bbd.1113379

Makale Dosyaları

Tam Metin

DUJE tarafından yayınlanan tüm makaleler, Creative Commons Atıf 4.0 Uluslararası Lisansı ile lisanslanmıştır. Bu, orijinal eser ve kaynağın uygun şekilde belirtilmesi koşuluyla, herkesin eseri kopyalamasına, yeniden dağıtmasına, yeniden düzenlemesine, iletmesine ve uyarlamasına izin verir. 24456