Research Article
BibTex RIS Cite

DigiHuman: Yüz İfadeleri ile Konuşan Dijital Bir İnsan

Year 2024, Volume: 19 Issue: 1, 25 - 37, 28.03.2024
https://doi.org/10.55525/tjst.1301324

Abstract

Yapay zeka destekli sohbet robotları ve sanal insanlar, oluşturulma amaçlarına bağlı olarak farklı görevleri yerine getirmek için kullanıcılarla aralarında iletişim kurma yetenekleri nedeniyle son zamanlarda birçok uygulamada önemli rol üstlenmişlerdir. Sanal insanlar, gerçekçi insan formları, davranışları ve özellikle sanal gerçeklik ortamında deneyimlendiklerinde duygusal geri bildirim iletme yetenekleri nedeniyle farklı sektörlerde büyük ilgi görmektedir. Diğer taraftan, sohbet robotları insanlarla iletişim kurmadaki yüksek verimlilikleri nedeniyle insan bilgisayar etkileşimi için en umut verici örneklerden biri olarak çeşitli uygulamalarda kullanılmaktadır. Bu nedenle, bu çalışmada başarılı bir iletişim ve gerçekçi davranış sergilemesi için yüz ifadeleri aracılığıyla duyguları iletme yeteneğine sahip bir gerçek zamanlı sohbet robotu oluşturması amaçlanmıştır. Bunun için sırasıyla; konuşma tanıma, duygu sentezi, yanıt üretme gibi çeşitli özellikler için çoklu geliştirilmiş yapay zeka modelleri uygulanmıştır. Çalışma kapsamında yaklaşım, kullanılan tüm modeller, bileşenler ve sonuçları kapsamlı bir şekilde açıklanmış ve kullanıcı testleri sonuçları da açıklanmıştır.

References

  • Robert PH, König A, Amieva H, Andrieu S, Bremond F, Bullock R, Ceccaldi M, Dubois B, Gauthier S, Konigsberg pa, nave s. recommendations for the use of serious games in people with Alzheimer’s disease, related disorders and frailty. Frontiers in Aging Neuroscience. 2014; 6:54.
  • Xiong W, Wu L, Alleva F, Droppo J, Huang X and Stolcke A. The Microsoft 2017 conversational speech recognition system. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 15-20 April 2018; Calgary, AB, Canada. pp. 5934-5938.
  • Skerry-Ryan RJ, Battenberg E, Xiao Y, Wang Y, Stanton D, Shor J, Weiss R, Clark R, Saurous RA. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In: International Conference on Machine Learning; 10-15 Jul 2018; Stockholm, Sweden. pp. 4693-4702.
  • Zhang WE, Sheng QZ, Alhazmi A, Li C. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST). 2020 Apr 1;11(3):1-41.
  • Hyneman W, Itokazu H, Williams L, Zhao X. Human face project. In: ACM Siggraph 2005 Courses; 31 Jul 2005; Los Angeles, CA, USA: pp. 5-es.
  • Shawar BA, Atwell E. Chatbots: Are they Really Useful? Journal for Language Technology and Computational Linguistics 2007; 22(1):29-49.
  • Adamopoulou E, Moussiades L. An overview of chatbot technology. In: IFIP international conference on artificial intelligence applications and innovations; 5–7 June 2020; Neos Marmaras, Greece: pp. 373-383.
  • Griol D, Sanchis A, Molina JM, Callejas Z. Developing enhanced conversational agents for social virtual worlds. Neurocomputing. 2019;354: 27-40.
  • Weizenbaum, J. ELIZA—a Computer Program for the Study of Natural Language Communication between Man and Machine. Commun. 1966, 9(1), 36–45.
  • Molnár G, Szüts Z. The role of chatbots in formal education. In: IEEE 16th International Symposium on Intelligent Systems and Informatics; 13-15 Sep 2018; Subotica, Serbia. pp. 197-202.
  • Balci K, Not E, Zancanaro M, Pianesi F. Xface open source project and smil-agent scripting language for creating and animating embodied conversational agents. In: the 15th ACM international conference on multimedia; 25-29 Sep 2007; Augsburg, Germany. pp. 1013-1016.
  • Aneja D, McDuff D, Shah S. A high-fidelity open embodied avatar with lip syncing and expression capabilities. In: 2019 International Conference on Multimodal Interaction; 14-18 Oct 2019; Suzhou, China. pp. 69-73.
  • Stainer-Hochgatterer A, Wings-Kölgen C, Cereghetti D, Hanke S, Sandner E. Miraculous-life: An avatar-based virtual support partner to assist daily living. In ISG 2016 World Conference of Gerontechnolog; 28-30 Sept 2016; Nice, France. pp. 95-96.
  • Nijdam NA, Konstantas D. The CaMeLi framework—a multimodal virtual companion for older adults. In: Intelligent Systems and Applications (IntelliSys 2016): 21–22 September 2016; London, UK. pp. 196-217.
  • Don A, Brennan S, Laurel B, Shneiderman B. Anthropomorphism: from ELIZA to Terminator 2. In: the SIGCHI conference on Human Factors in Computing Systems; 1 Jun 1992; San Francisco, CA, USA. pp. 67-70.
  • Bartl A, Wenninger S, Wolf E, Botsch M, Latoschik ME. Affordable but not cheap: A case study of the effects of two 3D-reconstruction methods of virtual humans. Front. Virtual Real. 2021;2: 694617.
  • Komaritzan M, Wenninger S and Botsch M. Inside humans: creating a simple layered anatomical model from human surface scans. Front. Virtual Real. 2021; 2:694244.
  • Regateiro J, Volino M and Hilton A. Deep4D: a compact generative representation for volumetric video. Front. Virtual Real. 2021; 2:739010.
  • Liu Z, Shan Y, Zhang Z. Expressive expression mapping with ratio images. In: the 28th annual conference on Computer graphics and interactive techniques; 1 Aug 2001; Los Angeles, CA, USA. pp. 271-276.
  • Queiroz RB, Cohen M, Musse SR. An extensible framework for interactive facial animation with facial expressions, lip synchronization and eye behavior. Computers in Entertainment (CIE). 2010;7(4):1-20.
  • Lee M, Lee YK, Lim MT, Kang TK. Emotion recognition using convolutional neural network with selected statistical photoplethysmogram features. Applied Sciences. 2020;10(10):3501.
  • Kharde V, Sonawane P. Sentiment analysis of twitter data: a survey of techniques. International Journal of Computer Applications. 2016, 139(11):5-15.
  • Kenton JD, Toutanova LK. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT; Jun 2-7 2019; Minneapolis,MN, USA: pp. 4171-4186.
  • Maiya AS. ktrain: A low-code library for augmented machine learning. The Journal of Machine Learning Research. 2022;23(1):7070-5.
  • Alammar J. The Illustrated Transformer. https://jalammar.github.io/illustrated-transformer. Accessed 20 April 2021.
  • Smith LN. A disciplined approach to neural network hyper-parameters: Part 1--learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820. 2018.
  • Sun Y, Sebe N, Lew MS, Gevers T. Authentic emotion detection in real-time video. In: Computer Vision in Human-Computer Interaction: ECCV 2004 Workshop on HCI; 16 May 2004; Prague, Czech Republic. pp. 94-104.
  • Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: A large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition; 20-25 Jun 2009; Miami, FL, USA. pp. 248-255.
  • Chollet F. Xception: Deep learning with depthwise separable convolutions. In: the IEEE conference on computer vision and pattern recognition; 21-26 Jul 2017; Honolulu, HI, USA. pp. 1251-1258.
  • Roller S, Dinan E, Goyal N, Ju D, Williamson M, Liu Y, Xu J, Ott M, Shuster K, Smith EM, Boureau YL. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637. 2020.
  • Miller AH, Feng W, Fisch A, Lu J, Batra D, Bordes A, Parikh D, Weston J. Parlai: A dialog research software platform. In: the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations; 9-11 September 2017; Copenhagen, Denmark. pp. 79-84.
  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in neural information processing systems. 2017;30.
  • Smith AP. Muscle-based facial animation using blendshapes in superposition. Doctoral dissertation, Texas A&M University, 2007.
  • Li T, Bolkart T, Black MJ, Li H, Romero J. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 2017;36(6):194-1.
  • Prince EB, Martin KB, Messinger DS, Allen M. Facial action coding system. 2015.
  • Anjyo K. Blendshape Facial Animation, Handbook of Human Motion. Bertram Müller. Springer Cham, 2018; pp. 2145–2155.
  • Ekman P. An argument for basic emotions. Cognition & Emotion. 1992;6(3-4):169-200.
  • Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, Ng AY. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. 2014.
  • Mozilla DeepSpeech, https://github.com/mozilla/DeepSpeech. Accessed 1 May 2021.
  • Oord AV, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499. 2016.

DigiHuman: A Conversational Digital Human with Facial Expressions

Year 2024, Volume: 19 Issue: 1, 25 - 37, 28.03.2024
https://doi.org/10.55525/tjst.1301324

Abstract

Recently, Artificial Intelligence (AI)-powered chatbots and virtual humans have assumed significant roles in various domains due to their ability to interact with users and perform tasks based on their intended purpose. Virtual humans have received considerable attention in various industries due to their lifelike human appearance, behaviour, and ability to convey emotions, especially in virtual reality contexts. Conversely, chatbots are finding use in a wide range of applications and represent a promising feature of human-computer interaction due to their efficient communication with humans. Therefore, this study aims to develop a real-time chatbot that can effectively convey emotions through facial expressions, thereby promoting realistic communication. To achieve this, several advanced AI models were employed to address different aspects, including speech recognition, emotion synthesis, and response generation. The methodology, models used, components, and results are explained in detail, and the results of the user study are also presented.

References

  • Robert PH, König A, Amieva H, Andrieu S, Bremond F, Bullock R, Ceccaldi M, Dubois B, Gauthier S, Konigsberg pa, nave s. recommendations for the use of serious games in people with Alzheimer’s disease, related disorders and frailty. Frontiers in Aging Neuroscience. 2014; 6:54.
  • Xiong W, Wu L, Alleva F, Droppo J, Huang X and Stolcke A. The Microsoft 2017 conversational speech recognition system. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 15-20 April 2018; Calgary, AB, Canada. pp. 5934-5938.
  • Skerry-Ryan RJ, Battenberg E, Xiao Y, Wang Y, Stanton D, Shor J, Weiss R, Clark R, Saurous RA. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In: International Conference on Machine Learning; 10-15 Jul 2018; Stockholm, Sweden. pp. 4693-4702.
  • Zhang WE, Sheng QZ, Alhazmi A, Li C. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST). 2020 Apr 1;11(3):1-41.
  • Hyneman W, Itokazu H, Williams L, Zhao X. Human face project. In: ACM Siggraph 2005 Courses; 31 Jul 2005; Los Angeles, CA, USA: pp. 5-es.
  • Shawar BA, Atwell E. Chatbots: Are they Really Useful? Journal for Language Technology and Computational Linguistics 2007; 22(1):29-49.
  • Adamopoulou E, Moussiades L. An overview of chatbot technology. In: IFIP international conference on artificial intelligence applications and innovations; 5–7 June 2020; Neos Marmaras, Greece: pp. 373-383.
  • Griol D, Sanchis A, Molina JM, Callejas Z. Developing enhanced conversational agents for social virtual worlds. Neurocomputing. 2019;354: 27-40.
  • Weizenbaum, J. ELIZA—a Computer Program for the Study of Natural Language Communication between Man and Machine. Commun. 1966, 9(1), 36–45.
  • Molnár G, Szüts Z. The role of chatbots in formal education. In: IEEE 16th International Symposium on Intelligent Systems and Informatics; 13-15 Sep 2018; Subotica, Serbia. pp. 197-202.
  • Balci K, Not E, Zancanaro M, Pianesi F. Xface open source project and smil-agent scripting language for creating and animating embodied conversational agents. In: the 15th ACM international conference on multimedia; 25-29 Sep 2007; Augsburg, Germany. pp. 1013-1016.
  • Aneja D, McDuff D, Shah S. A high-fidelity open embodied avatar with lip syncing and expression capabilities. In: 2019 International Conference on Multimodal Interaction; 14-18 Oct 2019; Suzhou, China. pp. 69-73.
  • Stainer-Hochgatterer A, Wings-Kölgen C, Cereghetti D, Hanke S, Sandner E. Miraculous-life: An avatar-based virtual support partner to assist daily living. In ISG 2016 World Conference of Gerontechnolog; 28-30 Sept 2016; Nice, France. pp. 95-96.
  • Nijdam NA, Konstantas D. The CaMeLi framework—a multimodal virtual companion for older adults. In: Intelligent Systems and Applications (IntelliSys 2016): 21–22 September 2016; London, UK. pp. 196-217.
  • Don A, Brennan S, Laurel B, Shneiderman B. Anthropomorphism: from ELIZA to Terminator 2. In: the SIGCHI conference on Human Factors in Computing Systems; 1 Jun 1992; San Francisco, CA, USA. pp. 67-70.
  • Bartl A, Wenninger S, Wolf E, Botsch M, Latoschik ME. Affordable but not cheap: A case study of the effects of two 3D-reconstruction methods of virtual humans. Front. Virtual Real. 2021;2: 694617.
  • Komaritzan M, Wenninger S and Botsch M. Inside humans: creating a simple layered anatomical model from human surface scans. Front. Virtual Real. 2021; 2:694244.
  • Regateiro J, Volino M and Hilton A. Deep4D: a compact generative representation for volumetric video. Front. Virtual Real. 2021; 2:739010.
  • Liu Z, Shan Y, Zhang Z. Expressive expression mapping with ratio images. In: the 28th annual conference on Computer graphics and interactive techniques; 1 Aug 2001; Los Angeles, CA, USA. pp. 271-276.
  • Queiroz RB, Cohen M, Musse SR. An extensible framework for interactive facial animation with facial expressions, lip synchronization and eye behavior. Computers in Entertainment (CIE). 2010;7(4):1-20.
  • Lee M, Lee YK, Lim MT, Kang TK. Emotion recognition using convolutional neural network with selected statistical photoplethysmogram features. Applied Sciences. 2020;10(10):3501.
  • Kharde V, Sonawane P. Sentiment analysis of twitter data: a survey of techniques. International Journal of Computer Applications. 2016, 139(11):5-15.
  • Kenton JD, Toutanova LK. Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT; Jun 2-7 2019; Minneapolis,MN, USA: pp. 4171-4186.
  • Maiya AS. ktrain: A low-code library for augmented machine learning. The Journal of Machine Learning Research. 2022;23(1):7070-5.
  • Alammar J. The Illustrated Transformer. https://jalammar.github.io/illustrated-transformer. Accessed 20 April 2021.
  • Smith LN. A disciplined approach to neural network hyper-parameters: Part 1--learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820. 2018.
  • Sun Y, Sebe N, Lew MS, Gevers T. Authentic emotion detection in real-time video. In: Computer Vision in Human-Computer Interaction: ECCV 2004 Workshop on HCI; 16 May 2004; Prague, Czech Republic. pp. 94-104.
  • Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: A large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition; 20-25 Jun 2009; Miami, FL, USA. pp. 248-255.
  • Chollet F. Xception: Deep learning with depthwise separable convolutions. In: the IEEE conference on computer vision and pattern recognition; 21-26 Jul 2017; Honolulu, HI, USA. pp. 1251-1258.
  • Roller S, Dinan E, Goyal N, Ju D, Williamson M, Liu Y, Xu J, Ott M, Shuster K, Smith EM, Boureau YL. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637. 2020.
  • Miller AH, Feng W, Fisch A, Lu J, Batra D, Bordes A, Parikh D, Weston J. Parlai: A dialog research software platform. In: the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations; 9-11 September 2017; Copenhagen, Denmark. pp. 79-84.
  • Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in neural information processing systems. 2017;30.
  • Smith AP. Muscle-based facial animation using blendshapes in superposition. Doctoral dissertation, Texas A&M University, 2007.
  • Li T, Bolkart T, Black MJ, Li H, Romero J. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 2017;36(6):194-1.
  • Prince EB, Martin KB, Messinger DS, Allen M. Facial action coding system. 2015.
  • Anjyo K. Blendshape Facial Animation, Handbook of Human Motion. Bertram Müller. Springer Cham, 2018; pp. 2145–2155.
  • Ekman P. An argument for basic emotions. Cognition & Emotion. 1992;6(3-4):169-200.
  • Hannun A, Case C, Casper J, Catanzaro B, Diamos G, Elsen E, Prenger R, Satheesh S, Sengupta S, Coates A, Ng AY. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567. 2014.
  • Mozilla DeepSpeech, https://github.com/mozilla/DeepSpeech. Accessed 1 May 2021.
  • Oord AV, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499. 2016.
There are 40 citations in total.

Details

Primary Language English
Subjects Natural Language Processing, Artificial Intelligence (Other), Engineering
Journal Section TJST
Authors

Kasım Özacar 0000-0001-7637-0620

Munya Alkhalıfa 0000-0003-0364-201X

Publication Date March 28, 2024
Submission Date May 24, 2023
Published in Issue Year 2024 Volume: 19 Issue: 1

Cite

APA Özacar, K., & Alkhalıfa, M. (2024). DigiHuman: A Conversational Digital Human with Facial Expressions. Turkish Journal of Science and Technology, 19(1), 25-37. https://doi.org/10.55525/tjst.1301324
AMA Özacar K, Alkhalıfa M. DigiHuman: A Conversational Digital Human with Facial Expressions. TJST. March 2024;19(1):25-37. doi:10.55525/tjst.1301324
Chicago Özacar, Kasım, and Munya Alkhalıfa. “DigiHuman: A Conversational Digital Human With Facial Expressions”. Turkish Journal of Science and Technology 19, no. 1 (March 2024): 25-37. https://doi.org/10.55525/tjst.1301324.
EndNote Özacar K, Alkhalıfa M (March 1, 2024) DigiHuman: A Conversational Digital Human with Facial Expressions. Turkish Journal of Science and Technology 19 1 25–37.
IEEE K. Özacar and M. Alkhalıfa, “DigiHuman: A Conversational Digital Human with Facial Expressions”, TJST, vol. 19, no. 1, pp. 25–37, 2024, doi: 10.55525/tjst.1301324.
ISNAD Özacar, Kasım - Alkhalıfa, Munya. “DigiHuman: A Conversational Digital Human With Facial Expressions”. Turkish Journal of Science and Technology 19/1 (March 2024), 25-37. https://doi.org/10.55525/tjst.1301324.
JAMA Özacar K, Alkhalıfa M. DigiHuman: A Conversational Digital Human with Facial Expressions. TJST. 2024;19:25–37.
MLA Özacar, Kasım and Munya Alkhalıfa. “DigiHuman: A Conversational Digital Human With Facial Expressions”. Turkish Journal of Science and Technology, vol. 19, no. 1, 2024, pp. 25-37, doi:10.55525/tjst.1301324.
Vancouver Özacar K, Alkhalıfa M. DigiHuman: A Conversational Digital Human with Facial Expressions. TJST. 2024;19(1):25-37.