Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama

Mustafa Sami Cücen; Saadin Oyucu; Hüseyin Polat

doi:10.17671/gazibtd.1159289

Research Article

Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama

Year 2023, , 237 - 249, 31.07.2023

Mustafa Sami Cücen , Saadin Oyucu , Hüseyin Polat

https://doi.org/10.17671/gazibtd.1159289

Abstract

Konuşma sentezleme (TTS: Text-to-Speech) sistemleri insan-bilgisayar etkileşiminin önemli bir parçasıdır. TTS işleminde bir dizi metne karşılık gelen bir dizi spektrogram tahmin edilmektedir. Elde edilen spektrogram dizisi insanların duyabileceği ses dalga formuna dönüştürülmektedir. TTS sistemlerinin başarısı, geliştirme kaynaklarının yetersizliği nedeni ile farklı diller için aynı düzeyde değildir. Bir TTS sisteminin verimli şekilde geliştirilebilmesi için ulaşılabilir, büyük boyutlu bir konuşma veri kümesine ihtiyaç duyulmaktadır. Türkçe gibi kaynak yetersizliği olan diller için konuşma veri kümelerinin eksikliği, TTS sistemleri geliştirmenin önündeki en büyük engellerden biridir. Büyük boyutlu bir veri kümesi hazırlama oldukça zaman alan, zorlu ve maliyetli bir görevdir. Bu çalışmada, Türkçe TTS sistemlerinin geliştirilmesinde kullanılabilecek bir veri kümesi hazırlanmıştır. Daha önceden hazırlanan metin verisi, bir erkek konuşmacı tarafından İstanbul Türkçesi kullanılarak duygudan bağımsız olarak seslendirilmiştir. Metin verisi 109.826 kelime içermektedir. Seslendirilen konuşma verisi yaklaşık 12 saat 38 dakika 59 saniye uzunluğundadır ve 22.050 Hz. örnekleme frekansında kaydedilmiştir. Türkçe için hazırlanan bu veri kümesi daha önce İngilizce için hazırlanmış ve başarılı sonuçlar elde edilmiş “The LJ Speech Dataset” isimli veri kümesi ile karşılaştırılmış ve gelecekteki çalışmalar için öneriler sunulmuştur. Bu veri kümesi akademik düzeyde Türkçe TTS çalışmalarını teşvik etmek için hazırlanmıştır. Hazırlanan Türkçe veri kümesinin performans durumunu gözlemlemek için GlowTTS modeli bu veri kümesi kullanılarak eğitilmiştir. Eğitilen GlowTTS modeli ile bir Türkçe TTS sistemi geliştirilmiştir. Geliştirilen Türkçe TTS sistemi kullanılarak sentezlenen konuşmalar ile doğal konuşmaların karşılaştırılması sonucu 2,12’lik bir MOS-LQO değeri elde edilmiştir. Elde edilen ilk sonuçlar hazırlanan veri kümesinin Türkçe TTS sistemi geliştirme çalışmalarına etkin bir katkı sağladığını göstermektedir.

Keywords

Konuşma sentezleme, Metinden konuşmaya dönüştürme sistemleri, Türkçe konuşma sentezleme, Derin öğrenme

Supporting Institution

TUBİTAK

Project Number

121E479

References

Y. Ning, S. He, Z. Wu, C. Xing, L. J. Zhang, “Review of deep learning based speech synthesis”, Appl. Sci., 9(19), 1–16, 2019.
S. Lemmetty, Review of speech synthesis technology, Yüksek Lisans Tezi, Helsinki University of Technology, Department of Electrical and Communications Engineering, 1999.
H. Dudley, T. H. Tarnóczy, “The Speaking Machine of Wolfgang von Kempelen”, J. Acoust. Soc. Am., 22 (2), 151–166, 1949.
H. Dudley, “The Carrier Nature of Speech”, Bell Syst. Tech. J., 19 (4), 495–515, 1940.
N. Umeda, R. Teranishi, “The Parsing Program for Automatic Text-to-Speech Synthesis Developed at the Electrotechnical Laboratory in 1968”, IEEE Trans. Acoust., 23 (2), 183–188, 1975.
A. E. Yilmaz, “Türkçe Metinden Konuşma Sentezleme Uygulamaları İçin Bir Veri Sözlük Seti ve Yazılım Çerçevesi”, Gazi Üniversitesi Mühendislik Mimar. Fakültesi Derg., 24 (4), 735–744, 2009.
İ. Y. Özüm, A Speech Synthesis System for Turkish Language Based on the Concetanation of Phonemes Taken from Speaker, Yüksek Lisans Tezi, Middle East Technical University, Graduate School of Natural and Applied Sciences, 1993.
B. Eker, Turkish Text To Speech System, Yüksek Lisans Tezi, Bilkent University, The Department of Computer Engineering, 2002.
R. A. Khan, J. S. Chitode, “Concatenative Speech Synthesis: A Review”, Int. J. Comput. Appl., 136 (3), 1–6, 2016.
Y. Tabet, M. Boughazi, “Speech synthesis techniques. A survey”, 7th Int. Work. Syst. Signal Process. their Appl. WoSSPA 2011, 67–70, 2011.
M. Z. Rashad, H. M. El-Bakry, I. R. Isma’il, N. Mastorakis, “An overview of text-to-speech synthesis techniques”, Int. Conf. Commun. Inf. Technol. - Proc., 84–89, 2010.
D. Govind, S. R. M. Prasanna, “Expressive speech synthesis: A review”, Int. J. Speech Technol., 16 (2), 237–260, 2013.
R. Aşlıyan, K. Günel, “Türkçe metinler için hece tabanlı konuşma sentezleme sistemi”, Akademik Bilişim 2008, Çanakkale, Türkiye, 31–38, 2008.
M. Jalil, F. A. Butt, A. Malik, “A survey of different speech synthesis techniques”, 2013 Int. Conf. Technol. Adv. Electr. Electron. Comput. Eng. TAEECE 2013, 204–207, 2013.
İ. B. Uslu, “Metinden Konuşma Sentezleme”, TMMOB Elektrik Mühendisleri Odası Ankara Şubesi Haber Bülteni, 11–14, 2010.
A. Dunaev, A Text-to-Speech System Based on Deep Neural Networks, Lisans Tezi, KIT Department of Informatics, Institute for Anthropomatics and Robotics (IAR), Interactive Systems Labs (ISL) Karlsruhe Institute of Technology, 2019.
B. S. Gürler, Türkçe Konuşma Tanıma Sistemleri İçin Bir Konuşma Veritabanı, Yüksek Lisans Tezi, Gazi Üniversitesi, Elektronik-Bilgisayar Eğitimi Anabilim Dalı, 2014.
M. C. Orhan, C. Demiroğlu, “Konuşmacı Aradeğerlemeli SMM Tabanlı Metinden Konuşma Sentezleme Sistemi”, 2011 IEEE 19th Signal Processing and Communications Applications Conference (SIU 2011), 781–784, 2011.
Internet: Festvox, CMU_ARCTIC Databases, http://festvox.org/cmu_arctic/, 23.04.2022.
Internet: The LJ Speech Dataset, https://keithito.com/LJ-Speech-Dataset/, 23.04.2022.
Internet: Kaggle, The World English Bible, https://www.kaggle.com/datasets/bryanpark/the-world-english-bible-speech-dataset?select=transcript.txt, 23.04.2022.
E. Casanova vd., “TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese”, Lang. Resour. Eval., 2022.
Internet: Papers With Code, KazakhTTS Dataset, https://paperswithcode.com/dataset/kazakhtts, 23.04.2022.
Internet: openslr.org, https://www.openslr.org/, 23.04.2022.
X. Li, D. Ma, B. Yin, “Advance research in agricultural text-to-speech: the word segmentation of analytic language and the deep learning-based end-to-end system”, Comput. Electron. Agric., 180, 1–10, 2021.
N. Halabi, Modern Standard Arabic Phonetics for Speech Synthesis, Doktora Tezi, University of Southampton, Faculty of Physical Sciences and Engineering School of Electronics and Computer Science, 2016.
D. Van Niekerk vd., “Rapid development of TTS corpora for four South African languages”, Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 2178–2182, 2017.
J. Kominek, A. W. Black, “The CMU Arctic Databases for Speech Synthesis”, Proc. ISCA Work. Speech Synth., 223–224, 2004.
I. Demirşahin, O. Kjartansson, A. Gutkin, C. Rivera, “Opensource Multispeaker Corpora of the English Accents in the British Isles”, Proc. 12th Language Resources and Evaluation Conference (LREC 2020), 6532- 6541, 2020.
Internet: Meta-Share, Estonian Emotional Speech Corpus, https://metashare.ut.ee/repository/browse/estonian-emotional-speech-corpus/4d42d7a8463411e2a6e4005056b40024a19021a316b54b7fb707757d43d1a889/, 23.04.2022.
R. Altrov, H. Pajupuu, “Estonian Emotional Speech Corpus: theoretical base and implementation”, 4th International Workshop on Corpora for Research on Emootion Sentiment & Social Signals ES3 2012, 50–53, 2012.
Internet: T. Müller and D. Kreutz, Thorsten-Voice- ‘Thorsten-21.02-neutral’ Dataset, https://zenodo.org/record/5525342, 23.04.2022
Internet: Papers With Code, JSUT Corpus Dataset, https://paperswithcode.com/dataset/jsut-corpus, 23.04.2022.
R. Sonobe, S. Takamichi, H. Saruwatari, “JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis”, ICASSP2018, 2017.
S. Mussakhojayeva, A. Janaliyeva, A. Mirzakhmetov, Y. Khassanov, H. A. Varol, “KazakhTTS: An open-source Kazakh text-to-speech synthesis dataset”, Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 3511–3515, 2021.
N. Srivastava, R. Mukhopadhyay, K. R. Prajwal, C. V Jawahar, “IndicSpeech: Text-to-Speech Corpus for Indian Languages”, Proc. 12th Language Resources and Evaluation Conference (LREC 2020), 6417- 6422,2020.
E. Guner, C. Demiroglu, “A small footprint hybrid statistical/unit selection text-to-speech synthesis system for agglutinative languages”, ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., 4537–4540, 2012.
R. Gokay ve H. Yalcin, “Improving Low Resource Turkish Speech Recognition with Data Augmentation and TTS”, 16th Int. Multi-Conference Syst. Signals Devices, SSD 2019, 357–360, 2019.
İ. Sel, D. Hanbay, M. Karabatak, “Beyin Bilgisayar Arayüzleri İçin Türkçe Metinden Konuşma Sentezleme Sistemi”, Elektr. ve Bilgi. Sempozyumu 2011, 273–276, 2011.
J. Shen vd.., "Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions", 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779-4783, 2018.

Preparing A Balanced Corpus for Development of Turkish Speech Synthesis Systems

Year 2023, , 237 - 249, 31.07.2023

Mustafa Sami Cücen , Saadin Oyucu , Hüseyin Polat

https://doi.org/10.17671/gazibtd.1159289

Abstract

Speech synthesis systems are an important part of human-computer interaction. With speech synthesis, a speech waveform corresponding to a spoken text is produced. The resulting waveform is converted into audio data that people can hear. The success of speech synthesis systems is not at the same level for different languages due to a lack of development resources. To train a speech synthesis system efficiently, a large, accessible corpus is needed. The lack of such corpus for low-resource languages such as Turkish is the biggest obstacle to developing Turkish speech synthesis systems. Preparing a large corpus is a time-consuming, challenging, and costly task. In this study, the process of creating an accessible corpus that will be used in the development of Turkish speech synthesis systems, increasing the success of naturalness and intelligibility, and the difficulties encountered are explained. The previously compiled text data for the corpus was voiced by a male speaker using Istanbul Turkish, regardless of emotion. The text data contains 109826 words. The spoken speech data is approximately 12 hours 38 minutes 59 seconds long and is at 22050 Hz. recorded at the sampling rate. This corpus prepared for Turkish was compared with the corpus named “The LJ Speech Dataset” which was previously prepared for English and successful results were obtained, and suggestions for future studies were presented. This corpus was developed to encourage Turkish speech synthesis studies at the academic level. In this way, we hope that a major deficiency in the development of Turkish speech synthesis systems will be eliminated.

Keywords

Text-to-speech system, Speech synthesis, Turkish Speech synthesis, Deep learning

Project Number

121E479

References

Y. Ning, S. He, Z. Wu, C. Xing, L. J. Zhang, “Review of deep learning based speech synthesis”, Appl. Sci., 9(19), 1–16, 2019.
S. Lemmetty, Review of speech synthesis technology, Yüksek Lisans Tezi, Helsinki University of Technology, Department of Electrical and Communications Engineering, 1999.
H. Dudley, T. H. Tarnóczy, “The Speaking Machine of Wolfgang von Kempelen”, J. Acoust. Soc. Am., 22 (2), 151–166, 1949.
H. Dudley, “The Carrier Nature of Speech”, Bell Syst. Tech. J., 19 (4), 495–515, 1940.
N. Umeda, R. Teranishi, “The Parsing Program for Automatic Text-to-Speech Synthesis Developed at the Electrotechnical Laboratory in 1968”, IEEE Trans. Acoust., 23 (2), 183–188, 1975.
A. E. Yilmaz, “Türkçe Metinden Konuşma Sentezleme Uygulamaları İçin Bir Veri Sözlük Seti ve Yazılım Çerçevesi”, Gazi Üniversitesi Mühendislik Mimar. Fakültesi Derg., 24 (4), 735–744, 2009.
İ. Y. Özüm, A Speech Synthesis System for Turkish Language Based on the Concetanation of Phonemes Taken from Speaker, Yüksek Lisans Tezi, Middle East Technical University, Graduate School of Natural and Applied Sciences, 1993.
B. Eker, Turkish Text To Speech System, Yüksek Lisans Tezi, Bilkent University, The Department of Computer Engineering, 2002.
R. A. Khan, J. S. Chitode, “Concatenative Speech Synthesis: A Review”, Int. J. Comput. Appl., 136 (3), 1–6, 2016.
Y. Tabet, M. Boughazi, “Speech synthesis techniques. A survey”, 7th Int. Work. Syst. Signal Process. their Appl. WoSSPA 2011, 67–70, 2011.
M. Z. Rashad, H. M. El-Bakry, I. R. Isma’il, N. Mastorakis, “An overview of text-to-speech synthesis techniques”, Int. Conf. Commun. Inf. Technol. - Proc., 84–89, 2010.
D. Govind, S. R. M. Prasanna, “Expressive speech synthesis: A review”, Int. J. Speech Technol., 16 (2), 237–260, 2013.
R. Aşlıyan, K. Günel, “Türkçe metinler için hece tabanlı konuşma sentezleme sistemi”, Akademik Bilişim 2008, Çanakkale, Türkiye, 31–38, 2008.
M. Jalil, F. A. Butt, A. Malik, “A survey of different speech synthesis techniques”, 2013 Int. Conf. Technol. Adv. Electr. Electron. Comput. Eng. TAEECE 2013, 204–207, 2013.
İ. B. Uslu, “Metinden Konuşma Sentezleme”, TMMOB Elektrik Mühendisleri Odası Ankara Şubesi Haber Bülteni, 11–14, 2010.
A. Dunaev, A Text-to-Speech System Based on Deep Neural Networks, Lisans Tezi, KIT Department of Informatics, Institute for Anthropomatics and Robotics (IAR), Interactive Systems Labs (ISL) Karlsruhe Institute of Technology, 2019.
B. S. Gürler, Türkçe Konuşma Tanıma Sistemleri İçin Bir Konuşma Veritabanı, Yüksek Lisans Tezi, Gazi Üniversitesi, Elektronik-Bilgisayar Eğitimi Anabilim Dalı, 2014.
M. C. Orhan, C. Demiroğlu, “Konuşmacı Aradeğerlemeli SMM Tabanlı Metinden Konuşma Sentezleme Sistemi”, 2011 IEEE 19th Signal Processing and Communications Applications Conference (SIU 2011), 781–784, 2011.
Internet: Festvox, CMU_ARCTIC Databases, http://festvox.org/cmu_arctic/, 23.04.2022.
Internet: The LJ Speech Dataset, https://keithito.com/LJ-Speech-Dataset/, 23.04.2022.
Internet: Kaggle, The World English Bible, https://www.kaggle.com/datasets/bryanpark/the-world-english-bible-speech-dataset?select=transcript.txt, 23.04.2022.
E. Casanova vd., “TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese”, Lang. Resour. Eval., 2022.
Internet: Papers With Code, KazakhTTS Dataset, https://paperswithcode.com/dataset/kazakhtts, 23.04.2022.
Internet: openslr.org, https://www.openslr.org/, 23.04.2022.
X. Li, D. Ma, B. Yin, “Advance research in agricultural text-to-speech: the word segmentation of analytic language and the deep learning-based end-to-end system”, Comput. Electron. Agric., 180, 1–10, 2021.
N. Halabi, Modern Standard Arabic Phonetics for Speech Synthesis, Doktora Tezi, University of Southampton, Faculty of Physical Sciences and Engineering School of Electronics and Computer Science, 2016.
D. Van Niekerk vd., “Rapid development of TTS corpora for four South African languages”, Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 2178–2182, 2017.
J. Kominek, A. W. Black, “The CMU Arctic Databases for Speech Synthesis”, Proc. ISCA Work. Speech Synth., 223–224, 2004.
I. Demirşahin, O. Kjartansson, A. Gutkin, C. Rivera, “Opensource Multispeaker Corpora of the English Accents in the British Isles”, Proc. 12th Language Resources and Evaluation Conference (LREC 2020), 6532- 6541, 2020.
Internet: Meta-Share, Estonian Emotional Speech Corpus, https://metashare.ut.ee/repository/browse/estonian-emotional-speech-corpus/4d42d7a8463411e2a6e4005056b40024a19021a316b54b7fb707757d43d1a889/, 23.04.2022.
R. Altrov, H. Pajupuu, “Estonian Emotional Speech Corpus: theoretical base and implementation”, 4th International Workshop on Corpora for Research on Emootion Sentiment & Social Signals ES3 2012, 50–53, 2012.
Internet: T. Müller and D. Kreutz, Thorsten-Voice- ‘Thorsten-21.02-neutral’ Dataset, https://zenodo.org/record/5525342, 23.04.2022
Internet: Papers With Code, JSUT Corpus Dataset, https://paperswithcode.com/dataset/jsut-corpus, 23.04.2022.
R. Sonobe, S. Takamichi, H. Saruwatari, “JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis”, ICASSP2018, 2017.
S. Mussakhojayeva, A. Janaliyeva, A. Mirzakhmetov, Y. Khassanov, H. A. Varol, “KazakhTTS: An open-source Kazakh text-to-speech synthesis dataset”, Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, 3511–3515, 2021.
N. Srivastava, R. Mukhopadhyay, K. R. Prajwal, C. V Jawahar, “IndicSpeech: Text-to-Speech Corpus for Indian Languages”, Proc. 12th Language Resources and Evaluation Conference (LREC 2020), 6417- 6422,2020.
E. Guner, C. Demiroglu, “A small footprint hybrid statistical/unit selection text-to-speech synthesis system for agglutinative languages”, ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., 4537–4540, 2012.
R. Gokay ve H. Yalcin, “Improving Low Resource Turkish Speech Recognition with Data Augmentation and TTS”, 16th Int. Multi-Conference Syst. Signals Devices, SSD 2019, 357–360, 2019.
İ. Sel, D. Hanbay, M. Karabatak, “Beyin Bilgisayar Arayüzleri İçin Türkçe Metinden Konuşma Sentezleme Sistemi”, Elektr. ve Bilgi. Sempozyumu 2011, 273–276, 2011.
J. Shen vd.., "Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions", 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4779-4783, 2018.

There are 40 citations in total.

Details

Primary Language	Turkish
Subjects	Engineering
Journal Section	Articles
Authors	Mustafa Sami Cücen 0000-0002-3911-2566 Saadin Oyucu 0000-0003-3880-3039 Hüseyin Polat 0000-0003-4128-2625
Project Number	121E479
Publication Date	July 31, 2023
Submission Date	August 8, 2022
Published in Issue	Year 2023

Cite

APA	Cücen, M. S., Oyucu, S., & Polat, H. (2023). Türkçe Konuşma Sentezleme Sistemlerinin Geliştirilmesi için Dengeli Bir Veri Kümesi Hazırlama. Bilişim Teknolojileri Dergisi, 16(3), 237-249. https://doi.org/10.17671/gazibtd.1159289

Article Files