Research Article
BibTex RIS Cite

Derin Öğrenme Mimarilerinde Akustik ve Fonotaktik Öznitelikleri Kullanan Türkçe Ağız Tanıma

Year 2020, , 207 - 216, 31.07.2020
https://doi.org/10.17671/gazibtd.668023

Abstract

Ağızlar, standart dilden belli oranda ayrılan yerel konuşma biçimleridir. Ağız tanıma, konuşma tanıma alanında çalışılan popüler konular arasındadır. Özellikle, büyük ölçekli konuşma tanıma sistemlerinin başarımlarını arttırmak için konuşmanın ağzının öncelikli olarak belirlenmesi istenmektedir. Konuşmanın fonetik farklılıkları, fiziksel düzeyde akustik özellikleri incelenerek tespit edilebilmektedir. Log mel-spektrogram gibi öznitelikler bu amaçla kullanılmaktadır. Bununla birlikte, fonotaktik terimi, bir dilde/ağızda, fonemlerin bir araya gelme kurallarına karşılık gelmektedir. Fonem dizilimleri ve bu dizilimin sıklığı ağızdan ağza değişiklik göstermektedir. Fonem dizilimleri fonem tanıyıcılar yardımıyla elde edilmektedir. Son yıllarda popüler olan diğer bir konu derin öğrenme sinir ağlarıdır. Derin öğrenme sinir ağlarının özel bir çeşidi olan Evrişimli Sinir Ağları (CNN) özellikle görüntü ve konuşma tanımada sıklıkla kullanılmaktadır. Uzun Kısa-Dönem Bellekli Sinir Ağları (LSTM), dil modellemede n-gram modellerden daha başarılı sonuçlar üreten bir derin öğrenme sinir ağı modelidir. Bu çalışmada Türkçe ağızların akustik ve fonotaktik özellikleri bakımından CNN ve LSTM-türü sinir ağlarıyla sınıflandırılması ele alınmıştır. Ayrıca LSTM sinir ağları fonotaktik yaklaşımda dil modelleme için kullanılmıştır. Deneysel çalışmada önerilen yaklaşımlar, tarafımızca toplanan Türkçe Ağızlar Veri Kümesi üzerinde kullanılarak sınanmış ve yorumlanmıştır. Çalışma sonucunda, kullanılan yaklaşımların Türkçe ağız tanıma için %85,1 doğruluk oranı verdiği gözlenmiştir.

References

  • J. Zhao, H. Shu, L. Zhang, X. Wang, Q. Gong, P. Li, “Cortical competition during language discrimination”, Neuroimage, 43(3), 624–633, 2008.
  • F. Ramus, J. Mehler, “Language identification with suprasegmental cues: a study based on speech resynthesis.”, J. Acoust. Soc. Am., 105(1), 512–21, 1999.
  • Y. K. Muthusamy, E. Barnard, R. A. Cole, “Reviewing Automatic Language Identification”, IEEE Signal Process. Mag., 11(4), 33–41, 1994.
  • M. A. Zissman, “Comparison of four approaches to automatic language identification of telephone speech”, IEEE Trans. Speech Audio Process., 4(1), 31–44, 1996.
  • N. Demir, “Ağız Terimi Üzerine”, Türkbilig, 105–116, 2002.
  • A. Etman, A. A. Louis, American dialect identification using phonotactic and prosodic features, IntelliSys - Proc. 2015 SAI Intell. Syst. Conf., pp. 963–970, 2015.
  • S. Safavi, M. Russell, P. Jančovič, "Automatic speaker, age-group and gender identification from children’s speech", Computer Speech & Language, 50, 141-156, 2018.
  • F. Biadsy, Automatic Dialect and Accent Recognition and its Application to Speech Recognition, PhD Thesis, Columbia Univ., pp. 1–171, 2011.
  • N. Dehak, P. A. Torres-Carrasquillo, D. Reynolds, R. Dehak, Language recognition via Ivectors and dimensionality reduction, in Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech, 857–860, 2011.
  • A. Hanani, M. Russell, M. J. Carey, “Human and computer recognition of regional accents and ethnic groups from British English speech”, Comput. Speech Lang., 27(1), 59–74, 2013.
  • H. Soltau, L. Mangu, F. Biadsy, From modern standard Arabic to Levantine ASR: Leveraging GALE for dialects, in ASRU, 266–271, 2011.
  • M. Soufifar, S. Cumani, L. Burget, J. H. Cernocky, Discriminative classifiers for phonotactic language recognition with ivectors, in ICASSP, 4853–4856, 2012.
  • I. Lopez-moreno, J. Gonzalez-dominguez, O. Plchot, D. Martinez, J. Gonzalez-rodriguez, P. Moreno, Automatic language identification using deep neural networks, in ICASSP, vol. 1, 2014.
  • C. Salamea, L. F. D’Haro, R. De Cordoba, R. San-Segundo, “On the use of phone-gram units in recurrent neural networks for language identification”, Proceedings of Odyssey, 117-123, 2016.
  • M. Jin, Y. Song, I. Mcloughlin, L.R. Dai, Z.F. Ye, “LID-senone extraction via deep neural networks for end-to-end language identification”, Proceedings of Odyssey, 210-216, 2016.
  • A. Lozano-Diez, R. Zazo-Candil, J. Gonzalez-Dominguez, D.T. Toledano, J. Gonzalez-Rodriguez, An end-to-end approach to language identification in short utterances using convolutional neural networks, in Interspeech, 2015.
  • Y. Tian, L. He, Y. Liu, J. Liu, “Investigation of senone-based long short-term memory RNNs for spoken language recognition”, Proceedings of Odyssey, 89-93, 2016.
  • L. Deng, D. Yu, “Deep Learning: Methods and Applications”, Found. Trends Signal Process., 7, 2014.
  • T. N. Sainath, R. J. Weiss, A. W. Senior, K. W. Wilson, O. Vinyals, Learning the speech front-end with raw waveform CLDNNs, in Interspeech, 2015.
  • A.-R. Mohamed, G. E. Hinton, G. Penn, Understanding How Deep Belief Networks Perform Acoustic Modelling, in ICASSP, 2012.
  • D. C. Cireşan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber, Flexible, High Performance Convolutional Neural Networks for Image Classification, Proc. Twenty-Second Int. Jt. Conf. Artif. Intell., 2011.
  • O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, “Convolutional Neural Networks for Speech Recognition”, IEEE/ACM Trans. Audio, Speech, Lang. Process., 22(10), 2014.
  • N. F. Chen, W. Shen, J. P. Campbell, A linguistically-informative approach to dialect recognition using dialect discriminating context-dependent phonetic models, in ICASSP, 5014–5017, 2010.
  • G. Işık, H. Artuner, A Dataset For Turkish Dialect Recognition and Classification with Deep Learning, in 26. IEEE Signal Processing and Communications Applications Conference (SIU), 2018.
  • M. Sundermeyer, R. Schlüter, H. Ney, LSTM Neural Networks for Language Modeling, Proc. Interspeech, 194–197, 2012.
  • T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, S. Khudanpur, Recurrent Neural Network based Language Model, in Interspeech, 1045–1048, 2010.
  • N. Demir, “Ağız Araştırmalarında Kaynak Kişi Meselesi”, Folk. Prof. Dr. Dursun Yıldırım Armağanı, 11, 1998.
  • Internet: P. Boersma, D. Weenink, Praat program, http://www.praat.org, 03.08.2019.
  • L. Karahan, Anadolu Ağızlarının Sınıflandırılması, Türk Dil Kurumu Yayınları, 1996.
  • P. Matějka, P. Schwarz, J. Cernock, P. Chytil, Phonotactic Language Identification using High Quality Phoneme Recognition, in Eurospeech, 2005.
  • Internet: P. Schwarz, P. Matejka, L. Burget, O. Glembek, Phoneme recognition based on long temporal context, http://speech.fit.vutbr.cz/software, 27.10.2019.
  • M. H. S. Segler, T. Kogej, C. Tyrchan, M. P. Waller, “Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural Networks”, ACS Central Science, 4(1), 120-131, 2017.
  • Y. Goldberg, “A primer on neural network models for natural language processing”, J. Artif. Intell. Res., 57, 345–420, 2016.
  • A. Graves, J. Schmidhuber, “Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures”, Neural Networks, 18(5), 602–610, 2005.
  • H. Sak, A. Senior, F. Beaufays, Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling, Interspeech, 338–342, 2014.
  • Y. Bengio, P. Simard, P. Frasconi, “Learning long-term dependencies with gradient descent is difficult”, IEEE Trans. Neural Networks, 5, 157–166, 1994.
  • S. Hochreiter, J. Schmidhuber, “Long Short-Term Memory”, Neural Comput., 9(8), 1735–1780, 1997.
  • Internet: F. Chollet, Github, https://github.com/fchollet/keras. 15.11.2019.
  • B. Mcfee, C. Raffel, D. Liang, D. P. W. Ellis, M. Mcvicar, E. Battenberg, O. Nieto, librosa: Audio and Music Signal Analysis in Python, Proc. 14th Python Sci. Conf., no. Scipy, 1–7, 2015.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, J. Mach. Learn. Res., 15, 1929–1958, 2014.
  • V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann machines, Proc. 27th Int. Conf. Mach. Learn., 2010.
  • L. Bottou, Large-Scale Machine Learning with Stochastic Gradient Descent, Proc. COMPSTAT’2010, 177–186, 2010.
  • R. J. Williams, J. Peng, “An efficient gradient-based algorithm for online training of recurrent network trajectories”, Neural Comput., 4, 491–501, 1990.
  • L. Ferrer, Y. Lei, M. Mclaren, N. Scheffer, “Study of Senone-Based Deep Neural Network Approaches for Spoken Language Recognition”, IEEE/ACM Trans. Audio, Speech, Lang. Process., 24(1), 105–116, 2016.
  • Z. Tang, D. Wang, Y. Chen, L. Li, A. Abel, “Phonetic temporal neural model for language identification”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1), 134-144, 2017.

Turkish Dialect Recognition Using Acoustic and Phonotactic Features in Deep Learning Architectures

Year 2020, , 207 - 216, 31.07.2020
https://doi.org/10.17671/gazibtd.668023

Abstract

Dialects are local forms of speech separated by a certain rate from a standard language. Dialect recognition is one of the popular topics studied in speech recognition. In particular, the spoken dialect is asked to be identified first in order to improve the performance of large scale speech recognition systems. The phonetic differences of speech can be determined by examining the acoustic properties at the physical level. Features such as Log mel-spectrograms are used for this purpose. In addition, the phonotactic term corresponds to the arrangement rules of phonemes in a language/dialect. Phoneme sequences and the frequency of this sequence vary from dialect to dialect. Phoneme sequences are obtained by phoneme recognizers. Another topic that has become popular in recent years is deep learning neural networks. Convolutional Neural Networks (CNN), which is a special kind of deep learning neural networks, are often used in image and speech recognition. Long Short-Term Memory Neural Networks (LSTM) is a deep learning neural network model that produces more successful results than n-gram models in language modeling. In this study, the classification of Turkish dialects with CNN and LSTM type neural networks in terms of acoustic and phonotactic features were discussed. Also, LSTM neural networks are used for language modeling in phonotactic approach. In the experimental study, the proposed approaches were tested and interpreted on the Turkish Dialects Dataset that we collected. As a result of the study, it has been observed that the approaches used reaches 85.1% accuracy rate for Turkish dialect recognition.

References

  • J. Zhao, H. Shu, L. Zhang, X. Wang, Q. Gong, P. Li, “Cortical competition during language discrimination”, Neuroimage, 43(3), 624–633, 2008.
  • F. Ramus, J. Mehler, “Language identification with suprasegmental cues: a study based on speech resynthesis.”, J. Acoust. Soc. Am., 105(1), 512–21, 1999.
  • Y. K. Muthusamy, E. Barnard, R. A. Cole, “Reviewing Automatic Language Identification”, IEEE Signal Process. Mag., 11(4), 33–41, 1994.
  • M. A. Zissman, “Comparison of four approaches to automatic language identification of telephone speech”, IEEE Trans. Speech Audio Process., 4(1), 31–44, 1996.
  • N. Demir, “Ağız Terimi Üzerine”, Türkbilig, 105–116, 2002.
  • A. Etman, A. A. Louis, American dialect identification using phonotactic and prosodic features, IntelliSys - Proc. 2015 SAI Intell. Syst. Conf., pp. 963–970, 2015.
  • S. Safavi, M. Russell, P. Jančovič, "Automatic speaker, age-group and gender identification from children’s speech", Computer Speech & Language, 50, 141-156, 2018.
  • F. Biadsy, Automatic Dialect and Accent Recognition and its Application to Speech Recognition, PhD Thesis, Columbia Univ., pp. 1–171, 2011.
  • N. Dehak, P. A. Torres-Carrasquillo, D. Reynolds, R. Dehak, Language recognition via Ivectors and dimensionality reduction, in Proceedings of the Annual Conference of the International Speech Communication Association, Interspeech, 857–860, 2011.
  • A. Hanani, M. Russell, M. J. Carey, “Human and computer recognition of regional accents and ethnic groups from British English speech”, Comput. Speech Lang., 27(1), 59–74, 2013.
  • H. Soltau, L. Mangu, F. Biadsy, From modern standard Arabic to Levantine ASR: Leveraging GALE for dialects, in ASRU, 266–271, 2011.
  • M. Soufifar, S. Cumani, L. Burget, J. H. Cernocky, Discriminative classifiers for phonotactic language recognition with ivectors, in ICASSP, 4853–4856, 2012.
  • I. Lopez-moreno, J. Gonzalez-dominguez, O. Plchot, D. Martinez, J. Gonzalez-rodriguez, P. Moreno, Automatic language identification using deep neural networks, in ICASSP, vol. 1, 2014.
  • C. Salamea, L. F. D’Haro, R. De Cordoba, R. San-Segundo, “On the use of phone-gram units in recurrent neural networks for language identification”, Proceedings of Odyssey, 117-123, 2016.
  • M. Jin, Y. Song, I. Mcloughlin, L.R. Dai, Z.F. Ye, “LID-senone extraction via deep neural networks for end-to-end language identification”, Proceedings of Odyssey, 210-216, 2016.
  • A. Lozano-Diez, R. Zazo-Candil, J. Gonzalez-Dominguez, D.T. Toledano, J. Gonzalez-Rodriguez, An end-to-end approach to language identification in short utterances using convolutional neural networks, in Interspeech, 2015.
  • Y. Tian, L. He, Y. Liu, J. Liu, “Investigation of senone-based long short-term memory RNNs for spoken language recognition”, Proceedings of Odyssey, 89-93, 2016.
  • L. Deng, D. Yu, “Deep Learning: Methods and Applications”, Found. Trends Signal Process., 7, 2014.
  • T. N. Sainath, R. J. Weiss, A. W. Senior, K. W. Wilson, O. Vinyals, Learning the speech front-end with raw waveform CLDNNs, in Interspeech, 2015.
  • A.-R. Mohamed, G. E. Hinton, G. Penn, Understanding How Deep Belief Networks Perform Acoustic Modelling, in ICASSP, 2012.
  • D. C. Cireşan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber, Flexible, High Performance Convolutional Neural Networks for Image Classification, Proc. Twenty-Second Int. Jt. Conf. Artif. Intell., 2011.
  • O. Abdel-Hamid, A.-R. Mohamed, H. Jiang, L. Deng, G. Penn, D. Yu, “Convolutional Neural Networks for Speech Recognition”, IEEE/ACM Trans. Audio, Speech, Lang. Process., 22(10), 2014.
  • N. F. Chen, W. Shen, J. P. Campbell, A linguistically-informative approach to dialect recognition using dialect discriminating context-dependent phonetic models, in ICASSP, 5014–5017, 2010.
  • G. Işık, H. Artuner, A Dataset For Turkish Dialect Recognition and Classification with Deep Learning, in 26. IEEE Signal Processing and Communications Applications Conference (SIU), 2018.
  • M. Sundermeyer, R. Schlüter, H. Ney, LSTM Neural Networks for Language Modeling, Proc. Interspeech, 194–197, 2012.
  • T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, S. Khudanpur, Recurrent Neural Network based Language Model, in Interspeech, 1045–1048, 2010.
  • N. Demir, “Ağız Araştırmalarında Kaynak Kişi Meselesi”, Folk. Prof. Dr. Dursun Yıldırım Armağanı, 11, 1998.
  • Internet: P. Boersma, D. Weenink, Praat program, http://www.praat.org, 03.08.2019.
  • L. Karahan, Anadolu Ağızlarının Sınıflandırılması, Türk Dil Kurumu Yayınları, 1996.
  • P. Matějka, P. Schwarz, J. Cernock, P. Chytil, Phonotactic Language Identification using High Quality Phoneme Recognition, in Eurospeech, 2005.
  • Internet: P. Schwarz, P. Matejka, L. Burget, O. Glembek, Phoneme recognition based on long temporal context, http://speech.fit.vutbr.cz/software, 27.10.2019.
  • M. H. S. Segler, T. Kogej, C. Tyrchan, M. P. Waller, “Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural Networks”, ACS Central Science, 4(1), 120-131, 2017.
  • Y. Goldberg, “A primer on neural network models for natural language processing”, J. Artif. Intell. Res., 57, 345–420, 2016.
  • A. Graves, J. Schmidhuber, “Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures”, Neural Networks, 18(5), 602–610, 2005.
  • H. Sak, A. Senior, F. Beaufays, Long Short-Term Memory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling, Interspeech, 338–342, 2014.
  • Y. Bengio, P. Simard, P. Frasconi, “Learning long-term dependencies with gradient descent is difficult”, IEEE Trans. Neural Networks, 5, 157–166, 1994.
  • S. Hochreiter, J. Schmidhuber, “Long Short-Term Memory”, Neural Comput., 9(8), 1735–1780, 1997.
  • Internet: F. Chollet, Github, https://github.com/fchollet/keras. 15.11.2019.
  • B. Mcfee, C. Raffel, D. Liang, D. P. W. Ellis, M. Mcvicar, E. Battenberg, O. Nieto, librosa: Audio and Music Signal Analysis in Python, Proc. 14th Python Sci. Conf., no. Scipy, 1–7, 2015.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, J. Mach. Learn. Res., 15, 1929–1958, 2014.
  • V. Nair, G. E. Hinton, Rectified linear units improve restricted boltzmann machines, Proc. 27th Int. Conf. Mach. Learn., 2010.
  • L. Bottou, Large-Scale Machine Learning with Stochastic Gradient Descent, Proc. COMPSTAT’2010, 177–186, 2010.
  • R. J. Williams, J. Peng, “An efficient gradient-based algorithm for online training of recurrent network trajectories”, Neural Comput., 4, 491–501, 1990.
  • L. Ferrer, Y. Lei, M. Mclaren, N. Scheffer, “Study of Senone-Based Deep Neural Network Approaches for Spoken Language Recognition”, IEEE/ACM Trans. Audio, Speech, Lang. Process., 24(1), 105–116, 2016.
  • Z. Tang, D. Wang, Y. Chen, L. Li, A. Abel, “Phonetic temporal neural model for language identification”, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1), 134-144, 2017.
There are 45 citations in total.

Details

Primary Language English
Subjects Computer Software, Engineering
Journal Section Articles
Authors

Gültekin Işık 0000-0003-3037-5586

Harun Artuner 0000-0002-6044-379X

Publication Date July 31, 2020
Submission Date December 31, 2019
Published in Issue Year 2020

Cite

APA Işık, G., & Artuner, H. (2020). Turkish Dialect Recognition Using Acoustic and Phonotactic Features in Deep Learning Architectures. Bilişim Teknolojileri Dergisi, 13(3), 207-216. https://doi.org/10.17671/gazibtd.668023