Yeni Bir Türkçe Sesli Kitap Veri Seti Üzerinde Convolutional RNN+CTC, LSTM+CTC ve GRU+CTC Modellerinin Karşılaştırılması

Halil İbrahim Yalman; Zekeriya Tüfekci

doi:10.31590/ejosat.1082109

Araştırma Makalesi

Yeni Bir Türkçe Sesli Kitap Veri Seti Üzerinde Convolutional RNN+CTC, LSTM+CTC ve GRU+CTC Modellerinin Karşılaştırılması

Yıl 2022, Sayı: 34, 321 - 327, 31.03.2022

Halil İbrahim Yalman , Zekeriya Tüfekci

https://doi.org/10.31590/ejosat.1082109

Cited By: 1

Öz

Konuşma tanıma insanların çıkardığı ses dalgalarının yazıya dönüştürülmesi işlemidir. Geçmişten günümüze birçok konuşma tanıma modeli ve veri seti üretilmekle beraber ülkemizde bu konuda bir eksiklik olduğu yadsınamaz bir gerçektir. Bu çalışmada, Türkçe konuşma tanıma sistemleri için sesli kitaplardan oluşan özgün bir veri seti geliştirilmiştir. Bu veri seti halihazırda oluşturulmuş olan sesli kitapların bölümlenmesi yoluyla hazırlanmıştır. Bu veri seti üzerinde Evrişimli Sinir Ağları (CNN) ve Bağlantıcı Zamansal Sınıflandırma (CTC) ile birlikte Yinelemeli Sinir Ağı (RNN), Uzun Kısa Süreli Hafıza (LSTM), Geçitli Tekrarlayan Birimler (GRU) modellerinin performansı incelenmiş ve karşılaştırması yapılmıştır. Bu çalışmanın sonuçlarına göre performansı en yüksek olan model LSTM olması ile birlikte daha az parametre kullanan GRU modelinin konuşma tanıma oranı LSTM modelinin konuşma tanıma oranına yakın çıkmıştır.

Anahtar Kelimeler

Konuşma Tanıma, Derin Öğrenme, Evrişimli Sinir Ağları, Uzun Kısa Süreli Bellek, Basit Tekrarlayan Ağlar, Kapılı Tekrarlayan Hücreler, Bağlantıcı Zamansal Sınıflandırma, Türkçe Sesli Kitap Veriseti.

Kaynakça

Abdel-Hamid O., Mohamed A., Jiang H., Deng L., Penn G. and Yu D., (2014) "Convolutional neural networks for speech recognition" IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 10, pp. 1533-1545, doi: 10.1109/TASLP.2014.2339736.
Benba A., Jilbab A. and Hammouch A., (2015) “Detecting patients with parkinson’s disease using mel frequency cepstral coefficients and support vector machines”, International Journal on Electrical Engineering and Informatics- Volume 7, Number 2.
Cho K., Van Merriënboer B., Bahdanau D., Bengio Y., (2014) “On the properties of neural machine translation: encoder-decoder approaches.” arXiv preprint arXiv:1409.1259.
Dahl G. E., Yu D., Deng L., A. Acero (2012) “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition.” Ieee Transactıons On Audıo, Speech, And Language Processıng, Vol. 20, No. 1.
Goodfellow I., Bengio Y. and Courville A., (2018) Derin Öğrenme, Ankara: Buzdağı Yayınevi
Graves A., Mohamed A., Hinton G. (2013) “Speech recognition with deep recurrent neural networks.” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing- Proceedings. 38. 10.1109/ICASSP.2013.6638947.
Graves A., Fernández S., Gomez F., and Schmidhuber J., (2006) “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.” In Proceedings of the 23rd international conference on Machine learning (pp. 369-376).
Hochreiter S. and Schmidhuber J. (1997) “Long short-term memory. neural computation.”, 9(8), 1735–1780.
Ravanelli M., Brakel P., Omologo M. and Bengio Y., (2018) "Light gated recurrent units for speech recognition." in IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 92-102, doi: 10.1109/TETCI.2017.2762739.
Renals S. and Bourlard H. (1994) “Connectionist probability estimators in hmm speech recognition.” Ieee Transactions On Speech And Audio Processing, Vol.2, No. 1, Part 11.
Rumelhart D., Hinton G. and Williams, R. (1986) “Learning representations by back-propagating errors.” Nature 323, 533–536
Tanveer M. H., Zhu H., Ahmed W., Thomas A., Imran B. M. and Salman M., (2021) “Mel-spectrogram and deep cnn based representation learning from bio-sonar ımplementation on uavs”, 2021 International Conference on Computer, Control and Robotics.
Tak R. N., Agrawal D. M., and Patil H. A., (2017) “Novel phase encoded mel filterbank energies for environmental soundclassification.” In International Conference on Pattern Recognition and Machine Intelligence, pages 317–325.Springer.

Comparison of Convolutional RNN+CTC, LSTM+CTC and GRU+CTC Models on A New Turkish Audiobook Dataset

Yıl 2022, Sayı: 34, 321 - 327, 31.03.2022

Halil İbrahim Yalman , Zekeriya Tüfekci

https://doi.org/10.31590/ejosat.1082109

Cited By: 1

Öz

Speech recognition is the process of converting sound waves produced by humans into text. Although many Speech recognition models and data sets have been produced from the past to the present, it is an undeniable fact that there is a deficiency in this regard in our country. In this study, a unique data set consisting of audio books was developed for Turkish speech recognition systems. This dataset has been prepared by partitioning the audiobooks that have already been prepared. On this dataset, Recurrent Neural Network (RNN), Long Short Term Memory (LSTM), Gated Recurrent Units (GRU) models were examined and compared together with Convolutional Neural Networks (CNN) and Connectionist Temporal Classification (CTC). According to the results of this study, although the model with the highest performance is LSTM, the speech recognition rate of the GRU model, which uses fewer parameters, is close to the speech recognition rate of the LSTM model.

Anahtar Kelimeler

Speech Recognition, Deep Learning, Convolutional Neural Networks, Long Short Term Memory, Simple Recurrent Networks, Gated Recurrent Units, Connectionist temporal classification, Turkish Audiobook Dataset.

Kaynakça

Abdel-Hamid O., Mohamed A., Jiang H., Deng L., Penn G. and Yu D., (2014) "Convolutional neural networks for speech recognition" IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 10, pp. 1533-1545, doi: 10.1109/TASLP.2014.2339736.
Benba A., Jilbab A. and Hammouch A., (2015) “Detecting patients with parkinson’s disease using mel frequency cepstral coefficients and support vector machines”, International Journal on Electrical Engineering and Informatics- Volume 7, Number 2.
Cho K., Van Merriënboer B., Bahdanau D., Bengio Y., (2014) “On the properties of neural machine translation: encoder-decoder approaches.” arXiv preprint arXiv:1409.1259.
Dahl G. E., Yu D., Deng L., A. Acero (2012) “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition.” Ieee Transactıons On Audıo, Speech, And Language Processıng, Vol. 20, No. 1.
Goodfellow I., Bengio Y. and Courville A., (2018) Derin Öğrenme, Ankara: Buzdağı Yayınevi
Graves A., Mohamed A., Hinton G. (2013) “Speech recognition with deep recurrent neural networks.” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing- Proceedings. 38. 10.1109/ICASSP.2013.6638947.
Graves A., Fernández S., Gomez F., and Schmidhuber J., (2006) “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.” In Proceedings of the 23rd international conference on Machine learning (pp. 369-376).
Hochreiter S. and Schmidhuber J. (1997) “Long short-term memory. neural computation.”, 9(8), 1735–1780.
Ravanelli M., Brakel P., Omologo M. and Bengio Y., (2018) "Light gated recurrent units for speech recognition." in IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, no. 2, pp. 92-102, doi: 10.1109/TETCI.2017.2762739.
Renals S. and Bourlard H. (1994) “Connectionist probability estimators in hmm speech recognition.” Ieee Transactions On Speech And Audio Processing, Vol.2, No. 1, Part 11.
Rumelhart D., Hinton G. and Williams, R. (1986) “Learning representations by back-propagating errors.” Nature 323, 533–536
Tanveer M. H., Zhu H., Ahmed W., Thomas A., Imran B. M. and Salman M., (2021) “Mel-spectrogram and deep cnn based representation learning from bio-sonar ımplementation on uavs”, 2021 International Conference on Computer, Control and Robotics.
Tak R. N., Agrawal D. M., and Patil H. A., (2017) “Novel phase encoded mel filterbank energies for environmental soundclassification.” In International Conference on Pattern Recognition and Machine Intelligence, pages 317–325.Springer.

Toplam 13 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Mühendislik
Bölüm	Makaleler
Yazarlar	Halil İbrahim Yalman 0000-0003-0841-1309 Zekeriya Tüfekci 0000-0001-7835-2741
Erken Görünüm Tarihi	30 Ocak 2022
Yayımlanma Tarihi	31 Mart 2022
Yayımlandığı Sayı	Yıl 2022 Sayı: 34

Kaynak Göster

APA	Yalman, H. İ., & Tüfekci, Z. (2022). Yeni Bir Türkçe Sesli Kitap Veri Seti Üzerinde Convolutional RNN+CTC, LSTM+CTC ve GRU+CTC Modellerinin Karşılaştırılması. Avrupa Bilim Ve Teknoloji Dergisi(34), 321-327. https://doi.org/10.31590/ejosat.1082109

Avrupa Bilim ve Teknoloji Dergisi

Yeni Bir Türkçe Sesli Kitap Veri Seti Üzerinde Convolutional RNN+CTC, LSTM+CTC ve GRU+CTC Modellerinin Karşılaştırılması

Öz

Anahtar Kelimeler

Kaynakça

Comparison of Convolutional RNN+CTC, LSTM+CTC and GRU+CTC Models on A New Turkish Audiobook Dataset

Öz

Anahtar Kelimeler

Kaynakça

Ayrıntılar

Kaynak Göster

Cited By

Konuşma Tanımaya Uygulanan BiRNN, BiLSTM ve BiGRU Modellerinin Performans Değerlendirmesi

European Journal of Science and Technology

https://doi.org/10.31590/ejosat.1111314