Research Article
BibTex RIS Cite

Konuşma Tanıma için Kodlayıcı Olarak Paralel Kapılı Tekrarlayan Birim Ağları

Year 2022, Issue: 36, 87 - 90, 31.05.2022
https://doi.org/10.31590/ejosat.1103714

Abstract

Listen, Attend and Spell (LAS) ağı konuşma tanıma için belli bir dil modeline gereksinim duymayan uçtan-uca yaklaşımlardan biridir. İki kısımdan oluşur; akustik öznitelikleri girdi olarak alan kodlayıcı kısmı, kodlayıcı çıkışı ve dikkat mekanizmasına bağlı olarak bir zaman adımında tek bir karakter üreten kod çözümleyici kısmı. Hem kod çözümleyici hem de kodlayıcı kısımlarında çok katmanlı tekrarlayan sinir ağları (RNN) kullanılır. Bu nedenle LAS mimarisi kod çözümleyici için bir RNN ve kodlayıcı için bir başka RNN olarak basitleştirilebilir. Şekilleri ve katman boyutları farklı olabilir. Bu çalışmada, kodlayıcı kısmı için çoklu RNN kullanımının performansını inceledik. Temel alınan LAS ağı 256 gizli boyutu olan bir RNN kullanmaktadır. 128 ve 64 gizli boyutları için 2 ve 4 RNN kullandık. Önerilen yaklaşımın ardındaki ana fikir, RNN’leri verilerdeki farklı örüntülere (bu çalışma için fonemler) odaklamaktır. Kodlayıcının çıkışında bunların çıkışları birleştirilir ve kod çözümleyiciye iletilir. TIMIT veritabanı, performans metriği olarak fonem hata oranı seçilerek bahsedilen ağların performansını karşılaştırmak için kullanılmıştır. Deneysel sonuçlar, önerilen yaklaşımın temek alınan ağdan daha iyi bir performans elde edebileceğini göstermiştir. Ancak RNN’lerin sayısını artırmak daha fazla
iyileşmeyi garanti etmemektedir.

References

  • C. Kim et al., “A Review of On-Device Fully Neural End-to-End Automatic Speech Recognition Algorithms,” in 2020 54th Asilomar Conference on Signals, Systems, and Computers, 2020, pp. 277–283.
  • A. P. Varga and R. K. Moore, “Hidden Markov model decomposition of speech and noise,” in International Conference on Acoustics, Speech, and Signal Processing, 1990, pp. 845–848.
  • G. Hinton et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
  • Yiğit, E., Özkaya, U., Öztürk, Ş., Singh, D. and Gritli, H. “Automatic detection of power quality disturbance using convolutional neural network structure with gated recurrent unit”, Mobile Information Systems, 2021.
  • A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 6645–6649.
  • W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964.
  • I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in 27th International Neural Information Processing Systems, 2014, pp. 3104–3112.
  • D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations, 2015, pp. 1–15.
  • K. Cho et al., “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734.
  • M. T. S. Al-Kaltakchi, W. L. Woo, S. S. Dlay, and J. A. Chambers, “Comparison of I-vector and GMM-UBM approaches to speaker identification with TIMIT and NIST 2008 databases in challenging environments,” in 2017 25th European Signal Processing Conference (EUSIPCO), 2017, pp. 533–537.
  • T. Drugman, Y. Stylianou, Y. Kida, and M. Akamine, “Voice Activity Detection: Merging Source and Filter-based Information,” IEEE Signal Process. Lett., vol. 23, no. 2, pp. 252–256, Feb. 2016.
  • I. Cohen and B. Berdugo, “Speech enhancement for non-stationary noise environments,” Signal Processing, vol. 81, no. 11, pp. 2403–2418, Nov. 2001.
  • S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust., vol. 28, no. 4, pp. 357–366, Aug. 1980.
  • F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription,” in 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, 2011, pp. 24–29.
  • S. Yang, X. Yu, and Y. Zhou, “LSTM and GRU Neural Network Performance Comparison Study: Taking Yelp Review Dataset as an Example,” in 2020 International Workshop on Electronic Communication and Artificial Intelligence (IWECAI), 2020, pp. 98–101.
  • T. Luong, H. Pham, and C. D. Manning, “Effective Approaches to Attention-based Neural Machine Translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412–1421.
  • K.-F. Lee and H.-W. Hon, “Speaker-independent phone recognition using hidden Markov models,” IEEE Trans. Acoust., vol. 37, no. 11, pp. 1641–1648, 1989.

Parallel Gated Recurrent Unit Networks as an Encoder for Speech Recognition

Year 2022, Issue: 36, 87 - 90, 31.05.2022
https://doi.org/10.31590/ejosat.1103714

Abstract

Listen, Attend and Spell (LAS) network is one of the end-to-end approaches for speech recognition, which does not require an explicit language model. It consists of two parts; the encoder part which receives acoustic features as inputs, and the decoder network which produces one character at a time step, based on the encoder output and an attention mechanism. Multi-layer recurrent neural networks (RNN) are used in both decoder and encoder parts. Hence, the LAS architecture can be simplified as one RNN for the decoder, and another RNN for the encoder. Their shapes and layer sizes can be different. In this work, we examined the performance of using multi RNNs for the encoder part. Our baseline LAS network uses an RNN with a hidden size of 256. We used 2 and 4 RNNs with hidden
sizes of 128 and 64 for each case. The main idea behind the proposed approach is to focus the RNNs to different patterns (phonemes in this case) in the data. At the output of the encoder, their outputs are concatenated and fed to the decoder. TIMIT database is used to compare the performance of the mentioned networks, using phoneme error rate as the performance metric. The experimental results showed that proposed approach can achieve a better performance than the baseline network. However, increasing the number of RNNs does not guarantee further improvements.

References

  • C. Kim et al., “A Review of On-Device Fully Neural End-to-End Automatic Speech Recognition Algorithms,” in 2020 54th Asilomar Conference on Signals, Systems, and Computers, 2020, pp. 277–283.
  • A. P. Varga and R. K. Moore, “Hidden Markov model decomposition of speech and noise,” in International Conference on Acoustics, Speech, and Signal Processing, 1990, pp. 845–848.
  • G. Hinton et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.
  • Yiğit, E., Özkaya, U., Öztürk, Ş., Singh, D. and Gritli, H. “Automatic detection of power quality disturbance using convolutional neural network structure with gated recurrent unit”, Mobile Information Systems, 2021.
  • A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013, pp. 6645–6649.
  • W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 4960–4964.
  • I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in 27th International Neural Information Processing Systems, 2014, pp. 3104–3112.
  • D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations, 2015, pp. 1–15.
  • K. Cho et al., “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734.
  • M. T. S. Al-Kaltakchi, W. L. Woo, S. S. Dlay, and J. A. Chambers, “Comparison of I-vector and GMM-UBM approaches to speaker identification with TIMIT and NIST 2008 databases in challenging environments,” in 2017 25th European Signal Processing Conference (EUSIPCO), 2017, pp. 533–537.
  • T. Drugman, Y. Stylianou, Y. Kida, and M. Akamine, “Voice Activity Detection: Merging Source and Filter-based Information,” IEEE Signal Process. Lett., vol. 23, no. 2, pp. 252–256, Feb. 2016.
  • I. Cohen and B. Berdugo, “Speech enhancement for non-stationary noise environments,” Signal Processing, vol. 81, no. 11, pp. 2403–2418, Nov. 2001.
  • S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoust., vol. 28, no. 4, pp. 357–366, Aug. 1980.
  • F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription,” in 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, 2011, pp. 24–29.
  • S. Yang, X. Yu, and Y. Zhou, “LSTM and GRU Neural Network Performance Comparison Study: Taking Yelp Review Dataset as an Example,” in 2020 International Workshop on Electronic Communication and Artificial Intelligence (IWECAI), 2020, pp. 98–101.
  • T. Luong, H. Pham, and C. D. Manning, “Effective Approaches to Attention-based Neural Machine Translation,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015, pp. 1412–1421.
  • K.-F. Lee and H.-W. Hon, “Speaker-independent phone recognition using hidden Markov models,” IEEE Trans. Acoust., vol. 37, no. 11, pp. 1641–1648, 1989.
There are 17 citations in total.

Details

Primary Language English
Subjects Engineering
Journal Section Articles
Authors

Zekeriya Tüfekci 0000-0001-7835-2741

Gökay Dişken 0000-0002-8680-0636

Early Pub Date April 11, 2022
Publication Date May 31, 2022
Published in Issue Year 2022 Issue: 36

Cite

APA Tüfekci, Z., & Dişken, G. (2022). Parallel Gated Recurrent Unit Networks as an Encoder for Speech Recognition. Avrupa Bilim Ve Teknoloji Dergisi(36), 87-90. https://doi.org/10.31590/ejosat.1103714