A Benchmark for Feature-injection Architectures in Image Captioning
Year 2021,
Issue: 31, 461 - 468, 31.12.2021
Rumeysa Keskin
,
Özkan Çaylı
,
Özge Taylan Moral
,
Volkan Kılıç
,
Aytuğ Onan
Abstract
Describing an image with a grammatically and semantically correct sentence, known as image captioning, has been improved significantly with recent advances in computer vision (CV) and natural language processing (NLP) communities. The integration of these communities leads to the development of feature-injection architectures, which define how extracted features are used in captioning. In this paper, a benchmark of feature-injection architectures that utilize CV and NLP techniques is reported for encoder-decoder based captioning. Benchmark evaluations include Inception-v3 convolutional neural network to extract image features in the encoder while the feature-injection architectures such as init-inject, pre-inject, par-inject and merge are applied with a multi-layer gated recurrent unit (GRU) to generate captions in the decoder. Architectures have been evaluated extensively on the MSCOCO dataset across eight performance metrics. It has been concluded that the init-inject architecture with 3-layer GRU outperforms the other architectures in terms of captioning accuracy.
Supporting Institution
TÜBİTAK
Thanks
This research was supported by the Scientific and Technological Research Council of Turkey (TUBITAK)-British Council (The Newton-Katip Celebi Fund Institutional Links, Turkey-UK project: 120N995).
References
- Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. Paper presented at the European Conference on Computer Vision.
- Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Paper presented at the Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization.
- Baran, M., Moral, Ö. T., & Kılıç, V. (2021). Akıllı Telefonlar için Birleştirme Modeli Tabanlı Görüntü Altyazılama. Avrupa Bilim ve Teknoloji Dergisi(26), 191-196.
- Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2020). Mobile Application Based Automatic Caption Generation for Visually Impaired. Paper presented at the International Conference on Intelligent and Fuzzy Systems.
- Chang, S.-F. (1995). Compressed-domain techniques for image/video indexing and manipulation. Paper presented at the Proceedings., International Conference on Image Processing.
- Chiarella, D., Yarbrough, J., & Jackson, C. A.-L. (2020). Using alt text to make science Twitter more accessible for people with visual impairments. Nature Communications, 11(1), 1-3.
- Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:.
- Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., . . . Mitchell, M. J. a. p. a. (2015). Language models for image captioning: The quirks and what works.
- Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Gao, Y., & Glowacka, D. (2016). Deep gate recurrent neural network. Paper presented at the Asian conference on machine learning.
- Gers, F. A., & Schmidhuber, E. (2001). LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE Transactions on Neural Networks Learning Systems, 12(6), 1333-1340.
- Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., . . . Bigham, J. P. (2018). Vizwiz grand challenge: Answering visual questions from blind people. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Hochreiter, S., & Schmidhuber, J. J. N. c. (1997). Long short-term memory. 9(8), 1735-1780.
- Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853-899.
- Keskin, R., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Multi-GRU Based Automated Image Captioning for Smartphones. Paper presented at the 2021 29th Signal Processing and Communications Applications Conference (SIU).
- Kılıç, V. (2021). Deep Gated Recurrent Unit for Smartphone-Based Image Captioning. Sakarya University Journal of Computer Information Sciences, 4(2), 181-191.
- Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., . . . Berg, T. L. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis Machine Intelligence, 35(12), 2891-2903.
- Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Paper presented at the Text Summarization Branches Out.
- Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., . . . Zitnick, C. L. (2014). Microsoft coco: Common objects in context. Paper presented at the European Conference on Computer Vision.
- Liu, S., Zhu, Z., Ye, N., Guadarrama, S., & Murphy, K. (2016). Optimization of image description metrics using policy gradient methods.
- Liu, X., Xu, Q., & Wang, N. (2019). A survey on deep neural network-based image captioning. The Visual Computer, 35(3), 445-470.
- Makav, B., & Kılıç, V. (2019a). A new image captioning approach for visually impaired people. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering (ELECO).
- Makav, B., & Kılıç, V. (2019b). Smartphone-based image captioning for visually and hearing impaired. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering (ELECO).
- Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., & Yuille, A. L. (2015). Learning like a child: Fast novel visual concept learning from sentence descriptions of images. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
- Nina, O., & Rodriguez, A. (2015). Simplified LSTM unit and search space probability exploration for image description. Paper presented at the 2015 10th International Conference on Information, Communications and Signal Processing (ICICS).
- Ordonez, V., Kulkarni, G., & Berg, T. (2011). Im2text: Describing images using 1 million captioned photographs. Advances in Neural Information Processing Systems, 24, 1143-1151.
- Ouyang, H., Zeng, J., Li, Y., & Luo, S. J. P. (2020). Fault detection and identification of blast furnace ironmaking process using the gated recurrent unit network. 8(4), 391.
- Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Paper presented at the Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Tanti, M., Gatt, A., & Camilleri, K. P. (2018). Where to put the image in an image caption generator. Natural Language Engineering, 24(3), 467-489.
- Tao, Y., Wang, X., Sánchez, R.-V., Yang, S., & Bai, Y. (2019). Spur gear fault diagnosis using a multilayer gated recurrent unit approach with vibration signal. IEEE Access, 7, 56880-56889.
- Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:.08029.
- Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2016). Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Transactions on Pattern Analysis Machine Intelligence, 39(4), 652-663.
- Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T. (2017, 22-29 Oct. 2017). Boosting Image Captioning with Attributes. Paper presented at the 2017 IEEE International Conference on Computer Vision (ICCV).
- Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 67-78.
Görüntü Altyazılamada Öznitelik Enjeksiyon Mimarileri için Bir Kıyaslama
Year 2021,
Issue: 31, 461 - 468, 31.12.2021
Rumeysa Keskin
,
Özkan Çaylı
,
Özge Taylan Moral
,
Volkan Kılıç
,
Aytuğ Onan
Abstract
Görüntü altyazılama olarak bilinen, bir görüntüyü dilbilgisel ve anlamsal olarak doğru bir cümle olarak tanımlama, bilgisayarlı görme ve doğal dil işleme alanlarındaki son gelişmelerle birlikte önemli ölçüde ilerlemiştir. Bu iki alanın birleştirilmesi, çıkarılan özniteliklerin altyazı oluşturmada nasıl kullanılacağını tanımlayan öznitelik enjeksiyon mimarisinin geliştirilmesine öncülük etmiştir. Bu çalışmada, bilgisayarlı görme ve doğal dil işleme tekniklerini kodlayıcı-kod çözücü tabanlı görüntü altyazılamada kullanan öznitelik enjeksiyon mimarilerinin bir karşılaştırılması raporlanmaktadır. Kıyaslama değerlendirmelerinde, Inception-v3 evrişimsel sinir ağı, kodlayıcıda görüntü özniteliklerini çıkarmak için kullanılırken; init-inject, pre-inject, par-inject ve merge gibi öznitelik enjeksiyon mimarileri altyazı üretmek için çok katmanlı kapılı tekrarlayan birim ile kod çözücüde uygulanmaktadır. Mimariler sekiz performans metriği ile MSCOCO veri kümesi üzerinde kapsamlı bir şekilde değerlendirilmiştir. 3 katmanlı GRU ile init-inject mimarisinin altyazı doğruluğu açısından diğer mimarilerden daha iyi performans gösterdiği sonucuna varılmıştır.
References
- Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. Paper presented at the European Conference on Computer Vision.
- Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Paper presented at the Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization.
- Baran, M., Moral, Ö. T., & Kılıç, V. (2021). Akıllı Telefonlar için Birleştirme Modeli Tabanlı Görüntü Altyazılama. Avrupa Bilim ve Teknoloji Dergisi(26), 191-196.
- Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2020). Mobile Application Based Automatic Caption Generation for Visually Impaired. Paper presented at the International Conference on Intelligent and Fuzzy Systems.
- Chang, S.-F. (1995). Compressed-domain techniques for image/video indexing and manipulation. Paper presented at the Proceedings., International Conference on Image Processing.
- Chiarella, D., Yarbrough, J., & Jackson, C. A.-L. (2020). Using alt text to make science Twitter more accessible for people with visual impairments. Nature Communications, 11(1), 1-3.
- Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:.
- Devlin, J., Cheng, H., Fang, H., Gupta, S., Deng, L., He, X., . . . Mitchell, M. J. a. p. a. (2015). Language models for image captioning: The quirks and what works.
- Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Gao, Y., & Glowacka, D. (2016). Deep gate recurrent neural network. Paper presented at the Asian conference on machine learning.
- Gers, F. A., & Schmidhuber, E. (2001). LSTM recurrent networks learn simple context-free and context-sensitive languages. IEEE Transactions on Neural Networks Learning Systems, 12(6), 1333-1340.
- Gurari, D., Li, Q., Stangl, A. J., Guo, A., Lin, C., Grauman, K., . . . Bigham, J. P. (2018). Vizwiz grand challenge: Answering visual questions from blind people. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Hochreiter, S., & Schmidhuber, J. J. N. c. (1997). Long short-term memory. 9(8), 1735-1780.
- Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853-899.
- Keskin, R., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Multi-GRU Based Automated Image Captioning for Smartphones. Paper presented at the 2021 29th Signal Processing and Communications Applications Conference (SIU).
- Kılıç, V. (2021). Deep Gated Recurrent Unit for Smartphone-Based Image Captioning. Sakarya University Journal of Computer Information Sciences, 4(2), 181-191.
- Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., . . . Berg, T. L. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis Machine Intelligence, 35(12), 2891-2903.
- Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. Paper presented at the Text Summarization Branches Out.
- Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., . . . Zitnick, C. L. (2014). Microsoft coco: Common objects in context. Paper presented at the European Conference on Computer Vision.
- Liu, S., Zhu, Z., Ye, N., Guadarrama, S., & Murphy, K. (2016). Optimization of image description metrics using policy gradient methods.
- Liu, X., Xu, Q., & Wang, N. (2019). A survey on deep neural network-based image captioning. The Visual Computer, 35(3), 445-470.
- Makav, B., & Kılıç, V. (2019a). A new image captioning approach for visually impaired people. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering (ELECO).
- Makav, B., & Kılıç, V. (2019b). Smartphone-based image captioning for visually and hearing impaired. Paper presented at the 2019 11th International Conference on Electrical and Electronics Engineering (ELECO).
- Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., & Yuille, A. L. (2015). Learning like a child: Fast novel visual concept learning from sentence descriptions of images. Paper presented at the Proceedings of the IEEE International Conference on Computer Vision.
- Nina, O., & Rodriguez, A. (2015). Simplified LSTM unit and search space probability exploration for image description. Paper presented at the 2015 10th International Conference on Information, Communications and Signal Processing (ICICS).
- Ordonez, V., Kulkarni, G., & Berg, T. (2011). Im2text: Describing images using 1 million captioned photographs. Advances in Neural Information Processing Systems, 24, 1143-1151.
- Ouyang, H., Zeng, J., Li, Y., & Luo, S. J. P. (2020). Fault detection and identification of blast furnace ironmaking process using the gated recurrent unit network. 8(4), 391.
- Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Paper presented at the Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
- Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Tanti, M., Gatt, A., & Camilleri, K. P. (2018). Where to put the image in an image caption generator. Natural Language Engineering, 24(3), 467-489.
- Tao, Y., Wang, X., Sánchez, R.-V., Yang, S., & Bai, Y. (2019). Spur gear fault diagnosis using a multilayer gated recurrent unit approach with vibration signal. IEEE Access, 7, 56880-56889.
- Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:.08029.
- Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
- Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2016). Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Transactions on Pattern Analysis Machine Intelligence, 39(4), 652-663.
- Yao, T., Pan, Y., Li, Y., Qiu, Z., & Mei, T. (2017, 22-29 Oct. 2017). Boosting Image Captioning with Attributes. Paper presented at the 2017 IEEE International Conference on Computer Vision (ICCV).
- Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014). From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2, 67-78.