Fusion of High-Level Visual Attributes for Image Captioning
Year 2023,
Issue: 52, 161 - 168, 15.12.2023
Murat Kılcı
,
Özkan Çaylı
,
Volkan Kılıç
Abstract
Image captioning aims to generate a natural language description that accurately conveys the content of an image. Recently, deep learning models have been used to extract visual attributes from images, enhancing the accuracy of captions. However, it is essential to assess these visual attributes to ensure optimal performance and avoid incorporating redundant or misleading information. In this study, we employ the visual attributes of semantic segmentation, object detection, instance segmentation, keypoint detection, and their fusion. Experimental evaluations on the commonly used datasets VizWiz and MSCOCO Captions demonstrate that the fusion of visual attributes improves the accuracy of caption generation. Furthermore, the image captioning model, which utilizes the fusion of visual attributes, has been embedded into our custom-designed Android application, named NObstacle, enabling captioning without the need for an internet connection.
Supporting Institution
TUBITAK ve İKCU BAP
Project Number
120N995, 2021-ÖDL-MÜMF-0006, 2022-TYL-FEBE-0012
Thanks
This research was supported by the Scientific and Technological Research Council of Turkey (TUBITAK) British Council (The Newton Katip Celebi Fund Institutional Links, Turkey UK project: 120N995) and by the scientific research projects coordination unit of Izmir Katip Celebi University (project no: 2021-ÖDL-MÜMF-0006, & 2022-TYL-FEBE-0012).
References
- Akosman, Ş. A., Öktem, M., Moral, Ö. T., & Kılıç, V. (2021). Deep Learning-based Semantic Segmentation for Crack Detection on Marbles. 29th Signal Processing and Communications Applications Conference (SIU),
- Amit, Y., Felzenszwalb, P., & Girshick, R. (2020). Object detection. Computer Vision: A Reference Guide, 1-9.
- Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. Computer Vision–ECCV: 14th European Conference, Amsterdam, The Netherlands, October 11-14,
Proceedings, Part V 14,
- Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE conference on computer vision and pattern recognition,
- Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization,
- Baran, M., Moral, Ö. T., & Kılıç, V. (2021). Akıllı telefonlar için birleştirme modeli tabanlı görüntü altyazılama. Avrupa Bilim ve Teknoloji Dergisi(26), 191-196.
- Barroso-Laguna, A., Riba, E., Ponsa, D., & Mikolajczyk, K. (2019). Key. net: Keypoint detection by handcrafted and learned cnn filters. Proceedings of the IEEE/CVF international conference on computer vision,
- Betül, U., Çaylı, Ö., Kılıç, V., & Onan, A. (2022). Resnet based deep gated recurrent unit for image captioning on smartphone. Avrupa Bilim ve Teknoloji Dergisi(35), 610-615.
- Aydın, S., Çaylı, Ö., Kılıç, V., & Onan, A. (2022). Sequence-to-sequence video captioning with residual connected gated recurrent units. Avrupa Bilim ve Teknoloji Dergisi(35), 380-386.
- Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2021). Mobile application based automatic caption generation for visually impaired. Intelligent and Fuzzy Techniques: Smart and Innovative Solutions: Proceedings of the INFUS Conference, Istanbul, Turkey, July 21-23,
- Chang, S.-F. (1995). Compressed-domain techniques for image/video indexing and manipulation. Proceedings., International Conference on Image Processing,
- Chen, T., Zhang, Z., You, Q., Fang, C., Wang, Z., Jin, H., & Luo, J. (2018). ``Factual''or``Emotional'': Stylized Image Captioning with Adaptive Learning and Attention. Proceedings of the european conference on computer vision (ECCV),
- Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
- Çaylı, Ö., Kılıç, V., Onan, A., & Wang, W. (2022). Auxiliary classifier based residual rnn for image captioning. 30th European Signal Processing Conference (EUSIPCO),
- Çaylı, Ö., Liu, X., Kılıç, V., & Wang, W. (2023). Knowledge Distillation for Efficient Audio-Visual Video Captioning. arXiv preprint arXiv:2306.09947.
- Deselaers, T., Keysers, D., & Ney, H. (2004). Features for image retrieval: A quantitative comparison. Pattern Recognition: 26th DAGM Symposium, Tübingen, Germany, August 30-September 1. Proceedings 26,
- Doǧan, V., Isık, T., Kılıç, V., & Horzum, N. (2022). A field-deployable water quality monitoring with machine learning-based smartphone colorimetry. Analytical Methods, 14(35), 3458-3466.
- Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. (2010). Every picture tells a story: Generating sentences from images. Computer Vision–ECCV: 11th European Conference on
Computer Vision, Heraklion, Crete, Greece, September 5-11, Proceedings, Part IV 11,
- Fetiler, B., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Video captioning based on multi-layer gated recurrent unit for smartphones. Avrupa Bilim ve Teknoloji Dergisi(32), 221-226.
- Ganesan, K. (2018). Rouge 2.0: Updated and improved measures for evaluation of summarization tasks. arXiv preprint arXiv:1803.01937.
- Guo, Y., Liu, Y., Georgiou, T., & Lew, M. S. (2018). A review of semantic segmentation using deep neural networks. International journal of multimedia information retrieval, 7, 87-93.
- Gurari, D., Zhao, Y., Zhang, M., & Bhattacharya, N. (2020). Captioning images taken by people who are blind. Computer Vision–ECCV: 16th European Conference, Glasgow, UK, August 23–28, Proceedings, Part XVII 16,
- He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. Proceedings of the IEEE international conference on computer vision,
- Ibarra, F. F., Kardan, O., Hunter, M. R., Kotabe, H. P., Meyer, F. A., & Berman, M. G. (2017). Image feature types and their predictions of aesthetic preference and naturalness. Frontiers in Psychology, 8, 632.
- Jiang, W., Ma, L., Chen, X., Zhang, H., & Liu, W. (2018). Learning to guide decoding for image captioning. Proceedings of the AAAI Conference on Artificial Intelligence,
- Keskin, R., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). A benchmark for feature-injection architectures in image captioning. Avrupa Bilim ve Teknoloji Dergisi(31), 461-468.
- Kılıç, V. (2021). Deep gated recurrent unit for smartphone-based image captioning. Sakarya University Journal of Computer and Information Sciences, 4(2), 181-191.
- Kılıç, V., Mercan, Ö. B., Tetik, M., Kap, Ö., & Horzum, N. (2022). Non-enzymatic colorimetric glucose detection based on Au/Ag nanoparticles using smartphone and machine learning. Analytical Sciences, 38(2), 347-358.
- Kılıç, V., Zhong, X., Barnard, M., Wang, W., & Kittler, J. (2014). Audio-visual tracking of a variable number of speakers with a random finite set approach. 17th International Conference on Information Fusion (FUSION),
- Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. Computer Vision–ECCV: 13th European Conference, Zurich, Switzerland, September 6-12, Proceedings, Part V 13,
- Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018). Path aggregation network for instance segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition,
- Makav, B., & Kılıç, V. (2019). Smartphone-based image captioning for visually and hearing impaired. 11th international conference on electrical and electronics engineering (ELECO),
- Mercan, Ö. B., Doğan, V., & Kılıç, V. (2020). Time Series Analysis based Machine Learning Classification for Blood Sugar Levels. Medical Technologies Congress (TIPTEKNO),
- Mercan, Ö. B., & Kılıç, V. (2020). Deep learning based colorimetric classification of glucose with au-ag nanoparticles using smartphone. Medical Technologies Congress (TIPTEKNO),
- Moral, Ö. T., Kılıç, V., Onan, A., & Wang, W. (2022). Automated Image Captioning with Multi-layer Gated Recurrent Unit. 30th European Signal Processing Conference (EUSIPCO),
- Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the Association for Computational Linguistics,
- Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., & Antiga, L. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
- Pu, B., Liu, Y., Zhu, N., Li, K., & Li, K. (2020). ED-ACNN: Novel attention convolutional neural network based on encoder–decoder framework for human traffic prediction. Applied Soft Computing, 97, 106688.
- Sayraci, B., Ağralı, M., & Kılıç, V. (2023). Artificial Intelligence Based Instance-Aware Semantic Lobe Segmentation on Chest Computed Tomography Images. Avrupa Bilim ve Teknoloji Dergisi(46), 109-115.
- Sun, X., Xiao, B., Wei, F., Liang, S., & Wei, Y. (2018). Integral human pose regression. Proceedings of the European conference on computer vision (ECCV),
- Tahir, H., Iftikhar, A., & Mumraiz, M. (2021). Forecasting COVID-19 via registration slips of patients using resnet-101 and performance analysis and comparison of prediction for COVID-19 using faster r-cnn, mask r-cnn, and resnet-50. International conference on advances in electrical, computing, communication and sustainable
technologies (ICAECT),
- Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition,
- Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Proceedings of the IEEE conference on computer vision and pattern recognition,
- Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., & Cottrell, G. (2018). Understanding convolution for semantic segmentation. IEEE winter conference on applications of computer vision (WACV),
- Xi, D., Qin, Y., Luo, J., Pu, H., & Wang, Z. (2021). Multipath fusion mask R-CNN with double attention and its application into gear pitting detection. IEEE Transactions on Instrumentation and Measurement, 70, 1-11.
- Yang, M., Liu, J., Shen, Y., Zhao, Z., Chen, X., Wu, Q., & Li, C. (2020). An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Transactions on Image Processing, 29, 9627-9640.
- You, Q., Jin, H., & Luo, J. (2018). Image captioning at will: A versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121.
- Yu, J., Li, J., Yu, Z., & Huang, Q. (2019). Multimodal transformer with multi-view visual representation for image captioning. IEEE transactions on circuits and systems for video technology, 30(12), 4467-4480.
Görüntü Altyazılama için Üst Düzey Görsel Özniteliklerin Birleştirilmesi
Year 2023,
Issue: 52, 161 - 168, 15.12.2023
Murat Kılcı
,
Özkan Çaylı
,
Volkan Kılıç
Abstract
Görüntü altyazılama, bir görüntünün içeriğini doğru olarak ileten bir doğal dil açıklaması üretmeyi amaçlar. Son zamanlarda, altyazıların doğruluğunu arttırmak için görsel öznitelikleri çıkaran derin öğrenme modelleri kullanılmaktadır. Ancak, optimal performansın sağlanması, gereksiz ve yanıltıcı bilgilerin işlenmesinin önlenmesi açısından bu görsel özniteliklerin değerlendirilmesi önemlidir. Bu çalışmada, anlamsal bölümleme, nesne algılama, örnek bölümleme, anahtar nokta algılamanın ve bunların birleşiminin görsel öznitelikleri kullanıyoruz. VizWiz ve MSCOCO Captions gibi yaygın olarak kullanılan veri kümelerinde yapılan deneysel değerlendirmeler, görsel özniteliklerin birleşiminin altyazı üretiminin doğruluğunu artırdığını göstermektedir. Ayrıca, görsel özniteliklerin birleşimini kullanan görüntü altyazılama modeli, NObstacle adını verdiğimiz özel tasarlanmış Android uygulamamıza entegre edilerek internet bağlantısı gerektirmeden altyazı üretimini sağlamaktadır.
Project Number
120N995, 2021-ÖDL-MÜMF-0006, 2022-TYL-FEBE-0012
References
- Akosman, Ş. A., Öktem, M., Moral, Ö. T., & Kılıç, V. (2021). Deep Learning-based Semantic Segmentation for Crack Detection on Marbles. 29th Signal Processing and Communications Applications Conference (SIU),
- Amit, Y., Felzenszwalb, P., & Girshick, R. (2020). Object detection. Computer Vision: A Reference Guide, 1-9.
- Anderson, P., Fernando, B., Johnson, M., & Gould, S. (2016). Spice: Semantic propositional image caption evaluation. Computer Vision–ECCV: 14th European Conference, Amsterdam, The Netherlands, October 11-14,
Proceedings, Part V 14,
- Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., & Zhang, L. (2018). Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE conference on computer vision and pattern recognition,
- Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization,
- Baran, M., Moral, Ö. T., & Kılıç, V. (2021). Akıllı telefonlar için birleştirme modeli tabanlı görüntü altyazılama. Avrupa Bilim ve Teknoloji Dergisi(26), 191-196.
- Barroso-Laguna, A., Riba, E., Ponsa, D., & Mikolajczyk, K. (2019). Key. net: Keypoint detection by handcrafted and learned cnn filters. Proceedings of the IEEE/CVF international conference on computer vision,
- Betül, U., Çaylı, Ö., Kılıç, V., & Onan, A. (2022). Resnet based deep gated recurrent unit for image captioning on smartphone. Avrupa Bilim ve Teknoloji Dergisi(35), 610-615.
- Aydın, S., Çaylı, Ö., Kılıç, V., & Onan, A. (2022). Sequence-to-sequence video captioning with residual connected gated recurrent units. Avrupa Bilim ve Teknoloji Dergisi(35), 380-386.
- Çaylı, Ö., Makav, B., Kılıç, V., & Onan, A. (2021). Mobile application based automatic caption generation for visually impaired. Intelligent and Fuzzy Techniques: Smart and Innovative Solutions: Proceedings of the INFUS Conference, Istanbul, Turkey, July 21-23,
- Chang, S.-F. (1995). Compressed-domain techniques for image/video indexing and manipulation. Proceedings., International Conference on Image Processing,
- Chen, T., Zhang, Z., You, Q., Fang, C., Wang, Z., Jin, H., & Luo, J. (2018). ``Factual''or``Emotional'': Stylized Image Captioning with Adaptive Learning and Attention. Proceedings of the european conference on computer vision (ECCV),
- Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
- Çaylı, Ö., Kılıç, V., Onan, A., & Wang, W. (2022). Auxiliary classifier based residual rnn for image captioning. 30th European Signal Processing Conference (EUSIPCO),
- Çaylı, Ö., Liu, X., Kılıç, V., & Wang, W. (2023). Knowledge Distillation for Efficient Audio-Visual Video Captioning. arXiv preprint arXiv:2306.09947.
- Deselaers, T., Keysers, D., & Ney, H. (2004). Features for image retrieval: A quantitative comparison. Pattern Recognition: 26th DAGM Symposium, Tübingen, Germany, August 30-September 1. Proceedings 26,
- Doǧan, V., Isık, T., Kılıç, V., & Horzum, N. (2022). A field-deployable water quality monitoring with machine learning-based smartphone colorimetry. Analytical Methods, 14(35), 3458-3466.
- Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. (2010). Every picture tells a story: Generating sentences from images. Computer Vision–ECCV: 11th European Conference on
Computer Vision, Heraklion, Crete, Greece, September 5-11, Proceedings, Part IV 11,
- Fetiler, B., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). Video captioning based on multi-layer gated recurrent unit for smartphones. Avrupa Bilim ve Teknoloji Dergisi(32), 221-226.
- Ganesan, K. (2018). Rouge 2.0: Updated and improved measures for evaluation of summarization tasks. arXiv preprint arXiv:1803.01937.
- Guo, Y., Liu, Y., Georgiou, T., & Lew, M. S. (2018). A review of semantic segmentation using deep neural networks. International journal of multimedia information retrieval, 7, 87-93.
- Gurari, D., Zhao, Y., Zhang, M., & Bhattacharya, N. (2020). Captioning images taken by people who are blind. Computer Vision–ECCV: 16th European Conference, Glasgow, UK, August 23–28, Proceedings, Part XVII 16,
- He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. Proceedings of the IEEE international conference on computer vision,
- Ibarra, F. F., Kardan, O., Hunter, M. R., Kotabe, H. P., Meyer, F. A., & Berman, M. G. (2017). Image feature types and their predictions of aesthetic preference and naturalness. Frontiers in Psychology, 8, 632.
- Jiang, W., Ma, L., Chen, X., Zhang, H., & Liu, W. (2018). Learning to guide decoding for image captioning. Proceedings of the AAAI Conference on Artificial Intelligence,
- Keskin, R., Çaylı, Ö., Moral, Ö. T., Kılıç, V., & Onan, A. (2021). A benchmark for feature-injection architectures in image captioning. Avrupa Bilim ve Teknoloji Dergisi(31), 461-468.
- Kılıç, V. (2021). Deep gated recurrent unit for smartphone-based image captioning. Sakarya University Journal of Computer and Information Sciences, 4(2), 181-191.
- Kılıç, V., Mercan, Ö. B., Tetik, M., Kap, Ö., & Horzum, N. (2022). Non-enzymatic colorimetric glucose detection based on Au/Ag nanoparticles using smartphone and machine learning. Analytical Sciences, 38(2), 347-358.
- Kılıç, V., Zhong, X., Barnard, M., Wang, W., & Kittler, J. (2014). Audio-visual tracking of a variable number of speakers with a random finite set approach. 17th International Conference on Information Fusion (FUSION),
- Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. Computer Vision–ECCV: 13th European Conference, Zurich, Switzerland, September 6-12, Proceedings, Part V 13,
- Liu, S., Qi, L., Qin, H., Shi, J., & Jia, J. (2018). Path aggregation network for instance segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition,
- Makav, B., & Kılıç, V. (2019). Smartphone-based image captioning for visually and hearing impaired. 11th international conference on electrical and electronics engineering (ELECO),
- Mercan, Ö. B., Doğan, V., & Kılıç, V. (2020). Time Series Analysis based Machine Learning Classification for Blood Sugar Levels. Medical Technologies Congress (TIPTEKNO),
- Mercan, Ö. B., & Kılıç, V. (2020). Deep learning based colorimetric classification of glucose with au-ag nanoparticles using smartphone. Medical Technologies Congress (TIPTEKNO),
- Moral, Ö. T., Kılıç, V., Onan, A., & Wang, W. (2022). Automated Image Captioning with Multi-layer Gated Recurrent Unit. 30th European Signal Processing Conference (EUSIPCO),
- Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the Association for Computational Linguistics,
- Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., & Antiga, L. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
- Pu, B., Liu, Y., Zhu, N., Li, K., & Li, K. (2020). ED-ACNN: Novel attention convolutional neural network based on encoder–decoder framework for human traffic prediction. Applied Soft Computing, 97, 106688.
- Sayraci, B., Ağralı, M., & Kılıç, V. (2023). Artificial Intelligence Based Instance-Aware Semantic Lobe Segmentation on Chest Computed Tomography Images. Avrupa Bilim ve Teknoloji Dergisi(46), 109-115.
- Sun, X., Xiao, B., Wei, F., Liang, S., & Wei, Y. (2018). Integral human pose regression. Proceedings of the European conference on computer vision (ECCV),
- Tahir, H., Iftikhar, A., & Mumraiz, M. (2021). Forecasting COVID-19 via registration slips of patients using resnet-101 and performance analysis and comparison of prediction for COVID-19 using faster r-cnn, mask r-cnn, and resnet-50. International conference on advances in electrical, computing, communication and sustainable
technologies (ICAECT),
- Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition,
- Vedantam, R., Lawrence Zitnick, C., & Parikh, D. (2015). Cider: Consensus-based image description evaluation. Proceedings of the IEEE conference on computer vision and pattern recognition,
- Wang, P., Chen, P., Yuan, Y., Liu, D., Huang, Z., Hou, X., & Cottrell, G. (2018). Understanding convolution for semantic segmentation. IEEE winter conference on applications of computer vision (WACV),
- Xi, D., Qin, Y., Luo, J., Pu, H., & Wang, Z. (2021). Multipath fusion mask R-CNN with double attention and its application into gear pitting detection. IEEE Transactions on Instrumentation and Measurement, 70, 1-11.
- Yang, M., Liu, J., Shen, Y., Zhao, Z., Chen, X., Wu, Q., & Li, C. (2020). An ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial network. IEEE Transactions on Image Processing, 29, 9627-9640.
- You, Q., Jin, H., & Luo, J. (2018). Image captioning at will: A versatile scheme for effectively injecting sentiments into image descriptions. arXiv preprint arXiv:1801.10121.
- Yu, J., Li, J., Yu, Z., & Huang, Q. (2019). Multimodal transformer with multi-view visual representation for image captioning. IEEE transactions on circuits and systems for video technology, 30(12), 4467-4480.