A COMPREHENSIVE REVIEW ON USING OF DEEP LEARNING APPROACHES IN VIDEO CAPTIONING APPLICATIONS
Year 2020,
Volume: 8 Issue: 5, 271 - 289, 29.12.2020
Özlem Alpay
,
M. Ali Akcayol
Abstract
Video captioning is defined as creating captions for videos by automatically. It is an area of increasing interest as it includes both computer vision and natural language approaches. Producing expressions in natural language and combining them with image frames is a difficult process. Many approaches have been developed to solve this problem. In this study, a literature study about the developments in video captioning research is presented. The studies examined are examined in different categories according to the methods used. The methods are summarized and their strengths and limitations are analyzed. The methods are summarized and their strengths and limitations are analyzed. Deep learning is one of the most common methods used in this regard. Research has been conducted on the applicability of deep learning approaches in video labeling systems. Used datasets and performance evaluation criteria are analyzed. The developments in deep learning methods have provided new approaches to video captioning. The successful results have been observed with the use of deep learning methods in studies on video captioning.
References
- Aafaq, N., Akhtar, N., Liu, W., Mian, A., 2019. Empirical Autopsy of Deep Video Captioning Frameworks, arxiv.org/pdf/1911.09345v1
- Amaresh, M. and Chitrakala, S., 2019. Video Captioning using Deep Learning: An Overview of Methods, Datasets and Metrics, International Conference on Communication and Signal Processing, India
- Ayers, D. and Shah, M. 2001. Monitoring human behavior from video taken in an office environment. Image and Vision Computing, 19(12),833–846.
- Baraldi, L., Grana, C. and Cucchiara, R., 2017. Hierarchical Boundary-Aware Neural Encoder for Video Captioning, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA
- Carreira, J., Zisserman, A., 2017. Quo Vadis, Action Recognition? A New Model and The Kinetics Dataset. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA
- Chen, D., Dolan, W., 2011, Collecting Highly Parallel Data For Paraphrase Evaluation. In ACL: Human Language Technologies, 1, 190-200.
- Chen, Y., Zhang, W., Wang, S., Li, L., Huang, Q., 2018. Saliency-Based Spatiotemporal Attention for Video Captioning, International Conference on Multimedia Big Data (BigMM), Xi'an, China
- Cho, K.,Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning Phrase Representations Using Rnn Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078,
- Çtamak,B., Kuyu, M., Erdem, A., Erdem, E., 2019. MSVD-Turkish: A Large-Scale Dataset for Video Captioning in Turkish, 27th Signal Processing and Communications Applications Conference (SIU), Sivas, Türkiye
- Das, P., Xu, C., Doell, R. F., Corso. and J. J., 2013. A Thousand Frames in Just a Few Words: Lingual Description of Videos Through Latent Topics and Sparse Object Stitching. 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA
- Ding, S., Qu, S., Xi, Y., Wan, S., 2019. A Long Video Captioning Generation Algorithm for Big Video Data Retrieval, Future Generation Computer Systems 93, 583–595
- Elman, J. L., 1990. Finding Structure in time. Cognitive Science, 14(2), 179–211
- Gan, Z., Gan, C., Hez, X., Puy, Y., Tranz, K., Gaoz, J.,Cariny, L., Dengz, L., 2017. Semantic Compositional Networks for Visual Captioning, arXiv:1611.08002v2,
- Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H., 2017. Video Captioning With Attention-Based LSTM And Semantic Consistency, IEEE Transactions Multimedia, 19(9), 2045–2055
- Gella, S., Lewis, M. and Rohrbach. M., 2018. A Dataset for Telling the Stories of Social Media Videos. In Proc of the 2018 Conference on Empirical Methods in Natural Language Processing. 968-974
- Gers, F., Long Short-Term Memory in Recurrent Neural Networks, Ph.D. dissertation, Dept. Comput. Sci., Univ. Hannover, Hannover, Germany, 2001.
- Graves, A., Jaitly, N., 2014. Towards End-To-End Speech Recognition With Recurrent Neural Networks.31st International Conference on Machine Learning (ICML-14). 1764- 1772.
- Hochreiter S. and Schmidhuber, J., 1997. Long Short-Term Memory,Neural Computer, 9(8), 1735–1780.
- Jegham, I., Khalifa, A.B., Alouani, I.,Mahjoub, M.A., 2020. Vision-Based Human Action Recognition: An Overview and Real World Challenges, Forensic Science International: Digital Investigation, 32, 200901
- Kiros, R., Salakhutdinov, R., Zemel, R., 2014. Multimodal Neural Language Models, Proceedings of the 31st International Conference on Machine Learning, PMLR, 32(2), 595-603
- Krishna, R., Hata, K., Ren, F., Fei-Fei, L. and Niebles. J. C., 2017.Dense-Captioning Events in Videos. In Proceedings of the IEEE International Conference on Computer Vision , Venice, Italy
- Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S.,Choi, Y., 2013. Babytalk: Understanding and Generating Simple Image Descriptions, IEEE Transactions On Pattern Analysis and Machine Intelligence, 35 (12)2891–2903.
- Kuznetsova, P.,Ordonez, V., Berg, T.L., Choi, Y., 2014. TREETALK: Composition and Compression of Trees for Image Descriptions, Transactions of the Association for Computational Linguistics 2 (1), 351–362.
- Kojima A., and Tamura, T.,2002, Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions, International Journal of Computer Vision 50(2), 171–184
- Lavie, A., Agarwall, A., 2007. Meteor: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments,2007, Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic
- Li, H., Song, D., Liao, L., Peng, C., 2019. Revnet: Brıng Revıewıng Into Vıdeo Captioning for a Better Descrıptıon, IEEE International Conference on Multimedia and Expo (ICME) Chiana
- Li, S., Tao, Z., Li, K., Fu, Y., 2019. Visual to Text: Survey of Image and Video Captioning, IEEE Transactions on Emerging Topics in Computational Intelligence, 3(4), 297-312.
- Li, W., Guo, D., Fang, X., (2018). Multimodal Architecture for Video Captioning with Memory Networks and an Attention Mechanism, Pattern Recognition Letters 105, 23–29
- Lin, C.Y., 2004. ROUGE: A Package for Automatic Evaluation of Summaries, In Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain
- Liu, J., Wang, Z., Liu, H., 2020. HDS-SP: A Novel Descriptor For Skeleton-Based Human Action Recognition, Neurocomputing, 385,22-32
- Ma, M., Wang, B., 2017. A Grey Relational Analysis based Evaluation Metric for Image Captioning and Video Captioning, 2017 International Conference on Grey Systems and Intelligent Services (GSIS), Stockholm, Sweden
- Nabati, M., Behrad, A., 2020. Video Captioning Using Boosted And Parallel Long Short-Term Memory Networks, Computer Vision and Image Understanding, 190, 102840.
- Nan, W., Zhigang, Z., Huan, L., Jingqi, M., Jiajun, Z., Guangxue, D., 2019. Gesture Recognition Based on Deep Learning in Complex Scenes, 2019 Chinese Control And Decision Conference (CCDC). Nanchang, China, China
- Özer E.G., Karapınar İ.N., Başbuğ S., Turan S., Utku A., Akcayol M.A., 2020. Deep learning based new model for video captioning, International Journal of Advanced Computer Science and Applications, 11(3), 1-6.
- Pan, P., Xu, Z., Yang, Y., Wu, F.,Zhuang, Y., 2016. Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA
- Park, J., Song, C., Han. J-H., (2017), A Study of Evaluation Metrics and Datasets for Video Captioning. International Conference Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan
- Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B. and Pinkal. M., 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics (TACL) 1, 25–36,
- Rohrbach, A., Rohrbach, M., Qiu, W., Friedrich, A., Pinkal, M., Schiele, B., 2014. Coherent Multi-Sentence Video Description with Variable Level of Detail. Pattern Recognition, 184-195, Germany
- Rohrbach, A., Rohrbach, M., Tandon, N. and Schiele. B., 2015. A Dataset for Movie Description. arXiv.org/abs/1501.02530
- Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M. and Schiele. B., 2012. Script Data for Attribute-Based Recognition of Composite Activities. In European Conference on Computer Vision, 144-157, Springer
- Rohrbach, M., Qiu, W., Titov, I., 2013. Translating Video Content to Natural Language Descriptions, 2013. IEEE International Conference on Computer Vision, Sydney, NSW, Australia
- Sigurdsson, G. A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A., 2016. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. arxiv.org/abs/1604.01753
- Smirnov, E.A., Timoshenko, D.M., Andrianov, S.N., 2014. Comparison of Regularization Methods for ImageNet Classification with Deep Convolutional Neural Networks, AASRI Procedia, 6,89-94
- Song, J., Guo, Y., Gao, L.,Li, X., Hanjalic, A.,Shen, H.T., (2015), From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning, Journal Of Latex Class Files, 14(8),1-10.
- Su, J., 2018. Study of Video Captioning Problem
- Szegedy, C., Ioffe, S., Vanhoucke, S. Alemi, A., 2016. Inception-v4, Inception-Resnet And The Impact Of Residual Connections on Learning. /arxiv.org/abs/1602.07261
- Şeker, A., Diri, B., Balık, H.H., 2017. Derin Öğrenme Yöntemleri ve Uygulamaları Hakkında Bir İnceleme, Gazi Mühendislik Bilimleri Dergisi, 3(3). 47-64
- Tang, P., Wang, H., Kwong, S., 2017. G- MS2F: Googlenet Based Multi-Stage Feature Fusion Of Deep CNN For Scene Recognition, Neurocomputing, 225,188-197
- Torabi, A., Pal, C., Larochelle, H. and Courville. A., 2015. Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research. arXiv:1503.01070
- Trabelsi, A., Elouedi, Z., Lefevre, E., 2019. Decision Tree Classifiers For Evidential Attribute Values And Class Labels, Fuzzy Sets and Systems, 366,46-62
- Tran, D., Bourdev, L., Fergus, R., Torresani,L., Paluri, M., 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. arxiv.org/abs/1412.0767
- Unal, ME, Citamak, B., Yagcioglu, S., Erdem, A., Erdem, E., Cinbis, N.I., Cakici, R., 2016. Tasviret: A Benchmark Dataset for Automatic Turkish Description Generation From Images, 2016 24th Signal Processing and Communication Application Conference (SIU), Zonguldak, Turkey
- Vedantam, R., Zitnick, C.L., Parikh, D., 2015. Cider: Consensus-Based İmage Description Evaluation, IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Boston, MA, USA.
- Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K., 2015. Sequence To Sequence—Video To Text, 2015 IEEE International Conference Computer Vision, Santiago, Chile
- Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R. and K. Saenko. 2014. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. arXiv preprint arXiv:1412.4729, 2014
- Vinyals, O., Toshev, A., Bengio, S., Erhan, D., 2015. Show and Tell: A Neural Image Captioning Generator, 28th IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA
- Wang, B., Ma, L., Zhang, W., Liu, W., 2018. Reconstruction Network for Video Captioning, IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City USA
- Wang, H., Gao, C., Han, Y., (2018). Sequence in Sequence for Video Captioning, Pattern Recognition Letters, 130, 327-334
- Wu, A., Han, Y., Yang, Y., Hu, Q., Wu, F., 2019. Convolutional Reconstruction-to-Sequence for Video Captioning, IEEE Transactions on Circuits and Systems for Video Technology, 30(11), 4299 - 4308
- Wu, X., Sahoo, D., Hoi, S.C.H., 2020. Recent Advances in Deep Learning For Object Detection, Neurocomputing, 396,39-64
- Wu, Z., Yao,T., Fu, Y., Jiang, Y.-G., 2016. Deep Learning for Video Classification and Captioning, Frontiers of Multimedia Research 3-29
- Xiao, H., Shi, J., 2019. Video Captioning with Adaptive Attention and Mixed Loss Optimization, IEEE Access, 7, 135757-13769.
- Xu, J., Mei, T., Yao, T., Rui, Y., 2016. Msr-vtt: A Large Video Description Dataset for Bridging Video and Language. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA
- Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R. and Bengio, Y., 2016. Show, Attend and Tell: Neural Image Captioning Generation with Visual Attention, Proceedings of the 32nd International Conference on Machine Learning, (PMLR) 37, 2048-2057,
- Xu, N., Liu, A., 2018. Dual-Stream Recurrent Neural Network for Video Captioning, IEEE Transactions On Circuits And Systems For Video Technology, 29(8), 2482-2493
- Yang, Y., Zhou, J., Jiangbo A., Bin, Y., Hanjalic, A., Shen, H.T., Ji, Y., 2018. Video Captioning by Adversarial LSTM, IEEE Transactions on Image Processing, 27(11), 5600-5611
- Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., Cohen, W. W., 2016. Review Networks for Caption Generation, 30th International Conference on Neural Information Processing, Barcelona, SPAIN
- Yang, Z., Yue, J., Li, Z., Zhu, L., 2018. Vegetable Image Retrieval with Fine-tuning VGG Model and Image Hash, IFAC-PapersOnLine, 51(17), 280-285.
- Yao, L., Cho, K., Ballas, N., Pa´ı, C., Courville, A., 2015. Describing Videos By Exploiting Temporal Structure, International Conference on Computer Vision (ICCV), Santiago, Chile
- Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A., 2015. Describing Videos By Exploiting Temporal Structure, 2015 IEEE International Conference Computer Vision, Santiago, Chile
- Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T., 2016. Boosting İmage Captioning with Attributes, in Proc. IEEE Int. Conference Computer Vision, Venice, Italy
- Yingwei, P., Mei, T., Yao,T., Li, H., Rui. Y., 2015. Jointly Modeling Embedding and Translation to Bridge Video and Language. arxiv.org/abs/1505.01861
- Yingwei, P., Yao, T., Li, H., Mei. T., 2016. Video Captioning with Transferred Semantic Attributes. arxiv.org/abs/1611.07675
- You, Q., Jin, H., Wang, Z., Fang, C., Luo J., 2016. Image Captioning with Semantic Attention, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA
- Yu, Y., Choi, J., Kim, Y., Yoo, K., Lee, S.-H., Kim, G., 2017. Supervising Neural Attention Models for Video Captioning by Human Gaze Data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). Honolulu, Hawaii
- Yuan, J., Xiong, H-C., Xiao, Y., Guan, W., Wang, M., Hong, R., Li, Z.Y., 2019. Gated CNN: Integrating Multi-Scale Feature Layers For Object Detection, Pattern Recognition 105, 107131
- Zeng, K., Chen, T., Niebles, J. C., Sun, M., 2016. Title Generation for User Generated Videos. arxiv.org/abs/1608.07068
- Zhao, H., Li, X., 2017. A Cost Sensitive Decision Tree Algorithm Based On Weighted Class Distribution With Batch Deleting Attribute Mechanism, Information Sciences, 378, 303-316
- Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A., 2014. Learning Deep Features For Scene Recognition Using Places Database, Proceedings of the Advances in Neural Information Processing Systems (NIPS). 487–495.
- Zhou, L., Kalantidis, Y., Chen, X., Corso, J. J., Rohrbach, M., 2018. Grounded video description. arxiv.org/abs/1812.06587
VİDEO ETİKETLEME UYGULAMALARINDA DERİN ÖĞRENME YAKLAŞIMLARININ KULLANILMASI ÜZERİNE KAPSAMLI BİR İNCELEME
Year 2020,
Volume: 8 Issue: 5, 271 - 289, 29.12.2020
Özlem Alpay
,
M. Ali Akcayol
Abstract
Video etiketleme, otomatik bir şekilde videolar için etiket oluşturma olarak tanımlanmaktadır. Hem bilgisayar görmesi hem de doğal dil yaklaşımlarını birlikte içerdiği için gittikçe ilgi çeken bir alan olmaktadır İfadeleri doğal dilde üretip ve onları görüntü çerçeveleri ile birleştirmek zorlu bir süreçtir. Bu sorunu çözmek için çeşitli yaklaşımlar geliştirilmiştir. Bu çalışmada, video etiketleme araştırmalarındaki gelişmeler hakkında bir literatür çalışması sunulmuştur. İncelenen çalışmalar kullanılan yöntemlere göre farklı kategorilerde incelenmiştir. Yöntemler özetlenmiş, güçlü ve sınırlı yönleri analiz edilmiştir. Derin öğrenme, bu konuda kullanılan en yaygın yöntemlerden biridir. Video etiketleme sistemlerinde derin öğrenme yaklaşımlarının uygulanabilirliği üzerine araştırmalar yapılmıştır. Bu konuda kullanılan veri setleri, performans değerlendirme kriterleri karşılaştırılarak analiz edilmiştir. Derin öğrenme yöntemlerindeki gelişmeler video etiketleme konusunda yeni yaklaşımlar sağlamıştır. Video etiketleme konusunda yapılan çalışmalarda derin öğrenme yöntemlerinin kullanılması ile başarılı sonuçlar elde edilmiştir
References
- Aafaq, N., Akhtar, N., Liu, W., Mian, A., 2019. Empirical Autopsy of Deep Video Captioning Frameworks, arxiv.org/pdf/1911.09345v1
- Amaresh, M. and Chitrakala, S., 2019. Video Captioning using Deep Learning: An Overview of Methods, Datasets and Metrics, International Conference on Communication and Signal Processing, India
- Ayers, D. and Shah, M. 2001. Monitoring human behavior from video taken in an office environment. Image and Vision Computing, 19(12),833–846.
- Baraldi, L., Grana, C. and Cucchiara, R., 2017. Hierarchical Boundary-Aware Neural Encoder for Video Captioning, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA
- Carreira, J., Zisserman, A., 2017. Quo Vadis, Action Recognition? A New Model and The Kinetics Dataset. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA
- Chen, D., Dolan, W., 2011, Collecting Highly Parallel Data For Paraphrase Evaluation. In ACL: Human Language Technologies, 1, 190-200.
- Chen, Y., Zhang, W., Wang, S., Li, L., Huang, Q., 2018. Saliency-Based Spatiotemporal Attention for Video Captioning, International Conference on Multimedia Big Data (BigMM), Xi'an, China
- Cho, K.,Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y., 2014. Learning Phrase Representations Using Rnn Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078,
- Çtamak,B., Kuyu, M., Erdem, A., Erdem, E., 2019. MSVD-Turkish: A Large-Scale Dataset for Video Captioning in Turkish, 27th Signal Processing and Communications Applications Conference (SIU), Sivas, Türkiye
- Das, P., Xu, C., Doell, R. F., Corso. and J. J., 2013. A Thousand Frames in Just a Few Words: Lingual Description of Videos Through Latent Topics and Sparse Object Stitching. 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA
- Ding, S., Qu, S., Xi, Y., Wan, S., 2019. A Long Video Captioning Generation Algorithm for Big Video Data Retrieval, Future Generation Computer Systems 93, 583–595
- Elman, J. L., 1990. Finding Structure in time. Cognitive Science, 14(2), 179–211
- Gan, Z., Gan, C., Hez, X., Puy, Y., Tranz, K., Gaoz, J.,Cariny, L., Dengz, L., 2017. Semantic Compositional Networks for Visual Captioning, arXiv:1611.08002v2,
- Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H., 2017. Video Captioning With Attention-Based LSTM And Semantic Consistency, IEEE Transactions Multimedia, 19(9), 2045–2055
- Gella, S., Lewis, M. and Rohrbach. M., 2018. A Dataset for Telling the Stories of Social Media Videos. In Proc of the 2018 Conference on Empirical Methods in Natural Language Processing. 968-974
- Gers, F., Long Short-Term Memory in Recurrent Neural Networks, Ph.D. dissertation, Dept. Comput. Sci., Univ. Hannover, Hannover, Germany, 2001.
- Graves, A., Jaitly, N., 2014. Towards End-To-End Speech Recognition With Recurrent Neural Networks.31st International Conference on Machine Learning (ICML-14). 1764- 1772.
- Hochreiter S. and Schmidhuber, J., 1997. Long Short-Term Memory,Neural Computer, 9(8), 1735–1780.
- Jegham, I., Khalifa, A.B., Alouani, I.,Mahjoub, M.A., 2020. Vision-Based Human Action Recognition: An Overview and Real World Challenges, Forensic Science International: Digital Investigation, 32, 200901
- Kiros, R., Salakhutdinov, R., Zemel, R., 2014. Multimodal Neural Language Models, Proceedings of the 31st International Conference on Machine Learning, PMLR, 32(2), 595-603
- Krishna, R., Hata, K., Ren, F., Fei-Fei, L. and Niebles. J. C., 2017.Dense-Captioning Events in Videos. In Proceedings of the IEEE International Conference on Computer Vision , Venice, Italy
- Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S.,Choi, Y., 2013. Babytalk: Understanding and Generating Simple Image Descriptions, IEEE Transactions On Pattern Analysis and Machine Intelligence, 35 (12)2891–2903.
- Kuznetsova, P.,Ordonez, V., Berg, T.L., Choi, Y., 2014. TREETALK: Composition and Compression of Trees for Image Descriptions, Transactions of the Association for Computational Linguistics 2 (1), 351–362.
- Kojima A., and Tamura, T.,2002, Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions, International Journal of Computer Vision 50(2), 171–184
- Lavie, A., Agarwall, A., 2007. Meteor: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments,2007, Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic
- Li, H., Song, D., Liao, L., Peng, C., 2019. Revnet: Brıng Revıewıng Into Vıdeo Captioning for a Better Descrıptıon, IEEE International Conference on Multimedia and Expo (ICME) Chiana
- Li, S., Tao, Z., Li, K., Fu, Y., 2019. Visual to Text: Survey of Image and Video Captioning, IEEE Transactions on Emerging Topics in Computational Intelligence, 3(4), 297-312.
- Li, W., Guo, D., Fang, X., (2018). Multimodal Architecture for Video Captioning with Memory Networks and an Attention Mechanism, Pattern Recognition Letters 105, 23–29
- Lin, C.Y., 2004. ROUGE: A Package for Automatic Evaluation of Summaries, In Proceedings of Workshop on Text Summarization Branches Out, Post-Conference Workshop of ACL 2004, Barcelona, Spain
- Liu, J., Wang, Z., Liu, H., 2020. HDS-SP: A Novel Descriptor For Skeleton-Based Human Action Recognition, Neurocomputing, 385,22-32
- Ma, M., Wang, B., 2017. A Grey Relational Analysis based Evaluation Metric for Image Captioning and Video Captioning, 2017 International Conference on Grey Systems and Intelligent Services (GSIS), Stockholm, Sweden
- Nabati, M., Behrad, A., 2020. Video Captioning Using Boosted And Parallel Long Short-Term Memory Networks, Computer Vision and Image Understanding, 190, 102840.
- Nan, W., Zhigang, Z., Huan, L., Jingqi, M., Jiajun, Z., Guangxue, D., 2019. Gesture Recognition Based on Deep Learning in Complex Scenes, 2019 Chinese Control And Decision Conference (CCDC). Nanchang, China, China
- Özer E.G., Karapınar İ.N., Başbuğ S., Turan S., Utku A., Akcayol M.A., 2020. Deep learning based new model for video captioning, International Journal of Advanced Computer Science and Applications, 11(3), 1-6.
- Pan, P., Xu, Z., Yang, Y., Wu, F.,Zhuang, Y., 2016. Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA
- Park, J., Song, C., Han. J-H., (2017), A Study of Evaluation Metrics and Datasets for Video Captioning. International Conference Intelligent Informatics and Biomedical Sciences (ICIIBMS), Okinawa, Japan
- Regneri, M., Rohrbach, M., Wetzel, D., Thater, S., Schiele, B. and Pinkal. M., 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics (TACL) 1, 25–36,
- Rohrbach, A., Rohrbach, M., Qiu, W., Friedrich, A., Pinkal, M., Schiele, B., 2014. Coherent Multi-Sentence Video Description with Variable Level of Detail. Pattern Recognition, 184-195, Germany
- Rohrbach, A., Rohrbach, M., Tandon, N. and Schiele. B., 2015. A Dataset for Movie Description. arXiv.org/abs/1501.02530
- Rohrbach, M., Regneri, M., Andriluka, M., Amin, S., Pinkal, M. and Schiele. B., 2012. Script Data for Attribute-Based Recognition of Composite Activities. In European Conference on Computer Vision, 144-157, Springer
- Rohrbach, M., Qiu, W., Titov, I., 2013. Translating Video Content to Natural Language Descriptions, 2013. IEEE International Conference on Computer Vision, Sydney, NSW, Australia
- Sigurdsson, G. A., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A., 2016. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. arxiv.org/abs/1604.01753
- Smirnov, E.A., Timoshenko, D.M., Andrianov, S.N., 2014. Comparison of Regularization Methods for ImageNet Classification with Deep Convolutional Neural Networks, AASRI Procedia, 6,89-94
- Song, J., Guo, Y., Gao, L.,Li, X., Hanjalic, A.,Shen, H.T., (2015), From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning, Journal Of Latex Class Files, 14(8),1-10.
- Su, J., 2018. Study of Video Captioning Problem
- Szegedy, C., Ioffe, S., Vanhoucke, S. Alemi, A., 2016. Inception-v4, Inception-Resnet And The Impact Of Residual Connections on Learning. /arxiv.org/abs/1602.07261
- Şeker, A., Diri, B., Balık, H.H., 2017. Derin Öğrenme Yöntemleri ve Uygulamaları Hakkında Bir İnceleme, Gazi Mühendislik Bilimleri Dergisi, 3(3). 47-64
- Tang, P., Wang, H., Kwong, S., 2017. G- MS2F: Googlenet Based Multi-Stage Feature Fusion Of Deep CNN For Scene Recognition, Neurocomputing, 225,188-197
- Torabi, A., Pal, C., Larochelle, H. and Courville. A., 2015. Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research. arXiv:1503.01070
- Trabelsi, A., Elouedi, Z., Lefevre, E., 2019. Decision Tree Classifiers For Evidential Attribute Values And Class Labels, Fuzzy Sets and Systems, 366,46-62
- Tran, D., Bourdev, L., Fergus, R., Torresani,L., Paluri, M., 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. arxiv.org/abs/1412.0767
- Unal, ME, Citamak, B., Yagcioglu, S., Erdem, A., Erdem, E., Cinbis, N.I., Cakici, R., 2016. Tasviret: A Benchmark Dataset for Automatic Turkish Description Generation From Images, 2016 24th Signal Processing and Communication Application Conference (SIU), Zonguldak, Turkey
- Vedantam, R., Zitnick, C.L., Parikh, D., 2015. Cider: Consensus-Based İmage Description Evaluation, IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Boston, MA, USA.
- Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K., 2015. Sequence To Sequence—Video To Text, 2015 IEEE International Conference Computer Vision, Santiago, Chile
- Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R. and K. Saenko. 2014. Translating Videos to Natural Language Using Deep Recurrent Neural Networks. arXiv preprint arXiv:1412.4729, 2014
- Vinyals, O., Toshev, A., Bengio, S., Erhan, D., 2015. Show and Tell: A Neural Image Captioning Generator, 28th IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA
- Wang, B., Ma, L., Zhang, W., Liu, W., 2018. Reconstruction Network for Video Captioning, IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City USA
- Wang, H., Gao, C., Han, Y., (2018). Sequence in Sequence for Video Captioning, Pattern Recognition Letters, 130, 327-334
- Wu, A., Han, Y., Yang, Y., Hu, Q., Wu, F., 2019. Convolutional Reconstruction-to-Sequence for Video Captioning, IEEE Transactions on Circuits and Systems for Video Technology, 30(11), 4299 - 4308
- Wu, X., Sahoo, D., Hoi, S.C.H., 2020. Recent Advances in Deep Learning For Object Detection, Neurocomputing, 396,39-64
- Wu, Z., Yao,T., Fu, Y., Jiang, Y.-G., 2016. Deep Learning for Video Classification and Captioning, Frontiers of Multimedia Research 3-29
- Xiao, H., Shi, J., 2019. Video Captioning with Adaptive Attention and Mixed Loss Optimization, IEEE Access, 7, 135757-13769.
- Xu, J., Mei, T., Yao, T., Rui, Y., 2016. Msr-vtt: A Large Video Description Dataset for Bridging Video and Language. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA
- Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R. and Bengio, Y., 2016. Show, Attend and Tell: Neural Image Captioning Generation with Visual Attention, Proceedings of the 32nd International Conference on Machine Learning, (PMLR) 37, 2048-2057,
- Xu, N., Liu, A., 2018. Dual-Stream Recurrent Neural Network for Video Captioning, IEEE Transactions On Circuits And Systems For Video Technology, 29(8), 2482-2493
- Yang, Y., Zhou, J., Jiangbo A., Bin, Y., Hanjalic, A., Shen, H.T., Ji, Y., 2018. Video Captioning by Adversarial LSTM, IEEE Transactions on Image Processing, 27(11), 5600-5611
- Yang, Z., Yuan, Y., Wu, Y., Salakhutdinov, R., Cohen, W. W., 2016. Review Networks for Caption Generation, 30th International Conference on Neural Information Processing, Barcelona, SPAIN
- Yang, Z., Yue, J., Li, Z., Zhu, L., 2018. Vegetable Image Retrieval with Fine-tuning VGG Model and Image Hash, IFAC-PapersOnLine, 51(17), 280-285.
- Yao, L., Cho, K., Ballas, N., Pa´ı, C., Courville, A., 2015. Describing Videos By Exploiting Temporal Structure, International Conference on Computer Vision (ICCV), Santiago, Chile
- Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A., 2015. Describing Videos By Exploiting Temporal Structure, 2015 IEEE International Conference Computer Vision, Santiago, Chile
- Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T., 2016. Boosting İmage Captioning with Attributes, in Proc. IEEE Int. Conference Computer Vision, Venice, Italy
- Yingwei, P., Mei, T., Yao,T., Li, H., Rui. Y., 2015. Jointly Modeling Embedding and Translation to Bridge Video and Language. arxiv.org/abs/1505.01861
- Yingwei, P., Yao, T., Li, H., Mei. T., 2016. Video Captioning with Transferred Semantic Attributes. arxiv.org/abs/1611.07675
- You, Q., Jin, H., Wang, Z., Fang, C., Luo J., 2016. Image Captioning with Semantic Attention, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA
- Yu, Y., Choi, J., Kim, Y., Yoo, K., Lee, S.-H., Kim, G., 2017. Supervising Neural Attention Models for Video Captioning by Human Gaze Data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017). Honolulu, Hawaii
- Yuan, J., Xiong, H-C., Xiao, Y., Guan, W., Wang, M., Hong, R., Li, Z.Y., 2019. Gated CNN: Integrating Multi-Scale Feature Layers For Object Detection, Pattern Recognition 105, 107131
- Zeng, K., Chen, T., Niebles, J. C., Sun, M., 2016. Title Generation for User Generated Videos. arxiv.org/abs/1608.07068
- Zhao, H., Li, X., 2017. A Cost Sensitive Decision Tree Algorithm Based On Weighted Class Distribution With Batch Deleting Attribute Mechanism, Information Sciences, 378, 303-316
- Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A., 2014. Learning Deep Features For Scene Recognition Using Places Database, Proceedings of the Advances in Neural Information Processing Systems (NIPS). 487–495.
- Zhou, L., Kalantidis, Y., Chen, X., Corso, J. J., Rohrbach, M., 2018. Grounded video description. arxiv.org/abs/1812.06587