Code Clone Detection with Convolutional Neural Networks

Harun Dişli; Ayşe Tosun

doi:10.17671/gazibtd.541476

Research Article

Code Clone Detection with Convolutional Neural Networks

Year 2020, , 1 - 12, 31.01.2020

Harun Dişli Ayşe Tosun

https://doi.org/10.17671/gazibtd.541476

Cited By: 1

Abstract

Similar
or identical code portions which are generated by copying and reusing code
portions within the source code are named as code clones. While so many works
have been conducted to detect these clones, they generally use string
comparison techniques and very few of them take advantage of popular learning
based approaches, such as deep learning. This paper proposes a new approach
based on a popular and successful image classification technique named as
convolutional neural network. It simply tokenizes each candidate clone pair in
order to generate image files. Then, convolutional neural network is used to
classify these image data with labels “clone” and “not clone”. In order to train and test the network, clone
and not clone pairs are chosen from a public database including six million
methods. As a result, the approach gives 99% accuracy, effectively detects
clones and not clones with 2-5% false alarms rates at method granularity.

Keywords

code clone detection, deep learning, convolutional neural network

References

C. K. Roy and J. R. Cordy, “A Mutation / Injection-based Automatic Framework for Evaluating Code Clone Detection Tools”, 4th International Workshop on Mutation Analysis (MUTATION) in 2nd International Conference on Software Testing, Verification, and Validation Workshops. Denver, Colorado: IEEE Computer Society, 157–166, 1-4 April 2009.
A. Sheneamer and J. Kalita, “Article: A survey of software clone detection techniques,” International Journal of Computer Applications, 137 (10), 1–21, 2016
Y. Jia, D. Binkley, M. Harman, J. Krinke, and M. Matsushita, “KClone: a proposed approach to fast precise code clone detection”, 3rd International Workshop on Software Clones (IWSC), 2009
C. K. Roy, J. R. Cordy, and R. Koschke. “Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach”, Sci. Comput. Program., 74(7), 470–495, 2009.
B. Lague, E. M. Merlo, J. Mayrand, J. Hudepohl, “Assessing the Benefits of Incorporating Function Clone Detection in a Development Process”, IEEE International Conference on Software Maintenance (ICSM), 314-321, Oct. 1997.
J. Johnson, “Visualizing textual redundancy in legacy source”, Conference of the Centre for advanced Studies on Collaborative research (CASCON), 171-183, 1994.
S. Ducasse, M. Rieger, S. Demeyer, “A language independent approach for detecting duplicated code”, 15th International Conference on Software Maintenance (ICSM), 109-118, 1999.
C.K. Roy, J.R. Cordy, “An empirical study of function clones in open source software systems”, 15th Working Conference on Reverse Engineering (WCRE), 81-90, 2008.
B. Baker, “A program for identifying duplicated code”, 24th Symposium on the Interface, Computing Science and Statistics, 49-57, 1992.
T. Kamiya, S. Kusumoto, K. Inoue, “CCFinder: A multilinguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28(7), 654-670, 2002.
Z. Li, S. Lu, S. Myagmar, Y. Zhou, “CP-Miner: Finding copy-paste and related bugs in large-scale software code”, IEEE Transactions on Software Engineering, 32(3), 176-192, 2006.
T. Yamashina, H. Uwano, K. Fushida, Y. Kamei, M. Nagura, S. Kawaguchi, H. Iida, “SHINOBI: A real-time code clone detection tool for software maintenance”, Technical Report: NAIST-IS-TR2007011, Graduate School of Information Science, Nara Institute of Science and Technology, 2008.
I. Baxter, A. Yahin, L. Moura, M. Anna, “Clone detection using abstract syntax trees”, 14th International Conference on Software Maintenance (ICSM), 368-377, 1998.
L. Jiang, G. Misherghi, Z. Su, S. Glondu, “DECKARD: Scalable and accurate tree-based detection of code clones”, 29th International Conference on Software Engineering (ICSE), 96-105, 2007.
S. Ducasse, M. Rieger, S. Demeyer, “A language independent approach for detecting duplicated code”, 15th International Conference on Software Maintenance (ICSM), 109-118, 2009.
B. Baker, “On finding duplication and near-duplication in large software systems”, 2nd Working Conference on Reverse Engineering, 86-95, 1995.
R. Wettel, R. Marinescu, “Archeology of code duplication: Recovering duplication chains from small duplication fragments”, 7th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, 8, 2005.
K. Kontogiannis, “Evaluation experiments on the detection of programming patterns using software metrics”, 3rd Working Conference on Reverse Engineering, 44-54, 1997.
M. White, C. Vendome, M. Linares-Vásquez, D. Poshyvanyk, “Toward deep learning software repositories”, IEEE/ACM 12th Working Conference on Mining Software Repositories (MSR), 334–345, 2015.
B. Can, “LSTM Ağları ile Türkçe Kök Bulma”, Bilişim Teknolojileri Dergisi, 12(3), 183-193, 2019.
H.K. Dam, T. Tran, T. Pham, “A deep language model for software code”, arXiv preprint:1608.02715, 2016.
L. Li, H. Feng, W. Zhuang, N. Meng, B. Ryder, “CCLearner: A Deep Learning-Based Clone Detection Approach”, International Conference on Software Maintenance and Evolution (ICSME), 249–260, 2017.
C.K. Roy, J.R. Cordy, “Near-miss function clones in open source software: an empirical study”, Journal of Software: Evolution and Process, 22(3), 165–189, 2010.
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, “Distributed Representations of Words and Phrases and their Compositionality”, 26th International Conference on Neural Information Processing Systems, Nevada, A.B.D., 3111-3119, 2013.
J. Svajlenko, J.F. Islam, I. Keivanloo, C.K. Roy, M.M. Mia, "Towards a Big Data Curated Benchmark of Inter-Project Code Clones", Early Research Achievements track of the 30th International Conference on Software Maintenance and Evolution (ICSME) Victoria, Canada, 2014.
Internet: F. Li, J. Johnson and S. Yeung, “Convolutional Neural Networks for Visual Recognation class in Stanford University, 2018, http://cs231n.github.io/convolutional-networks/
N. Davey, P. Barson, S. Field, R. Frank, “The development of a software clone detector”, International Journal of Applied Software Technology, 1(3/4), 219-236, 1995.
R. Komondoor, S. Horwitz, “Using slicing to identify duplication in source code”, 8th International Symposium on Static Analysis (SAS), 40-56, 2001.
M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep learning code fragments for code clone detection,” 31st IEEE/ACM International Conference on Automated Software Engineering, 2016
Internet: ANTLR, http://www.antlr.org
A. Krizhevsky, I. Sutskever, G.E. Hinton, “ImageNet classification with deep convolutional neural networks”, International Conference on Neural Information Processing Systems (NIPS), 1106–1114, 2012
K. Simonyan, A. Zisserman, “Very deep convolutional networks for large-scale image recognition”, International Conference on Learning Representations, 2014.
S.E. Sahin, A. Tosun, “A Conceptual Replication on Predicting the Severity of Software Vulnerabilities”, International Conference on Evaluation and Assessment in Software Engineering (EASE), Copenhagen, 2019.
J. Rokui, “Autoassociative Signature Authentication Based on Recurrent Neural Network”, Artificial Intelligence and Soft Computing, Editors: L. Rutkowski, R. Scherer, M. Korytkowski, W. Pedrycz, R. Tadeusiewicz, J.M. Zurada, Springer, 88-96, 2018.
S. Agarwal, H.S. Sikchi, S. Rooj, S. Bhattacharya, A. Routray, “Illumination-Invariant Face Recognition by Fusing Thermal and Visual Images via Gradient Transfer”, Advances in Computer Vision, Editors: K. Arai and S. Kapoor, 658-670, 2020.
Internet: Y. LeCun, “Lenet, convolutional neural networks,” 2015, Available: http: //yann.lecun.com/exdb/lenet/
Y. Bengio, X. Glorot, “Understanding the difficulty of training deep feedforward neural networks”, 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 249– 256, May 2010.
D. Kingma and J. Ba. “Adam: A method for stochastic optimization”, International Conference on Learning Representations, 2015.
M. Kızrak, B. Bolat “Derin Öğrenme ile Kalabalık Analizi Üzerine Detaylı Bir Araştırma”, Bilişim Teknolojileri Dergisi, 11(3), 263-286, 2018.
C. Acı, A. Çırak, “Türkçe Haber Metinlerinin Konvolüsyonel Sinir Ağları ve Word2Vec Kullanılarak Sınıflandırılması”, Bilişim Teknolojileri Dergisi, 12(3), 219-228, 2019.

Konvolüsyonel Sinir Ağları İle Kod Klonlarının Tespiti

Year 2020, , 1 - 12, 31.01.2020

Harun Dişli Ayşe Tosun

https://doi.org/10.17671/gazibtd.541476

Cited By: 1

Abstract

Yazılım
geliştirirken kopyalama ve yeniden kullanma yoluyla oluşturulan benzer veya
aynı kod parçaları, kod klonları olarak adlandırılır. Bu klonları tespit etmek
için pek çok çalışma yapılmış olsa da, çalışmalar genellikle katar
karşılaştırma tekniklerini kullanılmakta ve çok azı popüler araştırma
alanlarından olan derin öğrenmeden faydalanmaktadır. Bu makale, konvolüsyonel
sinir ağı olarak adlandırılan, popüler ve başarılı görüntü sınıflandırma
yöntemine dayanan yeni bir yaklaşım önermektedir. Bu yöntem, görüntü
dosyalarını oluşturmak için her aday klon çiftini sembollere ayırır. Daha
sonra, konvolüsyonel sinir ağı bu görüntü verilerini “klon” veya “klon değil”
etiketleriyle sınıflandırmak için kullanılır. Ağı eğitmek ve test etmek için
altı milyon java metodu içeren bir veri tabanından örneklerler seçilerek
kullanılmıştır. Sonuç olarak, bu
yaklaşım metot bazındaki klonları % 95'lik bir doğrulukla etkili bir şekilde
tespit etmektedir.

Keywords

kod klon tespiti, derin öğrenme, konvolüsyonel sinir ağı

References

C. K. Roy and J. R. Cordy, “A Mutation / Injection-based Automatic Framework for Evaluating Code Clone Detection Tools”, 4th International Workshop on Mutation Analysis (MUTATION) in 2nd International Conference on Software Testing, Verification, and Validation Workshops. Denver, Colorado: IEEE Computer Society, 157–166, 1-4 April 2009.
A. Sheneamer and J. Kalita, “Article: A survey of software clone detection techniques,” International Journal of Computer Applications, 137 (10), 1–21, 2016
Y. Jia, D. Binkley, M. Harman, J. Krinke, and M. Matsushita, “KClone: a proposed approach to fast precise code clone detection”, 3rd International Workshop on Software Clones (IWSC), 2009
C. K. Roy, J. R. Cordy, and R. Koschke. “Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach”, Sci. Comput. Program., 74(7), 470–495, 2009.
B. Lague, E. M. Merlo, J. Mayrand, J. Hudepohl, “Assessing the Benefits of Incorporating Function Clone Detection in a Development Process”, IEEE International Conference on Software Maintenance (ICSM), 314-321, Oct. 1997.
J. Johnson, “Visualizing textual redundancy in legacy source”, Conference of the Centre for advanced Studies on Collaborative research (CASCON), 171-183, 1994.
S. Ducasse, M. Rieger, S. Demeyer, “A language independent approach for detecting duplicated code”, 15th International Conference on Software Maintenance (ICSM), 109-118, 1999.
C.K. Roy, J.R. Cordy, “An empirical study of function clones in open source software systems”, 15th Working Conference on Reverse Engineering (WCRE), 81-90, 2008.
B. Baker, “A program for identifying duplicated code”, 24th Symposium on the Interface, Computing Science and Statistics, 49-57, 1992.
T. Kamiya, S. Kusumoto, K. Inoue, “CCFinder: A multilinguistic token-based code clone detection system for large scale source code”, IEEE Transactions on Software Engineering, 28(7), 654-670, 2002.
Z. Li, S. Lu, S. Myagmar, Y. Zhou, “CP-Miner: Finding copy-paste and related bugs in large-scale software code”, IEEE Transactions on Software Engineering, 32(3), 176-192, 2006.
T. Yamashina, H. Uwano, K. Fushida, Y. Kamei, M. Nagura, S. Kawaguchi, H. Iida, “SHINOBI: A real-time code clone detection tool for software maintenance”, Technical Report: NAIST-IS-TR2007011, Graduate School of Information Science, Nara Institute of Science and Technology, 2008.
I. Baxter, A. Yahin, L. Moura, M. Anna, “Clone detection using abstract syntax trees”, 14th International Conference on Software Maintenance (ICSM), 368-377, 1998.
L. Jiang, G. Misherghi, Z. Su, S. Glondu, “DECKARD: Scalable and accurate tree-based detection of code clones”, 29th International Conference on Software Engineering (ICSE), 96-105, 2007.
S. Ducasse, M. Rieger, S. Demeyer, “A language independent approach for detecting duplicated code”, 15th International Conference on Software Maintenance (ICSM), 109-118, 2009.
B. Baker, “On finding duplication and near-duplication in large software systems”, 2nd Working Conference on Reverse Engineering, 86-95, 1995.
R. Wettel, R. Marinescu, “Archeology of code duplication: Recovering duplication chains from small duplication fragments”, 7th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, 8, 2005.
K. Kontogiannis, “Evaluation experiments on the detection of programming patterns using software metrics”, 3rd Working Conference on Reverse Engineering, 44-54, 1997.
M. White, C. Vendome, M. Linares-Vásquez, D. Poshyvanyk, “Toward deep learning software repositories”, IEEE/ACM 12th Working Conference on Mining Software Repositories (MSR), 334–345, 2015.
B. Can, “LSTM Ağları ile Türkçe Kök Bulma”, Bilişim Teknolojileri Dergisi, 12(3), 183-193, 2019.
H.K. Dam, T. Tran, T. Pham, “A deep language model for software code”, arXiv preprint:1608.02715, 2016.
L. Li, H. Feng, W. Zhuang, N. Meng, B. Ryder, “CCLearner: A Deep Learning-Based Clone Detection Approach”, International Conference on Software Maintenance and Evolution (ICSME), 249–260, 2017.
C.K. Roy, J.R. Cordy, “Near-miss function clones in open source software: an empirical study”, Journal of Software: Evolution and Process, 22(3), 165–189, 2010.
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, “Distributed Representations of Words and Phrases and their Compositionality”, 26th International Conference on Neural Information Processing Systems, Nevada, A.B.D., 3111-3119, 2013.
J. Svajlenko, J.F. Islam, I. Keivanloo, C.K. Roy, M.M. Mia, "Towards a Big Data Curated Benchmark of Inter-Project Code Clones", Early Research Achievements track of the 30th International Conference on Software Maintenance and Evolution (ICSME) Victoria, Canada, 2014.
Internet: F. Li, J. Johnson and S. Yeung, “Convolutional Neural Networks for Visual Recognation class in Stanford University, 2018, http://cs231n.github.io/convolutional-networks/
N. Davey, P. Barson, S. Field, R. Frank, “The development of a software clone detector”, International Journal of Applied Software Technology, 1(3/4), 219-236, 1995.
R. Komondoor, S. Horwitz, “Using slicing to identify duplication in source code”, 8th International Symposium on Static Analysis (SAS), 40-56, 2001.
M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep learning code fragments for code clone detection,” 31st IEEE/ACM International Conference on Automated Software Engineering, 2016
Internet: ANTLR, http://www.antlr.org
A. Krizhevsky, I. Sutskever, G.E. Hinton, “ImageNet classification with deep convolutional neural networks”, International Conference on Neural Information Processing Systems (NIPS), 1106–1114, 2012
K. Simonyan, A. Zisserman, “Very deep convolutional networks for large-scale image recognition”, International Conference on Learning Representations, 2014.
S.E. Sahin, A. Tosun, “A Conceptual Replication on Predicting the Severity of Software Vulnerabilities”, International Conference on Evaluation and Assessment in Software Engineering (EASE), Copenhagen, 2019.
J. Rokui, “Autoassociative Signature Authentication Based on Recurrent Neural Network”, Artificial Intelligence and Soft Computing, Editors: L. Rutkowski, R. Scherer, M. Korytkowski, W. Pedrycz, R. Tadeusiewicz, J.M. Zurada, Springer, 88-96, 2018.
S. Agarwal, H.S. Sikchi, S. Rooj, S. Bhattacharya, A. Routray, “Illumination-Invariant Face Recognition by Fusing Thermal and Visual Images via Gradient Transfer”, Advances in Computer Vision, Editors: K. Arai and S. Kapoor, 658-670, 2020.
Internet: Y. LeCun, “Lenet, convolutional neural networks,” 2015, Available: http: //yann.lecun.com/exdb/lenet/
Y. Bengio, X. Glorot, “Understanding the difficulty of training deep feedforward neural networks”, 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 249– 256, May 2010.
D. Kingma and J. Ba. “Adam: A method for stochastic optimization”, International Conference on Learning Representations, 2015.
M. Kızrak, B. Bolat “Derin Öğrenme ile Kalabalık Analizi Üzerine Detaylı Bir Araştırma”, Bilişim Teknolojileri Dergisi, 11(3), 263-286, 2018.
C. Acı, A. Çırak, “Türkçe Haber Metinlerinin Konvolüsyonel Sinir Ağları ve Word2Vec Kullanılarak Sınıflandırılması”, Bilişim Teknolojileri Dergisi, 12(3), 219-228, 2019.

There are 40 citations in total.

Details

Primary Language	English
Subjects	Computer Software
Journal Section	Articles
Authors	Harun Dişli This is me Ayşe Tosun
Publication Date	January 31, 2020
Submission Date	March 18, 2019
Published in Issue	Year 2020

Cite

APA	Dişli, H., & Tosun, A. (2020). Code Clone Detection with Convolutional Neural Networks. Bilişim Teknolojileri Dergisi, 13(1), 1-12. https://doi.org/10.17671/gazibtd.541476

Cited By

Deep learning approaches for bad smell detection: a systematic literature review

Empirical Software Engineering

https://doi.org/10.1007/s10664-023-10312-z

Article Files

Full Text