Comparison of Serial and Parallel Programming Performance in Outlier Detection with DBSCAN Algorithm

Hüseyin Yaşar; Mehmet Albayrak

doi:10.35193/bseufbd.649539

Araştırma Makalesi

Standart Dışı Değerlerin Tespitinde DBSCAN Algoritması ile Seri ve Paralel Programlama Performansının Karşılaştırılması

Yıl 2020, Cilt: 7 Sayı: 1, 129 - 140, 28.06.2020

Hüseyin Yaşar , Mehmet Albayrak

https://doi.org/10.35193/bseufbd.649539

Cited By: 1

Öz

Bilgisayarların hayatımıza girmesiyle beraber dijital verilerin boyutları giderek artmaktadır. Dijital dünyada üretilen bu verilerin içinde benzerlerinden farklı davranış sergileyen standart dışı değerler (aykırı değerler) bulunabilmektedir. Bu değerlerin özellikle büyük veri setleri içinde tespiti; güvenlik, sigortacılık, finans, tıp ve genetik gibi alanlarda büyük önem taşımaktadır. Büyük veri setlerinde standart dışı değerlerin tespitinde veri madenciliği yöntemlerinden kümeleme teknikleri sıklıkla kullanılmaktadır. Gürültülü ve aykırı değerlere karşı hassas olan kümeleme algoritmalarından, yoğunluk tabanlı DBSCAN (Density-based spatial clustering of applications with noise) algoritması standart dışı değerlerin tespitinde kullanılan en önemli yöntemlerdendir. Bu çalışmada standart dışı değerlerin tespiti için C# programlama dilinde DBSCAN algoritması kullanılarak bir uygulama geliştirilmiştir. Geliştirilen uygulamada; veri sayıları birbirinden farklı 2 adet veri seti ele alınmış ve analizleri yapılmıştır. Veri setleri analizinin en kısa süreye indirilebilmesi için seri ve paralel programlama teknikleri ayrı ayrı kullanılmıştır. Büyük veri setlerinin analiz süresini kısaltmak amacı ile. Net 4.0 ile gelen TPL (Task Parallel Library) içinde yer alan paralel sınıf üyelerinden yararlanılmıştır. Veri setlerinde yapılan analizlerde DBSCAN algoritmasının standart dışı değerlerin tespiti açısından seçilen diğer algoritmalara göre daha yüksek doğruluk oranında sonuç verdiği ve kullanılabilir olduğu görülmüştür. Hesaplama performansı açısından ele alındığında ise paralel programlamanın veri sayısı arttıkça daha verimli olabileceği sonucuna varılmıştır.

Anahtar Kelimeler

Standart dışı veri, Kümeleme, DBSCAN, Paralel programlama

Destekleyen Kurum

Süleyman Demirel Üniversitesi Bilimsel Araştırma Projeleri Koordinasyon Birimi

Proje Numarası

4199-YL1-14

Teşekkür

Süleyman Demirel Üniversitesi Bilimsel Araştırma Projeleri Koordinasyon Birimine destekleri için teşekkür ederiz.

Kaynakça

IBM. (2016). What is Big Data? https://www 01.ibm.com/software/data/bigdata/what-is-big-data.html, (25.03.2020).
Güçlü, M. (2012). Detection of Outlier Value with Artificial Immune System Based Algorithm. M.Sc. Thesis, Yıldız Technical University, Institute of Science and Technology, Istanbul, Turkey
Duan, L., Xu, L., Liu, Y., Lee, J. (2009). Cluster-Based Outlier Detection. Annals of Operations Research, 168(1), 151-168.
Ercan, U., Akar, H., Koçer, A. (2013). Basic Algorithms Used in Parallel Programming. Academic Informatics Conference, 23-25 January 2013, Antalya, Turkey, 861-866.
Durmuş, B. (2013). Virtual Parallel Machine. M.Sc. Thesis, Dumlupınar University, Institute of Science and Technology, Kütahya, Turkey.
Yang, J., He, Q. (2018). Scheduling Parallel Computations by Work Stealing: A survey. International Journal of Parallel Programming, 46(2), 173-197.
İnce, K. (2013). Application of Genetic Algorithms with Parallel Programming in Multicore Architectures. M.Sc. Thesis, İnönü University, Institute of Science and Technology, Malatya, Turkey.
Akçay, M., Erdem, H.A. (2013). Parallel Computing with Intel Parallel Studio. XVIII. Internet Conference in Turkey, 9-11 December 2013, İstanbul University, 79-83.
Anthes, G. (2014). Researchers Simplify Parallel Programming. Communications of the ACM, 57(11), 13-15.
Kalva, H., Colic, A., Garcia, A., Furht, B. (2011). Parallel Programming for Multimedia Applications. Multimedia Tools and Applications, 51(2), 801-818.
Güneş, A. (2011). Recognition of Handwriting Numbers by Parallel Programming. M.Sc. Thesis, Süleyman Demirel University, Institute of Science and Technology, Isparta, Turkey.
Drew, J. (2013). Parallel Programming in C # and Other Alternatives, https://www.codeproject.com/articles/701175/parallel-programming-in-csharp-and-other-alternati, (15.01.2019).
Lazar, A. (2018). Task Parallel Library for Easy Multi-Threading in .NET Core [Tutorial]. https://hub.packtpub.com/task-parallel-library-multi-threading-net-core/, (25.03.2018).
Ovla, H.D., Taşdelen, B. (2012). Outlier Value Management. Mersin University Journal of Health Sciences, 5 (3), 1-8.
Aktürk, Z., Acemoğlu, H. (2010). Research and Practice Statistics for Health Care Workers. Anadolu Matbaası, İstanbul, 325.
Vural, A. (2007). Effects of Outliers on Regression Models and Robust Estimators. Master Thesis, Marmara University, Institute of Social Sciences, İstanbul, Turkey.
Moreira, A., Maribel, Y.S., Carneiro, S. (2005). Density-Based Clustering Algorithms-DBSCAN and SNN. University of Minho, Portugal. https://pdfs.semanticscholar.org/6227/2d87e82ffdec283c6da9d16f5065d7c44835.pdf?_ga=2.241964354.1371730934.1589407236-1802936592.1589407236, (15.05.2018).
Cassisi, C., Ferro, A., Giugno, R., Pigola, G., Pulvirenti, A. (2013). Enhancing Density-Based Clustering: Parameter Reduction and Outlier Detection. Information Systems, 38(3), 317-330.
Khan, I., Capozzoli, A., Corgnati, S.P., Cerquitelli, T. (2013). Fault Detection Analysis of Building Energy Consumption using Data Mining Techniques. Energy Procedia, 42, 557-566.
Birant, D., Kut, A. (2006). Spatio-Temporal Outlier Detection in Large Databases. Journal of Computing and Information Technology, 14(4), 291-297.
Bilgin, T.T., Çamurcu, Y. (2005). Applied Comparison of DBSCAN, OPTICS and k-Means Clustering Algorithms. Politeknik Journal, 8 (2), 139-145.
Yahoo. (2016). Computing Systems Data. S5 - A Labeled Anomaly Detection Dataset, version 1.0 (16M). http://webscope.sandbox.yahoo.com/catalog.php?datatype=s&did=70, (03.12.2016).
German Research Center for Artificial Intelligence. (2016). http://madm.dfki.de/_media/downloads/dfki-artificial-3000-unsupervised-ad.zip, (15.05.2016).
Freeman, A. (2010). Pro .NET 4 Parallel Programming in C#. Apress, USA, 311.
Elbatta, M.T.H., Ashour, W.M. (2013). A Dynamic Method for Discovering Density Varied Clusters. International Journal of Signal Processing, Image Processing and Pattern Recognition, 6 (3), 123-134.
Microsoft Docs. (2018). Map dependencies with code maps. https://docs.microsoft.com/tr-tr/visualstudio/modeling/map-dependencies-across-your-solutions?view=vs-2019, (30.03.2019).

Comparison of Serial and Parallel Programming Performance in Outlier Detection with DBSCAN Algorithm

Yıl 2020, Cilt: 7 Sayı: 1, 129 - 140, 28.06.2020

Hüseyin Yaşar , Mehmet Albayrak

https://doi.org/10.35193/bseufbd.649539

Cited By: 1

Öz

With the introduction of computers into our lives, digital data sizes are increasing gradually. Non-standard values (outliers) which behave differently from the others can be found in these data produced in the digital world. Detection of these values, especially in big data sets; has great importance in fields such as security, insurance, finance, medicine and genetics. Clustering methods of data mining techniques are frequently used in outlier detection in big data sets. Density based DBSCAN (Density-based spatial clustering of applications with noise) algorithm from clustering algorithms which are sensitive to noisy and outlier values is one of the most important methods in outlier detection. In this study, an application was developed using DBSCAN algorithm in C# programming language for the detection of outliers. In the developed application, 2 data sets with different data numbers were examined and analyzed. For the shortest possible data analysis time, serial and parallel programming techniques were used separately. In order to shorten the analysis time of big data sets, parallel class members in TPL (Task Parallel Library) provided with .Net 4.0 were used. In series of analysis of data sets, it was seen that DBSCAN algorithm produces more accurate results and is more practicable than other selected algorithms in terms of outlier detection. When considered in terms of computing performance, parallel programming has become more efficient as the number of data increases.

Anahtar Kelimeler

Outlier, Clustering, DBSCAN, Parallel Programming

Proje Numarası

4199-YL1-14

Kaynakça

IBM. (2016). What is Big Data? https://www 01.ibm.com/software/data/bigdata/what-is-big-data.html, (25.03.2020).
Güçlü, M. (2012). Detection of Outlier Value with Artificial Immune System Based Algorithm. M.Sc. Thesis, Yıldız Technical University, Institute of Science and Technology, Istanbul, Turkey
Duan, L., Xu, L., Liu, Y., Lee, J. (2009). Cluster-Based Outlier Detection. Annals of Operations Research, 168(1), 151-168.
Ercan, U., Akar, H., Koçer, A. (2013). Basic Algorithms Used in Parallel Programming. Academic Informatics Conference, 23-25 January 2013, Antalya, Turkey, 861-866.
Durmuş, B. (2013). Virtual Parallel Machine. M.Sc. Thesis, Dumlupınar University, Institute of Science and Technology, Kütahya, Turkey.
Yang, J., He, Q. (2018). Scheduling Parallel Computations by Work Stealing: A survey. International Journal of Parallel Programming, 46(2), 173-197.
İnce, K. (2013). Application of Genetic Algorithms with Parallel Programming in Multicore Architectures. M.Sc. Thesis, İnönü University, Institute of Science and Technology, Malatya, Turkey.
Akçay, M., Erdem, H.A. (2013). Parallel Computing with Intel Parallel Studio. XVIII. Internet Conference in Turkey, 9-11 December 2013, İstanbul University, 79-83.
Anthes, G. (2014). Researchers Simplify Parallel Programming. Communications of the ACM, 57(11), 13-15.
Kalva, H., Colic, A., Garcia, A., Furht, B. (2011). Parallel Programming for Multimedia Applications. Multimedia Tools and Applications, 51(2), 801-818.
Güneş, A. (2011). Recognition of Handwriting Numbers by Parallel Programming. M.Sc. Thesis, Süleyman Demirel University, Institute of Science and Technology, Isparta, Turkey.
Drew, J. (2013). Parallel Programming in C # and Other Alternatives, https://www.codeproject.com/articles/701175/parallel-programming-in-csharp-and-other-alternati, (15.01.2019).
Lazar, A. (2018). Task Parallel Library for Easy Multi-Threading in .NET Core [Tutorial]. https://hub.packtpub.com/task-parallel-library-multi-threading-net-core/, (25.03.2018).
Ovla, H.D., Taşdelen, B. (2012). Outlier Value Management. Mersin University Journal of Health Sciences, 5 (3), 1-8.
Aktürk, Z., Acemoğlu, H. (2010). Research and Practice Statistics for Health Care Workers. Anadolu Matbaası, İstanbul, 325.
Vural, A. (2007). Effects of Outliers on Regression Models and Robust Estimators. Master Thesis, Marmara University, Institute of Social Sciences, İstanbul, Turkey.
Moreira, A., Maribel, Y.S., Carneiro, S. (2005). Density-Based Clustering Algorithms-DBSCAN and SNN. University of Minho, Portugal. https://pdfs.semanticscholar.org/6227/2d87e82ffdec283c6da9d16f5065d7c44835.pdf?_ga=2.241964354.1371730934.1589407236-1802936592.1589407236, (15.05.2018).
Cassisi, C., Ferro, A., Giugno, R., Pigola, G., Pulvirenti, A. (2013). Enhancing Density-Based Clustering: Parameter Reduction and Outlier Detection. Information Systems, 38(3), 317-330.
Khan, I., Capozzoli, A., Corgnati, S.P., Cerquitelli, T. (2013). Fault Detection Analysis of Building Energy Consumption using Data Mining Techniques. Energy Procedia, 42, 557-566.
Birant, D., Kut, A. (2006). Spatio-Temporal Outlier Detection in Large Databases. Journal of Computing and Information Technology, 14(4), 291-297.
Bilgin, T.T., Çamurcu, Y. (2005). Applied Comparison of DBSCAN, OPTICS and k-Means Clustering Algorithms. Politeknik Journal, 8 (2), 139-145.
Yahoo. (2016). Computing Systems Data. S5 - A Labeled Anomaly Detection Dataset, version 1.0 (16M). http://webscope.sandbox.yahoo.com/catalog.php?datatype=s&did=70, (03.12.2016).
German Research Center for Artificial Intelligence. (2016). http://madm.dfki.de/_media/downloads/dfki-artificial-3000-unsupervised-ad.zip, (15.05.2016).
Freeman, A. (2010). Pro .NET 4 Parallel Programming in C#. Apress, USA, 311.
Elbatta, M.T.H., Ashour, W.M. (2013). A Dynamic Method for Discovering Density Varied Clusters. International Journal of Signal Processing, Image Processing and Pattern Recognition, 6 (3), 123-134.
Microsoft Docs. (2018). Map dependencies with code maps. https://docs.microsoft.com/tr-tr/visualstudio/modeling/map-dependencies-across-your-solutions?view=vs-2019, (30.03.2019).

Toplam 26 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Mühendislik
Bölüm	Makaleler
Yazarlar	Hüseyin Yaşar 0000-0003-2715-9313 Mehmet Albayrak 0000-0002-7089-122X
Proje Numarası	4199-YL1-14
Yayımlanma Tarihi	28 Haziran 2020
Gönderilme Tarihi	21 Kasım 2019
Kabul Tarihi	26 Nisan 2020
Yayımlandığı Sayı	Yıl 2020 Cilt: 7 Sayı: 1

Kaynak Göster

APA	Yaşar, H., & Albayrak, M. (2020). Comparison of Serial and Parallel Programming Performance in Outlier Detection with DBSCAN Algorithm. Bilecik Şeyh Edebali Üniversitesi Fen Bilimleri Dergisi, 7(1), 129-140. https://doi.org/10.35193/bseufbd.649539

Cited By

Balanced DATA by DBSCAN and Weighted Arithmetic Mean to Improve Performance of Machine Learning Algorithms

Bitlis Eren Üniversitesi Fen Bilimleri Dergisi

https://doi.org/10.17798/bitlisfen.985519

Makale Dosyaları

Tam Metin