Clustering Techniques in Data Mining: A Survey of Methods, Challenges, and Applications

Tasnim Alasalı; Yasin Ortakcı

doi:10.53070/bbd.1421527

Theoretical Article

Veri Madenciliğinde Kümeleme Teknikleri: Yöntemler, Zorluklar ve Uygulamalar Üzerine Bir Araştırma

Year 2024, , 32 - 50, 06.06.2024

Tasnim Alasalı Yasin Ortakcı

https://doi.org/10.53070/bbd.1421527

Abstract

Kümeleme, veri madenciliğinin hem araştırma hem de pratik uygulamalarında önemli bir tekniktir. Geleneksel olarak, anlamlı içgörüler elde etmek için etiketsiz verilerin düzenlenmesini kolaylaştıran önemli bir analitik metot olarak işlev yapmaktadır. Kümeleme zorluklarının doğasında var olan karmaşıklık, farklı tipte kümeleme algoritmalarının geliştirilmesini sağlamıştır. Bu algoritmaların her biri, belirli veri kümeleme senaryolarını ele almak için uyarlanmıştır. Bu bağlamda, bu makale, çeşitli alanlardaki zorlukları ve uygulamaları da dahil olmak üzere veri madenciliğindeki kümeleme tekniklerinin kapsamlı bir analizini sunmaktadır. Ayrıca, mesafe tabanlı, hiyerarşik, grid tabanlı ve yoğunluk tabanlı algoritmaları kapsayan farklı kümeleme metodolojilerini karakterize eden güçlü yönler ve sınırlamalar hakkında kapsamlı bir araştırma yapmaktadır. Ek olarak, sağlık hizmetleri, görüntü işleme, metin ve belge kümeleme ve büyük veri analitiği alanı dahil ancak bunlarla sınırlı olmamak üzere çeşitli alanlarda çok sayıda kümeleme algoritması örneği ve bunların ampirik sonuçları açıklanmaktadır.

Keywords

Kümeleme, hiyerarşik, mesafe-tabanlı, grid tabanlı, yoğunluk tabanlı, veri madenciliği

References

Abernathy, A., & Celebi, M. E. (2022). The incremental online k-means clustering algorithm and its application to color quantization. Expert Systems with Applications, 207, 117927.
Açmalı, Ş. S., & Ortakcı, Y. (2021). Clustering Performance Analysis of Traditional and New-Generation Meta-Heuristic Algorithms. Manchester Journal of Artificial Intelligence and Applied Sciences, 2(2).
Ahmed, N., Barczak, A. L. C., Susnjak, T., & Rashid, M. A. (2020). A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench. Journal of Big Data, 7(1), 1–18.
Ahmed, S. R. A., Al Barazanchi, I., Jaaz, Z. A., & Abdulshaheed, H. R. (2019). Clustering algorithms subjected to K-mean and gaussian mixture model on multidimensional data set. Periodicals of Engineering and Natural Sciences, 7(2), 448–457.
ALASALI, T., & DAKKAK, O. (2023). EXPLORING THE LANDSCAPE OF SDN-BASED DDOS DEFENSE: A HOLISTIC EXAMINATION OF DETECTION AND MITIGATION APPROACHES, RESEARCH GAPS AND PROMISING AVENUES FOR FUTURE EXPLORATION. International Journal of Advanced Natural Sciences and Engineering Researches, 7(4), 327–349.
Ali, H. H., & Kadhum, L. E. (2017). K-means clustering algorithm applications in data mining and pattern recognition. International Journal of Science and Research (IJSR), 6(8), 1577–1584.
Alomari, H. W., Al-Badarneh, A. F., Al-Alaj, A., & Khamaiseh, S. Y. (2023). Enhanced Approach for Agglomerative Clustering Using Topological Relations. IEEE Access, 11, 21945–21967.
Ambikesh, G., Rao, S. S., & Chandrasekaran, K. (2023). A grasshopper optimization algorithm-based movie recommender system. Multimedia Tools and Applications, 1–22.
Amirizadeh, E., & Boostani, R. (2021). CDEC: a constrained deep embedded clustering. International Journal of Intelligent Computing and Cybernetics, 14(4), 686–701.
Anam, S., Fitriah, Z., Hidayat, N., & Maulana, M. H. A. A. (2023). Classification Model for Diabetes Mellitus Diagnosis based on K-Means Clustering Algorithm Optimized with Bat Algorithm. International Journal of Advanced Computer Science and Applications, 14(1).
Ayesha, S., Hanif, M. K., & Talib, R. (2020a). Overview and comparative study of dimensionality reduction techniques for high dimensional data. Information Fusion, 59, 44–58.
Ayesha, S., Hanif, M. K., & Talib, R. (2020b). Overview and comparative study of dimensionality reduction techniques for high dimensional data. Information Fusion, 59, 44–58.
Azhir, E., Navimipour, N. J., Hosseinzadeh, M., Sharifi, A., & Darwesh, A. (2021). An efficient automated incremental density-based algorithm for clustering and classification. Future Generation Computer Systems, 114, 665–678.
Bahadori, S., & Charkari, N. M. (2018). Increasing Efficiency of Time Series Clustering by Dimension Reduction Techniques. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 18(5), 164–170.
Bansal, A., Sharma, M., & Goel, S. (2017). Improved k-mean clustering algorithm for prediction analysis using classification technique in data mining. International Journal of Computer Applications, 157(6), 975–8887.
Bechini, A., Marcelloni, F., & Renda, A. (2020). TSF-DBSCAN: A novel fuzzy density-based approach for clustering unbounded data streams. IEEE Transactions on Fuzzy Systems, 30(3), 623–637.
Bhattacharjee, P., & Mitra, P. (2020). BISDBx: towards batch-incremental clustering for dynamic datasets using SNN-DBSCAN. Pattern Analysis and Applications, 23(2), 975–1009.
CERNIAN, A., CARSTOIU, D., & OLTEANU, A. (2011). Clustering Heterogeneous Web Data using Clustering by Compression. Cluster Validity, 13th Intl. Symp. on Symbolic and Numeric Algorithms for Scientific Computing.
Chadebec, C., Thibeau-Sutre, E., Burgos, N., & Allassonnière, S. (2022). Data augmentation in high dimensional low sample size setting using a geometry-based variational autoencoder. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 2879–2896.
Chakraborty, S., & Das, S. (2020). Detecting meaningful clusters from high-dimensional data: A strongly consistent sparse center-based clustering approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 2894–2908.
Chakraborty, S., & Nagwani, N. K. (2014). Analysis and study of Incremental DBSCAN clustering algorithm. ArXiv Preprint ArXiv:1406.4754.
Chalapathi, M. M., Kumar, M. R., Sharma, N., & Shitharth, S. (2022). Ensemble Learning by High-Dimensional Acoustic Features for Emotion Recognition from Speech Audio Signal. Security and Communication Networks, 2022.
Chatterjee, S., & Das, A. (2023). An ensemble algorithm using quantum evolutionary optimization of weighted type-II fuzzy system and staged Pegasos Quantum Support Vector Classifier with multi-criteria decision making system for diagnosis and grading of breast cancer. Soft Computing, 27(11), 7147–7178.
Chen, H., Cai, Y., Ji, C., Selvaraj, G., Wei, D., & Wu, H. (2023). AdaPPI: identification of novel protein functional modules via adaptive graph convolution networks in a protein–protein interaction network. Briefings in Bioinformatics, 24(1), bbac523.
Chen, J., Li, D., Huang, R., Chen, Z., & Li, W. (2023). Aero-engine remaining useful life prediction method with self-adaptive multimodal data fusion and cluster-ensemble transfer regression. Reliability Engineering & System Safety, 234, 109151.
Chen, M.-S., Lin, J.-Q., Li, X.-L., Liu, B.-Y., Wang, C.-D., Huang, D., & Lai, J.-H. (2022). Representation learning in multi-view clustering: A literature review. Data Science and Engineering, 7(3), 225–241.
Choudhary, C., Singh, I., & Kumar, M. (2023). Community detection algorithms for recommendation systems: techniques and metrics. Computing, 105(2), 417–453.
Curiskis, S. A., Drake, B., Osborn, T. R., & Kennedy, P. J. (2020). An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Information Processing & Management, 57(2), 102034.
da Silva, L. E. B., Rayapati, N., & Wunsch, D. C. (2022). iCVI-ARTMAP: Using incremental cluster validity indices and adaptive resonance theory reset mechanism to accelerate validation and achieve multiprototype unsupervised representations. IEEE Transactions on Neural Networks and Learning Systems.
Dakkak, O., Arif, S., & Nor, S. A. (2015). Resource allocation mechanisms in computational grid: A survey. Asian Research Publishing Network (ARPN), 10.
Dakkak, O., Fazea, Y., Nor, S. A., & Arif, S. (2021). Towards accommodating deadline driven jobs on high performance computing platforms in grid computing environment. Journal of Computational Science, 54, 101439. De Weerdt, J., Vanden Broucke, S., Vanthienen, J., & Baesens, B. (2013). Active trace clustering for improved process discovery. IEEE Transactions on Knowledge and Data Engineering, 25(12), 2708–2720.
Deng, M., Liu, Q., Cheng, T., & Shi, Y. (2011). An adaptive spatial clustering algorithm based on Delaunay triangulation. Computers, Environment and Urban Systems, 35(4), 320–332.
Dhas, C. S. G., Yuvaraj, N., Kousik, N. V, & Geleto, T. D. (2022). D-PPSOK clustering algorithm with data sampling for clustering big data analysis. In System Assurances (pp. 503–512). Elsevier.
Diallo, B., Hu, J., Li, T., Khan, G. A., Liang, X., & Zhao, Y. (2021). Deep embedding clustering based on contractive autoencoder. Neurocomputing, 433, 96–107.
Duan, Y., Liu, C., Li, S., Guo, X., & Yang, C. (2023a). An automatic affinity propagation clustering based on improved equilibrium optimizer and t-SNE for high-dimensional data. Information Sciences, 623, 434–454.
Duan, Y., Liu, C., Li, S., Guo, X., & Yang, C. (2023b). An automatic affinity propagation clustering based on improved equilibrium optimizer and t-SNE for high-dimensional data. Information Sciences, 623, 434–454.
Elgarhy, I., Badr, M. M., Mahmoud, M., Fouda, M. M., Alsabaan, M., & Kholidy, H. A. (2023). Clustering and Ensemble Based Approach For Securing Electricity Theft Detectors Against Evasion Attacks. IEEE Access.
Ezugwu, A. E., Ikotun, A. M., Oyelade, O. O., Abualigah, L., Agushaka, J. O., Eke, C. I., & Akinyelu, A. A. (2022a). A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence, 110, 104743.
Ezugwu, A. E., Ikotun, A. M., Oyelade, O. O., Abualigah, L., Agushaka, J. O., Eke, C. I., & Akinyelu, A. A. (2022b). A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence, 110, 104743.
Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., Foufou, S., & Bouras, A. (2014). A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing, 2(3), 267–279.
Fakir, Y., & El Iklil, J. (2021). Clustering techniques for big data mining. International Conference on Business Intelligence, 183–200.
Faroughi, A., Boostani, R., Tajalizadeh, H., & Javidan, R. (2023). ARD-Stream: An adaptive radius density-based stream clustering. Future Generation Computer Systems, 149, 416–431.
Fu, X., Yuan, Y., Qiu, H., Suo, H., Song, Y., Li, A., Zhang, Y., Xiao, C., Li, Y., & Dou, L. (2024). AGF-PPIS: A protein–protein interaction site predictor based on an attention mechanism and graph convolutional networks. Methods.
Gao, L., Song, J., Liu, X., Shao, J., Liu, J., & Shao, J. (2017). Learning in high-dimensional multimedia data: the state of the art. Multimedia Systems, 23, 303–313.
Ghazal, T. M. (2021). Performances of K-means clustering algorithm with different distance metrics. Intelligent Automation & Soft Computing, 30(2), 735–742.
Ghosal, A., Nandy, A., Das, A. K., Goswami, S., & Panday, M. (2020). A short review on different clustering techniques and their applications. Emerging Technology in Modelling and Graphics: Proceedings of IEM Graph 2018, 69–83.
Gu, B., & Sheng, V. S. (2013). Feasibility and finite convergence analysis for accurate on-line $\nu $-Support vector machine. IEEE Transactions on Neural Networks and Learning Systems, 24(8), 1304–1315.
Guo, T., Yu, K., Aloqaily, M., & Wan, S. (2022). Constructing a prior-dependent graph for data clustering and dimension reduction in the edge of AIoT. Future Generation Computer Systems, 128, 381–394.
Han, X., Quan, L., Xiong, X., Almeter, M., Xiang, J., & Lan, Y. (2017). A novel data clustering algorithm based on modified gravitational search algorithm. Engineering Applications of Artificial Intelligence, 61, 1–7.
Hao, Z., Lu, Z., Li, G., Nie, F., Wang, R., & Li, X. (2023). Ensemble clustering with attentional representation. IEEE Transactions on Knowledge and Data Engineering.
Haris, M., Yusoff, Y., Zain, A. M., Khattak, A. S., & Hussain, S. F. (2024). Breaking down multi-view clustering: A comprehensive review of multi-view approaches for complex data structures. Engineering Applications of Artificial Intelligence, 132, 107857.
Hassan, Z. F., Al-Shareefi, F., & Gheni, H. Q. (2023). A Coloured Image Watermarking Based on Genetic K-Means Clustering Methodology. Journal of Advances in Information Technology, 14(2).
He, G., Jiang, W., Peng, R., Yin, M., & Han, M. (2022). Soft Subspace Based Ensemble Clustering for Multivariate Time Series Data. IEEE Transactions on Neural Networks and Learning Systems.
He, M., & Chen, H. (2024). Anomaly Detection in Species Distribution Patterns: A Spatio-Temporal Approach for Biodiversity Conservation. Journal of Biobased Materials and Bioenergy, 18(1), 39–50.
Hossain, M. Z., Akhtar, M. N., Ahmad, R. B., & Rahman, M. (2019). A dynamic K-means clustering for data mining. Indonesian Journal of Electrical Engineering and Computer Science, 13(2), 521–526.
Huang, Q., Gao, R., & Akhavan, H. (2023). An ensemble hierarchical clustering algorithm based on merits at cluster and partition levels. Pattern Recognition, 136, 109255.
Iam-On, N., & Boongoen, T. (2015). Diversity-driven generation of link-based cluster ensemble and application to data classification. Expert Systems with Applications, 42(21), 8259–8273.
Ikotun, A. M., Ezugwu, A. E., Abualigah, L., Abuhaija, B., & Heming, J. (2023a). K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622, 178–210.
Ikotun, A. M., Ezugwu, A. E., Abualigah, L., Abuhaija, B., & Heming, J. (2023b). K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622, 178–210.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys (CSUR), 31(3), 264–323.
Jain, P. K., & Pamula, R. (2019). Two-step anomaly detection approach using clustering algorithm. International Conference on Advanced Computing Networking and Informatics: ICANI-2018, 513–520.
JayaLakshmi, A. N. M., & Kishore, K. V. K. (2022). Performance evaluation of DNN with other machine learning techniques in a cluster using Apache Spark and MLlib. Journal of King Saud University-Computer and Information Sciences, 34(1), 1311–1319.
Jeong, S., Park, J., & Lim, S. (2023). mr2vec: Multiple role-based social network embedding. Pattern Recognition Letters, 176, 140–146.
Kadiravan, G., Sujatha, P., Asvany, T., Punithavathi, R., Elhoseny, M., Pustokhina, I. V, Pustokhin, D. A., & Shankar, K. (2021). Metaheuristic Clustering Protocol for Healthcare Data Collection in Mobile Wireless Multimedia Sensor Networks. Computers, Materials & Continua, 66(3).
Kannout, E., Grodzki, M., & Grzegorowski, M. (2023). Towards addressing item cold-start problem in collaborative filtering by embedding agglomerative clustering and FP-growth into the recommendation system. Computer Science and Information Systems, 00, 52.
Karthikeyan, B., George, D. J., Manikandan, G., & Thomas, T. (2020). A comparative study on k-means clustering and agglomerative hierarchical clustering. International Journal of Emerging Trends in Engineering Research, 8(5).
Kaya, M.-F., & Schoop, M. (2022). Analytical comparison of clustering techniques for the recognition of communication patterns. Group Decision and Negotiation, 31(3), 555–589.
Kharchenko, P. V. (2021). The triumphs and limitations of computational methods for scRNA-seq. Nature Methods, 18(7), 723–732.
Kim, S., Cha, J., Kim, D., & Park, E. (2023). Understanding Mental Health Issues in Different Subdomains of Social Networking Services: Computational Analysis of Text-Based Reddit Posts. Journal of Medical Internet Research, 25, e49074.
Krishnaswamy, R., Subramaniam, K., Nandini, V., Vijayalakshmi, K., Kadry, S., & Nam, Y. (2023). Metaheuristic Based Clustering with Deep Learning Model for Big Data Classification. Comput. Syst. Sci. Eng., 44(1), 391–406.
Kuo, R. J., Chang, C. K., Nguyen, T. P. Q., & Liao, T. W. (2021). Application of genetic algorithm-based intuitionistic fuzzy weighted c-ordered-means algorithm to cluster analysis. Knowledge and Information Systems, 63, 1935–1959.
Kuwil, F. H., Shaar, F., Topcu, A. E., & Murtagh, F. (2019). A new data clustering algorithm based on critical distance methodology. Expert Systems with Applications, 129, 296–310.
lahmood HAMEED, F., & DAKKAK, O. (2022). Brain Tumor Detection and Classification Using Convolutional Neural Network (CNN). 2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), 1–7.
Laohakiat, S., & Sa-Ing, V. (2021). An incremental density-based clustering framework using fuzzy local clustering. Information Sciences, 547, 404–426.
Lee, Y., Park, C., & Kang, S. (2022). Deep Embedded Clustering Framework for Mixed Data. IEEE Access, 11, 33–40.
Li, X., Chen, X., & Rezaeipanah, A. (2023). Automatic breast cancer diagnosis based on hybrid dimensionality reduction technique and ensemble classification. Journal of Cancer Research and Clinical Oncology, 1–19.
Liu, C., Nie, F., Wang, R., & Li, X. (2022). Scalable fuzzy clustering with anchor graph. IEEE Transactions on Knowledge and Data Engineering.
Liu, H., Yang, J., Ye, M., James, S. C., Tang, Z., Dong, J., & Xing, T. (2021). Using t-distributed Stochastic Neighbor Embedding (t-SNE) for cluster analysis and spatial zone delineation of groundwater geochemistry data. Journal of Hydrology, 597, 126146.
Liu, R., Ren, R., Liu, J., & Liu, J. (2020). A clustering and dimensionality reduction based evolutionary algorithm for large-scale multi-objective problems. Applied Soft Computing, 89, 106120.
Lv, Y., Ma, T., Tang, M., Cao, J., Tian, Y., Al-Dhelaan, A., & Al-Rodhaan, M. (2016). An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing, 171, 9–22.
Lydia, E. L., Moses, G. J., Varadarajan, V., Nonyelu, F., Maseleno, A., Perumal, E., & Shankar, K. (2020). Clustering and indexing of multiple documents using feature extraction through apache hadoop on big data. Malaysian Journal of Computer Science, 108–123.
Maia, J., Junior, C. A. S., Guimarães, F. G., de Castro, C. L., Lemos, A. P., Galindo, J. C. F., & Cohen, M. W. (2020). Evolving clustering algorithm based on mixture of typicalities for stream data mining. Future Generation Computer Systems, 106, 672–684.
Marqués-Sánchez, P., Martínez-Fernández, M. C., Benítez-Andrades, J. A., Quiroga-Sánchez, E., García-Ordás, M. T., & Arias-Ramos, N. (2023). Adolescent relational behaviour and the obesity pandemic: A descriptive study applying social network analysis and machine learning techniques. PloS One, 18(8), e0289553.
Mayanglambam, S. D., Horng, S.-J., & Pamula, R. (2023). PSO clustering and pruning-based KNN for outlier detection. Soft Computing, 1–17.
Mohammadi, M., Shokrollahi, A., Reisi, M., Abdollahpouri, A., & Moradi, P. (2023). Scalable and robust big data clustering with adaptive local feature weighting based on the Map-Reduce and Hadoop.
Mortensen, K. O., Zardbani, F., Haque, M. A., Agustsson, S. Y., Mottin, D., Hofmann, P., & Karras, P. (2023). Marigold: Efficient k-Means Clustering in High Dimensions. Proceedings of the VLDB Endowment, 16(7), 1740–1748.
Mrukwa, G., & Polanska, J. (2022). DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data. BMC Bioinformatics, 23(1), 1–24.
Mussabayev, R., Mladenovic, N., Jarboui, B., & Mussabayev, R. (2023). How to use K-means for big data clustering? Pattern Recognition, 137, 109269.
Nie, X., Qin, D., Zhou, X., Duo, H., Hao, Y., Li, B., & Liang, G. (2023). Clustering ensemble in scRNA-seq data analysis: Methods, applications and challenges. Computers in Biology and Medicine, 106939.
Nozari, H., & Sadeghi, M. E. (2021). Artificial intelligence and Machine Learning for Real-world problems (A survey). International Journal of Innovation in Engineering, 1(3), 38–47.
Ollagnier, A., Cabrio, E., & Villata, S. (2023). Unsupervised fine-grained hate speech target community detection and characterisation on social media. Social Network Analysis and Mining, 13(1), 58.
Omar, N., Nazirun, N. N., Vijayam, B., Wahab, A. A., & Bahuri, H. A. (2023). Diabetes subtypes classification for personalized health care: A review. Artificial Intelligence Review, 56(3), 2697–2721.
Ortakci, Y. (2017). Parallel particle swarm optimization in data clustering. Int. J Soft Comput. Artif. Intell.(IJSCAI), 5(1), 10–14.
Oskouei, A. G., Balafar, M. A., & Motamed, C. (2021). FKMAWCW: categorical fuzzy k-modes clustering with automated attribute-weight and cluster-weight learning. Chaos, Solitons & Fractals, 153, 111494.
Pareek, J., & Jacob, J. (2021). Data compression and visualization using PCA and T-SNE. Advances in Information Communication Technology and Computing: Proceedings of AICTC 2019, 327–337.
Patel, D., Modi, R., & Sarvakar, K. (2014). A comparative study of clustering data mining: Techniques and research challenges. International Journal of Latest Technology in Engineering, Management & Applied Science, 3(9), 67–70.
Pérez-Ortega, J., Rey-Figueroa, C. D., Roblero-Aguilar, S. S., Almanza-Ortega, N. N., Zavala-Díaz, C., García-Paredes, S., & Landero-Nájera, V. (2023). POFCM: A Parallel Fuzzy Clustering Algorithm for Large Datasets. Mathematics, 11(8), 1920.
Pham, N. D., Le, T. D., Park, K., & Choo, H. (2010). SCCS: Spatiotemporal clustering and compressing schemes for efficient data collection applications in WSNs. International Journal of Communication Systems, 23(11), 1311–1333.
Phan, H. T., & Nguyen, N. T. (2024). A Fuzzy Graph Convolutional Network Model for Sentence-Level Sentiment Analysis. IEEE Transactions on Fuzzy Systems.
Phan, H. T., Nguyen, N. T., & Hwang, D. (2023). Aspect-level sentiment analysis: A survey of graph convolutional network methods. Information Fusion, 91, 149–172.
Price, M. A., McEwen, J. D., Cai, X., Kitching, T. D., Wallis, C. G. R., & Collaboration), L. D. E. S. (2021). Sparse Bayesian mass mapping with uncertainties: hypothesis testing of structure. Monthly Notices of the Royal Astronomical Society, 506(3), 3678–3690.
Purwandari, K., Sigalingging, J. W. C., Fhadli, M., Arizky, S. N., & Pardamean, B. (2020). Data mining for predicting customer satisfaction using clustering techniques. 2020 International Conference on Information Management and Technology (ICIMTech), 223–227.
Qoku, A., & Buettner, F. (2023). Encoding Domain Knowledge in Multi-view Latent Variable Models: A Bayesian Approach with Structured Sparsity. International Conference on Artificial Intelligence and Statistics, 11545–11562.
Qu, W., Xiu, X., Chen, H., & Kong, L. (2023). A Survey on High-Dimensional Subspace Clustering. Mathematics, 11(2), 436.
Rahayu, K., Novianti, L., & Kusnandar, M. (2020). Implementation data mining with K-Means algorithm for clustering distribution rabies case area in Palembang City. Journal of Physics: Conference Series, 1500(1), 012121.
Ran, X., Xi, Y., Lu, Y., Wang, X., & Lu, Z. (2023). Comprehensive survey on hierarchical clustering algorithms and the recent developments. Artificial Intelligence Review, 56(8), 8219–8264.
Ray, P., Reddy, S. S., & Banerjee, T. (2021). Various dimension reduction techniques for high dimensional data analysis: a review. Artificial Intelligence Review, 54, 3473–3515.
Reddy, G. T., Reddy, M. P. K., Lakshmanna, K., Kaluri, R., Rajput, D. S., Srivastava, G., & Baker, T. (2020). Analysis of dimensionality reduction techniques on big data. Ieee Access, 8, 54776–54788.
Rehman, M. U., & Khan, D. M. (2021). A novel density-based technique for outlier detection of high dimensional data utilizing full feature space. Information Technology and Control, 50(1), 138–152.
Richards, J. A., & Richards, J. A. (2022). Remote sensing digital image analysis (Vol. 5). Springer.
Rubarth, K., Sattler, P., Zimmermann, H. G., & Konietschke, F. (2021). Estimation and testing of Wilcoxon–Mann–Whitney effects in factorial clustered data designs. Symmetry, 14(2), 244.
Sabitha, A. S., & Bansal, A. (2017). Climate change analysis to study land surface temparature trends. 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT), 1–8.
Sahoo, S. K., Pattanaik, P., Mohanty, M. N., & Mishra, D. K. (2023). Opposition Learning Based Improved Bee Colony Optimization (OLIBCO) Algorithm for Data Clustering. International Journal of Advanced Computer Science and Applications, 14(4).
Saklani, R., Purohit, K., Vats, S., Sharma, V., Kukreja, V., & Yadav, S. P. (2023). Multicore Implementation of K-Means Clustering Algorithm. 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), 171–175.
Samoilenko, S., & Osei-Bryson, K.-M. (2019). Representation matters: An exploration of the socio-economic impacts of ICT-enabled public value in the context of sub-Saharan economies. International Journal of Information Management, 49, 69–85.
Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O. P., Tiwari, A., Er, M. J., Ding, W., & Lin, C.-T. (2017a). A review of clustering techniques and developments. Neurocomputing, 267, 664–681.
Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O. P., Tiwari, A., Er, M. J., Ding, W., & Lin, C.-T. (2017b). A review of clustering techniques and developments. Neurocomputing, 267, 664–681.
Shah, N. H., Priamvada, A., & Shukla, B. P. (2023). Decoding spatial precipitation patterns using artificial intelligence. Spatial Information Research, 1–12.
Sharma, S., Agrawal, J., Agarwal, S., & Sharma, S. (2013). Machine learning techniques for data mining: A survey. 2013 IEEE International Conference on Computational Intelligence and Computing Research, 1–6.
Sheng, G., Wang, Q., Pei, C., & Gao, Q. (2022). Contrastive deep embedded clustering. Neurocomputing, 514, 13–20.
Shi, Y., Yang, K., Yu, Z., Chen, C. L. P., & Zeng, H. (2023). Adaptive Ensemble Clustering With Boosting BLS-Based Autoencoder. IEEE Transactions on Knowledge and Data Engineering.
Shrifan, N. H. M. M., Akbar, M. F., & Isa, N. A. M. (2022). An adaptive outlier removal aided k-means clustering algorithm. Journal of King Saud University-Computer and Information Sciences, 34(8), 6365–6376.
Sinaga, K. P., Hussain, I., & Yang, M.-S. (2021). Entropy K-means clustering with feature reduction under unknown number of clusters. IEEE Access, 9, 67736–67751.
Souiden, I., Omri, M. N., & Brahmi, Z. (2022). A survey of outlier detection in high dimensional data streams. Computer Science Review, 44, 100463.
Sun, L., Zhang, J., Ding, W., & Xu, J. (2022). Feature reduction for imbalanced data classification using similarity-based feature clustering with adaptive weighted K-nearest neighbors. Information Sciences, 593, 591–613.
Tejasree, S., & Chandra Mohan, B. (2023). An improved differential bond energy algorithm with fuzzy merging method to improve the document clustering for information mining. Expert Systems, e13261.
Thrun, M. C., & Ultsch, A. (2021). Using projection-based clustering to find distance-and density-based clusters in high-dimensional data. Journal of Classification, 38, 280–312.
Thudumu, S., Branch, P., Jin, J., & Singh, J. (2020). A comprehensive survey of anomaly detection techniques for high dimensional big data. Journal of Big Data, 7, 1–30.
Tiwari, A. (2021). Enhancing k-means algorithm clustering performance with improved time complexity. National Conference on “Unprecedented and Advanced Concepts of Computer Vision” NCUACC, 11(12).
Ukey, N., Yang, Z., Li, B., Zhang, G., Hu, Y., & Zhang, W. (2023). Survey on exact knn queries over high-dimensional data space. Sensors, 23(2), 629.
Utku, A., Can, U., & Aslan, S. (2023). Detection of hateful twitter users with graph convolutional network model. Earth Science Informatics, 16(1), 329–343.
Vandhana, S., & Anuradha, J. (2021). Environmental air pollution clustering using enhanced ensemble clustering methodology. Environmental Science and Pollution Research, 28, 40746–40755.
Wang, C., Danilevsky, M., Desai, N., Zhang, Y., Nguyen, P., Taula, T., & Han, J. (2013). A phrase mining framework for recursive construction of a topical hierarchy. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 437–445.
Wang, F., Zheng, Z., Zhang, Y., Li, Y., Yang, K., & Zhu, C. (2023). To see further: Knowledge graph-aware deep graph convolutional network for recommender systems. Information Sciences, 647, 119465.
Wang, L., Wang, Y., Deng, H., & Chen, H. (2023). Attention reweighted sparse subspace clustering. Pattern Recognition, 139, 109438.
Wickramasinghe, C. S., Marino, D. L., & Manic, M. (2021). ResNet autoencoders for unsupervised feature learning from high-dimensional data: Deep models resistant to performance degradation. IEEE Access, 9, 40511–40520.
Wright, J., & Ma, Y. (2022). High-dimensional data analysis with low-dimensional models: Principles, computation, and applications. Cambridge University Press.
Xie, J., Xu, X., Lan, Y., Shi, X., Yong, Y., & Wu, D. (2023). Automatic velocity picking with restricted weighted k-means clustering using prior information. Frontiers in Earth Science, 10, 1076999.
Xie, W.-B., Lee, Y.-L., Wang, C., Chen, D.-B., & Zhou, T. (2020). Hierarchical clustering supported by reciprocal nearest neighbors. Information Sciences, 527, 279–292.
Xie, Z., Nie, M., & Wang, T. (2009). Clustering Based Compress Data Cube Algorithm. 2009 WRI World Congress on Software Engineering, 4, 429–433.
Xu, D., & Tian, Y. (2015). A comprehensive survey of clustering algorithms. Annals of Data Science, 2, 165–193. Yedla, M., Pathakota, S. R., & Srinivasa, T. M. (2010). Enhancing K-means clustering algorithm with improved initial center. International Journal of Computer Science and Information Technologies, 1(2), 121–125.
Yu, T.-T., Chen, C.-Y., Wu, T.-H., & Chang, Y.-C. (2023). Application of high-dimensional uniform manifold approximation and projection (UMAP) to cluster existing landfills on the basis of geographical and environmental features. Science of The Total Environment, 904, 167013.
Yuan, C., & Yang, H. (2019). Research on K-value selection method of K-means clustering algorithm. J, 2(2), 226–235.
Yue, G., Deng, A., Qu, Y., Cui, H., & Liu, J. (n.d.). Fuzzy-Rough induced spectral ensemble clustering. Journal of Intelligent & Fuzzy Systems, Preprint, 1–18.
Zhong, L., Yang, J., Chen, Z., & Wang, S. (2023). Contrastive Graph Convolutional Networks With Generative Adjacency Matrix. IEEE Transactions on Signal Processing, 71, 772–785.

Clustering Techniques in Data Mining: A Survey of Methods, Challenges, and Applications

Year 2024, , 32 - 50, 06.06.2024

Tasnim Alasalı Yasin Ortakcı

https://doi.org/10.53070/bbd.1421527

Abstract

Clustering is a crucial technique in both research and practical applications of data mining. It has traditionally functioned as a pivotal analytical technique, facilitating the organization of unlabeled data to extract meaningful insights. The inherent complexity of clustering challenges has led to the development of a variety of clustering algorithms. Each of these algorithms is tailored to address specific data clustering scenarios. In this context, this paper provides a thorough analysis of clustering techniques in data mining, including their challenges and applications in various domains. It also undertakes an extensive exploration of the strengths and limitations characterizing distinct clustering methodologies, encompassing distance-based, hierarchical, grid-based, and density-based algorithms. Additionally, it explains numerous examples of clustering algorithms and their empirical results in various domains, including but not limited to healthcare, image processing, text and document clustering, and the field of big data analytics.

Keywords

Clustering, hierarchical, distance-based, grid-based, density-based, data mining

References

Abernathy, A., & Celebi, M. E. (2022). The incremental online k-means clustering algorithm and its application to color quantization. Expert Systems with Applications, 207, 117927.
Açmalı, Ş. S., & Ortakcı, Y. (2021). Clustering Performance Analysis of Traditional and New-Generation Meta-Heuristic Algorithms. Manchester Journal of Artificial Intelligence and Applied Sciences, 2(2).
Ahmed, N., Barczak, A. L. C., Susnjak, T., & Rashid, M. A. (2020). A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench. Journal of Big Data, 7(1), 1–18.
Ahmed, S. R. A., Al Barazanchi, I., Jaaz, Z. A., & Abdulshaheed, H. R. (2019). Clustering algorithms subjected to K-mean and gaussian mixture model on multidimensional data set. Periodicals of Engineering and Natural Sciences, 7(2), 448–457.
ALASALI, T., & DAKKAK, O. (2023). EXPLORING THE LANDSCAPE OF SDN-BASED DDOS DEFENSE: A HOLISTIC EXAMINATION OF DETECTION AND MITIGATION APPROACHES, RESEARCH GAPS AND PROMISING AVENUES FOR FUTURE EXPLORATION. International Journal of Advanced Natural Sciences and Engineering Researches, 7(4), 327–349.
Ali, H. H., & Kadhum, L. E. (2017). K-means clustering algorithm applications in data mining and pattern recognition. International Journal of Science and Research (IJSR), 6(8), 1577–1584.
Alomari, H. W., Al-Badarneh, A. F., Al-Alaj, A., & Khamaiseh, S. Y. (2023). Enhanced Approach for Agglomerative Clustering Using Topological Relations. IEEE Access, 11, 21945–21967.
Ambikesh, G., Rao, S. S., & Chandrasekaran, K. (2023). A grasshopper optimization algorithm-based movie recommender system. Multimedia Tools and Applications, 1–22.
Amirizadeh, E., & Boostani, R. (2021). CDEC: a constrained deep embedded clustering. International Journal of Intelligent Computing and Cybernetics, 14(4), 686–701.
Anam, S., Fitriah, Z., Hidayat, N., & Maulana, M. H. A. A. (2023). Classification Model for Diabetes Mellitus Diagnosis based on K-Means Clustering Algorithm Optimized with Bat Algorithm. International Journal of Advanced Computer Science and Applications, 14(1).
Ayesha, S., Hanif, M. K., & Talib, R. (2020a). Overview and comparative study of dimensionality reduction techniques for high dimensional data. Information Fusion, 59, 44–58.
Ayesha, S., Hanif, M. K., & Talib, R. (2020b). Overview and comparative study of dimensionality reduction techniques for high dimensional data. Information Fusion, 59, 44–58.
Azhir, E., Navimipour, N. J., Hosseinzadeh, M., Sharifi, A., & Darwesh, A. (2021). An efficient automated incremental density-based algorithm for clustering and classification. Future Generation Computer Systems, 114, 665–678.
Bahadori, S., & Charkari, N. M. (2018). Increasing Efficiency of Time Series Clustering by Dimension Reduction Techniques. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 18(5), 164–170.
Bansal, A., Sharma, M., & Goel, S. (2017). Improved k-mean clustering algorithm for prediction analysis using classification technique in data mining. International Journal of Computer Applications, 157(6), 975–8887.
Bechini, A., Marcelloni, F., & Renda, A. (2020). TSF-DBSCAN: A novel fuzzy density-based approach for clustering unbounded data streams. IEEE Transactions on Fuzzy Systems, 30(3), 623–637.
Bhattacharjee, P., & Mitra, P. (2020). BISDBx: towards batch-incremental clustering for dynamic datasets using SNN-DBSCAN. Pattern Analysis and Applications, 23(2), 975–1009.
CERNIAN, A., CARSTOIU, D., & OLTEANU, A. (2011). Clustering Heterogeneous Web Data using Clustering by Compression. Cluster Validity, 13th Intl. Symp. on Symbolic and Numeric Algorithms for Scientific Computing.
Chadebec, C., Thibeau-Sutre, E., Burgos, N., & Allassonnière, S. (2022). Data augmentation in high dimensional low sample size setting using a geometry-based variational autoencoder. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3), 2879–2896.
Chakraborty, S., & Das, S. (2020). Detecting meaningful clusters from high-dimensional data: A strongly consistent sparse center-based clustering approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 2894–2908.
Chakraborty, S., & Nagwani, N. K. (2014). Analysis and study of Incremental DBSCAN clustering algorithm. ArXiv Preprint ArXiv:1406.4754.
Chalapathi, M. M., Kumar, M. R., Sharma, N., & Shitharth, S. (2022). Ensemble Learning by High-Dimensional Acoustic Features for Emotion Recognition from Speech Audio Signal. Security and Communication Networks, 2022.
Chatterjee, S., & Das, A. (2023). An ensemble algorithm using quantum evolutionary optimization of weighted type-II fuzzy system and staged Pegasos Quantum Support Vector Classifier with multi-criteria decision making system for diagnosis and grading of breast cancer. Soft Computing, 27(11), 7147–7178.
Chen, H., Cai, Y., Ji, C., Selvaraj, G., Wei, D., & Wu, H. (2023). AdaPPI: identification of novel protein functional modules via adaptive graph convolution networks in a protein–protein interaction network. Briefings in Bioinformatics, 24(1), bbac523.
Chen, J., Li, D., Huang, R., Chen, Z., & Li, W. (2023). Aero-engine remaining useful life prediction method with self-adaptive multimodal data fusion and cluster-ensemble transfer regression. Reliability Engineering & System Safety, 234, 109151.
Chen, M.-S., Lin, J.-Q., Li, X.-L., Liu, B.-Y., Wang, C.-D., Huang, D., & Lai, J.-H. (2022). Representation learning in multi-view clustering: A literature review. Data Science and Engineering, 7(3), 225–241.
Choudhary, C., Singh, I., & Kumar, M. (2023). Community detection algorithms for recommendation systems: techniques and metrics. Computing, 105(2), 417–453.
Curiskis, S. A., Drake, B., Osborn, T. R., & Kennedy, P. J. (2020). An evaluation of document clustering and topic modelling in two online social networks: Twitter and Reddit. Information Processing & Management, 57(2), 102034.
da Silva, L. E. B., Rayapati, N., & Wunsch, D. C. (2022). iCVI-ARTMAP: Using incremental cluster validity indices and adaptive resonance theory reset mechanism to accelerate validation and achieve multiprototype unsupervised representations. IEEE Transactions on Neural Networks and Learning Systems.
Dakkak, O., Arif, S., & Nor, S. A. (2015). Resource allocation mechanisms in computational grid: A survey. Asian Research Publishing Network (ARPN), 10.
Dakkak, O., Fazea, Y., Nor, S. A., & Arif, S. (2021). Towards accommodating deadline driven jobs on high performance computing platforms in grid computing environment. Journal of Computational Science, 54, 101439. De Weerdt, J., Vanden Broucke, S., Vanthienen, J., & Baesens, B. (2013). Active trace clustering for improved process discovery. IEEE Transactions on Knowledge and Data Engineering, 25(12), 2708–2720.
Deng, M., Liu, Q., Cheng, T., & Shi, Y. (2011). An adaptive spatial clustering algorithm based on Delaunay triangulation. Computers, Environment and Urban Systems, 35(4), 320–332.
Dhas, C. S. G., Yuvaraj, N., Kousik, N. V, & Geleto, T. D. (2022). D-PPSOK clustering algorithm with data sampling for clustering big data analysis. In System Assurances (pp. 503–512). Elsevier.
Diallo, B., Hu, J., Li, T., Khan, G. A., Liang, X., & Zhao, Y. (2021). Deep embedding clustering based on contractive autoencoder. Neurocomputing, 433, 96–107.
Duan, Y., Liu, C., Li, S., Guo, X., & Yang, C. (2023a). An automatic affinity propagation clustering based on improved equilibrium optimizer and t-SNE for high-dimensional data. Information Sciences, 623, 434–454.
Duan, Y., Liu, C., Li, S., Guo, X., & Yang, C. (2023b). An automatic affinity propagation clustering based on improved equilibrium optimizer and t-SNE for high-dimensional data. Information Sciences, 623, 434–454.
Elgarhy, I., Badr, M. M., Mahmoud, M., Fouda, M. M., Alsabaan, M., & Kholidy, H. A. (2023). Clustering and Ensemble Based Approach For Securing Electricity Theft Detectors Against Evasion Attacks. IEEE Access.
Ezugwu, A. E., Ikotun, A. M., Oyelade, O. O., Abualigah, L., Agushaka, J. O., Eke, C. I., & Akinyelu, A. A. (2022a). A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence, 110, 104743.
Ezugwu, A. E., Ikotun, A. M., Oyelade, O. O., Abualigah, L., Agushaka, J. O., Eke, C. I., & Akinyelu, A. A. (2022b). A comprehensive survey of clustering algorithms: State-of-the-art machine learning applications, taxonomy, challenges, and future research prospects. Engineering Applications of Artificial Intelligence, 110, 104743.
Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A. Y., Foufou, S., & Bouras, A. (2014). A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing, 2(3), 267–279.
Fakir, Y., & El Iklil, J. (2021). Clustering techniques for big data mining. International Conference on Business Intelligence, 183–200.
Faroughi, A., Boostani, R., Tajalizadeh, H., & Javidan, R. (2023). ARD-Stream: An adaptive radius density-based stream clustering. Future Generation Computer Systems, 149, 416–431.
Fu, X., Yuan, Y., Qiu, H., Suo, H., Song, Y., Li, A., Zhang, Y., Xiao, C., Li, Y., & Dou, L. (2024). AGF-PPIS: A protein–protein interaction site predictor based on an attention mechanism and graph convolutional networks. Methods.
Gao, L., Song, J., Liu, X., Shao, J., Liu, J., & Shao, J. (2017). Learning in high-dimensional multimedia data: the state of the art. Multimedia Systems, 23, 303–313.
Ghazal, T. M. (2021). Performances of K-means clustering algorithm with different distance metrics. Intelligent Automation & Soft Computing, 30(2), 735–742.
Ghosal, A., Nandy, A., Das, A. K., Goswami, S., & Panday, M. (2020). A short review on different clustering techniques and their applications. Emerging Technology in Modelling and Graphics: Proceedings of IEM Graph 2018, 69–83.
Gu, B., & Sheng, V. S. (2013). Feasibility and finite convergence analysis for accurate on-line $\nu $-Support vector machine. IEEE Transactions on Neural Networks and Learning Systems, 24(8), 1304–1315.
Guo, T., Yu, K., Aloqaily, M., & Wan, S. (2022). Constructing a prior-dependent graph for data clustering and dimension reduction in the edge of AIoT. Future Generation Computer Systems, 128, 381–394.
Han, X., Quan, L., Xiong, X., Almeter, M., Xiang, J., & Lan, Y. (2017). A novel data clustering algorithm based on modified gravitational search algorithm. Engineering Applications of Artificial Intelligence, 61, 1–7.
Hao, Z., Lu, Z., Li, G., Nie, F., Wang, R., & Li, X. (2023). Ensemble clustering with attentional representation. IEEE Transactions on Knowledge and Data Engineering.
Haris, M., Yusoff, Y., Zain, A. M., Khattak, A. S., & Hussain, S. F. (2024). Breaking down multi-view clustering: A comprehensive review of multi-view approaches for complex data structures. Engineering Applications of Artificial Intelligence, 132, 107857.
Hassan, Z. F., Al-Shareefi, F., & Gheni, H. Q. (2023). A Coloured Image Watermarking Based on Genetic K-Means Clustering Methodology. Journal of Advances in Information Technology, 14(2).
He, G., Jiang, W., Peng, R., Yin, M., & Han, M. (2022). Soft Subspace Based Ensemble Clustering for Multivariate Time Series Data. IEEE Transactions on Neural Networks and Learning Systems.
He, M., & Chen, H. (2024). Anomaly Detection in Species Distribution Patterns: A Spatio-Temporal Approach for Biodiversity Conservation. Journal of Biobased Materials and Bioenergy, 18(1), 39–50.
Hossain, M. Z., Akhtar, M. N., Ahmad, R. B., & Rahman, M. (2019). A dynamic K-means clustering for data mining. Indonesian Journal of Electrical Engineering and Computer Science, 13(2), 521–526.
Huang, Q., Gao, R., & Akhavan, H. (2023). An ensemble hierarchical clustering algorithm based on merits at cluster and partition levels. Pattern Recognition, 136, 109255.
Iam-On, N., & Boongoen, T. (2015). Diversity-driven generation of link-based cluster ensemble and application to data classification. Expert Systems with Applications, 42(21), 8259–8273.
Ikotun, A. M., Ezugwu, A. E., Abualigah, L., Abuhaija, B., & Heming, J. (2023a). K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622, 178–210.
Ikotun, A. M., Ezugwu, A. E., Abualigah, L., Abuhaija, B., & Heming, J. (2023b). K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Information Sciences, 622, 178–210.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys (CSUR), 31(3), 264–323.
Jain, P. K., & Pamula, R. (2019). Two-step anomaly detection approach using clustering algorithm. International Conference on Advanced Computing Networking and Informatics: ICANI-2018, 513–520.
JayaLakshmi, A. N. M., & Kishore, K. V. K. (2022). Performance evaluation of DNN with other machine learning techniques in a cluster using Apache Spark and MLlib. Journal of King Saud University-Computer and Information Sciences, 34(1), 1311–1319.
Jeong, S., Park, J., & Lim, S. (2023). mr2vec: Multiple role-based social network embedding. Pattern Recognition Letters, 176, 140–146.
Kadiravan, G., Sujatha, P., Asvany, T., Punithavathi, R., Elhoseny, M., Pustokhina, I. V, Pustokhin, D. A., & Shankar, K. (2021). Metaheuristic Clustering Protocol for Healthcare Data Collection in Mobile Wireless Multimedia Sensor Networks. Computers, Materials & Continua, 66(3).
Kannout, E., Grodzki, M., & Grzegorowski, M. (2023). Towards addressing item cold-start problem in collaborative filtering by embedding agglomerative clustering and FP-growth into the recommendation system. Computer Science and Information Systems, 00, 52.
Karthikeyan, B., George, D. J., Manikandan, G., & Thomas, T. (2020). A comparative study on k-means clustering and agglomerative hierarchical clustering. International Journal of Emerging Trends in Engineering Research, 8(5).
Kaya, M.-F., & Schoop, M. (2022). Analytical comparison of clustering techniques for the recognition of communication patterns. Group Decision and Negotiation, 31(3), 555–589.
Kharchenko, P. V. (2021). The triumphs and limitations of computational methods for scRNA-seq. Nature Methods, 18(7), 723–732.
Kim, S., Cha, J., Kim, D., & Park, E. (2023). Understanding Mental Health Issues in Different Subdomains of Social Networking Services: Computational Analysis of Text-Based Reddit Posts. Journal of Medical Internet Research, 25, e49074.
Krishnaswamy, R., Subramaniam, K., Nandini, V., Vijayalakshmi, K., Kadry, S., & Nam, Y. (2023). Metaheuristic Based Clustering with Deep Learning Model for Big Data Classification. Comput. Syst. Sci. Eng., 44(1), 391–406.
Kuo, R. J., Chang, C. K., Nguyen, T. P. Q., & Liao, T. W. (2021). Application of genetic algorithm-based intuitionistic fuzzy weighted c-ordered-means algorithm to cluster analysis. Knowledge and Information Systems, 63, 1935–1959.
Kuwil, F. H., Shaar, F., Topcu, A. E., & Murtagh, F. (2019). A new data clustering algorithm based on critical distance methodology. Expert Systems with Applications, 129, 296–310.
lahmood HAMEED, F., & DAKKAK, O. (2022). Brain Tumor Detection and Classification Using Convolutional Neural Network (CNN). 2022 International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), 1–7.
Laohakiat, S., & Sa-Ing, V. (2021). An incremental density-based clustering framework using fuzzy local clustering. Information Sciences, 547, 404–426.
Lee, Y., Park, C., & Kang, S. (2022). Deep Embedded Clustering Framework for Mixed Data. IEEE Access, 11, 33–40.
Li, X., Chen, X., & Rezaeipanah, A. (2023). Automatic breast cancer diagnosis based on hybrid dimensionality reduction technique and ensemble classification. Journal of Cancer Research and Clinical Oncology, 1–19.
Liu, C., Nie, F., Wang, R., & Li, X. (2022). Scalable fuzzy clustering with anchor graph. IEEE Transactions on Knowledge and Data Engineering.
Liu, H., Yang, J., Ye, M., James, S. C., Tang, Z., Dong, J., & Xing, T. (2021). Using t-distributed Stochastic Neighbor Embedding (t-SNE) for cluster analysis and spatial zone delineation of groundwater geochemistry data. Journal of Hydrology, 597, 126146.
Liu, R., Ren, R., Liu, J., & Liu, J. (2020). A clustering and dimensionality reduction based evolutionary algorithm for large-scale multi-objective problems. Applied Soft Computing, 89, 106120.
Lv, Y., Ma, T., Tang, M., Cao, J., Tian, Y., Al-Dhelaan, A., & Al-Rodhaan, M. (2016). An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing, 171, 9–22.
Lydia, E. L., Moses, G. J., Varadarajan, V., Nonyelu, F., Maseleno, A., Perumal, E., & Shankar, K. (2020). Clustering and indexing of multiple documents using feature extraction through apache hadoop on big data. Malaysian Journal of Computer Science, 108–123.
Maia, J., Junior, C. A. S., Guimarães, F. G., de Castro, C. L., Lemos, A. P., Galindo, J. C. F., & Cohen, M. W. (2020). Evolving clustering algorithm based on mixture of typicalities for stream data mining. Future Generation Computer Systems, 106, 672–684.
Marqués-Sánchez, P., Martínez-Fernández, M. C., Benítez-Andrades, J. A., Quiroga-Sánchez, E., García-Ordás, M. T., & Arias-Ramos, N. (2023). Adolescent relational behaviour and the obesity pandemic: A descriptive study applying social network analysis and machine learning techniques. PloS One, 18(8), e0289553.
Mayanglambam, S. D., Horng, S.-J., & Pamula, R. (2023). PSO clustering and pruning-based KNN for outlier detection. Soft Computing, 1–17.
Mohammadi, M., Shokrollahi, A., Reisi, M., Abdollahpouri, A., & Moradi, P. (2023). Scalable and robust big data clustering with adaptive local feature weighting based on the Map-Reduce and Hadoop.
Mortensen, K. O., Zardbani, F., Haque, M. A., Agustsson, S. Y., Mottin, D., Hofmann, P., & Karras, P. (2023). Marigold: Efficient k-Means Clustering in High Dimensions. Proceedings of the VLDB Endowment, 16(7), 1740–1748.
Mrukwa, G., & Polanska, J. (2022). DiviK: divisive intelligent K-means for hands-free unsupervised clustering in big biological data. BMC Bioinformatics, 23(1), 1–24.
Mussabayev, R., Mladenovic, N., Jarboui, B., & Mussabayev, R. (2023). How to use K-means for big data clustering? Pattern Recognition, 137, 109269.
Nie, X., Qin, D., Zhou, X., Duo, H., Hao, Y., Li, B., & Liang, G. (2023). Clustering ensemble in scRNA-seq data analysis: Methods, applications and challenges. Computers in Biology and Medicine, 106939.
Nozari, H., & Sadeghi, M. E. (2021). Artificial intelligence and Machine Learning for Real-world problems (A survey). International Journal of Innovation in Engineering, 1(3), 38–47.
Ollagnier, A., Cabrio, E., & Villata, S. (2023). Unsupervised fine-grained hate speech target community detection and characterisation on social media. Social Network Analysis and Mining, 13(1), 58.
Omar, N., Nazirun, N. N., Vijayam, B., Wahab, A. A., & Bahuri, H. A. (2023). Diabetes subtypes classification for personalized health care: A review. Artificial Intelligence Review, 56(3), 2697–2721.
Ortakci, Y. (2017). Parallel particle swarm optimization in data clustering. Int. J Soft Comput. Artif. Intell.(IJSCAI), 5(1), 10–14.
Oskouei, A. G., Balafar, M. A., & Motamed, C. (2021). FKMAWCW: categorical fuzzy k-modes clustering with automated attribute-weight and cluster-weight learning. Chaos, Solitons & Fractals, 153, 111494.
Pareek, J., & Jacob, J. (2021). Data compression and visualization using PCA and T-SNE. Advances in Information Communication Technology and Computing: Proceedings of AICTC 2019, 327–337.
Patel, D., Modi, R., & Sarvakar, K. (2014). A comparative study of clustering data mining: Techniques and research challenges. International Journal of Latest Technology in Engineering, Management & Applied Science, 3(9), 67–70.
Pérez-Ortega, J., Rey-Figueroa, C. D., Roblero-Aguilar, S. S., Almanza-Ortega, N. N., Zavala-Díaz, C., García-Paredes, S., & Landero-Nájera, V. (2023). POFCM: A Parallel Fuzzy Clustering Algorithm for Large Datasets. Mathematics, 11(8), 1920.
Pham, N. D., Le, T. D., Park, K., & Choo, H. (2010). SCCS: Spatiotemporal clustering and compressing schemes for efficient data collection applications in WSNs. International Journal of Communication Systems, 23(11), 1311–1333.
Phan, H. T., & Nguyen, N. T. (2024). A Fuzzy Graph Convolutional Network Model for Sentence-Level Sentiment Analysis. IEEE Transactions on Fuzzy Systems.
Phan, H. T., Nguyen, N. T., & Hwang, D. (2023). Aspect-level sentiment analysis: A survey of graph convolutional network methods. Information Fusion, 91, 149–172.
Price, M. A., McEwen, J. D., Cai, X., Kitching, T. D., Wallis, C. G. R., & Collaboration), L. D. E. S. (2021). Sparse Bayesian mass mapping with uncertainties: hypothesis testing of structure. Monthly Notices of the Royal Astronomical Society, 506(3), 3678–3690.
Purwandari, K., Sigalingging, J. W. C., Fhadli, M., Arizky, S. N., & Pardamean, B. (2020). Data mining for predicting customer satisfaction using clustering techniques. 2020 International Conference on Information Management and Technology (ICIMTech), 223–227.
Qoku, A., & Buettner, F. (2023). Encoding Domain Knowledge in Multi-view Latent Variable Models: A Bayesian Approach with Structured Sparsity. International Conference on Artificial Intelligence and Statistics, 11545–11562.
Qu, W., Xiu, X., Chen, H., & Kong, L. (2023). A Survey on High-Dimensional Subspace Clustering. Mathematics, 11(2), 436.
Rahayu, K., Novianti, L., & Kusnandar, M. (2020). Implementation data mining with K-Means algorithm for clustering distribution rabies case area in Palembang City. Journal of Physics: Conference Series, 1500(1), 012121.
Ran, X., Xi, Y., Lu, Y., Wang, X., & Lu, Z. (2023). Comprehensive survey on hierarchical clustering algorithms and the recent developments. Artificial Intelligence Review, 56(8), 8219–8264.
Ray, P., Reddy, S. S., & Banerjee, T. (2021). Various dimension reduction techniques for high dimensional data analysis: a review. Artificial Intelligence Review, 54, 3473–3515.
Reddy, G. T., Reddy, M. P. K., Lakshmanna, K., Kaluri, R., Rajput, D. S., Srivastava, G., & Baker, T. (2020). Analysis of dimensionality reduction techniques on big data. Ieee Access, 8, 54776–54788.
Rehman, M. U., & Khan, D. M. (2021). A novel density-based technique for outlier detection of high dimensional data utilizing full feature space. Information Technology and Control, 50(1), 138–152.
Richards, J. A., & Richards, J. A. (2022). Remote sensing digital image analysis (Vol. 5). Springer.
Rubarth, K., Sattler, P., Zimmermann, H. G., & Konietschke, F. (2021). Estimation and testing of Wilcoxon–Mann–Whitney effects in factorial clustered data designs. Symmetry, 14(2), 244.
Sabitha, A. S., & Bansal, A. (2017). Climate change analysis to study land surface temparature trends. 2017 3rd International Conference on Computational Intelligence & Communication Technology (CICT), 1–8.
Sahoo, S. K., Pattanaik, P., Mohanty, M. N., & Mishra, D. K. (2023). Opposition Learning Based Improved Bee Colony Optimization (OLIBCO) Algorithm for Data Clustering. International Journal of Advanced Computer Science and Applications, 14(4).
Saklani, R., Purohit, K., Vats, S., Sharma, V., Kukreja, V., & Yadav, S. P. (2023). Multicore Implementation of K-Means Clustering Algorithm. 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), 171–175.
Samoilenko, S., & Osei-Bryson, K.-M. (2019). Representation matters: An exploration of the socio-economic impacts of ICT-enabled public value in the context of sub-Saharan economies. International Journal of Information Management, 49, 69–85.
Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O. P., Tiwari, A., Er, M. J., Ding, W., & Lin, C.-T. (2017a). A review of clustering techniques and developments. Neurocomputing, 267, 664–681.
Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O. P., Tiwari, A., Er, M. J., Ding, W., & Lin, C.-T. (2017b). A review of clustering techniques and developments. Neurocomputing, 267, 664–681.
Shah, N. H., Priamvada, A., & Shukla, B. P. (2023). Decoding spatial precipitation patterns using artificial intelligence. Spatial Information Research, 1–12.
Sharma, S., Agrawal, J., Agarwal, S., & Sharma, S. (2013). Machine learning techniques for data mining: A survey. 2013 IEEE International Conference on Computational Intelligence and Computing Research, 1–6.
Sheng, G., Wang, Q., Pei, C., & Gao, Q. (2022). Contrastive deep embedded clustering. Neurocomputing, 514, 13–20.
Shi, Y., Yang, K., Yu, Z., Chen, C. L. P., & Zeng, H. (2023). Adaptive Ensemble Clustering With Boosting BLS-Based Autoencoder. IEEE Transactions on Knowledge and Data Engineering.
Shrifan, N. H. M. M., Akbar, M. F., & Isa, N. A. M. (2022). An adaptive outlier removal aided k-means clustering algorithm. Journal of King Saud University-Computer and Information Sciences, 34(8), 6365–6376.
Sinaga, K. P., Hussain, I., & Yang, M.-S. (2021). Entropy K-means clustering with feature reduction under unknown number of clusters. IEEE Access, 9, 67736–67751.
Souiden, I., Omri, M. N., & Brahmi, Z. (2022). A survey of outlier detection in high dimensional data streams. Computer Science Review, 44, 100463.
Sun, L., Zhang, J., Ding, W., & Xu, J. (2022). Feature reduction for imbalanced data classification using similarity-based feature clustering with adaptive weighted K-nearest neighbors. Information Sciences, 593, 591–613.
Tejasree, S., & Chandra Mohan, B. (2023). An improved differential bond energy algorithm with fuzzy merging method to improve the document clustering for information mining. Expert Systems, e13261.
Thrun, M. C., & Ultsch, A. (2021). Using projection-based clustering to find distance-and density-based clusters in high-dimensional data. Journal of Classification, 38, 280–312.
Thudumu, S., Branch, P., Jin, J., & Singh, J. (2020). A comprehensive survey of anomaly detection techniques for high dimensional big data. Journal of Big Data, 7, 1–30.
Tiwari, A. (2021). Enhancing k-means algorithm clustering performance with improved time complexity. National Conference on “Unprecedented and Advanced Concepts of Computer Vision” NCUACC, 11(12).
Ukey, N., Yang, Z., Li, B., Zhang, G., Hu, Y., & Zhang, W. (2023). Survey on exact knn queries over high-dimensional data space. Sensors, 23(2), 629.
Utku, A., Can, U., & Aslan, S. (2023). Detection of hateful twitter users with graph convolutional network model. Earth Science Informatics, 16(1), 329–343.
Vandhana, S., & Anuradha, J. (2021). Environmental air pollution clustering using enhanced ensemble clustering methodology. Environmental Science and Pollution Research, 28, 40746–40755.
Wang, C., Danilevsky, M., Desai, N., Zhang, Y., Nguyen, P., Taula, T., & Han, J. (2013). A phrase mining framework for recursive construction of a topical hierarchy. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 437–445.
Wang, F., Zheng, Z., Zhang, Y., Li, Y., Yang, K., & Zhu, C. (2023). To see further: Knowledge graph-aware deep graph convolutional network for recommender systems. Information Sciences, 647, 119465.
Wang, L., Wang, Y., Deng, H., & Chen, H. (2023). Attention reweighted sparse subspace clustering. Pattern Recognition, 139, 109438.
Wickramasinghe, C. S., Marino, D. L., & Manic, M. (2021). ResNet autoencoders for unsupervised feature learning from high-dimensional data: Deep models resistant to performance degradation. IEEE Access, 9, 40511–40520.
Wright, J., & Ma, Y. (2022). High-dimensional data analysis with low-dimensional models: Principles, computation, and applications. Cambridge University Press.
Xie, J., Xu, X., Lan, Y., Shi, X., Yong, Y., & Wu, D. (2023). Automatic velocity picking with restricted weighted k-means clustering using prior information. Frontiers in Earth Science, 10, 1076999.
Xie, W.-B., Lee, Y.-L., Wang, C., Chen, D.-B., & Zhou, T. (2020). Hierarchical clustering supported by reciprocal nearest neighbors. Information Sciences, 527, 279–292.
Xie, Z., Nie, M., & Wang, T. (2009). Clustering Based Compress Data Cube Algorithm. 2009 WRI World Congress on Software Engineering, 4, 429–433.
Xu, D., & Tian, Y. (2015). A comprehensive survey of clustering algorithms. Annals of Data Science, 2, 165–193. Yedla, M., Pathakota, S. R., & Srinivasa, T. M. (2010). Enhancing K-means clustering algorithm with improved initial center. International Journal of Computer Science and Information Technologies, 1(2), 121–125.
Yu, T.-T., Chen, C.-Y., Wu, T.-H., & Chang, Y.-C. (2023). Application of high-dimensional uniform manifold approximation and projection (UMAP) to cluster existing landfills on the basis of geographical and environmental features. Science of The Total Environment, 904, 167013.
Yuan, C., & Yang, H. (2019). Research on K-value selection method of K-means clustering algorithm. J, 2(2), 226–235.
Yue, G., Deng, A., Qu, Y., Cui, H., & Liu, J. (n.d.). Fuzzy-Rough induced spectral ensemble clustering. Journal of Intelligent & Fuzzy Systems, Preprint, 1–18.
Zhong, L., Yang, J., Chen, Z., & Wang, S. (2023). Contrastive Graph Convolutional Networks With Generative Adjacency Matrix. IEEE Transactions on Signal Processing, 71, 772–785.

There are 145 citations in total.

Details

Primary Language	English
Subjects	Data Mining and Knowledge Discovery
Journal Section	PAPERS
Authors	Tasnim Alasalı This is me 0009-0009-4780-5088 Yasin Ortakcı 0000-0002-0683-2049
Publication Date	June 6, 2024
Submission Date	January 17, 2024
Acceptance Date	March 5, 2024
Published in Issue	Year 2024

Cite

APA	Alasalı, T., & Ortakcı, Y. (2024). Clustering Techniques in Data Mining: A Survey of Methods, Challenges, and Applications. Computer Science, 9(Issue:1), 32-50. https://doi.org/10.53070/bbd.1421527

Article Files

Full Text

The Creative Commons Attribution 4.0 International License is applied to all research papers published by JCS and

a Digital Object Identifier (DOI) is assigned for each published paper.