Comparative analysis of machine learning techniques for detecting potability of water

Vahid Sinap

doi:10.59313/jsr-a.1416015

Research Article

BibTex

RIS

Cite

Comparative analysis of machine learning techniques for detecting potability of water

Year 2024, , 135 - 161, 29.09.2024

Vahid Sinap

https://doi.org/10.59313/jsr-a.1416015

Abstract

This research aims to evaluate the effectiveness of machine learning algorithms in determining the potability of water. In the study, a total of 3276 water samples were analyzed for 10 different features that determine the potability of water. Besides that, the study's consideration is to evaluate the impact of trimming, IQR, and percentile methods on the performance of machine learning algorithms. The models were built using nine different classification algorithms (Logistic Regression, Decision Trees, Random Forest, XGBoost, Naive Bayes, K-Nearest Neighbors, Support Vector Machine, AdaBoost, and Bagging Classifier). According to the results, filling the missing data with the population mean and handling outliers with Trimming and IQR methods improved the performance of the models. Random Forest and Decision Tree algorithms were the most accurate in determining the potability of water. The findings of this research are of high importance to sustainable water resource management and serve as a crucial input for the decision-making process on the quality of water. The study also offers an example for researchers working on datasets that contain missing values and outliers.

Keywords

water quality, potability analysis, machine learning, classification, data processing

References

[1] X. Wen et al., “Microbial indicators and their use for monitoring drinking water quality—A review,” Sustainability, vol. 12, no. 6, pp. 2249, 2020.
[2] S. E. Hrudey and E. J. Hrudey, Safe Drinking Water. IWA publishing, 2004.
[3] W. J. Cosgrove and D. P. Loucks, “Water management: Current and future challenges and research directions,” Water Resources Research, vol. 51, no. 6, pp. 4823-4839, 2015.
[4] H. G. Peterson, “Rural drinking water and waterborne illness,” Saskatoon, SK: Safe Drinking Water Foundation, pp. 162-91, 2001.
[5] T. Russo, K. Alfredo, and J. Fisher, “Sustainable water management in urban, agricultural, and natural systems,” Water, vol. 6, no. 12, pp. 3934-3956, 2014.
[6] S. A. Esrey, “Water, waste, and well-being: a multicountry study,” American Journal of Epidemiology, vol. 143, no. 6, pp. 608-623, 1996.
[7] World Health Organization, “Guidelines for drinking-water quality (Vol. 1),” World Health Organization, 2004.
[8] J. DeZuane, Handbook of Drinking Water Quality, John Wiley & Sons, 1997.
[9] S. J. Kulkarni, “A review on research and studies on dissolved oxygen and its affecting parameters,” International Journal of Research and Review, vol. 3, no. 8, pp. 18-22, 2016.
[10] C. Jingsheng, Y. Tao, and E. Ongley, “Influence of high levels of total suspended solids on measurement of COD and BOD in the Yellow River, China,” Environmental Monitoring and Assessment, vol. 116, pp. 321-334, 2006.
[11] S. Morais, F. G. Costa, and M. D. L. Pereira, “Heavy metals and human health,” Environmental Health–Emerging Issues and Practice, vol. 10, no. 1, pp. 227-245, 2012.
[12] A. K. Singh and R. Chandra, “Pollutants released from the pulp paper industry: Aquatic toxicity and their health hazards,” Aquatic Toxicology, vol. 211, pp. 202-216, 2019.
[13] P. Nannipieri, S. Greco, and B. Ceccanti, “Ecological significance of the biological activity in soil,” Soil Biochemistry, pp. 293-356, 2017.
[14] D. Eisma, Suspended Matter in the Aquatic Environment, Springer Science & Business Media, 2012.
[15] S. Some, R. Mondal, D. Mitra, D. Jain, D. Verma, and S. Das, “Microbial pollution of water with special reference to coliform bacteria and their nexus with environment,” Energy Nexus, vol. 1, pp. 100008, 2021.
[16] I. Delpla, A. V. Jung, E. Baures, M. Clement, and O. Thomas, “Impacts of climate change on surface water quality in relation to drinking water production,” Environment International, vol. 35, no. 8, pp. 1225-1233, 2009.
[17] T. Dube, O. Mutanga, K. Seutloali, S. Adelabu, and C. Shoko, “Water quality monitoring in sub-Saharan African lakes: a review of remote sensing applications,” African Journal of Aquatic Science, vol. 40, no. 1, pp. 1-7, 2015.
[18] D. T. E. Hunt and A. L. Wilson, The Chemical Analysis of Water: General Principles and Techniques (Vol. 2), Royal Society of Chemistry, 1986.
[19] C. E. Hatch, A. T. Fisher, J. S. Revenaugh, J. Constantz, and C. Ruehl, “Quantifying surface water–groundwater interactions using time series analysis of streambed thermal records: Method development,” Water Resources Research, vol. 42, no. 10, pp. 1-14, 2006.
[20] I. Yaroshenko et al., “Real-time water quality monitoring with chemical sensors,” Sensors, vol. 20, no. 12, pp. 3432, 2020.
[21] H. B. Glasgow, J. M. Burkholder, R. E. Reed, A. J. Lewitus, and J. E. Kleinman, “Real-time remote monitoring of water quality: A review of current applications, and advancements in sensor, telemetry, and computing technologies,” Journal of Experimental Marine Biology and Ecology, vol. 300, no. 1-2, pp. 409-448, 2004.
[22] K. T. Peterson, V. Sagan, P. Sidike, E. A. Hasenmueller, J. J. Sloan, and J. H. Knouft, “Machine learning-based ensemble prediction of water-quality variables using feature-level and decision-level fusion with proximal remote sensing,” Photogrammetric Engineering & Remote Sensing, vol. 85, no. 4, pp. 269-280, 2019.
[23] L. F. Arias-Rodriguez et al., “Integration of Remote Sensing and Mexican Water Quality Monitoring System Using an Extreme Learning Machine,” Sensors, vol. 21, no. 12, pp. 4118, 2021.
[24] V. Gudivada, A. Apon, and J. Ding, “Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations,” International Journal on Advances in Software, vol. 10, no. 1, pp. 1-20, 2017.
[25] U. Ahmed, R. Mumtaz, H. Anwar, A. A. Shah, R. Irfan, and J. Garc´ıa-Nieto, “Efficient water quality prediction using supervised machine learning,” Water, vol. 11, pp. 2210, 2019.
[26] S. Kouadri, A. Elbeltagi, A. R. M. T. Islam, and S Kateb, “Performance of machine learning methods in predicting water quality index based on irregular data set: application on Illizi region (Algerian southeast),” Applied Water Science, vol. 11, no. 12, pp. 190, 2021.
[27] J. P. Nair and M. S. Vijaya, “Predictive models for river water quality using machine learning and big data techniques - a Survey,” in Proceedings of the 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), IEEE, Coimbatore, India, March 2021.
[28] M. M. Hassan, M. M. Hassan, L. Akter et al., “Efficient prediction of water quality index (WQI) using machine learning algorithms,” Human-Centric Intelligent Systems, vol. 1, no. 3-4, pp. 86–97, 2021.
[29] B. Charbuty and A. M. Abdulazeez, “Classification based on decision tree algorithm for machine learning,” Journal of Applied Science and Technology Trends, vol. 2, no. 01, pp. 20–28, 2021.
[30] P. Chawla, X. Cao, Y. Fu, C. M. Hu, M. Wang, S. Wang, and J. Z. Gao, “Water quality prediction of Salton Sea using machine learning and big data techniques,” Int. J. Environ. Anal. Chem., vol. 103, no. 18, pp. 6835–6858, 2023.
[31] K. Joslyn, “Water quality factor prediction using supervised machine learning,” REU Final Reports, vol. 6, 2018.
[32] Y. Wang, J. Zhou, K. Chen, Y. Wang, and L. Liu, “Water quality prediction method based on LSTM neural network,” in 2017 12th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), Nov. 2017, pp. 1-5.
[33] M. Hmoud Al-Adhaileh and F. Waselallah Alsaade, “Modelling and prediction of water quality by using artificial intelligence,” Sustainability, vol. 13, no. 8, pp. 4259, 2021.
[34] T. H. Aldhyani, M. Al-Yaari, H. Alkahtani, and M. Maashi, “Water quality prediction using artificial intelligence algorithms,” Applied Bionics and Biomechanics, 2020.
[35] X. Wang, Y. Li, Q. Qiao, A. Tavares, and Y. Liang, “Water quality prediction based on machine learning and comprehensive weighting methods,” Entropy, vol. 25, no. 8, pp. 1186, 2023.
[36] M. Y. Shams, A. M. Elshewey, E. S. M. El-kenawy, A. Ibrahim, F. M. Talaat, and Z. Tarek, “Water quality prediction using machine learning models based on grid search method,” Multimedia Tools and Applications, pp. 1-28, 2023.
[37] J. P. Nair and M. S. Vijaya, “River water quality prediction and index classification using machine learning,” Journal of Physics: Conference Series, vol. 2325, no. 1, pp. 012011, Aug. 2022.
[38] A. Nouraki, M. Alavi, M. Golabi, and M. Albaji, “Prediction of water quality parameters using machine learning models: A case study of the Karun River, Iran,” Environmental Science and Pollution Research, vol. 28, no. 40, pp. 57060-57072, 2021.
[39] M. Azrour, J. Mabrouki, G. Fattah, et al., “Machine learning algorithms for efficient water quality prediction,” Model. Earth Syst. Environ., vol. 8, pp. 2793-2801, 2022.
[40] S. Dharshini, “Deep learning approach for prediction and classification of potable water,” Analytical Sciences, vol. 39, pp. 1179-1189, 2023.
[41] S. Dalal, E. M. Onyema, C. A. T. Romero, L. C. Ndufeiya-Kumasi, D. C. Maryann, A. J. Nnedimkpa, and T. K. Bhatia, “Machine learning-based forecasting of potability of drinking water through adaptive boosting model,” Open Chemistry, vol. 20, no. 1, pp. 816-828, 2022.
[42] Z. H. Zhou, Machine Learning. Springer Nature, 2021.
[43] V. Sinap, “Prediction of Counter-Strike: Global Offensive round results with machine learning techniques,” Journal of Intelligent Systems: Theory and Applications, vol. 6, no. 2, pp. 119-129, 2023, doi: 10.38016/jista.1235031.
[44] S. Keskin, O. Sevli, and E. Okatan, “Comparative analysis of the classification of recyclable wastes,” Journal of Scientific Reports-A, vol. 055, pp. 70-79, 2023.
[45] D. Böhning, “Multinomial logistic regression algorithm,” Annals of the Institute of Statistical Mathematics, vol. 44, no. 1, pp. 197-200, 1992.
[46] C. Kingsford and S. L. Salzberg, “What are decision trees?,” Nature Biotechnology, vol. 26, no. 9, pp. 1011-1013, 2008.
[47] K. Mathan, P. M. Kumar, P. Panchatcharam, G. Manogaran, and R. Varadharajan, “A novel Gini index decision tree data mining method with neural network classifiers for prediction of heart disease,” Design Automation for Embedded Systems, vol. 22, pp. 225-242, 2018.
[48] S. J. Rigatti, “Random forest,” Journal of Insurance Medicine, vol. 47, no. 1, pp. 31-39, 2017.
[49] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785-794.
[50] G. I. Webb, J. R. Boughton, and Z. Wang, “Not so naive bayes: aggregating one-dependence estimators,” Machine Learning, vol. 58, pp. 5-24, 2005.
[51] L. E. Peterson, “K-Nearest neighbor,” Scholarpedia, vol. 4, no. 2, pp. 1883, 2009.
[52] H. Bhavsar and M. H. Panchal, “A review on support vector machine for data classification,” International Journal of Advanced Research in Computer Engineering & Technology (IJARCET), vol. 1, no. 10, pp. 185-189, 2012.
[53] A. Taherkhani, G. Cosma, and T. M. McGinnity, “AdaBoost-CNN: An adaptive boosting algorithm for convolutional neural networks to classify multi-class imbalanced datasets using transfer learning,” Neurocomputing, vol. 404, pp. 351-366, 2020.
[54] L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, pp. 123-140, 1996.
[55] X. Zhu, C. Bao, and W. Qiu, “Bagging very weak learners with lazy local learning,” in 2008 19th International Conference on Pattern Recognition, 2008, pp. 1-4.
[56] P. Baldi, S. Brunak, Y. Chauvin, C. A. Andersen, and H. Nielsen, “Assessing the accuracy of prediction algorithms for classification: an overview,” Bioinformatics, vol. 16, no. 5, pp. 412-424, 2000.
[57] N. R. Cook, “Use and misuse of the receiver operating characteristic curve in risk prediction,” Circulation, vol. 115, no. 7, pp. 928-935, 2007.
[58] J. Myerson, L. Green, and M. Warusawitharana, “Area under the curve as a measure of discounting,” Journal of the Experimental Analysis of Behavior, vol. 76, no. 2, pp. 235-243, 2001.
[59] K. Boyd, K. H. Eng, and C. D. Page, “Area under the precision-recall curve: Point estimates and confidence intervals,” in Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III, 2013, pp. 451-466.
[60] Kaggle, Water Quality and Potability, 2021 [Online]. Available: https://www.kaggle.com/datasets/uom190346a/water-quality-and-potability.
[61] R. J. Little and D. B. Rubin, Statistical Analysis with Missing Data, vol. 793. John Wiley & Sons, 2019.
[62] T. D. Pigott, “A review of methods for missing data,” Educational Research and Evaluation, vol. 7, no. 4, pp. 353-383, 2001.
[63] G. Rose and S. Day, “The population mean predicts the number of deviant individuals,” BMJ: British Medical Journal, vol. 301, no. 6759, pp. 1031, 1990.
[64] R. K. Pearson, “Outliers in process modeling and identification,” IEEE Transactions on Control Systems Technology, vol. 10, no. 1, pp. 55-63, 2002.
[65] V. Tkachev, M. Sorokin, C. Borisov, A. Garazha, A. Buzdin, and N. Borisov, “Flexible data trimming improves performance of global machine learning methods in omics-based personalized oncology,” International Journal of Molecular Sciences, vol. 21, no. 3, pp. 713, 2020.
[66] N. E. Huang, M. L. C. Wu, S. R. Long, S. S. Shen, W. Qu, P. Gloersen, and K. L. Fan, “A confidence limit for the empirical mode decomposition and hilbert spectral analysis,” Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, vol. 459, no. 2037, pp. 2317-2345, 2003.
[67] H. P. Vinutha, B. Poornima, and B. M. Sagar, “Detection of outliers using interquartile range technique from intrusion dataset,” in Information and Decision Sciences: Proceedings of the 6th International Conference on FICTA, Springer Singapore, pp. 511-518, 2018.
[68] N. Aravind, S. Nagajothi and S. Elavenil, “Machine learning model for predicting the crack detection and pattern recognition of geopolymer concrete beams,” Construction and Building Materials, 297, pp. 123785, 2021.
[69] D. Kartini, D. T. Nugrahadi and A. Farmadi, A, “Hyperparameter tuning using GridsearchCV on the comparison of the activation function of the ELM method to the classification of pneumonia in toddlers,” in 2021 4th International Conference of Computer and Informatics Engineering (IC2IE), IEEE, pp. 390-395, Sep. 2021.
[70] C. Schaffer, “Selecting a classification method by cross-validation,” Machine Learning, vol. 13, p.135-143, 1993.
[71] S. Narkhede, “Understanding AUC-ROC curve,” Towards Data Science, vol. 26, no. 1, pp. 220-227, 2018.
[72] V. J. Lei et al., “Model performance metrics in assessing the value of adding intraoperative data for death prediction: Applications to noncardiac surgery,” in MedInfo, 2019, pp. 223-227.
[73] J. A. Hanley and B. J. McNeil, “the meaning and use of the area under a receiver operating characteristic (ROC) curve,” Radiology, vol. 143, no. 1, pp. 29-36, 1982.
[74] M. Durairaj and T. Suresh, “Enhanced gradient boosting tree classifier using optimization technique for water quality prediction,” Annals of the Romanian Society for Cell Biology, pp. 3860-3873, 2021.
[75] T. Kavzoglu and A. Teke, “Predictive performances of ensemble machine learning algorithms in landslide susceptibility mapping using random forest, extreme gradient boosting (XGBoost) and natural gradient boosting (NGBoost),” Arabian Journal for Science and Engineering, vol. 47, no. 6, pp. 7367-7385, 2022.
[76] D. Dezfooli et al., “Classification of water quality status based on minimum quality parameters: Application of machine learning techniques,” Modeling Earth Systems and Environment, vol. 4, pp. 311-324, 2018.
[77] S. Shrestha and F. Kazama, “Assessment of surface water quality using multivariate statistical techniques: A case study of the Fuji River Basin, Japan,” Environmental Modelling & Software, vol. 22, no. 4, pp. 464-475, 2007.
[78] V. Tkachev, M. Sorokin, C. Borisov, A. Garazha, A. Buzdin and N. Borisov, “Flexible data trimming improves performance of global machine learning methods in omics-based personalized oncology,” International Journal of Molecular Sciences, vol. 21 no. 3, pp. 713, 2020.
[79] P. Ukkonen and A. Mäkelä, “Evaluation of machine learning classifiers for predicting deep convection,” Journal of Advances in Modeling Earth Systems, vol. 11 no. 6, pp. 1784-1802, 2019.
[80] C. Mantel, F. Villebro, G. A. dos Reis Benatto, H. R. Parikh, S. Wendlandt, K. Hossain, ... and S. Forchhammer, “Machine learning prediction of defect types for electroluminescence images of photovoltaic panels,” in Applications of Machine Learning, vol. 11139, SPIE, p. 1113904, Sep. 2019.