A k-mer based metaheuristic approach for detecting COVID-19 variants

Hilal Arslan

doi:10.24012/dumf.1195600

Araştırma Makalesi

A k-mer based metaheuristic approach for detecting COVID-19 variants

Yıl 2023, , 17 - 26, 23.03.2023

Hilal Arslan

https://doi.org/10.24012/dumf.1195600

Cited By: 1

Öz

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) belongs to coronaviridae family and a change in the genetic sequence of SARS-CoV-2 is named as a mutation that causes to variants of SARS-CoV-2. In this paper, we propose a novel and efficient method to predict SARS-CoV-2 variants of concern from whole human genome sequences. In this method, we describe 16 dinucleotide and 64 trinucleotide features to differentiate SARS-CoV-2 variants of concern. The efficacy of the proposed features is proved by using four classifiers, k-nearest neighbor, support vector machines, multilayer perceptron, and random forest. The proposed method is evaluated on the dataset including 223,326 complete human genome sequences including recently designated variants of concern, Alpha, Beta, Gamma, Delta, and Omicron variants. Experimental results present that overall accuracy for detecting SARS-CoV-2 variants of concern remarkably increases when trinucleotide features rather than dinucleotide features are used. Furthermore, we use the whale optimization algorithm, which is a state-of-the-art method for reducing the number of features and choosing the most relevant features. We select 44 trinucleotide features out of 64 to differentiate SARS-CoV-2 variants with acceptable accuracy as a result of the whale optimization method. Experimental results indicate that the SVM classifier with selected features achieves about 99% accuracy, sensitivity, specificity, precision on average. The proposed method presents an admirable performance for detecting SARS-CoV-2 variants.

Anahtar Kelimeler

COVID-19, SARS-CoV-2, Whale Optimization Algorithm, Classifiers, Feature Selection, Machine Learning

Kaynakça

[1] Volz, E., Mishra, S., Chand, M., Barrett, J. C., & al., R. J. et. (2021). Assessing transmissibility of SARS-CoV-2 lineage B.1.1.7 in England. Nature, 593(7858), 266–269. doi:10.1038/s41586-021-03470-x
[2] Lauring, A. S., & Malani, P. N. (09 2021). Variants of SARS-CoV-2. JAMA, 326(9), 880–880. doi:10.1001/jama.2021.14181
[3] Tegally, H., Wilkinson, E., Giovanetti, M., & al., A. I. et. (2021). Detection of a SARS-CoV-2 variant of concern in South Africa. Nature, 592(7854), 438–443. doi:10.1038/s41586-021-03402-9
[4] Sabino, E. C., Buss, L. F., Carvalho, M. P. S., & al., E. (2021). Resurgence of COVID-19 in Manaus, Brazil, despite high seroprevalence. The Lancet, 397(10273), 452–455. doi:10.1016/s0140-6736(21)00183-5
[5] Mlcochova, P., Kemp, S. A., Dhar, M. S., & al., G. P. et. (2021). SARS-CoV-2 B.1.617.2 Delta variant replication and immune evasion. Nature, 599(7883), 114–119. doi:10.1038/s41586-021-03944-y
[6] Sahoo, J. P., & Samal, K. C. (2021). World on alert: WHO designated south African new COVID strain (Omicron/B.1.1.529) as a variant of concern. Biotica Research Today, 3(11), 1086–1088.
[7] Jiang, X., Coffee, M., Bari, A., Wang, J., Jiang, X., Huang, J., … Huang, Y. (2020). Towards an Artificial Intelligence Framework for Data-Driven Prediction of Coronavirus Clinical Severity. Computers, Materials $\&$ Continua, 62(3), 537–551. doi:10.32604/cmc.2020.010691
[8] Zoabi, Y., Deri-Rozov, S., & Shomron, N. (2021). Machine learning-based prediction of COVID-19 diagnosis based on symptoms. Npj Digital Medicine, 4(1), 3. doi:10.1038/s41746-020-00372-6
[9] Muhammad, L. J., Algehyne, E. A., Usman, S. S., Ahmad, A., Chakraborty, C., & Mohammed, I. A. (2021). Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset. SN Computer Science, 2(1), 11. doi:10.1007/s42979-020-00394-7
[10] Shi, F., Wang, J., Shi, J., Wu, Z., Wang, Q., Tang, Z., … Shen, D. (2021). Review of Artificial Intelligence Techniques in Imaging Data Acquisition, Segmentation, and Diagnosis for COVID-19. IEEE Reviews in Biomedical Engineering, 14, 4–15. doi:10.1109/RBME.2020.2987975
[11] Mohamadou, Y., Halidou, A., & Kapen, P. T. (2020). A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of COVID-19. Applied Intelligence, 50(11), 3913–3925. doi:10.1007/s10489-020-01770-9
[12] Arslan, H., & Arslan, H. (2021). A new COVID-19 detection method from human genome sequences using CpG island features and KNN classifier. Engineering Science and Technology, an International Journal. doi:10.1016/j.jestch.2020.12.026
[13] Arslan, H. (2021a). COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus. Computers $\&$ Industrial Engineering, 161, 107666. doi:10.1016/j.cie.2021.107666
[14] Arslan, H., & Aygün, B. (2021). Performance Analysis of Machine Learning Algorithms in Detection of COVID-19 from Common Symptoms. 2021 29th Signal Processing and Communications Applications Conference (SIU), 1–4. doi:10.1109/SIU53274.2021.9477809
[15] Arslan, H. (2021b). Machine Learning Methods for COVID-19 Prediction Using Human Genomic Data. Proceedings, 74(1). doi:10.3390/proceedings2021074020
[16] Ali, S., Tamkanat-E-Ali, Khan, M. A., Khan, I., & Patterson, M. (2021). Effective and scalable clustering of SARS-CoV-2 sequences. arXiv [q-bio.PE]. Ανακτήθηκε από http://arxiv.org/abs/2108.08143
[17] Jamil, S., & Rahman, M. (2021). A Dual-Stage Vocabulary of Features (VoF)-Based Technique for COVID-19 Variants’ Classification. Applied Sciences, 11(24). doi:10.3390/app112411902
[18] Ogiela, M. R., & Ogiela, U. (2021). Linguistic methods in healthcare application and COVID-19 variants classification. Neural Computing and Applications. doi:10.1007/s00521-021-06286-y
[19] Mann, C., Griffin, J. H., & Downard, K. M. (2021). Detection and evolution of SARS-CoV-2 coronavirus variants of concern with mass spectrometry. Analytical and Bioanalytical Chemistry, 413(29), 7241–7249. doi:10.1007/s00216-021-03649-1
[20] Mafarja, M., & Mirjalili, S. (2018). Whale optimization approaches for wrapper feature selection. Applied Soft Computing, 62, 441–453. doi:10.1016/j.asoc.2017.11.006
[21] Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446–3453.
[22] Deng, Z., Zhu, X., Cheng, D., Zong, M., & Zhang, S. (2016). Efficient KNN Classification Algorithm for Big Data. Neurocomput., 195(C), 143–148. doi:10.1016/j.neucom.2015.08.112
[23] Abu Alfeilat, H., Hassanat, A., Lasassmeh, O., Tarawneh, A., Alhasanat, M., Eyal-Salman, H., & Prasath, S. (08 2019). Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review. Big Data, 7. doi:10.1089/big.2018.0175
[24] Bishop, C. M. (2006). Pattern recognition and Machine Learning. Springer.
[25] Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366. doi:10.1016/0893-6080(89)90020-8
[26] Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. doi:10.1023/A:1009715923555
[27] Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. doi:10.1007/978-1-4757-2440-0
[28] Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425. doi:10.1109/72.991427
[29] Min, J. H., & Lee, Y.-C. (2005). Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters. Expert Systems with Applications, 28(4), 603–614. doi:10.1016/j.eswa.2004.12.008
[30] Keerthi, S. S., & Lin, C.-J. (2003). Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel. Neural Computation, 15(7), 1667–1689. doi:10.1162/089976603321891855
[31] Breiman, L. (2001a). Random Forests. Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324
[32] Breiman, L. (2001b). Machine Learning, 45(1), 5–32. doi:10.1023/a:1010933404324
[33] Shu, Y., & McCauley, J. (2017). GISAID: Global initiative on sharing all influenza data - from vision to reality. Eurosurveillance, 22(13). doi:10.2807/1560-7917.ES.2017.22.13.30494
[34] Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing $\&$ Management, 45(4), 427–437. doi:10.1016/j.ipm.2009.03.002

COVID-19 varyantlarını tespit etmek için k-mer tabanlı bir metasezgisel yaklaşım

Yıl 2023, , 17 - 26, 23.03.2023

Hilal Arslan

https://doi.org/10.24012/dumf.1195600

Cited By: 1

Öz

Emergence of SARS-CoV-2 variants threatens the public health and remarkably prolong the COVID-19 pandemic. Rapid and accurate detection of SARS-CoV-2 variants is crucial to track mutations, monitor the changes, measure the efficiency of the current vaccines, assess the evolution of SARS-CoV-2 as well as prevent its spread. In this paper, we propose a novel and efficient method to predict SARS-CoV-2 variants of concern from whole human genome sequences. In this method, we describe 16 dinucleotide and 64 trinucleotide features to differentiate SARS-CoV-2 variants of concern. The efficacy of the proposed features is proved by using four classifiers, k-nearest neighbor, support vector machines, multilayer perceptron, and random forest. The proposed method is evaluated on the dataset including 223,326 complete human genome sequences including recently designated variants of concern, Alpha, Beta, Gamma, Delta, and Omicron variants. Experimental results present that overall accuracy for detecting SARS-CoV-2 variants of concern remarkably increases when trinucleotide features rather than dinucleotide features are used. Furthermore, we use the whale optimization algorithm, which is the state-of-the-art method for reducing the number of features and choosing the most relevant features. We select
44 trinucleotide features out of 64 to differentiate SARS-CoV-2 variants with acceptable accuracy as a result of the whale optimization method. Experimental results indicate that the SVM classifier with selected features achieves about 99% accuracy, sensitivity, specificity, precision on average. The proposed method presents an admirable performance for detecting SARS-CoV-2 variants.

Anahtar Kelimeler

COVID-19, SARS-CoV-2, Classifiers, Feature Selection, Machine Learning

Kaynakça

[1] Volz, E., Mishra, S., Chand, M., Barrett, J. C., & al., R. J. et. (2021). Assessing transmissibility of SARS-CoV-2 lineage B.1.1.7 in England. Nature, 593(7858), 266–269. doi:10.1038/s41586-021-03470-x
[2] Lauring, A. S., & Malani, P. N. (09 2021). Variants of SARS-CoV-2. JAMA, 326(9), 880–880. doi:10.1001/jama.2021.14181
[3] Tegally, H., Wilkinson, E., Giovanetti, M., & al., A. I. et. (2021). Detection of a SARS-CoV-2 variant of concern in South Africa. Nature, 592(7854), 438–443. doi:10.1038/s41586-021-03402-9
[4] Sabino, E. C., Buss, L. F., Carvalho, M. P. S., & al., E. (2021). Resurgence of COVID-19 in Manaus, Brazil, despite high seroprevalence. The Lancet, 397(10273), 452–455. doi:10.1016/s0140-6736(21)00183-5
[5] Mlcochova, P., Kemp, S. A., Dhar, M. S., & al., G. P. et. (2021). SARS-CoV-2 B.1.617.2 Delta variant replication and immune evasion. Nature, 599(7883), 114–119. doi:10.1038/s41586-021-03944-y
[6] Sahoo, J. P., & Samal, K. C. (2021). World on alert: WHO designated south African new COVID strain (Omicron/B.1.1.529) as a variant of concern. Biotica Research Today, 3(11), 1086–1088.
[7] Jiang, X., Coffee, M., Bari, A., Wang, J., Jiang, X., Huang, J., … Huang, Y. (2020). Towards an Artificial Intelligence Framework for Data-Driven Prediction of Coronavirus Clinical Severity. Computers, Materials $\&$ Continua, 62(3), 537–551. doi:10.32604/cmc.2020.010691
[8] Zoabi, Y., Deri-Rozov, S., & Shomron, N. (2021). Machine learning-based prediction of COVID-19 diagnosis based on symptoms. Npj Digital Medicine, 4(1), 3. doi:10.1038/s41746-020-00372-6
[9] Muhammad, L. J., Algehyne, E. A., Usman, S. S., Ahmad, A., Chakraborty, C., & Mohammed, I. A. (2021). Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset. SN Computer Science, 2(1), 11. doi:10.1007/s42979-020-00394-7
[10] Shi, F., Wang, J., Shi, J., Wu, Z., Wang, Q., Tang, Z., … Shen, D. (2021). Review of Artificial Intelligence Techniques in Imaging Data Acquisition, Segmentation, and Diagnosis for COVID-19. IEEE Reviews in Biomedical Engineering, 14, 4–15. doi:10.1109/RBME.2020.2987975
[11] Mohamadou, Y., Halidou, A., & Kapen, P. T. (2020). A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of COVID-19. Applied Intelligence, 50(11), 3913–3925. doi:10.1007/s10489-020-01770-9
[12] Arslan, H., & Arslan, H. (2021). A new COVID-19 detection method from human genome sequences using CpG island features and KNN classifier. Engineering Science and Technology, an International Journal. doi:10.1016/j.jestch.2020.12.026
[13] Arslan, H. (2021a). COVID-19 prediction based on genome similarity of human SARS-CoV-2 and bat SARS-CoV-like coronavirus. Computers $\&$ Industrial Engineering, 161, 107666. doi:10.1016/j.cie.2021.107666
[14] Arslan, H., & Aygün, B. (2021). Performance Analysis of Machine Learning Algorithms in Detection of COVID-19 from Common Symptoms. 2021 29th Signal Processing and Communications Applications Conference (SIU), 1–4. doi:10.1109/SIU53274.2021.9477809
[15] Arslan, H. (2021b). Machine Learning Methods for COVID-19 Prediction Using Human Genomic Data. Proceedings, 74(1). doi:10.3390/proceedings2021074020
[16] Ali, S., Tamkanat-E-Ali, Khan, M. A., Khan, I., & Patterson, M. (2021). Effective and scalable clustering of SARS-CoV-2 sequences. arXiv [q-bio.PE]. Ανακτήθηκε από http://arxiv.org/abs/2108.08143
[17] Jamil, S., & Rahman, M. (2021). A Dual-Stage Vocabulary of Features (VoF)-Based Technique for COVID-19 Variants’ Classification. Applied Sciences, 11(24). doi:10.3390/app112411902
[18] Ogiela, M. R., & Ogiela, U. (2021). Linguistic methods in healthcare application and COVID-19 variants classification. Neural Computing and Applications. doi:10.1007/s00521-021-06286-y
[19] Mann, C., Griffin, J. H., & Downard, K. M. (2021). Detection and evolution of SARS-CoV-2 coronavirus variants of concern with mass spectrometry. Analytical and Bioanalytical Chemistry, 413(29), 7241–7249. doi:10.1007/s00216-021-03649-1
[20] Mafarja, M., & Mirjalili, S. (2018). Whale optimization approaches for wrapper feature selection. Applied Soft Computing, 62, 441–453. doi:10.1016/j.asoc.2017.11.006
[21] Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39(3), 3446–3453.
[22] Deng, Z., Zhu, X., Cheng, D., Zong, M., & Zhang, S. (2016). Efficient KNN Classification Algorithm for Big Data. Neurocomput., 195(C), 143–148. doi:10.1016/j.neucom.2015.08.112
[23] Abu Alfeilat, H., Hassanat, A., Lasassmeh, O., Tarawneh, A., Alhasanat, M., Eyal-Salman, H., & Prasath, S. (08 2019). Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review. Big Data, 7. doi:10.1089/big.2018.0175
[24] Bishop, C. M. (2006). Pattern recognition and Machine Learning. Springer.
[25] Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366. doi:10.1016/0893-6080(89)90020-8
[26] Burges, C. J. C. (1998). A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. doi:10.1023/A:1009715923555
[27] Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. doi:10.1007/978-1-4757-2440-0
[28] Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415–425. doi:10.1109/72.991427
[29] Min, J. H., & Lee, Y.-C. (2005). Bankruptcy prediction using support vector machine with optimal choice of kernel function parameters. Expert Systems with Applications, 28(4), 603–614. doi:10.1016/j.eswa.2004.12.008
[30] Keerthi, S. S., & Lin, C.-J. (2003). Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel. Neural Computation, 15(7), 1667–1689. doi:10.1162/089976603321891855
[31] Breiman, L. (2001a). Random Forests. Machine Learning, 45(1), 5–32. doi:10.1023/A:1010933404324
[32] Breiman, L. (2001b). Machine Learning, 45(1), 5–32. doi:10.1023/a:1010933404324
[33] Shu, Y., & McCauley, J. (2017). GISAID: Global initiative on sharing all influenza data - from vision to reality. Eurosurveillance, 22(13). doi:10.2807/1560-7917.ES.2017.22.13.30494
[34] Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing $\&$ Management, 45(4), 427–437. doi:10.1016/j.ipm.2009.03.002

Toplam 34 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Bölüm	Makaleler
Yazarlar	Hilal Arslan 0000-0002-6449-6952
Yayımlanma Tarihi	23 Mart 2023
Gönderilme Tarihi	27 Ekim 2022
Yayımlandığı Sayı	Yıl 2023

Kaynak Göster

IEEE	H. Arslan, “A k-mer based metaheuristic approach for detecting COVID-19 variants”, DÜMF MD, c. 14, sy. 1, ss. 17–26, 2023, doi: 10.24012/dumf.1195600.

Cited By

A Parallel Algorithm for Designing Primer and Probe for Accurate Detection of Severe Acute Respiratory Syndrome Coronavirus

Black Sea Journal of Engineering and Science

https://doi.org/10.34248/bsengineering.1324890

Makale Dosyaları

Tam Metin

DUJE tarafından yayınlanan tüm makaleler, Creative Commons Atıf 4.0 Uluslararası Lisansı ile lisanslanmıştır. Bu, orijinal eser ve kaynağın uygun şekilde belirtilmesi koşuluyla, herkesin eseri kopyalamasına, yeniden dağıtmasına, yeniden düzenlemesine, iletmesine ve uyarlamasına izin verir. 24456