High Performance Classification of Cancer Types with Gene Microarray Datasets: Hybrid Approach

Yılmaz Atay; Muhterem Oğuzhan Yıldırım; Cuma Umur Doğan

doi:10.29109/gujsc.1000926

Araştırma Makalesi

High Performance Classification of Cancer Types with Gene Microarray Datasets: Hybrid Approach

Yıl 2021, Cilt: 9 Sayı: 4, 811 - 827, 29.12.2021

Yılmaz Atay , Muhterem Oğuzhan Yıldırım Cuma Umur Doğan

https://doi.org/10.29109/gujsc.1000926

Öz

Currently the approach of biological meaningfulness detection from gene microarray datasets obtained with microarray technology is used effectively in many areas such as disease diagnosis and differentiation of cancer types. However, since datasets obtained with this technology measure gene expression profiles collectively, the number of features in the dataset can be quite high. The small number of samples in gene microarray datasets, the high number of features and where the data is noisy significantly complicates the preparation process of these datasets. In order for machine learning models to successfully classify, the number of features that represent the size of the dataset should be reduced. In the proposed method, gene microarray data is taken as input and Information Gain, Fisher Correlation Scoring, ReliefF and, Chi-Square methods are applied separately for feature selection. After this stage, a sub-dataset containing the new genes is obtained and a pool of genes for Genetic Algorithm is created according to this dataset. Bayes classifier is trained using the sub-dataset created with the genes of the most successful chromosome. Thus, the classification process of cancer data is successfully completed. The model proposed in this study was applied to datasets that are frequently used in the literature and high success rates were obtained in classification. As a result; acceptable feature selection methods and the hybrid method based on Genetic Algorithm generally provided the most appropriate results on the all test data.

Anahtar Kelimeler

ensemble method, genetic algorithm, cancer, microarray, naive bayes, feature selection, classification

Kaynakça

[1] Zhang, P.-W., Chen, L., Huang, T., Zhang, N., Kong, X.-Y., Cai, Y.- D. (2015). Classifying Ten Types of Major Cancers Based on Reverse Phase Protein Array Profiles. PLoS One, 10(3). doi: 10.1371/jour- nal.pone.0123147.  
[2] Al-shamasneh, A. R. M., Obaidellah, U. H. B. (2017). Artificial Intelligence Techniques for Cancer Detection and Classification: Review Study. European Scientific Journal, 13(3). https://doi.org/10.19044/esj.2016.v13n3p342.  
[3] Russo, G., Zegar, C., Giordano, A. (2003). Advantages and limitations of microarray technology in human cancer - Oncogene. Oncogene, 22, 6497–6507. doi: 10.1038/sj.onc.1206865.  
[4] Bolo ́n-Canedo, V., Sa ́nchez-Maron ̃o, N., Alonso-Betanzos, A., Ben ́ıtez, J. M., Herrera, F. (2014). A review of microarray datasets and applied feature selection methods. Inform. Sci., 282, 111–135. doi: 10.1016/j.ins.2014.05.042.  
[5] Candan, H., Durmus ̧, A., Harman, G. (2019). Genetik Algoritma ve Sınıflandırıcı Yo ̈ntemler ile Kanser Tahmini. Veri Bilimi, 2(1), 30–34.  
[6] Kahraman M., Kaya, M. (2010). Çok amaçlı genetik algoritma kullanarak DNA mikrodizi verilerinin ku ̈melenmesi. (20 Ağustos 2021). Retrieved from https://tez.yok.gov.tr (tez no: 269977).  
[7] Turgut S., Dağtekin M., Ensari T. (2017). Makine öğrenmesi yöntemleri kullanarak kanser teşhisi. (22 Ag ̆ustos 2021). Retrieved from https://tez.yok.gov.tr (tez no: 487852).  
[8] Su, Q., Wang, Y., Jiang, X., Chen, F., Lu, W.-c. (2017). A Cancer Gene Selection Algorithm Based on the K-S Test and CFS. Biomed Res. Int., 2017, 1645619. doi: 10.1155/2017/1645619.  
[9] Roobaert et al.: Information Gain, Correlation and Support Vector Machines, StudFuzz 207, 463–470 (2006).
[10] Hall, M. 1999. Correlation-based Feature Selection for Machine Learning, The University of Waikato, PhD Thesis, Hamilton.  
[11] Jadhav, S., He, H., Jenkins, K. (2018). Information Gain Directed Genetic Algorithm Wrapper Feature selection for Credit Rating. Appl. Soft Comput., 69. doi: 10.1016/j.asoc.2018.04.033.  
[12] Budak, H. (2018). Özellik Seçim Yöntemleri ve Yeni Bir Yaklaşım. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 22(zel), 10. doi: 10.19113/sdufbed.01653.  
[13] Kira, K., Rendell, L. A. (1992). The feature selection problem: tra- ditional methods and a new algorithm. AAAI’92: Proceedings of the tenth national conference on Artificial intelligence. AAAI Press. doi: 10.5555/1867135.1867155.  
[14] Islam, M. J., Wu, Q. M. J., Ahmadi, M., Sid-Ahmed, M. A. (2007). Investigating the performance of naive-Bayes classifiers and K-nearest neighbor classifiers. 2007 International Conference on Convergence Information Technology (ICCIT 2007).  
[15] Chicco, D. and Giuseppe J.,”The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation”, BMC genomics 21.1 (2020): 1-13.  
[16] Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonu- cleotide arrays. Proc. Natl. Acad. Sci. U.S.A., 96(12), 6745–6750. doi: 10.1073/pnas.96.12.6745.  
[17] Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., ...Golub, T. R. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression - Nature. Nature, 415, 436–442. doi: 10.1038/415436a.  
[18] Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., ...Lander, E. S. (1999). Molecular Classification of Can- cer: Class Discovery and Class Prediction by Gene Expression Monitor- ing. Science, 286(5439), 531–537. doi: 10.1126/science.286.5439.531.  
[19] Feltes, B. C.,Chandelier, E. B., Grisci, B. I., Dorn, M. (2019). CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research. J. Com- put. Biol., 26(4), 376–386. doi: 10.1089/cmb.2018.0238.  
[20] Islam, Md. M., Iqbal, H., Haque, Md. R., Hasan, Md. K. (2017). Prediction of breast cancer using support vector machine and K-Nearest neighbors. 2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC). IEEE. doi: 10.1109/R10-HTC.2017.8288944.  
[21] Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI’95: Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2. Morgan Kaufmann Publishers Inc. doi: 10.5555/1643031.1643047.  
[22] Nguyen, T., Khosravi, A., Creighton, D., Nahavandi, S. (2015). Hidden Markov models for cancer classification using gene expression profiles. Inform. Sci., 316, 293–307. doi: 10.1016/j.ins.2015.04.012.  
[23] Hengpraprohm, S. (2013). GA-Based Classifier with SNR Weighted Features for Cancer Microarray Data Classification. International Journal of Signal Processing Systems, 1(1), 29–33. doi: 10.12720/ijsps.1.1.29- 33.  
[24] Yu, H., Ni, J., Dan, Y., Xu, S. (2012). Mining and integrating reliable decision rules for imbalanced cancer gene expression data sets. Tsinghua Sci. Technol., 17(6), 666–673. doi: 10.1109/TST.2012.6374368.  
[25] Gunavathi, C., Premalatha, K. (2014). Performance analysis of genetic algorithm with KNN and SVM for Feature Selection in Tumor Classification. World Academy of Science, Engineering and Technology, International Journal of Computer, Control, Quantum and Information Engineering, 8(8), 1390–1397.  
[26] Hernandez, J. C. H., Duval, B., Hao, J.-K. (2007). A Genetic Embedded Approach for Gene Selection and Classification of Microarray Data. Evolutionary Computation,Machine Learning and Data Mining in Bioinformatics. Springer. doi: 10.1007/978-3-540-71783-6 9.  
[27] Salem, H., Attiya, G., El-Fishawy, N. (2017). Classification of human cancer diseases by gene expression profiles. Appl. Soft Comput., 50, 124–134. doi: 10.1016/j.asoc.2016.11.026.

Gen Mikrodizi Veri Setleriyle Kanser Türlerinin Yüksek Başarımlı Sınıflandırılması: Hibrit Yaklaşım

Yıl 2021, Cilt: 9 Sayı: 4, 811 - 827, 29.12.2021

Yılmaz Atay , Muhterem Oğuzhan Yıldırım Cuma Umur Doğan

https://doi.org/10.29109/gujsc.1000926

Öz

Günümüzde mikrodizi teknolojisi ile elde edilen gen mikrodizi veri setlerinden biyolojik anlamlılık tespiti yaklaşımı, hastalık tanısı ve kanser türlerinin ayırt edilmesi gibi pek çok alanda etkin bir şekilde kullanılmaktadır. Fakat bu teknoloji ile elde edilen veri kümeleri, gen ifade profillerini toplu olarak ölçtüğü için veri kümesindeki özellik sayısı oldukça fazla olabilmektedir. Gen mikrodizi veri kümelerindeki örnek sayılarının az olması, özellik sayısının fazla olması ve verilerin gürültülü olması bu veri kümelerinin ön hazırlık işlemlerini oldukça karmaşık hale getirmektedir. Makine öğrenmesi modellerinin sınıflandırmayı başarıyla yapabilmesi için özellik sayısının, yani veri kümesinin boyutunun azaltılması gerekmektedir. Önerilen yöntemde, gen mikrodizi verileri girdi olarak alınır ve öznitelik seçimi amacıyla Bilgi Kazancı, Fisher Korelasyon Skorlama, ReliefF ve Ki-Kare yöntemleri ayrı ayrı uygulanır. Bu aşamadan sonra yeni gen alt veri kümesi elde edilir ve Genetik Algoritmanın gen havuzu oluşturulur. Bu algoritmanın uygun adımlarda tekrar çalıştırılması sonrasında seçilen en başarılı kromozomun genleri ile oluşturulan alt veri kümesi kullanılarak Naive Bayes sınıflandırıcısı eğitilir. Böylece kanser verilerinin sınıflandırılması işlemi tamamlanır. Bu çalışmada önerilen model, literatürde sıklıkla kullanılan veri kümelerine uygulanmış ve sınıflandırmada yüksek başarı oranları elde edilmiştir. Sonuç olarak; uygun öznitelik seçim yöntemleri ve Genetik Algoritma temelli hibrit yöntem genel anlamda tüm test verileri üzerinde en uygun sonuçlara ulaşılmasını sağlamıştır.

Anahtar Kelimeler

ensemble metot, genetik algoritma, kanser, mikrodizi, naive bayes, öznitelik seçimi, sınıflandırma

Kaynakça

[1] Zhang, P.-W., Chen, L., Huang, T., Zhang, N., Kong, X.-Y., Cai, Y.- D. (2015). Classifying Ten Types of Major Cancers Based on Reverse Phase Protein Array Profiles. PLoS One, 10(3). doi: 10.1371/jour- nal.pone.0123147.  
[2] Al-shamasneh, A. R. M., Obaidellah, U. H. B. (2017). Artificial Intelligence Techniques for Cancer Detection and Classification: Review Study. European Scientific Journal, 13(3). https://doi.org/10.19044/esj.2016.v13n3p342.  
[3] Russo, G., Zegar, C., Giordano, A. (2003). Advantages and limitations of microarray technology in human cancer - Oncogene. Oncogene, 22, 6497–6507. doi: 10.1038/sj.onc.1206865.  
[4] Bolo ́n-Canedo, V., Sa ́nchez-Maron ̃o, N., Alonso-Betanzos, A., Ben ́ıtez, J. M., Herrera, F. (2014). A review of microarray datasets and applied feature selection methods. Inform. Sci., 282, 111–135. doi: 10.1016/j.ins.2014.05.042.  
[5] Candan, H., Durmus ̧, A., Harman, G. (2019). Genetik Algoritma ve Sınıflandırıcı Yo ̈ntemler ile Kanser Tahmini. Veri Bilimi, 2(1), 30–34.  
[6] Kahraman M., Kaya, M. (2010). Çok amaçlı genetik algoritma kullanarak DNA mikrodizi verilerinin ku ̈melenmesi. (20 Ağustos 2021). Retrieved from https://tez.yok.gov.tr (tez no: 269977).  
[7] Turgut S., Dağtekin M., Ensari T. (2017). Makine öğrenmesi yöntemleri kullanarak kanser teşhisi. (22 Ag ̆ustos 2021). Retrieved from https://tez.yok.gov.tr (tez no: 487852).  
[8] Su, Q., Wang, Y., Jiang, X., Chen, F., Lu, W.-c. (2017). A Cancer Gene Selection Algorithm Based on the K-S Test and CFS. Biomed Res. Int., 2017, 1645619. doi: 10.1155/2017/1645619.  
[9] Roobaert et al.: Information Gain, Correlation and Support Vector Machines, StudFuzz 207, 463–470 (2006).
[10] Hall, M. 1999. Correlation-based Feature Selection for Machine Learning, The University of Waikato, PhD Thesis, Hamilton.  
[11] Jadhav, S., He, H., Jenkins, K. (2018). Information Gain Directed Genetic Algorithm Wrapper Feature selection for Credit Rating. Appl. Soft Comput., 69. doi: 10.1016/j.asoc.2018.04.033.  
[12] Budak, H. (2018). Özellik Seçim Yöntemleri ve Yeni Bir Yaklaşım. Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü Dergisi, 22(zel), 10. doi: 10.19113/sdufbed.01653.  
[13] Kira, K., Rendell, L. A. (1992). The feature selection problem: tra- ditional methods and a new algorithm. AAAI’92: Proceedings of the tenth national conference on Artificial intelligence. AAAI Press. doi: 10.5555/1867135.1867155.  
[14] Islam, M. J., Wu, Q. M. J., Ahmadi, M., Sid-Ahmed, M. A. (2007). Investigating the performance of naive-Bayes classifiers and K-nearest neighbor classifiers. 2007 International Conference on Convergence Information Technology (ICCIT 2007).  
[15] Chicco, D. and Giuseppe J.,”The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation”, BMC genomics 21.1 (2020): 1-13.  
[16] Alon, U., Barkai, N., Notterman, D. A., Gish, K., Ybarra, S., Mack, D., Levine, A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonu- cleotide arrays. Proc. Natl. Acad. Sci. U.S.A., 96(12), 6745–6750. doi: 10.1073/pnas.96.12.6745.  
[17] Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., ...Golub, T. R. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression - Nature. Nature, 415, 436–442. doi: 10.1038/415436a.  
[18] Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., ...Lander, E. S. (1999). Molecular Classification of Can- cer: Class Discovery and Class Prediction by Gene Expression Monitor- ing. Science, 286(5439), 531–537. doi: 10.1126/science.286.5439.531.  
[19] Feltes, B. C.,Chandelier, E. B., Grisci, B. I., Dorn, M. (2019). CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research. J. Com- put. Biol., 26(4), 376–386. doi: 10.1089/cmb.2018.0238.  
[20] Islam, Md. M., Iqbal, H., Haque, Md. R., Hasan, Md. K. (2017). Prediction of breast cancer using support vector machine and K-Nearest neighbors. 2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC). IEEE. doi: 10.1109/R10-HTC.2017.8288944.  
[21] Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. IJCAI’95: Proceedings of the 14th international joint conference on Artificial intelligence - Volume 2. Morgan Kaufmann Publishers Inc. doi: 10.5555/1643031.1643047.  
[22] Nguyen, T., Khosravi, A., Creighton, D., Nahavandi, S. (2015). Hidden Markov models for cancer classification using gene expression profiles. Inform. Sci., 316, 293–307. doi: 10.1016/j.ins.2015.04.012.  
[23] Hengpraprohm, S. (2013). GA-Based Classifier with SNR Weighted Features for Cancer Microarray Data Classification. International Journal of Signal Processing Systems, 1(1), 29–33. doi: 10.12720/ijsps.1.1.29- 33.  
[24] Yu, H., Ni, J., Dan, Y., Xu, S. (2012). Mining and integrating reliable decision rules for imbalanced cancer gene expression data sets. Tsinghua Sci. Technol., 17(6), 666–673. doi: 10.1109/TST.2012.6374368.  
[25] Gunavathi, C., Premalatha, K. (2014). Performance analysis of genetic algorithm with KNN and SVM for Feature Selection in Tumor Classification. World Academy of Science, Engineering and Technology, International Journal of Computer, Control, Quantum and Information Engineering, 8(8), 1390–1397.  
[26] Hernandez, J. C. H., Duval, B., Hao, J.-K. (2007). A Genetic Embedded Approach for Gene Selection and Classification of Microarray Data. Evolutionary Computation,Machine Learning and Data Mining in Bioinformatics. Springer. doi: 10.1007/978-3-540-71783-6 9.  
[27] Salem, H., Attiya, G., El-Fishawy, N. (2017). Classification of human cancer diseases by gene expression profiles. Appl. Soft Comput., 50, 124–134. doi: 10.1016/j.asoc.2016.11.026.

Toplam 27 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Mühendislik
Bölüm	Tasarım ve Teknoloji
Yazarlar	Yılmaz Atay 0000-0002-3298-3334 Muhterem Oğuzhan Yıldırım Bu kişi benim 0000-0003-1288-0861 Cuma Umur Doğan Bu kişi benim 0000-0003-1792-7294
Yayımlanma Tarihi	29 Aralık 2021
Gönderilme Tarihi	26 Eylül 2021
Yayımlandığı Sayı	Yıl 2021 Cilt: 9 Sayı: 4

Kaynak Göster

APA	Atay, Y., Yıldırım, M. O., & Doğan, C. U. (2021). High Performance Classification of Cancer Types with Gene Microarray Datasets: Hybrid Approach. Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım Ve Teknoloji, 9(4), 811-827. https://doi.org/10.29109/gujsc.1000926

Kapak Resmi İndir

Makale Dosyaları

Tam Metin

e-ISSN:2147-9526