Döküman dili tanıma için içerik bağımsız yeni bir yaklaşım: Açı Örüntüler

Tuba Noyan; Fatma Kuncan; Ramazan Tekin; Yılmaz Kaya

doi:10.17341/gazimmfd.844700

Araştırma Makalesi

Döküman dili tanıma için içerik bağımsız yeni bir yaklaşım: Açı Örüntüler

Yıl 2022, Cilt: 37 Sayı: 3, 1277 - 1292, 28.02.2022

Tuba Noyan Fatma Kuncan , Ramazan Tekin , Yılmaz Kaya

https://doi.org/10.17341/gazimmfd.844700

Cited By: 2

Öz

Metin madenciliğinde dil tanıma (DT), bir belgenin veya bir kısmının yazıldığı doğal dili algılama çalışmasıdır. Bu çalışmada, karakterlerin UTF-8 değerleri arasında kalan açı bilgisini kullanan metinden yeni bir dil tanıma yaklaşımı önerilmiştir. Önerilen açı yöntemi metinlerden öznitelik çıkarımı için kullanılmıştır. Açı örüntüler yöntemi istatistiksel bir yaklaşımdır. Önerilen yaklaşımı test etmek amacıyla çeşitli şekillerde oluşturulan dört veri setinin kullanılması kararlaştırılmıştır. Elde edilen öznitelikler Rastsal Orman (RO, RF, Random Forest), Destek Vektör Makinesi (DVM, SVM, Support Vector Machine), Liner Diskriminant Analiz (LDA, Linear Discriminant Analysis), Naive Bayes (NB) ve k-en yakın komşu (Knn, k-nearest neighbors) olmak üzere farklı sınıflandırma yöntemleri kullanılmıştır. Dört farklı veri seti kümesinden belirlenen DT başarım sonuçları sırası ile %96,81, %99,39, %93,31 ve %98,60 olarak gözlenmiştir. Yapılan çalışma sonucunda ulaşılan başarım sonuçlarına göre önerilen açı örüntüler yönteminin DT uygulamasında önemli ayırt edici bilgiler verdiği belirlenmiştir.

Anahtar Kelimeler

Metin tabanlı dil tanıma, Doğal dil işleme, Açı örüntüler, Öznitelik çıkarma

Teşekkür

Bu çalışma Siirt Üniversitesi Mühendislik Fakültesi MaVi Laboratuvarında yapılmıştır. Bu makalenin yazarları, verilen destekten dolayı MaVi Laboratuvar çalışanlarına teşekkür ederler.

Kaynakça

1. Başkaya, F., & Aydin, İ. (2017, September). Haber metinlerinin farkli metin madenciliği yöntemleriyle siniflandirilmasi. In 2017 International Artificial Intelligence and Data Processing Symposium (IDAP) (pp. 1-5). IEEE.
2. Acı, Ç , Çırak, A . (2019). Türkçe Haber Metinlerinin Konvolüsyonel Sinir Ağları ve Word2Vec Kullanılarak Sınıflandırılması . Bilişim Teknolojileri Dergisi , 12 (3) , 219-228 . DOI: 10.17671/gazibtd.457917.
3. Öztürk, Ö., Abidin, D., & Özacar, T. (2018). Using classification algorithms for Turkish music makam recognition. Selçuk Üniversitesi Mühendislik, Bilim ve Teknoloji Dergisi, 6(3), 377-393.
4. Aksu, M. Ç., & Karaman, E. (2020). FastText ve Kelime Çantası Kelime Temsil Yöntemlerinin Turistik Mekanlar İçin Yapılan Türkçe İncelemeler Kullanılarak Karşılaştırılması. Avrupa Bilim ve Teknoloji Dergisi, (20), 311-320.
5. Kutlu, Y. (2020). Challenges Encountered in Turkish Natural Language Processing Studies. Natural and Engineering Sciences.
6. Kuncan, M., Vardar, E., Kaplan, K., & Ertunç, H. M. (2020). Turkish handwriting recognition system using multi-layer perceptron. Journal of Mechatronics and Artificial Intelligence in Engineering, 1(2).
7. Özcan, T , Baştürk, A . (2020). ERUSLR: Yeni bir Türkçe işaret dili veri seti ve hiperparametre optimizasyonu destekli evrişimli sinir ağı ile tanınması . Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi , 36 (1) , 527-542 . DOI: 10.17341/gazimmfd.746793.
8. Fragkou P., Text segmentation for language identification in Greek forums. Procedia-Social and Behavioral Sciences, 147, 160-166, 2014.
9. Abainia K., Ouamour S., Sayoud H., Effective language identification of forum texts based on statistical approaches. Information Processing & Management, 52(4), 491-512, 2016.
10. Xafopoulos A., Kotropoulos C., Almpanidis G., Pitas I., Language identification in web documents using discrete HMMs, Pattern Recognition, 37 (3), 583-594, 2004.
11. Lui M., Lau J. H., Baldwin T., Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics, 2, 27–40, 2014.
12. Cavnar W.B., Trenkle J.M., N-gram-based text categorization, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las VegasNevada-USA, 161–175, April 11-13, 1994.
13. Kaya Y., Ertuğrul, Ö. F., Doküman dili tanıma için yeni bir öznitelik çıkarım yaklaşımı: İkili desenler. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 31(4), 1085-1094, 2016.
14. Sarma N., Singh S. R., Goswami, D., Influence of social conversational features on language identification in highly multilingual online conversations. Information Processing & Management, 56(1), 151-166, 2019.
15. Takçı H., Ekinci E., Minimal feature set in language identification and finding suitable classification method with it, Procedia Technology, 1, 444–448, 2012.
16. Gamallo P., Pichel, J. R., Alegria, I., From language identification to language distance. Physica A: Statistical Mechanics and its Applications, 484, 152-162, 2017.
17. Takcı H., Soğukpınar İ., Letter based text scoring method for language identification, International Conference on Advances in Information Systems, İzmir-Türkiye, 283-290, October 20-22, 2004.
18. Evans D.A., Grefenstette G.T., Tong X., Method of identifying the language of a textual passage using short word and/or n-gram comparisons, U.S. Patent No: US7359851, Washington, DC: U.S. Patent and Trademark Office, April 15, 2008.
19. Popescu M., Dinu L.P., Kernel methods and string kernels for authorship identification: The federalist papers case. International Conference on Recent Advances in Natural Language Processing (RANLP- 07), Borovets-Bulgaria, September 27-29, 2007.
20. Popescu M., Grozea C., Kernel methods and string kernels for authorship analysis Notebook for PAN at CLEF, Conference and Labs of the Evaluation Forum, Rome-Italy, September 17-20, 2012.
21. Popescu M., Ionescu R.T., The Story of the Characters, the DNA and the Native Language, Eighth Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta-GA-USA, 270–278, June 13, 2013.
22. Ahmed B., Cha, S.H., Tappert C., Language identification from text using n-gram based cumulative frequency addition, Proceedings of Student/Faculty Research Day, CSIS, Pace University, 12.1-12.8, May 7, 2004.
23. Başkaya, F., & Aydin, İ. (2017, September). Haber metinlerinin farkli metin madenciliği yöntemleriyle siniflandirilmasi. In 2017 International Artificial Intelligence and Data Processing Symposium (IDAP) (pp. 1-5). IEEE.
24. Acı, Ç , Çırak, A . (2019). Türkçe Haber Metinlerinin Konvolüsyonel Sinir Ağları ve Word2Vec Kullanılarak Sınıflandırılması . Bilişim Teknolojileri Dergisi , 12 (3) , 219-228 . DOI: 10.17671/gazibtd.457917.
25. Öztürk, Ö., Abidin, D., & Özacar, T. (2018). Using classification algorithms for Turkish music makam recognition. Selçuk Üniversitesi Mühendislik, Bilim ve Teknoloji Dergisi, 6(3), 377-393.
26. Aksu, M. Ç., & Karaman, E. (2020). FastText ve Kelime Çantası Kelime Temsil Yöntemlerinin Turistik Mekanlar İçin Yapılan Türkçe İncelemeler Kullanılarak Karşılaştırılması. Avrupa Bilim ve Teknoloji Dergisi, (20), 311-320.
27. Kutlu, Y. (2020). Challenges Encountered in Turkish Natural Language Processing Studies. Natural and Engineering Sciences.
28. Kuncan, M., Vardar, E., Kaplan, K., & Ertunç, H. M. (2020). Turkish handwriting recognition system using multi-layer perceptron. Journal of Mechatronics and Artificial Intelligence in Engineering, 1(2).
29. Özcan, T , Baştürk, A . (2020). ERUSLR: Yeni bir Türkçe işaret dili veri seti ve hiperparametre optimizasyonu destekli evrişimli sinir ağı ile tanınması . Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi , 36 (1) , 527-542 . DOI: 10.17341/gazimmfd.746793.
30. Kuncan, F., Kaya, Y., & Kuncan, M. (2019). New approaches based on local binary patterns for gender identification from sensor signals. Journal of the Faculty of Engineering and Architecture of Gazi University, 34(4), 2173-2185.
31. Li, G., Li, J., Ju, Z., Sun, Y., & Kong, J. (2019). A novel feature extraction method for machine learning based on surface electromyography from healthy brain. Neural Computing and Applications, 31(12), 9013-9022.
32. Kuncan, M., Kaplan, K., Minaz, M. R., Kaya, Y., & Ertunc, H. M. (2020). A novel feature extraction method for bearing fault classification with one dimensional ternary patterns. ISA transactions, 100, 346-357.
33. Gumaei, A., Hassan, M. M., Hassan, M. R., Alelaiwi, A., & Fortino, G. (2019). A hybrid feature extraction method with regularized extreme learning machine for brain tumor classification. IEEE Access, 7, 36266-36273.
34. Takçı H., Güngör T., A high performance centroidbased classification approach for language identification, Pattern Recognition Letters, 33 (16), 2077-2084, 2012.
35. Prager J.M., Linguini: Language identification for multilingual documents, 32nd Annual Hawaii International Conference on Systems Sciences, HawaiiUSA, 1-11, January 5-8, 1999.
36. Suzuki I., Mikami Y., Ohsato A., Chubachi Y., A language and character set determination method based on N-gram statistics, ACM Transactions on Asian Language Information Processing, 1 (3), 269-278, 2002.
37. Castro D. W., Souza E., Vitório D., Santos D., Oliveira A. L., Smoothed n-gram based models for tweet language identification: A case study of the brazilian and european portuguese national varieties. Applied Soft Computing, 61, 1160-1172, 2017.
38. Xiao, D., Li, Y. K., Zhang, H., Sun, Y., Tian, H., Wu, H., & Wang, H. (2020). ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding. arXiv preprint arXiv:2010.12148.
39. Botha G.R., Barnard E., Factors that affect the accuracy of text-based language identification,Computer Speech & Language, 26 (5), 307-320, 2012.
40. Güven, Z , Di̇ri̇, B , Çakaloğlu, T . (2020). Duygu analizi için n-aşamalı Gizli Dirichlet Ayırımı ile diğer konu modelleme yöntemlerinin karşılaştırılması . Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 35 (4), 2135-2146 . DOI: 10.17341/gazimmfd.556104.
41. Durmuş, G , Soğukpınar, İ . (2019). Makine öğrenmesi teknikleri ile ikili yürütülebilir dosyalarda arabellek taşması zayıflığı analizi için yeni bir yaklaşım . Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 34 (4), 1695-1704. DOI: 10.17341/gazimmfd.571485.
42. Yücesoy, E., & Nabiyev, V. V. (2016). Konuşmacı Yaş Ve Cinsiyetinin Gkm Süpervektörlerine Dayalı Bir Dvm Sınıflandırıcısı İle Belirlenmesi. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 31(3).
43. Poutsma, A., Applying Monte Carlo techniques to language identification. In: Proceedings of Computational Linguistics in the Netherlands.2001.
44. Binas, A., Markovian Time Series Models for Language Identification. Project Report, Available: http://www.cs.toronto.edu/ abinas/csc2515report.pdf (online), 2005.
45. Xafopoulos A., Kotropoulos C., Almpanidis G., Pitas I., Language identification in web documents using discrete HMMs, Pattern Recognition, 37 (3), 583-594, 2004.
46. Li Q., Chen Y.P., Personalized text snippet extraction using statistical language models, Pattern Recognition, 43 (1), 378-386, 2010.
47. Sibun P., Reynar J.C., Language identification: examining the issues, In: Proc.5th Symposium on Document Analysis and Information Retrieval, Las Vegas-Nevada-USA, 125–135, April 15-17, 1996.
48. Song Y., Dai L., Wang R., An automatic language identification method based on subspace analysis, IEEE International Conference on Multimedia and Expo, New York-NY-USA, 598-601, 28 Jun - 03 Jul 2009.
49. Takci H., Diagnosis of breast cancer by the help of centroid based classifiers, Journal of the Faculty of Engineering and Architecture of Gazi University, 31(2), 323-330, 2016.
50. Tian J., Suontausta J., Scalable neural network based language identification from written text. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). (Vol. 1, pp. I-48). IEEE, April, 2003.
51. Selamat A., Ng, C. C., Arabic script web page language identifications using decision tree neural networks. Pattern Recognition, 44(1), 133-144, 2011.
52. Köklü M., Kahramanlı H., Allahverdi N., A new accurate and efficient approach to extract classification rules, Journal of the Faculty of Engineering and Architecture of Gazi University, 29 (3), 477-486, 2014.
53. Jiang C., Coenen F., Sanderson R., Zito M., Text classification using graph mining-based feature extraction, Knowledge-Based Systems, 23 (4), 302- 308, 2010.
54. Tan S., An effective refinement strategy for KNN text classifier, Expert Systems with Applications, 30 (2), 290-298, 2006.
55. Murthy K. N., Kumar G. B., Language identification from small text samples. Journal of Quantitative Linguistics, 13(01), 57-80, 2006.
56. Jiang C., Coenen F., Sanderson R., Zito M., Text classification using graph mining-based feature extraction, Knowledge-Based Systems, 23 (4), 302- 308, 2010.
57. Botha G.R., Barnard E., Factors that affect the accuracy of text-based language identification, Computer Speech & Language, 26 (5), 307-320, 2012.
58. Hayta Ş.B., Takçı H., Eminli M., Language Identification Based on n-Gram Feature Extraction Method by Using Classifiers, IU-Journal of Electrical & Electronics Engineering, 13 (2), 1629-1639, 2013.
59. Yavanoğlu U., Sağıroğlu, Ş., Automatic web based language identification and translation system, Journal of the Faculty of Engineering and Architecture of Gazi University, 25 (3), 483-494, 2010.
60. Singh A. K., Study of some distance measures for language and encoding identification. In Proceedings of the Workshop on Linguistic Distances, pp. 63-72, July, 2006.
61. Gottron T., Lipka, N., A comparison of language identification approaches on short, query-style texts. In European Conference on Information Retrieval, pp. 611-614, Springer, Berlin, Heidelberg, March, 2010.
62. Baldwin T., Lui M., Language identification: The long and the short of the matter. In Human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics, pp. 229-237, June,2010.
63. Tromp E., Pechenizkiy M., Graph-based n-gram language identification on short texts. In Proc. 20th Machine Learning conference of Belgium and The Netherlands, pp. 27-34, May, 2011.
64. Botha G. R., Barnard E., Factors that affect the accuracy of text-based language identification. Computer Speech & Language, 26(5), 307-320, 2012.
65. Carreras X., Chao I., Padró L., Padró M., FreeLing: An Open-Source Suite of Language Analyzers. In LREC, pp. 239-242, May, 2004.
66. Zhai L.-F., Siu M., Yang X., Gish H., Discriminatively trained language models using support vector machines for language identification. In: IEEE Odyssey 2006: The Speaker and Language Recognition Workshop, pp. 1–6, 2006.
67. Ljubesic N., Mikelic N.,Boras D., Language indentification: How to distinguish similar languages?. In 2007 29th International Conference on Information Technology Interfaces, pp-541-546, IEEE., June, 2007.
68. Martin T., The WiLI benchmark dataset for written language identification, https://arxiv.org/pdf/1801.07779.pdf, 2020.

A new content-free approach to identification of document language: Angle Patterns

Yıl 2022, Cilt: 37 Sayı: 3, 1277 - 1292, 28.02.2022

Tuba Noyan Fatma Kuncan , Ramazan Tekin , Yılmaz Kaya

https://doi.org/10.17341/gazimmfd.844700

Cited By: 2

Öz

Language identification (LI) in text mining is the study of natural language perception in which a document or a part of it is written. In this study, a new language identification approach from text using the angle information between the UTF-8 values of the characters is proposed. The proposed angle method is used for feature extraction from texts. Angle patterns method is a statistical approach. It was decided to use four data sets created in various ways to test the proposed approach. The obtained features are used with different classification methods such as RF( Random Forest), SVM (Support Vector Machine), LDA (Linear Discriminant Analysis), NB (Naive Bayes) and Knn (k-nearest neighbor). LI performance results determined from four different data set sets were observed as 96.81%, 99.39%, 93.31% and 98.60%, respectively. According to the success results obtained as a result of the study, it was determined that the proposed angle patterns method gave important distinctive information in LI application.

Anahtar Kelimeler

Text-based language identification, Natural language processing, Angle patterns, Feature extraction

Kaynakça

1. Başkaya, F., & Aydin, İ. (2017, September). Haber metinlerinin farkli metin madenciliği yöntemleriyle siniflandirilmasi. In 2017 International Artificial Intelligence and Data Processing Symposium (IDAP) (pp. 1-5). IEEE.
2. Acı, Ç , Çırak, A . (2019). Türkçe Haber Metinlerinin Konvolüsyonel Sinir Ağları ve Word2Vec Kullanılarak Sınıflandırılması . Bilişim Teknolojileri Dergisi , 12 (3) , 219-228 . DOI: 10.17671/gazibtd.457917.
3. Öztürk, Ö., Abidin, D., & Özacar, T. (2018). Using classification algorithms for Turkish music makam recognition. Selçuk Üniversitesi Mühendislik, Bilim ve Teknoloji Dergisi, 6(3), 377-393.
4. Aksu, M. Ç., & Karaman, E. (2020). FastText ve Kelime Çantası Kelime Temsil Yöntemlerinin Turistik Mekanlar İçin Yapılan Türkçe İncelemeler Kullanılarak Karşılaştırılması. Avrupa Bilim ve Teknoloji Dergisi, (20), 311-320.
5. Kutlu, Y. (2020). Challenges Encountered in Turkish Natural Language Processing Studies. Natural and Engineering Sciences.
6. Kuncan, M., Vardar, E., Kaplan, K., & Ertunç, H. M. (2020). Turkish handwriting recognition system using multi-layer perceptron. Journal of Mechatronics and Artificial Intelligence in Engineering, 1(2).
7. Özcan, T , Baştürk, A . (2020). ERUSLR: Yeni bir Türkçe işaret dili veri seti ve hiperparametre optimizasyonu destekli evrişimli sinir ağı ile tanınması . Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi , 36 (1) , 527-542 . DOI: 10.17341/gazimmfd.746793.
8. Fragkou P., Text segmentation for language identification in Greek forums. Procedia-Social and Behavioral Sciences, 147, 160-166, 2014.
9. Abainia K., Ouamour S., Sayoud H., Effective language identification of forum texts based on statistical approaches. Information Processing & Management, 52(4), 491-512, 2016.
10. Xafopoulos A., Kotropoulos C., Almpanidis G., Pitas I., Language identification in web documents using discrete HMMs, Pattern Recognition, 37 (3), 583-594, 2004.
11. Lui M., Lau J. H., Baldwin T., Automatic detection and language identification of multilingual documents. Transactions of the Association for Computational Linguistics, 2, 27–40, 2014.
12. Cavnar W.B., Trenkle J.M., N-gram-based text categorization, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las VegasNevada-USA, 161–175, April 11-13, 1994.
13. Kaya Y., Ertuğrul, Ö. F., Doküman dili tanıma için yeni bir öznitelik çıkarım yaklaşımı: İkili desenler. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 31(4), 1085-1094, 2016.
14. Sarma N., Singh S. R., Goswami, D., Influence of social conversational features on language identification in highly multilingual online conversations. Information Processing & Management, 56(1), 151-166, 2019.
15. Takçı H., Ekinci E., Minimal feature set in language identification and finding suitable classification method with it, Procedia Technology, 1, 444–448, 2012.
16. Gamallo P., Pichel, J. R., Alegria, I., From language identification to language distance. Physica A: Statistical Mechanics and its Applications, 484, 152-162, 2017.
17. Takcı H., Soğukpınar İ., Letter based text scoring method for language identification, International Conference on Advances in Information Systems, İzmir-Türkiye, 283-290, October 20-22, 2004.
18. Evans D.A., Grefenstette G.T., Tong X., Method of identifying the language of a textual passage using short word and/or n-gram comparisons, U.S. Patent No: US7359851, Washington, DC: U.S. Patent and Trademark Office, April 15, 2008.
19. Popescu M., Dinu L.P., Kernel methods and string kernels for authorship identification: The federalist papers case. International Conference on Recent Advances in Natural Language Processing (RANLP- 07), Borovets-Bulgaria, September 27-29, 2007.
20. Popescu M., Grozea C., Kernel methods and string kernels for authorship analysis Notebook for PAN at CLEF, Conference and Labs of the Evaluation Forum, Rome-Italy, September 17-20, 2012.
21. Popescu M., Ionescu R.T., The Story of the Characters, the DNA and the Native Language, Eighth Workshop on Innovative Use of NLP for Building Educational Applications, Atlanta-GA-USA, 270–278, June 13, 2013.
22. Ahmed B., Cha, S.H., Tappert C., Language identification from text using n-gram based cumulative frequency addition, Proceedings of Student/Faculty Research Day, CSIS, Pace University, 12.1-12.8, May 7, 2004.
23. Başkaya, F., & Aydin, İ. (2017, September). Haber metinlerinin farkli metin madenciliği yöntemleriyle siniflandirilmasi. In 2017 International Artificial Intelligence and Data Processing Symposium (IDAP) (pp. 1-5). IEEE.
24. Acı, Ç , Çırak, A . (2019). Türkçe Haber Metinlerinin Konvolüsyonel Sinir Ağları ve Word2Vec Kullanılarak Sınıflandırılması . Bilişim Teknolojileri Dergisi , 12 (3) , 219-228 . DOI: 10.17671/gazibtd.457917.
25. Öztürk, Ö., Abidin, D., & Özacar, T. (2018). Using classification algorithms for Turkish music makam recognition. Selçuk Üniversitesi Mühendislik, Bilim ve Teknoloji Dergisi, 6(3), 377-393.
26. Aksu, M. Ç., & Karaman, E. (2020). FastText ve Kelime Çantası Kelime Temsil Yöntemlerinin Turistik Mekanlar İçin Yapılan Türkçe İncelemeler Kullanılarak Karşılaştırılması. Avrupa Bilim ve Teknoloji Dergisi, (20), 311-320.
27. Kutlu, Y. (2020). Challenges Encountered in Turkish Natural Language Processing Studies. Natural and Engineering Sciences.
28. Kuncan, M., Vardar, E., Kaplan, K., & Ertunç, H. M. (2020). Turkish handwriting recognition system using multi-layer perceptron. Journal of Mechatronics and Artificial Intelligence in Engineering, 1(2).
29. Özcan, T , Baştürk, A . (2020). ERUSLR: Yeni bir Türkçe işaret dili veri seti ve hiperparametre optimizasyonu destekli evrişimli sinir ağı ile tanınması . Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi , 36 (1) , 527-542 . DOI: 10.17341/gazimmfd.746793.
30. Kuncan, F., Kaya, Y., & Kuncan, M. (2019). New approaches based on local binary patterns for gender identification from sensor signals. Journal of the Faculty of Engineering and Architecture of Gazi University, 34(4), 2173-2185.
31. Li, G., Li, J., Ju, Z., Sun, Y., & Kong, J. (2019). A novel feature extraction method for machine learning based on surface electromyography from healthy brain. Neural Computing and Applications, 31(12), 9013-9022.
32. Kuncan, M., Kaplan, K., Minaz, M. R., Kaya, Y., & Ertunc, H. M. (2020). A novel feature extraction method for bearing fault classification with one dimensional ternary patterns. ISA transactions, 100, 346-357.
33. Gumaei, A., Hassan, M. M., Hassan, M. R., Alelaiwi, A., & Fortino, G. (2019). A hybrid feature extraction method with regularized extreme learning machine for brain tumor classification. IEEE Access, 7, 36266-36273.
34. Takçı H., Güngör T., A high performance centroidbased classification approach for language identification, Pattern Recognition Letters, 33 (16), 2077-2084, 2012.
35. Prager J.M., Linguini: Language identification for multilingual documents, 32nd Annual Hawaii International Conference on Systems Sciences, HawaiiUSA, 1-11, January 5-8, 1999.
36. Suzuki I., Mikami Y., Ohsato A., Chubachi Y., A language and character set determination method based on N-gram statistics, ACM Transactions on Asian Language Information Processing, 1 (3), 269-278, 2002.
37. Castro D. W., Souza E., Vitório D., Santos D., Oliveira A. L., Smoothed n-gram based models for tweet language identification: A case study of the brazilian and european portuguese national varieties. Applied Soft Computing, 61, 1160-1172, 2017.
38. Xiao, D., Li, Y. K., Zhang, H., Sun, Y., Tian, H., Wu, H., & Wang, H. (2020). ERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding. arXiv preprint arXiv:2010.12148.
39. Botha G.R., Barnard E., Factors that affect the accuracy of text-based language identification,Computer Speech & Language, 26 (5), 307-320, 2012.
40. Güven, Z , Di̇ri̇, B , Çakaloğlu, T . (2020). Duygu analizi için n-aşamalı Gizli Dirichlet Ayırımı ile diğer konu modelleme yöntemlerinin karşılaştırılması . Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 35 (4), 2135-2146 . DOI: 10.17341/gazimmfd.556104.
41. Durmuş, G , Soğukpınar, İ . (2019). Makine öğrenmesi teknikleri ile ikili yürütülebilir dosyalarda arabellek taşması zayıflığı analizi için yeni bir yaklaşım . Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 34 (4), 1695-1704. DOI: 10.17341/gazimmfd.571485.
42. Yücesoy, E., & Nabiyev, V. V. (2016). Konuşmacı Yaş Ve Cinsiyetinin Gkm Süpervektörlerine Dayalı Bir Dvm Sınıflandırıcısı İle Belirlenmesi. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 31(3).
43. Poutsma, A., Applying Monte Carlo techniques to language identification. In: Proceedings of Computational Linguistics in the Netherlands.2001.
44. Binas, A., Markovian Time Series Models for Language Identification. Project Report, Available: http://www.cs.toronto.edu/ abinas/csc2515report.pdf (online), 2005.
45. Xafopoulos A., Kotropoulos C., Almpanidis G., Pitas I., Language identification in web documents using discrete HMMs, Pattern Recognition, 37 (3), 583-594, 2004.
46. Li Q., Chen Y.P., Personalized text snippet extraction using statistical language models, Pattern Recognition, 43 (1), 378-386, 2010.
47. Sibun P., Reynar J.C., Language identification: examining the issues, In: Proc.5th Symposium on Document Analysis and Information Retrieval, Las Vegas-Nevada-USA, 125–135, April 15-17, 1996.
48. Song Y., Dai L., Wang R., An automatic language identification method based on subspace analysis, IEEE International Conference on Multimedia and Expo, New York-NY-USA, 598-601, 28 Jun - 03 Jul 2009.
49. Takci H., Diagnosis of breast cancer by the help of centroid based classifiers, Journal of the Faculty of Engineering and Architecture of Gazi University, 31(2), 323-330, 2016.
50. Tian J., Suontausta J., Scalable neural network based language identification from written text. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP'03). (Vol. 1, pp. I-48). IEEE, April, 2003.
51. Selamat A., Ng, C. C., Arabic script web page language identifications using decision tree neural networks. Pattern Recognition, 44(1), 133-144, 2011.
52. Köklü M., Kahramanlı H., Allahverdi N., A new accurate and efficient approach to extract classification rules, Journal of the Faculty of Engineering and Architecture of Gazi University, 29 (3), 477-486, 2014.
53. Jiang C., Coenen F., Sanderson R., Zito M., Text classification using graph mining-based feature extraction, Knowledge-Based Systems, 23 (4), 302- 308, 2010.
54. Tan S., An effective refinement strategy for KNN text classifier, Expert Systems with Applications, 30 (2), 290-298, 2006.
55. Murthy K. N., Kumar G. B., Language identification from small text samples. Journal of Quantitative Linguistics, 13(01), 57-80, 2006.
56. Jiang C., Coenen F., Sanderson R., Zito M., Text classification using graph mining-based feature extraction, Knowledge-Based Systems, 23 (4), 302- 308, 2010.
57. Botha G.R., Barnard E., Factors that affect the accuracy of text-based language identification, Computer Speech & Language, 26 (5), 307-320, 2012.
58. Hayta Ş.B., Takçı H., Eminli M., Language Identification Based on n-Gram Feature Extraction Method by Using Classifiers, IU-Journal of Electrical & Electronics Engineering, 13 (2), 1629-1639, 2013.
59. Yavanoğlu U., Sağıroğlu, Ş., Automatic web based language identification and translation system, Journal of the Faculty of Engineering and Architecture of Gazi University, 25 (3), 483-494, 2010.
60. Singh A. K., Study of some distance measures for language and encoding identification. In Proceedings of the Workshop on Linguistic Distances, pp. 63-72, July, 2006.
61. Gottron T., Lipka, N., A comparison of language identification approaches on short, query-style texts. In European Conference on Information Retrieval, pp. 611-614, Springer, Berlin, Heidelberg, March, 2010.
62. Baldwin T., Lui M., Language identification: The long and the short of the matter. In Human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics, pp. 229-237, June,2010.
63. Tromp E., Pechenizkiy M., Graph-based n-gram language identification on short texts. In Proc. 20th Machine Learning conference of Belgium and The Netherlands, pp. 27-34, May, 2011.
64. Botha G. R., Barnard E., Factors that affect the accuracy of text-based language identification. Computer Speech & Language, 26(5), 307-320, 2012.
65. Carreras X., Chao I., Padró L., Padró M., FreeLing: An Open-Source Suite of Language Analyzers. In LREC, pp. 239-242, May, 2004.
66. Zhai L.-F., Siu M., Yang X., Gish H., Discriminatively trained language models using support vector machines for language identification. In: IEEE Odyssey 2006: The Speaker and Language Recognition Workshop, pp. 1–6, 2006.
67. Ljubesic N., Mikelic N.,Boras D., Language indentification: How to distinguish similar languages?. In 2007 29th International Conference on Information Technology Interfaces, pp-541-546, IEEE., June, 2007.
68. Martin T., The WiLI benchmark dataset for written language identification, https://arxiv.org/pdf/1801.07779.pdf, 2020.

Toplam 68 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	Türkçe
Konular	Mühendislik
Bölüm	Makaleler
Yazarlar	Tuba Noyan Bu kişi benim 0000-0002-3359-2570 Fatma Kuncan 0000-0003-0712-6426 Ramazan Tekin 0000-0003-4325-6922 Yılmaz Kaya 0000-0001-5167-1101
Yayımlanma Tarihi	28 Şubat 2022
Gönderilme Tarihi	21 Aralık 2020
Kabul Tarihi	25 Eylül 2021
Yayımlandığı Sayı	Yıl 2022 Cilt: 37 Sayı: 3

Kaynak Göster

APA	Noyan, T., Kuncan, F., Tekin, R., Kaya, Y. (2022). Döküman dili tanıma için içerik bağımsız yeni bir yaklaşım: Açı Örüntüler. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, 37(3), 1277-1292. https://doi.org/10.17341/gazimmfd.844700
AMA	Noyan T, Kuncan F, Tekin R, Kaya Y. Döküman dili tanıma için içerik bağımsız yeni bir yaklaşım: Açı Örüntüler. GUMMFD. Şubat 2022;37(3):1277-1292. doi:10.17341/gazimmfd.844700
Chicago	Noyan, Tuba, Fatma Kuncan, Ramazan Tekin, ve Yılmaz Kaya. “Döküman Dili tanıma için içerik bağımsız Yeni Bir yaklaşım: Açı Örüntüler”. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi 37, sy. 3 (Şubat 2022): 1277-92. https://doi.org/10.17341/gazimmfd.844700.
EndNote	Noyan T, Kuncan F, Tekin R, Kaya Y (01 Şubat 2022) Döküman dili tanıma için içerik bağımsız yeni bir yaklaşım: Açı Örüntüler. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi 37 3 1277–1292.
IEEE	T. Noyan, F. Kuncan, R. Tekin, ve Y. Kaya, “Döküman dili tanıma için içerik bağımsız yeni bir yaklaşım: Açı Örüntüler”, GUMMFD, c. 37, sy. 3, ss. 1277–1292, 2022, doi: 10.17341/gazimmfd.844700.
ISNAD	Noyan, Tuba vd. “Döküman Dili tanıma için içerik bağımsız Yeni Bir yaklaşım: Açı Örüntüler”. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi 37/3 (Şubat 2022), 1277-1292. https://doi.org/10.17341/gazimmfd.844700.
JAMA	Noyan T, Kuncan F, Tekin R, Kaya Y. Döküman dili tanıma için içerik bağımsız yeni bir yaklaşım: Açı Örüntüler. GUMMFD. 2022;37:1277–1292.
MLA	Noyan, Tuba vd. “Döküman Dili tanıma için içerik bağımsız Yeni Bir yaklaşım: Açı Örüntüler”. Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi, c. 37, sy. 3, 2022, ss. 1277-92, doi:10.17341/gazimmfd.844700.
Vancouver	Noyan T, Kuncan F, Tekin R, Kaya Y. Döküman dili tanıma için içerik bağımsız yeni bir yaklaşım: Açı Örüntüler. GUMMFD. 2022;37(3):1277-92.

Cited By

Kodlayıcı kod çözücü ve dikkat algoritmaları kullanılarak karakter tabanlı kelime üretimi

Gazi Üniversitesi Mühendislik Mimarlık Fakültesi Dergisi

https://doi.org/10.17341/gazimmfd.1206277

A Hybrid Model Based on Deep Features and Ensemble Learning for the Diagnosis of COVID-19: DeepFeat-E

Turkish Journal of Science and Technology

https://doi.org/10.55525/tjst.1237103

Makale Dosyaları

Tam Metin