Gözetimli Makine Öğrenmesiyle Noktalama ve Etkisiz Kelime Sıklıkları Kullanarak Yazar Tanıma

Tevfik Uyar; Kübra Karacan Uyar; Emre Yağlı

doi:10.17671/gazibtd.623629

Research Article

Columnist Identification with Supervised Machine Learning using Punctuation and Stop Word Frequencies

Year 2021, Volume: 14 Issue: 2, 183 - 190, 30.04.2021

Tevfik Uyar , Kübra Karacan Uyar Emre Yağlı

https://doi.org/10.17671/gazibtd.623629

Cited By: 2

Abstract

This research asserts that such features as the frequency of stop words and punctuation marks are sufficient for author identification of the texts that are column-long. Six of Cumhuriyet columnists who periodically write in the newspaper were selected and 120 columns were collected from each. Nine features based on the frequency of particular stop words and punctuation marks were extracted. Eight supervised machine learning algorithms were trained with extracted feature set. Author identification performance of each algorithm was measured. The effect of dimension reduction and scaling on each algorithm were also examined. Following these procedures, minimum 82% and maximum 92% accuracy were obtained. It is also found that scaling or dimension reduction with principal component analysis (PCA) do not create significant difference alone on accuracy scores, while scaling and linear discriminant analysis significantly increases the validation scores of some of algorithms such as support vector machines (p<0.05), Gaussian Naïve Bayes, and k-nearest neighbour (p<0.001). Moreover, when feature importance of random forest algorithm is analysed, average word count in a sentence and comma frequency are found as the most important features for detecting the authors.

Keywords

artificial learning, author identification, classification algorithms, supervised learning

References

C. C. Aggarwal, C. X. Zhai, “An introduction to text mining”, Mining Text Data, Editör: Aggarwal, C. C., Zhai, C. X., Springer, Boston, MA, A.B.D., 1–10, 2013.
O. de Vel, A. Anderson, M. Corney, G. Mohay, “Mining e-mail content for author identification forensics”, ACM SIGMOD Record, 30(4), 55-64, Ara. 2001.
S. Hill ve F. Provost, “The myth of the double-blind review?”, ACM SIGKDD Explorations Newsletter, 5(2), 179-184, 2003.
J. Houvardas ve E. Stamatatos, “N-Gram Feature Selection for Authorship Identification”, Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2006, Cilt 4183, Editör: Euzenat J., Domingue J.. Springer, Berlin, Heidelberg, 77-86, 2006.
D. Abercrombie, “Voice qualities”, Psycholinguistics: An introduction to the study of speech and personality, Editör: Markel, N.N., The Dorsey Press, Londra, 109–127, 1969.
M. A. K. Halliday, A. McIntosh, ve P. Strevens, The linguistic sciences and language teaching, Longman, Londra, 1964.
M. Coulthard, “Author identification, idiolect, and linguistic uniqueness”, Appl. Linguist., 25(4),. 431–447, 2004.
D. Biber, Variation across speech and writing, Cambridge University Press, Cambridge, 1988.
D. Biber, Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press, Cambridge, 1995.
R. Shuy, The language of confession, interrogation and deception, Sage, Londra, 1998.
M. Coulthard, “Forensic discourse analysis”, Advances in spoken discourse analysis, Editör: Coulthard, N. Routledge, Londra, 242–257, 1992.
M. Coulthard, “On the use of corpora in the analysis of forensic texts”, Forensic Linguist. Int. J. Speech, Lang. Law, 1(1), 27–43, 1994.
R. Eagleson, “Forensic analysis of personal written text: A case study”, Language and the law, Editör: Gibbons, J., Longman, Londra, 362–373, 1994.
N. Chomsky, Aspects of the theory of syntax, MIT Press, Cambridge, 1965.
M. A. K. Halliday, Learning how to mean, Edward Arnold, Londra, 1975.
N. MacLeod, T. Grant, “Whose Tweet? Authorship analysis of micro-blogs and other short-form messages”, International Association of Forensic Linguists’ Tenth Biennial Conference, 210–224, 2012.
C. Chaski, “Empirical evaluations of language-based authorship identification techniques”, Int. J. Speech, Lang. Law, 8(1), 1–65, 2001.
T. Grant ve K. Baker, “Identifying reliable, valid markers of authorship: A response to Chaski”, Int. J. Speech, Lang. Law, 8(1), 66–79, 2001.
G. R. McMenamin, “Style markers in authortship studies”, Int. J. Speech, Lang. Law, 8(2), 93–97, 2001.
S. Argamon, “Interpreting Burrows’s Delta: geometric and probabilistic foundations”, Lit. Linguist. Comput., 23(2), 131–147, 2008.
D. L. Hoover, “Multivariate analysis and the study of style variation”, Lit. Linguist. Comput., 18(4), 341–359, 2003.
M. Koppel, J. Schler, ve S. Argamon, “Authorship attribution in the wild”, Lang. Resour. Eval., 45, 83–94, 2011.
J. Burrows, “Delta: A measure for stylistic difference and a guide to likely authorship”, Lit. Linguist. Comput., 17(3), 267–287, 2002.
B. Levent, V. E. Diri, “Türkçe dokümanlarda yapay sinir ağları ile yazar tanıma”, XVI. Akademik Bilişim Konferansı Mersin Üniversitesi, 735–741, 5 - 7 Şubat 2014.
I. N. Bozkurt, Ö. Bağlıoğlu, ve E. Uyar, “Authorship attribution: performance of various features and classification methods”, 22nd International Symposium on Computer and Information Sciences, ISCIS 2007 - Proceedings, 158–162, 2007.
T. Taş ve A. K. Görür, “Author identification for Turkish texts”, J. Arts Sci., 7, 151–161, 2007.
F. Türkoğlu, B. Diri, ve M. F. Amasyalı, “Author attribution of Turkish texts by feature mining”, Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues, 1086–1093, 2007.
S. Doğan ve B. Diri, “Türkçe dökümanlar için N-gram tabanlı yeni bir sınıflandırma(Ng-ind): Yazar, tür ve cinsiyet”, Türkiye Bilişim Vakfı Bilgi. Bilim. ve Mühendisliği Derg., 1(3), 11–19, 2010.
M. Yasdi, B. Diri, “Soyut özetllik çıkarımı ile yazar tanıma”, IEEE 20. Sinyal İşleme ve İletişim Uygulamaları Kurultayı, Fethiye, Muğla, Türkiye, 2012.
M. F. Amasyalı, B. Diri, F. Türkoğlu, “Farklı özellik vektörleri ile Türkçe dökümanların yazarlarının belirlenmesi”, 15. Türkiye Yapay Sinir Ağları Sempozyumu, Muğla, 21- 24 Haziran, 2006.
Y. Bay, E. Çelebi, “Feature Selection for Enhanced Author Identification of Turkish Text”, 30th International Symposium on Computer and Information Sciences, ISCIS 2015 - Proceedings, 371-379, 2015.
N. Ş. Saygılı, T. Amghar, B. Levrat, T. Acarman, “Taking advantage of Turkish characteristic features to achieve authorship attribution problems for Turkish”, 25th Signal Processing and Communications Applications Conference (SIU), Antalya, 2017.
B. Kuyumcu, B. Buluz, Y. Kömeçoğlu, “Author Identification in Turkish Documents with Ridge Regression Analysis”, 27th Signal Processing and Communications Applications Conference (SIU), Sivas, 24-26 Nisan 2019.
G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning with Application in R, Springer, Los Angeles, A.B.D., 2017.
S. B.Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques", Informatica, 31, 249–268, 2007.
E. Alpaydın, Yapay Öğrenme, Boğaziçi Üniversitesi Yayınları, İstanbul, 88-116, 2017.
H.Wang, C. Ding, H. Huang, "Multi-label linear discriminant analysis", Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6316 LNCS(PART 6), 126–139, 2017.
T. Hastie, J. Tibshirani, J. Friedman, The Elements of Statistical Learning, Data Mining, Inference, and Prediction, Springer, New York, A.B.D., 2016.
M.Kuhn, K. Johanson, Applied Predictive Modeling, Springer, New York, 2013.
A. G. Karacor, E.Torun, R. Abay, “Aircraft Classification Using Image Processing Tecniques and Artificial Neural Neworks", International Journal of Pattern Recognition and Artificial Intelligence, 25(08), 1321–1335. 2011.

Gözetimli Makine Öğrenmesiyle Noktalama ve Etkisiz Kelime Sıklıkları Kullanarak Yazar Tanıma

Year 2021, Volume: 14 Issue: 2, 183 - 190, 30.04.2021

Tevfik Uyar , Kübra Karacan Uyar Emre Yağlı

https://doi.org/10.17671/gazibtd.623629

Cited By: 2

Abstract

Bu çalışmada köşe yazısı uzunluğundaki yazılarda noktalama ve etkisiz kelime kullanım sıklığı gibi basit özniteliklerin yazar tanımada yeterli olduğu ortaya konmuştur. Cumhuriyet gazetesi yazarlarından sıkça köşe yazan 6 adedi seçilerek her birinin çalışmanın başladığı tarihten geriye doğru son 120 köşe yazıları alınmış, her bir yazı için bir takım etkisiz kelime ve noktalama işaretlerinin kullanım sıklıklarına dayanan dokuz adet öznitelik elde edilmiştir. Sekiz gözetimli yapay öğrenme algoritması eğitildikten sonra yazının yazarını tanıma başarısı önişlemsiz ve önişlemden geçirilmiş veri kümelerinde ayrı ayrı ölçülmüş, asgari %82 ve azami %92 olmak üzere yüksek isabetli sonuçlar elde edilmiştir. Ölçeklemenin ve temel bileşen analizinin (PCA) başarıyı anlamlı miktarda değiştirmediği, ancak ölçekleme ve boyut azaltma yöntemi olarak doğrusal ayırtaç çözümlemenin (LDA) birlikte kullanılmasının en yakın komşu (kNN) ve Gaussian Naive Bayes (GNB) algoritmalarının yöntemlerin başarılarında yüksek anlamlı (p<0.001), destek vektör makineleri (SVM) algoritmasının başarısında ise anlamlı (p<0.05) bir fark yarattığı görülmüştür. Ayrıca karar ağacı temelli rasgele orman algoritmasında (RF) öznitelik önem analizi yapılarak cümle başına ortalama kelime sayısının ve virgül kullanma sıklığının en ayırıcı öznitelikler olduğu tespit edilmiştir.

Keywords

gözetimli öğrenme, sınıflandırma algoritmaları, yapay öğrenme, yazar tanıma

References

C. C. Aggarwal, C. X. Zhai, “An introduction to text mining”, Mining Text Data, Editör: Aggarwal, C. C., Zhai, C. X., Springer, Boston, MA, A.B.D., 1–10, 2013.
O. de Vel, A. Anderson, M. Corney, G. Mohay, “Mining e-mail content for author identification forensics”, ACM SIGMOD Record, 30(4), 55-64, Ara. 2001.
S. Hill ve F. Provost, “The myth of the double-blind review?”, ACM SIGKDD Explorations Newsletter, 5(2), 179-184, 2003.
J. Houvardas ve E. Stamatatos, “N-Gram Feature Selection for Authorship Identification”, Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2006, Cilt 4183, Editör: Euzenat J., Domingue J.. Springer, Berlin, Heidelberg, 77-86, 2006.
D. Abercrombie, “Voice qualities”, Psycholinguistics: An introduction to the study of speech and personality, Editör: Markel, N.N., The Dorsey Press, Londra, 109–127, 1969.
M. A. K. Halliday, A. McIntosh, ve P. Strevens, The linguistic sciences and language teaching, Longman, Londra, 1964.
M. Coulthard, “Author identification, idiolect, and linguistic uniqueness”, Appl. Linguist., 25(4),. 431–447, 2004.
D. Biber, Variation across speech and writing, Cambridge University Press, Cambridge, 1988.
D. Biber, Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press, Cambridge, 1995.
R. Shuy, The language of confession, interrogation and deception, Sage, Londra, 1998.
M. Coulthard, “Forensic discourse analysis”, Advances in spoken discourse analysis, Editör: Coulthard, N. Routledge, Londra, 242–257, 1992.
M. Coulthard, “On the use of corpora in the analysis of forensic texts”, Forensic Linguist. Int. J. Speech, Lang. Law, 1(1), 27–43, 1994.
R. Eagleson, “Forensic analysis of personal written text: A case study”, Language and the law, Editör: Gibbons, J., Longman, Londra, 362–373, 1994.
N. Chomsky, Aspects of the theory of syntax, MIT Press, Cambridge, 1965.
M. A. K. Halliday, Learning how to mean, Edward Arnold, Londra, 1975.
N. MacLeod, T. Grant, “Whose Tweet? Authorship analysis of micro-blogs and other short-form messages”, International Association of Forensic Linguists’ Tenth Biennial Conference, 210–224, 2012.
C. Chaski, “Empirical evaluations of language-based authorship identification techniques”, Int. J. Speech, Lang. Law, 8(1), 1–65, 2001.
T. Grant ve K. Baker, “Identifying reliable, valid markers of authorship: A response to Chaski”, Int. J. Speech, Lang. Law, 8(1), 66–79, 2001.
G. R. McMenamin, “Style markers in authortship studies”, Int. J. Speech, Lang. Law, 8(2), 93–97, 2001.
S. Argamon, “Interpreting Burrows’s Delta: geometric and probabilistic foundations”, Lit. Linguist. Comput., 23(2), 131–147, 2008.
D. L. Hoover, “Multivariate analysis and the study of style variation”, Lit. Linguist. Comput., 18(4), 341–359, 2003.
M. Koppel, J. Schler, ve S. Argamon, “Authorship attribution in the wild”, Lang. Resour. Eval., 45, 83–94, 2011.
J. Burrows, “Delta: A measure for stylistic difference and a guide to likely authorship”, Lit. Linguist. Comput., 17(3), 267–287, 2002.
B. Levent, V. E. Diri, “Türkçe dokümanlarda yapay sinir ağları ile yazar tanıma”, XVI. Akademik Bilişim Konferansı Mersin Üniversitesi, 735–741, 5 - 7 Şubat 2014.
I. N. Bozkurt, Ö. Bağlıoğlu, ve E. Uyar, “Authorship attribution: performance of various features and classification methods”, 22nd International Symposium on Computer and Information Sciences, ISCIS 2007 - Proceedings, 158–162, 2007.
T. Taş ve A. K. Görür, “Author identification for Turkish texts”, J. Arts Sci., 7, 151–161, 2007.
F. Türkoğlu, B. Diri, ve M. F. Amasyalı, “Author attribution of Turkish texts by feature mining”, Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues, 1086–1093, 2007.
S. Doğan ve B. Diri, “Türkçe dökümanlar için N-gram tabanlı yeni bir sınıflandırma(Ng-ind): Yazar, tür ve cinsiyet”, Türkiye Bilişim Vakfı Bilgi. Bilim. ve Mühendisliği Derg., 1(3), 11–19, 2010.
M. Yasdi, B. Diri, “Soyut özetllik çıkarımı ile yazar tanıma”, IEEE 20. Sinyal İşleme ve İletişim Uygulamaları Kurultayı, Fethiye, Muğla, Türkiye, 2012.
M. F. Amasyalı, B. Diri, F. Türkoğlu, “Farklı özellik vektörleri ile Türkçe dökümanların yazarlarının belirlenmesi”, 15. Türkiye Yapay Sinir Ağları Sempozyumu, Muğla, 21- 24 Haziran, 2006.
Y. Bay, E. Çelebi, “Feature Selection for Enhanced Author Identification of Turkish Text”, 30th International Symposium on Computer and Information Sciences, ISCIS 2015 - Proceedings, 371-379, 2015.
N. Ş. Saygılı, T. Amghar, B. Levrat, T. Acarman, “Taking advantage of Turkish characteristic features to achieve authorship attribution problems for Turkish”, 25th Signal Processing and Communications Applications Conference (SIU), Antalya, 2017.
B. Kuyumcu, B. Buluz, Y. Kömeçoğlu, “Author Identification in Turkish Documents with Ridge Regression Analysis”, 27th Signal Processing and Communications Applications Conference (SIU), Sivas, 24-26 Nisan 2019.
G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning with Application in R, Springer, Los Angeles, A.B.D., 2017.
S. B.Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques", Informatica, 31, 249–268, 2007.
E. Alpaydın, Yapay Öğrenme, Boğaziçi Üniversitesi Yayınları, İstanbul, 88-116, 2017.
H.Wang, C. Ding, H. Huang, "Multi-label linear discriminant analysis", Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6316 LNCS(PART 6), 126–139, 2017.
T. Hastie, J. Tibshirani, J. Friedman, The Elements of Statistical Learning, Data Mining, Inference, and Prediction, Springer, New York, A.B.D., 2016.
M.Kuhn, K. Johanson, Applied Predictive Modeling, Springer, New York, 2013.
A. G. Karacor, E.Torun, R. Abay, “Aircraft Classification Using Image Processing Tecniques and Artificial Neural Neworks", International Journal of Pattern Recognition and Artificial Intelligence, 25(08), 1321–1335. 2011.

There are 40 citations in total.

Details

Primary Language	Turkish
Subjects	Computer Software
Journal Section	Articles
Authors	Tevfik Uyar 0000-0003-0124-6910 Kübra Karacan Uyar This is me 0000-0002-2109-286X Emre Yağlı 0000-0002-1044-9018
Publication Date	April 30, 2021
Submission Date	September 26, 2019
Published in Issue	Year 2021 Volume: 14 Issue: 2

Cite

APA	Uyar, T., Karacan Uyar, K., & Yağlı, E. (2021). Gözetimli Makine Öğrenmesiyle Noktalama ve Etkisiz Kelime Sıklıkları Kullanarak Yazar Tanıma. Bilişim Teknolojileri Dergisi, 14(2), 183-190. https://doi.org/10.17671/gazibtd.623629

Cited By

KÖŞE YAZILARININ NİCEL ÖZELLİKLERİ İLE OKUNABİLİRLİKLERİ ÜZERİNE BİR İNCELEME

Bingöl Üniversitesi Sosyal Bilimler Enstitüsü Dergisi

https://doi.org/10.29029/busbed.1251786

Text Authorship Identification Based On Ensemble Learning and Genetic Algorithm Combination in Turkish Text

Journal of Polytechnic

https://doi.org/10.2339/politeknik.992493

Download Cover Image

Article Files

Full Text