Research Article
BibTex RIS Cite

Columnist Identification with Supervised Machine Learning using Punctuation and Stop Word Frequencies

Year 2021, Volume: 14 Issue: 2, 183 - 190, 30.04.2021
https://doi.org/10.17671/gazibtd.623629

Abstract

This research asserts that such features as the frequency of stop words and punctuation marks are sufficient for author identification of the texts that are column-long. Six of Cumhuriyet columnists who periodically write in the newspaper were selected and 120 columns were collected from each. Nine features based on the frequency of particular stop words and punctuation marks were extracted. Eight supervised machine learning algorithms were trained with extracted feature set. Author identification performance of each algorithm was measured. The effect of dimension reduction and scaling on each algorithm were also examined. Following these procedures, minimum 82% and maximum 92% accuracy were obtained. It is also found that scaling or dimension reduction with principal component analysis (PCA) do not create significant difference alone on accuracy scores, while scaling and linear discriminant analysis significantly increases the validation scores of some of algorithms such as support vector machines (p<0.05), Gaussian Naïve Bayes, and k-nearest neighbour (p<0.001). Moreover, when feature importance of random forest algorithm is analysed, average word count in a sentence and comma frequency are found as the most important features for detecting the authors.

References

  • C. C. Aggarwal, C. X. Zhai, “An introduction to text mining”, Mining Text Data, Editör: Aggarwal, C. C., Zhai, C. X., Springer, Boston, MA, A.B.D., 1–10, 2013.
  • O. de Vel, A. Anderson, M. Corney, G. Mohay, “Mining e-mail content for author identification forensics”, ACM SIGMOD Record, 30(4), 55-64, Ara. 2001.
  • S. Hill ve F. Provost, “The myth of the double-blind review?”, ACM SIGKDD Explorations Newsletter, 5(2), 179-184, 2003.
  • J. Houvardas ve E. Stamatatos, “N-Gram Feature Selection for Authorship Identification”, Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2006, Cilt 4183, Editör: Euzenat J., Domingue J.. Springer, Berlin, Heidelberg, 77-86, 2006.
  • D. Abercrombie, “Voice qualities”, Psycholinguistics: An introduction to the study of speech and personality, Editör: Markel, N.N., The Dorsey Press, Londra, 109–127, 1969.
  • M. A. K. Halliday, A. McIntosh, ve P. Strevens, The linguistic sciences and language teaching, Longman, Londra, 1964.
  • M. Coulthard, “Author identification, idiolect, and linguistic uniqueness”, Appl. Linguist., 25(4),. 431–447, 2004.
  • D. Biber, Variation across speech and writing, Cambridge University Press, Cambridge, 1988.
  • D. Biber, Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press, Cambridge, 1995.
  • R. Shuy, The language of confession, interrogation and deception, Sage, Londra, 1998.
  • M. Coulthard, “Forensic discourse analysis”, Advances in spoken discourse analysis, Editör: Coulthard, N. Routledge, Londra, 242–257, 1992.
  • M. Coulthard, “On the use of corpora in the analysis of forensic texts”, Forensic Linguist. Int. J. Speech, Lang. Law, 1(1), 27–43, 1994.
  • R. Eagleson, “Forensic analysis of personal written text: A case study”, Language and the law, Editör: Gibbons, J., Longman, Londra, 362–373, 1994.
  • N. Chomsky, Aspects of the theory of syntax, MIT Press, Cambridge, 1965.
  • M. A. K. Halliday, Learning how to mean, Edward Arnold, Londra, 1975.
  • N. MacLeod, T. Grant, “Whose Tweet? Authorship analysis of micro-blogs and other short-form messages”, International Association of Forensic Linguists’ Tenth Biennial Conference, 210–224, 2012.
  • C. Chaski, “Empirical evaluations of language-based authorship identification techniques”, Int. J. Speech, Lang. Law, 8(1), 1–65, 2001.
  • T. Grant ve K. Baker, “Identifying reliable, valid markers of authorship: A response to Chaski”, Int. J. Speech, Lang. Law, 8(1), 66–79, 2001.
  • G. R. McMenamin, “Style markers in authortship studies”, Int. J. Speech, Lang. Law, 8(2), 93–97, 2001.
  • S. Argamon, “Interpreting Burrows’s Delta: geometric and probabilistic foundations”, Lit. Linguist. Comput., 23(2), 131–147, 2008.
  • D. L. Hoover, “Multivariate analysis and the study of style variation”, Lit. Linguist. Comput., 18(4), 341–359, 2003.
  • M. Koppel, J. Schler, ve S. Argamon, “Authorship attribution in the wild”, Lang. Resour. Eval., 45, 83–94, 2011.
  • J. Burrows, “Delta: A measure for stylistic difference and a guide to likely authorship”, Lit. Linguist. Comput., 17(3), 267–287, 2002.
  • B. Levent, V. E. Diri, “Türkçe dokümanlarda yapay sinir ağları ile yazar tanıma”, XVI. Akademik Bilişim Konferansı Mersin Üniversitesi, 735–741, 5 - 7 Şubat 2014.
  • I. N. Bozkurt, Ö. Bağlıoğlu, ve E. Uyar, “Authorship attribution: performance of various features and classification methods”, 22nd International Symposium on Computer and Information Sciences, ISCIS 2007 - Proceedings, 158–162, 2007.
  • T. Taş ve A. K. Görür, “Author identification for Turkish texts”, J. Arts Sci., 7, 151–161, 2007.
  • F. Türkoğlu, B. Diri, ve M. F. Amasyalı, “Author attribution of Turkish texts by feature mining”, Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues, 1086–1093, 2007.
  • S. Doğan ve B. Diri, “Türkçe dökümanlar için N-gram tabanlı yeni bir sınıflandırma(Ng-ind): Yazar, tür ve cinsiyet”, Türkiye Bilişim Vakfı Bilgi. Bilim. ve Mühendisliği Derg., 1(3), 11–19, 2010.
  • M. Yasdi, B. Diri, “Soyut özetllik çıkarımı ile yazar tanıma”, IEEE 20. Sinyal İşleme ve İletişim Uygulamaları Kurultayı, Fethiye, Muğla, Türkiye, 2012.
  • M. F. Amasyalı, B. Diri, F. Türkoğlu, “Farklı özellik vektörleri ile Türkçe dökümanların yazarlarının belirlenmesi”, 15. Türkiye Yapay Sinir Ağları Sempozyumu, Muğla, 21- 24 Haziran, 2006.
  • Y. Bay, E. Çelebi, “Feature Selection for Enhanced Author Identification of Turkish Text”, 30th International Symposium on Computer and Information Sciences, ISCIS 2015 - Proceedings, 371-379, 2015.
  • N. Ş. Saygılı, T. Amghar, B. Levrat, T. Acarman, “Taking advantage of Turkish characteristic features to achieve authorship attribution problems for Turkish”, 25th Signal Processing and Communications Applications Conference (SIU), Antalya, 2017.
  • B. Kuyumcu, B. Buluz, Y. Kömeçoğlu, “Author Identification in Turkish Documents with Ridge Regression Analysis”, 27th Signal Processing and Communications Applications Conference (SIU), Sivas, 24-26 Nisan 2019.
  • G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning with Application in R, Springer, Los Angeles, A.B.D., 2017.
  • S. B.Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques", Informatica, 31, 249–268, 2007.
  • E. Alpaydın, Yapay Öğrenme, Boğaziçi Üniversitesi Yayınları, İstanbul, 88-116, 2017.
  • H.Wang, C. Ding, H. Huang, "Multi-label linear discriminant analysis", Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6316 LNCS(PART 6), 126–139, 2017.
  • T. Hastie, J. Tibshirani, J. Friedman, The Elements of Statistical Learning, Data Mining, Inference, and Prediction, Springer, New York, A.B.D., 2016.
  • M.Kuhn, K. Johanson, Applied Predictive Modeling, Springer, New York, 2013.
  • A. G. Karacor, E.Torun, R. Abay, “Aircraft Classification Using Image Processing Tecniques and Artificial Neural Neworks", International Journal of Pattern Recognition and Artificial Intelligence, 25(08), 1321–1335. 2011.

Gözetimli Makine Öğrenmesiyle Noktalama ve Etkisiz Kelime Sıklıkları Kullanarak Yazar Tanıma

Year 2021, Volume: 14 Issue: 2, 183 - 190, 30.04.2021
https://doi.org/10.17671/gazibtd.623629

Abstract

Bu çalışmada köşe yazısı uzunluğundaki yazılarda noktalama ve etkisiz kelime kullanım sıklığı gibi basit özniteliklerin yazar tanımada yeterli olduğu ortaya konmuştur. Cumhuriyet gazetesi yazarlarından sıkça köşe yazan 6 adedi seçilerek her birinin çalışmanın başladığı tarihten geriye doğru son 120 köşe yazıları alınmış, her bir yazı için bir takım etkisiz kelime ve noktalama işaretlerinin kullanım sıklıklarına dayanan dokuz adet öznitelik elde edilmiştir. Sekiz gözetimli yapay öğrenme algoritması eğitildikten sonra yazının yazarını tanıma başarısı önişlemsiz ve önişlemden geçirilmiş veri kümelerinde ayrı ayrı ölçülmüş, asgari %82 ve azami %92 olmak üzere yüksek isabetli sonuçlar elde edilmiştir. Ölçeklemenin ve temel bileşen analizinin (PCA) başarıyı anlamlı miktarda değiştirmediği, ancak ölçekleme ve boyut azaltma yöntemi olarak doğrusal ayırtaç çözümlemenin (LDA) birlikte kullanılmasının en yakın komşu (kNN) ve Gaussian Naive Bayes (GNB) algoritmalarının yöntemlerin başarılarında yüksek anlamlı (p<0.001), destek vektör makineleri (SVM) algoritmasının başarısında ise anlamlı (p<0.05) bir fark yarattığı görülmüştür. Ayrıca karar ağacı temelli rasgele orman algoritmasında (RF) öznitelik önem analizi yapılarak cümle başına ortalama kelime sayısının ve virgül kullanma sıklığının en ayırıcı öznitelikler olduğu tespit edilmiştir.

References

  • C. C. Aggarwal, C. X. Zhai, “An introduction to text mining”, Mining Text Data, Editör: Aggarwal, C. C., Zhai, C. X., Springer, Boston, MA, A.B.D., 1–10, 2013.
  • O. de Vel, A. Anderson, M. Corney, G. Mohay, “Mining e-mail content for author identification forensics”, ACM SIGMOD Record, 30(4), 55-64, Ara. 2001.
  • S. Hill ve F. Provost, “The myth of the double-blind review?”, ACM SIGKDD Explorations Newsletter, 5(2), 179-184, 2003.
  • J. Houvardas ve E. Stamatatos, “N-Gram Feature Selection for Authorship Identification”, Artificial Intelligence: Methodology, Systems, and Applications. AIMSA 2006, Cilt 4183, Editör: Euzenat J., Domingue J.. Springer, Berlin, Heidelberg, 77-86, 2006.
  • D. Abercrombie, “Voice qualities”, Psycholinguistics: An introduction to the study of speech and personality, Editör: Markel, N.N., The Dorsey Press, Londra, 109–127, 1969.
  • M. A. K. Halliday, A. McIntosh, ve P. Strevens, The linguistic sciences and language teaching, Longman, Londra, 1964.
  • M. Coulthard, “Author identification, idiolect, and linguistic uniqueness”, Appl. Linguist., 25(4),. 431–447, 2004.
  • D. Biber, Variation across speech and writing, Cambridge University Press, Cambridge, 1988.
  • D. Biber, Dimensions of register variation: A cross-linguistic comparison. Cambridge University Press, Cambridge, 1995.
  • R. Shuy, The language of confession, interrogation and deception, Sage, Londra, 1998.
  • M. Coulthard, “Forensic discourse analysis”, Advances in spoken discourse analysis, Editör: Coulthard, N. Routledge, Londra, 242–257, 1992.
  • M. Coulthard, “On the use of corpora in the analysis of forensic texts”, Forensic Linguist. Int. J. Speech, Lang. Law, 1(1), 27–43, 1994.
  • R. Eagleson, “Forensic analysis of personal written text: A case study”, Language and the law, Editör: Gibbons, J., Longman, Londra, 362–373, 1994.
  • N. Chomsky, Aspects of the theory of syntax, MIT Press, Cambridge, 1965.
  • M. A. K. Halliday, Learning how to mean, Edward Arnold, Londra, 1975.
  • N. MacLeod, T. Grant, “Whose Tweet? Authorship analysis of micro-blogs and other short-form messages”, International Association of Forensic Linguists’ Tenth Biennial Conference, 210–224, 2012.
  • C. Chaski, “Empirical evaluations of language-based authorship identification techniques”, Int. J. Speech, Lang. Law, 8(1), 1–65, 2001.
  • T. Grant ve K. Baker, “Identifying reliable, valid markers of authorship: A response to Chaski”, Int. J. Speech, Lang. Law, 8(1), 66–79, 2001.
  • G. R. McMenamin, “Style markers in authortship studies”, Int. J. Speech, Lang. Law, 8(2), 93–97, 2001.
  • S. Argamon, “Interpreting Burrows’s Delta: geometric and probabilistic foundations”, Lit. Linguist. Comput., 23(2), 131–147, 2008.
  • D. L. Hoover, “Multivariate analysis and the study of style variation”, Lit. Linguist. Comput., 18(4), 341–359, 2003.
  • M. Koppel, J. Schler, ve S. Argamon, “Authorship attribution in the wild”, Lang. Resour. Eval., 45, 83–94, 2011.
  • J. Burrows, “Delta: A measure for stylistic difference and a guide to likely authorship”, Lit. Linguist. Comput., 17(3), 267–287, 2002.
  • B. Levent, V. E. Diri, “Türkçe dokümanlarda yapay sinir ağları ile yazar tanıma”, XVI. Akademik Bilişim Konferansı Mersin Üniversitesi, 735–741, 5 - 7 Şubat 2014.
  • I. N. Bozkurt, Ö. Bağlıoğlu, ve E. Uyar, “Authorship attribution: performance of various features and classification methods”, 22nd International Symposium on Computer and Information Sciences, ISCIS 2007 - Proceedings, 158–162, 2007.
  • T. Taş ve A. K. Görür, “Author identification for Turkish texts”, J. Arts Sci., 7, 151–161, 2007.
  • F. Türkoğlu, B. Diri, ve M. F. Amasyalı, “Author attribution of Turkish texts by feature mining”, Advanced Intelligent Computing Theories and Applications. With Aspects of Theoretical and Methodological Issues, 1086–1093, 2007.
  • S. Doğan ve B. Diri, “Türkçe dökümanlar için N-gram tabanlı yeni bir sınıflandırma(Ng-ind): Yazar, tür ve cinsiyet”, Türkiye Bilişim Vakfı Bilgi. Bilim. ve Mühendisliği Derg., 1(3), 11–19, 2010.
  • M. Yasdi, B. Diri, “Soyut özetllik çıkarımı ile yazar tanıma”, IEEE 20. Sinyal İşleme ve İletişim Uygulamaları Kurultayı, Fethiye, Muğla, Türkiye, 2012.
  • M. F. Amasyalı, B. Diri, F. Türkoğlu, “Farklı özellik vektörleri ile Türkçe dökümanların yazarlarının belirlenmesi”, 15. Türkiye Yapay Sinir Ağları Sempozyumu, Muğla, 21- 24 Haziran, 2006.
  • Y. Bay, E. Çelebi, “Feature Selection for Enhanced Author Identification of Turkish Text”, 30th International Symposium on Computer and Information Sciences, ISCIS 2015 - Proceedings, 371-379, 2015.
  • N. Ş. Saygılı, T. Amghar, B. Levrat, T. Acarman, “Taking advantage of Turkish characteristic features to achieve authorship attribution problems for Turkish”, 25th Signal Processing and Communications Applications Conference (SIU), Antalya, 2017.
  • B. Kuyumcu, B. Buluz, Y. Kömeçoğlu, “Author Identification in Turkish Documents with Ridge Regression Analysis”, 27th Signal Processing and Communications Applications Conference (SIU), Sivas, 24-26 Nisan 2019.
  • G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning with Application in R, Springer, Los Angeles, A.B.D., 2017.
  • S. B.Kotsiantis, "Supervised Machine Learning: A Review of Classification Techniques", Informatica, 31, 249–268, 2007.
  • E. Alpaydın, Yapay Öğrenme, Boğaziçi Üniversitesi Yayınları, İstanbul, 88-116, 2017.
  • H.Wang, C. Ding, H. Huang, "Multi-label linear discriminant analysis", Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 6316 LNCS(PART 6), 126–139, 2017.
  • T. Hastie, J. Tibshirani, J. Friedman, The Elements of Statistical Learning, Data Mining, Inference, and Prediction, Springer, New York, A.B.D., 2016.
  • M.Kuhn, K. Johanson, Applied Predictive Modeling, Springer, New York, 2013.
  • A. G. Karacor, E.Torun, R. Abay, “Aircraft Classification Using Image Processing Tecniques and Artificial Neural Neworks", International Journal of Pattern Recognition and Artificial Intelligence, 25(08), 1321–1335. 2011.
There are 40 citations in total.

Details

Primary Language Turkish
Subjects Computer Software
Journal Section Articles
Authors

Tevfik Uyar 0000-0003-0124-6910

Kübra Karacan Uyar This is me 0000-0002-2109-286X

Emre Yağlı 0000-0002-1044-9018

Publication Date April 30, 2021
Submission Date September 26, 2019
Published in Issue Year 2021 Volume: 14 Issue: 2

Cite

APA Uyar, T., Karacan Uyar, K., & Yağlı, E. (2021). Gözetimli Makine Öğrenmesiyle Noktalama ve Etkisiz Kelime Sıklıkları Kullanarak Yazar Tanıma. Bilişim Teknolojileri Dergisi, 14(2), 183-190. https://doi.org/10.17671/gazibtd.623629