THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION

Muhammet Yasin Pak; Serkan Gunal

doi:10.18038/aubtda.270276

Research Article

Year 2017, Volume: 18 Issue: 1, 218 - 224, 31.03.2017

Muhammet Yasin Pak , Serkan Gunal

https://doi.org/10.18038/aubtda.270276

Abstract

References

Aslantürk O. Turkish authorship analysis with an incremental and adaptive model. MSc Dissertation, Hacettepe University, Ankara, Turkey, 2014.
Diri B, Amasyalı MF. Automatic author detection for Turkish texts. Artificial Neural Networks and Neural Information Processing 2003, 138-141.
Amasyalı MF, Diri B. Automatic Turkish text categorization in terms of author, genre and gender. In: NLDB 11th International Conference on Applications of Natural Language to Information Systems; 2006; Klagenfurt, Austria. pp. 221-226.
Amasyalı MF, Diri B, Türkoğlu F. Farklı özellik vektörleri ile Türkçe dokümanların yazarlarının belirlenmesi. In: The 15th Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN); 21-24 June 2006; Muğla, Turkey.
Türkoğlu F, Diri B, Amasyalı MF. Author attribution of Turkish texts by feature mining. In: The 3rd International Conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications with Aspects of Artificial Intelligence; 2007; Qingdao, China. pp. 1086–1093.
Kaban Z, Diri B. Genre and author detection in Turkish texts using artificial immune recognition systems. In: IEEE 16th Signal Processing, Communication and Applications Conference; April 2008. pp. 1-4.
Orucu F. Turkish Language Characteristics and Author Identification. MSc. Dissertation, Dokuz Eylül University, İzmir, 2009.
Bay Y, Çelebi E, Feature Selection for Enhanced Author Identification of Turkish Text. In: the 30th International Symposium on Computer and Information Sciences, 2015. pp. 371-379.
Stamatatos E. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 2009; 60(3): 538-556.
Joachims T. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Carnegie-Mellon Univ. Pittsburgh PA Dept. of Computer Science 1996.
Gunal S. Hybrid feature selection for text classification, Turkish Journal of Electrical Engineering & Computer Sciences 2012; 20(sup.2): 1296-1311.
Uysal AK, Gunal S, Ergin S, Sora Gunal E. The impact of feature extraction and selection on SMS spam filtering. Elektronika ir Elektrotechnika 2013; 19(5): 67-72.
Pak MY, Gunal S. Sentiment classification based on domain prediction, Elektronika ir Elektrotechnika 2016; 22(2): 96-99.
Manning CD, Raghavan P, Schtze H. Introduction to Information Retrieval. New York, USA: Cambridge University Press, 2008
Uysal AK, Gunal S. The impact of preprocessing on text classification. Information Processing & Management 2014; 50(1): 104-112.
Can F, Kocberber S, Balcik E, Kaynak C, Ocalan HC, Vursavas OM. Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology 2008, 59: 407–421.
Zemberek. <http://code.google.com/p/zemberek/> (Accessed October 2016).
Gunal S, Edizkan R. Subspace based feature selection for pattern recognition. Information Sciences 2008; 178(19): 3716-3726.
McCallum A, Nigam K. A comparison of event models for naïve Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization 1998; 752: 41-48.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 2009; 11(1): 10-18.
Platt JC. Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods 1999; 185-208.

THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION

Year 2017, Volume: 18 Issue: 1, 218 - 224, 31.03.2017

Muhammet Yasin Pak , Serkan Gunal

https://doi.org/10.18038/aubtda.270276

Abstract

Author identification,
one of the popular topics in text classification and natural language
processing, basically aims to determine the author of a given text through various
analyses. In the literature, different text representation approaches and use
of preprocessing steps are considered for author identification problem. This
paper aims to comprehensively examine the impact of text representation and preprocessing
steps on author identification specifically for Turkish language. For this
purpose, the contributions of all possible combinations of different text
representation approaches, namely unigram and bigram, together with the
preprocessing tasks, including stemming and stop-word removal, to the
performance of author identification are investigated. For the experimental
evaluation, a brand new dataset is constituted. Also, two different
classification algorithms, namely Multinomial Naive Bayes and Sequential
Minimal Optimization, are employed. The results of the experimental analysis
reveal that using bigram features alone should be avoided. Besides, it is shown
that stop-words should be
kept inside the text while stemming can be preferred depending on the
classification algorithm so that higher performance can be achieved for author
identification.

Keywords

Author identification, text classification, text preprocessing, text representation

References

Aslantürk O. Turkish authorship analysis with an incremental and adaptive model. MSc Dissertation, Hacettepe University, Ankara, Turkey, 2014.
Diri B, Amasyalı MF. Automatic author detection for Turkish texts. Artificial Neural Networks and Neural Information Processing 2003, 138-141.
Amasyalı MF, Diri B. Automatic Turkish text categorization in terms of author, genre and gender. In: NLDB 11th International Conference on Applications of Natural Language to Information Systems; 2006; Klagenfurt, Austria. pp. 221-226.
Amasyalı MF, Diri B, Türkoğlu F. Farklı özellik vektörleri ile Türkçe dokümanların yazarlarının belirlenmesi. In: The 15th Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN); 21-24 June 2006; Muğla, Turkey.
Türkoğlu F, Diri B, Amasyalı MF. Author attribution of Turkish texts by feature mining. In: The 3rd International Conference on Intelligent Computing: Advanced Intelligent Computing Theories and Applications with Aspects of Artificial Intelligence; 2007; Qingdao, China. pp. 1086–1093.
Kaban Z, Diri B. Genre and author detection in Turkish texts using artificial immune recognition systems. In: IEEE 16th Signal Processing, Communication and Applications Conference; April 2008. pp. 1-4.
Orucu F. Turkish Language Characteristics and Author Identification. MSc. Dissertation, Dokuz Eylül University, İzmir, 2009.
Bay Y, Çelebi E, Feature Selection for Enhanced Author Identification of Turkish Text. In: the 30th International Symposium on Computer and Information Sciences, 2015. pp. 371-379.
Stamatatos E. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 2009; 60(3): 538-556.
Joachims T. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Carnegie-Mellon Univ. Pittsburgh PA Dept. of Computer Science 1996.
Gunal S. Hybrid feature selection for text classification, Turkish Journal of Electrical Engineering & Computer Sciences 2012; 20(sup.2): 1296-1311.
Uysal AK, Gunal S, Ergin S, Sora Gunal E. The impact of feature extraction and selection on SMS spam filtering. Elektronika ir Elektrotechnika 2013; 19(5): 67-72.
Pak MY, Gunal S. Sentiment classification based on domain prediction, Elektronika ir Elektrotechnika 2016; 22(2): 96-99.
Manning CD, Raghavan P, Schtze H. Introduction to Information Retrieval. New York, USA: Cambridge University Press, 2008
Uysal AK, Gunal S. The impact of preprocessing on text classification. Information Processing & Management 2014; 50(1): 104-112.
Can F, Kocberber S, Balcik E, Kaynak C, Ocalan HC, Vursavas OM. Information retrieval on Turkish texts. Journal of the American Society for Information Science and Technology 2008, 59: 407–421.
Zemberek. <http://code.google.com/p/zemberek/> (Accessed October 2016).
Gunal S, Edizkan R. Subspace based feature selection for pattern recognition. Information Sciences 2008; 178(19): 3716-3726.
McCallum A, Nigam K. A comparison of event models for naïve Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization 1998; 752: 41-48.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 2009; 11(1): 10-18.
Platt JC. Fast training of support vector machines using sequential minimal optimization. Advances in Kernel Methods 1999; 185-208.

There are 21 citations in total.

Details

Subjects	Engineering
Journal Section	Articles
Authors	Muhammet Yasin Pak Serkan Gunal
Publication Date	March 31, 2017
Published in Issue	Year 2017 Volume: 18 Issue: 1

Cite

APA	Pak, M. Y., & Gunal, S. (2017). THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering, 18(1), 218-224. https://doi.org/10.18038/aubtda.270276
AMA	Pak MY, Gunal S. THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION. AUJST-A. March 2017;18(1):218-224. doi:10.18038/aubtda.270276
Chicago	Pak, Muhammet Yasin, and Serkan Gunal. “THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION”. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering 18, no. 1 (March 2017): 218-24. https://doi.org/10.18038/aubtda.270276.
EndNote	Pak MY, Gunal S (March 1, 2017) THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering 18 1 218–224.
IEEE	M. Y. Pak and S. Gunal, “THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION”, AUJST-A, vol. 18, no. 1, pp. 218–224, 2017, doi: 10.18038/aubtda.270276.
ISNAD	Pak, Muhammet Yasin - Gunal, Serkan. “THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION”. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering 18/1 (March 2017), 218-224. https://doi.org/10.18038/aubtda.270276.
JAMA	Pak MY, Gunal S. THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION. AUJST-A. 2017;18:218–224.
MLA	Pak, Muhammet Yasin and Serkan Gunal. “THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION”. Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering, vol. 18, no. 1, 2017, pp. 218-24, doi:10.18038/aubtda.270276.
Vancouver	Pak MY, Gunal S. THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION. AUJST-A. 2017;18(1):218-24.

Download Cover Image

Article Files

Full Text