Author identification,
one of the popular topics in text classification and natural language
processing, basically aims to determine the author of a given text through various
analyses. In the literature, different text representation approaches and use
of preprocessing steps are considered for author identification problem. This
paper aims to comprehensively examine the impact of text representation and preprocessing
steps on author identification specifically for Turkish language. For this
purpose, the contributions of all possible combinations of different text
representation approaches, namely unigram and bigram, together with the
preprocessing tasks, including stemming and stop-word removal, to the
performance of author identification are investigated. For the experimental
evaluation, a brand new dataset is constituted. Also, two different
classification algorithms, namely Multinomial Naive Bayes and Sequential
Minimal Optimization, are employed. The results of the experimental analysis
reveal that using bigram features alone should be avoided. Besides, it is shown
that stop-words should be
kept inside the text while stemming can be preferred depending on the
classification algorithm so that higher performance can be achieved for author
identification.
Subjects | Engineering |
---|---|
Journal Section | Articles |
Authors | |
Publication Date | March 31, 2017 |
Published in Issue | Year 2017 Volume: 18 Issue: 1 |