Deep Feature Generation for Author Identification

Şükrü Ozan; Davut Emre Taşar; Umut Özdil

doi:10.18466/cbayarfbe.846016

Araştırma Makalesi

Deep Feature Generation for Author Identification

Yıl 2021, Cilt: 17 Sayı: 2, 137 - 143, 28.06.2021

Şükrü Ozan , Davut Emre Taşar , Umut Özdil

https://doi.org/10.18466/cbayarfbe.846016

Öz

Identifying the authors of a given set of text is a well addressed and complicated task. It requires thorough knowledge of different authors’ writing styles and discriminating them. As the main contribution of this paper, we propose to perform this task using machine learning and deep learning methods, state-of-the-art algorithms, and methods used in numerous complex Natural Language Processing (NLP) problems. We used a text corpus of daily newspaper columns written by thirty authors to perform our experiments. The experimental results proved that document embeddings trained via neural network architecture achieve cutting edge accuracy in learning writing styles and identifying authors of given writings even though the dataset has a considerably unbalanced distribution. We represent our experimental results and outsource our codes for interested readers and natural language processing (NLP) enthusiasts as a GitHub repository. They can reproduce and confirm the results and modify them according to their own needs.

Anahtar Kelimeler

Natural Language Processing, Document Embeddings, Logistic Regression, Support Vector Machines, Author Identification

Destekleyen Kurum

TÜBİTAK

Proje Numarası

3190585

Teşekkür

This work is a part of the project supported by the Scientific and Technological Research Council of Turkey (TUBITAK) TEYDEB-1501 program under Project no 3190585, and named “General Purpose Chatbot Application That Can Produce Meaningful Dialog via Machine Learning Algorithms”.

Kaynakça

Stamatatos, E., Fakotakis, N., Kokkinakis, G.: 2000. Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495
Sebastiani, F. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1): 1-47.
Zheng, Rong, et al. 2006. “A framework for authorship identification of online messages: Writing‐style features and classification techniques.” Journal of the American society for information science and technology 57.3 : 378-393.
Burrows, J.F. 1987. Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style. Literary and Linguistic Computing 2: 61-70.
Diederich, J., J. Kindermann, E. Leopold, and G. Paass. 2003.. Authorship Attribution with Support Vector Machines. Applied Intelligence 19(1/2): 109-123
Luyckx, K., Daelemans 2011, W.: The effect of author set size and data size in authorship attribution. Literary Linguist. Comput. 26(1), 35–55
Abbasi, Ahmed, and Hsinchun Chen. 2008. “Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace.” ACM Transactions on Information Systems (TOIS) 26.2 : 1-29.
Holmes, D. 1998. The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing, 13(3): 111-117.
Mikolov, Tomas, et al.2013. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781.
Mikolov, Tomas, et al. 2013. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 26: 3111-3119.
Cortes, Corinna, and Vladimir Vapnik.1995. “Support-vector networks.” Machine learning 20.3: 273-297.
Li, J., Huang, G., Fan, C., Sun, Z., & Zhu, H. (2019). Keyword extraction for short text via word2vec, doc2vec, and textrank. Turkish Journal of Electrical Engineering & Computer Sciences, 27(3), 1794-1805.
Rehurek, R., Sojka, P. 2010. Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, pp. 45–50.
Kim, Donghwa, et al. 2019 “Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec.” Information Sciences 477 : 15-29.
Peng, Chao-Ying Joanne, Kuk Lida Lee, and Gary M. Ingersoll. 2002. “An introduction to logistic regression analysis and reporting.” The journal of educational research 96.1: 3-14.
Kwak, Chanyeong, and Alan Clayton-Matthews.2002 “Multinomial logistic regression.” Nursing research 51.6: 404-410.
Becht, Etienne, et al. 2019. “Dimensionality reduction for visualizing single-cell data using UMAP.” Nature biotechnology 37.1 : 38-44
Zhang, Ye, Stephen Roller, and Byron Wallace. 2016. “MGNC-CNN: A simple approach to exploiting multiple word embeddings for sentence classification.” arXiv preprint arXiv:1603.00968 .
Radford, Alec, Luke Metz, and Soumith Chintala. 2015 “Unsupervised representation learning with deep convolutional generative adversarial networks.” arXiv preprint arXiv:1511.06434
“Deep Feature Generation for Author Identification “ https://github.com/adresgezgini/DFG4AI/

Yıl 2021, Cilt: 17 Sayı: 2, 137 - 143, 28.06.2021

Şükrü Ozan , Davut Emre Taşar , Umut Özdil

https://doi.org/10.18466/cbayarfbe.846016

Öz

Proje Numarası

3190585

Kaynakça

Stamatatos, E., Fakotakis, N., Kokkinakis, G.: 2000. Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495
Sebastiani, F. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1): 1-47.
Zheng, Rong, et al. 2006. “A framework for authorship identification of online messages: Writing‐style features and classification techniques.” Journal of the American society for information science and technology 57.3 : 378-393.
Burrows, J.F. 1987. Word Patterns and Story Shapes: The Statistical Analysis of Narrative Style. Literary and Linguistic Computing 2: 61-70.
Diederich, J., J. Kindermann, E. Leopold, and G. Paass. 2003.. Authorship Attribution with Support Vector Machines. Applied Intelligence 19(1/2): 109-123
Luyckx, K., Daelemans 2011, W.: The effect of author set size and data size in authorship attribution. Literary Linguist. Comput. 26(1), 35–55
Abbasi, Ahmed, and Hsinchun Chen. 2008. “Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace.” ACM Transactions on Information Systems (TOIS) 26.2 : 1-29.
Holmes, D. 1998. The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing, 13(3): 111-117.
Mikolov, Tomas, et al.2013. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781.
Mikolov, Tomas, et al. 2013. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 26: 3111-3119.
Cortes, Corinna, and Vladimir Vapnik.1995. “Support-vector networks.” Machine learning 20.3: 273-297.
Li, J., Huang, G., Fan, C., Sun, Z., & Zhu, H. (2019). Keyword extraction for short text via word2vec, doc2vec, and textrank. Turkish Journal of Electrical Engineering & Computer Sciences, 27(3), 1794-1805.
Rehurek, R., Sojka, P. 2010. Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, ELRA, pp. 45–50.
Kim, Donghwa, et al. 2019 “Multi-co-training for document classification using various document representations: TF–IDF, LDA, and Doc2Vec.” Information Sciences 477 : 15-29.
Peng, Chao-Ying Joanne, Kuk Lida Lee, and Gary M. Ingersoll. 2002. “An introduction to logistic regression analysis and reporting.” The journal of educational research 96.1: 3-14.
Kwak, Chanyeong, and Alan Clayton-Matthews.2002 “Multinomial logistic regression.” Nursing research 51.6: 404-410.
Becht, Etienne, et al. 2019. “Dimensionality reduction for visualizing single-cell data using UMAP.” Nature biotechnology 37.1 : 38-44
Zhang, Ye, Stephen Roller, and Byron Wallace. 2016. “MGNC-CNN: A simple approach to exploiting multiple word embeddings for sentence classification.” arXiv preprint arXiv:1603.00968 .
Radford, Alec, Luke Metz, and Soumith Chintala. 2015 “Unsupervised representation learning with deep convolutional generative adversarial networks.” arXiv preprint arXiv:1511.06434
“Deep Feature Generation for Author Identification “ https://github.com/adresgezgini/DFG4AI/

Toplam 20 adet kaynakça vardır.

Ayrıntılar

Birincil Dil	İngilizce
Konular	Mühendislik
Bölüm	Makaleler
Yazarlar	Şükrü Ozan 0000-0002-3227-348X Davut Emre Taşar Umut Özdil
Proje Numarası	3190585
Yayımlanma Tarihi	28 Haziran 2021
Yayımlandığı Sayı	Yıl 2021 Cilt: 17 Sayı: 2

Kaynak Göster

APA	Ozan, Ş., Taşar, D. E., & Özdil, U. (2021). Deep Feature Generation for Author Identification. Celal Bayar University Journal of Science, 17(2), 137-143. https://doi.org/10.18466/cbayarfbe.846016
AMA	Ozan Ş, Taşar DE, Özdil U. Deep Feature Generation for Author Identification. CBUJOS. Haziran 2021;17(2):137-143. doi:10.18466/cbayarfbe.846016
Chicago	Ozan, Şükrü, Davut Emre Taşar, ve Umut Özdil. “Deep Feature Generation for Author Identification”. Celal Bayar University Journal of Science 17, sy. 2 (Haziran 2021): 137-43. https://doi.org/10.18466/cbayarfbe.846016.
EndNote	Ozan Ş, Taşar DE, Özdil U (01 Haziran 2021) Deep Feature Generation for Author Identification. Celal Bayar University Journal of Science 17 2 137–143.
IEEE	Ş. Ozan, D. E. Taşar, ve U. Özdil, “Deep Feature Generation for Author Identification”, CBUJOS, c. 17, sy. 2, ss. 137–143, 2021, doi: 10.18466/cbayarfbe.846016.
ISNAD	Ozan, Şükrü vd. “Deep Feature Generation for Author Identification”. Celal Bayar University Journal of Science 17/2 (Haziran 2021), 137-143. https://doi.org/10.18466/cbayarfbe.846016.
JAMA	Ozan Ş, Taşar DE, Özdil U. Deep Feature Generation for Author Identification. CBUJOS. 2021;17:137–143.
MLA	Ozan, Şükrü vd. “Deep Feature Generation for Author Identification”. Celal Bayar University Journal of Science, c. 17, sy. 2, 2021, ss. 137-43, doi:10.18466/cbayarfbe.846016.
Vancouver	Ozan Ş, Taşar DE, Özdil U. Deep Feature Generation for Author Identification. CBUJOS. 2021;17(2):137-43.

Kapak Resmi İndir

Makale Dosyaları

Tam Metin