Research Article
BibTex RIS Cite

Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti

Year 2021, , 675 - 684, 15.09.2021
https://doi.org/10.35234/fumbd.929133

Abstract

Yazar profili oluşturma (Author Profiling) bir metnin üslup ve içeriğine bakarak yazarın çeşitli özelliklerinin ortaya çıkarılmasına yönelik bir metin kümesi analizidir. Bu özellikler yaş, cinsiyet, kişilik özellikleri ve hatta meslek gibi unsurları barındırır. Cinsiyet belirleme yazar profili oluşturma çalışmalarının alt alanlarından birisidir. Siber suçlar başta olmak üzere sahte haber yayma gibi adli olayların yanında pazarlama (reklamcılık), sosyolojik ve psikolojik olayların incelenmesinde cinsiyet belirleme oldukça önemlidir. Twitter gönderileri dil kurallarına uymayan, kısaltılmış kelimeler ve anlamsız cümle yapıları da içerme ihtimallerine rağmen cinsiyet belirleme görevi için yaygın bir şekilde kullanılmaktadır. Bu çalışmada Türkçe Twitter gönderilerinden cinsiyet tespiti yapılmaya çalışılmıştır. Problem bir sınıflandırma görevi olarak ele alınmıştır. Yapılan çalışmada makine öğrenmesi metotları(TF-IDF + SVM), derin öğrenme yöntemleri (LSTM, CNN) ve Türkçe için ön eğitimli dil modelleri(BERT, DistilBert, Electra) kullanılmıştır. Yapılan deneyler sonucunda en yüksek başarımı (%80.1) kelime boyutunun 128k olduğu Bert modeli sağlamıştır. Bu çalışma diğer metin sınıflandırma görevleri için de detaylı bir çalışma olma özelliği göstermektedir.

References

  • [1] F. M. R. Pardo, A. Giachanou, B. Ghanem, and P. Rosso, “Overview of the 8th Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter,” CLEF 2020 Labs Work. Noteb. Pap., pp. 22–25, 2020, [Online]. Available: CEUR-WS.org.
  • [2] M. A. Álvarez-Carmona et al., “A visual approach for age and gender identification on Twitter,” J. Intell. Fuzzy Syst., vol. 34, no. 5, pp. 3133–3145, 2018, doi: 10.3233/JIFS-169497.
  • [3] F. Rangel, P. Rosso, M. Montes-Y-Gómez, M. Potthast, and B. Stein, “Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter,” CEUR Workshop Proc., vol. 2380, 2018.
  • [4] E. Sezerer, O. Polatbilek, and S. Tekir, “A Turkish Dataset for Gender Identification of Twitter Users,” pp. 203–207, 2019, doi: 10.18653/v1/w19-4023.
  • [5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” no. Mlm, 2018, [Online]. Available: http://arxiv.org/abs/1810.04805.
  • [6] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter,” arXiv, pp. 2–6, 2019.
  • [7] C. D. Manning, “Electra : P Re - Training T Ext E Ncoders As D Iscriminators R Ather T Han G Enerators,” Iclr, pp. 1–18, 2020, [Online]. Available: https://github.com/google-research/.
  • [8] F. Rangel, P. Rosso, M. Potthast, and B. Stein, “Overview of the 5th author profiling task at PAN 2017: Gender and language variety identification in Twitter,” CEUR Workshop Proc., vol. 1866, 2017.
  • [9] F. Rangel and P. Rosso, “Overview of the 7th author profiling task at Pan 2019: Bots and gender profiling in twitter,” CEUR Workshop Proc., vol. 2380, 2019.
  • [10] W. Zaghouani and A. Charfi, “Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification,” arXiv, pp. 694–700, 2018.
  • [11] M. Talebi and C. Köse, “Facebook yorumlarının analiziyle Cinsiyet, Yaş ve Eğitim düzeyi belirleme Identifying Gender, Age and Education level by analyzing comments on Facebook,” Ieee, no. 2007, pp. 4–7, 2013, [Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6531599.
  • [12] L. I. Qian et al., “A Survey on Text Classification: From Shallow to Deep Learning,” arXiv, vol. 31, no. 11, pp. 1–21, 2020.
  • [13] İ. Sel, A. Karci, and D. Hanbay, “Karşılıklı Bilgi Kullanılarak Metin Sınıflandırma İçin Özellik Seçimi Feature Selection for Text Classification Using Mutual Information,” 2019 Int. Artif. Intell. Data Process. Symp., pp. 18–21, 2019.
  • [14] İ. Sel and D. Hanbay, “Doğal Dil İşleme Yöntemleri Kullanarak E- Maillerin Sınıflandırılması E- Mail Classification Using Natural Language Processing,” 2019 27th Signal Process. Commun. Appl. Conf., pp. 19–22, 2019, [Online]. Available: https://doi.org/10.1109/SIU.2019.8806593.
  • [15] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., pp. 1–15, 2015.
  • [16] A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017.
  • [17] J. Pizarro, “Using N-grams to detect Bots on Twitter Notebook for PAN at CLEF 2019,” CEUR Workshop Proc., vol. 2380, pp. 9–12, 2019.
  • [18] S. ElSayed and M. Farouk, “Gender identification for Egyptian Arabic dialect in twitter using deep learning models,” Egypt. Informatics J., vol. 21, no. 3, pp. 159–167, 2020, doi: 10.1016/j.eij.2020.04.001.
  • [19] S. Schweter, “BERTurk - BERT models for Turkish.” Zenodo, 2020, doi: 10.5281/zenodo.3770924.

Gender Identification from Turkish Tweets Using Pre-Trained Language Models

Year 2021, , 675 - 684, 15.09.2021
https://doi.org/10.35234/fumbd.929133

Abstract

Author Profiling is a text set analysis to reveal various characteristics of the author by examining the style and content of a text. These features include factors such as age, gender, personality traits and even profession. Gender identification is one of the subfields of author profile creation. Gender identification is very important in the investigation of marketing (advertising), sociological and psychological events, as well as forensic events such as spreading fake news, especially cybercrime. Twitter posts are widely used for gender identification, although they may include ungrammatical structures, abbreviated words and meaningless sentence structures. In this study, it was attempted to determine gender from Turkish Twitter posts. The problem is handled as a classification task. In the study, machine learning methods (TF-IDF + SVM), deep learning methods (LSTM, CNN) and pre-trained language models for Turkish (BERT, DistilBert, Electra) were used. As a result of the experiments, Bert model with the word size of 128k provided the highest success (80.1%). This study also features as a detailed study for other text classification tasks.

References

  • [1] F. M. R. Pardo, A. Giachanou, B. Ghanem, and P. Rosso, “Overview of the 8th Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter,” CLEF 2020 Labs Work. Noteb. Pap., pp. 22–25, 2020, [Online]. Available: CEUR-WS.org.
  • [2] M. A. Álvarez-Carmona et al., “A visual approach for age and gender identification on Twitter,” J. Intell. Fuzzy Syst., vol. 34, no. 5, pp. 3133–3145, 2018, doi: 10.3233/JIFS-169497.
  • [3] F. Rangel, P. Rosso, M. Montes-Y-Gómez, M. Potthast, and B. Stein, “Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter,” CEUR Workshop Proc., vol. 2380, 2018.
  • [4] E. Sezerer, O. Polatbilek, and S. Tekir, “A Turkish Dataset for Gender Identification of Twitter Users,” pp. 203–207, 2019, doi: 10.18653/v1/w19-4023.
  • [5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” no. Mlm, 2018, [Online]. Available: http://arxiv.org/abs/1810.04805.
  • [6] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter,” arXiv, pp. 2–6, 2019.
  • [7] C. D. Manning, “Electra : P Re - Training T Ext E Ncoders As D Iscriminators R Ather T Han G Enerators,” Iclr, pp. 1–18, 2020, [Online]. Available: https://github.com/google-research/.
  • [8] F. Rangel, P. Rosso, M. Potthast, and B. Stein, “Overview of the 5th author profiling task at PAN 2017: Gender and language variety identification in Twitter,” CEUR Workshop Proc., vol. 1866, 2017.
  • [9] F. Rangel and P. Rosso, “Overview of the 7th author profiling task at Pan 2019: Bots and gender profiling in twitter,” CEUR Workshop Proc., vol. 2380, 2019.
  • [10] W. Zaghouani and A. Charfi, “Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification,” arXiv, pp. 694–700, 2018.
  • [11] M. Talebi and C. Köse, “Facebook yorumlarının analiziyle Cinsiyet, Yaş ve Eğitim düzeyi belirleme Identifying Gender, Age and Education level by analyzing comments on Facebook,” Ieee, no. 2007, pp. 4–7, 2013, [Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6531599.
  • [12] L. I. Qian et al., “A Survey on Text Classification: From Shallow to Deep Learning,” arXiv, vol. 31, no. 11, pp. 1–21, 2020.
  • [13] İ. Sel, A. Karci, and D. Hanbay, “Karşılıklı Bilgi Kullanılarak Metin Sınıflandırma İçin Özellik Seçimi Feature Selection for Text Classification Using Mutual Information,” 2019 Int. Artif. Intell. Data Process. Symp., pp. 18–21, 2019.
  • [14] İ. Sel and D. Hanbay, “Doğal Dil İşleme Yöntemleri Kullanarak E- Maillerin Sınıflandırılması E- Mail Classification Using Natural Language Processing,” 2019 27th Signal Process. Commun. Appl. Conf., pp. 19–22, 2019, [Online]. Available: https://doi.org/10.1109/SIU.2019.8806593.
  • [15] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., pp. 1–15, 2015.
  • [16] A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017.
  • [17] J. Pizarro, “Using N-grams to detect Bots on Twitter Notebook for PAN at CLEF 2019,” CEUR Workshop Proc., vol. 2380, pp. 9–12, 2019.
  • [18] S. ElSayed and M. Farouk, “Gender identification for Egyptian Arabic dialect in twitter using deep learning models,” Egypt. Informatics J., vol. 21, no. 3, pp. 159–167, 2020, doi: 10.1016/j.eij.2020.04.001.
  • [19] S. Schweter, “BERTurk - BERT models for Turkish.” Zenodo, 2020, doi: 10.5281/zenodo.3770924.
There are 19 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section MBD
Authors

İlhami Sel 0000-0003-0222-7017

Davut Hanbay 0000-0003-2271-7865

Publication Date September 15, 2021
Submission Date April 28, 2021
Published in Issue Year 2021

Cite

APA Sel, İ., & Hanbay, D. (2021). Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi, 33(2), 675-684. https://doi.org/10.35234/fumbd.929133
AMA Sel İ, Hanbay D. Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi. September 2021;33(2):675-684. doi:10.35234/fumbd.929133
Chicago Sel, İlhami, and Davut Hanbay. “Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti”. Fırat Üniversitesi Mühendislik Bilimleri Dergisi 33, no. 2 (September 2021): 675-84. https://doi.org/10.35234/fumbd.929133.
EndNote Sel İ, Hanbay D (September 1, 2021) Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi 33 2 675–684.
IEEE İ. Sel and D. Hanbay, “Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti”, Fırat Üniversitesi Mühendislik Bilimleri Dergisi, vol. 33, no. 2, pp. 675–684, 2021, doi: 10.35234/fumbd.929133.
ISNAD Sel, İlhami - Hanbay, Davut. “Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti”. Fırat Üniversitesi Mühendislik Bilimleri Dergisi 33/2 (September 2021), 675-684. https://doi.org/10.35234/fumbd.929133.
JAMA Sel İ, Hanbay D. Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi. 2021;33:675–684.
MLA Sel, İlhami and Davut Hanbay. “Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti”. Fırat Üniversitesi Mühendislik Bilimleri Dergisi, vol. 33, no. 2, 2021, pp. 675-84, doi:10.35234/fumbd.929133.
Vancouver Sel İ, Hanbay D. Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi. 2021;33(2):675-84.