Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti

İlhami Sel; Davut Hanbay

doi:10.35234/fumbd.929133

Research Article

Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti

Year 2021, , 675 - 684, 15.09.2021

İlhami Sel , Davut Hanbay

https://doi.org/10.35234/fumbd.929133

Cited By: 3

Abstract

Yazar profili oluşturma (Author Profiling) bir metnin üslup ve içeriğine bakarak yazarın çeşitli özelliklerinin ortaya çıkarılmasına yönelik bir metin kümesi analizidir. Bu özellikler yaş, cinsiyet, kişilik özellikleri ve hatta meslek gibi unsurları barındırır. Cinsiyet belirleme yazar profili oluşturma çalışmalarının alt alanlarından birisidir. Siber suçlar başta olmak üzere sahte haber yayma gibi adli olayların yanında pazarlama (reklamcılık), sosyolojik ve psikolojik olayların incelenmesinde cinsiyet belirleme oldukça önemlidir. Twitter gönderileri dil kurallarına uymayan, kısaltılmış kelimeler ve anlamsız cümle yapıları da içerme ihtimallerine rağmen cinsiyet belirleme görevi için yaygın bir şekilde kullanılmaktadır. Bu çalışmada Türkçe Twitter gönderilerinden cinsiyet tespiti yapılmaya çalışılmıştır. Problem bir sınıflandırma görevi olarak ele alınmıştır. Yapılan çalışmada makine öğrenmesi metotları(TF-IDF + SVM), derin öğrenme yöntemleri (LSTM, CNN) ve Türkçe için ön eğitimli dil modelleri(BERT, DistilBert, Electra) kullanılmıştır. Yapılan deneyler sonucunda en yüksek başarımı (%80.1) kelime boyutunun 128k olduğu Bert modeli sağlamıştır. Bu çalışma diğer metin sınıflandırma görevleri için de detaylı bir çalışma olma özelliği göstermektedir.

Keywords

Yazar profili oluşturma, cinsiyet tespiti, doğal dil işleme, dil modelleri, metin sınıflandırma.

References

[1] F. M. R. Pardo, A. Giachanou, B. Ghanem, and P. Rosso, “Overview of the 8th Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter,” CLEF 2020 Labs Work. Noteb. Pap., pp. 22–25, 2020, [Online]. Available: CEUR-WS.org.
[2] M. A. Álvarez-Carmona et al., “A visual approach for age and gender identification on Twitter,” J. Intell. Fuzzy Syst., vol. 34, no. 5, pp. 3133–3145, 2018, doi: 10.3233/JIFS-169497.
[3] F. Rangel, P. Rosso, M. Montes-Y-Gómez, M. Potthast, and B. Stein, “Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter,” CEUR Workshop Proc., vol. 2380, 2018.
[4] E. Sezerer, O. Polatbilek, and S. Tekir, “A Turkish Dataset for Gender Identification of Twitter Users,” pp. 203–207, 2019, doi: 10.18653/v1/w19-4023.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” no. Mlm, 2018, [Online]. Available: http://arxiv.org/abs/1810.04805.
[6] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter,” arXiv, pp. 2–6, 2019.
[7] C. D. Manning, “Electra : P Re - Training T Ext E Ncoders As D Iscriminators R Ather T Han G Enerators,” Iclr, pp. 1–18, 2020, [Online]. Available: https://github.com/google-research/.
[8] F. Rangel, P. Rosso, M. Potthast, and B. Stein, “Overview of the 5th author profiling task at PAN 2017: Gender and language variety identification in Twitter,” CEUR Workshop Proc., vol. 1866, 2017.
[9] F. Rangel and P. Rosso, “Overview of the 7th author profiling task at Pan 2019: Bots and gender profiling in twitter,” CEUR Workshop Proc., vol. 2380, 2019.
[10] W. Zaghouani and A. Charfi, “Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification,” arXiv, pp. 694–700, 2018.
[11] M. Talebi and C. Köse, “Facebook yorumlarının analiziyle Cinsiyet, Yaş ve Eğitim düzeyi belirleme Identifying Gender, Age and Education level by analyzing comments on Facebook,” Ieee, no. 2007, pp. 4–7, 2013, [Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6531599.
[12] L. I. Qian et al., “A Survey on Text Classification: From Shallow to Deep Learning,” arXiv, vol. 31, no. 11, pp. 1–21, 2020.
[13] İ. Sel, A. Karci, and D. Hanbay, “Karşılıklı Bilgi Kullanılarak Metin Sınıflandırma İçin Özellik Seçimi Feature Selection for Text Classification Using Mutual Information,” 2019 Int. Artif. Intell. Data Process. Symp., pp. 18–21, 2019.
[14] İ. Sel and D. Hanbay, “Doğal Dil İşleme Yöntemleri Kullanarak E- Maillerin Sınıflandırılması E- Mail Classification Using Natural Language Processing,” 2019 27th Signal Process. Commun. Appl. Conf., pp. 19–22, 2019, [Online]. Available: https://doi.org/10.1109/SIU.2019.8806593.
[15] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., pp. 1–15, 2015.
[16] A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017.
[17] J. Pizarro, “Using N-grams to detect Bots on Twitter Notebook for PAN at CLEF 2019,” CEUR Workshop Proc., vol. 2380, pp. 9–12, 2019.
[18] S. ElSayed and M. Farouk, “Gender identification for Egyptian Arabic dialect in twitter using deep learning models,” Egypt. Informatics J., vol. 21, no. 3, pp. 159–167, 2020, doi: 10.1016/j.eij.2020.04.001.
[19] S. Schweter, “BERTurk - BERT models for Turkish.” Zenodo, 2020, doi: 10.5281/zenodo.3770924.

Gender Identification from Turkish Tweets Using Pre-Trained Language Models

Year 2021, , 675 - 684, 15.09.2021

İlhami Sel , Davut Hanbay

https://doi.org/10.35234/fumbd.929133

Cited By: 3

Abstract

Author Profiling is a text set analysis to reveal various characteristics of the author by examining the style and content of a text. These features include factors such as age, gender, personality traits and even profession. Gender identification is one of the subfields of author profile creation. Gender identification is very important in the investigation of marketing (advertising), sociological and psychological events, as well as forensic events such as spreading fake news, especially cybercrime. Twitter posts are widely used for gender identification, although they may include ungrammatical structures, abbreviated words and meaningless sentence structures. In this study, it was attempted to determine gender from Turkish Twitter posts. The problem is handled as a classification task. In the study, machine learning methods (TF-IDF + SVM), deep learning methods (LSTM, CNN) and pre-trained language models for Turkish (BERT, DistilBert, Electra) were used. As a result of the experiments, Bert model with the word size of 128k provided the highest success (80.1%). This study also features as a detailed study for other text classification tasks.

Keywords

Author profiling, gender identification, natural language processing, language models, text classification

References

[1] F. M. R. Pardo, A. Giachanou, B. Ghanem, and P. Rosso, “Overview of the 8th Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on Twitter,” CLEF 2020 Labs Work. Noteb. Pap., pp. 22–25, 2020, [Online]. Available: CEUR-WS.org.
[2] M. A. Álvarez-Carmona et al., “A visual approach for age and gender identification on Twitter,” J. Intell. Fuzzy Syst., vol. 34, no. 5, pp. 3133–3145, 2018, doi: 10.3233/JIFS-169497.
[3] F. Rangel, P. Rosso, M. Montes-Y-Gómez, M. Potthast, and B. Stein, “Overview of the 6th Author Profiling Task at PAN 2018: Multimodal Gender Identification in Twitter,” CEUR Workshop Proc., vol. 2380, 2018.
[4] E. Sezerer, O. Polatbilek, and S. Tekir, “A Turkish Dataset for Gender Identification of Twitter Users,” pp. 203–207, 2019, doi: 10.18653/v1/w19-4023.
[5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” no. Mlm, 2018, [Online]. Available: http://arxiv.org/abs/1810.04805.
[6] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter,” arXiv, pp. 2–6, 2019.
[7] C. D. Manning, “Electra : P Re - Training T Ext E Ncoders As D Iscriminators R Ather T Han G Enerators,” Iclr, pp. 1–18, 2020, [Online]. Available: https://github.com/google-research/.
[8] F. Rangel, P. Rosso, M. Potthast, and B. Stein, “Overview of the 5th author profiling task at PAN 2017: Gender and language variety identification in Twitter,” CEUR Workshop Proc., vol. 1866, 2017.
[9] F. Rangel and P. Rosso, “Overview of the 7th author profiling task at Pan 2019: Bots and gender profiling in twitter,” CEUR Workshop Proc., vol. 2380, 2019.
[10] W. Zaghouani and A. Charfi, “Arap-Tweet: A Large Multi-Dialect Twitter Corpus for Gender, Age and Language Variety Identification,” arXiv, pp. 694–700, 2018.
[11] M. Talebi and C. Köse, “Facebook yorumlarının analiziyle Cinsiyet, Yaş ve Eğitim düzeyi belirleme Identifying Gender, Age and Education level by analyzing comments on Facebook,” Ieee, no. 2007, pp. 4–7, 2013, [Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6531599.
[12] L. I. Qian et al., “A Survey on Text Classification: From Shallow to Deep Learning,” arXiv, vol. 31, no. 11, pp. 1–21, 2020.
[13] İ. Sel, A. Karci, and D. Hanbay, “Karşılıklı Bilgi Kullanılarak Metin Sınıflandırma İçin Özellik Seçimi Feature Selection for Text Classification Using Mutual Information,” 2019 Int. Artif. Intell. Data Process. Symp., pp. 18–21, 2019.
[14] İ. Sel and D. Hanbay, “Doğal Dil İşleme Yöntemleri Kullanarak E- Maillerin Sınıflandırılması E- Mail Classification Using Natural Language Processing,” 2019 27th Signal Process. Commun. Appl. Conf., pp. 19–22, 2019, [Online]. Available: https://doi.org/10.1109/SIU.2019.8806593.
[15] D. Bahdanau, K. H. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 3rd Int. Conf. Learn. Represent. ICLR 2015 - Conf. Track Proc., pp. 1–15, 2015.
[16] A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 2017-Decem, no. Nips, pp. 5999–6009, 2017.
[17] J. Pizarro, “Using N-grams to detect Bots on Twitter Notebook for PAN at CLEF 2019,” CEUR Workshop Proc., vol. 2380, pp. 9–12, 2019.
[18] S. ElSayed and M. Farouk, “Gender identification for Egyptian Arabic dialect in twitter using deep learning models,” Egypt. Informatics J., vol. 21, no. 3, pp. 159–167, 2020, doi: 10.1016/j.eij.2020.04.001.
[19] S. Schweter, “BERTurk - BERT models for Turkish.” Zenodo, 2020, doi: 10.5281/zenodo.3770924.

There are 19 citations in total.

Details

Primary Language	Turkish
Subjects	Engineering
Journal Section	MBD
Authors	İlhami Sel 0000-0003-0222-7017 Davut Hanbay 0000-0003-2271-7865
Publication Date	September 15, 2021
Submission Date	April 28, 2021
Published in Issue	Year 2021

Cite

APA	Sel, İ., & Hanbay, D. (2021). Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi, 33(2), 675-684. https://doi.org/10.35234/fumbd.929133
AMA	Sel İ, Hanbay D. Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi. September 2021;33(2):675-684. doi:10.35234/fumbd.929133
Chicago	Sel, İlhami, and Davut Hanbay. “Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti”. Fırat Üniversitesi Mühendislik Bilimleri Dergisi 33, no. 2 (September 2021): 675-84. https://doi.org/10.35234/fumbd.929133.
EndNote	Sel İ, Hanbay D (September 1, 2021) Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi 33 2 675–684.
IEEE	İ. Sel and D. Hanbay, “Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti”, Fırat Üniversitesi Mühendislik Bilimleri Dergisi, vol. 33, no. 2, pp. 675–684, 2021, doi: 10.35234/fumbd.929133.
ISNAD	Sel, İlhami - Hanbay, Davut. “Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti”. Fırat Üniversitesi Mühendislik Bilimleri Dergisi 33/2 (September 2021), 675-684. https://doi.org/10.35234/fumbd.929133.
JAMA	Sel İ, Hanbay D. Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi. 2021;33:675–684.
MLA	Sel, İlhami and Davut Hanbay. “Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti”. Fırat Üniversitesi Mühendislik Bilimleri Dergisi, vol. 33, no. 2, 2021, pp. 675-84, doi:10.35234/fumbd.929133.
Vancouver	Sel İ, Hanbay D. Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti. Fırat Üniversitesi Mühendislik Bilimleri Dergisi. 2021;33(2):675-84.

Fırat Üniversitesi Mühendislik Bilimleri Dergisi

Ön Eğitimli Dil Modelleri Kullanarak Türkçe Tweetlerden Cinsiyet Tespiti

Abstract

Keywords

References

Gender Identification from Turkish Tweets Using Pre-Trained Language Models

Abstract

Keywords

References

Details

Cite

Cited By

ÖN EĞİTİMLİ DİL MODELLERİYLE DUYGU ANALİZİ

İstanbul Sabahattin Zaim Üniversitesi Fen Bilimleri Enstitüsü Dergisi

https://doi.org/10.47769/izufbed.1312032

Türkçe Sosyal Medya Mesajlarından Kullanıcıların Yaş ve Cinsiyetini Tahmin Etme

Ömer Halisdemir Üniversitesi Mühendislik Bilimleri Dergisi

https://doi.org/10.28948/ngumuh.1191719

Naive Bayes Sınıflandırıcısı Kullanılarak YouTube Verileri Üzerinden Çok Dilli Duygu Analizi

Bilişim Teknolojileri Dergisi

https://doi.org/10.17671/gazibtd.999960