Benchmark Effect of Web Search Engines on Text Mining

Ahmet Toprak; Metin Turan

Research Article

Benchmark Effect of Web Search Engines on Text Mining

Year 2021, Volume: 4 Issue: 1, 84 - 92, 15.01.2021

Ahmet Toprak , Metin Turan

Abstract

There have been many studies about creating a dictionary and these studies have come from past to present with different methods and different analyzes. Especially with the emergence of the World Wide Web, efforts to create dictionary based on instant data have gained importance. Therefore, the performance of the web search engines directly effects the model which is using web documents for automatic dictionary creation. The web search engines were evaluated in terms of their suggested documents relationality to the query in the research. For this purpose, an automatic dictionary creating model using web documents were developed. First of all, the topic seed words are determined by the documents presented to the system initially. Search is executed by these seed words initially. Then TF-IDF metric was used as meaningful word selection method for returned first document. The top n meaningful words were selected from the highest TF-IDF values. The value of n was determined experimentally. When searching the web with these words added to the dictionary, new documents were suggesting by the web search engine. By repeating the process, experimental dictionaries of a certain size were obtained. By the way, the documents suggested by each web engine are generally different, so that the dictionary similarity produced from the top suggested documents can measure web engines performance of selecting relational documents. Hash similarity was used to evaluate dictionary performance. According to the results, dictionary with the 73.9% highest similarity for Google search engine, dictionary with the 68.7% highest similarity for Bing search engine and dictionary with the 60.5% highest similarity for Yandex search engine were produced.

Keywords

Automatic Dictionary Creation, Hash Similarity, Natural Language Processing, Performance of Web, TF-IDF Metric

References

B V.Z. Kepuska and P. Rojanasthie, “Speech corpus generation from DVDs of movies and tv series,” Journal of International Technology and Information Management, vol. 20(1), pp. 49-82, 2011.
R. Ellen, “Automatically constructing a dictionary for information extraction tasks,” Proceedings of the Eleventh National Conference on Artificial Intelligence, pp. 811-816, 1993.
S. Koeva, I. Stoyanova, M. Todorova and S. Leseva, “Semi-automatic compilation of the dictionary of Bulgarian multiword expressions,” Proceedings of GLOBALEX 2016, pp. 86-95, 2016. https://doi.org/10.5281/zenodo.1469527
K.E. Silverman, V. Anderson, J.R. Bellegarda, K.A. Lenzo and D. Naik, “Design and collection of corpus of polyphones and prosodic contexts for speech synthesis research and development,” Sixth European Conference on Speech Communication and Technology, PP. 5-9, 1999.
A. Toprak, “Creating English dictionary with natural language processing,” Published Master Thesis, Istanbul Commerce University Institute of Science, Istanbul, 2019.
C. Caldera, R. Berndt, E. Eggeling, M. Schröttner and D.W. Fellner, “PRIMA-towards an automatic review / paper matching score calculation,” The Sixth International Conference on Creative Content Technologies (CONTENT 2014), pp. 70-75, 2014.
A. Mishra, and S. Vishwakarma, “Analysis of TF-IDF model and its variant for document retrieval,” International Conference on Computational Intelligence and Communication Networks (CICN), pp. 772-776, 2015. https://www.doi.org/10.1109/CICN.2015.157
J. Lavid, H.J. Arús, B. Clerck and V. Hoste, “Creation of a high-quality, register-diversified parallel (English-Spanish) corpus for linguistic and computational investigations,” 7th International Conference on Corpus Linguistics (CILC2015), vol. 198, pp. 249-256, 2015. https://doi.org/10.1016/j.sbspro.2015.07.443
S.H. Sarkar and K. Mumit, “Automatic bangla corpus creation,” PAN Localization Working Papers, vol. 3(1), pp. 22-26, 2010.
B. Megyesi, J. Nasman and A. Palmer, “The uppsala corpus of student writings: corpus creation, annotation, and analysis,” Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 3192-3199, 2016.
F. Famili, W. Shen, R. Weber and E. Simoudis, “Data preprocessing and intelligent data analysis,” Intell. Data Anal, vol. 1(4), pp. 3-23, 1997. https://doi.org/10.1016/S1088-467X(98)00007-9
V. Agarwal, “Research on data preprocessing and categorization technique for smartphone review analysis,” International Journal of Computer Applications, vol. 131(4), pp. 30-36, 2015. https://www.doi.org/10.5120/ijca2015907309
C. Moral, A. Antonio, R. Imbert and J. Ramirez, “A survey of stemming algorithms in information retrieval,” Information Research: An International Electronic Journal, vol. 19(1), pp. 76-80, 2014.
R. Khoury, L. Shi and A. Hamou-Lhadj, “Key elements extraction and traces comprehension using Gestalt Theory and the Helmholtz Principle,” 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 478-482, 2016. https://www.doi.org/10.1109/ICSME.2016.24
B. Dadachev, A. Balinsky, H. Balinsky and S. Simske, “On the Helmholtz Principle for data mining,” Third International Conference on Emerging Security Technologies, pp. 99-102, 2012. https://www.doi.org/10.1109/EST.2012.11
S. Jabri, A. Dahbi, T. Gadi and A. Bassir, “Ranking of text documents using TF-IDF weighting and association rules mining,” 2018 4th International Conference on Optimization and Applications (ICOA), pp. 1-6, 2018. https://www.doi.org/10.1109/ICOA.2018.837057
A.G. Jivani, “A comparative study of stemming algorithms,” Int. J. Comp. Tech. Appl, vol. 2(6), pp. 1930-1938, 2011.
M.S Charikar, “Similarity estimation techniques from rounding algorithms,” In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing,pp.380-388,2002. https://www.doi.org/10.1145/509907.509965
Y. Li, F. Liu, Z. Du and D. Zhang, “A simhash-based integrative features extraction algorithm for malware detection,” Algorithms-Open Access Journal, vol. 11(8), pp. 1-13, 2018. https://doi.org/10.3390/a11080124
Y. Zhang, Z. Jin, W. Mu and W. Wang, “Research of distinct algorithm of short text based on simhash,” DEStech Transactions on Engineering and Technology Research, pp. 120-126, 2017. https://www.doi.org/10.12783/dtetr/oect2017/16127
Q. Jiang and M. Sun, “Semi-supervised simhash for efficient document similarity search,” The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, vol. 1, pp. 93-101, 2011.
B. Pi, S. Fu, W. Wang and S. Han, “SimHash-based effective and efficient detecting of near duplicate short messages,” Proceedings of the Second Symposium International Computer Science and Computational Technology (ISCSCT ’09), pp. 20-25, 2009.
M. Turan and S. Ogtelik, “İngilizce dokümanlarda tema ve alt kavramlar tespit modeli,” Düzce Üniversitesi Bilim ve Teknoloji Dergisi, vol. 6(4), pp. 754-764, 2018. https://doi.org/10.29130/dubited.420104

Year 2021, Volume: 4 Issue: 1, 84 - 92, 15.01.2021

Ahmet Toprak , Metin Turan

Abstract

References

B V.Z. Kepuska and P. Rojanasthie, “Speech corpus generation from DVDs of movies and tv series,” Journal of International Technology and Information Management, vol. 20(1), pp. 49-82, 2011.
R. Ellen, “Automatically constructing a dictionary for information extraction tasks,” Proceedings of the Eleventh National Conference on Artificial Intelligence, pp. 811-816, 1993.
S. Koeva, I. Stoyanova, M. Todorova and S. Leseva, “Semi-automatic compilation of the dictionary of Bulgarian multiword expressions,” Proceedings of GLOBALEX 2016, pp. 86-95, 2016. https://doi.org/10.5281/zenodo.1469527
K.E. Silverman, V. Anderson, J.R. Bellegarda, K.A. Lenzo and D. Naik, “Design and collection of corpus of polyphones and prosodic contexts for speech synthesis research and development,” Sixth European Conference on Speech Communication and Technology, PP. 5-9, 1999.
A. Toprak, “Creating English dictionary with natural language processing,” Published Master Thesis, Istanbul Commerce University Institute of Science, Istanbul, 2019.
C. Caldera, R. Berndt, E. Eggeling, M. Schröttner and D.W. Fellner, “PRIMA-towards an automatic review / paper matching score calculation,” The Sixth International Conference on Creative Content Technologies (CONTENT 2014), pp. 70-75, 2014.
A. Mishra, and S. Vishwakarma, “Analysis of TF-IDF model and its variant for document retrieval,” International Conference on Computational Intelligence and Communication Networks (CICN), pp. 772-776, 2015. https://www.doi.org/10.1109/CICN.2015.157
J. Lavid, H.J. Arús, B. Clerck and V. Hoste, “Creation of a high-quality, register-diversified parallel (English-Spanish) corpus for linguistic and computational investigations,” 7th International Conference on Corpus Linguistics (CILC2015), vol. 198, pp. 249-256, 2015. https://doi.org/10.1016/j.sbspro.2015.07.443
S.H. Sarkar and K. Mumit, “Automatic bangla corpus creation,” PAN Localization Working Papers, vol. 3(1), pp. 22-26, 2010.
B. Megyesi, J. Nasman and A. Palmer, “The uppsala corpus of student writings: corpus creation, annotation, and analysis,” Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 3192-3199, 2016.
F. Famili, W. Shen, R. Weber and E. Simoudis, “Data preprocessing and intelligent data analysis,” Intell. Data Anal, vol. 1(4), pp. 3-23, 1997. https://doi.org/10.1016/S1088-467X(98)00007-9
V. Agarwal, “Research on data preprocessing and categorization technique for smartphone review analysis,” International Journal of Computer Applications, vol. 131(4), pp. 30-36, 2015. https://www.doi.org/10.5120/ijca2015907309
C. Moral, A. Antonio, R. Imbert and J. Ramirez, “A survey of stemming algorithms in information retrieval,” Information Research: An International Electronic Journal, vol. 19(1), pp. 76-80, 2014.
R. Khoury, L. Shi and A. Hamou-Lhadj, “Key elements extraction and traces comprehension using Gestalt Theory and the Helmholtz Principle,” 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 478-482, 2016. https://www.doi.org/10.1109/ICSME.2016.24
B. Dadachev, A. Balinsky, H. Balinsky and S. Simske, “On the Helmholtz Principle for data mining,” Third International Conference on Emerging Security Technologies, pp. 99-102, 2012. https://www.doi.org/10.1109/EST.2012.11
S. Jabri, A. Dahbi, T. Gadi and A. Bassir, “Ranking of text documents using TF-IDF weighting and association rules mining,” 2018 4th International Conference on Optimization and Applications (ICOA), pp. 1-6, 2018. https://www.doi.org/10.1109/ICOA.2018.837057
A.G. Jivani, “A comparative study of stemming algorithms,” Int. J. Comp. Tech. Appl, vol. 2(6), pp. 1930-1938, 2011.
M.S Charikar, “Similarity estimation techniques from rounding algorithms,” In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing,pp.380-388,2002. https://www.doi.org/10.1145/509907.509965
Y. Li, F. Liu, Z. Du and D. Zhang, “A simhash-based integrative features extraction algorithm for malware detection,” Algorithms-Open Access Journal, vol. 11(8), pp. 1-13, 2018. https://doi.org/10.3390/a11080124
Y. Zhang, Z. Jin, W. Mu and W. Wang, “Research of distinct algorithm of short text based on simhash,” DEStech Transactions on Engineering and Technology Research, pp. 120-126, 2017. https://www.doi.org/10.12783/dtetr/oect2017/16127
Q. Jiang and M. Sun, “Semi-supervised simhash for efficient document similarity search,” The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, vol. 1, pp. 93-101, 2011.
B. Pi, S. Fu, W. Wang and S. Han, “SimHash-based effective and efficient detecting of near duplicate short messages,” Proceedings of the Second Symposium International Computer Science and Computational Technology (ISCSCT ’09), pp. 20-25, 2009.
M. Turan and S. Ogtelik, “İngilizce dokümanlarda tema ve alt kavramlar tespit modeli,” Düzce Üniversitesi Bilim ve Teknoloji Dergisi, vol. 6(4), pp. 754-764, 2018. https://doi.org/10.29130/dubited.420104

There are 23 citations in total.

Details

Primary Language	English
Subjects	Engineering
Journal Section	Articles
Authors	Ahmet Toprak Metin Turan
Publication Date	January 15, 2021
Published in Issue	Year 2021 Volume: 4 Issue: 1