Research Article
BibTex RIS Cite

Web Veri Çıkarımda Çıkarım Kurallarının İncelenmesi

Year 2018, Volume: 1 Issue: 2, 72 - 77, 30.12.2018

Abstract

Gerekli veriyi web
sayfasından çıkarmak veri madenciliği ve bilgi erişimi alanındaki uygulamalar
için önemlidir. Web sayfasından veriyi çıkarmak için DOM tabanlı yöntemler veya
düzenli ifadeler kullanılabilir. Bu çıkarım işlemi için hem DOM tabanlı yöntemler
hem de düzenli ifadeler için birden fazla çıkarım kuralı hazırlanabilir. Bu
çalışmada, çıkarım kuralları ile birden fazla veriyi elde etmenin çıkarım
işlemi üzerindeki etkinliği incelenmiştir. Veri seti olarak haber, film ve
alış/veriş alanlarında olmak üzere on beş web sitesi seçilmiştir. Bu web
siteleri için farklı çıkarım teknikleri ile veri çıkarımı için çıkarım kural
dosyaları oluşturulmuştur. Web sitelerinde özellikle yorum gibi tekrarlayan
veriler üzerinde odaklanmıştır. Deneyler, oluşturulması daha zahmetli ve zaman
alıcı düzenli ifadelerin DOM tabanlı yöntemlere göre çok daha iyi sonuçlar
verdiğini göstermiştir. DOM tabanlı yöntemler arasında beklenildiği gibi lxml
ayrıştırıcı kütüphanesi en iyi sonuçları vermiştir. Deneyler, bir geliştirici
tarafından hazırlanan çıkarım kuralarının çıkarım süresini etkilediği
göstermektedir. Sonuç olarak, iyi hazırlanmış çıkarım düzenli ifadeleri ile web
sayfalarında çok daha hızlı bir şekilde istenilen veriye erişmek mümkündür.

References

  • [1] A. F. R. Rahman, H. Alam, and R. Hartono, “Content extraction from html documents,” in 1st Int. Workshop on Web Document Analysis (WDA2001), 2001, pp. 1–4.[2] E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner, “Web data extraction, applications and techniques: A survey,” Knowledge-Based Syst., vol. 70, pp. 301–323, Nov. 2014.[3] S. Flesca, G. Manco, E. Masciari, E. Rende, and A. Tagarelli, “Web Wrapper Induction: A Brief Survey,” AI Commun., vol. 17, no. 2, pp. 57–61, 2004.[4] N. Kushmerick, “Wrapper Induction for Information Extraction,” PhD Thesis, University of Washington, 1997.[5] L. Liu, C. Pu, and W. Han, “XWRAP : An XML - enabled Wrapper Construction System for Web Information Sources,” Proc. 16th Int. Conf. Data Eng., 2000.[6] B. Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer, 2011.[7] B. Fazzinga, S. Flesca, and A. Tagarelli, “Schema-based Web wrapping,” Knowl. Inf. Syst., vol. 26, no. 1, pp. 127–173, 2011.[8] E. Uzun, T. Yerlikaya, and M. Kurt, “A lightweight parser for extracting useful contents from web pages,” in 2nd International Symposium on Computing in Science & Engineering-ISCSE 2011, Kusadasi, Aydin, Turkey, 2011, pp. 67–73.[9] L. M. Álvarez-Sabucedo, L. E. Anido-Rifón, and J. M. Santos-Gago, “Reusing web contents: a DOM approach,” Softw. Pract. Exp., vol. 39, no. 3, pp. 299–314, Mar. 2009.[10] S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, “DOM-based content extraction of HTML documents,” in Proceedings of the twelfth international conference on World Wide Web - WWW ’03, 2003, p. 207.[11] L. Fu, Y. Meng, Y. Xia, and H. Yu, “Web content extraction based on webpage layout analysis,” in Proceedings - 2nd International Conference on Information Technology and Computer Science, ITCS 2010, 2010, pp. 40–43.[12] E. Uçar, E. Uzun, and P. Tüfekci, “A novel algorithm for extracting the user reviews from web pages,” J. Inf. Sci., vol. 43, no. 5, pp. 696–712, Sep. 2016.[13] E. Uzun, H. V. Agun, and T. Yerlikaya, “A hybrid approach for extracting informative content from web pages,” Inf. Process. Manag., vol. 49, no. 4, pp. 928–944, 2013.[14] E. Uzun, T. Yerlikaya, and O. Kırat, “Comparison of Python Libraries used for Web Data Extraction,” in 7th International Scientific Conference “TechSys 2018” – Engineering, Technologies and Systems, Technical University of Sofia, Plovdiv Branch May 17-19, 2018, pp. 108–113.[15] M. Kobayashi and K. Takeda, “Information retrieval on the web,” ACM Comput. Surv., vol. 32, no. 2, pp. 144–173, Jun. 2000.[16] M. Kumar, R. Bhatia, and D. Rattan, “A survey of Web crawlers for information retrieval,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 7, no. 6, p. e1218, 2017.

Examination of Extraction Rules in Web Data Extraction

Year 2018, Volume: 1 Issue: 2, 72 - 77, 30.12.2018

Abstract

Extracting the desired data from the web page is important issue for applications in the fields of data mining and information retrieval. DOM-based methods or regular expressions can be used to extract data from a web page. For this extraction process, multiple extraction rules can be prepared for both DOM-based methods and regular expressions. In this study, the effectiveness of obtaining more than one data with extraction rules is investigated. As a data set, fifteen websites including in the fields of news, film and shopping have been selected. Extraction rule files have been created for data extraction with different extraction techniques for these websites. Web sites are mainly focused on repetitive data such as reviews. Experiments have shown that regular expressions, the creation process is more laborious and time consuming, give better results than DOM-based methods. Among the DOM-based methods, the lxml parser library provided the best results as expected. Experiments indicate that the extraction rules prepared by a developer affect the extraction time. As a result, it is possible to extract the desired data much faster in web pages with the well-prepared regular expressions.

References

  • [1] A. F. R. Rahman, H. Alam, and R. Hartono, “Content extraction from html documents,” in 1st Int. Workshop on Web Document Analysis (WDA2001), 2001, pp. 1–4.[2] E. Ferrara, P. De Meo, G. Fiumara, and R. Baumgartner, “Web data extraction, applications and techniques: A survey,” Knowledge-Based Syst., vol. 70, pp. 301–323, Nov. 2014.[3] S. Flesca, G. Manco, E. Masciari, E. Rende, and A. Tagarelli, “Web Wrapper Induction: A Brief Survey,” AI Commun., vol. 17, no. 2, pp. 57–61, 2004.[4] N. Kushmerick, “Wrapper Induction for Information Extraction,” PhD Thesis, University of Washington, 1997.[5] L. Liu, C. Pu, and W. Han, “XWRAP : An XML - enabled Wrapper Construction System for Web Information Sources,” Proc. 16th Int. Conf. Data Eng., 2000.[6] B. Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Springer, 2011.[7] B. Fazzinga, S. Flesca, and A. Tagarelli, “Schema-based Web wrapping,” Knowl. Inf. Syst., vol. 26, no. 1, pp. 127–173, 2011.[8] E. Uzun, T. Yerlikaya, and M. Kurt, “A lightweight parser for extracting useful contents from web pages,” in 2nd International Symposium on Computing in Science & Engineering-ISCSE 2011, Kusadasi, Aydin, Turkey, 2011, pp. 67–73.[9] L. M. Álvarez-Sabucedo, L. E. Anido-Rifón, and J. M. Santos-Gago, “Reusing web contents: a DOM approach,” Softw. Pract. Exp., vol. 39, no. 3, pp. 299–314, Mar. 2009.[10] S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm, “DOM-based content extraction of HTML documents,” in Proceedings of the twelfth international conference on World Wide Web - WWW ’03, 2003, p. 207.[11] L. Fu, Y. Meng, Y. Xia, and H. Yu, “Web content extraction based on webpage layout analysis,” in Proceedings - 2nd International Conference on Information Technology and Computer Science, ITCS 2010, 2010, pp. 40–43.[12] E. Uçar, E. Uzun, and P. Tüfekci, “A novel algorithm for extracting the user reviews from web pages,” J. Inf. Sci., vol. 43, no. 5, pp. 696–712, Sep. 2016.[13] E. Uzun, H. V. Agun, and T. Yerlikaya, “A hybrid approach for extracting informative content from web pages,” Inf. Process. Manag., vol. 49, no. 4, pp. 928–944, 2013.[14] E. Uzun, T. Yerlikaya, and O. Kırat, “Comparison of Python Libraries used for Web Data Extraction,” in 7th International Scientific Conference “TechSys 2018” – Engineering, Technologies and Systems, Technical University of Sofia, Plovdiv Branch May 17-19, 2018, pp. 108–113.[15] M. Kobayashi and K. Takeda, “Information retrieval on the web,” ACM Comput. Surv., vol. 32, no. 2, pp. 144–173, Jun. 2000.[16] M. Kumar, R. Bhatia, and D. Rattan, “A survey of Web crawlers for information retrieval,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 7, no. 6, p. e1218, 2017.
There are 1 citations in total.

Details

Primary Language Turkish
Subjects Engineering
Journal Section Research Articles
Authors

Erdinç Uzun 0000-0003-4351-2244

Tarık Yerlikaya

Oğuz Kırat

Publication Date December 30, 2018
Submission Date November 21, 2018
Published in Issue Year 2018 Volume: 1 Issue: 2