CRAWLING OF JAPANESE REAL-ESTATE WEBSITES USING SCRAPY
Main Article Content
Abstract
Web crawler is a program in the softwarespace that enables the download of data from websites.This paper implements a python web crawler framework,Scrapy.The crawler framework implemented mainlyfocusses on major real-estate websites of Japan. Themotivation behindthe implementation of the Scrapyframework was the speed of website crawling supplied by the framework of Scrapy, data filters that can beappliedand also, the wide library support for pythonprogramming language.
Â
Â
Downloads
Article Details
COPYRIGHT
Submission of a manuscript implies: that the work described has not been published before, that it is not under consideration for publication elsewhere; that if and when the manuscript is accepted for publication, the authors agree to automatic transfer of the copyright to the publisher.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
- The journal allows the author(s) to retain publishing rights without restrictions.
- The journal allows the author(s) to hold the copyright without restrictions.
References
80legs. (n.d.). Retrieved February 13, 2018, from https://80legs.groovehq.com/knowledge_base/topics/how-80legs-crawls-urls-depth-first-vs-breadth-first-vs-greedy
Dallmeier V., Burger M., Orth T., Zeller A. (2013) WebMate: Generating Test Cases for Web 2.0. In: Winkler D., Biffl S., Bergsmann J. (eds) Software Quality. Increasing Value in Software and Systems Development. SWQD 2013. Lecture Notes in Business Information Processing, vol 133. Springer, Berlin, HeidelbergI. S. Jacobs and C. P. Bean, “Fine particles, thin films and exchange anisotropy,†in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271–350.
Myers, D., & McGuffee, J. W. (2015). Choosing Scrapy. Journal of Computing Sciences in Colleges, 31(1), 83-89.
Hareendran, S., Parashar, A., & Khan, F. U. (2014). Automated specification extraction for consolidated product catalogue. 2014 IEEE Students Conference on Electrical, Electronics and Computer Science. doi:10.1109/sceecs.2014.6804527
Harrison, J. R., Roberts, D. L., & Hernandez-Castro, J. (2016). Assessing the extent and nature of wildlife trade on the dark web. Conservation Biology, 30(4), 900-904. doi:10.1111/cobi.12707
Umbrich, J., Harth, A., Hogan, A., & Decker, S. (2008). Four Heuristics to Guide Structured Content Crawling. 2008 Eighth International Conference on Web Engineering. doi:10.1109/icwe.2008.42
Sharma, S. (2017, October 16). Crawling the Web with Scrapy. Retrieved February 13, 2018, from http://opensourceforu.com/2017/10/crawling-web-scrapy/
Mishra, P. (2012). Focused Crawling Techniques. International Journal of Computers & Technology, 2(2). doi:ISSN: 2277–3061 (online)
Architecture overview. (n.d.). Retrieved February 13, 2018, from https://doc.scrapy.org/en/latest/topics/architecture.html
S. (2018, January 25). Scrapy/w3lib. Retrieved February 13, 2018, from https://github.com/scrapy/w3lib
Jasani, B. M. (2016). Analyzing search engine mechanism and developing a prototype for web crawling architectural model for effectiveness of search engine
(Master's thesis, Saurashtra University Rajkot, Gujarat, India).Rajkot: Saurashtra University. doi:http://hdl.handle.net/10603/103719S. (2018, January 25).
Wang, J., & Guo, Y. (2012). Scrapy-Based Crawling and User-Behavior Characteristics Analysis on Taobao. 2012 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery. doi:10.1109/cyberc.2012.17
XPath Tutorial. Retrieved February 07, 2018, from https://www.w3schools.com
Shi, Z., Shi, M., & Lin, W. (2016). The Implementation of Crawling News Page Based on Incremental Web Crawler. 2016 4th Intl Conf on Applied Computing and Information Technology/3rd Intl Conf on Computational Science/Intelligence and Applied Informatics/1st Intl Conf on Big Data, Cloud Computing, Data Science & Engineering (ACIT-CSII-BCD). doi:10.1109/acit-csii-bcd.2016.073