CRAWLING OF JAPANESE REAL-ESTATE WEBSITES USING SCRAPY

Bassam Farooq; Dr. Mohd. Shahid Husain; Mohammad Suaib

doi:10.26483/ijarcs.v9i0.6139

PDF

Published: Jul 16, 2018

DOI: https://doi.org/10.26483/ijarcs.v9i0.6139

Keywords:

Japan, Framework, Python, Real-estate, Scrapy, Web Crawler

Bassam Farooq

Dr. Mohd. Shahid Husain

Mohammad Suaib

Abstract

Web crawler is a program in the softwarespace that enables the download of data from websites.This paper implements a python web crawler framework,Scrapy.The crawler framework implemented mainlyfocusses on major real-estate websites of Japan. Themotivation behindthe implementation of the Scrapyframework was the speed of website crawling supplied by the framework of Scrapy, data filters that can beappliedand also, the wide library support for pythonprogramming language.

Â

Downloads

Download data is not yet available.

Issue

2018: Volume 9 Special Issue No. 2, April 2018

Section

Articles

COPYRIGHT

Submission of a manuscript implies: that the work described has not been published before, that it is not under consideration for publication elsewhere; that if and when the manuscript is accepted for publication, the authors agree to automatic transfer of the copyright to the publisher.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
The journal allows the author(s) to retain publishing rights without restrictions.
The journal allows the author(s) to hold the copyright without restrictions.

References

80legs. (n.d.). Retrieved February 13, 2018, from https://80legs.groovehq.com/knowledge_base/topics/how-80legs-crawls-urls-depth-first-vs-breadth-first-vs-greedy

Dallmeier V., Burger M., Orth T., Zeller A. (2013) WebMate: Generating Test Cases for Web 2.0. In: Winkler D., Biffl S., Bergsmann J. (eds) Software Quality. Increasing Value in Software and Systems Development. SWQD 2013. Lecture Notes in Business Information Processing, vol 133. Springer, Berlin, HeidelbergI. S. Jacobs and C. P. Bean, â€œFine particles, thin films and exchange anisotropy,â€ in Magnetism, vol. III, G. T. Rado and H. Suhl, Eds. New York: Academic, 1963, pp. 271â€“350.

Myers, D., & McGuffee, J. W. (2015). Choosing Scrapy. Journal of Computing Sciences in Colleges, 31(1), 83-89.

Hareendran, S., Parashar, A., & Khan, F. U. (2014). Automated specification extraction for consolidated product catalogue. 2014 IEEE Students Conference on Electrical, Electronics and Computer Science. doi:10.1109/sceecs.2014.6804527

Harrison, J. R., Roberts, D. L., & Hernandez-Castro, J. (2016). Assessing the extent and nature of wildlife trade on the dark web. Conservation Biology, 30(4), 900-904. doi:10.1111/cobi.12707

Umbrich, J., Harth, A., Hogan, A., & Decker, S. (2008). Four Heuristics to Guide Structured Content Crawling. 2008 Eighth International Conference on Web Engineering. doi:10.1109/icwe.2008.42

Sharma, S. (2017, October 16). Crawling the Web with Scrapy. Retrieved February 13, 2018, from http://opensourceforu.com/2017/10/crawling-web-scrapy/

Mishra, P. (2012). Focused Crawling Techniques. International Journal of Computers & Technology, 2(2). doi:ISSN: 2277â€“3061 (online)

Architecture overview. (n.d.). Retrieved February 13, 2018, from https://doc.scrapy.org/en/latest/topics/architecture.html

S. (2018, January 25). Scrapy/w3lib. Retrieved February 13, 2018, from https://github.com/scrapy/w3lib

Jasani, B. M. (2016). Analyzing search engine mechanism and developing a prototype for web crawling architectural model for effectiveness of search engine

(Master's thesis, Saurashtra University Rajkot, Gujarat, India).Rajkot: Saurashtra University. doi:http://hdl.handle.net/10603/103719S. (2018, January 25).

Wang, J., & Guo, Y. (2012). Scrapy-Based Crawling and User-Behavior Characteristics Analysis on Taobao. 2012 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery. doi:10.1109/cyberc.2012.17

XPath Tutorial. Retrieved February 07, 2018, from https://www.w3schools.com

Shi, Z., Shi, M., & Lin, W. (2016). The Implementation of Crawling News Page Based on Incremental Web Crawler. 2016 4th Intl Conf on Applied Computing and Information Technology/3rd Intl Conf on Computational Science/Intelligence and Applied Informatics/1st Intl Conf on Big Data, Cloud Computing, Data Science & Engineering (ACIT-CSII-BCD). doi:10.1109/acit-csii-bcd.2016.073

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

References