An Optimal Solution for small file problem in Hadoop

Mansoor Ahmad Mir

doi:10.26483/ijarcs.v8i5.3300

PDF

Published: Jun 20, 2017

DOI: https://doi.org/10.26483/ijarcs.v8i5.3300

Keywords:

NameNode, DataNode, Amazon EC2, Merging, PreFetching

Mansoor Ahmad Mir

SEST, Jamia Hamdard New Delhi India

Abstract

Hadoop is an open source Apache project and software framework for distributed processing of large dataset across large clusters of computers with commodity hardware(used mainly for processing of big Data).HDFS(Hadoop distributed file system and MapReduce(a programming model) are Hadoopâ€™s main two components). Hadoop does not perform well while processing large number of small sized files(of the size of the hundreds of KBâ€™s or few MBâ€™s) posing a heavy burden on NameNode of HDFS and increases the execution time of MapReduce. Hadoop being designed to handle large sized files thus pays the penalty for handling small sized files. This research work provides an introduction to the Hadoop, Big Data and review of the existing work and proposed better efficient technique to handle small file handling problem with Hadoop based on file merging technique, hashing and caching. This technique results in saving memory at NameNode, average memory usage at DataNode and improves the access efficiency as compared to the existing techniques

Downloads

Download data is not yet available.

Issue

Vol. 8 No. 5 (2017): May-June 2017

Section

Articles

COPYRIGHT

Submission of a manuscript implies: that the work described has not been published before, that it is not under consideration for publication elsewhere; that if and when the manuscript is accepted for publication, the authors agree to automatic transfer of the copyright to the publisher.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
The journal allows the author(s) to retain publishing rights without restrictions.
The journal allows the author(s) to hold the copyright without restrictions.

Author Biography

Mansoor Ahmad Mir, SEST, Jamia Hamdard New Delhi India

Mtech computer science engineering Student at Jamia hamdard.

References

Shvachko, Konstantin, et al. "The hadoop distributed file system." Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. IEEE, 2010.

Yuan, Yu, et al. "Performance analysis of Hadoop for handling small files in single node." Jisuanji Gongcheng yu Yingyong(Computer Engineering and Applications) 49.3 (2013): 57-60.

White, Tom. "The small files problem." Cloudera Blog, blog. cloudera. com/blog/2009/02/the-small-filesproblem (2009).

Dong, Bo, et al. "A novel approach to improving the efficiency of storing and accessing small files on hadoop: a case study by powerpoint files." Services Computing (SCC), 2010 IEEE International Conference on. IEEE, 2010.

White, T. 2010. Hadoop: The Definitive Guide. 2nd ed. O'Reilly Media, Sebastopol, CA. 41-45.

Jiang, Liu, Bing Li, and Meina Song. "THE optimization of HDFS based on small files." Broadband Network and Multimedia Technology (IC-BNMT), 2010 3rd IEEE International Conference on. IEEE, 2010.

Mackey, Grant, Saba Sehrish, and Jun Wang. "Improving metadata management for small files in HDFS." Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on. IEEE, 2009.

Luo, Min, and Haruo Yokota. "Comparing Hadoop and Fat-Btree based access method for small file I/O applications." International Conference on Web-Age Information Management. Springer Berlin Heidelberg, 2010.

Shen, Chunhui, et al. "A digital library architecture supporting massive small files and efficient replica maintenance." Proceedings of the 10th annual joint conference on Digital libraries. ACM, 2010.

Liu, Xuhui, et al. "Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS." Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on. IEEE, 2009.

Shvachko, Konstantin. "Name-node memory size estimates and optimization proposal." Apache Hadoop Common Issues, HADOOP-1687 (2007).

Dong, Bo, et al. "An optimized approach for storing and accessing small files on cloud storage." Journal of Network and Computer Applications 35.6 (2012): 1847-1862.

Gupta, B., Nath, R., Gopal, G. April, 2016. A Novel Techniques to Handle Small Files with Big Data Technology. In Proceedings of Vivechana : A National Conference on Advances in Computer Science and Engineering (ACSE) held at Department of Computer Science & Applications, Kurukshetra University, Kurukshetra, Haryana, India on 29-30 April 2016.

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

Mansoor Ahmad Mir, SEST, Jamia Hamdard New Delhi India

References

Most read articles by the same author(s)