An Optimal Solution for small file problem in Hadoop

Mansoor Ahmad Mir


Hadoop is an open source Apache project and software framework for distributed processing of large dataset across large clusters of computers with commodity hardware(used mainly for processing of big Data).HDFS(Hadoop distributed file system and MapReduce(a programming model) are Hadoop’s main two components). Hadoop does not perform well while processing large number of small sized files(of the size of the hundreds of KB’s or few MB’s) posing a heavy burden on NameNode of HDFS and increases the execution time of MapReduce. Hadoop being designed to handle large sized files thus pays the penalty for handling small sized files. This research work provides an introduction to the Hadoop, Big Data and review of the existing work and proposed better efficient technique to handle small file handling problem with Hadoop based on file merging technique, hashing and caching. This technique results in saving memory at NameNode, average memory usage at DataNode and improves the access efficiency as compared to the existing techniques


NameNode, DataNode, Amazon EC2, Merging, PreFetching

Full Text:



Shvachko, Konstantin, et al. "The hadoop distributed file system." Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. IEEE, 2010.

Yuan, Yu, et al. "Performance analysis of Hadoop for handling small files in single node." Jisuanji Gongcheng yu Yingyong(Computer Engineering and Applications) 49.3 (2013): 57-60.

White, Tom. "The small files problem." Cloudera Blog, blog. cloudera. com/blog/2009/02/the-small-filesproblem (2009).

Dong, Bo, et al. "A novel approach to improving the efficiency of storing and accessing small files on hadoop: a case study by powerpoint files." Services Computing (SCC), 2010 IEEE International Conference on. IEEE, 2010.

White, T. 2010. Hadoop: The Definitive Guide. 2nd ed. O'Reilly Media, Sebastopol, CA. 41-45.

Jiang, Liu, Bing Li, and Meina Song. "THE optimization of HDFS based on small files." Broadband Network and Multimedia Technology (IC-BNMT), 2010 3rd IEEE International Conference on. IEEE, 2010.

Mackey, Grant, Saba Sehrish, and Jun Wang. "Improving metadata management for small files in HDFS." Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on. IEEE, 2009.

Luo, Min, and Haruo Yokota. "Comparing Hadoop and Fat-Btree based access method for small file I/O applications." International Conference on Web-Age Information Management. Springer Berlin Heidelberg, 2010.

Shen, Chunhui, et al. "A digital library architecture supporting massive small files and efficient replica maintenance." Proceedings of the 10th annual joint conference on Digital libraries. ACM, 2010.

Liu, Xuhui, et al. "Implementing WebGIS on Hadoop: A case study of improving small file I/O performance on HDFS." Cluster Computing and Workshops, 2009. CLUSTER'09. IEEE International Conference on. IEEE, 2009.

Shvachko, Konstantin. "Name-node memory size estimates and optimization proposal." Apache Hadoop Common Issues, HADOOP-1687 (2007).

Dong, Bo, et al. "An optimized approach for storing and accessing small files on cloud storage." Journal of Network and Computer Applications 35.6 (2012): 1847-1862.

Gupta, B., Nath, R., Gopal, G. April, 2016. A Novel Techniques to Handle Small Files with Big Data Technology. In Proceedings of Vivechana : A National Conference on Advances in Computer Science and Engineering (ACSE) held at Department of Computer Science & Applications, Kurukshetra University, Kurukshetra, Haryana, India on 29-30 April 2016.



  • There are currently no refbacks.

Copyright (c) 2017 International Journal of Advanced Research in Computer Science