Open Access Open Access  Restricted Access Subscription Access

Assessment of Various Chunking Techniques for De-duplication in Big Data

Rahul Rawat


From previous fifteen years, the data is growing very rapidly; such a huge data is termed as big data. The size of big data can be in TB and PB. The main challenge in big data is to handle duplicate data and to obtain useful information. In this paper, an assessment of various chunking techniques has been presented. There are different chunking techniques available named; frequency based chunking, content based chunking, byte level chunking etc. Assessment is based on mapping of chunking techniques with big data applications and de-duplication performance metrics. Assessment shows that FBC is best suited for data de-duplication compared to other chunking techniques.

Cite this Article
Rahul Rawat. Assessment of Various Chunking Techniques for De-duplication in Big Data. Recent Trends in Programming Languages. 2016; 3(2): 7–12p.


Big data, chunking, data de-duplication (De-dup), FBC (Frequency based Chunking), CDC (Content Defined Chunking)

Full Text:



Min Chen, Shiwen Mao, Yunhao Liu. Big Data: A Survey. Business Media New York: Springer; 2014; 171–209p. DOI=10.1007/s11036-013-0489-0

Qinlu He, Zhanhuai Li, Xiao Zhang. Data De-duplication Techniques. 2010 International Conference on Future Information Technology and Management Engineering, IEEE. 2010; 430–433p. DOI: 10.1109/FITME.2010.5656539

Manogarand E, Abirami S. A Study on Data De-duplication Techniques for Optimized Storage. 2014 Sixth International Conference on Advanced Computing (lCoAC), IEEE. 2014; 161–166p. DOI: 10.1109/ICoAC.2014.7229702

Joao Paulo, Jose Pereira. A Survey and Classification of Storage Deduplication Systems. ACM Computing Surveys (CSUR). Jul 2014; 4(1). Article No.11. ACM New York, NY, USA. DOI: 10.1145/2611778

Sean C. Rhea, Kevin Liang, Eric Brewer.2003. Value-Based Web Caching. In Proceedings of theTwelfth International World Wide Web Conference,ACM. New York, NY, USA, May 2003.pp.619-628. DOI=10.1145/775152.775239

Policroniades C, Pratt I. Alternatives for Detecting Redundancy in Storage Systems Data. In Proc USENIX Annu Tech Conf, Boston, MA, USA. Jun 2004; 73–86p.

Eshghi K, Tang H. A Framework for Analyzing and Improving Content-Based Chunking Algorithms. Hewlett-Packard Labs, Paolo Alto, CA, USA, Tech. Rep. TR 2005. 2005; 30: 1–6p.

Meister D, Brinkmann A. Multi-level Comparison of Data De-duplication in a Backup Scenario. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference. ACM New York, NY, USA. May 2009; 1–12p. DOI: 10.1145/1534530.1534541

Erik Kruus, Cristian Ungureanu, Cezary Dubnicki. Bimodal Content Defined Chunking for Backup Streams. In Proceedings of the 8th USENIX Conference on File and Storage Technologies. USENIX Association Berkeley, CA, USA. 2010; 239–252p.

Guanlin Lu, Yu Jin, Du David HC. Frequency Based Chunking for Data De-Duplication. 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems. 2010; 1–10p. DOI: 10.1109/MASCOTS.2010.37

Aronovich L, Asher R, Harnik D, et al. Toaff, Similarity Based Deduplication with Small Data Chunks. Discrete Appl Math, Elsevier. 2015; 1–13p.

Youjip Won, Kyeongyeol Lim, Jaehong Min. MUCH: Multithreaded Content-BasedFile Chunking. IEEE Trans Comput. May 2015; 64(5): 1375–1388p. DOI: 10.1109/TC.2014.2322600

Ider Lkhagvasuren, Jung Min So, Jeong Gun Lee, et al. Byte-index Chunking Algorithm for Data De-duplication System. International Journal of Security and Its Applications (IJSIA). 2013; 7(5): 415–424p.


Ider Lkhagvasuren, Jung Min So, Jeong Gun Lee, et al. Multi-level Byte Index Chunking Approach for File Synchronization. In Proceedings of the 2nd International Conference on Smart Phone, Device and Applications, SPDA 2013, ASTL. 2013; 26: 155–159p. DOI: 10.14257/ijseia.2014.8.3.31


  • There are currently no refbacks.