Data De-Duplication with Adaptive Chunking and Accelerated Modification Identifying

Xingjun Zhang

Department of Computer Science and Technology
Xi'an Jiaotong University, Xi'an, 710049, China
Guofeng Zhu

Department of Computer Science and Technology
Xi'an Jiaotong University, Xi'an, 710049, China
Endong Wang

Inspur(Beijing) Electronic Information Industry Co. Ltd.
100085, Beijing, China
Scott Fowler

Department of Science and Technology
Linkoping University, Campus Norrkoping, SE-601 74, Sweden
Xiaoshe Dong

Department of Computer Science and Technology
Xi'an Jiaotong University, Xi'an, 710049, China

Data De-Duplication with Adaptive Chunking and Accelerated Modification Identifying

keywords: Data de-duplication, self-adaptive, FastCDC

The data de-duplication system not only pursues the high de-duplication rate, which refers to the aggregate reduction in storage requirements gained from de-duplication, but also the de-duplication speed. To solve the problem of random parameter-setting brought by Content Defined Chunking (CDC), a self-adaptive data chunking algorithm is proposed. The algorithm improves the de-duplication rate by conducting pre-processing de-duplication to the samples of the classified files and then selecting the appropriate algorithm parameters. Meanwhile, FastCDC, a kind of content-based fast data chunking algorithm, is adopted to solve the problem of low de-duplication speed of CDC. By introducing de-duplication factor and acceleration factor, FastCDC can significantly boost de-duplication speed while not sacrificing the de-duplication rate through adjusting these two parameters. The experimental results demonstrate that our proposed method can improve the de-duplication rate by about 5 %, while FastCDC can obtain the increase of de-duplication speed by 50 % to 200 % only at the expense of less than 3 % de-duplication rate loss.

reference: Vol. 35, 2016, No. 3, pp. 586–614

Computing and Informatics

formerly Computers and Artificial Intelligence

Data De-Duplication with Adaptive Chunking and Accelerated Modification Identifying