Whole-File Chunk-Based Deduplication Using Reinforcement Learning for Cloud Storage

Publication Date

1-1-2022

Document Type

Conference Proceeding

Publication Title

Proceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2022

DOI

10.1109/ASONAM55673.2022.10068661

First Page

269

Last Page

276

Abstract

Deduplication is the process of removing replicated data content from storage facilities like online databases, cloud datastore, local file systems, etc. It is commonly performed as part of data preprocessing to eliminate redundant data that requires extra storage spaces and computing power and is crucial for data storage management in cloud computing. Deduplication is essential for file backup systems since duplicated files will presumably consume more storage space, especially with a short backup period such as daily. A common technique in this field involves splitting files into chunks whose hashes can be compared using data structures or techniques like clustering. This paper explores the possibility of performing such file chunk deduplication leveraging an innovative reinforcement learning approach to achieve a high deduplication ratio. The proposed system is named SegDup, which achieves 13% higher deduplication ratio than Extreme Binning, a state-of-the art deduplication algorithm.

Keywords

Bloom Filter, Cloud Storage, Deduplication, Deep Q-Network, Reinforcement Learning

Department

Computer Science

Share

COinS