Whole-File Chunk-Based Deduplication Using Reinforcement Learning for Cloud Storage

Publication Date


Document Type

Conference Proceeding

Publication Title

Proceedings of the 2022 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2022



First Page


Last Page



Deduplication is the process of removing replicated data content from storage facilities like online databases, cloud datastore, local file systems, etc. It is commonly performed as part of data preprocessing to eliminate redundant data that requires extra storage spaces and computing power and is crucial for data storage management in cloud computing. Deduplication is essential for file backup systems since duplicated files will presumably consume more storage space, especially with a short backup period such as daily. A common technique in this field involves splitting files into chunks whose hashes can be compared using data structures or techniques like clustering. This paper explores the possibility of performing such file chunk deduplication leveraging an innovative reinforcement learning approach to achieve a high deduplication ratio. The proposed system is named SegDup, which achieves 13% higher deduplication ratio than Extreme Binning, a state-of-the art deduplication algorithm.


Bloom Filter, Cloud Storage, Deduplication, Deep Q-Network, Reinforcement Learning


Computer Science