Master of Science (MS)
Reinforcement Learning, Deduplication, File Chunk
Deduplication is the process of removing replicated data content from storage facilities like online databases, cloud datastore, local file systems, etc., which is commonly performed as part of data preprocessing to eliminate redundant data that requires unnecessary storage spaces and computing power. Deduplication is even more specifically essential for file backup systems since duplicated files will presumably consume more storage space, especially with a short backup period like daily . A common technique in this field involves splitting files into chunks whose hashes can be compared using data structures or techniques like clustering. In this project we explore the possibility of performing such file chunk deduplication leveraging an innovative reinforcement learning approach to achieve a high deduplication ratio. We name the proposed system SegDup. It achieves 13% higher deduplication ratio than a state- of-the art deduplication algorithm named Extreme Binning.
Yuan, Xincheng, "Whole File Chunk Based Deduplication Using Reinforcement Learning" (2022). Master's Projects. 1080.
Available for download on Friday, May 26, 2023