Publication Date

Spring 2022

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science

First Advisor

Teng Moh

Second Advisor

Melody Moh

Third Advisor

Chris Pollett


Reinforcement Learning, Deduplication, File Chunk


Deduplication is the process of removing replicated data content from storage facilities like online databases, cloud datastore, local file systems, etc., which is commonly performed as part of data preprocessing to eliminate redundant data that requires unnecessary storage spaces and computing power. Deduplication is even more specifically essential for file backup systems since duplicated files will presumably consume more storage space, especially with a short backup period like daily [8]. A common technique in this field involves splitting files into chunks whose hashes can be compared using data structures or techniques like clustering. In this project we explore the possibility of performing such file chunk deduplication leveraging an innovative reinforcement learning approach to achieve a high deduplication ratio. We name the proposed system SegDup. It achieves 13% higher deduplication ratio than a state- of-the art deduplication algorithm named Extreme Binning.