Publication Date
Spring 2022
Degree Type
Master's Project
Degree Name
Master of Science (MS)
Department
Computer Science
First Advisor
Teng Moh
Second Advisor
Melody Moh
Third Advisor
Chris Pollett
Keywords
Reinforcement Learning, Deduplication, File Chunk
Abstract
Deduplication is the process of removing replicated data content from storage facilities like online databases, cloud datastore, local file systems, etc., which is commonly performed as part of data preprocessing to eliminate redundant data that requires unnecessary storage spaces and computing power. Deduplication is even more specifically essential for file backup systems since duplicated files will presumably consume more storage space, especially with a short backup period like daily [8]. A common technique in this field involves splitting files into chunks whose hashes can be compared using data structures or techniques like clustering. In this project we explore the possibility of performing such file chunk deduplication leveraging an innovative reinforcement learning approach to achieve a high deduplication ratio. We name the proposed system SegDup. It achieves 13% higher deduplication ratio than a state- of-the art deduplication algorithm named Extreme Binning.
Recommended Citation
Yuan, Xincheng, "Whole File Chunk Based Deduplication Using Reinforcement Learning" (2022). Master's Projects. 1080.
DOI: https://doi.org/10.31979/etd.xdv4-q8f8
https://scholarworks.sjsu.edu/etd_projects/1080