Publication Date

Spring 2022

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Teng Moh

Second Advisor

Melody Moh

Third Advisor

Chris Pollett

Keywords

Reinforcement Learning, Deduplication, File Chunk

Abstract

Deduplication is the process of removing replicated data content from storage facilities like online databases, cloud datastore, local file systems, etc., which is commonly performed as part of data preprocessing to eliminate redundant data that requires unnecessary storage spaces and computing power. Deduplication is even more specifically essential for file backup systems since duplicated files will presumably consume more storage space, especially with a short backup period like daily [8]. A common technique in this field involves splitting files into chunks whose hashes can be compared using data structures or techniques like clustering. In this project we explore the possibility of performing such file chunk deduplication leveraging an innovative reinforcement learning approach to achieve a high deduplication ratio. We name the proposed system SegDup. It achieves 13% higher deduplication ratio than a state- of-the art deduplication algorithm named Extreme Binning.

Recommended Citation

Yuan, Xincheng, "Whole File Chunk Based Deduplication Using Reinforcement Learning" (2022). Master's Projects. 1080.
DOI: https://doi.org/10.31979/etd.xdv4-q8f8
https://scholarworks.sjsu.edu/etd_projects/1080

Download

Included in

Artificial Intelligence and Robotics Commons, Databases and Information Systems Commons

COinS

DOI

https://doi.org/10.31979/etd.xdv4-q8f8

Master's Projects

Whole File Chunk Based Deduplication Using Reinforcement Learning

Publication Date

Degree Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Abstract

Recommended Citation

Included in

DOI

Search

Browse All

Links

Master's Projects

Whole File Chunk Based Deduplication Using Reinforcement Learning

Author

Publication Date

Degree Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Abstract

Recommended Citation

Included in

Share

DOI

Search

Browse All

Links