Publication Date

Summer 2021

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)


Computer Science

First Advisor

Teng Moh


inline deduplication, block similarity, cache eviction, data fragmentation


Data deduplication is a concept of physically storing a single instance of data by eliminating redundant copies to save the storage space. The adoption of deduplication is minimal in actively accessed primary storage because of its complexities, such as random access patterns to data and the need for quicker request response time. Most of the solutions designed for primary storage are offline and dependent on the concept of locality. This paper proposes an inline deduplication system with a Machine Learning based cache eviction policy to reduce the metadata overhead in the deduplication process, eliminate the redundant writes and improve the overall throughput in latency-sensitive storage workload. The system’s major components are superblocking, categorizing superblocks, similarity detection, and deduplication supported by an efficient caching mechanism. It categorizes identical sequence of blocks based on the minimal fingerprint value of the superblock. Caching of the fingerprints plays a vital role in improving performance during deduplication. A novel Machine Learning model for cache eviction is built based on the recency, frequency, and category of a block. The experimental results show that more than 33% of redundant writes are eliminated with smaller superblocks, the metadata overheads are minimized by at least 54.5% by categorizing similar superblocks, and the cache hit rates based on the workload-dependent Machine Learning model are higher by 5.43%,10.36% over system with LRU eviction and LFU eviction policy respectively resulting in 14.4% better throughput than a system with traditional cache eviction policy with a metadata cache allocation of 10% of average metadata stream size. The cache system learns the past evicted block I/O statistics and refines itself while choosing an eviction candidate. The system has shown satisfactory performance in all the real-world I/O traces considered for experiments.