Performance Centric Primary Storage Deduplication Systems Exploiting Caching and Block Similarity

Publication Date


Document Type

Conference Proceeding

Publication Title

Proceedings of the 2022 16th International Conference on Ubiquitous Information Management and Communication, IMCOM 2022




Data deduplication is a concept of physically storing a single instance of data by eliminating redundant copies to save the storage space by matching strong data hashes (e.g., fingerprints). The adoption of deduplication for primary storage has been hampered because of its complexities, such as random-access patterns to data and the need for quicker request response time. Most of the solutions designed for primary storage are offline and dependent on the concept of locality. This paper proposes an inline deduplication system with a Machine Learning based cache eviction policy to reduce the metadata overhead in the deduplication process, eliminate the redundant writes and improve the overall throughput in latency-sensitive storage workload.Caching of the fingerprints plays a vital role in improving performance during deduplication. A novel Machine Learning model for cache eviction is built based on the recency, frequency, Logical Block Address, and category of a data block. The experimental results show that 33% of redundant writes are eliminated, 54.5% of metadata overhead is reduced by exploiting block similarity, and the metadata cache hit rates based on the Machine Learning model are higher by 5.43% and 10.36% over systems with Least Recently Used eviction and Least Frequently Used eviction policy respectively. We achieved 14.4% better throughput with a workload-dependent Machine Learning-based cache eviction policy than a system with traditional cache eviction policy. The cache system learns the past evicted block I/O statistics and refines itself while choosing an eviction candidate. Our system was evaluated on real-world I/O traces in experiments.


inline deduplication, machine learning based cache eviction, primary storage, superblocks


Computer Science