Faculty Research, Scholarly, and Creative Activity

Performance Centric Primary Storage Deduplication Systems Exploiting Caching and Block Similarity

Rakesh Gururaj, San Jose State University
Melody Moh, San Jose State UniversityFollow
Teng Sheng Moh, San Jose State UniversityFollow
Philip Shilane, Dell Technologies
Bhimsen Bhanjois, Dell Technologies

Publication Date

1-1-2022

Document Type

Conference Proceeding

Publication Title

Proceedings of the 2022 16th International Conference on Ubiquitous Information Management and Communication, IMCOM 2022

DOI

10.1109/IMCOM53663.2022.9721761

Abstract

Data deduplication is a concept of physically storing a single instance of data by eliminating redundant copies to save the storage space by matching strong data hashes (e.g., fingerprints). The adoption of deduplication for primary storage has been hampered because of its complexities, such as random-access patterns to data and the need for quicker request response time. Most of the solutions designed for primary storage are offline and dependent on the concept of locality. This paper proposes an inline deduplication system with a Machine Learning based cache eviction policy to reduce the metadata overhead in the deduplication process, eliminate the redundant writes and improve the overall throughput in latency-sensitive storage workload.Caching of the fingerprints plays a vital role in improving performance during deduplication. A novel Machine Learning model for cache eviction is built based on the recency, frequency, Logical Block Address, and category of a data block. The experimental results show that 33% of redundant writes are eliminated, 54.5% of metadata overhead is reduced by exploiting block similarity, and the metadata cache hit rates based on the Machine Learning model are higher by 5.43% and 10.36% over systems with Least Recently Used eviction and Least Frequently Used eviction policy respectively. We achieved 14.4% better throughput with a workload-dependent Machine Learning-based cache eviction policy than a system with traditional cache eviction policy. The cache system learns the past evicted block I/O statistics and refines itself while choosing an eviction candidate. Our system was evaluated on real-world I/O traces in experiments.

Keywords

inline deduplication, machine learning based cache eviction, primary storage, superblocks

Department

Computer Science

Recommended Citation

Rakesh Gururaj, Melody Moh, Teng Sheng Moh, Philip Shilane, and Bhimsen Bhanjois. "Performance Centric Primary Storage Deduplication Systems Exploiting Caching and Block Similarity" Proceedings of the 2022 16th International Conference on Ubiquitous Information Management and Communication, IMCOM 2022 (2022). https://doi.org/10.1109/IMCOM53663.2022.9721761

Link to Full Text

COinS

Faculty Research, Scholarly, and Creative Activity

Performance Centric Primary Storage Deduplication Systems Exploiting Caching and Block Similarity

Publication Date

Document Type

Publication Title

DOI

Abstract

Keywords

Department

Recommended Citation

Search

Browse All

Links

Faculty Research, Scholarly, and Creative Activity

Performance Centric Primary Storage Deduplication Systems Exploiting Caching and Block Similarity

Authors

Publication Date

Document Type

Publication Title

DOI

Abstract

Keywords

Department

Recommended Citation

Share

Search

Browse All

Links