The dilemma between deduplication and locality: Can both be achieved?

X Zou, J Yuan, P Shilane, W Xia, H Zhang… - … USENIX conference on …, 2021 - usenix.org
Data deduplication is widely used to reduce the size of backup workloads, but it has the
known disadvantage of causing poor data locality, also referred to as the fragmentation …

Rejection sampling for weighted jaccard similarity revisited

X Li, P Li - Proceedings of the AAAI Conference on Artificial …, 2021 - ojs.aaai.org
Efficiently computing the weighted Jaccard similarity has become an active research topic in
machine learning and theory. For sparse data, the standard technique is based on the …

The logarithmic dynamic cuckoo filter

F Zhang, H Chen, H Jin… - 2021 IEEE 37th …, 2021 - ieeexplore.ieee.org
The emergence of big data applications makes efficient representation for large-scale
dynamic data sets a challenge. The state-of-the-art design, ie, the dynamic cuckoo filter …

Odess: Speeding up resemblance detection for redundancy elimination by fast content-defined sampling

X Zou, C Deng, W Xia, P Shilane, H Tan… - 2021 IEEE 37th …, 2021 - ieeexplore.ieee.org
Multiple data reduction techniques have been investigated to lower storage costs for a wide
variety of customers. In this work, we focus on similarity-based delta compression, which …

Improving the performance of deduplication-based backup systems via container utilization based hot fingerprint entry distilling

D Zhang, Y Deng, Y Zhou, Y Zhu, X Qin - ACM Transactions on Storage …, 2021 - dl.acm.org
Data deduplication techniques construct an index consisting of fingerprint entries to identify
and eliminate duplicated copies of repeating data. The bottleneck of disk-based index …

Consistent sampling through extremal process

P Li, X Li, G Samorodnitsky, W Zhao - Proceedings of the Web …, 2021 - dl.acm.org
The1 Jaccard similarity has been widely used in search and machine learning, especially in
industrial practice. For binary (0/1) data, the Jaccard similarity is often called the …

GoSeed: Optimal seeding plan for deduplicated storage

A Nachman, S Sheinvald, A Kolikant… - ACM Transactions on …, 2021 - dl.acm.org
Deduplication decreases the physical occupancy of files in a storage volume by removing
duplicate copies of data chunks, but creates data-sharing dependencies that complicate …

Slimstore: A cloud-based deduplication system for multi-version backups

Z Zhang, H Hu, Z Xue, C Chen, Y Yu… - 2021 IEEE 37th …, 2021 - ieeexplore.ieee.org
Cloud backup is becoming the preferred way for users to support disaster recovery. In
addition to its convenience, users are deeply concerned about reducing storage costs in the …

Dynamic prime chunking algorithm for data deduplication in cloud storage

M Ellappan, S Abirami - … on Internet and Information Systems (TIIS), 2021 - koreascience.kr
The data deduplication technique identifies the duplicates and minimizes the redundant
storage data in the backup server. The chunk level deduplication plays a significant role in …

Accelerating ml/dl applications with hierarchical caching on deduplication storage clusters

P Hamandawana, A Khan, J Kim… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Large scale machine learning (ML) and deep learning (DL) platforms face challenges when
integrated with deduplication enabled storage clusters. In the quest to achieve smart and …