作者
Ingo Müller, Cornelius Ratsch, Franz Faerber
发表日期
2014/4
研讨会论文
17th International Conference on Extending Database Technology (EDBT) – 2014
页码范围
283–294
简介
Domain encoding is a common technique to compress the columns of a column store and to accelerate many types of queries at the same time. It is based on the assumption that most columns contain a relatively small set of distinct values, in particular string columns. In this paper, we argue that domain encoding is not the end of the story. In real world systems, we observe that a substantial amount of the columns are of string types. Moreover, most of the memory space is consumed by only a small fraction of these columns. To address this issue, we make three main contributions: First we survey several approaches and variants for dictionary compression, ie, data structures that store the dictionary of domain encoding in a compressed way. As expected, there is a trade-off between size of the data structure and its access performance. This observation can be used to compress rarely accessed data more than frequently accessed data. Furthermore the question which approach has the best compression ratio for a certain column heavily depends on specific characteristics of its content. Consequently, as a second contribution, we present non-trivial sampling schemes for all our dictionary formats, enabling us to estimate their size for a given column. This way it is possible to identify compression schemes specialized for the content of a specific column.
Third, we draft how to fully automate the decision of the dictionary format. We sketch a compression manager that selects the most appropriate dictionary format based on column access and update patterns, characteristics of the underlying data, and costs for set-up and access of the different data …
引用总数
201320142015201620172018201920202021202220232024122636613141194