CS*: Approximate Query Processing on Big Data using Scalable Join Correlated Sample Synopsis

F Yu, WC Hou - 2019 IEEE International Conference on Big …, 2019 - ieeexplore.ieee.org
F Yu, WC Hou
2019 IEEE International Conference on Big Data (Big Data), 2019ieeexplore.ieee.org
Complex join queries are expensive to process on big data. Providing fast and accurate
approximations to join queries with common aggregate functions can bring tremendous
benefits in many fields such as data management, data mining, and machine learning. The
state-of-the-art methods mainly focus on generating non-reusable samples during query
time which can be costly for big data applications. In this research, we develop a scalable
sample-based synopsis, called Scalable Join Correlated Sample Synopsis (or CS*), which …
Complex join queries are expensive to process on big data. Providing fast and accurate approximations to join queries with common aggregate functions can bring tremendous benefits in many fields such as data management, data mining, and machine learning. The state-of-the-art methods mainly focus on generating non-reusable samples during query time which can be costly for big data applications. In this research, we develop a scalable sample-based synopsis, called Scalable Join Correlated Sample Synopsis (or CS*), which can be pre-computed and doesn’t rely on any index structure. CS* only needs to be generated once and can be used to answer all future queries on the same database. It efficiently maintains join relationships between sampled tuples thanks to the introduced scheme of scalable join correlated sampling and a unique numerical value called join ratio (or JR). We further introduce two novel data structures, namely count trace and join correlated histogram, to optimize the calculation of JR values in map-reduce. For query estimations, multiple unbiased estimators are developed on CS* to provide fast and accurate approximations for join queries with common aggregate functions, acyclic or cyclic join graphs, and dangling tuples. The experimental study on large datasets demonstrates that CS* can be efficiently generated and provides accurate join query estimations with small sampling fractions.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果