A deep dive into common open formats for analytical dbmss

C Liu, A Pavlenko, M Interlandi, B Haynes - Proceedings of the VLDB …, 2023 - dl.acm.org
This paper evaluates the suitability of Apache Arrow, Parquet, and ORC as formats for
subsumption in an analytical DBMS. We systematically identify and explore the high-level …

Is big data performance reproducible in modern cloud networks?

A Uta, A Custura, D Duplyakin, I Jimenez… - … USENIX symposium on …, 2020 - usenix.org
Performance variability has been acknowledged as a problem for over a decade by cloud
practitioners and performance engineers. Yet, our survey of top systems conferences …

An empirical evaluation of columnar storage formats

X Zeng, Y Hui, J Shen, A Pavlo, W McKinney… - arXiv preprint arXiv …, 2023 - arxiv.org
Columnar storage is a core component of a modern data analytics system. Although many
database management systems (DBMSs) have proprietary storage formats, most provide …

Jumpgate:{In-Network} Processing as a Service for Data Analytics

C Mustard, F Ruffy, A Gakhokidze… - 11th USENIX Workshop …, 2019 - usenix.org
In-network processing, where data is processed by special-purpose devices as it passes
over the network, is showing great promise at improving application performance, in …

The impact of columnar file formats on SQL‐on‐hadoop engine performance: A study on ORC and Parquet

T Ivanov, M Pergolesi - Concurrency and Computation …, 2020 - Wiley Online Library
Columnar file formats provide an efficient way to store data to be queried by SQL‐on‐
Hadoop engines. Related works consider the performance of processing engine and file …

Sharing and caring of data at the edge

A Trivedi, L Wang, H Bal, A Iosup - 3rd USENIX Workshop on Hot Topics …, 2020 - usenix.org
Edge computing is an emerging computing paradigm where data is generated and
processed in the field using distributed computing devices. Many applications such as real …

Unification of Temporary Storage in the {NodeKernel} Architecture

P Stuedi, A Trivedi, J Pfefferle, A Klimovic… - 2019 USENIX Annual …, 2019 - usenix.org
Efficiently exchanging temporary data between tasks is critical to the end-to-end
performance of many data processing frameworks and applications. Unfortunately, the …

ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework

J Yun, B Tak, WS Han - Proceedings of the VLDB Endowment, 2024 - dl.acm.org
The schemalessness, one of the major advantages of JSON representation format, comes
with high penalties in querying and operations by denying various critical functions such as …

Scaling large production clusters with partitioned synchronization

Y Feng, Z Liu, Y Zhao, T Jin, Y Wu, Y Zhang… - 2021 USENIX Annual …, 2021 - usenix.org
The scale of computer clusters has grown significantly in recent years. Today, a cluster may
have 100 thousand machines and execute billions of tasks, especially short tasks, each day …

Skyhook: towards an arrow-native storage system

J Chakraborty, I Jimenez, SA Rodriguez… - 2022 22nd IEEE …, 2022 - ieeexplore.ieee.org
With the ever-increasing dataset sizes, several file formats such as Parquet, ORC, and Avro
have been developed to store data efficiently, save the network, and interconnect bandwidth …