Optimization of the join between large tables in the spark distributed framework

X Wu, Y He - Applied Sciences, 2023 - mdpi.com
The Join task between Spark large tables takes a long time to run and produces a lot of disk
I/O, network I/O and disk occupation in the Shuffle process. This paper proposes a …

A theoretical and experimental comparison of large-scale join algorithms in spark

AC Phan, TC Phan, TN Trieu, TTQ Tran - SN Computer Science, 2021 - Springer
Currently, the estimated amount of data created daily have reached the threshold of
petabytes or even zettabytes globally. It is no wonder that traditional data processing …

A comparative study of join algorithms in spark

AC Phan, TC Phan, TN Trieu - Future Data and Security Engineering: 7th …, 2020 - Springer
In the era of information explosion, the amount of data generated is increasing day by day,
reached the threshold of petabytes or even zettabytes. In order to extract useful information …

Join algorithms under apache spark: revisited

A Al-Badarneh - Proceedings of the 2019 5th International Conference …, 2019 - dl.acm.org
Currently, we are dealing with large scale applications, which in turn generate massive
amount of data and information. Large amount of data often requires processing algorithms …

A Spark Join Algorithm Based on Memory Monitoring and Batch Processing

C Kefei, L Zhao, Z Ke, D Xianjun… - 2018 IEEE 9th …, 2018 - ieeexplore.ieee.org
In recent years, the Spark memory computing framework has risen rapidly, and the data
processing speed has been greatly improved. However, the upper limit of speed is limited by …

Optimization of data distribution strategy in theta-join process based on spark

S Cao, E Haihong, M Song, K Zhang - Proceedings of the 2nd …, 2018 - dl.acm.org
The theta-join between tables is a common operation in the data query and statistical
analysis. When dealing with large amounts of data, it will produce a great deal of cost. The …

Approximate distributed joins in apache spark

DL Quoc, IE Akkus, P Bhatotia, S Blanas… - arXiv preprint arXiv …, 2018 - arxiv.org
The join operation is a fundamental building block of parallel data processing. Unfortunately,
it is very resource-intensive to compute an equi-join across massive datasets. The …

Utilizing page-level join index for optimization in parallel join execution

C Lee, ZA Chang - IEEE transactions on knowledge and data …, 1995 - ieeexplore.ieee.org
This paper presents a methodology for the optimization of parallel join execution. Past
research on parallel join methods mostly focused on the design of algorithms for partitioning …

PI-Join: Efficiently processing join queries on massive data

X Han, J Li, D Yang - Knowledge and information systems, 2012 - Springer
The ratio of disk capacity to disk transfer rate typically increases by 10× per decade. As a
result, disk is becoming slower from the view of applications because of the much larger …

Distributed join processing between streaming and stored big data under the micro-batch model

YH Jeon, KH Lee, HJ Kim - IEEE Access, 2019 - ieeexplore.ieee.org
In order to interpret, enrich, and analyze the streaming data, stream applications often
access the data stored in an external database. Although there has been a lot of studies on …