Effective data management strategy and RDD weight cache replacement strategy in Spark

K Jiang, S Du, F Zhao, Y Huang, C Li, Y Luo - Computer Communications, 2022 - Elsevier
With the dramatic increase in internet users and their demand for real-time network
performance, Spark has distributed computing environment has emerged. It is widely used …

Intermediate data placement and cache replacement strategy under Spark platform

C Li, Y Zhang, Y Luo - Journal of Parallel and Distributed Computing, 2022 - Elsevier
Spark is widely used due to its high performance caching mechanism and high scalability,
which still causes uneven workloads and produces useless intermediate caching results …

LPW: an efficient data-aware cache replacement strategy for Apache Spark

H Li, S Ji, H Zhong, W Wang, L Xu, Z Tang… - Science China …, 2023 - Springer
Caching is one of the most important techniques for the popular distributed big data
processing framework Spark. For this big data parallel computing framework, which is …

A Dynamic Memory Allocation Optimization Mechanism Based on Spark.

S Wang, S Geng, Z Zhang, A Ye… - Computers …, 2019 - search.ebscohost.com
Spark is a distributed data processing framework based on memory. Memory allocation is a
focus question of Spark research. A good memory allocation scheme can effectively improve …

A memory-aware spark cache replacement strategy

J Zhang, R Zhang, O Alfarraj, A Tolba… - Journal of Internet …, 2022 - jit.ndhu.edu.tw
Spark is currently the most widely used distributed computing framework, and its key data
abstraction concept, Resilient Distributed Dataset (RDD), brings significant performance …

Dynamic data replacement and adaptive scheduling policies in spark

C Li, Q Cai, Y Luo - Cluster Computing, 2022 - Springer
Improper data replacement and inappropriate selection of job scheduling policy are
important reasons for the degradation of Spark system operation speed, which directly …

Adaptive Control of Apache Spark's Data Caching Mechanism Based on Workload Characteristics

H Inagaki, T Fujii, R Kawashima… - 2018 6th International …, 2018 - ieeexplore.ieee.org
Apache Spark caches reusable data into memory/disk. From our preliminary evaluation, we
have found that a memory-and-disk caching is ineffective compared to disk-only caching …

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

C Li, Q Cai, Y Luo - The Journal of Supercomputing, 2022 - Springer
Both data shuffling and cache recovery are essential parts of the Spark system, and they
directly affect Spark parallel computing performance. Existing dynamic partitioning schemes …

Memory management approaches in apache spark: A review

M Dessokey, SM Saif, S Salem, E Saad… - … Conference on Advanced …, 2020 - Springer
In the era of Big Data, processing large amounts of data through data-intensive applications,
is presenting a challenge. An in-memory distributed computing system; Apache Spark is …

Handling data skew at reduce stage in Spark by ReducePartition

W Guo, C Huang, W Tian - Concurrency and Computation …, 2020 - Wiley Online Library
As a typical representative of distributed computing framework, Spark has been continuously
developed and popularized. It reduces the data transmission time through efficient memory …