A comprehensive survey on coded distributed computing: Fundamentals, challenges, and networking applications

JS Ng, WYB Lim, NC Luong, Z Xiong… - … Surveys & Tutorials, 2021 - ieeexplore.ieee.org
Distributed computing has become a common approach for large-scale computation tasks
due to benefits such as high reliability, scalability, computation speed, and cost …

A survey on spark ecosystem: Big data processing infrastructure, machine learning, and applications

S Tang, B He, C Yu, Y Li, K Li - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
With the explosive increase of big data in industry and academic fields, it is important to
apply large-scale data processing systems to analyze Big Data. Arguably, Spark is the state …

Intermediate data placement and cache replacement strategy under Spark platform

C Li, Y Zhang, Y Luo - Journal of Parallel and Distributed Computing, 2022 - Elsevier
Spark is widely used due to its high performance caching mechanism and high scalability,
which still causes uneven workloads and produces useless intermediate caching results …

A novel hybrid approach for multi-dimensional data anonymization for apache spark

SU Bazai, J Jang-Jaccard, H Alavizadeh - ACM Transactions on Privacy …, 2021 - dl.acm.org
Multi-dimensional data anonymization approaches (eg, Mondrian) ensure more fine-grained
data privacy by providing a different anonymization strategy applied for each attribute. Many …

A survey of coded distributed computing

JS Ng, WYB Lim, NC Luong, Z Xiong… - arXiv preprint arXiv …, 2020 - arxiv.org
Distributed computing has become a common approach for large-scale computation of tasks
due to benefits such as high reliability, scalability, computation speed, and costeffectiveness …

Performance model of mapreduce iterative applications for hybrid cloud bursting

FJ Clemente-Castelló, B Nicolae… - … on Parallel and …, 2018 - ieeexplore.ieee.org
Hybrid cloud bursting (ie, leasing temporary off-premise cloud resources to boost the overall
capacity during peak utilization) can be a cost-effective way to deal with the increasing …

Toward high-performance computing and big data analytics convergence: The case of spark-diy

S Caino-Lores, J Carretero, B Nicolae, O Yildiz… - IEEE …, 2019 - ieeexplore.ieee.org
Convergence between high-performance computing (HPC) and big data analytics (BDA) is
currently an established research area that has spawned new opportunities for unifying the …

Spark-diy: A framework for interoperable spark operations with high performance block-based data models

S Caíno-Lores, J Carretero, B Nicolae… - 2018 IEEE/ACM 5th …, 2018 - ieeexplore.ieee.org
Today's scientific applications are increasingly relying on a variety of data sources, storage
facilities, and computing infrastructures, and there is a growing demand for data analysis …

A performance study of big data workloads in cloud datacenters with network variability

A Uta, H Obaseki - Companion of the 2018 ACM/SPEC International …, 2018 - dl.acm.org
Public cloud computing platforms are a cost-effective solution for individuals and
organizations to deploy various types of workloads, ranging from scientific applications …

Improving the robustness and performance of parallel joins over distributed systems

L Cheng, S Kotoulas, TE Ward… - Journal of Parallel and …, 2017 - Elsevier
High-performance data processing systems typically utilize numerous servers with large
amounts of memory. An essential operation in such environment is the parallel join, the …