A survey on automatic parameter tuning for big data processing systems

H Herodotou, Y Chen, J Lu - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Big data processing systems (eg, Hadoop, Spark, Storm) contain a vast number of
configuration parameters controlling parallelism, I/O behavior, memory settings, and …

Ernest: Efficient performance prediction for {Large-Scale} advanced analytics

S Venkataraman, Z Yang, M Franklin, B Recht… - … USENIX Symposium on …, 2016 - usenix.org
Recent workload trends indicate rapid growth in the deployment of machine learning,
genomics and scientific workloads on cloud computing infrastructure. However, efficiently …

Clash of the titans: Mapreduce vs. spark for large scale data analytics

J Shi, Y Qiu, UF Minhas, L Jiao, C Wang… - Proceedings of the …, 2015 - dl.acm.org
MapReduce and Spark are two very popular open source cluster computing frameworks for
large scale data analytics. These frameworks hide the complexity of task parallelism and …

The many faces of data-centric workflow optimization: a survey

G Kougka, A Gounaris, A Simitsis - … Journal of Data Science and Analytics, 2018 - Springer
Workflow technology is rapidly evolving and, rather than being limited to modeling the
control flow in business processes, is becoming a key mechanism to perform advanced data …

Black or white? how to develop an autotuner for memory-based analytics

M Kunjir, S Babu - Proceedings of the 2020 ACM SIGMOD International …, 2020 - dl.acm.org
There is a lot of interest today in building autonomous (or, self-driving) data processing
systems. An emerging school of thought is to leverage AI-driven" black box" algorithms for …

Speedup your analytics: Automatic parameter tuning for databases and big data systems

J Lu, Y Chen, H Herodotou, S Babu - Proceedings of the VLDB …, 2019 - dl.acm.org
Database and big data analytics systems such as Hadoop and Spark have a large number
of configuration parameters that control memory distribution, I/O optimization, parallelism …

Memtune: Dynamic memory management for in-memory data analytic platforms

L Xu, M Li, L Zhang, AR Butt, Y Wang… - 2016 IEEE international …, 2016 - ieeexplore.ieee.org
Memory is a crucial resource for big data processing frameworks such as Spark and M3R,
where the memory is used both for computation and for caching intermediate storage data …

Dynamic configuration of partitioning in spark applications

A Gounaris, G Kougka, R Tous… - IEEE Transactions on …, 2017 - ieeexplore.ieee.org
Spark has become one of the main options for large-scale analytics running on top of shared-
nothing clusters. This work aims to make a deep dive into the parallelism configuration and …

Resource elasticity for large-scale machine learning

B Huang, M Boehm, Y Tian, B Reinwald… - Proceedings of the …, 2015 - dl.acm.org
Declarative large-scale machine learning (ML) aims at flexible specification of ML algorithms
and automatic generation of hybrid runtime plans ranging from single node, in-memory …

Learning-based automatic parameter tuning for big data analytics frameworks

L Bao, X Liu, W Chen - … Conference on Big Data (Big Data), 2018 - ieeexplore.ieee.org
Big data analytics frameworks (BDAFs) have been widely used for data processing
applications. These frameworks provide a large number of configuration parameters to …