A survey on automatic parameter tuning for big data processing systems

H Herodotou, Y Chen, J Lu - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Big data processing systems (eg, Hadoop, Spark, Storm) contain a vast number of
configuration parameters controlling parallelism, I/O behavior, memory settings, and …

Distributed data management using MapReduce

F Li, BC Ooi, MT Özsu, S Wu - ACM Computing Surveys (CSUR), 2014 - dl.acm.org
MapReduce is a framework for processing and managing large-scale datasets in a
distributed cluster, which has been used for applications such as generating search indexes …

Aria: automatic resource inference and allocation for mapreduce environments

A Verma, L Cherkasova, RH Campbell - Proceedings of the 8th ACM …, 2011 - dl.acm.org
MapReduce and Hadoop represent an economically compelling alternative for efficient
large scale data processing and advanced analytics in the enterprise. A key challenge in …

The case for evaluating mapreduce performance using workload suites

Y Chen, A Ganapathi, R Griffith… - 2011 IEEE 19th annual …, 2011 - ieeexplore.ieee.org
MapReduce systems face enormous challenges due to increasing growth, diversity, and
consolidation of the data and computation involved. Provisioning, configuring, and …

Profiling, what-if analysis, and cost-based optimization of mapreduce programs

H Herodotou, S Babu - Proceedings of the VLDB Endowment, 2011 - dl.acm.org
MapReduce has emerged as a viable competitor to database systems in big data analytics.
MapReduce programs are being written for a wide variety of application domains including …

Network-aware scheduling for data-parallel jobs: Plan when you can

V Jalaparti, P Bodik, I Menache, S Rao… - ACM SIGCOMM …, 2015 - dl.acm.org
To reduce the impact of network congestion on big data jobs, cluster management
frameworks use various heuristics to schedule compute tasks and/or network flows. Most of …

An analysis of traces from a production mapreduce cluster

S Kavulya, J Tan, R Gandhi… - 2010 10th IEEE/ACM …, 2010 - ieeexplore.ieee.org
MapReduce is a programming paradigm for parallel processing that is increasingly being
used for data-intensive applications in cloud computing environments. An understanding of …

No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics

H Herodotou, F Dong, S Babu - … of the 2nd ACM Symposium on Cloud …, 2011 - dl.acm.org
Infrastructure-as-a-Service (IaaS) cloud platforms have brought two unprecedented changes
to cluster provisioning practices. First, any (nonexpert) user can provision a cluster of any …

Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud

P Lama, X Zhou - Proceedings of the 9th international conference on …, 2012 - dl.acm.org
Distributed data processing framework MapReduce is increasingly deployed in Clouds to
leverage the pay-per-usage cloud computing model. Popular Hadoop MapReduce …

Purlieus: locality-aware resource allocation for MapReduce in a cloud

B Palanisamy, A Singh, L Liu, B Jain - Proceedings of 2011 international …, 2011 - dl.acm.org
We present Purlieus, a MapReduce resource allocation system aimed at enhancing the
performance of MapReduce jobs in the cloud. Purlieus provisions virtual MapReduce …