A data skew oriented reduce placement algorithm based on sampling

Z Tang, W Ma, K Li, K Li - IEEE Transactions on Cloud …, 2016 - ieeexplore.ieee.org
Z Tang, W Ma, K Li, K Li
IEEE Transactions on Cloud Computing, 2016ieeexplore.ieee.org
For frequent disk I/O and large data transmissions among different racks and physical
nodes, intermediate data communication has become the most important performance bottle-
neck in most running Hadoop systems. This paper proposes a reduce placement algorithm
called CORP to schedule related map and reduce tasks on the near nodes of clusters or
racks for data locality. Because the number of keys cannot be counted until the input data
are processed by map tasks, this paper applies a reservoir algorithm for sampling the input …
For frequent disk I/O and large data transmissions among different racks and physical nodes, intermediate data communication has become the most important performance bottle-neck in most running Hadoop systems. This paper proposes a reduce placement algorithm called CORP to schedule related map and reduce tasks on the near nodes of clusters or racks for data locality. Because the number of keys cannot be counted until the input data are processed by map tasks, this paper applies a reservoir algorithm for sampling the input data, which can bring the distribution of keys/values closer to the overall situation of original data. Based on the distribution matrix of the intermediate results in each partition, by calculating the distance and cost matrices among the cross node communication, the related map and reduce tasks can be scheduled to relatively nearby physical nodes for data locality. We implement CORP in Hadoop 2.4.0 and evaluate its performance using three widely used benchmarks: Sort, Grep, and Join. In these experiments, an evaluation model is proposed for selecting the appropriate sample rates, which can comprehensively consider the importance of cost, effect, and variance in sampling. Experimental results show that CORP can not only improve the balance of reduces tasks effectively but also decreases the job execution time for the lower inner data communication. Compared with some other reduce scheduling algorithms, the average data transmission of the entire system on the core switch has been reduced substantially.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果