查看文章

Understanding filesystem imbalance in Hadoop

作者

Andrew D Ferguson, Rodrigo Fonseca

发表日期

2010

期刊

Proceedings of the USENIX Annual Technical Conference

简介

The Hadoop platform for MapReduce [1] is an increasingly popular method for executing distributed computations, driven by free availability, an adaptable model, and support for very large data sets. In order to support such data sets efficiently, Hadoop executes most computations near the data, rather than transferring the data over the network. As a result, Hadoop’s performance is directly affected by the distribution of data in the Hadoop Distributed Filesystem (HDFS). In this work, we investigate the placement of blocks in HDFS and show that it exhibits surprising non-uniformity. When blocks are placed non-uniformly in the distributed filesystem, network transfers must occur during job execution in order to bring input data to available computational cores. Because cross-rack network bandwidth is one of the most limited resources in the cluster, these unnecessary transfers can degrade performance. The locations of file blocks read by a MapReduce job are collectively called the input split. In order to achieve best performance, the input split should intuitively consist of an equal number of file blocks on each node in the cluster. We show that under Hadoop’s default block placement strategy, the number of blocks on each node in the cluster is instead binomially distributed. In order to visualize the existing file placement strategy and its effect on task performance, we have developed a real-time “heatmap” which illustrates how “hot” or “cold” each host in the cluster is. A node is considered “hot” if it is carrying at least one standard deviation above the expected number of input splits. A node is “cold” if it supports less than one standard deviation below …

引用总数

被引用次数：4

2013201420152016201720181 1 1 1

学术搜索中的文章

Understanding filesystem imbalance in Hadoop

AD Ferguson, R Fonseca - Proceedings of the USENIX Annual Technical …, 2010

被引用次数：4 相关文章所有 7 个版本