作者
Andrew D Ferguson, Rodrigo Fonseca
发表日期
2010
期刊
Proceedings of the USENIX Annual Technical Conference
简介
The Hadoop platform for MapReduce [1] is an increasingly popular method for executing distributed computations, driven by free availability, an adaptable model, and support for very large data sets. In order to support such data sets efficiently, Hadoop executes most computations near the data, rather than transferring the data over the network. As a result, Hadoop’s performance is directly affected by the distribution of data in the Hadoop Distributed Filesystem (HDFS). In this work, we investigate the placement of blocks in HDFS and show that it exhibits surprising non-uniformity. When blocks are placed non-uniformly in the distributed filesystem, network transfers must occur during job execution in order to bring input data to available computational cores. Because cross-rack network bandwidth is one of the most limited resources in the cluster, these unnecessary transfers can degrade performance. The locations of file blocks read by a MapReduce job are collectively called the input split. In order to achieve best performance, the input split should intuitively consist of an equal number of file blocks on each node in the cluster. We show that under Hadoop’s default block placement strategy, the number of blocks on each node in the cluster is instead binomially distributed. In order to visualize the existing file placement strategy and its effect on task performance, we have developed a real-time “heatmap” which illustrates how “hot” or “cold” each host in the cluster is. A node is considered “hot” if it is carrying at least one standard deviation above the expected number of input splits. A node is “cold” if it supports less than one standard deviation below …
引用总数
2013201420152016201720181111
学术搜索中的文章
AD Ferguson, R Fonseca - Proceedings of the USENIX Annual Technical …, 2010