Towards fault-tolerant energy-efficient high performance computing in the cloud

KL Keville, R Garg, DJ Yates, K Arya… - … on Cluster Computing, 2012 - ieeexplore.ieee.org
2012 IEEE International Conference on Cluster Computing, 2012ieeexplore.ieee.org
In cluster computing, power and cooling represent a significant cost compared to the
hardware itself. This is of special concern in the cloud, which provides access to large
numbers of computers. We examine the use of ARM-based clusters for low-power, high
performance computing. This work examines two likely use-modes:(i) a standard dedicated
cluster, and (ii) a cluster of pre-configured virtual machines in the cloud. A 40-node
department-level cluster based on an ARM Cortex-A9 is compared against a similar cluster …
In cluster computing, power and cooling represent a significant cost compared to the hardware itself. This is of special concern in the cloud, which provides access to large numbers of computers. We examine the use of ARM-based clusters for low-power, high performance computing. This work examines two likely use-modes: (i) a standard dedicated cluster, and (ii) a cluster of pre-configured virtual machines in the cloud. A 40-node department-level cluster based on an ARM Cortex-A9 is compared against a similar cluster based on an Intel Core2 Duo, in contrast to a recent similar study on just a 4-node cluster. For the NAS benchmarks on 32-node clusters, ARM was found to have a power efficiency ranging from 1.3 to 6.2 times greater than that of Intel. This is despite Intel's approximately five times greater performance. The particular efficiency ratio depends primarily on the size of the working set relative to L2 cache. In addition to energy-efficient computing, this study also emphasizes fault tolerance: an important ingredient in high performance computing. It relies on two recent extensions to the DMTCP checkpoint-restart package. DMTCP was extended (i) to support ARM CPUs, and (ii) to support check pointing of the Qemu virtual machine in user-mode. DMTCP is used both to checkpoint native distributed applications, and to checkpoint a network of virtual machines. This latter case demonstrates the ability to deploy pre-configured software in virtual machines hosted in the cloud, and further to migrate cluster computation between hosts in the cloud.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果