[PDF][PDF] Retrospective: A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

A Putnam, AM Caulfield, ES Chung, L Adams… - sites.coecis.cornell.edu
A Putnam, AM Caulfield, ES Chung, L Adams, K Constantinides, J Demme, D Firestone
sites.coecis.cornell.edu
Hardware specialization can improve performance and energy efficiency by several orders
of magnitude over conventional CPUs [7]. However, the wide variety of cloud applications,
their rapid rate of change, and the need to support multiple geographical regions and
generations simultaneously make scalable custom hardware for the cloud challenging. The
Catapult program began in 2010, one year after the launch of Microsoft Bing and two years
after the launch of Red Dog, the predecessor to Azure. Bing approached Microsoft Research …
Hardware specialization can improve performance and energy efficiency by several orders of magnitude over conventional CPUs [7]. However, the wide variety of cloud applications, their rapid rate of change, and the need to support multiple geographical regions and generations simultaneously make scalable custom hardware for the cloud challenging. The Catapult program began in 2010, one year after the launch of Microsoft Bing and two years after the launch of Red Dog, the predecessor to Azure. Bing approached Microsoft Research (MSR) to find ways to give Bing a competitive edge. Many at Microsoft and in the architecture research community envisioned 1000 core” manycore” multiprocessors as the future for application acceleration, but our team looked to hardware specialization as a better approach for Bing. The short timeline of the request (as soon as possible) combined with the low budget quickly ruled out custom hardware solutions. We looked at both GPUs and FPGAs and decided that FPGAs could cover a broader range of workloads, and the SIMD execution model of GPUs did not match the latency-sensitive Bing workload which made batching requests impractical. Accordingly, MSR developed an accelerator platform based on FPGAs, with the vision of eventually creating a fully-custom solution. Under the codename Project Catapult, the 2014 paper focused on the architecture and deployment of FPGA hardware in Microsoft’s cloud and the hardware/software co-design that doubled Bing’s ranking throughput and reduced latency by 30% at true production scale.
It is worth noting that the 1,632 servers in the paper was not a random number. This was the smallest number number of servers that could run one Bing instance. With workload performance measured in 99%+ tail latencies and heavily dependent on IO performance, engineering and deploying scale systems is the only way to do realistic evaluations. Supporting specialized hardware at scale was more challenging than initially envisioned. We went through three design iterations. First, we designed a” mega-board” with six large FPGAs, four of which were placed in a special server per rack. However, datacenters prefer homogeneous racks to simplify power/cooling and to limit the blast radius of failures. In addition, network designs that concentrate traffic at one node
sites.coecis.cornell.edu
以上显示的是最相近的搜索结果。 查看全部搜索结果