Exploiting half precision arithmetic in Nvidia GPUs

NM Ho, WF Wong - 2017 IEEE High Performance Extreme …, 2017 - ieeexplore.ieee.org
With the growing importance of deep learning and energy-saving approximate computing,
half precision floating point arithmetic (FP16) is fast gaining popularity. Nvidia's recent …

SIMD parallelization of applications that traverse irregular data structures

B Ren, G Agrawal, JR Larus… - Proceedings of the …, 2013 - ieeexplore.ieee.org
Fine-grained data parallelism is increasingly common in mainstream processors in the form
of longer vectors and on-chip GPUs. This paper develops support for exploiting such data …

Microspec: Speculation-centric fine-grained parallelization for fsm computations

J Qiu, Z Zhao, B Ren - … of the 2016 International Conference on Parallel …, 2016 - dl.acm.org
Finite state machines (FSMs) are basic computation models that play essential roles in many
applications. Enabling efficient parallel FSM execution is critical to the performance of these …

Combining SIMD and Many/Multi-core parallelism for finite state machines with enumerative speculation

P Jiang, G Agrawal - Proceedings of the 22nd ACM SIGPLAN …, 2017 - dl.acm.org
Finite State Machine (FSM) is the key kernel behind many popular applications, including
regular expression matching, text tokenization, and Huffman decoding. Parallelizing FSMs is …

Optimizing and scaling HPCG on Tianhe-2: early experience

X Zhang, C Yang, F Liu, Y Liu, Y Lu - … 2014, Dalian, China, August 24-27 …, 2014 - Springer
In this paper, a first attempt has been made on optimizing and scaling HPCG on the world's
largest supercomputer, Tianhe-2. This early work focuses on the optimization of the CPU …

Accelerating HPCG on Tianhe-2: a hybrid CPU-MIC algorithm

Y Liu, X Zhang, C Yang, F Liu… - 2014 20th IEEE …, 2014 - ieeexplore.ieee.org
In this paper, we propose a hybrid algorithm to enable and accelerate the High Performance
Conjugate Gradient (HPCG) benchmark on a heterogeneous node with an arbitrary number …

A portable optimization engine for accelerating irregular data-traversal applications on SIMD architectures

B Ren, T Mytkowicz, G Agrawal - ACM Transactions on Architecture and …, 2014 - dl.acm.org
Fine-grained data parallelism is increasingly common in the form of longer vectors
integrated with mainstream processors (SSE, AVX) and various GPU architectures. This …

Efficient scheduling of recursive control flow on gpus

X Huo, S Krishnamoorthy, G Agrawal - Proceedings of the 27th …, 2013 - dl.acm.org
Graphics processing units (GPUs) have rapidly emerged as a very significant player in high
performance computing. Single instruction multiple thread (SIMT) pipelines are typically …

Planning and composition of Web services with dynamic constraints using situation calculus

K Nariai, I Paik, M Shinozawa - The Fifth International …, 2005 - ieeexplore.ieee.org
Web service composition enables the creation of new and more valuable services to
combine and link existing services. However, the treatment of user constraints (as user …

Combining simd and many/multi-core parallelism for finite-state machines with enumerative speculation

P Jiang, Y Xia, G Agrawal - ACM Transactions on Parallel Computing …, 2020 - dl.acm.org
Finite-state Machine (FSM) is the key kernel behind many popular applications, including
regular expression matching, text tokenization, and Huffman decoding. Parallelizing FSMs is …