Exploiting half precision arithmetic in Nvidia GPUs
With the growing importance of deep learning and energy-saving approximate computing,
half precision floating point arithmetic (FP16) is fast gaining popularity. Nvidia's recent …
half precision floating point arithmetic (FP16) is fast gaining popularity. Nvidia's recent …
SIMD parallelization of applications that traverse irregular data structures
Fine-grained data parallelism is increasingly common in mainstream processors in the form
of longer vectors and on-chip GPUs. This paper develops support for exploiting such data …
of longer vectors and on-chip GPUs. This paper develops support for exploiting such data …
Microspec: Speculation-centric fine-grained parallelization for fsm computations
Finite state machines (FSMs) are basic computation models that play essential roles in many
applications. Enabling efficient parallel FSM execution is critical to the performance of these …
applications. Enabling efficient parallel FSM execution is critical to the performance of these …
Combining SIMD and Many/Multi-core parallelism for finite state machines with enumerative speculation
P Jiang, G Agrawal - Proceedings of the 22nd ACM SIGPLAN …, 2017 - dl.acm.org
Finite State Machine (FSM) is the key kernel behind many popular applications, including
regular expression matching, text tokenization, and Huffman decoding. Parallelizing FSMs is …
regular expression matching, text tokenization, and Huffman decoding. Parallelizing FSMs is …
Optimizing and scaling HPCG on Tianhe-2: early experience
X Zhang, C Yang, F Liu, Y Liu, Y Lu - … 2014, Dalian, China, August 24-27 …, 2014 - Springer
In this paper, a first attempt has been made on optimizing and scaling HPCG on the world's
largest supercomputer, Tianhe-2. This early work focuses on the optimization of the CPU …
largest supercomputer, Tianhe-2. This early work focuses on the optimization of the CPU …
Accelerating HPCG on Tianhe-2: a hybrid CPU-MIC algorithm
Y Liu, X Zhang, C Yang, F Liu… - 2014 20th IEEE …, 2014 - ieeexplore.ieee.org
In this paper, we propose a hybrid algorithm to enable and accelerate the High Performance
Conjugate Gradient (HPCG) benchmark on a heterogeneous node with an arbitrary number …
Conjugate Gradient (HPCG) benchmark on a heterogeneous node with an arbitrary number …
A portable optimization engine for accelerating irregular data-traversal applications on SIMD architectures
Fine-grained data parallelism is increasingly common in the form of longer vectors
integrated with mainstream processors (SSE, AVX) and various GPU architectures. This …
integrated with mainstream processors (SSE, AVX) and various GPU architectures. This …
Efficient scheduling of recursive control flow on gpus
Graphics processing units (GPUs) have rapidly emerged as a very significant player in high
performance computing. Single instruction multiple thread (SIMT) pipelines are typically …
performance computing. Single instruction multiple thread (SIMT) pipelines are typically …
Planning and composition of Web services with dynamic constraints using situation calculus
K Nariai, I Paik, M Shinozawa - The Fifth International …, 2005 - ieeexplore.ieee.org
Web service composition enables the creation of new and more valuable services to
combine and link existing services. However, the treatment of user constraints (as user …
combine and link existing services. However, the treatment of user constraints (as user …
Combining simd and many/multi-core parallelism for finite-state machines with enumerative speculation
Finite-state Machine (FSM) is the key kernel behind many popular applications, including
regular expression matching, text tokenization, and Huffman decoding. Parallelizing FSMs is …
regular expression matching, text tokenization, and Huffman decoding. Parallelizing FSMs is …