The EDGE language: Extended general einsums for graph algorithms

TO Odemuyiwa, JS Emer, JD Owens - arXiv preprint arXiv:2404.11591, 2024 - arxiv.org
In this work, we propose a unified abstraction for graph algorithms: the Extended General
Einsums language, or EDGE. The EDGE language expresses graph algorithms in the …

Lazy release consistency for GPUs

J Alsop, MS Orr, BM Beckmann… - 2016 49th Annual IEEE …, 2016 - ieeexplore.ieee.org
The heterogeneous-race-free (HRF) memory model has been embraced by the
Heterogeneous System Architecture (HSA) Foundation and OpenCL TM because it clearly …

A case for work-stealing on FPGAs with OpenCL atomics

N Ramanathan, J Wickerson, F Winterstein… - Proceedings of the …, 2016 - dl.acm.org
We provide a case study of work-stealing, a popular method for run-time load balancing, on
FPGAs. Following the Cederman-Tsigas implementation for GPUs, we synchronize work …

Synchronization using remote-scope promotion

MS Orr, S Che, A Yilmazer, BM Beckmann… - ACM SIGARCH …, 2015 - dl.acm.org
Heterogeneous system architecture (HSA) and OpenCL define scoped synchronization to
facilitate low overhead communication across a subset of threads. Scoped synchronization …

A GPU task-parallel model with dependency resolution

S Tzeng, B Lloyd, JD Owens - IEEE Computer, 2012 - escholarship.org
We present a task-parallel programming model for the GPU. Our task model is robust
enough to handle irregular workloads that contain dependencies. We present two …

Remote-scope promotion: clarified, rectified, and verified

J Wickerson, M Batty, BM Beckmann… - Proceedings of the 2015 …, 2015 - dl.acm.org
Modern accelerator programming frameworks, such as OpenCL, organise threads into work-
groups. Remote-scope promotion (RSP) is a language extension recently proposed by AMD …

HQL: A scalable synchronization mechanism for GPUs

A Yilmazer, D Kaeli - 2013 IEEE 27th International Symposium …, 2013 - ieeexplore.ieee.org
Modern GPUs rely on atomic operations to perform global communication. These atomic
operations can be used to construct finer-grained locks to provide support for mutual …

Scalable collision detection using p-partition fronts on many-core processors

X Zhang, YJ Kim - IEEE Transactions on Visualization and …, 2013 - ieeexplore.ieee.org
We present a new parallel algorithm for collision detection using many-core computing
platforms of CPUs or GPUs. Based on the notion of a p-partition front, our algorithm is able to …

Lock‐Free Concurrent Data Structures

D Cederman, A Gidenstam, P Ha… - … multi‐core and …, 2017 - Wiley Online Library
Concurrent data structures are the data sharing side of parallel programming. An
implementation of a data structure is called lock‐free, if it allows multiple processes/threads …

SCALE: A hybrid MPI and multithreading based work stealing approach for massive contingency analysis in power systems

SK Khaitan, JD McCalley - Electric Power Systems Research, 2014 - Elsevier
In this paper, we present SCALE, a hybrid message passing interface (MPI) and
multithreading based work Stealing approach for massive Contingency AnaLysis in powEr …