High-performance implementation of regular and easily scalable sorting networks on an FPGA

V Sklyarov, I Skliarova - Microprocessors and Microsystems, 2014 - Elsevier
The paper is dedicated to fast FPGA-based hardware accelerators that implement sorting
networks. The primary emphasis is on the uniformity of core components, feasible …

On how to accelerate iterative stencil loops: a scalable streaming-based approach

R Cattaneo, G Natale, C Sicignano, D Sciuto… - ACM Transactions on …, 2015 - dl.acm.org
In high-performance systems, stencil computations play a crucial role as they appear in a
variety of different fields of application, ranging from partial differential equation solving, to …

Bridging high-level synthesis and application-specific arithmetic: The case study of floating-point summations

Y Uguen, F de Dinechin… - 2017 27th International …, 2017 - ieeexplore.ieee.org
FPGAs are well known for their ability to perform non-standard computations not supported
by classical microprocessors. Many libraries of highly customizable application-specific IPs …

Towards scalable and efficient FPGA stencil accelerators

G Deest, N Estibals, T Yuki, S Derrien… - IMPACT'16-6th …, 2016 - inria.hal.science
In this paper we propose a design template for stencil computations targeting FPGA-based
accelerators. The goal for our design is to provide scalable high throughput designs that can …

Data-aware process networks

C Alias, A Plesco - Proceedings of the 30th ACM SIGPLAN International …, 2021 - dl.acm.org
With the emergence of reconfigurable FPGA circuits as a credible alternative to GPUs for
HPC acceleration, new compilation paradigms are required to map high-level algorithmic …

C-to-coram: Compiling perfect loop nests to the portable coram abstraction

G Weisz, JC Hoe - Proceedings of the ACM/SIGDA international …, 2013 - dl.acm.org
This paper presents initial work on developing a C compiler for the CoRAM FPGA computing
abstraction. The presented effort focuses on compiling fixed-bound perfect loop nests that …

Low‐precision DSP‐based floating‐point multiply‐add fused for Field Programmable Gate Arrays

A Amaricai, O Boncalo… - IET Computers & Digital …, 2014 - Wiley Online Library
Floating‐point (FP) multiply‐add fused (F1* F2±F3) and multiply‐accumulate represent the
most common arithmetic operation in a wide range of applications, such as graphic …

Processor arrays generation for matrix algorithms used in embedded platforms implemented on FPGAs

R Perez-Andrade, C Torres-Huitzil… - Microprocessors and …, 2015 - Elsevier
Matrix algorithms are an important part of many digital signal processing applications as
they are core kernels that are usually required to be applied many times while computing …

High speed half-precision floating-point fused multiply and add unit using DSP blocks

SS Ganesh, JJJ Nesam… - 2020 First International …, 2020 - ieeexplore.ieee.org
Necessity of multiplication followed by the addition in numerous digital signal processing
applications demands Fused Multiply and Add (FMA) unit for computations. This FMA design …

[PDF][PDF] Scalable Trace-based Compile-Time Memory Allocation

PPR CLAUSS - 2024 - perso.ens-lyon.fr
High-Level Synthesis (HLS)[23, 11, 21, 7] consists in compiling a circuit from a high-level
program. With HLS, there is no runtime, every scheduling and allocation decision from high …