Power-efficient predication techniques for acceleration of control flow execution on cgra

K Han, J Ahn, K Choi - ACM Transactions on Architecture and Code …, 2013 - dl.acm.org
Coarse-grained reconfigurable architecture typically has an array of processing elements
which are controlled by a centralized unit. This makes it difficult to execute programs having …

Libra: Tailoring simd execution using heterogeneous hardware and dynamic configurability

Y Park, JJK Park, H Park… - 2012 45th Annual IEEE …, 2012 - ieeexplore.ieee.org
Mobile computing as exemplified by the smart phone has become an integral part of our
daily lives. The next generation of these devices will be driven by providing an even richer …

Nomad-attention: Efficient llm inference on cpus through multiply-add-free attention

T Zhang, JW Yi, B Yao, Z Xu, A Shrivastava - arXiv preprint arXiv …, 2024 - arxiv.org
Large language model inference on Central Processing Units (CPU) is challenging due to
the vast quantities of expensive Multiply-Add (MAD) matrix operations in the attention …

Occamy: Elastically sharing a simd co-processor across multiple cpu cores

Z Zhang, Y Ou, Y Liu, C Wang, Y Zhou… - Proceedings of the 28th …, 2023 - dl.acm.org
SIMD extensions are widely adopted in multi-core processors to exploit data-level
parallelism. However, when co-running workloads on different cores, compute-intensive …

Exploiting tightly-coupled cores

D Bates, A Bradbury, A Koltes, R Mullins - Journal of Signal Processing …, 2015 - Springer
The individual processors of a chip-multiprocessor traditionally have rigid boundaries. Inter-
core communication is only possible via memory, and control over a core's resources is …

Construction and exploitation of VLIW ASIPs with heterogeneous vector-widths

E Diken, R Jordans, R Corvino, L Jóźwiak… - Microprocessors and …, 2014 - Elsevier
Numerous applications in important domains, such as communication and multimedia, show
a significant data-level parallelism (DLP). A large part of the DLP is usually exploited …

Exploiting both pipelining and data parallelism with SIMD reconfigurable architecture

Y Kim, J Lee, J Lee, TX Mai, I Heo, Y Paek - International Symposium on …, 2012 - Springer
Reconfigurable Architecture (RA), which provides extremely high energy efficiency for
certain domains of applications, have one problem that current mapping algorithms for it do …

LRMDCR: A learner's role-based multi dimensional collaborative recommendation for group learning support

X Wan, T Ninomiya, T Okamoto - 2008 Eighth IEEE …, 2008 - ieeexplore.ieee.org
In order to improve the ldquoeducational provisionrdquo to implement the e-learning
recommender system, we propose a new recommendation approach which has been …

[引用][C] SIMD 自动向量化编译优化概述

高伟, 赵荣彩, 韩林, 庞建民, 丁锐 - 软件学报, 2015

Dual-Core Framework: Eliminating the Bottleneck Effect of Scalar Kernels on SIMD Architectures

Y Wang, S Chen, H Chen, J Wan… - … on Information and …, 2013 - search.ieice.org
The efficiency of ubiquitous SIMD (Single Instruction Multiple Data) media processors is
seriously limited by the bottleneck effect of the scalar kernels in media applications. To solve …