LIBSHALOM: Optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores

W Yang, J Fang, D Dong, X Su, Z Wang - Proceedings of the …, 2021 - dl.acm.org
General Matrix Multiplication (GEMM) is a key subroutine in highperformance computing.
While the mainstream linear algebra libraries can deliver high performance on large and …

Evaluating fft-based algorithms for strided convolutions on armv8 architectures?

X Huang, Q Wang, S Lu, R Hao, S Mei… - ACM SIGMETRICS …, 2022 - dl.acm.org
Convolutional Neural Networks (CNNs) have been widely adopted in all kinds of artificial
intelligence applications. Most of the computational overhead of CNNs is mainly spent on …

Automatic code generation and optimization of large-scale stencil computation on many-core processors

M Li, Y Liu, H Yang, Y Hu, Q Sun, B Chen… - Proceedings of the 50th …, 2021 - dl.acm.org
Stencil computation is an indispensable building block of many scientific applications and is
widely used by the numerical solvers of partial differential equations (PDEs). Due to the …

Performance evaluation of memory-centric armv8 many-core architectures: A case study with phytium 2000+

JB Fang, XK Liao, C Huang, DZ Dong - Journal of Computer Science and …, 2021 - Springer
This article presents a comprehensive performance evaluation of Phytium 2000+, an ARMv8-
based 64-core architecture. We focus on the cache and memory subsystems, analyzing the …

Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system

J Wei, X Zhang, Z Ji, J Li, Z Wei - Scientific Reports, 2021 - nature.com
Due to the increase in computing power, it is possible to improve the feature extraction and
data fitting capabilities of DNN networks by increasing their depth and model complexity …

Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs

X Fu, W Yang, D Dong, X Su - Proceedings of the 38th ACM International …, 2024 - dl.acm.org
Transformers reign supreme in natural language processing, representing a milestone
innovation in deep learning. For high-performance model inference, optimizing the time …

Performance evaluation of convolutional neural network on Tianhe-3 prototype

W Chen, X Dong, H Chen, Q Wang, X Yu… - The Journal of …, 2021 - Springer
Exascale supercomputers will greatly support the expanding computational resource
demand of convolutional neural networks (CNNs). At present, the prototype cluster of Tianhe …

Characterizing small-scale matrix multiplications on ARMv8-based many-core architectures

W Yang, J Fang, D Dong - 2021 IEEE International Parallel and …, 2021 - ieeexplore.ieee.org
General Matrix Multiplication (GEMM) is a key subroutine in high-performance computing.
There is a large body of work on evaluating and optimizing large-scale matrix multiplication …

GraphCube: Interconnection Hierarchy-aware Graph Processing

X Gan, G Wu, S Qiu, F Xiong, J Si, J Fang… - Proceedings of the 29th …, 2024 - dl.acm.org
Processing large-scale graphs with billions to trillions of edges requires efficiently utilizing
parallel systems. However, current graph processing engines do not scale well beyond a …

An empirical study of hpc workloads on huawei kunpeng 916 processor

YC Wang, JK Chen, BR Li, SC Zuo… - 2019 IEEE 25th …, 2019 - ieeexplore.ieee.org
The ARM-based server processors have been gaining momentum in high performance
computing (HPC). While not designed specifically for HPC, Huawei Kunpeng 916 processor …