Performance evaluation and analysis of linear algebra kernels in the prototype tianhe-3 cluster

W Yang, J Fang, D Dong, X Su, Z Wang - Proceedings of the …, 2021 - dl.acm.org

General Matrix Multiplication (GEMM) is a key subroutine in highperformance computing.
While the mainstream linear algebra libraries can deliver high performance on large and …

被引用次数：27 相关文章所有 6 个版本

[PDF] polimi.it

Evaluating fft-based algorithms for strided convolutions on armv8 architectures?

X Huang, Q Wang, S Lu, R Hao, S Mei… - ACM SIGMETRICS …, 2022 - dl.acm.org

Convolutional Neural Networks (CNNs) have been widely adopted in all kinds of artificial
intelligence applications. Most of the computational overhead of CNNs is mainly spent on …

被引用次数：18 相关文章所有 4 个版本

[PDF] ssslab.cn

Automatic code generation and optimization of large-scale stencil computation on many-core processors

M Li, Y Liu, H Yang, Y Hu, Q Sun, B Chen… - Proceedings of the 50th …, 2021 - dl.acm.org

Stencil computation is an indispensable building block of many scientific applications and is
widely used by the numerical solvers of partial differential equations (PDEs). Due to the …

被引用次数：16 相关文章所有 2 个版本

[PDF] ict.ac.cn

Performance evaluation of memory-centric armv8 many-core architectures: A case study with phytium 2000+

JB Fang, XK Liao, C Huang, DZ Dong - Journal of Computer Science and …, 2021 - Springer

This article presents a comprehensive performance evaluation of Phytium 2000+, an ARMv8-
based 64-core architecture. We focus on the cache and memory subsystems, analyzing the …

被引用次数：18 相关文章所有 7 个版本

[PDF] nature.com

Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system

J Wei, X Zhang, Z Ji, J Li, Z Wei - Scientific Reports, 2021 - nature.com

Due to the increase in computing power, it is possible to improve the feature extraction and
data fitting capabilities of DNN networks by increasing their depth and model complexity …

被引用次数：5 相关文章所有 6 个版本

Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs

X Fu, W Yang, D Dong, X Su - Proceedings of the 38th ACM International …, 2024 - dl.acm.org

Transformers reign supreme in natural language processing, representing a milestone
innovation in deep learning. For high-performance model inference, optimizing the time …

被引用次数：1 相关文章

Performance evaluation of convolutional neural network on Tianhe-3 prototype

W Chen, X Dong, H Chen, Q Wang, X Yu… - The Journal of …, 2021 - Springer

Exascale supercomputers will greatly support the expanding computational resource
demand of convolutional neural networks (CNNs). At present, the prototype cluster of Tianhe …

被引用次数：6 相关文章所有 3 个版本

[PDF] github.io

Characterizing small-scale matrix multiplications on ARMv8-based many-core architectures

W Yang, J Fang, D Dong - 2021 IEEE International Parallel and …, 2021 - ieeexplore.ieee.org

General Matrix Multiplication (GEMM) is a key subroutine in high-performance computing.
There is a large body of work on evaluating and optimizing large-scale matrix multiplication …

被引用次数：5 相关文章所有 3 个版本

GraphCube: Interconnection Hierarchy-aware Graph Processing

X Gan, G Wu, S Qiu, F Xiong, J Si, J Fang… - Proceedings of the 29th …, 2024 - dl.acm.org

Processing large-scale graphs with billions to trillions of edges requires efficiently utilizing
parallel systems. However, current graph processing engines do not scale well beyond a …

被引用次数：1 相关文章

[PDF] google.com

An empirical study of hpc workloads on huawei kunpeng 916 processor

YC Wang, JK Chen, BR Li, SC Zuo… - 2019 IEEE 25th …, 2019 - ieeexplore.ieee.org

The ARM-based server processors have been gaining momentum in high performance
computing (HPC). While not designed specifically for HPC, Huawei Kunpeng 916 processor …

被引用次数：10 相关文章所有 3 个版本

高级搜索

QQ 群