TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs

J Chen, N Xiong, X Liang, D Tao, S Li… - Proceedings of the …, 2019 - dl.acm.org
Linear algebra operations have been widely used in big data analytics and scientific
computations. Many works have been done on optimizing linear algebra operations on …

FT-ScaLAPACK: Correcting soft errors on-line for ScaLAPACK Cholesky, QR, and LU factorization routines

P Wu, Z Chen - Proceedings of the 23rd international symposium on …, 2014 - dl.acm.org
It is well known that soft errors in linear algebra operations can be detected off-line at the
end of the computation using algorithm-based fault tolerance (ABFT). However, traditional …

Correcting soft errors online in fast fourier transform

X Liang, J Chen, D Tao, S Li, P Wu, H Li… - Proceedings of the …, 2017 - dl.acm.org
While many algorithm-based fault tolerance (ABFT) schemes have been proposed to detect
soft errors offline in the fast Fourier transform (FFT) after computation finishes, none of the …

Algorithm-directed data placement in explicitly managed non-volatile memory

P Wu, D Li, Z Chen, JS Vetter, S Mittal - Proceedings of the 25th ACM …, 2016 - dl.acm.org
The emergence of many non-volatile memory (NVM) techniques is poised to revolutionize
main memory systems because of the relatively high capacity and low lifetime power …

Online algorithm-based fault tolerance for cholesky decomposition on heterogeneous systems with gpus

J Chen, X Liang, Z Chen - 2016 IEEE International Parallel and …, 2016 - ieeexplore.ieee.org
Extensive researches have been done on developing and optimizing algorithm-based fault
tolerance (ABFT) schemes for systolic arrays and general purpose microprocessors …

FT K-Means: A High-Performance K-Means on GPU with Fault Tolerance

S Wu, Y Ding, Y Zhai, J Liu, J Huang… - 2024 IEEE …, 2024 - ieeexplore.ieee.org
K-means is a widely used algorithm in clustering, how-ever, its efficiency is primarily
constrained by the computational cost of distance computing. Existing implementations …

Rethinking algorithm-based fault tolerance with a cooperative software-hardware approach

D Li, Z Chen, P Wu, JS Vetter - … of the International Conference on High …, 2013 - dl.acm.org
Algorithm-based fault tolerance (ABFT) is a highly efficient resilience solution for many
widely-used scientific computing kernels. However, in the context of the resilience …

Tsm2x: High-performance tall-and-skinny matrix–matrix multiplication on gpus

C Rivera, J Chen, N Xiong, J Zhang, SL Song… - Journal of Parallel and …, 2021 - Elsevier
Linear algebra operations have been widely used in big data analytics and scientific
computations. Many works have been done on optimizing linear algebra operations on …

Fault tolerant one-sided matrix decompositions on heterogeneous systems with gpus

J Chen, H Li, S Li, X Liang, P Wu, D Tao… - … Conference for High …, 2018 - ieeexplore.ieee.org
Current algorithm-based fault tolerance (ABFT) approach for one-sided matrix
decomposition on heterogeneous systems with GPUs have following limitations:(1) they do …

Software approaches for resilience of high performance computing systems: a survey

J Jia, Y Liu, G Zhang, Y Gao, D Qian - Frontiers of Computer Science, 2023 - Springer
With the scaling up of high-performance computing systems in recent years, their reliability
has been descending continuously. Therefore, system resilience has been regarded as one …