skip to main content
10.1145/3203217.3203240acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
research-article

Comparative analysis of soft-error detection strategies: a case study with iterative methods

Published: 08 May 2018 Publication History
  • Get Citation Alerts
  • Abstract

    Undetected soft errors caused by transient bit flips can lead to silent data corruption (SDC), an undesirable outcome where invalid results pass for valid ones. This has motivated the design of soft error detectors to minimize SDCs. However, the detectors have been studied under different contexts, making comparative evaluation difficult. In this paper, we present the first comprehensive evaluation of four online soft error detection techniques in detecting the adverse impact of soft errors on iterative methods. We observe that, across five iterative methods, the detectors studied achieve high but not perfect detection rates. To understand the potential for improved detection, we evaluate a machine-learning based detector that takes as features that are the runtime features observed by the individual detectors to arrive at their conclusions. Our evaluation demonstrates improved but still far from perfect detection accuracy for the machine learning based detectors. This extensive evaluation demonstrates the need for designing error detectors to handle the evolutionary behavior exhibited by iterative solvers.

    References

    [1]
    Rizwan A. Ashraf, Roberto Gioiosa, Gokcen Kestor, Ronald F. DeMara, Chen-Yong Cher, and Pradip Bose. 2015. Understanding the Propagation of Transient Errors in HPC Applications. In SC. 72:1--72:12.
    [2]
    D. Avresky, J. Arlat, J. C. Laprie, and Y. Crouzet. 1996. Fault injection for formal testing of fault tolerance. IEEE Transactions on Reliability 45, 3 (Sep 1996), 443--455.
    [3]
    Satish Balay, Shrirang Abhyankar, Mark F. Adams, Jed Brown, Peter Brune, Kris Buschelman, Lisandro Dalcin, Victor Eijkhout, William D. Gropp, Dinesh Kaushik, Matthew G. Knepley, Dave A. May, Lois Curfman McInnes, Karl Rupp, Barry F. Smith, Stefano Zampini, Hong Zhang, and Hong Zhang. 2017. PETSc Web page. http://www.mcs.anl.gov/petsc. (2017). http://www.mcs.anl.gov/petsc
    [4]
    J. H. Barton, E. W. Czeck, Z. Z. Segall, and D. P. Siewiorek. 1990. Fault injection experiments using FIAT. IEEE Trans. Comput. 39, 4 (Apr 1990), 575--582.
    [5]
    Eduardo Berrocal, Leonardo Bautista-Gomez, Sheng Di, Zhiling Lan, and Franck Cappello. 2015. Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications. In HPDC. 275--278.
    [6]
    S. Bohm and C. Engelmann. 2011. xSim: The extreme-scale simulator. In The Int. Conference on High Performance Computing and Simulation (HPCS).
    [7]
    Greg Bronevetsky and Bronis de Supinski. 2008. Soft Error Vulnerability of Iterative Linear Algebra Methods. In ICS. 155--164.
    [8]
    Jon Calhoun, Luke Olson, and Marc Snir. 2014. FlipIt: An LLVM Based Fault Injector for HPC. In Euro-Par: Parallel Processing Workshops. Vol. 8805. Springer International Publishing, 547--558.
    [9]
    Marc Casas, Bronis R. de Supinski, Greg Bronevetsky, and Martin Schulz. 2012. Fault Resilience of the Algebraic Multi-grid Solver. In ICS.
    [10]
    Zizhong Chen. 2013. Online-ABFT: An Online Algorithm Based Fault Tolerance Scheme for Soft Error Detection in Iterative Methods. In PPoPP. 167--176.
    [11]
    Chen-Yong Cher, Meeta S. Gupta, Pradip Bose, and K. Paul Muller. 2014. Understanding Soft Error Resiliency of BlueGene/Q Compute Chip through Hardware Proton Irradiation and Software Fault Injection. In SC.
    [12]
    Chen-Yong Cher, K.P. Muller, R.A. Haring, D.L. Satterfield, T.E. Musta, T.M. Gooding, K.D. Davis, M.B. Dombrowa, G.V. Kopcsay, R.M. Senger, Y. Sugawara, and K. Sugavanam. 2014. Soft Error Resiliency Characterization on IBM BlueGene/Q Processor. In the Asia and South Pacific Design Automation Conference (ASP-DAC).
    [13]
    Hyungmin Cho, S. Mirkhani, Chen-Yong Cher, J.A. Abraham, and S. Mitra. 2013. Quantitative evaluation of soft error injection techniques for robust system design. In DAC.
    [14]
    E. Ciocca, I. Koren, Z. Koren, C. M. Krishna, and D. S. Katz. 2004. Application-Level Fault Tolerance in the Orbital Thermal Imaging Spectrometer. In Proceedings of the IEEE Pacific Rim International Symposium on Dependable Computing (PRDC). 43--48.
    [15]
    Siegfried Cools, Wim Vanroose, Emrullah Fatih Yetkin, Emmanuel Agullo, and Luc Giraud. 2016. On rounding error resilience, maximal attainable accuracy and parallel performance of the pipelined Conjugate Gradients method for large-scale linear systems in PETSc. In Proceedings of the Exascale Applications and Software Conference. 3:1--3:10.
    [16]
    Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1, Article 1 (Dec. 2011), 25 pages.
    [17]
    Scott Dawson and Farnam Jahanian. 1995. Deterministic fault injection of distributed systems. 178--196.
    [18]
    S. Dawson, F. Jahanian, and T. Mitton. 1996. ORCHESTRA: a probing and fault injection environment for testing protocol implementations. In Proceedings of IEEE International Computer Performance and Dependability Symposium. 56--.
    [19]
    Sheng Di, Eduardo Berrocal, and Franck Cappello. 2015. An Efficient Silent Data Corruption Detection Method with Error-Feedback Control and Even Sampling for HPC Applications. In CCGrid. 271--280.
    [20]
    Sheng Di and Franck Cappello. 2016. Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications. IEEE Trans. Parallel Distrib. Syst. 27, 10 (Oct. 2016), 2809--2823.
    [21]
    J. Dongarra, A. Lumsdaine, R. Pozo, and K. Remington. 1995. IML++ v. 1.2 Iterative Methods Library Reference Guide. Technical Report CS-95-303. University of Tennessee.
    [22]
    K. Echtle and Y. Chen. 1991. Evaluation of deterministic fault injection for fault-tolerant protocol testing. In Fault-Tolerant Computing: The Twenty-First International Symposium. 418--425.
    [23]
    J. Elliott, M. Hoemmen, and F. Mueller. 2014. Evaluating the Impact of SDC on the GMRES Iterative Solver. In IPDPS. 1193--1202.
    [24]
    Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: Probabilistic Soft Error Reliability on the Cheap. In ASPLOS.
    [25]
    David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. 2012. Detection and Correction of Silent Data Corruption for Large-scale High-performance Computing. In SC. Article 78, 12 pages.
    [26]
    Giorgis Georgakoudis, Ignacio Laguna, Dimitrios S. Nikolopoulos, and Martin Schulz. 2017. REFINE: Realistic Fault Injection via Compiler-based Instrumentation for Accuracy, Portability and Speed. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 29, 14 pages.
    [27]
    Mei-Chen Hsueh, T. K. Tsai, and R. K. Iyer. 1997. Fault injection techniques and tools. Computer 30, 4 (Apr 1997), 75--82.
    [28]
    Ghani A. Kanawati, Nasser A. Kanawati, and Jacob A. Abraham. 1995. FERRARI: A Flexible Software-Based Fault and Error Injection System. IEEE Trans. Comput. 44, 2 (Feb. 1995), 248--260.
    [29]
    R. Leveugle, A. Calvez, P. Maistri, and P. Vanhauwaert. 2009. Statistical fault injection: Quantified error and confidence. In 2009 Design, Automation Test in Europe Conference Exhibition. 502--506.
    [30]
    Dong Li, Jeffrey S. Vetter, and Weikuan Yu. 2012. Classifying Soft Error Vulnerabilities in Extreme-scale Scientific Applications Using a Binary Instrumentation Tool. In SC.
    [31]
    Man-Lap Li, P. Ramachandran, U.R. Karpuzcu, S.K.S. Hari, and S.V. Adve. 2009. Accurate microarchitecture-level fault modeling for studying hardware faults. In HPCA.
    [32]
    J. Liu, M. C. Kurt, and G. Agrawal. 2015. A Practical Approach for Handling Soft Errors in Iterative Applications. In 2015 IEEE International Conference on Cluster Computing. 158--161.
    [33]
    R. Maia, L. Henriques, D. Costa, and H. Madeira. 2002. Xception - enhanced automated fault-injection environment. In DSN.
    [34]
    M. Maniatakos, N. Karimi, C. Tirumurti, A. Jas, and Y. Makris. 2011. Instruction-Level Impact Analysis of Low-Level Faults in a Modern Microprocessor Controller. IEEE Trans. Comput. 60, 9 (Sept 2011), 1260--1273.
    [35]
    Manu Shantharam, Sowmyalatha Srinivasmurthy, and Padma Raghavan. 2011. Characterizing the Impact of Soft Errors on Iterative Methods in Scientific Computing. In ICS. 152--161.
    [36]
    Vishal Chandra Sharma, Arvind Haran, Zvonimir Rakamarić, and Ganesh Gopalakrishnan. 2013. Towards Formal Approaches to System Resilience. In the IEEE Pacific Rim International Symposium on Dependable Computing (PRDC).
    [37]
    D. Skarin, R. Barbosa, and J. Karlsson. 2010. GOOFI-2: A tool for experimental dependability assessment. In DSN.
    [38]
    Joseph Sloan, Rakesh Kumar, and Greg Bronevetsky. 2012. Algorithmic Approaches to Low Overhead Fault Detection for Sparse Linear Algebra. In DSN. 1--12.
    [39]
    Omer Subasi, Javier Arias, Osman Unsal, Jesus Labarta, and Adrian Cristal. 2015. Programmer-directed Partial Redundancy for Resilient HPC. In CF. Article 47, 2 pages.
    [40]
    Omer Subasi, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman Unsal, Jesus Labarta, Adrian Cristal, Franck Cappello, and others. 2016. Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era. In CCGrid. 413--424.
    [41]
    Omer Subasi and Sriram Krishnamoorthy. 2017. A Gaussian Process Approach for Effective Soft Error Detection. In 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017, Honolulu, HI, USA, September 5--8, 2017. 608--612.
    [42]
    Omer Subasi, Gulay Yalcin, Ferad Zyulkyarov, Osman S. Unsal, and Jesús Labarta. 2017. Designing and Modelling Selective Replication for Fault-tolerant HPC Applications. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017, Madrid, Spain, May 14--17, 2017. 452--457.
    [43]
    Dingwen Tao, Shuaiwen Leon Song, Sriram Krishnamoorthy, Panruo Wu, Xin Liang, Eddy Z. Zhang, Darren Kerbyson, and Zizhong Chen. 2016. New-Sum: A Novel Online ABFT Scheme For General Iterative Methods. In HPDC. 43--55.
    [44]
    Tara E. Thomas, Anmol J. Bhattad, Subrata Mitra, and Saurabh Bagchi. 2016. Sirius: Neural Network Based Probabilistic Assertions for Detecting Silent Data Corruption in Parallel Programs. In 35th Symposium on Reliable Distributed Systems (SRDS).
    [45]
    Timothy K. Tsai and Ravishankar K. Iyer. 1995. Measuring Fault Tolerance with the FTAPE Fault Injection Tool. In Proceedings of the 8th International Conference on Modelling Techniques and Tools for Computer Performance Evaluation: Quantitative Evaluation of Computing and Communication Systems. 26--40.
    [46]
    M. Turmon, R. Granat, D.S. Katz, and J.Z. Lou. 2003. Tests and tolerances for high-performance software-implemehted fault detection. IEEE Trans. Comput. 52, 5 (May 2003), 579--591.
    [47]
    Jiesheng Wei, A. Thomas, Guanpeng Li, and K. Pattabiraman. 2014. Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults. In DSN.
    [48]
    Haissam Ziade, Rafic A. Ayoubi, and Raoul Velazco. 2004. A Survey on Fault Injection Techniques. Int. Arab J. Inf. Technol. 1 (2004), 171--186.

    Cited By

    View all
    • (2023)Micro-Architectural features as soft-error markers in embedded safety-critical systems: preliminary study2023 IEEE European Test Symposium (ETS)10.1109/ETS56758.2023.10174219(1-5)Online publication date: 22-May-2023
    • (2022)Work-in-Progress: Accuracy-Area Efficient Online Fault Detection for Robust Neural Network Software-Embedded Microcontrollers2022 International Conference on Embedded Software (EMSOFT)10.1109/EMSOFT55006.2022.00008(1-2)Online publication date: Oct-2022
    • (2022)Detection and correction of silent errors in the conjugate gradient algorithmNumerical Algorithms10.1007/s11075-022-01380-192:1(869-891)Online publication date: 29-Jul-2022
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CF '18: Proceedings of the 15th ACM International Conference on Computing Frontiers
    May 2018
    401 pages
    ISBN:9781450357616
    DOI:10.1145/3203217
    © 2018 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 May 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Funding Sources

    • European Union

    Conference

    CF '18
    Sponsor:
    CF '18: Computing Frontiers Conference
    May 8 - 10, 2018
    Ischia, Italy

    Acceptance Rates

    Overall Acceptance Rate 273 of 785 submissions, 35%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)14
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Micro-Architectural features as soft-error markers in embedded safety-critical systems: preliminary study2023 IEEE European Test Symposium (ETS)10.1109/ETS56758.2023.10174219(1-5)Online publication date: 22-May-2023
    • (2022)Work-in-Progress: Accuracy-Area Efficient Online Fault Detection for Robust Neural Network Software-Embedded Microcontrollers2022 International Conference on Embedded Software (EMSOFT)10.1109/EMSOFT55006.2022.00008(1-2)Online publication date: Oct-2022
    • (2022)Detection and correction of silent errors in the conjugate gradient algorithmNumerical Algorithms10.1007/s11075-022-01380-192:1(869-891)Online publication date: 29-Jul-2022
    • (2021)Error resilience of three GMRES implementations under fault injectionThe Journal of Supercomputing10.1007/s11227-021-04148-x78:5(7158-7185)Online publication date: 5-Nov-2021
    • (2020)FPDetectACM Transactions on Architecture and Code Optimization10.1145/340245117:3(1-27)Online publication date: 17-Aug-2020
    • (2020)Towards End-to-end SDC Detection for HPC Applications Equipped with Lossy Compression2020 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER49012.2020.00043(326-336)Online publication date: Sep-2020
    • (2019)Ground-Truth Prediction to Accelerate Soft-Error Impact Analysis for Iterative Methods2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC.2019.00048(333-344)Online publication date: Dec-2019

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media