Microprocessors have advanced to a point where they are no longer entirely reliable, being affected by silent computation errors (SCEs) that can corrupt the validity of application results without the user being able to notice. These errors compound for large scale high performance computing (HPC) applications that run in deployments with hundreds or even thousands of nodes and while there exist techniques and research on using runtime monitoring to detect such errors, they suffer from high-overhead, they are limited in the errors they detect, and they are all best-effort, without any guarantees that the errors will be detected. This project aims to address this problem by developing novel, end-to-end, modular, efficient, and verifiable SCE detection runtime monitoring systems for HPC applications. The project’s novelties are formal foundations for reasoning about SCEs, abstractions for decomposing HPC applications in easy to monitor fragments, and proof techniques to formally verify the correctness of the SCE detection mechanisms. The project’s impacts are helping scientists and other HPC practitioners to trust the results of their applications and draw robust conclusions, powering scientific discovery. The goal of this project is to improve SCE detection for HPC applications in two ways: (1) reduce the overhead, and (2) provide formal guarantees that SCE detection is correct. A key insight that enables both is that HPC applications can be decomposed into fragments with special properties that can be used to modularly detect SCEs in each fragment. The project can be separated into four research threads. First, developing formal foundations for reasoning about CPU SCEs together with proof techniques and mechanized proofs that existing SCE detection mechanisms are correct. Second, developing control-path abstractions for HPC applications and use them to develop a verifiable control-path aware SCE detection mechanism. Third, developing a verified library of HPC computation kernels and their validation formulae; using them to develop a verifiable validation-aware SCE detection mechanism. Finally, developing foundations for reasoning about GPU SCEs, together with GPU specific abstractions and proof techniques for GPU SCE detection mechanisms. This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria. NSF Award ID: 2546642 | Program: 01002627DB NSF RESEARCH & RELATED ACTIVIT | Principal Investigator: Konstantinos Kallas | Institution: University of California-Los Angeles, LOS ANGELES, CA | Award Amount: $800,000 View on NSF Award Search: https://www.nsf.gov/awardsearch/show-award/?AWD_ID=2546642 View on Research.gov: https://www.research.gov/awardapi-service/v1/awards/2546642.html

Collaborative Research: CS2: A Formal Foundation and System Support for Correctly Detecting Silent Computation Errors in Scientific HPC Applications

Description

Interested in this grant?

Grant Details

External Links

Get personalized grant matches