Download
Filename Size Access Description License

Abstract

Supercomputers continue to increase in scale and complexity to meet the demands of science and engineering. Exascale systems face high error rates due to increasing scale (10^9 cores), software complexity and rising memory error rates. Increasingly, errors escape immediate hardware-level detection, silently corrupting application states. Such latent errors can often be detected by application-level tests but typically at long latencies. Challenges for latent errors include determining when the error occurred, what data was corrupted, and how to recover efficiently. The predicted high error rates and latent errors are a critical problem that will increase the cost and may ultimately limit the scale of application science. However, existing fault tolerance approaches lack the support for latent errors. There is no general guidance to design latent error resilience.,This dissertation proposes a new approach called Application-Based Focused Recovery (ABFR) for high-performance applications to execute efficiently in an environment with high error rates and latent errors. This approach exploits application knowledge to focus the recovery on only potentially corrupted data, achieving efficient and scalable latent error resilience. The two key ideas of ABFR are (1) clearly define the application knowledge needed for latent error recovery (as embodied in the four ABFR operators); (2) provide powerful runtime support to manage the complex recovery procedures, using the four application operators, without any other application programmer effort.,ABFR is a well-defined resilience framework that allows the application to pursue strategies exploiting a range of application semantics. Application designers can express their knowledge flexibly in four ABFR operators. ABFR is also an application-system partner- ship that provides a clear separation between application knowledge and the underlying system. Application designers implement four operators without concern for the underly- ing architecture and system details. The ABFR runtime implements the complex recovery procedure, including triggering and composing the operators, exploiting parallelism, and achieving load balance. Together, these ABFR properties support flexible application-based resilience.,To demonstrate ABFR’s generality, we apply it to three varied scientific computation archetypes (stencil, N-Body tree, and Monte Carlo particle transport). We design ABFR operators for each computation and evaluate the performance of ABFR. We measure latent error resilience performance for varied error rates. Results indicate ABFR significantly improves recovery performance. Specifically, ABFR reduces error recovery cost by 2.4x to 367x, recovery latency by 2.2x to 24x) and I/O cost up to 1000x. ABFR achieves efficient and scalable recovery at scale with high latent error rates for all three computation archetypes. Note that these results may be improved by more sophisticated application ABFR operators.,Overall, this dissertation demonstrates a new approach for efficient, scalable latent error recovery on large-scale systems. ABFR enables flexible application-based error resilience and provides sophisticated runtime support. As a result, applications are able to tolerate higher error rates and latent errors.

Details

Additional Details

Actions

from
to
Download Full History