(W 11) Resilience
12th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids 


Tuesday 27.08.2019, 14:00 - 17:30


Heyne-Haus, Papendieck 16, 37073 Göttingen, room 2/right ► Map


Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), software complexity increases, and architectures become more heterogeneous. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is essential to ensure the success of the extreme-scale HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services.

Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.

Workshop Chairs

  • Stephen L. Scott, Tennessee Tech University and Oak Ridge National Laboratory, Systems Research Team, Cookeville, TN, USA
  • Chokchai (Box) Leangsuksun, Louisiana Tech University, SWEPCO Endowed Professor, Ruston, LA, USA
  • Patrick G. Bridges, University of New Mexico, Albuquerque, New Mexico, USA
  • Christian Engelmann, Oak Ridge National Laboratory, Oak Ridge, TN, USA


Session 1

14:00 - 14:30
Opening: Resilience Workshop Organizers

14:30 - 15:00
Scott Levy and Kurt Ferreira
Space-Efficient Reed-Solomon Encoding to Detect and Correct Pointer Corruption

15:00 - 15:30
Maher Salloum, Jackson Mayo and Robert Armstrong
Physics-Based Checksums for Silent-Error Detection in PDE Solvers

15:30 - 16:00 Coffee Break

Session 2

16:00 - 16:30
Max Baird, Sven-Bodo Scholz, Artjoms Sinkarovs and Leonardo Bautista-Gomez.
Checkpointing Kernel Executions of MPI+CUDA Applications

16:30 - 17:00
Carlos E. Gomez, Jaime Chavarriaga, Harold E. Castro and Andrei Tchernykh
Improving Reliability for provisioning of virtual machines in Desktop Clouds

17:00 - 17:30
Closing: Resilience Workshop Organizers




