(W 11) Resilience
12th Workshop on Resiliency in High Performance Computing in Clusters, Clouds, and Grids 

Date

Tuesday 27.08.2019, 14:00 - 17:30

Date

Heyne-Haus, Papendieck 16, 37073 Göttingen, room 2/right ► Map

Scope

Resilience is a critical challenge as high performance computing (HPC) systems continue to increase component counts, individual component reliability decreases (such as due to shrinking process technology and near-threshold voltage (NTV) operation), software complexity increases, and architectures become more heterogeneous. Application correctness and execution efficiency, in spite of frequent faults, errors, and failures, is essential to ensure the success of the extreme-scale HPC systems, cluster computing environments, Grid computing infrastructures, and Cloud computing services.

Resilience for HPC systems encompasses a wide spectrum of fundamental and applied research and development, including theoretical foundations, fault detection and prediction, monitoring and control, end-to-end data integrity, enabling infrastructure, and resilient solvers and algorithm-based fault tolerance. This workshop brings together experts in the community to further research and development in HPC resilience and to facilitate exchanges across the computational paradigms of extreme-scale HPC, cluster computing, Grid computing, and Cloud computing.

Workshop Chairs

  • Stephen L. Scott, Tennessee Tech University and Oak Ridge National Laboratory, Systems Research Team, Cookeville, TN, USA
  • Chokchai (Box) Leangsuksun, Louisiana Tech University, SWEPCO Endowed Professor, Ruston, LA, USA
  • Patrick G. Bridges, University of New Mexico, Albuquerque, New Mexico, USA
  • Christian Engelmann, Oak Ridge National Laboratory, Oak Ridge, TN, USA

Agenda

Session 1

14:00 - 14:30
Opening: Resilience Workshop Organizers

14:30 - 15:00
Scott Levy and Kurt Ferreira
Space-Efficient Reed-Solomon Encoding to Detect and Correct Pointer Corruption

15:00 - 15:30
Maher Salloum, Jackson Mayo and Robert Armstrong
Physics-Based Checksums for Silent-Error Detection in PDE Solvers

15:30 - 16:00 Coffee Break

Session 2

16:00 - 16:30
Max Baird, Sven-Bodo Scholz, Artjoms Sinkarovs and Leonardo Bautista-Gomez.
Checkpointing Kernel Executions of MPI+CUDA Applications

16:30 - 17:00
Carlos E. Gomez, Jaime Chavarriaga, Harold E. Castro and Andrei Tchernykh
Improving Reliability for provisioning of virtual machines in Desktop Clouds

17:00 - 17:30
Closing: Resilience Workshop Organizers

HOSTS

SPONSORS

SHARE ON:

We use cookies in order to design and continuously improve our website for you. By continuing to use the website, you agree to the use of cookies. You can find further information on this in our privacy policy.

Ok