Lab Home | Phone | Search | ||||||||
|
||||||||
Resilience is a new approach to thinking about the growing failure rates of HPC systems. While fault-tolerance addresses the problem of keeping the platform (or application) running in spite of failures in individual platform (or application) components, resilience focuses on the problem of keeping the application running to a correct solution in a timely and resource efficient manner in the presence of degradations and failures in individual platform components. While fault-tolerance is resigned to the notion that platform failures lead to application interrupts, resilience attempts to avoid application interrupts by anticipating and circumventing platform failures. We will discuss the challenges faced by traditional fault tolerance and suggest resilience can be a more effective solution for keep the application running to a correct solution. Bio: John T. Daly is a computer systems researcher and resilience thrust lead for the Advanced Computing Systems (ACS) Program at the Center for Exceptional Computing (CEC). He is responsible for stimulating and directing collaborative research efforts in industry, academia, and government, that are focused on the problem of keeping supercomputer applications running toward a correct solution in a timely and efficient manner in the presence of system degradations and failures. His research interests include mathematical modeling and analysis of failure, reliability, fault tolerance, calculational correctness, and throughput for applications at extreme scale. Prior to working at the CEC, John was a scientist and resilience researcher in the High Performance Computing (HPC) division at Los Alamos National Laboratory and a software engineer and application analyst for Raytheon Intelligence and Information Systems. He is a nationally recognized expert in resilience with more than 20 years of experience developing, porting, and running applications as an early adopter of many of the world\'s fastest supercomputers. Host: Nathan DeBardeleben, HPC-5 |