Lab Home | Phone | Search
Center for Nonlinear Studies  Center for Nonlinear Studies
 Colloquia Archive 
 Postdoc Seminars Archive 
 Quantum Lunch 
 Quantum Lunch Archive 
 CMS Colloquia 
 Q-Mat Seminars 
 Q-Mat Seminars Archive 
 P/T Colloquia 
 Kac Lectures 
 Kac Fellows 
 Dist. Quant. Lecture 
 Ulam Scholar 
 CNLS Fellowship Application 
 Student Program 
 Past Visitors 
 History of CNLS 
 Maps, Directions 
 CNLS Office 
Wednesday, May 13, 2009
09:00 AM - 10:00 AM
CNLS Conference Room (TA-3, Bldg 1690)


Resilience for Advanced Computing Systems: Challenge & Opportunity

John T. Daly
Center of Exceptional Computing

Resilience is a new approach to thinking about the growing failure rates of HPC systems. While fault-tolerance addresses the problem of keeping the platform (or application) running in spite of failures in individual platform (or application) components, resilience focuses on the problem of keeping the application running to a correct solution in a timely and resource efficient manner in the presence of degradations and failures in individual platform components. While fault-tolerance is resigned to the notion that platform failures lead to application interrupts, resilience attempts to avoid application interrupts by anticipating and circumventing platform failures. We will discuss the challenges faced by traditional fault tolerance and suggest resilience can be a more effective solution for keep the application running to a correct solution. Bio: John T. Daly is a computer systems researcher and resilience thrust lead for the Advanced Computing Systems (ACS) Program at the Center for Exceptional Computing (CEC). He is responsible for stimulating and directing collaborative research efforts in industry, academia, and government, that are focused on the problem of keeping supercomputer applications running toward a correct solution in a timely and efficient manner in the presence of system degradations and failures. His research interests include mathematical modeling and analysis of failure, reliability, fault tolerance, calculational correctness, and throughput for applications at extreme scale. Prior to working at the CEC, John was a scientist and resilience researcher in the High Performance Computing (HPC) division at Los Alamos National Laboratory and a software engineer and application analyst for Raytheon Intelligence and Information Systems. He is a nationally recognized expert in resilience with more than 20 years of experience developing, porting, and running applications as an early adopter of many of the world\'s fastest supercomputers.

Host: Nathan DeBardeleben, HPC-5