Lab Home | Phone | Search
Center for Nonlinear Studies  Center for Nonlinear Studies
 Home 
 People 
 Current 
 Affiliates 
 Visitors 
 Students 
 Research 
 ICAM-LANL 
 Publications 
 Conferences 
 Workshops 
 Sponsorship 
 Talks 
 Colloquia 
 Colloquia Archive 
 Seminars 
 Postdoc Seminars Archive 
 Quantum Lunch 
 CMS Colloquia 
 Q-Mat Seminars 
 Q-Mat Seminars Archive 
 Archive 
 Kac Lectures 
 Dist. Quant. Lecture 
 Ulam Scholar 
 Colloquia 
 
 Jobs 
 Students 
 Summer Research 
 Visitors 
 Description 
 Past Visitors 
 Services 
 General 
 
 History of CNLS 
 
 Maps, Directions 
 CNLS Office 
 T-Division 
 LANL 
 
Wednesday, May 13, 2009
09:00 AM - 10:00 AM
CNLS Conference Room (TA-3, Bldg 1690)

Seminar

Resilience for Advanced Computing Systems: Challenge & Opportunity

John T. Daly
Center of Exceptional Computing

Resilience is a new approach to thinking about the growing failure rates of HPC systems. While fault-tolerance addresses the problem of keeping the platform (or application) running in spite of failures in individual platform (or application) components, resilience focuses on the problem of keeping the application running to a correct solution in a timely and resource efficient manner in the presence of degradations and failures in individual platform components. While fault-tolerance is resigned to the notion that platform failures lead to application interrupts, resilience attempts to avoid application interrupts by anticipating and circumventing platform failures. We will discuss the challenges faced by traditional fault tolerance and suggest resilience can be a more effective solution for keep the application running to a correct solution. Bio: John T. Daly is a computer systems researcher and resilience thrust lead for the Advanced Computing Systems (ACS) Program at the Center for Exceptional Computing (CEC). He is responsible for stimulating and directing collaborative research efforts in industry, academia, and government, that are focused on the problem of keeping supercomputer applications running toward a correct solution in a timely and efficient manner in the presence of system degradations and failures. His research interests include mathematical modeling and analysis of failure, reliability, fault tolerance, calculational correctness, and throughput for applications at extreme scale. Prior to working at the CEC, John was a scientist and resilience researcher in the High Performance Computing (HPC) division at Los Alamos National Laboratory and a software engineer and application analyst for Raytheon Intelligence and Information Systems. He is a nationally recognized expert in resilience with more than 20 years of experience developing, porting, and running applications as an early adopter of many of the world\'s fastest supercomputers.

Host: Nathan DeBardeleben, HPC-5