Lab Home | Phone | Search
Center for Nonlinear Studies  Center for Nonlinear Studies
 Home 
 People 
 Current 
 Executive Committee 
 Postdocs 
 Visitors 
 Students 
 Research 
 Publications 
 Conferences 
 Workshops 
 Sponsorship 
 Talks 
 Seminars 
 Postdoc Seminars Archive 
 Quantum Lunch 
 Quantum Lunch Archive 
 P/T Colloquia 
 Archive 
 Ulam Scholar 
 
 Postdoc Nominations 
 Student Requests 
 Student Program 
 Visitor Requests 
 Description 
 Past Visitors 
 Services 
 General 
 
 History of CNLS 
 
 Maps, Directions 
 CNLS Office 
 T-Division 
 LANL 
 
Monday, August 08, 2022
11:00 AM - 12:00 PM
CNLS Conference Room (TA-3, Bldg 1690)

Seminar

Co-Design Summer School 2022 Exit Talk: Transformations for Energy Efficient Accelerated Chain Matrix Multiplication (TEE-ACM2)

Maxim Moraru, Ph.D. Student and Mina Warnet, Masters Student
University of Reims Champagne-Ardenne

Chain matrix multiplication plays a key role in the training of deep learning models, but also inphysics, computer graphics, etc. Matrix multiplications often cause a bottleneck in terms of performance and energy, because of the heavy costs in computations and memory operations.While their runtime performance was studied for years, significantly less effort has beenexpended in optimizing its energy efficiency. Thus, reducing the energetic cost of these types of computation is a major challenge.GPU power consumption is heavily impacted by the number of data transfers performed. Infact, a data transfer from global memory needs one thousand times more energy than a doubleprecision arithmetic operation. Thus, minimizing data transfers is key in reducing energyconsumption.

In this talk, we presents an energy efficient solution for Matrix Chain Multiplication on GPUs that minimizes computation and off-chip data transfers. Indeed, we focused on improving threedifferent aspects of Matrix Chain Multiplication.For single matrix multiplication, we use a blocking strategy that allows us to achieve theminimum number of global memory loads for a given amount of shared memory. We then extend our approach to three matrices--further decreasing the number of performed data transfers. Finally, we propose a parenthesizing algorithm that minimizes the number of memory transfers for a whole sequence of matrices.

Host: Julien Loiseau