Lab Home | Phone | Search | ||||||||
|
||||||||
Chain matrix multiplication plays a key role in the training of deep learning models, but also inphysics, computer graphics, etc. Matrix multiplications often cause a bottleneck in terms of performance and energy, because of the heavy costs in computations and memory operations.While their runtime performance was studied for years, significantly less effort has beenexpended in optimizing its energy efficiency. Thus, reducing the energetic cost of these types of computation is a major challenge.GPU power consumption is heavily impacted by the number of data transfers performed. Infact, a data transfer from global memory needs one thousand times more energy than a doubleprecision arithmetic operation. Thus, minimizing data transfers is key in reducing energyconsumption. In this talk, we presents an energy efficient solution for Matrix Chain Multiplication on GPUs that minimizes computation and off-chip data transfers. Indeed, we focused on improving threedifferent aspects of Matrix Chain Multiplication.For single matrix multiplication, we use a blocking strategy that allows us to achieve theminimum number of global memory loads for a given amount of shared memory. We then extend our approach to three matrices--further decreasing the number of performed data transfers. Finally, we propose a parenthesizing algorithm that minimizes the number of memory transfers for a whole sequence of matrices. |