Lab Home | Phone | Search | ||||||||
|
||||||||
Gradient descent (GD) is a well-known first order optimization method, which uses the gradient of the loss function, along with a step-size (or learning rate), to iteratively update the solution. When the loss (cost) function is dependent on datasets with large cardinality, such in cases typically associated with deep learning (DL), GD becomes impractical. In this scenario, stochastic GD (SGD), which uses a noisy gradient approximation (computed over a random fraction of the dataset), has become crucial. There exits several variants/improvements over the "vanilla" SGD, such SGD+momentum, Adagrad, RMSprop, Adadelta, Adam, Nadam, AdaBelief, etc., which are usually given as black-boxes by most of DL's libraries (TensorFlow, PyTorch, etc.). The primary objective of this talk is to open such black-boxes by explaining their "evolutionary path", in which each SGD variant may be understood as a set of add-on features over the vanilla SGD. Furthermore, since the hyper-parameters associated with each SGD variant do directly influence their performance, they will also be assessed from a theoretical and computational point of view. Host: Brendt Wohlberg, T5 |