Optimization Tricks: momentum, adaptive methods, batch-norm, and more...
March 20, 2018Transcript of the video
Deep learning is closely related to mathematical optimization. What people usually mean by optimization is to find a set of parameters that minimize or maximize a function. In the context of neural networks, this usually means minimizing a cost function by iteratively tuning the trainable parameters.
Perhaps the biggest difference between pure mathematical optimization and optimization in deep learning is that in deep learning, we do not optimize for maximum performance directly. Instead, we use an easier to optimize cost function on a training set and hope that minimizing that would improve the performance on a separate test set.
We already talked about how optimization works in its simplest form in the earlier videos. We pick a loss function that we want to minimize, do a forward pass given a mini-batch of inputs, take the derivative of the loss function with respect to the weights, update the weights, and iterate. This is how vanilla stochastic gradient descent works. There are some tricks that we can make this process more efficient.
First, let's focus on the update rule we had. This update rule tells us to take a step towards a lower loss value guided by the gradient. This usually works fine but can be a little slow to converge. One useful trick that we can do is to add a velocity term to this update rule, which helps our updates gain momentum towards minima at every update. This is called the Momentum algorithm.
The momentum algorithm is simple and straightforward. We define a parameter that accumulates a weighted average of the past gradients. Then use this parameter in our update rule. In other words, we add a weighted average of past updates to our current update. The value of the momentum parameter determines the weight of the past updates in this weighted average.
As a result of accumulating the gradients, the updates get larger when the algorithm repeatedly gets gradients having a similar direction. You can imagine the current state of the weights as a ball moving down on a surface defined by the cost function. The ball gains momentum as it moves towards the basin even if it hits some small pits on the way. This usually gives a quicker path to a solution as compared to plain stochastic gradient descent.
As you may recall, another factor that determines how big the step size will be is the learning rate. It's usually a good strategy to start with bigger steps and decrease the step size as we get closer to our target. You might think the magnitude of the gradient should shrink over time anyway, but that doesn't happen in many cases. The norm of the gradient might even increase, while the loss still keeps decreasing. So, it's common to set up a schedule, such as exponential decay, to decrease the learning rate over time either in a continuous way or by taking discrete steps.
In the plain version of stochastic gradient descent, the choice of learning rate might have a crucial impact on the performance. There are several methods that set a separate learning rate for each trainable parameter and adaptively adjust the learning rate to decrease a model's sensitivity to the initial learning rate.
AdaGrad algorithm decreases the learning rate faster for the parameters that have large gradient components and slower for the ones that have a smaller gradient.
RMSProp algorithm also adaptively tunes the learning rate for each parameter in a similar way but uses a moving average of gradients to make the optimization more suitable for optimizing non-convex cost functions.
Another algorithm that uses adaptive learning rates is the Adam optimizer. Adam stands for adaptive moment estimation. It attempts to combine the best parts of RMSProp and momentum optimizers. In practice, Adam and RMSProp both work well.
In addition to the optimization algorithm, the model architecture also has a big impact on how easy it is to optimize a model. Many successful models owe their performance to their architecture rather than the choice of the optimization algorithm. You can check out my earlier video on designing neural networks to learn more about how model architecture can facilitate the optimization procedure. You can find it in the Deep Learning Crash Course playlist in the description below.
One challenge in optimizing deep models is the internal covariate shift problem. When we update the weights in one layer in a deep neural network, we update them, assuming that its inputs would stay the same. However, the distribution of the inputs might change every time we update the weights as the previous layer parameters are updated.
In deep models, even small changes in the early layers get amplified through the network and cause a shift in the distributions of the later layers. Changes in the input distributions make it harder for the following layers to adapt. This problem is called the internal covariate shift problem.
A technique called batch-normalization makes it easier to optimize deep models by normalizing the outputs of the hidden nodes right before they are fed into an activation function.
The first step of batch normalization is to subtract the batch mean from every output value and divide it by the batch standard deviation. This gives us a zero-mean unit variance distribution at the output.
The second step uses scaling and shifting parameters to allow the variables to have any mean and standard deviation. These scaling and shifting parameters are trainable and learned during training. Essentially, the second step can undo what the first step does.
You might ask, what's the point of normalization then? The answer is that in practice, the second step doesn't really undo the first one. It's true that the variables are allowed to have an arbitrary mean and standard deviation, both with and without batch normalization. The difference is that when batch normalization is not used, the distributions are determined by a cascade of parameters. On the other hand, batch normalization parametrizes the mean and standard deviation as trainable parameters, which makes the distribution shifts manageable during training, resulting in faster convergence.
Once the training is complete, global statistics that are computed during training are used to normalize the activations rather than the batch statistics. In this way, the inference becomes independent of the input batch during test time, and we don't need a batch of samples to run inference on a single sample.
Optimizing deep models involves iterative processes that require some sort of parameter initialization. The way we initialize the parameters can have a big impact on a solution that a learning algorithm achieves. For example, if this is how the cost function looks like, projected into a single dimension, and the weights are initialized on the wrong side of a hill like this then the model would converge to a local minimum although there are much better solutions on the other side of the hill.
In practice, though, cost functions like this are very rare. In higher dimensional space, local minima are not very common, and it's likely that there is a way around hills like these.
Why are local minima rare? Think of it this way: for a point to be a local minimum; it needs to have a smaller value than its neighbors in all axes. If we have a single dimension, the odds of observing such structures are not very low. But what if we have a million parameters. Then, for a point to be a local minimum, it needs to have a smaller value than all of its neighbors in all one million directions. How likely is that? If it does happen, it's likely that it already has a very small value that can be considered an acceptable solution. In deep learning, we usually care about finding a good solution rather than finding the global minimum.
Another type of critical point is a saddle point, where the cost function gets a minimum value in some directions and a maximum value in some other directions. Saddle points are likelier to be observed than observing local minima since it's harder to get a value that is smaller than its neighbors in all directions as compared to a being a minimum only across some directions. If the norm of the gradient becomes very small during training, the problem is likelier to be a saddle point than being a local minimum.
Let's go back to weight initialization. The initial state of a neural network can have an effect on how fast the model converges, how good the converged point is, or if the model converges at all. So how should we choose the initial values? There's no definite answer, but there are some heuristics that might help. Initializing the biases is usually easier. It's usually safe to initialize them to zero or a small number.
However, initializing the rest of the parameters to zero or another constant is not a good idea. If we initialize all parameters to the same value, then they will all get the same updates during training and end up learning the same features. Ideally, we would prefer each neuron to learn something different to be useful. To do that, we need to initialize the weights in a way that breaks the symmetry so that each neuron gets a different update during training. Initializing the weights randomly usually works fine, although it doesn't guarantee an absolute asymmetry.
A very common initialization algorithm is the Glorot initializer, also known as the Xavier initializer. Glorot initializer randomly samples the weights from a uniform distribution where the range is determined by the number of inputs and outputs to keep the activation and gradient variance under control. This is the default initialization method in some frameworks.
The scale of the initial values matters because if they are picked from a very narrow range of small numbers, then the units might not be different enough to learn different features. If the scale is too big, then the gradients might grow exponentially as the weights get multiplied from one layer to another. This problem is called exploding gradients and is more prevalent in recurrent neural networks than the types of neural networks that we have discussed so far.
So far, we have focused only on feedforward neural networks. In the next video, we will discuss what recurrent neural networks are and how they work. That's all for today. Thanks for watching, stay tuned, and see you next time.
Further reading:
- Deep Learning by Ian Goodfellow
- Stochastic Gradient Descent (Wikipedia)
- Adaptive subgradient methods for online learning and stochastic optimization
- RMSProp Lecture Notes by Geoffrey Hinton
- Adam: A method for stochastic optimization
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- Saddle point (Wikipedia)
- Understanding the difficulty of training deep feedforward neural networks