Transfer Learning
February 27, 2018Transcript of the video
Let's talk about the fastest and easiest way you can build a deep learning model, without worrying too much about how much data you have. Training a deep model may require a lot of data and computational resources, but luckily there's transfer learning.
In many cases, you can just take a model off the shelf and adapt it to your problem. This is called transfer learning. For example, you can take a model that has been trained for one task, such as classifying objects, then fine-tune it to accomplish another task such as classifying scenes.
Transfer learning is particularly prevalent in computer vision-related tasks. Studies have shown that features learned from very large image sets, such as the ImageNet, are highly transferable to a variety of image recognition tasks.
There are several ways we can transfer knowledge from one model to another. Perhaps the easiest way is to chop off the top layer of the already trained model and replace it with a randomly initialized one. Then train the parameters only in the top layer for the new task, while all other parameters remain fixed. This method can be thought of as using the pre-trained model as a feature extractor since the fixed part acts like a feature extractor, and the top layer acts like a traditional fully connected neural network. This approach works best if our data and the task are similar to the data and the task that the original model was trained on. In cases where there isn't much data to train a model for the target task, this type of transfer learning might be our only option to train a model without overfitting, since having fewer parameters to train also reduces the risk of overfitting.
If we have more data, we can unfreeze these transferred parameters and train the entire network. In this setting, what we transfer is basically the initial values of the parameters. Initializing the weights using a pre-trained model instead of initializing them randomly can give our model a warm start and speed up the convergence. To preserve the initialization from pre-training, it's a common practice to lower the learning rate by an order of magnitude. To prevent changing the transferred parameters too early, it's also common to start with frozen parameters, train only the randomly initialized layers until they converge, then unfreeze all parameters and fine-tune the entire network.
Transfer learning is particularly useful when we have a limited amount of data for one task but have a large volume of data for a similar task, or have a model that has already been trained on such data. However, even if we do have enough data to afford training a model from scratch and the tasks are not so close, initializing the parameters using a pre-trained model can still be better than random initialization.
You might wonder why transfer learning even works. What type of useful information can a model offer towards accomplishing a task that it has not been trained on?
An interesting property of deep neural networks is that when they are trained on a large set of images, the early layer parameters resemble each other regardless of the specific task they have been trained on. For example, convolutional neural networks tend to learn edges, textures, and patterns in the first layers.
These layers seem to capture the features that are broadly useful for analyzing natural images. Features that detect edges, corners, shapes, textures, and different types of illuminants can be considered as generic feature extractors and be used in many different types of settings.
The closer we get to the output, the more specific features the layers tend to learn, such as object parts and objects. For example, the last layer in a network that has been trained to do classification would be very specific to that classification task. If the model is trained to classify breeds of dogs, one unit would respond only to pictures of a specific breed.
Transferring all layers except the top layer is probably the most common type of transfer learning, although not the only one. In general, it's possible to transfer the first n layers from a pre-trained model to a target network and randomly initialize the rest.
Technically the transferred part doesn't even have to be the first layers. If the tasks are the same, but the type of input data is a little different, it's possible to transfer the last layers as well. For example, let's say we have a face recognition model that has been trained on RGB images, and our target is to build a face recognition model that inputs images that have a depth channel in addition to RGB data. Given that we don't have a lot of data to train our new model from scratch, it might be worth transferring the later layers and re-train the early ones.
If transfer learning is so great, then why don't we use it for any kind of problem that we can think of? Well, we do use it for many tasks, but it's not always possible to transfer anything useful from one model to another. One such scenario can be where both the types of data and the tasks are vastly different. For example, I highly doubt one can transfer anything useful from a model trained on ImageNet to a model that is supposed to learn how to allocate funds to maintain a profitable stock portfolio.
Another scenario where transfer learning might not be applicable is that when the architecture of the model that we want to transfer features from doesn't need our needs. A novel model architecture wouldn't be able to benefit from pre-trained models. However, you can always train your custom model architecture on a large dataset for a generic task first. Then fine-tune the model for your specific task if the amount of data you have is the issue.
Another technique that addresses transferring knowledge between models having different architectures is model distillation. To distill a pre-trained model, you train the new model to mimic the outputs of the pre-trained model, rather than training it directly on data. This approach works well, particularly if you are trying to train a model that is smaller than the source model.
Transferring features from a pre-trained model certainly reduces the time needed to train a model, so does an efficient optimization procedure. The next video will focus on optimization, where we will talk about some tips and tricks that make training deep models easier and speed up the convergence. That's all for today. Feel free to ask questions and provide feedback in the comments section. Subscribe to my channel for more videos like this. You can find the links to additional resources in the description below. Thanks for watching, stay tuned, and see you next time.
Further reading:
- CS231n: Convolutional Neural Networks for Visual Recognition
- Best Practices for Fine-tuning Visual Classifiers to New Domains
- How transferable are features in deep neural networks?
- What makes ImageNet good for transfer learning?
- ImageNet
- How neural networks build up their understanding of images
- Distilling the Knowledge in a Neural Network