How to Design a Neural Network in 2020

January 5, 2020

Transcript

Today I realized that it's been a long time since I made the 'how to design a convolutional neural network' video. A lot has changed since then, but also a lot has stayed the same. So, in this video, we will talk about how to design a neural network in 2020, covering some of the useful techniques that came out or popularized between 2018 and 2020. At the end of the video, I will also go through some of our recent papers and explain how my colleagues and I designed neural networks for constrained environments. Alright, let's get started!

My first advice is still the same. You don't really need to spend too much time trying to design a neural network. You can pick something that worked for a similar problem and use it. But what if your model needs to have some special properties that mainstream models don't provide? What if you have some unique concerns that others typically disregard or haven't researched yet, such as minimizing the hardware footprint of a model or dealing with very large multispectral images. In many of those cases, you still don't have to start from scratch. There are many good practices and useful design patterns that you can use in your model architecture. That's what we will be covering in this video.

Let's first talk about efficient model architectures. Despite the recent advances in automated network architecture search, hand-designed models are still relevant, especially when it comes to designing efficient models.

For example, the ShuffleNetV2 paper argues that automatically found network architectures are much slower in practice even when they require a smaller number of operations to run. Their paper reports that MobileNetV2, a hand-designed model, is much faster than the NasNet-A, which was a result of an automated network architecture search process.

This is because the speed of a model depends not only on the number of floating-point operations but also on memory access costs and platform characteristics. Keeping those in mind, the ShuffleNetV2 paper proposes some guidelines to design efficient model architectures, optimized for inference speed.

One of their guidelines is based on the observation that equal channel width at both ends of a layer minimizes memory access costs. So it's a good practice not to change the number of input and output channels too frequently. Models that make use of bottlenecks and expand-and-contract modules such as SqueezeNet and MobileNet V2 violate this guideline. It doesn't mean you should never use them, though.

Speaking of bottlenecks, something I observed is that extremely narrow bottlenecks also hurt training stability. A few dead neurons and the entire model collapses. So, if you really need to use layers having very few filters, such as 8 or smaller, using linear activation instead of ReLU at the end of those layers would at least prevent dead neurons. MobileNetV2 also does something similar by using ReLU activations at the expansion layers while keeping the bottlenecks linear.

Another guideline is that network fragmentation reduces the degree of parallelism. Using a lot of small operations instead of a few large ones decreases efficiency since they are not very GPU-friendly, and they introduce extra overheads. This is a well-known phenomenon and is one of the reasons why some automatically designed models run slower. Network architecture search algorithms may result in heavily fragmented architectures when accuracy and the number of operations are the only search criteria.

The paper also points out is that element-wise operations have a non-negligible cost. Point-wise operations such as ReLU and 1x1 convolutions have a small number of floating-point operations, but their memory access cost is non-negligible. Therefore, one shouldn't consider them free in model architecture design.

As you may know, my work mostly focuses on models that operate on image data, and I haven't been working on anything natural language processing related for the past few years. However, it's hard not to see how successful the Transformer model has been in the field of natural language processing. This new type of architecture relies on what's called `attention mechanisms.' At a very high level, an attention mechanism tells a model where to look; what parts of the input signal are more relevant. You can think of it as a module that generates a weight vector given a query. For example, to resolve what 'it' refers to in "This video is very interesting. I liked it.", an attention vector would put a higher weight on the words "this" and "video." This type of attention mechanism is called self-attention.

It's straightforward to see how attention mechanisms help in this example, but can we apply this type of self-attention also to models that deal with computer vision problems? Yes, we can. We don't have words and sentences in images, but we certainly can design mechanisms to shift the attention towards particular spatial locations or feature maps.

For example, this paper, titled "Squeeze-and-Excitation Networks," proposes a self-attention mechanism that assigns weights to feature maps based on their relevance for a given input.

The way they do it is quite simple. For a given layer, they first squeeze the feature maps into a global description vector by averaging over the spatial axes. This is basically a global average pooling operator. Then, they feed that information into a mini fully-connected neural network that outputs an attention vector. Finally, they take those weights in the attention vector and multiply them with their corresponding feature maps at the input. This process essentially puts the attention on more relevant feature maps by recalibrating their channels.

This method can easily be applied to many types of convolutional neural network architectures, such as ResNet. MobileNetV3 also makes use of similar attention modules, combining manual design practices with an automated network architecture search approach. For many computer vision tasks, it seems that it's a good strategy to use self-attention mechanisms in convolutional neural network architectures.

At the beginning of the video, I mentioned using a well-known model architecture off-the-shelf would be sufficient in many cases. But if you are trying to solve a problem that has some specific requirements or limitations, then you may need to make some task-specific design choices to adapt your model to a targeted application. Let's go through two such examples.

The first one is a pixel-wise segmentation model that I designed to handle very large input images. The goal was to create surface water maps given satellite imagery. This is a typical semantic segmentation task that any mainstream pixel-wise prediction model would be expected to perform well. However, satellite imagery can be much larger, like orders of magnitude larger, than images that the mainstream models are designed to deal with. So, I needed to make some design choices to process large inputs in one shot, without dividing them into tiles, given a certain memory budget.

It's a common practice to double the number of feature maps whenever you downscale the input by two and vice versa. In this setting, higher-resolution layers get much more memory allocation than the coarser scale ones. Because downscaling an image by two in both spatial axes while doubling the number of channels still reduces the overall feature map volume by two.

One design choice I made was to quadruple the number of channels whenever the feature maps are spatially downscaled so that the model uses constant memory throughout the network and layers at different levels of abstraction get their fair share of memory allocation. This approach also has some downsides, but for this particular task, it worked very well. I published a paper on this in the IEEE Geoscience and Remote Sensing Letters very recently. There's a lot more in the paper. Check it out to learn more about it.

Let's move on to the second custom model design example, in which our goal was to minimize the latency and hardware footprint of our model. We used several tricks to do that. The main innovation in our model was the use of 3-way separable FIR-IIR filters for the purposes of line buffer minimization, and I'll explain what that means.

The concept of separable convolutions is nothing new. Many efficient model architectures use depthwise separable convolutions. However, as I mentioned in one of my earlier videos, 3-way separable convolutions having vertical, horizontal, and depthwise components are fairly uncommon. There is a reason for that. Convolutional neural networks typically use very small kernel sizes, such as 3x3. Therefore, breaking down a spatial convolution into column and row convolutions doesn't really save much. I'm guessing that's also why popular models don't use depthwise separable convolutions in the first layer, although I haven't seen it stated explicitly in the papers. Since the input is usually a 3-channel RGB image, the number of channels is too few for depthwise separability to be worth it.

In our case, spatial separability was very useful for factorizing the hardware cost. The cost of vertical convolutions was disproportionally high because of the number of lines that need to be buffered. The hardware acquired images line-by-line. Therefore, a column convolution required more elements to be buffered. For example, a 5x1 convolution would need 4 lines of data to be buffered, whereas a 1x5 convolution would need only 4 elements to be buffered.

We addressed this problem by replacing the vertical convolutions in a 3-way separable convolution layer with infinite impulse response filters. You can think of those IIR filters as recurrent neural network modules that summarize pixels in the vertical direction.

Unlike fixed-window convolutions, our separable FIR-IIR filters start processing their input as soon as the pixels arrive, without having to buffer the lines that would be spanned by a fixed-sized window. This reduces latency and the size of the line buffers, leading to significant savings in silicon area. You can find our paper in the description below.

My final tip in this video is not to follow any advice or guideline religiously, including my own. Things change fast, especially in this field. Guidelines, rules of thumb, and design patterns are practical, but they don't always work well. Things change, our understanding of things change. So it's better to keep an open mind.

Alright, that's all for today. I hope you liked it. Links are in the description, as always. Subscribe for more videos. Happy belated new year. Thanks for watching, stay tuned, and see you next time.

How to Design a Neural Network in 2020

Transcript

References in the Video