Deep Learning for Computer Vision

Convolutional Network

Problem: So far our classifiers don't respect the spatial structure of images!
Solution: Define new computational nodes that operate on images!

Convolutional Layer

There can be many filters generating many activation maps.
Each filter must have same depth as the input image.
Each filter also has a bias.
Activation map can be considered as a feature vector, telling us the structure of the input.

The input is often a batch of images
The output channel \(C_{out}\) is usually different from input channel \(C_{in}\)

Pooling Layer

Compute one unique number to represent the region to downsampling.

Convolutional Network Architecture

Normalization

Problem: Deep network is hard to train
Normalization is introduced to accelerate optimization.

The batch normalization get its name for it doing normalization in batch dimension \(N\).
It's a stiff constraint for those layers to exactly fit zero mean and unit variance. So in practice, we introduce learnable scale and shift \(\gamma,\beta\) to make our network able to learn mean and variance.

It is wired that the result of one image is influenced by other images, so the model process differently in training time and test time
- Training Time: empirically compute mean and variance.
- Test Time: use the average of historical mean and variance computed during training to compute.

CNN Architecture

AlexNet

Output channels \(=\) number of filters \(=\) 64
Output size \(W_{out}=(W_{in}-K+2P)/S+1=56\)
Memory that output feature consume
- Number of output elements \(=C_{out}\times H_{out}\times W_{out}=200704\)
- Bytes per element \(=4\) (for 32-bit floating point)
- KB \(=\) (number of elements)\(\times\) (bytes per element)/1024=784
Number of learnable parameters
- Weight shape \(=C_{out}\times C_{in}\times K \times K\)
- Bias shape \(=C_{out}=64\)
- Number of parameter \(=\) weight \(+\) bias \(=\) 23296

Flatten output size \(=C_{in}\times H\times W\)
FCparams \(=C_{in}\times C_{out}+C_{out}\)
FC flops \(=C_{in}\times C_{out}\)

VGG

By stacking 3 by 3 conv, we throw away kernal size as a hyperparam and only need to thing about how many 3x3 conv do we need.
We can also arbitaryly insert ReLU between these 3x3 conv, allowing more nonlinear computation.

GoogLeNet

Many innovation for efficiency: reduce parameter count, memory usage, and computation.

Residual Network

When training deeper model, we found that deeper model does worse than shallow model. The initial guess is that Deep model is overfitting since it is much bigger than the other model.

Deep Learning for Computer Vision

Convolutional Network

Convolutional Layer

Pooling Layer

Convolutional Network Architecture

Normalization

CNN Architecture

AlexNet

VGG

GoogLeNet

Residual Network

Other