Deep Learning for Computer Vision

Optimization

goal: find \(w^{*}=argmin_{w}L(w)\)

How to evaluate gradient?

Numeric gradient
- We can approximate the gradient by defination. Increasing one slot of the weight matrix by a small step and recompute loss function. Use the defination of derivative to get a approximation of gradient.
- Takes lots of time and not accurate.

Analytic gradient
- exact, fast, error-prone

In practice, always use analytic gradient, but check implementation with numerical gradienet. This is called a gradient check

Gradient Descent

Stochastic Gradient Descend

We don't use the full training data but some small subsamples of it to approximate loss function and gradient, for computing on the whole set is expansive. These small subsamples are called minibatches

SGD + Momentum may overshoot in the bottom for its historical speed and will come back.

Progress along "steep" direction is damped. Progress along "flat" directions is accelerated.
The learning rate will continuely decay since grad_squared gets larger and larger.

We initialize moment2 as 0 and if we take beta2 closely to 0, the moment2 will also close to 0, which may make our first gradient step very large. This could cause bad results.

Second-order optimization is better to use in low dimension.

Neural Network

To solve the limitation of linear classifier, we can apply feature transformation.

Activation function

Neuron

Space Warping

The more the layer, the more complex the model is. Try to adjust regularization parameter to solve it.

Universal Approximation

A two layer neural network can approximate any function. But it may need a large size to get high fidelity.

Convex Functions

Taking any two points in the input, the secant line will always lie above the function between that two points.

However, most neural networks need nonconvex optimization

Backpropagation

Computation Graph

Backprop Implementation

You can define your own node object in computation graph using pytorch API

Backprop with Vectors

Backprop with Matrics

\(dL/dx\) must have the same shape as \(x\) since loss \(L\) is a scalar

Assume we want to derive \(dL/dx_{i,j}\)

Note that only the \(ith\) row of \(y\) is formed by \(x_{i,j}\) and corresponding coefficients are the \(jth\) row of \(w\).
So \(dL/dx_{i,j}\) is just the inner product of the \(ith\) row of \(dL/dy\) and the \(jth\) column of \(w^{T}\), which leads to the result of \(dL/dx=(dL/dy)w^{T}\)

High-Order Derivatives

\(\frac{\partial ^{2}L}{\partial x_{0}^{2}}\) is \(D_{0} \times D_{0}\) dimension for it's the derivative of \(\frac{\partial L}{\partial x_{0}}\), which is a \(D_{0}\) dimension vector.