Neural Networks and Deep Learning
Introduction
Neural Network
- Use the given input x to predict output y. Inputs are called input layer.
- the circles are called hidden units, use input features to output another features. The units in a column are called a hidden layer.
Supervised Learning
- Structured Data: Basically databases of data. Each of the features has a defined meaning.
- Unstructured Data: Data like audio, image and text.
Drivers behind deep learning
- Scale: Large NN performs much better with a large amount of labeled data.
- Algorithms: From sigmoid to ReLU to solve gradient descend.
- Computation
Neural Network Basics
Binary Classification
Learn a classifier that can input a feature vector x and predict the corresponding label y is 1 or 0.
Notation
-
: A single training example where is a dimension feature vector and , the label, is either 0 or 1. -
: Number of training examples. Training examples are denoted from to -
matrix
represents training examples in a compact way.
The width and height of the matrix are
- Label
can also be denoted by stacking in columns.
Logistic Regression
Given an input feature vector
When implementing logistic regression, our job is to learn the parameters
Logistic Regression cost function
-
Denote
and -
Goal: Given
, want . -
Loss(error) function: A function to measure how good our prediction
is on a single example. The small the loss function value, the better our prediction. Denoted by . Some common functions are shown below. -
Cost function: measures how well you are doing on the entire training set.
The function here is a convex function with global optimal.
Gradient Descent
-
Gradient Descent: initialize parameters with some values and repeatedly update them in the opposite gradient direction. Take
for example:
: Learning rate, controls how big a step we take on each iteration.
Derivatives with a Computation Graph
- Computation Graph: Assume wew want to compute
- Logistic regression derivatives:
- In programming, we use
dx
to represent .x
can be any variable here such asa
,z
and so on. da
dz
dw1
dz
dz
db
dz
- In programming, we use
Logistic regression on m examples
We know that
So when computing derivatives on dw1
dw1
db
db
The algorithm is shown below:
Python and Vectorization
Vectorization
Vectorization can significantly speed up calculation compared to using explicit for
loops.
If we use time()
function in time
library to check the time comsuming, we will find that the Vectorization is hundreds of times faster than for loop.
One criterian is that always using built-in functions in Python or Numpy instead of explicit for loop.
Now we can apply vectorization to logistic regression.
When it comes to compute the prediction, recap the defination of matrix
Here we define matrix
If writen in Python:
Z = np.dot(w.T, X) + b
A = sigmoid(Z)
dZ = A - Y
dw = np.dot(X, dZ.T) / m
db = np.sum(dZ) / m
w = w - alpha * dw
b = b - alpha * db
Shallow Neural Networks
Neural Network Representation
* Each column is call a layer. There are three layers here
* Input layer with
$$
a^{[0]}=
\begin{bmatrix}
x_{1} \
x_{2} \
x_{3} \
\end{bmatrix}
\quad
a^{[1]}=
\begin{bmatrix}
a_{1}^{[1]} \
a_{2}^{[1]} \
a_{3}^{[1]} \
a_{4}^{[1]} \
\end{bmatrix}
\quad
a^{[2]}=\hat{y}
$$
*
Computation
In each node in hidden layer, we repeat the same computing way.
For the entire network,
We can also vectorize it
We can extend vectorization to multiple examples
Note that actually
Activation functions
There are different activation functions you can use in NN and different layers may have different activation functions. Now we can redefine that
-
tanh function:
. Always behaving better than sigmoid. A problem is when is too large, the gradient of both sigmoid and tanh goes to 0 -
ReLU function:
. When z is negative, the gradient is 0. The gradient is 1 otherwise. Widely used.
- Leaky ReLU:
.
Why a Non-Linear Activation Function needed?
If we use Linear function as activation function or simply don't use activation function, the output
Derivatives
- sigmoid:
- tanh:
- ReLU:
Neural network gradients
Similar to logistic regression.
The left one is formula for single trainging example and the right one is for the entire training set. Note the
Random initialization
Take the following NN for example:
If we initialize
We can initializate randomly.
W1 = np.random.rand((2, 2)) * 0.01
b1 = np.zero((2, 1))
W2 = np.random.rand((2, 2)) * 0.01
b2 = np.zero((2, 1))
Deep neural network
Deep neural network has more hidden layers.
-
denotes the number of layers. -
denotes the numbers of unit in layer .
Forward propagation
Here we store
The dimension of
Although we always try to get rid of explicit for loop, a for loop is still necessary to traverse the
Backward propagation
In an iteration, we first use forward propagation to get
Hyperparameters
- Parameters:
- Hyperparameters: parameters that control the above ultimate parameters
- learning rate
- iteration times
- number of hidden layers
- number of hidden units
- choice of activation functions
- learning rate