Deep Learning v.s. Machine Learning
The major difference between Deep Learning and Machine Learning technique is the problem solving approach. Deep Learning techniques tend to solve the problem end to end, where as Machine learning techniques need the problem statements to break down to different parts to be solved first and then their results to be combine at final stage.
Forward Propagation
The general procedure is the following:
$$ \begin{aligned} a^{(1)}(x) &= w^{(1)^T} \cdot x + b^{(1)} \\ h^{(1)}(x) &= g_1(a^{(1)}(x)) \\ a^{(2)}(x) &= w^{(2)^T} \cdot h^{(1)}(x) + b^{(2)} \\ h^{(2)}(x) &= g_2(a^{(2)}(x)) \\ &…… \\ a^{(L+1)}(x) &= w^{(L+1)^T} \cdot h^{(L)}(x) + b^{(L+1)} \\ h^{(L+1)}(x) &= g_{L+1}(a^{(L+1)}(x)) \end{aligned} $$
Note:
-
$w^{(i)}$ has dimension: (# of (hidden) units in layer $i$) $\times$ (# of (hidden) units in layer $i-1$).
-
$b^{(i)}, a^{(i)}, h^{(i)}$ have the same dimension: (# of (hidden) units in layer $i$, $1$).
-
$g_{i}$ is an activation function. Sigmoid, tanh, relu are common activation functions. In the last layer, the choose of activation function $g_{(L+1)}$ depends on problems, usually sigmoid, and softmax.
Loss functions of neural network
Common loss functions are cross-entropy loss, hinge loss, triple loss, etc. In fact, depending on specifice problems, we can define arbitrarily loss functions. We can also use AUC as a loss.
In this post, we focus on the cross-entropy loss.
For example, let the output of the neural network be $(0.6, 0.2, 0.1, 0.1)$, and the true label $(0, 1, 0, 0)$, then we can write the cross-entropy loss as
$$ \ell \left(\left[\begin{array}{l} 0.6 \\ 0.2 \\ 0.1 \\ 0.1 \end{array}\right],\left[\begin{array}{l} 0 \\ 1 \\ 0 \\ 0 \end{array}\right]\right) = -(0 \cdot \log 0.6 + 1 \cdot \log 0.2 + 0 \cdot \log 0.1 + 0 \cdot \log 0.1) = -\log 0.2 $$
From now on, we call the output of the neural network $f(x)$, and the true label w.r.t $x$ is y, then the corss-entropy loss is written by
$$\ell \left( f(x), y\right) = - \log f(x)_y$$
where $f(x)_y$ is the $y$-th entry of $f(x)$.
Back-propagation
In fitting a neural network, backpropagation computes the gradient of the loss function with respect to the weights of the network. The backpropagation algorithm works by computing the gradient of the loss function with respect to each weight by the chain rule, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule.
In the following, we would try to compute $\frac{\partial \ell}{\partial f(x)}, \frac{\partial \ell}{\partial a^{(L+1)}(x)}, \frac{\partial \ell}{\partial h^{(k)}(x)}, \frac{\partial \ell}{\partial a^{(k)}(x)}, \frac{\partial \ell}{\partial w^{(k)}}, \frac{\partial \ell}{\partial b^{(k)}}$, and then make a summary of the back-propagation process.
compute $\frac{\partial \ell}{\partial f(x)}$
First consider single element: $$ \begin{aligned} \frac{\partial \ell}{\partial f(x)_j} &= \frac{\partial -\log f(x)_y}{\partial f(x)_j} \\ &= \frac{-1}{f(x)_y} \cdot \frac{\partial f(x)_y}{\partial f(x)_j} \\ &= \frac{-1\cdot I(y=j)}{f(x)_y} \end{aligned} $$
Put it into vector form:
$$ \frac{\partial \ell}{\partial f(x)} = \frac{\partial -\log f(x)_y}{\partial f(x)} = \frac{-1}{f(x)_y} \cdot\left(\begin{array}{l} I(y=1) \\ I(y=2) \\ \quad … \\ I(y=n) \ \end{array}\right) = \frac{-e(y)}{f(x)_y} \tag 1 $$
where $e(y)$ is the one-hot encoding of label $y$.
compute $\frac{\partial \ell}{\partial a^{(L+1)}(x)}$
First consider single element:
$$ \begin{aligned} \frac{\partial \ell}{\partial a^{(L+1)}(x)_j} &= \frac{\partial -\log f(x)_y}{\partial a^{(L+1)}(x)_j} \\ &= \frac{-1}{f(x)_y} \cdot \frac{\partial f(x)_y}{\partial a^{(L+1)}(x)_j}\\ &= \frac{-1}{f(x)_y} \cdot \frac{\partial}{\partial a^{(L+1)}(x)_j} \cdot \frac{\exp (a^{(L+1)}(x)_y)}{\sum_{j'} \exp (a^{(L+1)}(x)_{j'})} \\ &= f(x)_j - I(y=j) \ \end{aligned} $$
Put it into vector form:
$$\frac{\partial \ell}{\partial a^{(L+1)}(x)} = \left(\begin{array}{l} f(x)_1 - I(y=1) \\ f(x)_2 - I(y=2) \\ \quad … \\ f(x)_n - I(y=n) \ \end{array}\right) = f(x) - e(y) \tag 2 $$
Note that here, we assume the last activation function is softmax.
compute $\frac{\partial \ell}{\partial h^{(k)}(x)}$
$$ \begin{aligned} \frac{\partial \ell}{\partial h^{(k)}(x)_j} &= \sum_i \frac{\partial \ell}{\partial a^{(k+1)}(x)_i} \cdot \frac{\partial a^{(k+1)}(x)_i}{\partial h^{(k)}(x)_j} \\ &= \sum_i \frac{\partial \ell}{\partial a^{(k+1)}(x)_i} \cdot \frac{\partial \sum_j w_{ij}^{(k+1)}h^{(k)}(x)_j + b^{(k+1)}_i}{\partial h^{(k)}(x)_j} \\ &= \sum_i \frac{\partial \ell}{\partial a^{(k+1)}(x)_i} \cdot w_{ij}^{(k+1)} \\ &= w_{ij}^{(k+1)^T} \cdot \frac{\partial -\log f(x)_y}{\partial a^{(k+1)}(x)} \end{aligned} $$
where $a^{(k+1)}(x) = w^{(k+1)}h^{(k)}(x) + b^{(k+1)}$.
Put it into vector form:
$$ \frac{\partial \ell}{\partial h^{(k)}(x)} = w^{(k+1)^T} \cdot \frac{\partial -\log f(x)_y}{\partial a^{(k+1)}(x)} \tag 3$$
compute $\frac{\partial \ell}{\partial a^{(k)}(x)}$
$$ \begin{aligned} \frac{\partial \ell}{\partial a^{(k)}(x)_j} &= \frac{\partial \ell}{\partial h^{(k)}(x)_j} \cdot \frac{\partial h^{(k)}(x)_j}{\partial a^{(k)}(x)_j} \\ &= \frac{\partial \ell}{\partial h^{(k)}(x)_j} \cdot g'_k(a^{(k)}(x)_j) \end{aligned} $$
where $ h^{(k)}(x) = g_k(a^{(k)}(x))$. Now put it into vector form:
$$ \begin{aligned} \frac{\partial \ell}{\partial a^{(k)}(x)} &= \left( \frac{\partial \ell}{\partial h^{(k)}(x)_1} \cdot g'_k(a^{(k)}(x)_1), \ … \ ,\ \frac{\partial \ell}{\partial h^{(k)}(x)_n} \cdot g'_k(a^{(k)}(x)_n) \right) \\ &= \left(\begin{array}{l} \frac{\partial \ell}{\partial h^{(k)}(x)_1} \\ \quad … \\ \frac{\partial \ell}{\partial h^{(k)}(x)_n} \\ \end{array}\right) \odot \left(\begin{array}{l} g'_k(a^{(k)}(x)_1) \\ \quad … \\ g'_k(a^{(k)}(x)_n) \ \end{array}\right) \\ &= \frac{\partial \ell}{\partial h^{(k)}(x)} \odot g'(a^{(k)}(x)) \end{aligned} \tag 4 $$
where “$\odot$” means the element-wise product.
compute $\frac{\partial \ell}{\partial w^{(k)}}$
$$ \begin{aligned} \frac{\partial \ell}{\partial w_{ij}^{(k)}} &= \frac{\partial \ell}{\partial a^{(k)}(x)_i} \cdot \frac{\partial a^{(k)}(x)_i}{\partial w_{ij}^{(k)}} \\ &= \frac{\partial \ell}{\partial a^{(k)}(x)_i} \cdot \frac{\sum_j w_{ij}^{(k)}h^{(k-1)}(x)_j + b^{(k)}_i}{\partial w_{ij}^{(k)}} \\ &= \frac{\partial \ell}{\partial a^{(k)}(x)_i} \cdot h^{(k-1)}(x)_j \end{aligned} $$
put it into vector form:
$$ \begin{aligned} \frac{\partial \ell}{\partial w^{(k)}} &= \frac{\partial -\log f(x)_y}{\partial a^{(k)}(x)} \cdot( h^{(k-1)}(x))^T \\ &= \left(\begin{array}{l} \frac{\partial -\log f(x)_y}{\partial a^{(k)}(x)_1} \\ \quad … \\ \frac{\partial -\log f(x)_y}{\partial a^{(k)}(x)_m} \ \end{array}\right) \cdot \left( h^{(k-1)}(x)_1, ,… , , , h^{(k-1)}(x)_n\right) \end{aligned} \tag 5 $$
compute $\frac{\partial \ell}{\partial b^{(k)}}$
$$ \begin{aligned} \frac{\partial \ell}{\partial b_{i}^{(k)}} &= \frac{\partial \ell}{\partial a^{(k)}(x)_i} \cdot \frac{\partial a^{(k)}(x)_i}{\partial b_i^{(k)}} \\ &= \frac{\partial \ell}{\partial a^{(k)}(x)_i} \cdot 1 \\ &= \frac{\partial \ell}{\partial a^{(k)}(x)_i} \end{aligned} $$
put it into vector form:
$$ \frac{\partial \ell}{\partial b^{(k)}} = \frac{\partial \ell}{\partial a^{(k)}(x)} = \frac{\partial -\log f(x)_y}{\partial a^{(k)}(x)} \tag 6 $$
Back-propagation Precedure (using SGD) Summary
For each data (x,y):
-
Forward progagation: Compute $a^{(k)}(x), h^{(k)}(x)$, loss.
-
Back Propagation:
- Compute output gradient: $$ \nabla_{a^{(L+1)}(x)} -\log f(x)_y = f(x) - e(y)$$
- for $k = L+1$ to $1$:
- Compute the gradients of parameters: $$ \begin{aligned} \nabla_{w^{(k)}(x)} -\log f(x)y &= (\nabla{a^{(k)}(x)} -\log f(x)y) \cdot (h^{(k-1)}(x))^T \\ \nabla{b^{(k)}(x)} -\log f(x)y &= \nabla{a^{(k)}(x)} -\log f(x)_y \end{aligned} $$
- Compute the gradients of hidden layers: $$ \begin{aligned} \nabla_{h^{(k-1)}(x)} -\log f(x)y &= (w^{(k)}(x))^T \cdot (\nabla{a^{(k)}(x)} -\log f(x)y) \\ \nabla{a^{(k-1)}(x)} -\log f(x)y &= (\nabla{h^{(k-1)}(x)} -\log f(x)_y) \odot g'(a^{(k-1)}(x)) \end{aligned} $$
Debugging: Gradient Checking
Since gradient computation can be notoriously difficult to debug and get right, even with a buggy implementation, it may not at all be apparent that anything is amiss. Gradient checking is a method for numerically checking the derivatives computed by your code to make sure that your implementation is correct.
Recall the mathematical definition of the derivative as:
$$ \frac{d}{d\theta}J(\theta) = \lim_{\epsilon \rightarrow 0} \frac{J(\theta+ \epsilon) - J(\theta-\epsilon)}{2 \epsilon}$$
Suppose we have a function $g_i(\theta)$ that computes $\textstyle \frac{\partial}{\partial \theta_i} J(\theta)$; we’d like to check if $g_i$ is outputting correct derivative values. We would choose a very small $\epsilon$ and choose a threshold $\delta$ and check if the following is true:
$$ | \frac{J(\theta+ \epsilon) - J(\theta-\epsilon)}{2 \epsilon} - g_i(\theta)| < \delta$$
Dropout
Dropout refers to ignoring units during the training phase of certain set of neurons which is chosen at random. “ignoring” means these units are not considered during a particular forward or backward pass. More technically, at each training stage, individual nodes are either dropped out of the net with probability $1-p$ or kept with probability $p$, so that a reduced network is left; incoming and outgoing edges to a dropped-out node are also removed.
Dropout is an approach of regularization in neural networks which helps reducing interdependent learning amongst the neurons.
Training Phase & Testing Phase:
For each hidden layer, for each training sample, for each iteration, ignore (zero out) a random fraction, $p$, of nodes (and corresponding activations).
In testing phase, we won’t use dropout.
Note: Dropout roughly doubles the number of iterations required to converge. However, training time for each epoch is less.
Early Stopping
Early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method, such as gradient descent. Such methods update the learner so as to make it better fit the training data with each iteration. Up to a point, this improves the learner’s performance on data outside of the training set. Past that point, however, improving the learner’s fit to the training data comes at the expense of increased generalization error. Early stopping rules provide guidance as to how many iterations can be run before the learner begins to over-fit.
SGD with Converge
Theoretically. SGD would converge if it satisfies
$$ \sum_{t=1}^{+\infty} \eta_t = +\infty \tag 7$$ $$\sum_{t=1}^{+\infty} \eta_t^2 < \infty \tag 8$$
where $\eta$ is the learning rate.
Reference:
- https://en.wikipedia.org/wiki/Backpropagation
- http://deeplearning.stanford.edu/tutorial/supervised/DebuggingGradientChecking/
- https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-to-learn-better-dropout-in-deep-machine-learning-74334da4bfc5
- https://en.wikipedia.org/wiki/Early_stopping
- https://www.researchgate.net/figure/Cross-validation-Early-stopping-Principe-2000_fig4_228469923
- https://towardsdatascience.com/why-deep-learning-is-needed-over-traditional-machine-learning-1b6a99177063