DNN study notes

Why deeper? not fatter?

Analogy: Using multiple layers of neurons (logic gates) to represent some functions are much simpler
deeper layer ->Modularization (More kinds of module with less training data)

low layer = basic module = enough training data

Bad training results?

choosing proper loss

softmax output layer -> cross entropy

And its partial derivation by W is:

$\begin{eqnarray} \frac{\partial C}{\partial w_j} & = & \frac{1}{n} \sum_x \frac{\sigma’(z) x_j}{\sigma(z) (1-\sigma(z))} (\sigma(z)-y). \tag{60}\end{eqnarray}$

Because the final activation function is usually softmax, therefore,

$\begin{eqnarray} \frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j(\sigma(z)-y). \tag{61}\end{eqnarray}$
mini-batch
- for each batch(n=1000), update parameters by picking mini-batches. ( the total loss is not the whole batch loss, but a mini-part).
- If there are 20 batches, update 20 times in one epoch. ( Shuffle the training examples for each epoch)
Vanishing Gradient Problem

lower layer = smaller gradients = slow learning = almost random

deeper layer = larger gradients = fast learning = already converge
- ReLU
  1. Fast to computer
  2. Biological reason
  3. Infinite sigmoid with different biases
  4. Vanishing gradient problem
  $\dfrac{\partial{C}}{\partial{b_{1}}}=\sigma^{\prime}(z_{1})\omega_{2}\sigma^{\prime}(z_{2})\omega_{3}\sigma^{\prime}(z_{3})\omega_{4}\sigma^{\prime}(z_{4})\dfrac{\partial{C}}{\partial{a_{4}}}$
  
  where $\sigma^{\prime}(z)$ is the activation function respectively. Therefore, ReLU = 1
Adaptive Learning Rate

If learning rate is too large, Total loss may not decrease after each update

If learning rate is too small, Training would be too slow

Reduce the learning rate by some factor every few epochs:
- Adagrad
  
  Learning rate is smaller and smaller for all parameters
  
  Smaller derivatives, larger learning rate, and vice versa
Momentum

Bad test result? overfittting?

Early Stopping
L2 regu

我们拿 L2正则化来探讨一下, 机器学习的过程是一个通过修改参数 theta 来减小误差的过程, 可是在减小误差的时候非线性越强的参数, 比如在 x^3 旁边的 theta 4 就会被修改得越多, 因为如果使用非线性强的参数就能使方程更加曲折, 也就能更好的拟合上那些分布的数据点. Theta 4 说, 瞧我本事多大, 就让我来改变模型, 来拟合所有的数据吧, 可是它这种态度招到了误差方程的强烈反击, 误差方程就说: no no no no, 我们是一个团队, 虽然你厉害, 但也不能仅仅靠你一个人, 万一你错了, 我们整个团队的效率就突然降低了, 我得 hold 住那些在 team 里独出风头的人. 这就是整套正规化算法的核心思想. 那 L1, L2 正则化又有什么不同呢?

l1:

尽量稀疏和dropout 类似
Weight Decay

$w = 0.99 w - learning~rate * gradient$
Dropout
- Each time before updating the parameters: Each neuron has p% to dropout
- The structure of the network is changed
- Using the new network for training
- For each mini-batch, we resample the dropout neurons
- If the dropout rate at training is p%, all the weights when testing times (1-p)%
- Reason
  - When teams up, if everyone expect the partner will do the work, nothing will be done finally
    
    However, if you know your partner will dropout, you will do better.
    
    When testing, no one dropout actually, so obtaining good results eventually.
  - Dropout is a kind of ensemble
    
    Train a bunch of networks with different structures