Why deeper? not fatter?
Analogy: Using multiple layers of neurons (logic gates) to represent some functions are much simpler
deeper layer ->Modularization (More kinds of module with less training data)
low layer = basic module = enough training data
Bad training results?
choosing proper loss
softmax output layer -> cross entropy
And its partial derivation by W is:
$\begin{eqnarray} \frac{\partial C}{\partial w_j} & = & \frac{1}{n} \sum_x \frac{\sigma’(z) x_j}{\sigma(z) (1-\sigma(z))} (\sigma(z)-y). \tag{60}\end{eqnarray}$
Because the final activation function is usually softmax, therefore,
$\begin{eqnarray} \frac{\partial C}{\partial w_j} = \frac{1}{n} \sum_x x_j(\sigma(z)-y). \tag{61}\end{eqnarray}$
mini-batch
- for each batch(n=1000), update parameters by picking mini-batches. ( the total loss is not the whole batch loss, but a mini-part).
- If there are 20 batches, update 20 times in one epoch. ( Shuffle the training examples for each epoch)
Vanishing Gradient Problem
lower layer = smaller gradients = slow learning = almost random
deeper layer = larger gradients = fast learning = already converge
ReLU
- Fast to computer
- Biological reason
- Infinite sigmoid with different biases
- Vanishing gradient problem
$\dfrac{\partial{C}}{\partial{b_{1}}}=\sigma^{\prime}(z_{1})\omega_{2}\sigma^{\prime}(z_{2})\omega_{3}\sigma^{\prime}(z_{3})\omega_{4}\sigma^{\prime}(z_{4})\dfrac{\partial{C}}{\partial{a_{4}}}$
where $\sigma^{\prime}(z)$ is the activation function respectively. Therefore, ReLU = 1
Adaptive Learning Rate
If learning rate is too large, Total loss may not decrease after each update
If learning rate is too small, Training would be too slow
Reduce the learning rate by some factor every few epochs:
Adagrad
Learning rate is smaller and smaller for all parameters
Smaller derivatives, larger learning rate, and vice versa
Momentum
Bad test result? overfittting?
Early Stopping
L2 regu
我们拿 L2正则化来探讨一下, 机器学习的过程是一个 通过修改参数 theta 来减小误差的过程, 可是在减小误差的时候非线性越强的参数, 比如在 x^3 旁边的 theta 4 就会被修改得越多, 因为如果使用非线性强的参数就能使方程更加曲折, 也就能更好的拟合上那些分布的数据点. Theta 4 说, 瞧我本事多大, 就让我来改变模型, 来拟合所有的数据吧, 可是它这种态度招到了误差方程的强烈反击, 误差方程就说: no no no no, 我们是一个团队, 虽然你厉害, 但也不能仅仅靠你一个人, 万一你错了, 我们整个团队的效率就突然降低了, 我得 hold 住那些在 team 里独出风头的人. 这就是整套正规化算法的核心思想. 那 L1, L2 正则化又有什么不同呢?
l1:
尽量稀疏 和dropout 类似
Weight Decay
$w = 0.99 w - learning~rate * gradient$
Dropout
Each time before updating the parameters: Each neuron has p% to dropout
The structure of the network is changed
Using the new network for training
For each mini-batch, we resample the dropout neurons
If the dropout rate at training is p%, all the weights when testing times (1-p)%
Reason
When teams up, if everyone expect the partner will do the work, nothing will be done finally
However, if you know your partner will dropout, you will do better.
When testing, no one dropout actually, so obtaining good results eventually.
Dropout is a kind of ensemble
Train a bunch of networks with different structures