Perceptron
感知机,只能表示线性可分的,所以不可表示 XOR
MultiLayer Perceptron
可以表示任何的 Decision Boundary,把各种 Decision Boundary 进行叠加就可以。
Activation Function
sigmoid
0~1, probability, Relative smaller gradient
Hypoblic Tangent
-1~1, zero centered, Relatively larger saturation region
Rectified Linear Unit (ReLU)
Time-efficient and faster convergence, but neurons are prone to death.
指数函数 exp
容易溢出,需要做处理:
Entropy
Entropy
Amount of bits to encode information (uncertainty) in q
Relative Entropy
Amount of extra bits to encode information in p given q:
Cross Entropy
Cross Entropy = Relative Entropy + Entropy
Cross-entropy loss: cost function:
Backpropagation
Automatically Differentiation
Practical Training Strategies
Stochastic Gradient Descent
SGD with momentum
Learning rate decay
Though some algorithms can adjust learning rate adaptively, a good choice of learning rate ! could result in better performance. To make network converge stably and quickly, we could set learning rate that decays over time
Exponential decay strategy
1/t decay strategy
Weight decay
L1 Regularization
L2 Regularization
Dropout
Weight Initialization
Proper initialization avoids reducing or magnifying the magnitudes of signals exponentially
Babysitting Learning
Overfit on a small data
- Training error fluctuating? Decrease the learning rate
- Some of the units saturated? Scale down the initialization. Properly normalize the inputs
Expressiveness of DL
Shallow MLP
Universal: Two-layer MLP can represent any boundary (Hornik, 1991)
Two-layer MLP requires exponentially large number of units KN->无穷
Space Folding
是讲如何形象地把低维空间非线性映射到高维空间中线性可分