RNN

2019/07/15

Recurrent Neural Network

  1. Local dependency assumption: The sequential information of all previous timestamps can be encoded into one hidden representation. 2-order Markov assumption. Temporal receptive field: 2

  2. Parameter Sharing assumption: If a feature is useful at time , then it should also be useful for all timestamp

但是实际上,两个假设在实际问题上都不会成立

is used to encode the information from all past timesteps

Stationary assumption

we use the same parameters, W and U

Bidirectional RNN

Deep RNN

RNN for LanguageModel(LM)

  • Advantages:
    • RNNs can represent unbounded temporal dependencies
    • RNNs encode histories of words into a fixed size hidden vector
    • Parameter size does not grow with the length of dependencies
  • Disadvantages:
    • RNNs are hard to learn long range dependecies present in data

Back-Propagation Through Time(BPTT)

Gets longer and longer as the time gap (t-s) gets bigger!

It has to learn a same transformation for all time steps

Truncated BPTT

原始的 RNN 计算 BPTT

  1. Forward through entire sequence to compute loss

  2. Backward through entire sequence to compute gradient

计算代价过高

Truncated BPTT:

Run forward and backward through chunks of the sequence instead of whole sequence

Truncation: carry hidden states forward in time forever, but only back-propagate for some smaller number of steps

Learning Difficulty

Vanishing or Exploding gradient

where

  • is the largest singular value of
  • is the upper bound of
  • depends on activation function , e.g. $$\left \tanh ^{\prime}(x)\right \leq 1,\left \sigma^{\prime}(x)\right \leq \frac{1}{4}$$

Conclusion:

  • RNNs are particularly unstable due to the repeated multiplication by the same weight matrix
  • Gradient can be very small or very large quickly

Vanishing Gradient

为什么梯度消失是个问题?

  • 不能学习到 long term dependencies,在语言模型中,在很远的词是不会影响到下一个词的预测的

对于梯度消失的小 trick:

  • Orthogonal Initialization(正交初始化):
    • 将参数 W 初始化为一个random orthogonal matrix (i.e. ), 所以一直乘不会改变norm ,因为
    • Initialize W with small random entries, the

Long Short-Term Memory(LSTM)

As long as forget gates are open (close to 1), gradients may pass into over long time gaps

  • Saturation of forget gates (=1) doesn’t stop gradient flow
  • A beeter initialization of forget gate bias is e.g. 1

Add ‘peephole’ connections to the gates

  • Vanilla LSTM

  • Peephole LSTM

There is no significant performance improvement by peephole

None of the variants can improve upon the standard LSTM significantly

The forget gate and the output activation function (if cell state unbounded) to be its most critical components

Thus, we can couple the input and forget input () to work well->GRU

Gated Recurrent Unit(GRU)

GRU and LSTM yield similar accuracy, but GRU converges faster than LSTM

Exploding Gradient

L1 or L2 penalty on Recurrent Weight:

Regularize W to ensure that spectral radius of W not exceed I:

对于梯度爆炸的小 trick:

  • Gradient Clipping: if ,

What if Batch Normalization?

batch normalization (makes the optimization landscape much easier to navigate and smooth the gradients!)

Problems when applied to RNNs:

  • Saturation of forget gates(=1) doesn’t stop gradient flow
  • Needs different statistics for different time-steps
    • problematic if a test sequence longer than any training sequences
  • Needs a large batch size
    • problematic as RNNs usually have small batches (recurrent structures, e.g.LSTM, usually require larger memory usage )

Layer Normalization

  • g(gain) and b(bias) are shared over all time-steps, the statistics are computed across each feature and are independent of batch sizes
  • After multiplying inputs and state by weights but right before nonlinearity

Weight Normalization

  • Lower computational complexity and less memory usage much faster then BN
  • Prevents noise from the data without relying on training data

Practical Training Strategies

  • Mismatch in training and inference
  • curriculum learning
  • scheduled sampling
  • variational dropout

Post Directory