Reinforcement Learning

2019/07/21

Markov Decision Process(MDP)

Policy

A policy specifies what action to take in each state

  • Deterministic policy: (for a state, only one action is taken)
  • Stochastic policy: (requires sampling to take action)

Objective: find optimal to maximize the expected sum of rewards with

Value Function

  • (State) Value function is a prediction of the future reward

    • How much reward will I get from state s under policy
  • (Action) Q-value function (quality) is a prediction of the future reward

    • from state s and action a, under policy
  • Optimal Q-value function is maximum value under optimal policy

Bellman Equation

Q-value function can be decomposed into a Bellman equation Optimal Q-value function also decomposes into a Bellman equation

Value-based RL

  • Estimate the optimal Q-value function
  • This is the maximum value achievable under any policy

Q-Learning(DQN)

Represent the Q-value function by Q-network with weights w

Lookup Table

Optimal Q-values should obey the Bellman equation Treat the right-hand side as a target

Minimize MSE loss by SGD:

  • Converges to Q* using table lookup representation (Q-table)

  • But diverges using neural networks due to

    • Correlations between samples: to remove correlations, build dataset from agent’s own experience
    • Non-stationary targets: to deal with non-stationarity, target parameters w’ are held fixed

Improvements to DQN

  • Double DQN
  • Prioritized replay
  • Dueling network

Policy-based RL

  • Search directly for the optimal policy
  • This is the policy achieving maximum future reward

Policy Network

directly output the probability of actions without learning the Q-value function

  • Guaranteed convergence to local minima
  • High-dimensional (continuous) action spaces
  • Stochastic policies (exploration/exploitation)

Policy Gradient(Reinforce)

policy gradients is exactly like supervised learning, except for:

  • no correct label
    • use fake label: sample action from policy
  • training when an episode is finished
  • scaled by the episode reward
    • increase the log probability for actions that worked

Training protocol

For episode in range(max_episodes):

observation = env.reset()

While true (For each timestep) :

  1. action = choose_action(observation)

  2. observation_, reward, done = env.step(action)

  3. store(observation, action, reward)

  4. if done:

    ​ feed-forward policy network maximize

    break

  5. observation = observation_

Q-Learning vs. Policy Gradient

  • Policy Gradient
    • very general but suffer from high variance so requires a lot of samples
    • Challenge: sample-efficiency
  • Q-learning
    • Does not always work but when it works, usually more sample-efficient
    • Challenge: exploration
  • Guarantees
    • Policy Gradient: Converges to a local minima of , often good enough
    • Q-learning: Zero guarantees since you are approximating the Bellman equation with a complicated function approximator

Post Directory