【强化学习】李宏毅强化学习课程笔记

Posted by ShawnD on June 15, 2020

公式推导

用$\pi_\theta(s)$表示一个参数为$\theta$的网络,也就是一个actor

$S_i$表示observation, $a_i$表示action, $r_i$表示reward

总共的reward: $R_\theta = \sum_{t=1}^{T}r_t$

即便是使用同一个actor, $R_\theta$每次也是不同的

一场游戏(episode)表示为$\tau$

  • \[\tau = \{s_1, a_1, r_1, s_2, a_2, r_2, ... , s_T, a_T, r_T\}\]
  • \[R(\tau) = \sum_{n=1}^{N}r_n\]

如果我们使用一个actor玩游戏,每个$\tau$有一个被采样的概率

  • 概率依赖于actor的参数$\theta$: $P(\tau \mid \theta)$
\[\bar R_\theta = \sum_\tau R(\tau)P(\tau \mid \theta)\]

使用$\pi_\theta$玩游戏N次, 得到\(\{\tau_1, \tau_2, ..., \tau_N\}\), 相当于从$P(\tau \mid \theta)$采样 $\tau$ N次

因此

\[\bar R_\theta = \sum_\tau R(\tau)P(\tau \mid \theta) \approx \frac{1}{N} \sum_{n=1}^{N} R(\tau^n)\]

问题状态:

\[\theta^* = arg \max_\theta \bar R_\theta, \quad \bar R_\theta = \sum_\tau R(\tau)P(\tau \mid \theta)\]

梯度下降:

  • 从$\theta^0$开始
  • $\theta^1 \leftarrow \theta^0 + \eta \nabla \bar R_{\theta^0}$
  • $\theta^2 \leftarrow \theta^1 + \eta \nabla \bar R_{\theta^1}$
\[\nabla \bar R_{\theta} = \sum_\tau R(\tau) \nabla P(\tau \mid \theta)\]

因此$R(\tau)$可以是不可微的

\[\nabla \bar R_{\theta} = \sum_\tau R(\tau) \nabla P(\tau \mid \theta) = \sum_\tau R(\tau)P(\tau \mid \theta) \frac{\nabla P(\tau \mid \theta)}{P(\tau \mid \theta)} \\ = \sum_\tau R(\tau)P(\tau \mid \theta) \nabla log P(\tau \mid \theta)\]

因为

\[\frac{dlog(f(x))}{dx} = \frac{1}{f(x)}\frac{df(x)}{dx}\]

使用$\pi_\theta$玩游戏N次, 得到\(\{\tau^1, \tau^2, ..., \tau^N\}\)

\[\sum_\tau R(\tau)P(\tau \mid \theta) \nabla log P(\tau \mid \theta) = \frac{1}{N}\sum_{n=1}^{N} R(\tau^n)\nabla log P(\tau^n \mid \theta)\] \[\tau = \{s_1, a_1, r_1, s_2, a_2, r_2, ..., s_T, a_T, r_T\}\] \[P(\tau \mid \theta) = p(s_1)p(a_1 \mid s_1, \theta)p(r_1, s_2 \mid s_1, a_1)p(a_2 \mid s_2, \theta)p(r_2, s_3 \mid s_2, a_2)...\] \[P(\tau \mid \theta) = p(s_1) \prod p(a_t \mid s_t, \theta)p(r_t, s_{t+1} \mid s_t, a_t)\] \[log P(\tau \mid \theta) = log p(s_1) + \sum_{t=1}^{T}(log p(a_t \mid s_t, \theta) + log p(r_t, s_{t+1} \mid s_t, a_t))\]

忽略掉不含$\theta$的项:

\[\nabla P(\tau \mid \theta) = \sum_{t=1}^{T} \nabla log p(a_t \mid s_t, \theta)\] \[\theta^{new} \leftarrow \theta^{old} + \eta \nabla \bar R_{\theta^{old}}\] \[\nabla \bar R_\theta \approx \frac{1}{N} \sum_{n=1}^{N} R(\tau^n) \nabla log P(\tau^n \mid \theta) = \frac{1}{N} \sum_{n=1}^{N} R(\tau^n) \sum_{t=1}^{T_n}\nabla log p(a_t^n \mid s_t^n, \theta) \\ = \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_n} R(\tau^n) \nabla log p(a_t^n \mid s_t^n, \theta)\]

为什么要取log?

因为:

\[\nabla log p(a_t^n \mid s_t^n, \theta) = \frac{\nabla p(a_t^n \mid s_t^n, \theta)}{p(a_t^n \mid s_t^n, \theta)}\]

相当于根据概率对梯度做了一个Normalization。

如果在$\tau^n$时, 机器在$s_t^n$时采用了$a_t^n$:

  • $R(\tau^n)$是正的, 微调$\theta$减小$p(a_t^n \mid s_t^n)$
  • $R(\tau^n)$是负的, 微小$\theta$增大$p(a_t^n \mid s_t^n)$

$R(\tau^n)$可能会永远都是正的

在理想情况下, a、b、c三个action根据概率增减

实际情况下, 可能采样不到三个action

这个时候,未被采样到的action概率将会下降

解决方法,自己设计一个bias:

\[\nabla \bar R_\theta \approx \frac{1}{N} \sum_{n=1}^{N} \sum_{t=1}^{T_n}(R(\tau^n) - b) \nabla log p(a_t^n \mid s_t^n, \theta)\]

如果$R(\tau^n)$比bias这个baseline大,那么action的概率才会增加,如果$R(\tau^n)$比bias小,那么action的概率还是会减小。

mark: Rainbow

Proximal Policy Optimization(PPO)

From on-policy to off-policy

以下棋为例:

on-policy: 自己下自己学

off-policy:在旁边看别人下

Reference

  1. 【李宏毅】2020 最新课程 (完整版) 强化学习