【深度学习】RNN的前向和反向传播推导

Posted by ShawnD on January 14, 2021

符号表示

  • $x_t$: $t$时刻的输入
  • $s_{t-1}$: 上一时刻的状态
  • $z$: $z = Vs_t$
  • $U$: $x_t$的权重矩阵
  • $W$: $s_{t-1}$的权重矩阵
  • $V$: $s_t$的权重矩阵
  • $E$: 表示损失函数

softmax求导

\[a_j = \frac{e^{z_j}}{\sum_k e^{z_k}}\]

当$i = j$的时候:

\[\begin{aligned} \frac{\partial a_j}{\partial z_i} &= \frac{\partial}{\partial z_i} (\frac{e^{z_j}}{\sum_k e^{z^k}}) \\ &= \frac{(e^{z_j})'·\sum_k e^{z_k} - e^{z_j}·e^{z_j}}{(\sum_k e^{z_k})^2} \\ &= \frac{e^{z_j}}{\sum_k e^{z_k}} - \frac{e^{z_j}}{\sum_k e^{z_k}}·\frac{e^{z_j}}{\sum_k e^{z_k}} \\ &= a_j(1-a_j) \end{aligned}\]

当$i \neq j$的时候

\[\begin{aligned} \frac{\partial a_j}{\partial z_i} &= \frac{\partial}{\partial z_i} (\frac{e^{z_j}}{\sum_k e^{z^k}}) \\ &= \frac{(e^{z_j})'·\sum_k e^{z_k} - e^{z_j}·e^{z_i}}{(\sum_k e^{z_k})^2} \\ &= -\frac{e^{z_j}}{\sum_k e^{z_k}} · \frac{e^{z_i}}{\sum_k e^{z_k}} = -a_j · a_i \end{aligned}\]

交叉熵求导

\[E_t(y_t, \hat y_t) = -y_t log \hat y_t\] \[E(y, \hat y) = \sum_t E_t(y_t, \hat y_t) = \sum_t -y_t log \hat y_t\] \[\begin{aligned} \frac{\partial E}{\partial \hat y_i} = \sum_t -\frac{y_t}{\hat y_t} \end{aligned}\]

RNN的BPTT过程

RNN的公式:

\[s_t = tanh(Ux_t + W s_{t-1})\] \[\hat y_t = softmax(V s_t)\]

预测结果$\hat y_t$与正确结果$y_t$做交叉熵得到损失值:

\[E_t(y_t, \hat y_t) = -y_t log \hat y_t\] \[\begin{aligned} E(y, \hat y) &= \sum_t E_t(y_t, \hat y_t) \\ &= -\sum_t y_t log \hat y_t \end{aligned}\]

在每个timestep,如果是做序列标注任务,我们每一个时刻都会预测一个结果,并得到一个损失值,那么最终的损失将会是所有时刻损失之和,如下所示:

总的损失表示为:

\[\frac{\partial E}{\partial W} = \sum_t \frac{\partial E_t}{\partial W}\]

下面以第3时刻为例,其它位置均为相同操作。

首先求对参数 $V$ 进行求导, 过程如下:

\[\begin{aligned} \frac{\partial E_3}{\partial V} &= \frac{\partial E_3}{\partial \hat y_3} \frac{\partial y_3}{\partial V} \\ &= \frac{\partial E_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial z_3} \frac{\partial z_3}{\partial V} \\ &= [-\frac{y_3}{\hat y_3} · \hat y_3(1- \hat y_3) - \sum_{i \neq 3, i=0}^4 \frac{y_i}{\hat y_i}(-\hat y_i \hat y_3)] · s_3 \\ &= [-y_3(1 - \hat y_3) + \sum_{i \neq 3, i=0}^4 y_i \hat y_3] · s_3 \\ &= [-y_3 + y_3 \hat y_3 + \sum_{i \neq 3, i=0}^4 y_i \hat y_3] · s_3 \\ &= [-y_3 + \hat y_3 \sum_{i=0}^4 y_i] · s_3 \\ &= (\hat y_3 - y_3) · s_3 \end{aligned}\]

下面对W进行求导:

\[\frac{\partial E_3}{\partial W} = \frac{\partial E_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial s_3} \frac{\partial s_3}{\partial W}\]

但是其中的$s_3 = tanh(Ux_3 + Ws_2)$, $s_3$依赖于$s_2$和$W$, 这个时候不能简单地把$s_3$当做常数看待, 继续使用链式法则:

\[\begin{aligned} \frac{\partial E_3}{\partial W} &= \frac{\partial E_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial s_3} \frac{\partial s_3}{\partial W} \\ &= \frac{\partial E_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial s_3}(\frac{\partial s_3}{\partial s_3}\frac{\partial s_3}{\partial W} + \frac{\partial s_3}{\partial s_2}\frac{\partial s_2}{\partial W} + \frac{\partial s_3}{\partial s_1}\frac{\partial s_1}{\partial W} + \frac{\partial s_3}{\partial s_0}\frac{\partial s_0}{\partial W}) \end{aligned}\]

合成公式为:

\[\frac{\partial E_3}{\partial W} = \sum_{k=0}^3 \frac{\partial E_3}{\partial \hat y_3} \frac{\partial \hat y_3}{\partial s_3} \frac{\partial s_3}{\partial s_k} \frac{\partial s_k}{\partial W}\]

对U求导和对S求导完全一样!

Reference

  1. 清晰易懂的反向传播推导
  2. 浅析循环神经网络(RNN)的反向求导过程