Table of Contents

Simple Policy gradient Training on armed bandit

In this post, we are going to build a simple Policy gradient experiment, on an “n-armed bandit” problem.

References

Reference implementation

Analysis

\[ \frac{\partial Loss}{\partial w_i} = \frac{\partial}{\partial w_i} \left(- log(out_i) \times R \right)\]

\[ \frac{\partial Loss}{\partial w_i} = -R \times \frac{\partial}{\partial w_i} \left(out_i\right) \times \frac{1}{out_i} \]

\begin{align} \frac{\partial}{\partial w_i} \left(out_i\right) & = \frac{\partial}{\partial w_i} \left(\frac{e^{w_i}}{\sum_k e^{w_k}}\right) \\ & = \frac{\partial}{\partial w_i} \left(e^{w_i}\right) \frac{1}{\sum_k e^{w_k}} + e^{w_i} \frac{\partial}{\partial w_i} \left(\frac{1}{\sum_k e^{w_k}}\right) \\ & = \frac{e^{w_i}}{\sum_k e^{w_k}} + e^{w_i} e^{w_i} \left( -\frac{1}{(\sum_k e^{w_k})^2}\right) \\ & = \frac{e^{w_i}}{\sum_k e^{w_k}} - \left( \frac{e^{w_i}}{\sum_k e^{w_k}} \right)^2 \\ & = out_i \cdot ( 1 - out_i) \end{align}

\[ \frac{\partial Loss}{\partial w_i} = - R \cdot (1 - out_i)\]

\[ w_i = w_i + \alpha \cdot R \cdot (1 - out_i)\]