Simple Policy gradient Training on armed bandit

In this post, we are going to build a simple Policy gradient experiment, on an “n-armed bandit” problem.


Reference implementation

  • So our reference implementation goes as follow:
    import numpy as np
    import random
    import tensorflow as tf
    import tensorflow.contrib.slim as slim
    import numpy as np
    from nv.core.utils import *
    # cf.
    # List out our bandit arms. 
    # Currently arm 4 (index #3) is set to most often provide a positive reward.
    bandit_arms = [0.2,0,-0.2,-2]
    num_arms = len(bandit_arms)
    def pullBandit(bandit):
        #Get a random number.
        result = np.random.randn(1)
        if result > bandit:
            #return a positive reward.
            return 1
            #return a negative reward.
            return -1
    def train_armed_bandit_network():
        # These two lines established the feed-forward part of the network. 
        weights = tf.Variable(tf.ones([num_arms]))
        output = tf.nn.softmax(weights)
        # The next six lines establish the training proceedure. We feed the reward and chosen action into the network
        # to compute the loss, and use it to update the network.
        reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
        action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
        responsible_output = tf.slice(output,action_holder,[1])
        loss = -(tf.log(responsible_output)*reward_holder)
        optimizer = tf.train.AdamOptimizer(learning_rate=1e-3)
        update = optimizer.minimize(loss)
        total_episodes = 1000 #Set total number of episodes to train agent on.
        total_reward = np.zeros(num_arms) #Set scoreboard for bandit arms to 0.
        init = tf.global_variables_initializer()
        # Launch the tensorflow graph
        with tf.Session() as sess:
            i = 0
            while i < total_episodes:
                #Choose action according to Boltzmann distribution.
                actions =
                a = np.random.choice(actions,p=actions)
                action = np.argmax(actions == a)
                reward = pullBandit(bandit_arms[action]) #Get our reward from picking one of the bandit arms.
                #Update the network.
                _,resp,ww =[update,responsible_output,weights], feed_dict={reward_holder:[reward],action_holder:[action]})
                #Update our running tally of scores.
                total_reward[action] += reward
                if i % 50 == 0:
                    logDEBUG("Running reward for the " + str(num_arms) + " arms of the bandit: " + str(total_reward))
        logDEBUG("\nThe agent thinks arm " + str(np.argmax(ww)+1) + " is the most promising....")
        if np.argmax(ww) == np.argmax(-np.array(bandit_arms)):
            logDEBUG("...and it was right!")
            logDEBUG("...and it was wrong!")


  • In the network above, we start with n weights \((w_1,w_2,\dots,w_n)\) where n is the number of arms we have on our bandit.
  • Then what we call the “Output” of the network, is a simple softmax taken on those weights, so \(out_i = \frac{e^{w_i}}{\sum_k e^{w_k}}\)
  • So when we run the network until the output production we get out the probability corresponding to each possible action (ie. each arm we can pull on the bandit), so we can see this output as the action probabilities: \((p_1,p_2,\dots,p_n)\)
  • We use those probabitities to select an action to perform (so we perform a so called Boltzman distribution here):
                #Choose action according to Boltzmann distribution.
                actions =
                a = np.random.choice(actions,p=actions)
                action = np.argmax(actions == a)
  • Note: personally I find the code above a bit confusing, I would rather do something like the following myself:
    actions = np.arange(num_actions) # keep this out of the loop obviously.
    probs =
    action = np.random.choice(actions,p=probs)
  • Then we check what kind of reward we get from the environment using this selected action, and the reward can be +1 or -1
  • So the next step is the actual update of the network: using the selected action \(a_i\) and the reward we just received \(R\), we build the loss value: \(loss = -log(out_i) * R\)
  • In the loss formula above the \(out_i\) value is always in (0.0, 1.0), so the log value is always negative, and thus, the loss always has the sign of the reward value. ⇒ So I guess if the loss is positive, the optimiser is going to increase the underlying weight, and if it's negative, it's going to decrease it instead (? This might need some clarification actually :-))
  • Clarification on the loss sign: let's consider the loss as a function of the weights, and from there try to compute the gradient, because the update the optimizer will apply eventually is something like: \(w_i = w_i - \alpha \times \frac{\partial Loss}{\partial w_i}\). So we have:

\[ \frac{\partial Loss}{\partial w_i} = \frac{\partial}{\partial w_i} \left(- log(out_i) \times R \right)\]

  • from a corner of my head I sort of remember that: \(ln'(x) = \frac1x\) and \((g(f))' = f' \times g'(f) \) (oh my… this is so far lol). So we can compute the following:

\[ \frac{\partial Loss}{\partial w_i} = -R \times \frac{\partial}{\partial w_i} \left(out_i\right) \times \frac{1}{out_i} \]

  • And then using the derivation formula: \((f*g)' = f'*g + g'*f\), we can continue on our path with:

\begin{align} \frac{\partial}{\partial w_i} \left(out_i\right) & = \frac{\partial}{\partial w_i} \left(\frac{e^{w_i}}{\sum_k e^{w_k}}\right) \\ & = \frac{\partial}{\partial w_i} \left(e^{w_i}\right) \frac{1}{\sum_k e^{w_k}} + e^{w_i} \frac{\partial}{\partial w_i} \left(\frac{1}{\sum_k e^{w_k}}\right) \\ & = \frac{e^{w_i}}{\sum_k e^{w_k}} + e^{w_i} e^{w_i} \left( -\frac{1}{(\sum_k e^{w_k})^2}\right) \\ & = \frac{e^{w_i}}{\sum_k e^{w_k}} - \left( \frac{e^{w_i}}{\sum_k e^{w_k}} \right)^2 \\ & = out_i \cdot ( 1 - out_i) \end{align}

  • So in the end we get:

\[ \frac{\partial Loss}{\partial w_i} = - R \cdot (1 - out_i)\]

  • And thus the weight update we will apply is going to be:

\[ w_i = w_i + \alpha \cdot R \cdot (1 - out_i)\]

  • ⇒ So yes we will increase the value of the weight if the reward is positive, and all is OK :-) (remember: \(out_i\) is a probability value, so in range [0,1])
  • And that's it for the algorithm! Nothing too fancy here… Actually, it could probably be done more efficiently without a neural network, but what I think I should take from that implementation is that we are learning the “weight or probability of the actions” we should take directly, and this is what is called learning a Policy apparently, and thus the Policy gradient training: Okay! Lesson learned!… Next ;-)!