no way to compare when less than two revisions
Differences
This shows you the differences between two versions of the page.
— | blog:2019:0308_full_policy_gradient [2020/07/10 12:11] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Full policy Gradient agent for Reinforcement Learning ====== | ||
+ | |||
+ | {{tag> | ||
+ | |||
+ | This time we are going to handle the creation of a full policy gradient algorithm implementation training on the OpenAI CartPole environment. | ||
+ | As opposed to the previous [[.0307_simple_armed_bandit|simple policy gradient]] implementation, | ||
+ | |||
+ | ====== ====== | ||
+ | |||
+ | ===== References ===== | ||
+ | |||
+ | * [[https:// | ||
+ | * https:// | ||
+ | * http:// | ||
+ | * [[https:// | ||
+ | |||
+ | ===== Definitions ===== | ||
+ | |||
+ | * **Policy network**: A network that takes as input our current state, and outputs the action we should do (with probability distribution [ie. **stochastic** policy]) | ||
+ | |||
+ | * In vanilla supervised learning the objective is to maximize \(\sum_i logp(y_i|x_i)\) where \(x_i,y_i\) are training examples (such as images and their labels). | ||
+ | |||
+ | ===== Initial implementation ===== | ||
+ | |||
+ | * The reference implementation I will use is as follow: <sxh python> | ||
+ | import random | ||
+ | import gym | ||
+ | |||
+ | import tensorflow as tf | ||
+ | import tensorflow.contrib.slim as slim | ||
+ | import numpy as np | ||
+ | |||
+ | from nv.core.utils import * | ||
+ | |||
+ | # cf. https:// | ||
+ | |||
+ | def train_policy_network(): | ||
+ | logDEBUG(" | ||
+ | env = gym.make(' | ||
+ | |||
+ | # Hyperparameters | ||
+ | H = 10 # number of hidden layer neurons | ||
+ | batch_size = 5 # every how many episodes to do a param update? | ||
+ | learning_rate = 1e-2 # feel free to play with this to train faster or more stably. | ||
+ | gamma = 0.99 # discount factor for reward | ||
+ | |||
+ | D = 4 # input dimensionality | ||
+ | |||
+ | def discount_rewards(r): | ||
+ | """ | ||
+ | discounted_r = np.zeros_like(r) | ||
+ | running_add = 0 | ||
+ | for t in reversed(range(0, | ||
+ | running_add = running_add * gamma + r[t] | ||
+ | discounted_r[t] = running_add | ||
+ | return discounted_r | ||
+ | | ||
+ | tf.reset_default_graph() | ||
+ | |||
+ | # This defines the network as it goes from taking an observation of the environment to | ||
+ | # giving a probability of chosing to the action of moving left or right. | ||
+ | observations = tf.placeholder(tf.float32, | ||
+ | W1 = tf.get_variable(" | ||
+ | initializer=tf.contrib.layers.xavier_initializer()) | ||
+ | layer1 = tf.nn.relu(tf.matmul(observations, | ||
+ | W2 = tf.get_variable(" | ||
+ | initializer=tf.contrib.layers.xavier_initializer()) | ||
+ | score = tf.matmul(layer1, | ||
+ | probability = tf.nn.sigmoid(score) | ||
+ | |||
+ | # From here we define the parts of the network needed for learning a good policy. | ||
+ | tvars = tf.trainable_variables() | ||
+ | input_y = tf.placeholder(tf.float32, | ||
+ | advantages = tf.placeholder(tf.float32, | ||
+ | |||
+ | # The loss function. This sends the weights in the direction of making actions | ||
+ | # that gave good advantage (reward over time) more likely, and actions that didn't less likely. | ||
+ | loglik = tf.log(input_y*(input_y - probability) + (1 - input_y)*(input_y + probability)) | ||
+ | loss = -tf.reduce_mean(loglik * advantages) | ||
+ | newGrads = tf.gradients(loss, | ||
+ | |||
+ | # Once we have collected a series of gradients from multiple episodes, we apply them. | ||
+ | # We don't just apply gradeients after every episode in order to account for noise in the reward signal. | ||
+ | adam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizer | ||
+ | W1Grad = tf.placeholder(tf.float32, | ||
+ | W2Grad = tf.placeholder(tf.float32, | ||
+ | batchGrad = [W1Grad, | ||
+ | updateGrads = adam.apply_gradients(zip(batchGrad, | ||
+ | | ||
+ | # Running the training: | ||
+ | xs, | ||
+ | running_reward = None | ||
+ | reward_sum = 0 | ||
+ | episode_number = 1 | ||
+ | total_episodes = 10000 | ||
+ | init = tf.global_variables_initializer() | ||
+ | |||
+ | # Launch the graph | ||
+ | with tf.Session() as sess: | ||
+ | rendering = False | ||
+ | sess.run(init) | ||
+ | observation = env.reset() # Obtain an initial observation of the environment | ||
+ | |||
+ | # Reset the gradient placeholder. We will collect gradients in | ||
+ | # gradBuffer until we are ready to update our policy network. | ||
+ | gradBuffer = sess.run(tvars) | ||
+ | for ix,grad in enumerate(gradBuffer): | ||
+ | gradBuffer[ix] = grad * 0 | ||
+ | | ||
+ | while episode_number <= total_episodes: | ||
+ | | ||
+ | # Rendering the environment slows things down, | ||
+ | # so let's only look at it once our agent is doing a good job. | ||
+ | # Note: I currently do not have support for OpenGL in custom python build. | ||
+ | # So the rendering below is not working. | ||
+ | # if reward_sum/ | ||
+ | # | ||
+ | # | ||
+ | | ||
+ | # Make sure the observation is in a shape the network can handle. | ||
+ | x = np.reshape(observation, | ||
+ | | ||
+ | # Run the policy network and get an action to take. | ||
+ | tfprob = sess.run(probability, | ||
+ | action = 1 if np.random.uniform() < tfprob else 0 | ||
+ | | ||
+ | xs.append(x) # observation | ||
+ | y = 1 if action == 0 else 0 # a "fake label" | ||
+ | ys.append(y) | ||
+ | |||
+ | # step the environment and get new measurements | ||
+ | observation, | ||
+ | reward_sum += reward | ||
+ | |||
+ | drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action) | ||
+ | |||
+ | if done: | ||
+ | episode_number += 1 | ||
+ | # stack together all inputs, hidden states, action gradients, and rewards for this episode | ||
+ | epx = np.vstack(xs) | ||
+ | epy = np.vstack(ys) | ||
+ | epr = np.vstack(drs) | ||
+ | tfp = tfps | ||
+ | xs, | ||
+ | |||
+ | # compute the discounted reward backwards through time | ||
+ | discounted_epr = discount_rewards(epr) | ||
+ | # size the rewards to be unit normal (helps control the gradient estimator variance) | ||
+ | discounted_epr -= np.mean(discounted_epr) | ||
+ | discounted_epr //= np.std(discounted_epr) | ||
+ | | ||
+ | # Get the gradient for this episode, and save it in the gradBuffer | ||
+ | tGrad = sess.run(newGrads, | ||
+ | for ix,grad in enumerate(tGrad): | ||
+ | gradBuffer[ix] += grad | ||
+ | | ||
+ | # If we have completed enough episodes, then update the policy network with our gradients. | ||
+ | if episode_number % batch_size == 0: | ||
+ | sess.run(updateGrads, | ||
+ | for ix,grad in enumerate(gradBuffer): | ||
+ | gradBuffer[ix] = grad * 0 | ||
+ | | ||
+ | # Give a summary of how well our network is doing for each batch of episodes. | ||
+ | running_reward = reward_sum if running_reward is None else running_reward * 0.99 + reward_sum * 0.01 | ||
+ | logDEBUG(' | ||
+ | batch_size, reward_sum// | ||
+ | | ||
+ | if reward_sum// | ||
+ | logDEBUG(" | ||
+ | break | ||
+ | | ||
+ | reward_sum = 0 | ||
+ | | ||
+ | observation = env.reset() | ||
+ | | ||
+ | logDEBUG(" | ||
+ | |||
+ | * When executing this training function, we can see that we have progress in the network actions, and we eventually get the following results: < | ||
+ | 2019-03-08T14: | ||
+ | 2019-03-08T14: | ||
+ | 2019-03-08T14: | ||
+ | 2019-03-08T14: | ||
+ | 2019-03-08T14: | ||
+ | 2019-03-08T14: | ||
+ | 2019-03-08T14: | ||
+ | 2019-03-08T14: | ||
+ | 2019-03-08T14: | ||
+ | |||
+ | ===== Analysis ===== | ||
+ | |||
+ | * Now let's see if I can make sense of all the elements in the code above... | ||
+ | * The input size we use for the observations is **D=4**. | ||
+ | * Our network take as inputs a given number of observations, | ||
+ | * Those inputs go into an hidden layer of size H=10, so we have a matrix of weights of size (4, 10) | ||
+ | * Then we perform a relu activation, | ||
+ | * And we feed those 10 values into our second layer of size (10,1), to produce a final **score value** | ||
+ | * And finally we get the sigmoid of this to have a probability value for "going right" | ||
+ | |||
+ | * => So far, so good. But then comes the interesting part: the training backpropagation. | ||
+ | | ||
+ | * We collect all the trainable variables with **tf.trainable_variables()**, | ||
+ | |||
+ | * Then we compute the loss as: | ||
+ | |||
+ | \[Loss = - \sum_i log(y_i*(y_i - pi) + (1-y_i)*(y_i + pi)) * A_i\] | ||
+ | |||
+ | * The summation above is done on the number of observations/ | ||
+ | |||
+ | * **About this loss formula**: | ||
+ | * The "fake label" that we receive for this formula are defined as follow: <sxh python>y = 1 if action == 0 else 0 # a "fake label"</ | ||
+ | * So... the label we will use is the invert of the action we took ? That's surprising to me. | ||
+ | * The labels we provide will only be 0 or 1, so the loss value component for each i is really: | ||
+ | * if \(y_i = 1\): \(Loss_i = - (1 - p_i) * A_i\) | ||
+ | * if \(y_i = 0\): \(Loss_i = - p_i * A_i\) | ||
+ | * In both cases, the closer we are to the provided label, the less loss we take, so the better our state. | ||
+ | * And in both cases, the higher the avantage value \(A_i\) the lower the loss, so the better our state too. | ||
+ | * So the signs and the formula make sense, assuming that, < | ||
+ | * And if this is the case, it means that the next time we face those inputs, our probability should be closer to 1.0, so the action selection below will give us the action 1 more often, because we use the code: <sxh python> | ||
+ | tfprob = sess.run(probability, | ||
+ | action = 1 if np.random.uniform() < tfprob else 0 | ||
+ | | ||
+ | xs.append(x) # observation | ||
+ | y = 1 if action == 0 else 0 # a "fake label" | ||
+ | ys.append(y)</ | ||
+ | * But then if the selected action is 1, **we are going to use the fake label value: y=0** ? How could that be ? | ||
+ | * What happens if we try to invert this relation, and thus use instead: <sxh python>y = 1 if action == 1 else 0 # a "fake label"</ | ||
+ | * => With the change of the code mentioned above, the **average reward is going down**, so that's not working, and the invertion is really needed. But why ? :-S | ||
+ | * Let's rethink about the definition of the loss function... **okay I see now!** **Stupid me**... :-) The loss is a quantity we want to minimize, and as such for each component \(Loss_i\), when the label \(y_i = 1\) we want the probability to be as close to zero as possible, and when \(y_i = 0\), then we want the probability to be as close to 1.0 as possible! | ||
+ | * So \(p_i \to 1.0\) means we want the **action 0**, not action 1! So the inversion mentioned above is indeed legit, ouf! | ||
+ | |||
+ | * Then we compute the gradient of the loss value with respect to all the trainable variables with the **tf.gradients** function (Note that this function will return a sum of the derivative for each element in the input tensor, but here the loss is only a scalar value, so there is no summation to perform). | ||
+ | * So, in other words, we are computing the values: \( \forall i: \frac{\partial Loss(w)}{\partial w_i}\) at the " | ||
+ | * And we store those gradient values in a tensor that we call **newGrads** for later use, instead of applying them immediately. | ||
+ | |||
+ | |||
+ | * The last part of the network graph is used to apply backpropagation using gradients that we will manually provide (this is really starting to feel funny...) | ||
+ | * So basically, we run one episode to its end. And for each step, we collect: the current observation \(x\), the action selected by our network \(y\) (actually y is the opposite of the action a, but never mind), and then also the reward \(\mathbf{r}\) | ||
+ | * Then we stack all those values vertically, to build an input vertor \(\mathbf{x}\), | ||
+ | * **Important note**: I said above that the function **tf.trainable_variables** would provide us with "all the weights", | ||
+ | * So the " | ||
+ | * And finally, when we have collected enough gradients from some episodes (ie. " | ||
+ | |||
+ | * **On the advantage computation**: | ||
+ | |||
+ | \[A_t = \sum_{k=0}^\infty \gamma^k r_{t+k}\] | ||
+ | |||
+ | * Then we re-normalize the advantages to stabilize the training and more clearing separate very good and very bad actions. | ||
+ | |||
+ | * One thing I find a bit surprise on this point is the re-normalization step: <sxh python> | ||
+ | discounted_epr = discount_rewards(epr) | ||
+ | # size the rewards to be unit normal (helps control the gradient estimator variance) | ||
+ | discounted_epr -= np.mean(discounted_epr) | ||
+ | discounted_epr //= np.std(discounted_epr) | ||
+ | </ | ||
+ | |||
+ | * => above we see that we perform a **floor division** with the standard deviation... And I really don't see why. Trying without this floor division now... => **OK**, so it still works if we remove this floor division, and the convergence to a solution is actually **somewhat | ||
+ | |||
+ | ===== Conclusion ===== | ||
+ | |||
+ | * I still cannot believe this is working so well... but it is. It's amazing :-). | ||