Differences

This shows you the differences between two versions of the page.

Link to this comparison view

blog:2019:0306_qnetwork_learning [2019/03/07 13:04] (current)
Line 1: Line 1:
 +====== QNetwork learning ======
 +
 +{{tag>​deep_learning}}
 +
 +Continuing on my current "​Reinforcement Learning"​ path we are now going to try the Q network implementation that we will train on the Frozenlake environment again.
 +
 +====== ======
 +
 +===== References =====
 +
 +  * [[https://​medium.com/​emergent-future/​simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0|Simple Reinforcement Learning with Tensorflow Part 0]]
 +
 +===== Reference implementation =====
 +
 +  * The reference implementation I will be using is as follow: <sxh python>​import gym
 +import numpy as np
 +import random
 +import tensorflow as tf
 +from nv.core.utils import *
 +
 +def train_qn_env(numEpisodes=2000,​ ename="​FrozenLake-v0"​):​
 +    logDEBUG("​Building FrozenLake environment..."​)
 +    env = gym.make(ename)
 +
 +    # Implementation of the network:
 +    tf.reset_default_graph()
 +
 +    # These lines establish the feed-forward part of the network used to choose actions
 +    inputs1 = tf.placeholder(shape=[1,​16],​dtype=tf.float32)
 +    W = tf.Variable(tf.random_uniform([16,​4],​0,​0.01))
 +    Qout = tf.matmul(inputs1,​W)
 +    predict = tf.argmax(Qout,​1)
 +
 +    # Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
 +    nextQ = tf.placeholder(shape=[1,​4],​dtype=tf.float32)
 +    loss = tf.reduce_sum(tf.square(nextQ - Qout))
 +    trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
 +    updateModel = trainer.minimize(loss)
 +
 +    # Actual training of the model:
 +    logDEBUG("​Initializing training..."​)
 +    init=tf.global_variables_initializer()
 +
 +    # Set learning parameters
 +    y = .99
 +    e = 0.1
 +
 +    # Create lists to contain total rewards and steps per episode
 +    jList = []
 +    rList = []
 +    ​
 +    with tf.Session() as sess:
 +        sess.run(init)
 +        ​
 +        for i in range(numEpisodes):​
 +            # Reset environment and get first new observation
 +            s = env.reset()
 +            rAll = 0
 +            d = False
 +            j = 0
 +            ​
 +            # The Q-Network
 +            while j < 99:
 +                j+=1
 +                ​
 +                # Choose an action by greedily (with e chance of random action) from the Q-network
 +                a,allQ = sess.run([predict,​Qout],​feed_dict={inputs1:​np.identity(16)[s:​s+1]})
 +                if np.random.rand(1) < e:
 +                    a[0] = env.action_space.sample()
 +                ​
 +                # Get new state and reward from environment
 +                s1,r,d,_ = env.step(a[0])
 +                ​
 +                # Obtain the Q' values by feeding the new state through our network
 +                Q1 = sess.run(Qout,​feed_dict={inputs1:​np.identity(16)[s1:​s1+1]})
 +                # Obtain maxQ' and set our target value for chosen action.
 +                maxQ1 = np.max(Q1)
 +                ​
 +                targetQ = allQ
 +                targetQ[0,​a[0]] = r + y*maxQ1
 +                ​
 +                # Train our network using target and predicted Q values
 +                _,W1 = sess.run([updateModel,​W],​feed_dict={inputs1:​np.identity(16)[s:​s+1],​nextQ:​targetQ})
 +                rAll += r
 +                s = s1
 +                ​
 +                if d == True:
 +                    # Reduce chance of random action as we train the model.
 +                    e = 1./((i/50) + 10)
 +                    break
 +
 +            jList.append(j)
 +            rList.append(rAll)
 +
 +            logDEBUG("​Episode %d/%d: reward: %f" % (i+1, numEpisodes,​ rAll))
 +
 +    logDEBUG("​Percent of succesful episodes: " + str(sum(rList)/​numEpisodes) + "​%"​)</​sxh>​
 +
 +  * So our reference results are as follow:
 +
 +{{ projects:​nervseed:​qnetwork:​1-ref_impl-reward.png?​1000 }}
 +
 +{{ projects:​nervseed:​qnetwork:​1-ref_impl-numsteps.png?​1000 }}
 +
 +
 +===== Initial investigations =====
 +
 +  * First thing we can notice on the previous implementation is that **it is slow**, much slower than the initial QTable version we played with... But this is not really surprising since we are "​feeding samples"​ one by one to our network...
 +
 +  * Let's start with a basic timing of the training session: the reference implementation above took about **81.27secs** to train on 2000 episodes.
 +
 +===== Cleaning session run calls =====
 +
 +  * Currently, we start with a given state \(S_t\), and we feed that into out network to compute the actions value for that state: \(Q_{out} = S_t * W\). And then we continue performing each time a single change/​backpropagation of loss in our network. And we are processing many more session executions that seems necessary in the process: **Let'​s start with some cleaning already**!
 +
 +  * So I just refactored a little the implementation to remove non-needed calls to session.run() as follow: <sxh python>​def train_qn_env_v2(numEpisodes=2000,​ ename="​FrozenLake-v0"​):​
 +    logDEBUG("​Building FrozenLake environment..."​)
 +    env = gym.make(ename)
 +
 +    # Implementation of the network:
 +    tf.reset_default_graph()
 +
 +    # These lines establish the feed-forward part of the network used to choose actions
 +    # inputs = tf.placeholder(shape=[1,​16],​dtype=tf.float32)
 +    inputs = tf.placeholder_with_default(np.identity(16,​ dtype=np.float32),​ (16,16), "​states"​)
 +
 +    W = tf.Variable(tf.random_uniform([16,​4],​0,​0.01))
 +    Qout = tf.matmul(inputs,​W)
 +
 +    # Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
 +    nextQ = tf.placeholder(shape=[16,​4],​dtype=tf.float32)
 +    loss = tf.reduce_sum(tf.square(nextQ - Qout))
 +    trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
 +    updateModel = trainer.minimize(loss)
 +
 +    # Actual training of the model:
 +    logDEBUG("​Initializing training..."​)
 +    init=tf.global_variables_initializer()
 +
 +    # Set learning parameters
 +    y = .99
 +    e = 0.1
 +    ​
 +    # Array containing the total reward and number of steps per episode:
 +    rList = np.zeros((numEpisodes,​ 2))
 +
 +    with tf.Session() as sess:
 +        sess.run(init)
 +        ​
 +        for i in range(numEpisodes):​
 +            # Reset environment and get first new observation
 +            s = env.reset()
 +            rAll = 0
 +            d = False
 +            j = 0
 +            ​
 +            action = None
 +
 +            # The Q-Network
 +            while j < 99:
 +                j+=1
 +                ​
 +                # Choose an action by greedily (with e chance of random action) from the Q-network
 +                allQ = sess.run(Qout,​feed_dict={})
 +                ​
 +                if np.random.rand(1) < e:
 +                    action = env.action_space.sample()
 +                else:
 +                    action = np.argmax(allQ[s,:​])
 +                ​
 +                # Get new state and reward from environment
 +                s1,r,d,_ = env.step(action)
 +                ​
 +                # Now we already have the Q' values since we evaluated all states above:
 +                # Obtain maxQ' and set our target value for chosen action.
 +                maxQ1 = np.max(allQ[s1,:​])
 +                ​
 +                targetQ = allQ
 +                targetQ[s,​action] = r + y*maxQ1
 +                ​
 +                # Train our network using target and predicted Q values
 +                sess.run(updateModel,​feed_dict={nextQ:​targetQ})
 +                rAll += r
 +                s = s1
 +                ​
 +                if d == True:
 +                    # Reduce chance of random action as we train the model.
 +                    e = 1./((i/50) + 10)
 +                    break
 +
 +            # And we assign our data for visualization:​
 +            rList[i] = [rAll, j]
 +
 +            if (i+1)%100==0:​
 +                mvals = np.mean(rList[i-99:​i+1],​ axis=0)
 +                logDEBUG("​%d/​%d:​ Mean reward: %f" % (i+1, numEpisodes,​ mvals[0]))
 +    ​
 +
 +    return rList</​sxh>​
 +
 +  * => And with that we get approximately a **x2** speedup already. Let's continue now and try to figure out how to train multiple states at the same time...
 +
 +===== Multi-inputs training =====
 +
 +  * What if we could train on multiple inputs at the same time ?
 +
 +  * Back to the **transition matrix** concept I introduced in my [[.0305_prob_qtable_learning|previous post on Q tables]]: I would like to try to introduce the same idea here: basically at each step, we use the knowledge we have to update the \(Q_{next}\) matrix to contain the value: \(r+ \gamma Q^n_{max}(S_{t+1},​ a^n)\) , but only for the **specific** current state. Yet if we store the previous information we collected on transition probabilities,​ and reward probabilities,​ we could update **all** the values in this \(Q_{next}\) matrix at the same time!
 +
 +  * So, the third implementation I tried was the following: <sxh python>​def train_qn_env_v3(numEpisodes=2000,​ ename="​FrozenLake-v0"​):​
 +    logDEBUG("​Building FrozenLake environment..."​)
 +    env = gym.make(ename)
 +
 +    nstates = env.observation_space.n
 +    nactions = env.action_space.n
 +
 +    # Implementation of the network:
 +    tf.reset_default_graph()
 +
 +    # These lines establish the feed-forward part of the network used to choose actions
 +    # inputs = tf.placeholder(shape=[1,​16],​dtype=tf.float32)
 +    inputs = tf.placeholder_with_default(np.identity(16,​ dtype=np.float32),​ (16,16), "​states"​)
 +
 +    W = tf.Variable(tf.random_uniform([16,​4],​0,​0.01))
 +    Qout = tf.matmul(inputs,​W)
 +
 +    # Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
 +    nextQ = tf.placeholder(shape=[16,​4],​dtype=tf.float32)
 +    loss = tf.reduce_sum(tf.square(nextQ - Qout))
 +    trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
 +    updateModel = trainer.minimize(loss)
 +
 +    # Actual training of the model:
 +    logDEBUG("​Initializing training..."​)
 +    init=tf.global_variables_initializer()
 +
 +    # Set learning parameters
 +    y = .99
 +    e = 0.1
 +    ​
 +    # Array containing the total reward and number of steps per episode:
 +    rList = np.zeros((numEpisodes,​ 2))
 +
 +    # Here we will also use a state transition tensor:
 +    tProbs = np.zeros((nstates,​ nactions, nstates))
 +
 +    # And a reward probability matrix:
 +    # Note: the first column of this matrix tells us how many
 +    # observations we have, and the second column tells us what
 +    # is the current total reward we got so far with this path.
 +    rProbs = np.zeros((nstates,​ nactions, 2))
 +
 +    # Target Q matrix:
 +    targetQ = np.zeros((nstates,​ nactions))
 +
 +    # epsilon value:
 +    eps=1e-10
 +
 +    with tf.Session() as sess:
 +        sess.run(init)
 +        ​
 +        for i in range(numEpisodes):​
 +            # Reset environment and get first new observation
 +            s = env.reset()
 +            rAll = 0
 +            d = False
 +            j = 0
 +            ​
 +            action = None
 +
 +            # The Q-Network
 +            while j < 99:
 +                j+=1
 +                ​
 +                # Choose an action by greedily (with e chance of random action) from the Q-network
 +                allQ = sess.run(Qout,​feed_dict={})
 +                ​
 +                if np.random.rand(1) < e:
 +                    action = env.action_space.sample()
 +                else:
 +                    action = np.argmax(allQ[s,:​])
 +                ​
 +                # Get new state and reward from environment
 +                s1,r,d,_ = env.step(action)
 +                ​
 +                # Fill our transitions and reward matrices:
 +                tProbs[s, action, s1] += 1.0
 +                rProbs[s, action, :] += [1.0, r]
 +
 +                # Now we update all the nextQ values together:
 +                # For each (S,a) pair in the targetQ matrix, we need to set the value
 +                # to: mean_reward + y * expected_next_state value:
 +                # So we start with the mean reward:
 +                targetQ = rProbs[:,:,​1]/​(rProbs[:,:,​0]+eps)
 +                ​
 +                # Compute the transition weights from each state:
 +                weights = tProbs/​(np.sum(tProbs,​ axis=2, keepdims=True)+eps)
 +                ​
 +                # So for each (S, a) pair, we have the nstates weights for each possible following states
 +                # in the weight matrix above.
 +                # we need to dot product this with the max Q'​(S',​a'​) value we currently have:
 +                maxQs = np.max(allQ,​ axis=1)
 +
 +                targetQ += y * np.dot(weights,​ maxQs)
 +                ​
 +                # Train our network using target and predicted Q values
 +                sess.run(updateModel,​feed_dict={nextQ:​targetQ})
 +                rAll += r
 +                s = s1
 +                ​
 +                if d == True:
 +                    # Reduce chance of random action as we train the model.
 +                    e = 1./((i/50) + 10)
 +                    break
 +
 +            # And we assign our data for visualization:​
 +            rList[i] = [rAll, j]
 +
 +            if (i+1)%100==0:​
 +                mvals = np.mean(rList[i-99:​i+1],​ axis=0)
 +                logDEBUG("​%d/​%d:​ Mean reward: %f" % (i+1, numEpisodes,​ mvals[0]))
 +    ​
 +
 +    return rList</​sxh>​
 +
 +  * And the results seem to be significantly better than with the previous implementations:​
 +
 +{{ projects:​nervseed:​qnetwork:​2-transition_matrix-reward.png?​1000 }}
 +
 +===== Recursive Q target update =====
 +
 +  * The next step I would like to take now that we can update all the Q values together in our network would be to try to improve the quality of the **Q target matrix**: in the code above, we build this target matrix with **1 update cycle**, using the values currently available in our Q value matrix... but then if we consider that this is the targetQ we would "like to reach",​ we could pretent that this is the one we have as "Q value source",​ and thus, recursively update our Q target again, and then do it again, and again... until our Q target matrix will stabilize... or explode :-)
 +
 +  * So I introduced the following function to compute the targetQ matrix recursively (but we need a counter, because it's not really converging):​ <sxh python> ​   def updateQTarget(sourceQ,​ counter=5, threshold=1e-4):​
 +        # Now we update all the nextQ values together:
 +        # For each (S,a) pair in the targetQ matrix, we need to set the value
 +        # to: mean_reward + y * expected_next_state value:
 +        # So we start with the mean reward:
 +        targetQ = rProbs[:,:,​1]/​(rProbs[:,:,​0]+eps)
 +        ​
 +        # Compute the transition weights from each state:
 +        weights = tProbs/​(np.sum(tProbs,​ axis=2, keepdims=True)+eps)
 +        ​
 +        # So for each (S, a) pair, we have the nstates weights for each possible following states
 +        # in the weight matrix above.
 +        # we need to dot product this with the max Q'​(S',​a'​) value we currently have:
 +        maxQs = np.max(sourceQ,​ axis=1)
 +
 +        targetQ += y * np.dot(weights,​ maxQs)
 +
 +        # Compute the difference between the source and target matrices:
 +        tval = np.sum(np.sum(np.abs(targetQ)))
 +        sval = np.sum(np.sum(np.abs(sourceQ)))
 +        if tval==sval:
 +            return targetQ
 +        ​
 +        diff = 2.0*abs(tval - sval)/​(sval+tval)
 +
 +        if diff > threshold and counter>​0:​
 +            # logDEBUG("​%d:​ Current targetQ change ratio: %f" % (counter, diff))
 +            return updateQTarget(targetQ,​ counter-1)
 +        ​
 +        return targetQ</​sxh>​
 +
 +  * Then I update the targetQ matrix from the current allQ matrix in the training loop: <sxh python> ​               # Fill our transitions and reward matrices:
 +                tProbs[s, action, s1] += 1.0
 +                rProbs[s, action, :] += [1.0, r]
 +
 +                targetQ = updateQTarget(allQ)
 +                ​
 +                # Train our network using target and predicted Q values
 +                sess.run(updateModel,​feed_dict={nextQ:​targetQ})
 +                rAll += r
 +                s = s1</​sxh>​
 +
 +  * With this implementation,​ we got the following results:
 +
 +{{ projects:​nervseed:​qnetwork:​3-recursive_Q_target-reward.png?​1000 }}
 +
 +  * => This seems just a little bit less good than with the non-recursive target Q matrix computation. But one thing I noticed already is that the recursive computation is not converging :-( And that's not quite what I would be expecting...
 +
 +  * So what can we do to make it converge ?
 +
 +  * Let's rethink a bit on how we can compute our \(Q(S,a)\) values: first, we will now consider that we give a "​value"​ to a given state on its own: \(Q(S)\), how do we define this state value ? it should be defined as the **expected reward** if we reach this state.
 +  ​
 +  * Actually we could even consider that the value depends on the path we take to reach that state: so the value we compute becomes \(Q(S_t,​a_t,​ S_{t+1})\). To compute this value we base ourself on the observations we have during our training, so for each triplet \((S_i, a_j, S_f)\) we will store the number of observations,​ as well as the cumulative reward.
 +
 +  * Next we need to compute Q(S,a) : this action can lead us to other states \(S_i\), each with a probability \(p_i\) and the reward we can expect in that case is the mean reward that we can extract from our \(Q(S, a, S_i)\) tensor described above.
 +
 +  * So we have  \[Q(S,a) = \sum_i p_i \times \left(\frac{Q[S,​ a, S_i, 1]}{Q[S, a, S_i, 0]} + \gamma \times Q(S_i)\right) \]
 +
 +with: \[p_i = \frac{Q[S, a, S_i, 0]}{\sum_j Q[S, a, S_j, 0]} \]
 +
 +  * Then we can compute the \(Q(S_i)\) values as: \(Q(S_i) = {max}_a(Q(S_i,​ a)) \)
 +  * Then we can recursively recompute the \(Q(S_i,​a)\) values until we get a convergence ?
 +
 +  * Let's clarify this even further: ​
 +    * We call \(Q^t_{i,​j}\) the value of the Q target matrix element as **state i**, and **action j**
 +    * We call \(R_{i,​j,​k}\) the extimate reward we get when transitioning to state k from the \((S_i,​a_j)\) pair.
 +    * We call \(Q^s_k\) the intrinsic value of the state k
 +    * We call \(p_{i,​j,​k}\) the probability of the transition from the pair \((S_i,​a_j)\) to state k.
 +
 +  * We those definitions our formula above become:
 +
 +\[Q^t_{i,j} = \sum_k p_{i,j,k} \left( R_{i,j,k} + \gamma Q^s_k \right)\]
 +
 +  * Then we should separate the "​recursive part", ie: from the state i, we could get back to state i directly maybe, so we have:
 +
 +\[Q^t_{i,j} = \sum_{k \neq i} p_{i,j,k} \left( R_{i,j,k} + \gamma Q^s_k \right) ​ + p_{i,j,i} \left( R_{i,j,i} + \gamma Q^s_i \right)\]
 +
 +  * Now, by definition we have: \(Q^s_i = max_{a_j}(Q^t_{i,​j})\)
 +
 +  * So the idea is to update all the \(Q^t_{i,​j}\) and \(Q^s_i\) values **recursively**. ​
 +/*
 +  * So, this max value has to verify:
 +
 +\[ Q^s_i = \underset{j}{max}(Q^t_{i,​j}) = \underset{j}{max}\left(\sum_{k \neq i} p_{i,j,k} \left( R_{i,j,k} + \gamma Q^s_k \right) ​ + p_{i,j,i} \left( R_{i,j,i} + \gamma Q^s_i \right)\right)\]
 +*/
 +
 +  * But unfortunately,​ it seems that the update of the target Q matrix is **not converging**. I also tried some kind of "​probability amplification"​ in hope that this could help the convergence:​ <sxh python> ​       targetQ = np.exp(targetQ)
 +        targetQ /= np.sum(targetQ,​ axis=1, keepdims=True) ​ # the sum cannot be zero.</​sxh>​
 +  ​
 +  * And indeed, with this change it seems we are converging, but then the result we can achieve from this are **not good**. So this path is not really worth it.
 +
 +  * => Anyway, never mind :-)! Time to move to a new experiment now ;-)!