Differences

This shows you the differences between two versions of the page.

Link to this comparison view

blog:2019:0304_simple_qtable_learning [2019/03/05 08:39] (current)
Line 1: Line 1:
 +====== Simple QTable learning ======
 +
 +{{tag>​deep_learning}}
 +
 +Today, I feel like trying an implementation of a "Q table learning"​. Of course, the idea is to go much further than this, but we have to start the the basis, right ? So let's begin.
 +
 +====== ======
 +
 +===== References =====
 +
 +As usual, here are the main references I'm using for this experiment:
 +
 +  * [[https://​simoninithomas.github.io/​Deep_reinforcement_learning_Course|Deep Reinforcement Learning Course]]: this guy is french... you just can't miss it (sorry man, but your accent is really hard :-)). Yet this seems to be a very interesting tutorial series, so I really want to try to go through it.
 +  * [[https://​medium.com/​emergent-future/​simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0|Simple Reinforcement Learning with Tensorflow Part 0]]: Another very promising tutorial series on Deep Reinforcement Learning.
 +
 +===== Prerequisites =====
 +
 +  * First thing first, I need the OpenAI gym python module, so let's intall that: <​code>​nv_py_call_pip install gym</​code>​
 +
 +===== FrozenLake application =====
 +
 +  * So I first tried with this simple implementation:​ <sxh python>​import gym
 +import numpy as np
 +import random
 +
 +from nv.core.utils import *
 +
 +def train_frozenlake(numEpisodes):​
 +    logDEBUG("​Building FrozenLake environment..."​)
 +    env = gym.make('​FrozenLake-v0'​)
 +
 +    #Initialize table with all zeros
 +    nstates = env.observation_space.n
 +    nactions = env.action_space.n
 +    Q = np.zeros([nstates,​nactions])
 +    logDEBUG("​Qtable shape: %s" % str(Q.shape))
 +
 +    # Set learning parameters
 +    lr = .8  # learning rate
 +    y = .95  # gamma (ie. discount rate)
 +
 +    # numEpisodes = 2000
 +    maxSteps = 99
 +
 +    # Exploration parameters
 +    epsilon = 1.0                 # Exploration rate
 +    max_epsilon = 1.0             # Exploration probability at start
 +    min_epsilon = 0.01            # Minimum exploration probability ​
 +    decay_rate = 7.0/​1800.0 ​      # Exponential decay rate for exploration prob
 +    # decay rate detail: after 1800 episodes we reach a prob of exp(-7)~0.0009
 +
 +    # Array containing the total reward and number of steps per episode:
 +    rList = np.zeros((numEpisodes,​ 2))
 +
 +    for i in range(numEpisodes):​
 +        #Reset environment and get first new observation
 +        state = env.reset()
 +    ​
 +        totalReward = 0
 +        done = False
 +        step = 0
 +
 +        # logDEBUG("​Performing episode %d/​%d..."​ % (i, numEpisodes))
 +
 +        #The Q-Table learning algorithm
 +        while step<​maxSteps:​
 +            step+=1
 +
 +            # Check if we should do exploration or explotation:​
 +            exp_thres = random.uniform(0,​ 1)
 +
 +            action = None
 +            if exp_thres < epsilon:
 +                # We do exploration:​
 +                action = env.action_space.sample()
 +            else:
 +                # We do explotation,​ so we use our current Qtable:
 +                action = np.argmax(Q[state,:​])
 +
 +            # Get new state and reward from environment
 +            newState,​reward,​done,​info = env.step(action)
 +
 +            # Update Q-Table with new knowledge using Bellman formula: ​
 +            Q[state,​action] = Q[state,​action] + lr*(reward + y*np.max(Q[newState,:​]) - Q[state,​action])
 +
 +            # Update total reward:
 +            totalReward += reward
 +
 +            # Update state:
 +            state = newState
 +
 +            # Stop if we are done:
 +            if done == True:
 +                break
 +            ​
 +        # Then we reduce the exploration rate:
 +        epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*i)
 +
 +        # And we assign our data for visualization:​
 +        logDEBUG("​%d/​%d:​ Total reward: %f" % (i+1, numEpisodes,​ totalReward))
 +        rList[i] = [totalReward,​ step]
 +    ​
 +    # Compute the mean reward:
 +    mvals = np.mean(rList,​ axis=0)
 +    logDEBUG("​Mean reward: %f" % mvals[0])
 +
 +    # return the data array:
 +    return rList</​sxh>​
 +  ​
 +  * And running this in jupyter with: <sxh python>​from nv.core.utils import *
 +from nv.deep_learning.DQN_apps import train_frozenlake
 +import matplotlib.pyplot as plt
 +import pandas as pd
 +
 +import os
 +os.environ['​TF_CPP_MIN_LOG_LEVEL'​] = '​1' ​
 +
 +dataDir = os.environ['​NVSEED_DATA_DIR'​]+"/"​
 +print("​Data dir: ", dataDir)
 +
 +exp="​1-first_trial"​
 +res = train_frozenlake(1000)
 +
 +print(res)
 +
 +plt.figure(figsize = (18,9))
 +plt.plot(range(res.shape[0]),​res[:,​0],​color='​b',​label='​Reward'​)
 +plt.plot(range(res.shape[0]),​res[:,​1],​color='​orange',​label='​Num steps'​)
 +plt.xlabel('​Iteration'​)
 +plt.ylabel('​Episode data')
 +plt.legend(fontsize=18)
 +# plt.grid(range(svals.shape[0]),​axis='​x',​ color='​r',​ linestyle='​-',​ linewidth=1)
 +# for xc in range(0,​svals.shape[0]+1,​ lfreq):
 +#     ​plt.axvline(x=xc,​ color='​k',​ linestyle='​--',​ linewidth=1)
 +filename = dataDir+"​deep_learning/​tests/​qtable/​%s.png"​ % (exp)
 +plt.savefig(filename)
 +plt.show()</​sxh>​
 +
 +  * And I got this kind of display: ​
 +
 +{{ projects:​nervseed:​qtable:​1-first_trial.png?​1000 }}
 +
 +  * So... the only "total rewards"​ that we seem to get here are "​0"​ or "​1"​... How could that be ??? Arrf, well, actually, this is correct according to the documentation available on the "​FrozenLake"​ environment:​
 +
 +<​note>​The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.</​note>​
 +
 +  * So now we should just try to train for a longer period to see if we can really improve our results. And with 20000 episodes we get this kind of results:
 +
 +{{ projects:​nervseed:​qtable:​2-long_training-reward.png?​1000 }}
 +
 +{{ projects:​nervseed:​qtable:​2-long_training-numsteps.png?​1000 }}
 +
 +  * => We can see that the reward is increasing progressively,​ but I think we can go much higher... and also, we see that the number of steps is getting larger and larger: I think it could be interesing to try to control that: if we add a small negative reward for each step, then the network should try to reduce this number of steps, no ? ;-)... And **no** lol, it actually doesn'​t work very well... But I currently have no clear idea why :-(
 +
 +  * With a more conventional decay setup (ie. quick decay to 0.01 value) we get tis kind of results:
 +
 +{{ projects:​nervseed:​qtable:​3-conventional_decay-reward.png?​1000 }}
 +
 +{{ projects:​nervseed:​qtable:​3-conventional_decay-numsteps.png?​1000 }}
 +
 +  * What I find strange with those results is that we don't seem to eventually reach a **perfect Qtable**, yet, if you think about this kind of problem it feels like if this should be possible... So could it be I have something going wrong here ?
 +
 +=> The typical Qtable array we get at the end of the training is something like that: <​code>​[[ ​ 8.45356320e-02 ​  ​3.21190753e-02 ​  ​8.59557936e-02 ​  ​8.57750436e-02]
 + ​[ ​ 1.38008125e-02 ​  ​1.58740568e-03 ​  ​1.44658989e-02 ​  ​8.69052154e-02]
 + ​[ ​ 1.00856435e-02 ​  ​2.02433213e-03 ​  ​8.28338186e-03 ​  ​7.49612561e-02]
 + ​[ ​ 2.57502785e-03 ​  ​1.08999445e-02 ​  ​4.77150803e-07 ​  ​7.53541474e-02]
 + ​[ ​ 9.62490068e-02 ​  ​4.54455221e-02 ​  ​9.53828356e-03 ​  ​5.80393039e-02]
 + ​[ ​ 0.00000000e+00 ​  ​0.00000000e+00 ​  ​0.00000000e+00 ​  ​0.00000000e+00]
 + ​[ ​ 1.89685598e-01 ​  ​1.84186277e-04 ​  ​2.81778415e-06 ​  ​2.27085696e-08]
 + ​[ ​ 0.00000000e+00 ​  ​0.00000000e+00 ​  ​0.00000000e+00 ​  ​0.00000000e+00]
 + ​[ ​ 4.37804809e-02 ​  ​1.20205573e-01 ​  ​8.06634119e-03 ​  ​2.38534679e-01]
 + ​[ ​ 1.65409985e-02 ​  ​7.52591077e-01 ​  ​1.53091012e-03 ​  ​1.23874446e-03]
 + ​[ ​ 2.63804747e-01 ​  ​3.71528815e-03 ​  ​3.09251232e-02 ​  ​5.45746045e-03]
 + ​[ ​ 0.00000000e+00 ​  ​0.00000000e+00 ​  ​0.00000000e+00 ​  ​0.00000000e+00]
 + ​[ ​ 0.00000000e+00 ​  ​0.00000000e+00 ​  ​0.00000000e+00 ​  ​0.00000000e+00]
 + ​[ ​ 2.64244622e-01 ​  ​2.16863746e-03 ​  ​5.37192468e-01 ​  ​8.89917016e-02]
 + ​[ ​ 3.64068530e-01 ​  ​4.66030522e-01 ​  ​1.19812849e-01 ​  ​4.60581823e-01]
 + ​[ ​ 0.00000000e+00 ​  ​0.00000000e+00 ​  ​0.00000000e+00 ​  ​0.00000000e+00]]</​code>​
 +
 +
 +  And this is something I find strange in this case, because we could optimize this table much more and on each line we should see only one main non zero value.
 +
 +===== Conclusion =====
 +
 +  => I'm not really satisfied with the Q Table training using the Bellman formula, and even if that may sond crazy, I have the feeling we could do better. So this is what I will try in a following post to clarify the situation.