blog:2019:0304_simple_qtable_learning

Differences

This shows you the differences between two versions of the page.


blog:2019:0304_simple_qtable_learning [2020/07/10 12:11] (current) – created - external edit 127.0.0.1
Line 1: Line 1:
 +====== Simple QTable learning ======
 +
 +{{tag>deep_learning}}
 +
 +Today, I feel like trying an implementation of a "Q table learning". Of course, the idea is to go much further than this, but we have to start the the basis, right ? So let's begin.
 +
 +====== ======
 +
 +===== References =====
 +
 +As usual, here are the main references I'm using for this experiment:
 +
 +  * [[https://simoninithomas.github.io/Deep_reinforcement_learning_Course|Deep Reinforcement Learning Course]]: this guy is french... you just can't miss it (sorry man, but your accent is really hard :-)). Yet this seems to be a very interesting tutorial series, so I really want to try to go through it.
 +  * [[https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0|Simple Reinforcement Learning with Tensorflow Part 0]]: Another very promising tutorial series on Deep Reinforcement Learning.
 +
 +===== Prerequisites =====
 +
 +  * First thing first, I need the OpenAI gym python module, so let's intall that: <code>nv_py_call_pip install gym</code>
 +
 +===== FrozenLake application =====
 +
 +  * So I first tried with this simple implementation: <sxh python>import gym
 +import numpy as np
 +import random
 +
 +from nv.core.utils import *
 +
 +def train_frozenlake(numEpisodes):
 +    logDEBUG("Building FrozenLake environment...")
 +    env = gym.make('FrozenLake-v0')
 +
 +    #Initialize table with all zeros
 +    nstates = env.observation_space.n
 +    nactions = env.action_space.n
 +    Q = np.zeros([nstates,nactions])
 +    logDEBUG("Qtable shape: %s" % str(Q.shape))
 +
 +    # Set learning parameters
 +    lr = .8  # learning rate
 +    y = .95  # gamma (ie. discount rate)
 +
 +    # numEpisodes = 2000
 +    maxSteps = 99
 +
 +    # Exploration parameters
 +    epsilon = 1.0                 # Exploration rate
 +    max_epsilon = 1.0             # Exploration probability at start
 +    min_epsilon = 0.01            # Minimum exploration probability 
 +    decay_rate = 7.0/1800.0       # Exponential decay rate for exploration prob
 +    # decay rate detail: after 1800 episodes we reach a prob of exp(-7)~0.0009
 +
 +    # Array containing the total reward and number of steps per episode:
 +    rList = np.zeros((numEpisodes, 2))
 +
 +    for i in range(numEpisodes):
 +        #Reset environment and get first new observation
 +        state = env.reset()
 +    
 +        totalReward = 0
 +        done = False
 +        step = 0
 +
 +        # logDEBUG("Performing episode %d/%d..." % (i, numEpisodes))
 +
 +        #The Q-Table learning algorithm
 +        while step<maxSteps:
 +            step+=1
 +
 +            # Check if we should do exploration or explotation:
 +            exp_thres = random.uniform(0, 1)
 +
 +            action = None
 +            if exp_thres < epsilon:
 +                # We do exploration:
 +                action = env.action_space.sample()
 +            else:
 +                # We do explotation, so we use our current Qtable:
 +                action = np.argmax(Q[state,:])
 +
 +            # Get new state and reward from environment
 +            newState,reward,done,info = env.step(action)
 +
 +            # Update Q-Table with new knowledge using Bellman formula: 
 +            Q[state,action] = Q[state,action] + lr*(reward + y*np.max(Q[newState,:]) - Q[state,action])
 +
 +            # Update total reward:
 +            totalReward += reward
 +
 +            # Update state:
 +            state = newState
 +
 +            # Stop if we are done:
 +            if done == True:
 +                break
 +            
 +        # Then we reduce the exploration rate:
 +        epsilon = min_epsilon + (max_epsilon - min_epsilon)*np.exp(-decay_rate*i)
 +
 +        # And we assign our data for visualization:
 +        logDEBUG("%d/%d: Total reward: %f" % (i+1, numEpisodes, totalReward))
 +        rList[i] = [totalReward, step]
 +    
 +    # Compute the mean reward:
 +    mvals = np.mean(rList, axis=0)
 +    logDEBUG("Mean reward: %f" % mvals[0])
 +
 +    # return the data array:
 +    return rList</sxh>
 +  
 +  * And running this in jupyter with: <sxh python>from nv.core.utils import *
 +from nv.deep_learning.DQN_apps import train_frozenlake
 +import matplotlib.pyplot as plt
 +import pandas as pd
 +
 +import os
 +os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' 
 +
 +dataDir = os.environ['NVSEED_DATA_DIR']+"/"
 +print("Data dir: ", dataDir)
 +
 +exp="1-first_trial"
 +res = train_frozenlake(1000)
 +
 +print(res)
 +
 +plt.figure(figsize = (18,9))
 +plt.plot(range(res.shape[0]),res[:,0],color='b',label='Reward')
 +plt.plot(range(res.shape[0]),res[:,1],color='orange',label='Num steps')
 +plt.xlabel('Iteration')
 +plt.ylabel('Episode data')
 +plt.legend(fontsize=18)
 +# plt.grid(range(svals.shape[0]),axis='x', color='r', linestyle='-', linewidth=1)
 +# for xc in range(0,svals.shape[0]+1, lfreq):
 +#     plt.axvline(x=xc, color='k', linestyle='--', linewidth=1)
 +filename = dataDir+"deep_learning/tests/qtable/%s.png" % (exp)
 +plt.savefig(filename)
 +plt.show()</sxh>
 +
 +  * And I got this kind of display: 
 +
 +{{ projects:nervseed:qtable:1-first_trial.png?1000 }}
 +
 +  * So... the only "total rewards" that we seem to get here are "0" or "1"... How could that be ??? Arrf, well, actually, this is correct according to the documentation available on the "FrozenLake" environment:
 +
 +<note>The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.</note>
 +
 +  * So now we should just try to train for a longer period to see if we can really improve our results. And with 20000 episodes we get this kind of results:
 +
 +{{ projects:nervseed:qtable:2-long_training-reward.png?1000 }}
 +
 +{{ projects:nervseed:qtable:2-long_training-numsteps.png?1000 }}
 +
 +  * => We can see that the reward is increasing progressively, but I think we can go much higher... and also, we see that the number of steps is getting larger and larger: I think it could be interesing to try to control that: if we add a small negative reward for each step, then the network should try to reduce this number of steps, no ? ;-)... And **no** lol, it actually doesn't work very well... But I currently have no clear idea why :-(
 +
 +  * With a more conventional decay setup (ie. quick decay to 0.01 value) we get tis kind of results:
 +
 +{{ projects:nervseed:qtable:3-conventional_decay-reward.png?1000 }}
 +
 +{{ projects:nervseed:qtable:3-conventional_decay-numsteps.png?1000 }}
 +
 +  * What I find strange with those results is that we don't seem to eventually reach a **perfect Qtable**, yet, if you think about this kind of problem it feels like if this should be possible... So could it be I have something going wrong here ?
 +
 +=> The typical Qtable array we get at the end of the training is something like that: <code>[[  8.45356320e-02   3.21190753e-02   8.59557936e-02   8.57750436e-02]
 +  1.38008125e-02   1.58740568e-03   1.44658989e-02   8.69052154e-02]
 +  1.00856435e-02   2.02433213e-03   8.28338186e-03   7.49612561e-02]
 +  2.57502785e-03   1.08999445e-02   4.77150803e-07   7.53541474e-02]
 +  9.62490068e-02   4.54455221e-02   9.53828356e-03   5.80393039e-02]
 +  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 +  1.89685598e-01   1.84186277e-04   2.81778415e-06   2.27085696e-08]
 +  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 +  4.37804809e-02   1.20205573e-01   8.06634119e-03   2.38534679e-01]
 +  1.65409985e-02   7.52591077e-01   1.53091012e-03   1.23874446e-03]
 +  2.63804747e-01   3.71528815e-03   3.09251232e-02   5.45746045e-03]
 +  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 +  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 +  2.64244622e-01   2.16863746e-03   5.37192468e-01   8.89917016e-02]
 +  3.64068530e-01   4.66030522e-01   1.19812849e-01   4.60581823e-01]
 +  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]]</code>
 +
 +
 +  And this is something I find strange in this case, because we could optimize this table much more and on each line we should see only one main non zero value.
 +
 +===== Conclusion =====
 +
 +  => I'm not really satisfied with the Q Table training using the Bellman formula, and even if that may sond crazy, I have the feeling we could do better. So this is what I will try in a following post to clarify the situation.