====== Simple QTable learning ====== {{tag>deep_learning}} Today, I feel like trying an implementation of a "Q table learning". Of course, the idea is to go much further than this, but we have to start the the basis, right ? So let's begin. ====== ====== ===== References ===== As usual, here are the main references I'm using for this experiment: * [[https://simoninithomas.github.io/Deep_reinforcement_learning_Course|Deep Reinforcement Learning Course]]: this guy is french... you just can't miss it (sorry man, but your accent is really hard :-)). Yet this seems to be a very interesting tutorial series, so I really want to try to go through it. * [[https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0|Simple Reinforcement Learning with Tensorflow Part 0]]: Another very promising tutorial series on Deep Reinforcement Learning. ===== Prerequisites ===== * First thing first, I need the OpenAI gym python module, so let's intall that: nv_py_call_pip install gym ===== FrozenLake application ===== * So I first tried with this simple implementation: import gym import numpy as np import random from nv.core.utils import * def train_frozenlake(numEpisodes): logDEBUG("Building FrozenLake environment...") env = gym.make('FrozenLake-v0') #Initialize table with all zeros nstates = env.observation_space.n nactions = env.action_space.n Q = np.zeros([nstates,nactions]) logDEBUG("Qtable shape: %s" % str(Q.shape)) # Set learning parameters lr = .8 # learning rate y = .95 # gamma (ie. discount rate) # numEpisodes = 2000 maxSteps = 99 # Exploration parameters epsilon = 1.0 # Exploration rate max_epsilon = 1.0 # Exploration probability at start min_epsilon = 0.01 # Minimum exploration probability decay_rate = 7.0/1800.0 # Exponential decay rate for exploration prob # decay rate detail: after 1800 episodes we reach a prob of exp(-7)~0.0009 # Array containing the total reward and number of steps per episode: rList = np.zeros((numEpisodes, 2)) for i in range(numEpisodes): #Reset environment and get first new observation state = env.reset() totalReward = 0 done = False step = 0 # logDEBUG("Performing episode %d/%d..." % (i, numEpisodes)) #The Q-Table learning algorithm while step * And running this in jupyter with: from nv.core.utils import * from nv.deep_learning.DQN_apps import train_frozenlake import matplotlib.pyplot as plt import pandas as pd import os os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1' dataDir = os.environ['NVSEED_DATA_DIR']+"/" print("Data dir: ", dataDir) exp="1-first_trial" res = train_frozenlake(1000) print(res) plt.figure(figsize = (18,9)) plt.plot(range(res.shape[0]),res[:,0],color='b',label='Reward') plt.plot(range(res.shape[0]),res[:,1],color='orange',label='Num steps') plt.xlabel('Iteration') plt.ylabel('Episode data') plt.legend(fontsize=18) # plt.grid(range(svals.shape[0]),axis='x', color='r', linestyle='-', linewidth=1) # for xc in range(0,svals.shape[0]+1, lfreq): # plt.axvline(x=xc, color='k', linestyle='--', linewidth=1) filename = dataDir+"deep_learning/tests/qtable/%s.png" % (exp) plt.savefig(filename) plt.show() * And I got this kind of display: {{ projects:nervseed:qtable:1-first_trial.png?1000 }} * So... the only "total rewards" that we seem to get here are "0" or "1"... How could that be ??? Arrf, well, actually, this is correct according to the documentation available on the "FrozenLake" environment: The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise. * So now we should just try to train for a longer period to see if we can really improve our results. And with 20000 episodes we get this kind of results: {{ projects:nervseed:qtable:2-long_training-reward.png?1000 }} {{ projects:nervseed:qtable:2-long_training-numsteps.png?1000 }} * => We can see that the reward is increasing progressively, but I think we can go much higher... and also, we see that the number of steps is getting larger and larger: I think it could be interesing to try to control that: if we add a small negative reward for each step, then the network should try to reduce this number of steps, no ? ;-)... And **no** lol, it actually doesn't work very well... But I currently have no clear idea why :-( * With a more conventional decay setup (ie. quick decay to 0.01 value) we get tis kind of results: {{ projects:nervseed:qtable:3-conventional_decay-reward.png?1000 }} {{ projects:nervseed:qtable:3-conventional_decay-numsteps.png?1000 }} * What I find strange with those results is that we don't seem to eventually reach a **perfect Qtable**, yet, if you think about this kind of problem it feels like if this should be possible... So could it be I have something going wrong here ? => The typical Qtable array we get at the end of the training is something like that: [[ 8.45356320e-02 3.21190753e-02 8.59557936e-02 8.57750436e-02] [ 1.38008125e-02 1.58740568e-03 1.44658989e-02 8.69052154e-02] [ 1.00856435e-02 2.02433213e-03 8.28338186e-03 7.49612561e-02] [ 2.57502785e-03 1.08999445e-02 4.77150803e-07 7.53541474e-02] [ 9.62490068e-02 4.54455221e-02 9.53828356e-03 5.80393039e-02] [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00] [ 1.89685598e-01 1.84186277e-04 2.81778415e-06 2.27085696e-08] [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00] [ 4.37804809e-02 1.20205573e-01 8.06634119e-03 2.38534679e-01] [ 1.65409985e-02 7.52591077e-01 1.53091012e-03 1.23874446e-03] [ 2.63804747e-01 3.71528815e-03 3.09251232e-02 5.45746045e-03] [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00] [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00] [ 2.64244622e-01 2.16863746e-03 5.37192468e-01 8.89917016e-02] [ 3.64068530e-01 4.66030522e-01 1.19812849e-01 4.60581823e-01] [ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]] And this is something I find strange in this case, because we could optimize this table much more and on each line we should see only one main non zero value. ===== Conclusion ===== => I'm not really satisfied with the Q Table training using the Bellman formula, and even if that may sond crazy, I have the feeling we could do better. So this is what I will try in a following post to clarify the situation.