Table of Contents

Simple QTable learning

Today, I feel like trying an implementation of a “Q table learning”. Of course, the idea is to go much further than this, but we have to start the the basis, right ? So let's begin.

References

As usual, here are the main references I'm using for this experiment:

Prerequisites

FrozenLake application

The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.

⇒ The typical Qtable array we get at the end of the training is something like that:

[[  8.45356320e-02   3.21190753e-02   8.59557936e-02   8.57750436e-02]
 [  1.38008125e-02   1.58740568e-03   1.44658989e-02   8.69052154e-02]
 [  1.00856435e-02   2.02433213e-03   8.28338186e-03   7.49612561e-02]
 [  2.57502785e-03   1.08999445e-02   4.77150803e-07   7.53541474e-02]
 [  9.62490068e-02   4.54455221e-02   9.53828356e-03   5.80393039e-02]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  1.89685598e-01   1.84186277e-04   2.81778415e-06   2.27085696e-08]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  4.37804809e-02   1.20205573e-01   8.06634119e-03   2.38534679e-01]
 [  1.65409985e-02   7.52591077e-01   1.53091012e-03   1.23874446e-03]
 [  2.63804747e-01   3.71528815e-03   3.09251232e-02   5.45746045e-03]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]
 [  2.64244622e-01   2.16863746e-03   5.37192468e-01   8.89917016e-02]
 [  3.64068530e-01   4.66030522e-01   1.19812849e-01   4.60581823e-01]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00]]
And this is something I find strange in this case, because we could optimize this table much more and on each line we should see only one main non zero value.

Conclusion

=> I'm not really satisfied with the Q Table training using the Bellman formula, and even if that may sond crazy, I have the feeling we could do better. So this is what I will try in a following post to clarify the situation.