# QNetwork learning

Continuing on my current “Reinforcement Learning” path we are now going to try the Q network implementation that we will train on the Frozenlake environment again.

## Reference implementation

• The reference implementation I will be using is as follow:
import gym
import numpy as np
import random
import tensorflow as tf
from nv.core.utils import *

def train_qn_env(numEpisodes=2000, ename="FrozenLake-v0"):
logDEBUG("Building FrozenLake environment...")
env = gym.make(ename)

# Implementation of the network:
tf.reset_default_graph()

# These lines establish the feed-forward part of the network used to choose actions
inputs1 = tf.placeholder(shape=[1,16],dtype=tf.float32)
W = tf.Variable(tf.random_uniform([16,4],0,0.01))
Qout = tf.matmul(inputs1,W)
predict = tf.argmax(Qout,1)

# Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=[1,4],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
updateModel = trainer.minimize(loss)

# Actual training of the model:
logDEBUG("Initializing training...")
init=tf.global_variables_initializer()

# Set learning parameters
y = .99
e = 0.1

# Create lists to contain total rewards and steps per episode
jList = []
rList = []

with tf.Session() as sess:
sess.run(init)

for i in range(numEpisodes):
# Reset environment and get first new observation
s = env.reset()
rAll = 0
d = False
j = 0

# The Q-Network
while j < 99:
j+=1

# Choose an action by greedily (with e chance of random action) from the Q-network
a,allQ = sess.run([predict,Qout],feed_dict={inputs1:np.identity(16)[s:s+1]})
if np.random.rand(1) < e:
a[0] = env.action_space.sample()

# Get new state and reward from environment
s1,r,d,_ = env.step(a[0])

# Obtain the Q' values by feeding the new state through our network
Q1 = sess.run(Qout,feed_dict={inputs1:np.identity(16)[s1:s1+1]})
# Obtain maxQ' and set our target value for chosen action.
maxQ1 = np.max(Q1)

targetQ = allQ
targetQ[0,a[0]] = r + y*maxQ1

# Train our network using target and predicted Q values
_,W1 = sess.run([updateModel,W],feed_dict={inputs1:np.identity(16)[s:s+1],nextQ:targetQ})
rAll += r
s = s1

if d == True:
# Reduce chance of random action as we train the model.
e = 1./((i/50) + 10)
break

jList.append(j)
rList.append(rAll)

logDEBUG("Episode %d/%d: reward: %f" % (i+1, numEpisodes, rAll))

logDEBUG("Percent of succesful episodes: " + str(sum(rList)/numEpisodes) + "%")
• So our reference results are as follow:

## Initial investigations

• First thing we can notice on the previous implementation is that it is slow, much slower than the initial QTable version we played with… But this is not really surprising since we are “feeding samples” one by one to our network…
• Let's start with a basic timing of the training session: the reference implementation above took about 81.27secs to train on 2000 episodes.

## Cleaning session run calls

• Currently, we start with a given state $S_t$, and we feed that into out network to compute the actions value for that state: $Q_{out} = S_t * W$. And then we continue performing each time a single change/backpropagation of loss in our network. And we are processing many more session executions that seems necessary in the process: Let's start with some cleaning already!
• So I just refactored a little the implementation to remove non-needed calls to session.run() as follow:
def train_qn_env_v2(numEpisodes=2000, ename="FrozenLake-v0"):
logDEBUG("Building FrozenLake environment...")
env = gym.make(ename)

# Implementation of the network:
tf.reset_default_graph()

# These lines establish the feed-forward part of the network used to choose actions
# inputs = tf.placeholder(shape=[1,16],dtype=tf.float32)
inputs = tf.placeholder_with_default(np.identity(16, dtype=np.float32), (16,16), "states")

W = tf.Variable(tf.random_uniform([16,4],0,0.01))
Qout = tf.matmul(inputs,W)

# Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=[16,4],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
updateModel = trainer.minimize(loss)

# Actual training of the model:
logDEBUG("Initializing training...")
init=tf.global_variables_initializer()

# Set learning parameters
y = .99
e = 0.1

# Array containing the total reward and number of steps per episode:
rList = np.zeros((numEpisodes, 2))

with tf.Session() as sess:
sess.run(init)

for i in range(numEpisodes):
# Reset environment and get first new observation
s = env.reset()
rAll = 0
d = False
j = 0

action = None

# The Q-Network
while j < 99:
j+=1

# Choose an action by greedily (with e chance of random action) from the Q-network
allQ = sess.run(Qout,feed_dict={})

if np.random.rand(1) < e:
action = env.action_space.sample()
else:
action = np.argmax(allQ[s,:])

# Get new state and reward from environment
s1,r,d,_ = env.step(action)

# Now we already have the Q' values since we evaluated all states above:
# Obtain maxQ' and set our target value for chosen action.
maxQ1 = np.max(allQ[s1,:])

targetQ = allQ
targetQ[s,action] = r + y*maxQ1

# Train our network using target and predicted Q values
sess.run(updateModel,feed_dict={nextQ:targetQ})
rAll += r
s = s1

if d == True:
# Reduce chance of random action as we train the model.
e = 1./((i/50) + 10)
break

# And we assign our data for visualization:
rList[i] = [rAll, j]

if (i+1)%100==0:
mvals = np.mean(rList[i-99:i+1], axis=0)
logDEBUG("%d/%d: Mean reward: %f" % (i+1, numEpisodes, mvals[0]))

return rList
• ⇒ And with that we get approximately a x2 speedup already. Let's continue now and try to figure out how to train multiple states at the same time…

## Multi-inputs training

• What if we could train on multiple inputs at the same time ?
• Back to the transition matrix concept I introduced in my previous post on Q tables: I would like to try to introduce the same idea here: basically at each step, we use the knowledge we have to update the $Q_{next}$ matrix to contain the value: $r+ \gamma Q^n_{max}(S_{t+1}, a^n)$ , but only for the specific current state. Yet if we store the previous information we collected on transition probabilities, and reward probabilities, we could update all the values in this $Q_{next}$ matrix at the same time!
• So, the third implementation I tried was the following:
def train_qn_env_v3(numEpisodes=2000, ename="FrozenLake-v0"):
logDEBUG("Building FrozenLake environment...")
env = gym.make(ename)

nstates = env.observation_space.n
nactions = env.action_space.n

# Implementation of the network:
tf.reset_default_graph()

# These lines establish the feed-forward part of the network used to choose actions
# inputs = tf.placeholder(shape=[1,16],dtype=tf.float32)
inputs = tf.placeholder_with_default(np.identity(16, dtype=np.float32), (16,16), "states")

W = tf.Variable(tf.random_uniform([16,4],0,0.01))
Qout = tf.matmul(inputs,W)

# Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
nextQ = tf.placeholder(shape=[16,4],dtype=tf.float32)
loss = tf.reduce_sum(tf.square(nextQ - Qout))
updateModel = trainer.minimize(loss)

# Actual training of the model:
logDEBUG("Initializing training...")
init=tf.global_variables_initializer()

# Set learning parameters
y = .99
e = 0.1

# Array containing the total reward and number of steps per episode:
rList = np.zeros((numEpisodes, 2))

# Here we will also use a state transition tensor:
tProbs = np.zeros((nstates, nactions, nstates))

# And a reward probability matrix:
# Note: the first column of this matrix tells us how many
# observations we have, and the second column tells us what
# is the current total reward we got so far with this path.
rProbs = np.zeros((nstates, nactions, 2))

# Target Q matrix:
targetQ = np.zeros((nstates, nactions))

# epsilon value:
eps=1e-10

with tf.Session() as sess:
sess.run(init)

for i in range(numEpisodes):
# Reset environment and get first new observation
s = env.reset()
rAll = 0
d = False
j = 0

action = None

# The Q-Network
while j < 99:
j+=1

# Choose an action by greedily (with e chance of random action) from the Q-network
allQ = sess.run(Qout,feed_dict={})

if np.random.rand(1) < e:
action = env.action_space.sample()
else:
action = np.argmax(allQ[s,:])

# Get new state and reward from environment
s1,r,d,_ = env.step(action)

# Fill our transitions and reward matrices:
tProbs[s, action, s1] += 1.0
rProbs[s, action, :] += [1.0, r]

# Now we update all the nextQ values together:
# For each (S,a) pair in the targetQ matrix, we need to set the value
# to: mean_reward + y * expected_next_state value:
targetQ = rProbs[:,:,1]/(rProbs[:,:,0]+eps)

# Compute the transition weights from each state:
weights = tProbs/(np.sum(tProbs, axis=2, keepdims=True)+eps)

# So for each (S, a) pair, we have the nstates weights for each possible following states
# in the weight matrix above.
# we need to dot product this with the max Q'(S',a') value we currently have:
maxQs = np.max(allQ, axis=1)

targetQ += y * np.dot(weights, maxQs)

# Train our network using target and predicted Q values
sess.run(updateModel,feed_dict={nextQ:targetQ})
rAll += r
s = s1

if d == True:
# Reduce chance of random action as we train the model.
e = 1./((i/50) + 10)
break

# And we assign our data for visualization:
rList[i] = [rAll, j]

if (i+1)%100==0:
mvals = np.mean(rList[i-99:i+1], axis=0)
logDEBUG("%d/%d: Mean reward: %f" % (i+1, numEpisodes, mvals[0]))

return rList
• And the results seem to be significantly better than with the previous implementations:

## Recursive Q target update

• The next step I would like to take now that we can update all the Q values together in our network would be to try to improve the quality of the Q target matrix: in the code above, we build this target matrix with 1 update cycle, using the values currently available in our Q value matrix… but then if we consider that this is the targetQ we would “like to reach”, we could pretent that this is the one we have as “Q value source”, and thus, recursively update our Q target again, and then do it again, and again… until our Q target matrix will stabilize… or explode
• So I introduced the following function to compute the targetQ matrix recursively (but we need a counter, because it's not really converging):
    def updateQTarget(sourceQ, counter=5, threshold=1e-4):
# Now we update all the nextQ values together:
# For each (S,a) pair in the targetQ matrix, we need to set the value
# to: mean_reward + y * expected_next_state value:
targetQ = rProbs[:,:,1]/(rProbs[:,:,0]+eps)

# Compute the transition weights from each state:
weights = tProbs/(np.sum(tProbs, axis=2, keepdims=True)+eps)

# So for each (S, a) pair, we have the nstates weights for each possible following states
# in the weight matrix above.
# we need to dot product this with the max Q'(S',a') value we currently have:
maxQs = np.max(sourceQ, axis=1)

targetQ += y * np.dot(weights, maxQs)

# Compute the difference between the source and target matrices:
tval = np.sum(np.sum(np.abs(targetQ)))
sval = np.sum(np.sum(np.abs(sourceQ)))
if tval==sval:
return targetQ

diff = 2.0*abs(tval - sval)/(sval+tval)

if diff > threshold and counter>0:
# logDEBUG("%d: Current targetQ change ratio: %f" % (counter, diff))
return updateQTarget(targetQ, counter-1)

return targetQ
• Then I update the targetQ matrix from the current allQ matrix in the training loop:
                # Fill our transitions and reward matrices:
tProbs[s, action, s1] += 1.0
rProbs[s, action, :] += [1.0, r]

targetQ = updateQTarget(allQ)

# Train our network using target and predicted Q values
sess.run(updateModel,feed_dict={nextQ:targetQ})
rAll += r
s = s1
• With this implementation, we got the following results:

• ⇒ This seems just a little bit less good than with the non-recursive target Q matrix computation. But one thing I noticed already is that the recursive computation is not converging And that's not quite what I would be expecting…
• So what can we do to make it converge ?
• Let's rethink a bit on how we can compute our $Q(S,a)$ values: first, we will now consider that we give a “value” to a given state on its own: $Q(S)$, how do we define this state value ? it should be defined as the expected reward if we reach this state.
• Actually we could even consider that the value depends on the path we take to reach that state: so the value we compute becomes $Q(S_t,a_t, S_{t+1})$. To compute this value we base ourself on the observations we have during our training, so for each triplet $(S_i, a_j, S_f)$ we will store the number of observations, as well as the cumulative reward.
• Next we need to compute Q(S,a) : this action can lead us to other states $S_i$, each with a probability $p_i$ and the reward we can expect in that case is the mean reward that we can extract from our $Q(S, a, S_i)$ tensor described above.
• So we have $Q(S,a) = \sum_i p_i \times \left(\frac{Q[S, a, S_i, 1]}{Q[S, a, S_i, 0]} + \gamma \times Q(S_i)\right)$

with: $p_i = \frac{Q[S, a, S_i, 0]}{\sum_j Q[S, a, S_j, 0]}$

• Then we can compute the $Q(S_i)$ values as: $Q(S_i) = {max}_a(Q(S_i, a))$
• Then we can recursively recompute the $Q(S_i,a)$ values until we get a convergence ?
• Let's clarify this even further:
• We call $Q^t_{i,j}$ the value of the Q target matrix element as state i, and action j
• We call $R_{i,j,k}$ the extimate reward we get when transitioning to state k from the $(S_i,a_j)$ pair.
• We call $Q^s_k$ the intrinsic value of the state k
• We call $p_{i,j,k}$ the probability of the transition from the pair $(S_i,a_j)$ to state k.
• We those definitions our formula above become:

$Q^t_{i,j} = \sum_k p_{i,j,k} \left( R_{i,j,k} + \gamma Q^s_k \right)$

• Then we should separate the “recursive part”, ie: from the state i, we could get back to state i directly maybe, so we have:

$Q^t_{i,j} = \sum_{k \neq i} p_{i,j,k} \left( R_{i,j,k} + \gamma Q^s_k \right) + p_{i,j,i} \left( R_{i,j,i} + \gamma Q^s_i \right)$

• Now, by definition we have: $Q^s_i = max_{a_j}(Q^t_{i,j})$
• So the idea is to update all the $Q^t_{i,j}$ and $Q^s_i$ values recursively.
• But unfortunately, it seems that the update of the target Q matrix is not converging. I also tried some kind of “probability amplification” in hope that this could help the convergence:
        targetQ = np.exp(targetQ)
targetQ /= np.sum(targetQ, axis=1, keepdims=True)  # the sum cannot be zero.
• And indeed, with this change it seems we are converging, but then the result we can achieve from this are not good. So this path is not really worth it.
• ⇒ Anyway, never mind ! Time to move to a new experiment now !