MNIST convolution network

In this post we are going to build a convolutional network with the architecture suggested in the reference book, with two pooling and two convolutional layers interleaved followed by a fully connected layer (with a dropout of p=0.5), and a terminal softmax layer.

Building the network

To build this network we use the previous implementation script as a reference, and we add the additional helper function needed to build the convolutional layers:

def conv2d(input, weight_shape, bias_shape):
    count = weight_shape[0] * weight_shape[1] * weight_shape[2]
    weight_init = tf.random_normal_initializer(stddev=(2.0/count)**0.5)
    W = tf.get_variable("W", weight_shape, initializer=weight_init)
    bias_init = tf.constant_initializer(value=0)
    b = tf.get_variable("b", bias_shape, initializer=bias_init)
    conv_out = tf.nn.conv2d(input, W, strides=[1, 1, 1, 1], padding='SAME')
    return tf.nn.relu(tf.nn.bias_add(conv_out, b))

def max_pool(input, k=2):
    return tf.nn.max_pool(input, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')

Then we update the inference function:

def inference(x, keep_prob):
    x = tf.reshape(x, shape=[-1, 28, 28, 1])
    with tf.variable_scope("conv_1"):
        conv_1 = conv2d(x, [5, 5, 1, 32], [32])
        pool_1 = max_pool(conv_1)
    with tf.variable_scope("conv_2"):
        conv_2 = conv2d(pool_1, [5, 5, 32, 64], [64])
        pool_2 = max_pool(conv_2)
    with tf.variable_scope("fc"):
        pool_2_flat = tf.reshape(pool_2, [-1, 7 * 7 * 64])
        fc_1 = layer(pool_2_flat, [7*7*64, 1024], [1024])
        # apply dropout
        fc_1_drop = tf.nn.dropout(fc_1, keep_prob)

    with tf.variable_scope("output"):
        output = layer(fc_1_drop, [1024, 10], [10])
    return output

Figuring out layers widths/heights

Since we used padding='SAME' for our convolution layers, this means that our output should have the same dimensions as our input on that layer, the input size is 28×28, and we apply a filter of size 5×5, so this means we will use a zero padding of 2 pixels around our input images.

To compute the width/height of the output of a conv layer, we can use the formulas:

\[width_{out} = \left\lceil\frac{width_{in} - extend + 2*padding}{stride} \right\rceil \] \[height_{out} = \left\lceil \frac{height_{in} - extend + 2*padding}{stride}\right\rceil \]

So after the first max pooling we have a dimension of 14×14, and after the second one, we have a size of 7×7 as stated above.

Setting up Adam Optimizer

We should train the network using the Adam optimizer this time, so I updated the train function:

def training(cost, global_step):
    tf.summary.scalar("cost", cost)
    optimizer = tf.train.AdamOptimizer(learning_rate)
    train_op = optimizer.minimize(cost, global_step=global_step)
    return train_op

Proper setup with dropout

One addition detail to keep in mind here is that we should use a dropout probability of 0.5 during training, but use a value of 1.0 during evaluation. So I tried to change just a little bit the main loop to reflect this:

            output = inference(x, 0.5)
            cost = loss(output, y)

            global_step = tf.Variable(0, name='global_step', trainable=False)

            train_op = training(cost, global_step)

            # For the evaluation we use a dropout value of 1.0:
            eval_out = inference(x, 1.0)
            eval_op = evaluate(eval_out, y)
            summary_op = tf.summary.merge_all()

… But of course, this didn't work: because calling “inference” twice means defining the layers twice. So instead i tried to set the reuse flag on the global variable scope:

        with tf.variable_scope("mlp_model") as scope:

            x = tf.placeholder("float", [None, 784]) # mnist data image of shape 28*28=784
            y = tf.placeholder("float", [None, 10]) # 0-9 digits recognition => 10 classes

            output = inference(x, 0.5)
            cost = loss(output, y)

            global_step = tf.Variable(0, name='global_step', trainable=False)

            train_op = training(cost, global_step)

            # For the evaluation we use a dropout value of 1.0:
            eval_out = inference(x, 1.0)

⇒ with this change I can launch the training, but it seems the network is not learning anything:

2019-01-01T21:29:01.783901 [DEBUG] Epoch: 0001, cost=2.378644308
2019-01-01T21:29:02.068586 [DEBUG] Validation Error: 0.904200
2019-01-01T21:29:07.736113 [DEBUG] Epoch: 0002, cost=2.302584587
2019-01-01T21:29:07.779086 [DEBUG] Validation Error: 0.904200
2019-01-01T21:29:13.306788 [DEBUG] Epoch: 0003, cost=2.302585052
2019-01-01T21:29:13.349761 [DEBUG] Validation Error: 0.904200
2019-01-01T21:29:18.897362 [DEBUG] Epoch: 0004, cost=2.302586809
2019-01-01T21:29:18.940488 [DEBUG] Validation Error: 0.904200
2019-01-01T21:29:24.368556 [DEBUG] Epoch: 0005, cost=2.302585125
2019-01-01T21:29:24.412528 [DEBUG] Validation Error: 0.904200
2019-01-01T21:29:29.923250 [DEBUG] Epoch: 0006, cost=2.302585246
2019-01-01T21:29:29.967224 [DEBUG] Validation Error: 0.904200
2019-01-01T21:29:35.406020 [DEBUG] Epoch: 0007, cost=2.302585125
2019-01-01T21:29:35.448995 [DEBUG] Validation Error: 0.904200
2019-01-01T21:29:41.119687 [DEBUG] Epoch: 0008, cost=2.302585125
2019-01-01T21:29:41.166659 [DEBUG] Validation Error: 0.904200

So I eventually found this page, which suggest turning the keep_drop variable into a regular tensorflow “placeholder”, which makes a lot of sense. So I updated the code accordingly:

prob = tf.placeholder_with_default(1.0, shape=())

# and later:, feed_dict={x: minibatch_x, y: minibatch_y, prob: 0.5})

And again this doesn't seem to work: my network is not learning anything just as reported above (stuck on the same values after more than 100 epochs): there must be something incorrect here, so what is it ?

OK found it: it seems I was using a too high learning rate of 0.01, with a rate of 0.001 the training results look good. Also note that I slightly increased the minibatch size as shown here:

learning_rate = 0.001
training_epochs = 300
# training_epochs = 2000
batch_size = 128
display_step = 1

So with this last change I could acheive a test accuracy of 99.3% after 300 training epochs, which is exactly what we expected! So we are all good on this experiment :-).