CIFAR-10 convolution network with batch normalization

So, continuing our deep learning adventures, we will now try to build a convolution network with batch normalization for training on the CIFAR-10 dataset.

Preparing helper functions

The first take we need to take is to prepare the additional helper function we will need. Namely, we need to support batch normalization in convolution layers:

def conv_batch_norm(x, n_out, phase_train):
    beta_init = tf.constant_initializer(value=0.0, dtype=tf.float32)
    gamma_init = tf.constant_initializer(value=1.0, dtype=tf.float32)
    beta = tf.get_variable("beta", [n_out], initializer=beta_init)
    gamma = tf.get_variable("gamma", [n_out], initializer=gamma_init)
    batch_mean, batch_var = tf.nn.moments(x, [0,1,2], name='moments')
    
    ema = tf.train.ExponentialMovingAverage(decay=0.9)
    ema_apply_op = ema.apply([batch_mean, batch_var])
    ema_mean, ema_var = ema.average(batch_mean), ema.average(batch_var)

    def mean_var_with_update():
        with tf.control_dependencies([ema_apply_op]):
            return tf.identity(batch_mean), tf.identity(batch_var)

    mean, var = tf.cond(phase_train, mean_var_with_update, lambda: (ema_mean, ema_var))

    normed = tf.nn.batch_norm_with_global_normalization(x, mean, var, beta, gamma, 1e-3, True)

    return normed

Then we also need support for batch normalization on fully connected layers:

def layer_batch_norm(x, n_out, phase_train):
    beta_init = tf.constant_initializer(value=0.0, dtype=tf.float32)
    gamma_init = tf.constant_initializer(value=1.0, dtype=tf.float32)
    beta = tf.get_variable("beta", [n_out], initializer=beta_init)
    gamma = tf.get_variable("gamma", [n_out], initializer=gamma_init)
    
    batch_mean, batch_var = tf.nn.moments(x, [0], name='moments')
    
    ema = tf.train.ExponentialMovingAverage(decay=0.9)
    ema_apply_op = ema.apply([batch_mean, batch_var])
    ema_mean, ema_var = ema.average(batch_mean), ema.average(batch_var)

    def mean_var_with_update():
        with tf.control_dependencies([ema_apply_op]):
            return tf.identity(batch_mean), tf.identity(batch_var)

    mean, var = tf.cond(phase_train, mean_var_with_update, lambda: (ema_mean, ema_var))

    x_r = tf.reshape(x, [-1, 1, 1, n_out])
    normed = tf.nn.batch_norm_with_global_normalization(x_r, mean, var, beta, gamma, 1e-3, True)

    return tf.reshape(normed, [-1, n_out])

Then we integrate those new helper functions into our network layer building functions:

def conv2d(input, weight_shape, bias_shape, phase_train, visualize=False):
    count = weight_shape[0] * weight_shape[1] * weight_shape[2]
    weight_init = tf.random_normal_initializer(stddev=(2.0/count)**0.5)
    W = tf.get_variable("W", weight_shape, initializer=weight_init)
    
    if visualize:
        filter_summary(W, weight_shape)

    bias_init = tf.constant_initializer(value=0)
    b = tf.get_variable("b", bias_shape, initializer=bias_init)
    
    conv_out = tf.nn.conv2d(input, W, strides=[1, 1, 1, 1], padding='SAME')

    logits = tf.nn.bias_add(conv_out, b)
    return tf.nn.relu(conv_batch_norm(logits, weight_shape[3], phase_train))

def layer(input, weight_shape, bias_shape, phase_train):
    weight_init = tf.random_normal_initializer(stddev=(2.0/weight_shape[0])**0.5)
    bias_init = tf.constant_initializer(value=0)
    
    W = tf.get_variable("W", weight_shape, initializer=weight_init)
    b = tf.get_variable("b", bias_shape, initializer=bias_init)
    logits = tf.matmul(input, W) + b
    
    return tf.nn.relu(layer_batch_norm(logits, weight_shape[1], phase_train))

⇒ Actually, I first tested those network architecture changes with the MNIST dataset, and there I could get rid of dropout and use a learning rate of 0.01 instead of 0.001 as expected, and still, acheive a test accuracy of 99.41% with 300 training epochs! Not bad.

Retrieve CIFAR-10 dataset

So now we should move to the CIFAR-10 dataset, but where do we get that from ?

⇒ The CIFAR-10 dataset can be downloaded from this location: https://www.cs.toronto.edu/~kriz/cifar.html (and we take the python version obviously)

The web page mentioned above also provides indications on the dataset layout in the archive file we just retrieved. So we start by extracting this package file:

tar xvzf cifar-10-python.tar.gz

Then we will have to load the training batch files and the test batch file. We can use this script as a template, but I wanted to adapt to the same “interface” as the one we have been using so far for the MNIST dataset.

So I started with a simple app script and built the required support functions step by step. Finally I ended with this CIFAR loader module:

import numpy as np
import tensorflow as tf
import pickle
from nv.core.utils import *

from nv.deep_learning.ImageDataSet import *

"""
The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 
training images and 10000 test images.
The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains 
exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random 
order, but some training batches may contain more images from one class than another. Between them, the training 
batches contain exactly 5000 images from each class.
"""

def unpickle(file):
    """load the cifar-10 data"""

    with open(file, 'rb') as fo:
        data = pickle.load(fo, encoding='bytes')
    return data

def dense_to_one_hot(labels_dense, num_classes=10):
  """Convert class labels from scalars to one-hot vectors."""
  num_labels = labels_dense.shape[0]
  index_offset = np.arange(num_labels) * num_classes
  labels_one_hot = np.zeros((num_labels, num_classes))
  labels_one_hot.flat[index_offset + labels_dense.ravel()] = 1
  return labels_one_hot

def read_data_sets(data_dir, one_hot=False, dtype=tf.float32):
    """Read the CIFAR datasets from a given folder and return separated sets for train/validation/test"""

    class DataSets(object):
        pass

    data_sets = DataSets()

    meta_data_dict = unpickle(data_dir + "/batches.meta")
    label_names = meta_data_dict[b'label_names']
    label_names = np.array(label_names)

    logDEBUG("Found the label names: %s" % label_names)

    # Next we load the train data:
    train_data = None
    # train_filenames = []
    train_labels = []

    # train_data_dict
    # 'batch_label': 'training batch 5 of 5'
    # 'data': ndarray
    # 'filenames': list
    # 'labels': list

    for i in range(1, 6):
        logDEBUG("Loading data_batch_%d..." % i)
        train_data_dict = unpickle(data_dir + "/data_batch_{}".format(i))
        if i == 1:
            train_data = train_data_dict[b'data']
        else:
            train_data = np.vstack((train_data, train_data_dict[b'data']))
        # train_filenames += train_data_dict[b'filenames']
        train_labels += train_data_dict[b'labels']

    # Not sure we need the following:
    # train_data = train_data.reshape((len(train_data), 3, 32, 32))
    # train_data = np.rollaxis(train_data, 1, 4)

    # Convert labels to numpy array:
    train_labels = np.array(train_labels)

    # logDEBUG("train_data head: %s" % train_data[:5, :5]) => we see here that each pixel component value is in the range 0,255 as expected.

    # logDEBUG("train_labels head: %s" % train_labels[:5])
    if one_hot:
        train_labels = dense_to_one_hot(train_labels)
    # logDEBUG("train_labels head bis: %s" % train_labels[:5])

    logDEBUG("Train samples shape: %s" % str(train_data.shape))
    logDEBUG("Num train labels: %d" % len(train_labels))

    logDEBUG("Loading test_batch...")
    test_data_dict = unpickle(data_dir + "/test_batch")
    test_data = test_data_dict[b'data']
    # test_filenames = test_data_dict[b'filenames']
    test_labels = test_data_dict[b'labels']

    # Not sure we need the following:
    # test_data = test_data.reshape((len(test_data), 3, 32, 32))    
    # test_data = np.rollaxis(test_data, 1, 4)
    
    # Convert test labels to numpy array:
    test_labels = np.array(test_labels)
    if one_hot:
        test_labels = dense_to_one_hot(test_labels)

    logDEBUG("Initial Test samples shape: %s" % str(test_data.shape))
    logDEBUG("Initial num test labels: %d" % len(test_labels))
    
    # Now we separate the validation set from the final test set:
    VALIDATION_SIZE = 5000
    validation_data = test_data[:VALIDATION_SIZE]
    validation_labels = test_labels[:VALIDATION_SIZE]
    test_data = test_data[VALIDATION_SIZE:]
    test_labels = test_labels[VALIDATION_SIZE:]

    logDEBUG("Validation samples shape: %s" % str(validation_data.shape))
    logDEBUG("Num validation labels: %s" % str(validation_labels.shape))
    logDEBUG("Test samples shape: %s" % str(test_data.shape))
    logDEBUG("Num test labels: %s" % str(test_labels.shape))

    data_sets.train = ImageDataSet(train_data, train_labels, dtype=dtype)
    data_sets.validation = ImageDataSet(validation_data, validation_labels, dtype=dtype)
    data_sets.test = ImageDataSet(test_data, test_labels, dtype=dtype)

    return data_sets

Augmenting the dataset

The script above will provide us with the raw images from the CIFAR-10 dataset, of size 32×32, but the reference book mentions that we should use 24×24 sub-images randomly cropped to augment the dataset.

I was initially thinking about training the network with the image sizes 32×32 directly, but now I'm pretty sure this will just overfit, so I won't even bother giving it a try.

I eventually found this page describing the transformations that should be performed on the dataset.

Now it makes sense to reshape the input dataset for further transformation with:

    train_data = train_data.reshape((len(train_data), 3, 32, 32))
    train_data = np.rollaxis(train_data, 1, 4)

Then we need a mechanism to retrieve the centered fixed sub-images for evaluation and test, since we have no randomness here we should as well perform this step only once:

    # We start with casting the data to float32:
    test_data = normalizeImages(test_data, np.float32, 1.0/255.0)

    # Then we extract the centered images:
    # Compute the x and y offsets on each image:
    height = IMAGE_SIZE
    width = IMAGE_SIZE
    xoff = (32 - width)//2
    yoff = (32 - height)//2

    test_data = test_data[:,yoff:yoff+height,xoff:xoff+width,:]

    logDEBUG("Resized test data shape: %s" % str(test_data.shape))

    # Then we should perform the image "standardization" for each image:

    # "Linearly scales image to have zero mean and unit variance.
    # This op computes (x - mean) / adjusted_stddev, where mean is the average of all values in image, and adjusted_stddev = max(stddev, 1.0/sqrt(image.NumElements())).
    # stddev is the standard deviation of all values in image. It is capped away from zero to protect against division by 0 when handling uniform images."

    # To do so we need to compute the mean value for each row of the dataset:
    numels = width*height*3
    test_data = test_data.reshape((test_data.shape[0], numels))

    mean = np.mean(test_data, axis=0)
    stddev = np.std(test_data, axis=0)
    adjdev = np.maximum(stddev, 1.0/np.sqrt(numels))

    test_data = (test_data - mean)/adjdev

The “regular” way to acheive this rather seems to be to put all the images we want to use on a “queue”, where some random processes may then be applied on them.

The next part is to perform the random transformations on the training data. But this part should be done randomly for each batch, so the logic should instead be provided in our dataset provider directly, using the following function that I created:

def nvRandomCropImages(data, height, width):
    # We assume that the data set we receive here has a shape: (num_images, img_height, img_width, 3)

    # We also assume that the base width/height values are larger than the target width/height values:
    assert height < data.shape[1]
    assert width < data.shape[2]
    
    numImgs = data.shape[0]

    numChannels = data.shape[3]

    # Then we prepare all the dimensions one by one:
    d0 = np.arange(numImgs)

    # for each image we will collect: height*width*numChannels elements, so we shoudl repead the d0 dimension
    # accordingly:
    d0 = np.repeat(d0, height*width*numChannels).reshape(numImgs, height, width, numChannels)

    # Next we prepare the height dimension, for each image, we will select a random height offset value:
    maxhoff = data.shape[1]-height+1 # If the initial height is 6 for instance, and the target height is 2,
    # then the offset value can be 0,1,2,3,4 => so the max value *exclusive* is 6-2+1
    d1 = np.random.randint(0,maxhoff,numImgs).reshape(numImgs,1)

    # Now we only have *1* height offset value per image, we also need to select the following "height values"
    ext = np.arange(height).reshape(1, height)

    # Now if we add both d1 and ext vectors, the values will broadcast to select all the required height indices:
    d1 = (d1+ext).reshape(-1)

    # Finally, for each height value in the vector d1 we will select width*numChannels value, so we should repeat 
    # those height values accordingly:
    d1 = np.repeat(d1, width*numChannels).reshape(numImgs, height, width, numChannels)
    # print("d1: ", d1)

    # Next we prepare the width dimension indices:
    maxwoff = data.shape[2]-width+1
    d2 = np.random.randint(0,maxwoff,numImgs).reshape(numImgs,1)

    # For each width value we will have height, so we repeat accordingly:
    # Note that we do not apply the repetition on the number of channels yet here:
    d2 = np.repeat(d2, height).reshape(-1,1)

    # then we apply the extension:
    ext = np.arange(width).reshape(1, width)

    d2 = (d2+ext).reshape(-1)

    # And finally we repeat here on the number of channels:
    d2 = np.repeat(d2, numChannels).reshape(numImgs, height, width, numChannels)
    # print("d2: ", d2)

    # Finally we prepare the channels dimension:
    d3 = np.zeros((numImgs*height*width), dtype=np.int32).reshape(-1,1)

    ext = np.arange(numChannels).reshape(1, numChannels)

    d3 = (d3+ext).reshape(numImgs, height, width, numChannels)
    # print("d3: ", d3)

    # Now we can extract the sub tensor:
    res = data[d0,d1,d2,d3]
    return res

The function above is pretty complex and a bit slow… so I'm not convinced anymore this is the right solution, and I will probably just get to the tensorflow queueing system at some point ;-)

Building the network

Now we can build the network architecture for this experiment as follow:

def inference(x, phase_train):
    with tf.variable_scope("conv_1"):
        conv_1 = conv2d(x, [5, 5, 3, 64], [64], phase_train,visualize=False) #visualize=True ?
        pool_1 = max_pool(conv_1)

    with tf.variable_scope("conv_2"):
        conv_2 = conv2d(pool_1, [5, 5, 64, 64], [64], phase_train)
        pool_2 = max_pool(conv_2)

    with tf.variable_scope("fc_1"):
        dim = 1
        for d in pool_2.get_shape()[1:].as_list():
            dim *= d

        pool_2_flat = tf.reshape(pool_2, [-1, dim])
        fc_1 = layer(pool_2_flat, [dim, 384], [384], phase_train)
        
        # apply dropout
        # fc_1_drop = tf.nn.dropout(fc_1, keep_prob)
    
    with tf.variable_scope("fc_2"):
        fc_2 = layer(fc_1, [384, 192], [192], phase_train)
        
        # apply dropout
        # fc_2_drop = tf.nn.dropout(fc_2, keep_prob)
    
    with tf.variable_scope("output"):
        output = layer(fc_2, [192, 10], [10], phase_train)
    
    return output

The phase_train argument mentioned above is simply a boolean flag, and we create a placeholder for it in the main loop of our training script, with a default value set to False, which we override to True when running the training operation:

# Creating the placeholder:
phase_train = tf.placeholder_with_default(False, shape=())

# And later:
sess.run(train_op, feed_dict={x: minibatch_x, y: minibatch_y, phase_train: True})

Observed results

With the setup described above, and using a learning_rate of 0.01, and batch_size of 100, I could only reach a test accuracy of about 82.36% after 300 training epochs.

⇒ This is not so good (compared to the theoretical accuracy of 96.7% mentioned in the reference book), but at the same time, I'm not performing all the data augment steps that should be done (like horizontal mirror, random contrast, random brightness), and maybe my learning rate was too high anyway ? So I will now try to train with a smaller learning rate (0.001) for more epochs (1000).

But well, it doesn't seem we will get much better results this way… So time to more cleverly augment our data using tensorflow queueing system.