Differences
This shows you the differences between two versions of the page.
— | blog:2019:0110_cifar_tf_queueing [2020/07/10 12:11] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Using tensorflow queue system with CIFAR dataset ====== | ||
+ | |||
+ | {{tag> | ||
+ | |||
+ | So [[blog: | ||
+ | |||
+ | But now I'm back, and this time we are going to handle this queueing mechanism as expected. | ||
+ | |||
+ | ====== ====== | ||
+ | |||
+ | This time we are going to use [[https:// | ||
+ | |||
+ | First think to note is that this time, the input routines will expect us to use the binary dataset format, not the python version. So I start with downloading that new version of the dataset from the page: http:// | ||
+ | |||
+ | < | ||
+ | |||
+ | Then we can use [[https:// | ||
+ | |||
+ | **Important note**: what I finally understood in this process is that it is not easy to perform both training and evaluation of the network with this new approach. And actually the recommended option seems to be to execute both separately: the train process would write checkpoints regularly, and the evaluation process would load those checkpoints before trying to evaluate the network predictions. | ||
+ | |||
+ | => Or maybe I wasn't trying hard enough ? As suggested on [[https:// | ||
+ | |||
+ | So I was actually able to use the queue inputs for the training part and the " | ||
+ | | ||
+ | if os.path.exists(" | ||
+ | logDEBUG(" | ||
+ | shutil.rmtree(" | ||
+ | |||
+ | with tf.Graph().as_default(): | ||
+ | |||
+ | with tf.variable_scope(" | ||
+ | | ||
+ | # We prepare the input queues: | ||
+ | images, labels = CIFAR.distorted_inputs(data_dir, | ||
+ | # x = tf.placeholder(" | ||
+ | # y = tf.placeholder(" | ||
+ | |||
+ | x = tf.placeholder_with_default( images, [None, 24,24,3]) | ||
+ | y = tf.placeholder_with_default( labels, [None]) | ||
+ | |||
+ | # Also retrieve the eval/test datasets: | ||
+ | dataset = CIFAR.read_data_sets(root_path+"/ | ||
+ | |||
+ | phase_train = tf.placeholder_with_default(False, | ||
+ | |||
+ | output = inference(x, | ||
+ | cost = loss(output, | ||
+ | |||
+ | global_step = tf.Variable(0, | ||
+ | |||
+ | train_op = training(cost, | ||
+ | |||
+ | # eval_output = inference(x, | ||
+ | eval_op = evaluate(output, | ||
+ | |||
+ | summary_op = tf.summary.merge_all() | ||
+ | |||
+ | saver = tf.train.Saver() | ||
+ | sess = tf.Session() | ||
+ | |||
+ | # summary_writer = tf.summary.FileWriter(" | ||
+ | summary_writer = tf.summary.FileWriter(" | ||
+ | | ||
+ | init_op = tf.global_variables_initializer() | ||
+ | sess.run(init_op) | ||
+ | |||
+ | # saver.restore(sess, | ||
+ | |||
+ | # Start the queue runners. | ||
+ | logDEBUG(" | ||
+ | tf.train.start_queue_runners(sess=sess) | ||
+ | logDEBUG(" | ||
+ | | ||
+ | # Training cycle | ||
+ | for step in range(max_steps): | ||
+ | |||
+ | start_time = time.time() | ||
+ | # sess.run(train_op, | ||
+ | | ||
+ | # Compute average loss | ||
+ | # loss_value = sess.run(cost, | ||
+ | |||
+ | _, loss_value = sess.run([train_op, | ||
+ | duration = time.time() - start_time | ||
+ | |||
+ | assert not np.isnan(loss_value), | ||
+ | | ||
+ | if step % 10 == 0: | ||
+ | num_examples_per_step = batch_size | ||
+ | examples_per_sec = num_examples_per_step / duration | ||
+ | sec_per_batch = float(duration) | ||
+ | |||
+ | format_str = ('step %d, loss = %.2f (%.1f examples/ | ||
+ | ' | ||
+ | logDEBUG(format_str % (step, loss_value, examples_per_sec, | ||
+ | | ||
+ | if step % 100 == 0: | ||
+ | summary_str = sess.run(summary_op) | ||
+ | summary_writer.add_summary(summary_str, | ||
+ | |||
+ | # Save the model checkpoint periodically. | ||
+ | if step % 1000 == 0 or (step + 1) == max_steps: | ||
+ | saver.save(sess, | ||
+ | |||
+ | if step % 200 == 0: | ||
+ | accuracy = sess.run(eval_op, | ||
+ | logDEBUG(" | ||
+ | |||
+ | logDEBUG(" | ||
+ | |||
+ | accuracy = sess.run(eval_op, | ||
+ | |||
+ | logDEBUG(" | ||
+ | |||
+ | < | ||
+ | |||
+ | => So with this setup, using 116990 training steps (ie. corresponding to 300 epochs) we can reach a test accuracy of 81.5% which is still not that impressive. But since the values reported by google itself are rather in the range of 83% - 86%, I'm starting to think maybe there was a typo in the book where we can read the values " | ||
+ | |||
+ | => Anyway, trying to train a bit longer now with 1000 epochs to confirm those results (and with a reduced learning rate): reaching **81.76%**: not really a large improvement. | ||
+ | |||