Tensorflow install adventures

So I've been reading the book Fundamentals of Deep Learning (2017) by Nikhil Buduma and al. and this made me realize I definitely needs to get started on Deep Learning and Machine Learning in general. I mean, I've already taken a few courses, and I was lucky enough to follow those courses from some of the fathers of machine learning, like Andrew Ng and Geoffrey Hinton, so I already have a correct understanding of the base concepts at play. I built some networks at some point so I already have some minimal experience too, but then I simply moved to other stuff, so it's high time I get back to it.

The idea here is thus to start playing with the MNIST dataset, to experiment with the state of the art deep learning models in TensorFlow.

Installing TensorFlow

Since TensorFlow is available as a Python library it can easily be installed with pip:

pip install --upgrade tensorflow-gpu

After the tensorflow installation is successful, we run a minimal test to check if the library is properly installed:

import tensorflow as tf

deep_learning = tf.constant('Deep Learning')
session = tf.Session()
session.run(deep_learning)

But of course, this gave me an error:

ImportError: DLL load failed: The specified module could not be found.

Failed to load the native TensorFlow runtime.

The error message wasn't really providing any clear detail, so I tried to investigate this, and found this bug report page, and from that page I downloaded the tensorflow_self_check.py script and tried to run it:

$ nv_call_python tensorflow_self_check.py
ERROR: Failed to import the TensorFlow module.

WARNING! This script is no longer maintained!
=============================================

Since TensorFlow 1.4, the self-check has been integrated with TensorFlow itself,
and any missing DLLs will be reported when you execute the `import tensorflow`
statement. The error messages printed below refer to TensorFlow 1.3 and earlier,
and are inaccurate for later versions of TensorFlow.

- Python version is 3.6.

- TensorFlow is installed at: D:\Projects\NervSeed\tools\windows\python-3.6\bin\lib\site-packages\tensorflow

- Could not load 'cudnn64_5.dll'. The GPU version of TensorFlow
  requires that this DLL be installed in a directory that is named in
  your %PATH% environment variable. Note that installing cuDNN is a
  separate step from installing CUDA, and it is often found in a
  different directory from the CUDA DLLs. You may install the
  necessary DLL by downloading cuDNN 5.1 from this URL:
  https://developer.nvidia.com/cudnn

- Could not find cuDNN.

So, it would seem that I don't have cudNN installed, which is absolutely true :-), but it also seems that this script I used above is not maintained anymore, so it might be a good idea to simply try the tensorflow import as suggested above:

$ nv_call_python -i
Python 3.6.3 (v3.6.3:2c5fed8, Oct  3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
Traceback (most recent call last):
  File "D:\Projects\NervSeed\tools\windows\python-3.6\bin\lib\site-packages\tensorflow\python\pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "D:\Projects\NervSeed\tools\windows\python-3.6\bin\lib\site-packages\tensorflow\python\pywrap_tensorflow_internal.py", line 28, in <module>

...

ImportError: DLL load failed: The specified module could not be found.


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

⇒ This is the same message as in the initial test script used above, and it is certainly not as clear as the self_check script… So let's focus on the hint we got from that one and try to install cudNN before trying anything else.

cudNN can be downloaded from: https://developer.nvidia.com/rdp/cudnn-download (note that registration on nvidia is required here)

⇒ So I downloaded cudNN v7.4.2 for Cuda 10.0, which means, I also need to upgrade Cuda itself.

CUDA 10.0 can be downloaded from: https://developer.nvidia.com/cuda-downloads

Note that for the linux installation we should run the shell command:

sudo sh cuda_10.0.130_410.48_linux.run

cudNN installation on linux

cudNN install instructions can be found on this page: https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html

Following the instructions from the link above we do:

$ tar xvzf /mnt/array1/softwares/Dev/cudnn-10.0-linux-x64-v7.4.2.24.tgz
$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include
$ sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
$ sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

Then we should ensure that the cudNN library is properly installed, so we try to build the test MNIST sample as suggested:

The cudNN samples package can be downloaded as a .deb file for Ubuntu 14.04 from the download site mentioned above. But since we didn't use the .deb file for the library installation, this package will give us an error, and leave the doc package unconfigured (which is not critical from my perspective)
$ cp -r /usr/src/cudnn_samples_v7 /home/kenshin/build/
$ cd cudnn_samples_v7/mnistCUDNN/
$ make clean && make
$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64/
$ ./mnistCUDNN

⇒ And we get the test passed! result as expected.

The cudNN installation on windows is similar: we simply need to copy the files extracted from the zip in the CUDA bin/include/bin folders.

Testing TensorFlow again

Both on Windows and Linux we obviously should have the CUDA/cudNN libraries available for TensorFlow, so we need to update our path accordingly:

  • On Windows, we add to our path the bin folder, for instance in my case: D:\Apps\Cuda-10.0\bin
  • On Linux we would change the LD_LIBRARY_PATH:
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64/

⇒ But… this will give me exactly the same error when trying to execute the import tensorflow statement in a simple python script :-(. So, it now rather seems that tensorflow depends on a specific version of CUDA and cudNN.

In the process I also tried to install the tensorflow package instead of tensorflow-gpu, only to get an error also when trying to import the module:

$ nv_call_python -i
Python 3.6.3 (v3.6.3:2c5fed8, Oct  3 2017, 18:11:49) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2018-12-28 20:33:47.357524: F tensorflow/python/lib/core/bfloat16.cc:675] Check failed: PyBfloat16_Type.tp_base != nullptr

So, it seems that training our little MNIST network will have to wait a little longer, and I will first have to clarify how to setup tensorflow properly with CUDA 10.0 and cudNN 7.4.2, which are the current versions I have installed.

⇒ Actually I found this tutorial on how to build tensorflow from sources on windows: maybe I won't have the choice and will have to go this way ?

And of course, now I just found this page: listing the valid combinations of TensorFlow/CUDA/cudNN versions: and thus version 1.12.0 is meant for CUDA v9 + cudNN v7. Maybe I should simply install cuda v9 then instead of going into building from sources…