IX - Neural Networks: Learning

9.1 - Cost Function

  • We use L to denote the total number of layers.
  • We use \(s_l\) = number of units in layer l (not counting the bias unit).
  • We also use K as the number of output units.
  • With binary classification we would have only one output unit. (K = 1).
  • With multiclass classification, we would have \(K \ge 3\).
  • The cost function we use here is a generalization of the Logistic regression cost function (with \(h_\Theta(x) \in \mathbb{R}^K\) and \((h_\Theta(x))_i = i^{\text{th}}~\text{output}\)):

\[J(\Theta) = - \frac{1}{m} \left[ \sum\limits_{i=1}^m \sum\limits_{k=1}^K y_k^{(i)}~log((h_\Theta(x^{(i)}))_k) + (1 -y_k^{(i)})~log(1 - (h_\Theta(x^{(i)}))_k) \right] + \frac{\lambda}{2m} \sum\limits_{l=1}^{L-1} \sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} (\Theta_{ji}^{(l)})^2 \]

9.2 - Backpropagation Algorithm

  • For each node, we are going to compute \(\delta_j^{(l)}\) representing the “error” of node j in layer l.
  • So, from a simple 4 layers network, we could compute the error on the output layer as: \(\delta_j^{(4)} = a_j^{(4)} - y_j = (h_\Theta(x))_j - y_j\).
  • Then we can continue with:

\[\delta^{(3)} = (\Theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)}) \text{ where } g'(z^{(3)}) = a^{(3)} .* (1-a^{(3)})\] \[\delta^{(2)} = (\Theta^{(2)})^T\delta^{(3)}.*g'(z^{(2)}) \text{ where } g'(z^{(2)}) = a^{(2)} .* (1-a^{(2)})\]

  • Finally, it can be proved that, if we ignore the regularization term then we get:

\[\frac{\partial}{\partial\Theta_{ij}^{(l)}} J(\Theta) = a_j^{(l)}\delta_i^{(l+1)}\]

  • Backpropagation algorithm:
    1. We set \(\Delta_{ij}^{(l)} = 0\) for all i,j,l.
    2. For i=1 to m:
      1. Set \(a^{(1)} = x^{(i)}\)
      2. Perform forward propagation to compute \(a^{(l)}\) for l=2,3,…,L
      3. Using \(y^{(i)}\), compute \(\delta^{(L)} = a^{(L)} - y^{(i)}\)
      4. Then compute \(\delta^{(L-1)},\delta^{(L-2)},\dots,\delta^{(2)}\)
      5. Set \(\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)}\delta_i^{(l+1)}\). Note that a vectorized implementation of this would be: \(\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}~(a^{(l)})^T\).
    3. Then we compute the following term \(D_{ij}^{(l)} := \begin{cases} \frac{1}{m}\Delta_{ij}^{(l)} + \frac{\lambda}{m} \Theta_{ij}^{(l)} & \text{ if } j \neq 0 \\ \frac{1}{m}\Delta_{ij}^{(l)} & \text{ if } j = 0 \end{cases}\)

⇒ Finally, it can be prooved that \(\frac{\partial}{\partial\Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}\).

9.3 - Backpropagation Intuition

  • If we select a single example i, then, formally: \(\delta_j^{(j)} = \frac{\partial}{\partial z_j^{(l)}} cost(i)\).

9.4 - Implementation Note: Unrolling parameters

  • for advanced optimization, we would need to write a function such as:
    function [jVal, gradient] = costFunction(theta)
    ...
    optTheta = fminunc(@costFunction,initialTheta, options);
  • But note that in the previous code, gradient, theta and initialTheta would be assumed to be vectors from \(\mathbb{R}^{n+1}\) by the optimization routine fminunc.
  • Here we have matrices instead, so we need to unroll those matrices into vectors. To do so in Octave, we can write:
    thetaVec = [ Theta1(:); Theta2(:); Theta3(:) ];
    DVec = [ D1(:); D2(:); D3(:)];
  • If we want to do the opposite, we use:
    Theta1 = reshape(thetaVec(1:110),10,11);
    Theta2 = reshape(thetaVec(111:220),10,11);
    Theta3 = reshape(thetaVec(221:231),1,11);

9.5 - Gradient checking

  • We can compute an approximation of the derivative of \(J(\Theta)\) with (taking an example here with \(\Theta \in \mathbb{R}\): \(\frac{d}{d\Theta} \approx \frac{J(\Theta+\epsilon) - J(\Theta-\epsilon)}{2\epsilon}\) (this is called the 2 sided difference, which gives a slightly more accurate approximation than the 1 sided difference).
  • Usually we can use \(\epsilon \approx 10^{-4}\).
  • So in Octave, we would write:
    gradApprox = (J(theta + EPSILON) - J(theta - EPSILON))/(2*EPSILON)
  • In the case \(\theta\) is a vector : \(\theta = [\theta_1,\theta_2,\theta_3,\dots,\theta_n]\). Then we can compute the gradient approximation as:

\[\frac{\partial}{\partial \theta_i} J(\theta) \approx \frac{J(\theta_1,\dots,\theta_i+\epsilon,\dots,\theta_n) - J(\theta_1,\cdots,\theta_i-\epsilon,\cdots,\theta_n)}{2\epsilon}\]

  • So in Octave, we would write:
    for i=1:n,
      thetaPlus = theta;
      thetaPlus(i) = thetaPlus(i) + EPSILON;
      thetaMinus = theta;
      thetaMinus(i) = thetaMinus(i) - EPSILON;
      gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*EPSILON);
    end;
  • Then we need to check that \(gradApprox \approx DVec\).
  • Note that we should ensure that gradient checking is disabled when we want to run the backprop code for learning (otherwise, its going to be extremely slow).

9.6 - Random Initialization

  • We need to initialize the initialTheta parameters. We can simply use \(initialTheta = zeros(n,1);\) for logistic regression. But this doesn't work for neural networks (otherwise the activation values will be the same and so will be the delta values).
  • Random initialization for the neural network is used to ensure symmetry breaking. Concretely, we initialize each value \(\Theta_{ij}^{(l)}\) to a random value in the range \([-\epsilon, \epsilon]\); For instance in Octave:
    Theta1 = rand(10,11)*(2*INIT_EPSILON) - INIT_EPSILON;
    Theta2 = rand(1,11)*(2*INIT_EPSILON) - INIT_EPSILON;
  • Note that in Octave rand(n,m) will create an nxm matrix will all elements set to random values between 0 and 1.

9.7 - Putting It Together

  • When training a neural network, we need to choose:
    • Number of input units : = number of features available.
    • Number of output units: = number of classes.
  • A reasonnable default is then to choose only 1 hidden layer.
  • Or if we use multiple hidden layers, then a reasonnable choice is to have the same number of units in each hidden layer. (usually, the mode units, the better).

9.8 - Autonomous Driving

  • This is just an example on autonomous driving with a neural network.