IX - Neural Networks: Learning
9.1 - Cost Function
- We use L to denote the total number of layers.
- We use \(s_l\) = number of units in layer l (not counting the bias unit).
- We also use K as the number of output units.
- With binary classification we would have only one output unit. (K = 1).
- With multiclass classification, we would have \(K \ge 3\).
- The cost function we use here is a generalization of the Logistic regression cost function (with \(h_\Theta(x) \in \mathbb{R}^K\) and \((h_\Theta(x))_i = i^{\text{th}}~\text{output}\)):
\[J(\Theta) = - \frac{1}{m} \left[ \sum\limits_{i=1}^m \sum\limits_{k=1}^K y_k^{(i)}~log((h_\Theta(x^{(i)}))_k) + (1 -y_k^{(i)})~log(1 - (h_\Theta(x^{(i)}))_k) \right] + \frac{\lambda}{2m} \sum\limits_{l=1}^{L-1} \sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} (\Theta_{ji}^{(l)})^2 \]
9.2 - Backpropagation Algorithm
- For each node, we are going to compute \(\delta_j^{(l)}\) representing the “error” of node j in layer l.
- So, from a simple 4 layers network, we could compute the error on the output layer as: \(\delta_j^{(4)} = a_j^{(4)} - y_j = (h_\Theta(x))_j - y_j\).
- Then we can continue with:
\[\delta^{(3)} = (\Theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)}) \text{ where } g'(z^{(3)}) = a^{(3)} .* (1-a^{(3)})\] \[\delta^{(2)} = (\Theta^{(2)})^T\delta^{(3)}.*g'(z^{(2)}) \text{ where } g'(z^{(2)}) = a^{(2)} .* (1-a^{(2)})\]
- Finally, it can be proved that, if we ignore the regularization term then we get:
\[\frac{\partial}{\partial\Theta_{ij}^{(l)}} J(\Theta) = a_j^{(l)}\delta_i^{(l+1)}\]
- Backpropagation algorithm:
- We set \(\Delta_{ij}^{(l)} = 0\) for all i,j,l.
- For i=1 to m:
- Set \(a^{(1)} = x^{(i)}\)
- Perform forward propagation to compute \(a^{(l)}\) for l=2,3,…,L
- Using \(y^{(i)}\), compute \(\delta^{(L)} = a^{(L)} - y^{(i)}\)
- Then compute \(\delta^{(L-1)},\delta^{(L-2)},\dots,\delta^{(2)}\)
- Set \(\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)}\delta_i^{(l+1)}\). Note that a vectorized implementation of this would be: \(\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}~(a^{(l)})^T\).
- Then we compute the following term \(D_{ij}^{(l)} := \begin{cases} \frac{1}{m}\Delta_{ij}^{(l)} + \frac{\lambda}{m} \Theta_{ij}^{(l)} & \text{ if } j \neq 0 \\ \frac{1}{m}\Delta_{ij}^{(l)} & \text{ if } j = 0 \end{cases}\)
⇒ Finally, it can be prooved that \(\frac{\partial}{\partial\Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}\).
9.3 - Backpropagation Intuition
- If we select a single example i, then, formally: \(\delta_j^{(j)} = \frac{\partial}{\partial z_j^{(l)}} cost(i)\).
9.4 - Implementation Note: Unrolling parameters
- for advanced optimization, we would need to write a function such as:
function [jVal, gradient] = costFunction(theta) ... optTheta = fminunc(@costFunction,initialTheta, options);
- But note that in the previous code, gradient, theta and initialTheta would be assumed to be vectors from \(\mathbb{R}^{n+1}\) by the optimization routine fminunc.
- Here we have matrices instead, so we need to unroll those matrices into vectors. To do so in Octave, we can write:
thetaVec = [ Theta1(:); Theta2(:); Theta3(:) ]; DVec = [ D1(:); D2(:); D3(:)];
- If we want to do the opposite, we use:
Theta1 = reshape(thetaVec(1:110),10,11); Theta2 = reshape(thetaVec(111:220),10,11); Theta3 = reshape(thetaVec(221:231),1,11);
9.5 - Gradient checking
- We can compute an approximation of the derivative of \(J(\Theta)\) with (taking an example here with \(\Theta \in \mathbb{R}\): \(\frac{d}{d\Theta} \approx \frac{J(\Theta+\epsilon) - J(\Theta-\epsilon)}{2\epsilon}\) (this is called the 2 sided difference, which gives a slightly more accurate approximation than the 1 sided difference).
- Usually we can use \(\epsilon \approx 10^{-4}\).
- So in Octave, we would write:
gradApprox = (J(theta + EPSILON) - J(theta - EPSILON))/(2*EPSILON)
- In the case \(\theta\) is a vector : \(\theta = [\theta_1,\theta_2,\theta_3,\dots,\theta_n]\). Then we can compute the gradient approximation as:
\[\frac{\partial}{\partial \theta_i} J(\theta) \approx \frac{J(\theta_1,\dots,\theta_i+\epsilon,\dots,\theta_n) - J(\theta_1,\cdots,\theta_i-\epsilon,\cdots,\theta_n)}{2\epsilon}\]
- So in Octave, we would write:
for i=1:n, thetaPlus = theta; thetaPlus(i) = thetaPlus(i) + EPSILON; thetaMinus = theta; thetaMinus(i) = thetaMinus(i) - EPSILON; gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*EPSILON); end;
- Then we need to check that \(gradApprox \approx DVec\).
- Note that we should ensure that gradient checking is disabled when we want to run the backprop code for learning (otherwise, its going to be extremely slow).
9.6 - Random Initialization
- We need to initialize the initialTheta parameters. We can simply use \(initialTheta = zeros(n,1);\) for logistic regression. But this doesn't work for neural networks (otherwise the activation values will be the same and so will be the delta values).
- Random initialization for the neural network is used to ensure symmetry breaking. Concretely, we initialize each value \(\Theta_{ij}^{(l)}\) to a random value in the range \([-\epsilon, \epsilon]\); For instance in Octave:
Theta1 = rand(10,11)*(2*INIT_EPSILON) - INIT_EPSILON; Theta2 = rand(1,11)*(2*INIT_EPSILON) - INIT_EPSILON;
- Note that in Octave rand(n,m) will create an nxm matrix will all elements set to random values between 0 and 1.
9.7 - Putting It Together
- When training a neural network, we need to choose:
- Number of input units : = number of features available.
- Number of output units: = number of classes.
- A reasonnable default is then to choose only 1 hidden layer.
- Or if we use multiple hidden layers, then a reasonnable choice is to have the same number of units in each hidden layer. (usually, the mode units, the better).
9.8 - Autonomous Driving
- This is just an example on autonomous driving with a neural network.