===== IX - Neural Networks: Learning =====
==== 9.1 - Cost Function ====
* We use **L** to denote the total number of layers.
* We use \(s_l\) = number of units in layer l (not counting the bias unit).
* We also use **K** as the number of output units.
* With binary classification we would have only one output unit. (K = 1).
* With multiclass classification, we would have \(K \ge 3\).
* The cost function we use here is a generalization of the Logistic regression cost function (with \(h_\Theta(x) \in \mathbb{R}^K\) and \((h_\Theta(x))_i = i^{\text{th}}~\text{output}\)):
\[J(\Theta) = - \frac{1}{m} \left[ \sum\limits_{i=1}^m \sum\limits_{k=1}^K y_k^{(i)}~log((h_\Theta(x^{(i)}))_k) + (1 -y_k^{(i)})~log(1 - (h_\Theta(x^{(i)}))_k) \right] + \frac{\lambda}{2m} \sum\limits_{l=1}^{L-1} \sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} (\Theta_{ji}^{(l)})^2 \]
==== 9.2 - Backpropagation Algorithm ====
* For each node, we are going to compute \(\delta_j^{(l)}\) representing the "error" of node j in layer l.
* So, from a simple 4 layers network, we could compute the error on the output layer as: \(\delta_j^{(4)} = a_j^{(4)} - y_j = (h_\Theta(x))_j - y_j\).
* Then we can continue with:
\[\delta^{(3)} = (\Theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)}) \text{ where } g'(z^{(3)}) = a^{(3)} .* (1-a^{(3)})\]
\[\delta^{(2)} = (\Theta^{(2)})^T\delta^{(3)}.*g'(z^{(2)}) \text{ where } g'(z^{(2)}) = a^{(2)} .* (1-a^{(2)})\]
* Finally, it can be proved that, if we ignore the regularization term then we get:
\[\frac{\partial}{\partial\Theta_{ij}^{(l)}} J(\Theta) = a_j^{(l)}\delta_i^{(l+1)}\]
* Backpropagation algorithm:
- We set \(\Delta_{ij}^{(l)} = 0\) for all i,j,l.
- For i=1 to m:
- Set \(a^{(1)} = x^{(i)}\)
- Perform forward propagation to compute \(a^{(l)}\) for l=2,3,...,L
- Using \(y^{(i)}\), compute \(\delta^{(L)} = a^{(L)} - y^{(i)}\)
- Then compute \(\delta^{(L-1)},\delta^{(L-2)},\dots,\delta^{(2)}\)
- Set \(\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)}\delta_i^{(l+1)}\). Note that a vectorized implementation of this would be: \(\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}~(a^{(l)})^T\).
- Then we compute the following term \(D_{ij}^{(l)} := \begin{cases} \frac{1}{m}\Delta_{ij}^{(l)} + \frac{\lambda}{m} \Theta_{ij}^{(l)} & \text{ if } j \neq 0 \\ \frac{1}{m}\Delta_{ij}^{(l)} & \text{ if } j = 0 \end{cases}\)
=> Finally, it can be prooved that \(\frac{\partial}{\partial\Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}\).
==== 9.3 - Backpropagation Intuition ====
* If we select a single example i, then, formally: \(\delta_j^{(j)} = \frac{\partial}{\partial z_j^{(l)}} cost(i)\).
==== 9.4 - Implementation Note: Unrolling parameters ====
* for advanced optimization, we would need to write a function such as:function [jVal, gradient] = costFunction(theta)
...
optTheta = fminunc(@costFunction,initialTheta, options);
* But note that in the previous code, gradient, theta and initialTheta would be assumed to be vectors from \(\mathbb{R}^{n+1}\) by the optimization routine fminunc.
* Here we have matrices instead, so we need to **unroll** those matrices into vectors. To do so in Octave, we can write:
thetaVec = [ Theta1(:); Theta2(:); Theta3(:) ];
DVec = [ D1(:); D2(:); D3(:)];
* If we want to do the opposite, we use:
Theta1 = reshape(thetaVec(1:110),10,11);
Theta2 = reshape(thetaVec(111:220),10,11);
Theta3 = reshape(thetaVec(221:231),1,11);
==== 9.5 - Gradient checking ====
* We can compute an approximation of the derivative of \(J(\Theta)\) with (taking an example here with \(\Theta \in \mathbb{R}\): \(\frac{d}{d\Theta} \approx \frac{J(\Theta+\epsilon) - J(\Theta-\epsilon)}{2\epsilon}\) (this is called the 2 sided difference, which gives a slightly more accurate approximation than the 1 sided difference).
* Usually we can use \(\epsilon \approx 10^{-4}\).
* So in Octave, we would write: gradApprox = (J(theta + EPSILON) - J(theta - EPSILON))/(2*EPSILON)
* In the case \(\theta\) is a vector : \(\theta = [\theta_1,\theta_2,\theta_3,\dots,\theta_n]\). Then we can compute the gradient approximation as:
\[\frac{\partial}{\partial \theta_i} J(\theta) \approx \frac{J(\theta_1,\dots,\theta_i+\epsilon,\dots,\theta_n) - J(\theta_1,\cdots,\theta_i-\epsilon,\cdots,\theta_n)}{2\epsilon}\]
* So in Octave, we would write: for i=1:n,
thetaPlus = theta;
thetaPlus(i) = thetaPlus(i) + EPSILON;
thetaMinus = theta;
thetaMinus(i) = thetaMinus(i) - EPSILON;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*EPSILON);
end;
* Then we need to check that \(gradApprox \approx DVec\).
* Note that we should **ensure** that gradient checking is **disabled** when we want to run the backprop code for learning (otherwise, its going to be extremely slow).
==== 9.6 - Random Initialization ====
* We need to initialize the **initialTheta** parameters. We can simply use \(initialTheta = zeros(n,1);\) for logistic regression. But this doesn't work for neural networks (otherwise the activation values will be the same and so will be the delta values).
* Random initialization for the neural network is used to ensure **symmetry breaking**. Concretely, we initialize each value \(\Theta_{ij}^{(l)}\) to a random value in the range \([-\epsilon, \epsilon]\); For instance in Octave:Theta1 = rand(10,11)*(2*INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(1,11)*(2*INIT_EPSILON) - INIT_EPSILON;
* Note that in Octave **rand(n,m)** will create an nxm matrix will all elements set to random values between 0 and 1.
==== 9.7 - Putting It Together ====
* When training a neural network, we need to choose:
* Number of input units : = number of features available.
* Number of output units: = number of classes.
* A reasonnable default is then to choose only 1 hidden layer.
* Or if we use multiple hidden layers, then a reasonnable choice is to have the same number of units in each hidden layer. (usually, the mode units, the better).
==== 9.8 - Autonomous Driving ====
* This is just an example on autonomous driving with a neural network.