===== IX - Neural Networks: Learning ===== ==== 9.1 - Cost Function ==== * We use **L** to denote the total number of layers. * We use \(s_l\) = number of units in layer l (not counting the bias unit). * We also use **K** as the number of output units. * With binary classification we would have only one output unit. (K = 1). * With multiclass classification, we would have \(K \ge 3\). * The cost function we use here is a generalization of the Logistic regression cost function (with \(h_\Theta(x) \in \mathbb{R}^K\) and \((h_\Theta(x))_i = i^{\text{th}}~\text{output}\)): \[J(\Theta) = - \frac{1}{m} \left[ \sum\limits_{i=1}^m \sum\limits_{k=1}^K y_k^{(i)}~log((h_\Theta(x^{(i)}))_k) + (1 -y_k^{(i)})~log(1 - (h_\Theta(x^{(i)}))_k) \right] + \frac{\lambda}{2m} \sum\limits_{l=1}^{L-1} \sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} (\Theta_{ji}^{(l)})^2 \] ==== 9.2 - Backpropagation Algorithm ==== * For each node, we are going to compute \(\delta_j^{(l)}\) representing the "error" of node j in layer l. * So, from a simple 4 layers network, we could compute the error on the output layer as: \(\delta_j^{(4)} = a_j^{(4)} - y_j = (h_\Theta(x))_j - y_j\). * Then we can continue with: \[\delta^{(3)} = (\Theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)}) \text{ where } g'(z^{(3)}) = a^{(3)} .* (1-a^{(3)})\] \[\delta^{(2)} = (\Theta^{(2)})^T\delta^{(3)}.*g'(z^{(2)}) \text{ where } g'(z^{(2)}) = a^{(2)} .* (1-a^{(2)})\] * Finally, it can be proved that, if we ignore the regularization term then we get: \[\frac{\partial}{\partial\Theta_{ij}^{(l)}} J(\Theta) = a_j^{(l)}\delta_i^{(l+1)}\] * Backpropagation algorithm: - We set \(\Delta_{ij}^{(l)} = 0\) for all i,j,l. - For i=1 to m: - Set \(a^{(1)} = x^{(i)}\) - Perform forward propagation to compute \(a^{(l)}\) for l=2,3,...,L - Using \(y^{(i)}\), compute \(\delta^{(L)} = a^{(L)} - y^{(i)}\) - Then compute \(\delta^{(L-1)},\delta^{(L-2)},\dots,\delta^{(2)}\) - Set \(\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)}\delta_i^{(l+1)}\). Note that a vectorized implementation of this would be: \(\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}~(a^{(l)})^T\). - Then we compute the following term \(D_{ij}^{(l)} := \begin{cases} \frac{1}{m}\Delta_{ij}^{(l)} + \frac{\lambda}{m} \Theta_{ij}^{(l)} & \text{ if } j \neq 0 \\ \frac{1}{m}\Delta_{ij}^{(l)} & \text{ if } j = 0 \end{cases}\) => Finally, it can be prooved that \(\frac{\partial}{\partial\Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}\). ==== 9.3 - Backpropagation Intuition ==== * If we select a single example i, then, formally: \(\delta_j^{(j)} = \frac{\partial}{\partial z_j^{(l)}} cost(i)\). ==== 9.4 - Implementation Note: Unrolling parameters ==== * for advanced optimization, we would need to write a function such as:function [jVal, gradient] = costFunction(theta) ... optTheta = fminunc(@costFunction,initialTheta, options); * But note that in the previous code, gradient, theta and initialTheta would be assumed to be vectors from \(\mathbb{R}^{n+1}\) by the optimization routine fminunc. * Here we have matrices instead, so we need to **unroll** those matrices into vectors. To do so in Octave, we can write: thetaVec = [ Theta1(:); Theta2(:); Theta3(:) ]; DVec = [ D1(:); D2(:); D3(:)]; * If we want to do the opposite, we use: Theta1 = reshape(thetaVec(1:110),10,11); Theta2 = reshape(thetaVec(111:220),10,11); Theta3 = reshape(thetaVec(221:231),1,11); ==== 9.5 - Gradient checking ==== * We can compute an approximation of the derivative of \(J(\Theta)\) with (taking an example here with \(\Theta \in \mathbb{R}\): \(\frac{d}{d\Theta} \approx \frac{J(\Theta+\epsilon) - J(\Theta-\epsilon)}{2\epsilon}\) (this is called the 2 sided difference, which gives a slightly more accurate approximation than the 1 sided difference). * Usually we can use \(\epsilon \approx 10^{-4}\). * So in Octave, we would write: gradApprox = (J(theta + EPSILON) - J(theta - EPSILON))/(2*EPSILON) * In the case \(\theta\) is a vector : \(\theta = [\theta_1,\theta_2,\theta_3,\dots,\theta_n]\). Then we can compute the gradient approximation as: \[\frac{\partial}{\partial \theta_i} J(\theta) \approx \frac{J(\theta_1,\dots,\theta_i+\epsilon,\dots,\theta_n) - J(\theta_1,\cdots,\theta_i-\epsilon,\cdots,\theta_n)}{2\epsilon}\] * So in Octave, we would write: for i=1:n, thetaPlus = theta; thetaPlus(i) = thetaPlus(i) + EPSILON; thetaMinus = theta; thetaMinus(i) = thetaMinus(i) - EPSILON; gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*EPSILON); end; * Then we need to check that \(gradApprox \approx DVec\). * Note that we should **ensure** that gradient checking is **disabled** when we want to run the backprop code for learning (otherwise, its going to be extremely slow). ==== 9.6 - Random Initialization ==== * We need to initialize the **initialTheta** parameters. We can simply use \(initialTheta = zeros(n,1);\) for logistic regression. But this doesn't work for neural networks (otherwise the activation values will be the same and so will be the delta values). * Random initialization for the neural network is used to ensure **symmetry breaking**. Concretely, we initialize each value \(\Theta_{ij}^{(l)}\) to a random value in the range \([-\epsilon, \epsilon]\); For instance in Octave:Theta1 = rand(10,11)*(2*INIT_EPSILON) - INIT_EPSILON; Theta2 = rand(1,11)*(2*INIT_EPSILON) - INIT_EPSILON; * Note that in Octave **rand(n,m)** will create an nxm matrix will all elements set to random values between 0 and 1. ==== 9.7 - Putting It Together ==== * When training a neural network, we need to choose: * Number of input units : = number of features available. * Number of output units: = number of classes. * A reasonnable default is then to choose only 1 hidden layer. * Or if we use multiple hidden layers, then a reasonnable choice is to have the same number of units in each hidden layer. (usually, the mode units, the better). ==== 9.8 - Autonomous Driving ==== * This is just an example on autonomous driving with a neural network.