## IX - Neural Networks: Learning

### 9.1 - Cost Function

• We use L to denote the total number of layers.
• We use $s_l$ = number of units in layer l (not counting the bias unit).
• We also use K as the number of output units.
• With binary classification we would have only one output unit. (K = 1).
• With multiclass classification, we would have $K \ge 3$.
• The cost function we use here is a generalization of the Logistic regression cost function (with $h_\Theta(x) \in \mathbb{R}^K$ and $(h_\Theta(x))_i = i^{\text{th}}~\text{output}$):

$J(\Theta) = - \frac{1}{m} \left[ \sum\limits_{i=1}^m \sum\limits_{k=1}^K y_k^{(i)}~log((h_\Theta(x^{(i)}))_k) + (1 -y_k^{(i)})~log(1 - (h_\Theta(x^{(i)}))_k) \right] + \frac{\lambda}{2m} \sum\limits_{l=1}^{L-1} \sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_{l+1}} (\Theta_{ji}^{(l)})^2$

### 9.2 - Backpropagation Algorithm

• For each node, we are going to compute $\delta_j^{(l)}$ representing the “error” of node j in layer l.
• So, from a simple 4 layers network, we could compute the error on the output layer as: $\delta_j^{(4)} = a_j^{(4)} - y_j = (h_\Theta(x))_j - y_j$.
• Then we can continue with:

$\delta^{(3)} = (\Theta^{(3)})^T\delta^{(4)}.*g'(z^{(3)}) \text{ where } g'(z^{(3)}) = a^{(3)} .* (1-a^{(3)})$ $\delta^{(2)} = (\Theta^{(2)})^T\delta^{(3)}.*g'(z^{(2)}) \text{ where } g'(z^{(2)}) = a^{(2)} .* (1-a^{(2)})$

• Finally, it can be proved that, if we ignore the regularization term then we get:

$\frac{\partial}{\partial\Theta_{ij}^{(l)}} J(\Theta) = a_j^{(l)}\delta_i^{(l+1)}$

• Backpropagation algorithm:
1. We set $\Delta_{ij}^{(l)} = 0$ for all i,j,l.
2. For i=1 to m:
1. Set $a^{(1)} = x^{(i)}$
2. Perform forward propagation to compute $a^{(l)}$ for l=2,3,…,L
3. Using $y^{(i)}$, compute $\delta^{(L)} = a^{(L)} - y^{(i)}$
4. Then compute $\delta^{(L-1)},\delta^{(L-2)},\dots,\delta^{(2)}$
5. Set $\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)}\delta_i^{(l+1)}$. Note that a vectorized implementation of this would be: $\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}~(a^{(l)})^T$.
3. Then we compute the following term $D_{ij}^{(l)} := \begin{cases} \frac{1}{m}\Delta_{ij}^{(l)} + \frac{\lambda}{m} \Theta_{ij}^{(l)} & \text{ if } j \neq 0 \\ \frac{1}{m}\Delta_{ij}^{(l)} & \text{ if } j = 0 \end{cases}$

⇒ Finally, it can be prooved that $\frac{\partial}{\partial\Theta_{ij}^{(l)}} J(\Theta) = D_{ij}^{(l)}$.

### 9.3 - Backpropagation Intuition

• If we select a single example i, then, formally: $\delta_j^{(j)} = \frac{\partial}{\partial z_j^{(l)}} cost(i)$.

### 9.4 - Implementation Note: Unrolling parameters

• for advanced optimization, we would need to write a function such as:
function [jVal, gradient] = costFunction(theta)
...
optTheta = fminunc(@costFunction,initialTheta, options);
• But note that in the previous code, gradient, theta and initialTheta would be assumed to be vectors from $\mathbb{R}^{n+1}$ by the optimization routine fminunc.
• Here we have matrices instead, so we need to unroll those matrices into vectors. To do so in Octave, we can write:
thetaVec = [ Theta1(:); Theta2(:); Theta3(:) ];
DVec = [ D1(:); D2(:); D3(:)];
• If we want to do the opposite, we use:
Theta1 = reshape(thetaVec(1:110),10,11);
Theta2 = reshape(thetaVec(111:220),10,11);
Theta3 = reshape(thetaVec(221:231),1,11);

• We can compute an approximation of the derivative of $J(\Theta)$ with (taking an example here with $\Theta \in \mathbb{R}$: $\frac{d}{d\Theta} \approx \frac{J(\Theta+\epsilon) - J(\Theta-\epsilon)}{2\epsilon}$ (this is called the 2 sided difference, which gives a slightly more accurate approximation than the 1 sided difference).
• Usually we can use $\epsilon \approx 10^{-4}$.
• So in Octave, we would write:
gradApprox = (J(theta + EPSILON) - J(theta - EPSILON))/(2*EPSILON)
• In the case $\theta$ is a vector : $\theta = [\theta_1,\theta_2,\theta_3,\dots,\theta_n]$. Then we can compute the gradient approximation as:

$\frac{\partial}{\partial \theta_i} J(\theta) \approx \frac{J(\theta_1,\dots,\theta_i+\epsilon,\dots,\theta_n) - J(\theta_1,\cdots,\theta_i-\epsilon,\cdots,\theta_n)}{2\epsilon}$

• So in Octave, we would write:
for i=1:n,
thetaPlus = theta;
thetaPlus(i) = thetaPlus(i) + EPSILON;
thetaMinus = theta;
thetaMinus(i) = thetaMinus(i) - EPSILON;
end;
• Then we need to check that $gradApprox \approx DVec$.
• Note that we should ensure that gradient checking is disabled when we want to run the backprop code for learning (otherwise, its going to be extremely slow).

### 9.6 - Random Initialization

• We need to initialize the initialTheta parameters. We can simply use $initialTheta = zeros(n,1);$ for logistic regression. But this doesn't work for neural networks (otherwise the activation values will be the same and so will be the delta values).
• Random initialization for the neural network is used to ensure symmetry breaking. Concretely, we initialize each value $\Theta_{ij}^{(l)}$ to a random value in the range $[-\epsilon, \epsilon]$; For instance in Octave:
Theta1 = rand(10,11)*(2*INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(1,11)*(2*INIT_EPSILON) - INIT_EPSILON;
• Note that in Octave rand(n,m) will create an nxm matrix will all elements set to random values between 0 and 1.

### 9.7 - Putting It Together

• When training a neural network, we need to choose:
• Number of input units : = number of features available.
• Number of output units: = number of classes.
• A reasonnable default is then to choose only 1 hidden layer.
• Or if we use multiple hidden layers, then a reasonnable choice is to have the same number of units in each hidden layer. (usually, the mode units, the better).

### 9.8 - Autonomous Driving

• This is just an example on autonomous driving with a neural network.