public:courses:machine_learning:machine_learning:nn_representation [NervTech's Wiki]

If we train a logistic regression algorithm with n features, including all the quadratic features \(x_ix_j\) we get approximately \(\frac{n^2}{2}\) features in total.

Origin of neural networks: try to mimic the brain.
Was widely used in 80s and early 90s.
Right now it is the state of the art of many applications.

If we rewire the visual signal to the auditory cortex or somatosensory cortex, that that cortex learns to see! (these are called neuro-rewiring experiments).
We can learn to see with our tongue.

Neuron inputs: the dendrites.
Neuron output: the axon.
Neurons communicate with pulses of electricity.

⇒ Add single neuron drawing here

Usually when drawing the neuron inputs we only draw x1, x2, x3, etc, not x0. x0 is called the bias unit (x0 = 1).
In neural networks, we sometimes use weights instead of parameters (\theta\).

⇒ Add neural network drawing here

The Layer 1 is called the input layer and the final layer is called the ouput layer.
Inbetween layers are called hidden layers.

\(a_i^{(j)}\) = “activation” of unit i in layer j
\(\Theta^{(j)}\) = matrix of weights controlling function mapping from layer j to layer j+1.

So, on the previous drawing we have:

\[a_1^{(2)} = g(\Theta_{10}^{(1)}x_0+\Theta_{11}^{(1)}x_1+\Theta_{12}^{(1)}x_2+\Theta_{13}^{(1)}x_3)\] \[a_2^{(2)} = g(\Theta_{20}^{(1)}x_0+\Theta_{21}^{(1)}x_1+\Theta_{22}^{(1)}x_2+\Theta_{23}^{(1)}x_3)\] \[a_3^{(2)} = g(\Theta_{30}^{(1)}x_0+\Theta_{31}^{(1)}x_1+\Theta_{32}^{(1)}x_2+\Theta_{33}^{(1)}x_3)\] \[h_\Theta(x) = a_1^{(3)} = g(\Theta_{10}^{(2)}a_0^{(2)}+\Theta_{11}^{(2)}a_1^{(2)}+\Theta_{12}^{(2)}a_2^{(2)}+\Theta_{13}^{(2)}a_3^{(2)})\]

If a network has \(s_j\) units in layer j and \(s_{j+1}\) units in layer j+1, then \(\Theta^{(j)}\) will be of dimension \(s_{j+1} \times (s_j+1)\).

We define \(z_1^{(2)} = \Theta_{10}^{(1)}x_0+\Theta_{11}^{(1)}x_1+\Theta_{12}^{(1)}x_2+\Theta_{13}^{(1)}x_3\), we define \(z_2^{(2)}\) and \(z_3^{(2)}\) similarly.
So we have the vectors: \(x = \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ x_3 \end{bmatrix}\) and we define \(z^{(2)} = \begin{bmatrix} z_1^{(2)} \\ z_2^{(2)} \\ z_3^{(2)} \end{bmatrix}\), we also define \(a^{(2)}\) similarly. Then we can use the vectorized computation: \(z^{(2)} = \Theta^{(1)}x\) and \(a^{(2)} = g(z^{(2)})\).

Now to make things a bit easier, we can just define \(a^{(1)} = x\), so that we get \(z^{(2)} = \Theta^{(1)}a^{(1)}\).
Also note that to compute the new layer we must also Add the component \(a_0^{(2)} = 1\).
Then we compute \(z^{(3)} = \Theta^{(2)}a^{(2)}\)

The process of computing \(h_\Theta(x)\) is called forward propagation.

⇒ Neural networks are learning their own features.

The way the units are connected in a neural network is called the architecture.

We consider here y = x1 XNOR x2 (eg. y = NOT (x1 XOR x2)).
We can compute AND function and OR function with a single neuron (weights -30,20,20 for AND and -10,20,20 for OR).