public:courses:machine_learning:machine_learning:linear

II. Linear Regression with One Variable

Simple linear regression representation as: \(h_{\theta}(x) = \theta_0 + \theta_1 . x\)
Linear regressoin with one variable == univariate linear regression

In \(h_{\theta}(x) = \theta_0 + \theta_1 . x\) the \(\theta_i\)'s are the parameters of the model.
In linear regression we want to minimize the value of \(\sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\)
So we define the Cost function: \(J(\theta_0,\theta_1) = \frac{1}{2m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\)
Cost function == Squared error function

We can draw contour plots to represent the \(J(\theta_0,\theta_1)\) function. And thus avoid relying on 3D plots.

Gradient Descent is an algorithm that can be used to minimize arbitrary functions such as \(J(\theta_0,\theta_1,...,\theta_n)\)
The idea is to start somewhere on the function “surface”, then iteratively find the local direction of the highest slope and move in that direction to get “down” as quickly as possible.
We use the symbol \(:=\) to denote assignment.
So gradient descent consist in repeating until convergence: \(\{ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1)\}\) (for j=0 and j=1)
\(\alpha\) is called the learning rate
Note that we have simultaneously update the parameters \(\theta_0\) and \(\theta_1\).

When we compute the partial derivative we actually get the direction of the biggest positive slope. That's why, we need to move in the opposite direction and we use \(\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} ...\) (⇒ notice the “-” sign)
If the learning rate is small then the gradient descent can be slow.
If the learning rate is too large then gradient descent can overshoot the minimun, or fail to converge or even diverge.
When gradient descent gets close to a local minimum it will automatically take smaller steps, because the slope will get smaller.

For the previously defined cost function we can compute the derivatives:
- \(\frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \)
- \(\frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}).x^{(i)} \)
The Cost function for linear regression is always a convex function (eg. bowl shaped function) so gradient descent will always find the only minimum location available.

The Gradient Descent described here is sometimes refered to as Batch Gradient descent, beacuse eachs tep of gradient descent is using all the training examples for the calculation.

For this cost function we could also solve the problem using the normal equation solution (eg. without iteration as in gradient descent), but gradient descent will scale better on large datasets with multiple features etc.

We will have to use linear algebra with matrices and vectors to move to higher dimension problems (eg. more features). Linear algebra is introduced in the next chapter.