===== II. Linear Regression with One Variable ===== ==== 2.1 - Model Representation ==== /* 2 - 1 - Model Representation (8 min) */ * Notation: * **m** : Number of training examples * **x**'s: "input" variable / feature * **y**'s: "output" variable / "target" variable * (x,y) : a single training example. * \((x^{(i)},y^{(i)})\) : i-th training example in the training set. * Basic representation of supervised learning: digraph finite_state_machine { rankdir=TB; size="7" node [shape = box]; "Training Set" "Learning Algorithm" "Hypothesis h"; "Training Set" -> "Learning Algorithm" [ label = "learning" ]; "Learning Algorithm" -> "Hypothesis h" [ label = "building" ]; "Size of house X" [ shape = plaintext]; "Estimated price Y" [ shape = plaintext]; "Size of house X" -> "Hypothesis h" [color = blue]; "Hypothesis h" -> "Estimated price Y" [color = blue]; {rank = same; "Size of house X"; "Hypothesis h"; "Estimated price Y"; } } * Simple linear regression representation as: \(h_{\theta}(x) = \theta_0 + \theta_1 . x\) * **Linear regressoin with one variable** == **univariate linear regression** ==== 2.2 - Cost function ==== /* 2 - 2 - Cost Function (8 min) */ * In \(h_{\theta}(x) = \theta_0 + \theta_1 . x\) the \(\theta_i\)'s are the **parameters** of the model. * In linear regression we want to minimize the value of \(\sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\) * So we define the **Cost function**: \(J(\theta_0,\theta_1) = \frac{1}{2m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\) * **Cost function** == **Squared error function** ==== 2.3 - Cost Function - Intuition I ==== /* 2 - 3 - Cost Function - Intuition I (11 min) */ ==== 2.4 - Cost Function - Intuition II ==== /* 2 - 4 - Cost Function - Intuition II (9 min) */ * We can draw **contour plots** to represent the \(J(\theta_0,\theta_1)\) function. And thus avoid relying on 3D plots. ==== 2.5 - Gradient Descent ==== /* 2 - 5 - Gradient Descent (11 min) */ * **Gradient Descent** is an algorithm that can be used to minimize arbitrary functions such as \(J(\theta_0,\theta_1,...,\theta_n)\) * The idea is to start somewhere on the function "surface", then iteratively find the local direction of the highest slope and move in that direction to get "down" as quickly as possible. * We use the symbol \(:=\) to denote assignment. * So gradient descent consist in repeating until convergence: \(\{ \theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1)\}\) (for j=0 and j=1) * \(\alpha\) is called the **learning rate** * Note that we have simultaneously update the parameters \(\theta_0\) and \(\theta_1\). ==== 2.6 - Gradient Descent Intuition ==== * When we compute the partial derivative we actually get the direction of the biggest **positive** slope. That's why, we need to move in the opposite direction and we use \(\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} ...\) (=> notice the "-" sign) * If the learning rate is **small** then the gradient descent can be slow. * If the learning rate is **too large** then gradient descent can overshoot the minimun, or fail to converge or even diverge. * When gradient descent gets close to a local minimum it will automatically take smaller steps, because the slope will get smaller. ==== 2.7 - Gradient Descent for Linear Regression ==== * For the previously defined cost function we can compute the derivatives: * \(\frac{\partial}{\partial \theta_0} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) \) * \(\frac{\partial}{\partial \theta_1} J(\theta_0,\theta_1) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}).x^{(i)} \) * The Cost function for linear regression is always a **convex function** (eg. bowl shaped function) so gradient descent will always find the only minimum location available. * The Gradient Descent described here is sometimes refered to as **Batch** Gradient descent, beacuse eachs tep of gradient descent is using all the training examples for the calculation. * For this cost function we could also solve the problem using the **normal equation** solution (eg. without iteration as in gradient descent), but gradient descent will scale better on large datasets with multiple features etc. ==== 2.8 - What's Next ==== * We will have to use linear algebra with matrices and vectors to move to higher dimension problems (eg. more features). Linear algebra is introduced in the next chapter.