===== VII - Regularization ===== ==== 7.1 - The problem of Overfitting ==== * **underfitting** when we use linear hypothesis for a dataset that would require higher order polynomial. We also say that the algorithm has a **high bias** => we are not fitting the training data very well. "bias" here should be understood as "pre-conception". * **overfitting** when we use a too high order polynomial to model a dataset, or actually, when we have **too many features**. We also say that the algorithm has **high variance**. * Note that overfitting can be present in both linear regression and logistic regression. * To fix overfitting, we can: - Reduce the number of features. * We can manually select which features to keep. * We can use **model selection algorithm** - Use regularization: * We keep all the features, but we reduce the values of the parameters \(\theta_j\). ==== 7.2 - Cost function ==== * When trying to get small values to the parameters: * We get "Simpler" hypothesis. * We are less prone to overfitting. * For linear regularisation, we try to penalize all the big values for all parameters \(\theta_j\), so we can use, the modified cost function: \[J(\theta) = \frac{1}{2m} \left[ \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \lambda \sum\limits_{j=1}^n \theta_j^2 \right]\] * Note that, by convention we do not penalize \(\theta_0\). Also \(\lambda\) is called the **regularization parameter**. It should be selected properly otherwise, we can get underfitting, or not remove overfitting properly. ==== 7.3 - Regularized Linear Regression ==== * The Regularized version of the partial derivative of \(J(\theta)\) is as follow: \[ \frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m} \left( \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \right) + \frac{\lambda}{m} \theta_j \] => Note that the previous derivative **does not** apply for \(\theta_0\). * From this point we note that we can write the gradient descent update rule as follow: \[ \theta_j := \theta_j (1 - \alpha \frac{\lambda}{m}) - \alpha \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \] * In the case of the normal equation, the updated computation for \(\theta\) is: \[\theta = (X^TX + \lambda ~ P)^{-1}X^Ty\] where P is similar to the identity matrix from \(\mathbb{R}^{(n+1) \times (n+1)}\) except that \(P_{11} = 0\). * Note that, if \(m \le n \) then the matrix \(X^TX\) is non-invertible (eg. singular or degenerate). * Fortunately, we can prove that with regularisation, if \(\lambda \gt 0\) then the matrix \(X^TX + \lambda~P\) will be invertible. ==== 7.4 - Regularized Logistic Regression ==== * Again, we add to the cost function \(J(\theta)\) the term \(\frac{\lambda}{2m} \sum\limits_{j=1}^n \theta_j^2\).