VII - Regularization
7.1 - The problem of Overfitting
- underfitting when we use linear hypothesis for a dataset that would require higher order polynomial. We also say that the algorithm has a high bias ⇒ we are not fitting the training data very well. “bias” here should be understood as “pre-conception”.
- overfitting when we use a too high order polynomial to model a dataset, or actually, when we have too many features. We also say that the algorithm has high variance.
- Note that overfitting can be present in both linear regression and logistic regression.
- To fix overfitting, we can:
- Reduce the number of features.
- We can manually select which features to keep.
- We can use model selection algorithm
- Use regularization:
- We keep all the features, but we reduce the values of the parameters \(\theta_j\).
7.2 - Cost function
- When trying to get small values to the parameters:
- We get “Simpler” hypothesis.
- We are less prone to overfitting.
- For linear regularisation, we try to penalize all the big values for all parameters \(\theta_j\), so we can use, the modified cost function:
\[J(\theta) = \frac{1}{2m} \left[ \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \lambda \sum\limits_{j=1}^n \theta_j^2 \right]\]
- Note that, by convention we do not penalize \(\theta_0\). Also \(\lambda\) is called the regularization parameter. It should be selected properly otherwise, we can get underfitting, or not remove overfitting properly.
7.3 - Regularized Linear Regression
- The Regularized version of the partial derivative of \(J(\theta)\) is as follow:
\[ \frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m} \left( \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \right) + \frac{\lambda}{m} \theta_j \]
⇒ Note that the previous derivative does not apply for \(\theta_0\).
- From this point we note that we can write the gradient descent update rule as follow:
\[ \theta_j := \theta_j (1 - \alpha \frac{\lambda}{m}) - \alpha \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \]
- In the case of the normal equation, the updated computation for \(\theta\) is:
\[\theta = (X^TX + \lambda ~ P)^{-1}X^Ty\]
where P is similar to the identity matrix from \(\mathbb{R}^{(n+1) \times (n+1)}\) except that \(P_{11} = 0\).
- Note that, if \(m \le n \) then the matrix \(X^TX\) is non-invertible (eg. singular or degenerate).
- Fortunately, we can prove that with regularisation, if \(\lambda \gt 0\) then the matrix \(X^TX + \lambda~P\) will be invertible.
7.4 - Regularized Logistic Regression
- Again, we add to the cost function \(J(\theta)\) the term \(\frac{\lambda}{2m} \sum\limits_{j=1}^n \theta_j^2\).