public:courses:machine_learning:machine

VII - Regularization

VII - Regularization

7.1 - The problem of Overfitting

underfitting when we use linear hypothesis for a dataset that would require higher order polynomial. We also say that the algorithm has a high bias ⇒ we are not fitting the training data very well. “bias” here should be understood as “pre-conception”.
overfitting when we use a too high order polynomial to model a dataset, or actually, when we have too many features. We also say that the algorithm has high variance.

Note that overfitting can be present in both linear regression and logistic regression.

To fix overfitting, we can:
1. Reduce the number of features.
  - We can manually select which features to keep.
  - We can use model selection algorithm
2. Use regularization:
  - We keep all the features, but we reduce the values of the parameters \(\theta_j\).

7.2 - Cost function

When trying to get small values to the parameters:
- We get “Simpler” hypothesis.
- We are less prone to overfitting.

For linear regularisation, we try to penalize all the big values for all parameters \(\theta_j\), so we can use, the modified cost function:

\[J(\theta) = \frac{1}{2m} \left[ \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \lambda \sum\limits_{j=1}^n \theta_j^2 \right]\]

Note that, by convention we do not penalize \(\theta_0\). Also \(\lambda\) is called the regularization parameter. It should be selected properly otherwise, we can get underfitting, or not remove overfitting properly.

7.3 - Regularized Linear Regression

The Regularized version of the partial derivative of \(J(\theta)\) is as follow:

\[ \frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m} \left( \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \right) + \frac{\lambda}{m} \theta_j \]

⇒ Note that the previous derivative does not apply for \(\theta_0\).

From this point we note that we can write the gradient descent update rule as follow:

\[ \theta_j := \theta_j (1 - \alpha \frac{\lambda}{m}) - \alpha \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \]

In the case of the normal equation, the updated computation for \(\theta\) is:

\[\theta = (X^TX + \lambda ~ P)^{-1}X^Ty\]

where P is similar to the identity matrix from \(\mathbb{R}^{(n+1) \times (n+1)}\) except that \(P_{11} = 0\).

Note that, if \(m \le n \) then the matrix \(X^TX\) is non-invertible (eg. singular or degenerate).
Fortunately, we can prove that with regularisation, if \(\lambda \gt 0\) then the matrix \(X^TX + \lambda~P\) will be invertible.

7.4 - Regularized Logistic Regression

Again, we add to the cost function \(J(\theta)\) the term \(\frac{\lambda}{2m} \sum\limits_{j=1}^n \theta_j^2\).

Table of Contents

VII - Regularization

7.1 - The problem of Overfitting

7.2 - Cost function

7.3 - Regularized Linear Regression

7.4 - Regularized Logistic Regression