public:courses:machine_learning:machine_learning:regularization

  • underfitting when we use linear hypothesis for a dataset that would require higher order polynomial. We also say that the algorithm has a high bias ⇒ we are not fitting the training data very well. “bias” here should be understood as “pre-conception”.
  • overfitting when we use a too high order polynomial to model a dataset, or actually, when we have too many features. We also say that the algorithm has high variance.
  • Note that overfitting can be present in both linear regression and logistic regression.
  • To fix overfitting, we can:
    1. Reduce the number of features.
      • We can manually select which features to keep.
      • We can use model selection algorithm
    2. Use regularization:
      • We keep all the features, but we reduce the values of the parameters \(\theta_j\).
  • When trying to get small values to the parameters:
    • We get “Simpler” hypothesis.
    • We are less prone to overfitting.
  • For linear regularisation, we try to penalize all the big values for all parameters \(\theta_j\), so we can use, the modified cost function:

\[J(\theta) = \frac{1}{2m} \left[ \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 + \lambda \sum\limits_{j=1}^n \theta_j^2 \right]\]

  • Note that, by convention we do not penalize \(\theta_0\). Also \(\lambda\) is called the regularization parameter. It should be selected properly otherwise, we can get underfitting, or not remove overfitting properly.
  • The Regularized version of the partial derivative of \(J(\theta)\) is as follow:

\[ \frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m} \left( \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \right) + \frac{\lambda}{m} \theta_j \]

⇒ Note that the previous derivative does not apply for \(\theta_0\).

  • From this point we note that we can write the gradient descent update rule as follow:

\[ \theta_j := \theta_j (1 - \alpha \frac{\lambda}{m}) - \alpha \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \]

  • In the case of the normal equation, the updated computation for \(\theta\) is:

\[\theta = (X^TX + \lambda ~ P)^{-1}X^Ty\]

where P is similar to the identity matrix from \(\mathbb{R}^{(n+1) \times (n+1)}\) except that \(P_{11} = 0\).

  • Note that, if \(m \le n \) then the matrix \(X^TX\) is non-invertible (eg. singular or degenerate).
  • Fortunately, we can prove that with regularisation, if \(\lambda \gt 0\) then the matrix \(X^TX + \lambda~P\) will be invertible.
  • Again, we add to the cost function \(J(\theta)\) the term \(\frac{\lambda}{2m} \sum\limits_{j=1}^n \theta_j^2\).
  • public/courses/machine_learning/machine_learning/regularization.txt
  • Last modified: 2020/07/10 12:11
  • by 127.0.0.1