public:courses:machine_learning:machine_learning:logistic_regression

  • Classification problem ⇒ discrete value for the y to predict, for instance \(y \in \{ 0, 1\}\). In that case, 0 ⇒ Negative Class and 1 ⇒ Positive Class.
  • When we have moe that 2 classes (eg. \(y \in \{ 0, 1, 2, 3\}\)) the problem is called a multi-class classification problem, when we have 2 classes, it is a binary classification problem.
  • For classification we could still apply linear regression to get an hypothesis, and then we would simply use a threshold to get back to discrete values. eg. \(h_\theta(x) \ge 0.5\) predicts “y=1”, and \(h_\theta(x) \lt 0.5\) predicts “y=0”. But this is not a good idea !! ⇒ can give really wrong predictions.
  • We will create a Logistic Regression algorithm producing hypotheis such as: \(0 \le h_\theta(x) \le 1\).
  • for Logistic Regression Model we use: \(h_\theta(x) = g(\theta^Tx)\) with \(g(z) = \frac{1}{1+e^{-z}}\). g(z) is called the sigmoid function or the logistic function. So we get: \(h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}\).
  • We still need to fit the parameters \(\theta\).
  • In this perspective, we then consider that \(h_\theta(x)\) is the estimated probability that y=1 on input x. So we can write: \(h_\theta(x) = P(y=1|x;\theta)\).
  • Note that, when we have y=1 or y=0, then \(P(y=0|x;\theta) = 1 - P(y=1|x;\theta)\).
  • For the sigmoid function, we notice than \(g(z) \gt 0.5\) for \(z \gt 0.5\), this happens when \(\theta^Tx \gt 0\).
  • Decision boundary is the line that separate the y=1 and y=0 areas on a 2D plot.
  • We can also have non-linear decision boundaries, and for instance predict y=1 if \(-1+x_1^2+x_2^2 \ge 0\). ⇒ This would define a circle of radius 1 as decision boundary.
  • Back to linear regression, we could define: \(J(\theta) = \frac{1}{m} \sum\limits_{i=1}^m Cost(h_\theta(x^{(i)}),y^{(i)})\), where we define: \(Cost(h_\theta(x^{(i)}),y^{(i)}) = \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2\).
  • To simplify the notation we get rid of the superscripts, and we write: \(Cost(h_\theta(x),y) = \frac{1}{2}(h_\theta(x) - y)^2\).
  • For logistic regression we cannot use \(h_\theta(x)\) directly to build the Cost function, otherwise, the resulting \(J(\theta)\) is non-convex and we cannot use gradient descent on it.
  • So, instead, we define the following cost function for logistic regression:

\[Cost(h_\theta(x),y) = \begin{cases} -log(h_\theta(x)) \text{ if } y=1 \\ -log(1 - h_\theta(x)) \text{ if } y=0 \end{cases}\]

  • Note that if y=1 and \(h_\theta(x)=1\) then Cost = 0, but as \(h_\theta(x) \to 0\), then \(Cost \to \infty\).
  • Since we always have y=0 or y=1, we can write the cost function directly as:

\[Cost(h_\theta(x),y) = -y~log(h_\theta(x)) - (1-y)~log(1 - h_\theta(x))\]

  • So the final cost function is written as:

\[ J(\theta) = -\frac{1}{m} \sum\limits_{i=1}^m y^{(i)}~log(h_\theta(x^{(i)})) + (1-y^{(i)})~log(1 - h_\theta(x^{(i)})) \]

  • The function can be derived from the principle of maximum likelyhood estimation in statistics.
  • Now what is left is to compute the partial derivative of the cost function to be able to perform gradient descent. We have here:

\[ \frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \]

⇒ This is exactly the same partial derivative as for linear regression! (except that h() is different).

  • Once we provide code to compute \(J(\theta)\) and \(\frac{\partial}{\partial\theta_j}J(\theta)\), then we can use any of the following algorithm to perform the optimization:
    • Gradient descent
    • Conjugate gradient
    • BFGS
    • L-BFGS
  • Avantages of the algorithms not seen in this class (all except Gradient descent):
    • No need to manually pick \(\alpha\).
    • Often faster than gradient descent.
  • Disadvantages:
    • More complex.
  • To implemnt this in Octave, we would write something like:
    function [jVal, gradient] = costFunction(theta)
    
    jVal = (theta(1)-5)^2+ (theta(2)-5)^2;
    gradient = zeros(2,1);
    gradient(1) = 2*(theta(1)-5);
    gradient(2) = 2*(theta(2)-5);
  • Then to actually use this method in octave to find the parameters we need:
    options = optimset('GradObj', 'on', 'MaxIter', '100');
    initialTheta = zeros(2,1);
    [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
  • exitFlag should be 1, check help fminunc for details. Note that fminunc will not work if \(\theta\) is just a real number.
  • One-vs-allOne-vs-rest
  • For each Class, we will train a binary classifier that isolate that class from the other classes. So if we have 3 classes, we will compute the hypothesis \(h_\theta^{(1)}(x), h_\theta^{(2)}(x), h_\theta^{(3)}(x)\).
  • Then to classify a new input x, we select the class that maximize the corresponding hypothesis.
  • public/courses/machine_learning/machine_learning/logistic_regression.txt
  • Last modified: 2020/07/10 12:11
  • by 127.0.0.1