VI - Logistic Regression
6.1 - Classification
- Classification problem ⇒ discrete value for the y to predict, for instance \(y \in \{ 0, 1\}\). In that case, 0 ⇒ Negative Class and 1 ⇒ Positive Class.
- When we have moe that 2 classes (eg. \(y \in \{ 0, 1, 2, 3\}\)) the problem is called a multi-class classification problem, when we have 2 classes, it is a binary classification problem.
- For classification we could still apply linear regression to get an hypothesis, and then we would simply use a threshold to get back to discrete values. eg. \(h_\theta(x) \ge 0.5\) predicts “y=1”, and \(h_\theta(x) \lt 0.5\) predicts “y=0”. But this is not a good idea !! ⇒ can give really wrong predictions.
- We will create a Logistic Regression algorithm producing hypotheis such as: \(0 \le h_\theta(x) \le 1\).
6.2 - Hypothesis Representation
- for Logistic Regression Model we use: \(h_\theta(x) = g(\theta^Tx)\) with \(g(z) = \frac{1}{1+e^{-z}}\). g(z) is called the sigmoid function or the logistic function. So we get: \(h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}\).
- We still need to fit the parameters \(\theta\).
- In this perspective, we then consider that \(h_\theta(x)\) is the estimated probability that y=1 on input x. So we can write: \(h_\theta(x) = P(y=1|x;\theta)\).
- Note that, when we have y=1 or y=0, then \(P(y=0|x;\theta) = 1 - P(y=1|x;\theta)\).
6.3 - Decision Boundary
- For the sigmoid function, we notice than \(g(z) \gt 0.5\) for \(z \gt 0.5\), this happens when \(\theta^Tx \gt 0\).
- Decision boundary is the line that separate the y=1 and y=0 areas on a 2D plot.
- We can also have non-linear decision boundaries, and for instance predict y=1 if \(-1+x_1^2+x_2^2 \ge 0\). ⇒ This would define a circle of radius 1 as decision boundary.
6.4 - Cost Function
- Back to linear regression, we could define: \(J(\theta) = \frac{1}{m} \sum\limits_{i=1}^m Cost(h_\theta(x^{(i)}),y^{(i)})\), where we define: \(Cost(h_\theta(x^{(i)}),y^{(i)}) = \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2\).
- To simplify the notation we get rid of the superscripts, and we write: \(Cost(h_\theta(x),y) = \frac{1}{2}(h_\theta(x) - y)^2\).
- For logistic regression we cannot use \(h_\theta(x)\) directly to build the Cost function, otherwise, the resulting \(J(\theta)\) is non-convex and we cannot use gradient descent on it.
- So, instead, we define the following cost function for logistic regression:
\[Cost(h_\theta(x),y) = \begin{cases} -log(h_\theta(x)) \text{ if } y=1 \\ -log(1 - h_\theta(x)) \text{ if } y=0 \end{cases}\]
- Note that if y=1 and \(h_\theta(x)=1\) then Cost = 0, but as \(h_\theta(x) \to 0\), then \(Cost \to \infty\).
6.5 - Simplified Cost Function and Gradient Descent
- Since we always have y=0 or y=1, we can write the cost function directly as:
\[Cost(h_\theta(x),y) = -y~log(h_\theta(x)) - (1-y)~log(1 - h_\theta(x))\]
- So the final cost function is written as:
\[ J(\theta) = -\frac{1}{m} \sum\limits_{i=1}^m y^{(i)}~log(h_\theta(x^{(i)})) + (1-y^{(i)})~log(1 - h_\theta(x^{(i)})) \]
- The function can be derived from the principle of maximum likelyhood estimation in statistics.
- Now what is left is to compute the partial derivative of the cost function to be able to perform gradient descent. We have here:
\[ \frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \]
⇒ This is exactly the same partial derivative as for linear regression! (except that h() is different).
6.6 - Advanced Optimization
- Once we provide code to compute \(J(\theta)\) and \(\frac{\partial}{\partial\theta_j}J(\theta)\), then we can use any of the following algorithm to perform the optimization:
- Gradient descent
- Conjugate gradient
- BFGS
- L-BFGS
- Avantages of the algorithms not seen in this class (all except Gradient descent):
- No need to manually pick \(\alpha\).
- Often faster than gradient descent.
- Disadvantages:
- More complex.
- To implemnt this in Octave, we would write something like:
function [jVal, gradient] = costFunction(theta) jVal = (theta(1)-5)^2+ (theta(2)-5)^2; gradient = zeros(2,1); gradient(1) = 2*(theta(1)-5); gradient(2) = 2*(theta(2)-5);
- Then to actually use this method in octave to find the parameters we need:
options = optimset('GradObj', 'on', 'MaxIter', '100'); initialTheta = zeros(2,1); [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
- exitFlag should be 1, check help fminunc for details. Note that fminunc will not work if \(\theta\) is just a real number.
6.7 - Multiclass Classification: Ons-vs-all
- One-vs-all ⇔ One-vs-rest
- For each Class, we will train a binary classifier that isolate that class from the other classes. So if we have 3 classes, we will compute the hypothesis \(h_\theta^{(1)}(x), h_\theta^{(2)}(x), h_\theta^{(3)}(x)\).
- Then to classify a new input x, we select the class that maximize the corresponding hypothesis.