===== VI - Logistic Regression ===== ==== 6.1 - Classification ==== * Classification problem => discrete value for the **y** to predict, for instance \(y \in \{ 0, 1\}\). In that case, 0 => **Negative Class** and 1 => **Positive Class**. * When we have moe that 2 classes (eg. \(y \in \{ 0, 1, 2, 3\}\)) the problem is called a **multi-class classification problem**, when we have 2 classes, it is a **binary classification problem**. * For classification we could still apply linear regression to get an hypothesis, and then we would simply use a threshold to get back to discrete values. eg. \(h_\theta(x) \ge 0.5\) predicts "y=1", and \(h_\theta(x) \lt 0.5\) predicts "y=0". But this is not a good idea !! => can give really wrong predictions. * We will create a **Logistic Regression** algorithm producing hypotheis such as: \(0 \le h_\theta(x) \le 1\). ==== 6.2 - Hypothesis Representation ==== * for Logistic Regression Model we use: \(h_\theta(x) = g(\theta^Tx)\) with \(g(z) = \frac{1}{1+e^{-z}}\). g(z) is called the **sigmoid function** or the **logistic function**. So we get: \(h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}\). * We still need to fit the parameters \(\theta\). * In this perspective, we then consider that \(h_\theta(x)\) is the estimated probability that y=1 on input x. So we can write: \(h_\theta(x) = P(y=1|x;\theta)\). * Note that, when we have y=1 or y=0, then \(P(y=0|x;\theta) = 1 - P(y=1|x;\theta)\). ==== 6.3 - Decision Boundary ==== * For the sigmoid function, we notice than \(g(z) \gt 0.5\) for \(z \gt 0.5\), this happens when \(\theta^Tx \gt 0\). * **Decision boundary** is the line that separate the y=1 and y=0 areas on a 2D plot. * We can also have non-linear decision boundaries, and for instance predict y=1 if \(-1+x_1^2+x_2^2 \ge 0\). => This would define a circle of radius 1 as decision boundary. ==== 6.4 - Cost Function ==== * Back to linear regression, we could define: \(J(\theta) = \frac{1}{m} \sum\limits_{i=1}^m Cost(h_\theta(x^{(i)}),y^{(i)})\), where we define: \(Cost(h_\theta(x^{(i)}),y^{(i)}) = \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2\). * To simplify the notation we get rid of the superscripts, and we write: \(Cost(h_\theta(x),y) = \frac{1}{2}(h_\theta(x) - y)^2\). * For logistic regression we cannot use \(h_\theta(x)\) directly to build the Cost function, otherwise, the resulting \(J(\theta)\) is **non-convex** and we cannot use gradient descent on it. * So, instead, we define the following cost function for logistic regression: \[Cost(h_\theta(x),y) = \begin{cases} -log(h_\theta(x)) \text{ if } y=1 \\ -log(1 - h_\theta(x)) \text{ if } y=0 \end{cases}\] * Note that if y=1 and \(h_\theta(x)=1\) then Cost = 0, but as \(h_\theta(x) \to 0\), then \(Cost \to \infty\). ==== 6.5 - Simplified Cost Function and Gradient Descent ==== * Since we **always** have y=0 or y=1, we can write the cost function directly as: \[Cost(h_\theta(x),y) = -y~log(h_\theta(x)) - (1-y)~log(1 - h_\theta(x))\] * So the final cost function is written as: \[ J(\theta) = -\frac{1}{m} \sum\limits_{i=1}^m y^{(i)}~log(h_\theta(x^{(i)})) + (1-y^{(i)})~log(1 - h_\theta(x^{(i)})) \] * The function can be derived from the principle of **maximum likelyhood estimation** in statistics. * Now what is left is to compute the partial derivative of the cost function to be able to perform gradient descent. We have here: \[ \frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \] => This is exactly the same partial derivative as for linear regression! (except that h() is different). ==== 6.6 - Advanced Optimization ==== * Once we provide code to compute \(J(\theta)\) and \(\frac{\partial}{\partial\theta_j}J(\theta)\), then we can use any of the following algorithm to perform the optimization: * Gradient descent * Conjugate gradient * BFGS * L-BFGS * Avantages of the algorithms not seen in this class (all except Gradient descent): * No need to manually pick \(\alpha\). * Often faster than gradient descent. * Disadvantages: * More complex. * To implemnt this in Octave, we would write something like:function [jVal, gradient] = costFunction(theta) jVal = (theta(1)-5)^2+ (theta(2)-5)^2; gradient = zeros(2,1); gradient(1) = 2*(theta(1)-5); gradient(2) = 2*(theta(2)-5); * Then to actually use this method in octave to find the parameters we need:options = optimset('GradObj', 'on', 'MaxIter', '100'); initialTheta = zeros(2,1); [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options); * exitFlag should be 1, check **help fminunc** for details. Note that fminunc will not work if \(\theta\) is just a real number. ==== 6.7 - Multiclass Classification: Ons-vs-all ==== * **One-vs-all** <=> **One-vs-rest** * For each Class, we will train a binary classifier that isolate that class from the other classes. So if we have 3 classes, we will compute the hypothesis \(h_\theta^{(1)}(x), h_\theta^{(2)}(x), h_\theta^{(3)}(x)\). * Then to classify a new input x, we select the class that maximize the corresponding hypothesis.