===== VI - Logistic Regression =====
==== 6.1 - Classification ====
* Classification problem => discrete value for the **y** to predict, for instance \(y \in \{ 0, 1\}\). In that case, 0 => **Negative Class** and 1 => **Positive Class**.
* When we have moe that 2 classes (eg. \(y \in \{ 0, 1, 2, 3\}\)) the problem is called a **multi-class classification problem**, when we have 2 classes, it is a **binary classification problem**.
* For classification we could still apply linear regression to get an hypothesis, and then we would simply use a threshold to get back to discrete values. eg. \(h_\theta(x) \ge 0.5\) predicts "y=1", and \(h_\theta(x) \lt 0.5\) predicts "y=0". But this is not a good idea !! => can give really wrong predictions.
* We will create a **Logistic Regression** algorithm producing hypotheis such as: \(0 \le h_\theta(x) \le 1\).
==== 6.2 - Hypothesis Representation ====
* for Logistic Regression Model we use: \(h_\theta(x) = g(\theta^Tx)\) with \(g(z) = \frac{1}{1+e^{-z}}\). g(z) is called the **sigmoid function** or the **logistic function**. So we get: \(h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}\).
* We still need to fit the parameters \(\theta\).
* In this perspective, we then consider that \(h_\theta(x)\) is the estimated probability that y=1 on input x. So we can write: \(h_\theta(x) = P(y=1|x;\theta)\).
* Note that, when we have y=1 or y=0, then \(P(y=0|x;\theta) = 1 - P(y=1|x;\theta)\).
==== 6.3 - Decision Boundary ====
* For the sigmoid function, we notice than \(g(z) \gt 0.5\) for \(z \gt 0.5\), this happens when \(\theta^Tx \gt 0\).
* **Decision boundary** is the line that separate the y=1 and y=0 areas on a 2D plot.
* We can also have non-linear decision boundaries, and for instance predict y=1 if \(-1+x_1^2+x_2^2 \ge 0\). => This would define a circle of radius 1 as decision boundary.
==== 6.4 - Cost Function ====
* Back to linear regression, we could define: \(J(\theta) = \frac{1}{m} \sum\limits_{i=1}^m Cost(h_\theta(x^{(i)}),y^{(i)})\), where we define: \(Cost(h_\theta(x^{(i)}),y^{(i)}) = \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2\).
* To simplify the notation we get rid of the superscripts, and we write: \(Cost(h_\theta(x),y) = \frac{1}{2}(h_\theta(x) - y)^2\).
* For logistic regression we cannot use \(h_\theta(x)\) directly to build the Cost function, otherwise, the resulting \(J(\theta)\) is **non-convex** and we cannot use gradient descent on it.
* So, instead, we define the following cost function for logistic regression:
\[Cost(h_\theta(x),y) = \begin{cases} -log(h_\theta(x)) \text{ if } y=1 \\ -log(1 - h_\theta(x)) \text{ if } y=0 \end{cases}\]
* Note that if y=1 and \(h_\theta(x)=1\) then Cost = 0, but as \(h_\theta(x) \to 0\), then \(Cost \to \infty\).
==== 6.5 - Simplified Cost Function and Gradient Descent ====
* Since we **always** have y=0 or y=1, we can write the cost function directly as:
\[Cost(h_\theta(x),y) = -y~log(h_\theta(x)) - (1-y)~log(1 - h_\theta(x))\]
* So the final cost function is written as:
\[ J(\theta) = -\frac{1}{m} \sum\limits_{i=1}^m y^{(i)}~log(h_\theta(x^{(i)})) + (1-y^{(i)})~log(1 - h_\theta(x^{(i)})) \]
* The function can be derived from the principle of **maximum likelyhood estimation** in statistics.
* Now what is left is to compute the partial derivative of the cost function to be able to perform gradient descent. We have here:
\[ \frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \]
=> This is exactly the same partial derivative as for linear regression! (except that h() is different).
==== 6.6 - Advanced Optimization ====
* Once we provide code to compute \(J(\theta)\) and \(\frac{\partial}{\partial\theta_j}J(\theta)\), then we can use any of the following algorithm to perform the optimization:
* Gradient descent
* Conjugate gradient
* BFGS
* L-BFGS
* Avantages of the algorithms not seen in this class (all except Gradient descent):
* No need to manually pick \(\alpha\).
* Often faster than gradient descent.
* Disadvantages:
* More complex.
* To implemnt this in Octave, we would write something like:function [jVal, gradient] = costFunction(theta)
jVal = (theta(1)-5)^2+ (theta(2)-5)^2;
gradient = zeros(2,1);
gradient(1) = 2*(theta(1)-5);
gradient(2) = 2*(theta(2)-5);
* Then to actually use this method in octave to find the parameters we need:options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
* exitFlag should be 1, check **help fminunc** for details. Note that fminunc will not work if \(\theta\) is just a real number.
==== 6.7 - Multiclass Classification: Ons-vs-all ====
* **One-vs-all** <=> **One-vs-rest**
* For each Class, we will train a binary classifier that isolate that class from the other classes. So if we have 3 classes, we will compute the hypothesis \(h_\theta^{(1)}(x), h_\theta^{(2)}(x), h_\theta^{(3)}(x)\).
* Then to classify a new input x, we select the class that maximize the corresponding hypothesis.