public:courses:machine_learning:machine_learning:logistic

VI - Logistic Regression

VI - Logistic Regression

6.1 - Classification

Classification problem ⇒ discrete value for the y to predict, for instance \(y \in \{ 0, 1\}\). In that case, 0 ⇒ Negative Class and 1 ⇒ Positive Class.
When we have moe that 2 classes (eg. \(y \in \{ 0, 1, 2, 3\}\)) the problem is called a multi-class classification problem, when we have 2 classes, it is a binary classification problem.
For classification we could still apply linear regression to get an hypothesis, and then we would simply use a threshold to get back to discrete values. eg. \(h_\theta(x) \ge 0.5\) predicts “y=1”, and \(h_\theta(x) \lt 0.5\) predicts “y=0”. But this is not a good idea !! ⇒ can give really wrong predictions.

We will create a Logistic Regression algorithm producing hypotheis such as: \(0 \le h_\theta(x) \le 1\).

6.2 - Hypothesis Representation

for Logistic Regression Model we use: \(h_\theta(x) = g(\theta^Tx)\) with \(g(z) = \frac{1}{1+e^{-z}}\). g(z) is called the sigmoid function or the logistic function. So we get: \(h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}\).
We still need to fit the parameters \(\theta\).
In this perspective, we then consider that \(h_\theta(x)\) is the estimated probability that y=1 on input x. So we can write: \(h_\theta(x) = P(y=1|x;\theta)\).
Note that, when we have y=1 or y=0, then \(P(y=0|x;\theta) = 1 - P(y=1|x;\theta)\).

6.3 - Decision Boundary

For the sigmoid function, we notice than \(g(z) \gt 0.5\) for \(z \gt 0.5\), this happens when \(\theta^Tx \gt 0\).
Decision boundary is the line that separate the y=1 and y=0 areas on a 2D plot.
We can also have non-linear decision boundaries, and for instance predict y=1 if \(-1+x_1^2+x_2^2 \ge 0\). ⇒ This would define a circle of radius 1 as decision boundary.

6.4 - Cost Function

Back to linear regression, we could define: \(J(\theta) = \frac{1}{m} \sum\limits_{i=1}^m Cost(h_\theta(x^{(i)}),y^{(i)})\), where we define: \(Cost(h_\theta(x^{(i)}),y^{(i)}) = \frac{1}{2}(h_\theta(x^{(i)}) - y^{(i)})^2\).
To simplify the notation we get rid of the superscripts, and we write: \(Cost(h_\theta(x),y) = \frac{1}{2}(h_\theta(x) - y)^2\).
For logistic regression we cannot use \(h_\theta(x)\) directly to build the Cost function, otherwise, the resulting \(J(\theta)\) is non-convex and we cannot use gradient descent on it.
So, instead, we define the following cost function for logistic regression:

\[Cost(h_\theta(x),y) = \begin{cases} -log(h_\theta(x)) \text{ if } y=1 \\ -log(1 - h_\theta(x)) \text{ if } y=0 \end{cases}\]

Note that if y=1 and \(h_\theta(x)=1\) then Cost = 0, but as \(h_\theta(x) \to 0\), then \(Cost \to \infty\).

6.5 - Simplified Cost Function and Gradient Descent

Since we always have y=0 or y=1, we can write the cost function directly as:

\[Cost(h_\theta(x),y) = -y~log(h_\theta(x)) - (1-y)~log(1 - h_\theta(x))\]

So the final cost function is written as:

\[ J(\theta) = -\frac{1}{m} \sum\limits_{i=1}^m y^{(i)}~log(h_\theta(x^{(i)})) + (1-y^{(i)})~log(1 - h_\theta(x^{(i)})) \]

The function can be derived from the principle of maximum likelyhood estimation in statistics.

Now what is left is to compute the partial derivative of the cost function to be able to perform gradient descent. We have here:

\[ \frac{\partial}{\partial\theta_j}J(\theta) = \frac{1}{m} \sum\limits_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \]

⇒ This is exactly the same partial derivative as for linear regression! (except that h() is different).

6.6 - Advanced Optimization

Once we provide code to compute \(J(\theta)\) and \(\frac{\partial}{\partial\theta_j}J(\theta)\), then we can use any of the following algorithm to perform the optimization:
- Gradient descent
- Conjugate gradient
- BFGS
- L-BFGS

Avantages of the algorithms not seen in this class (all except Gradient descent):
- No need to manually pick \(\alpha\).
- Often faster than gradient descent.
Disadvantages:
- More complex.

To implemnt this in Octave, we would write something like:

function [jVal, gradient] = costFunction(theta)

jVal = (theta(1)-5)^2+ (theta(2)-5)^2;
gradient = zeros(2,1);
gradient(1) = 2*(theta(1)-5);
gradient(2) = 2*(theta(2)-5);

Then to actually use this method in octave to find the parameters we need:

options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);

exitFlag should be 1, check help fminunc for details. Note that fminunc will not work if \(\theta\) is just a real number.

6.7 - Multiclass Classification: Ons-vs-all

One-vs-all ⇔ One-vs-rest
For each Class, we will train a binary classifier that isolate that class from the other classes. So if we have 3 classes, we will compute the hypothesis \(h_\theta^{(1)}(x), h_\theta^{(2)}(x), h_\theta^{(3)}(x)\).
Then to classify a new input x, we select the class that maximize the corresponding hypothesis.

Table of Contents