public:courses:machine_learning:machine_learning:linear_regression_multi [NervTech's Wiki]

public:courses:machine_learning:machine_learning:linear_regression_multi

When we have multiple features, we use the following notation convention:
- n : number of features,
- \(x^{(i)}\) : input (eg. features) of the i^th training example.
- \(x^{(i)}_j\) : value of feature j in the i^th training example.

The hypothesis will now be in the form:

\[h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n\]

For convinience we define \(x_0=1\), so for all training example i we define \(x^{(i)}_0 = 1\).
So, with this convention, the feature vector becomes an \(\mathbb{R}^{n+1}\) element of the form: \(x = \left[ \begin{matrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{matrix}\right]\).

We may also represent the parameters of the hypothesis as a vector of \(\mathbb{R}^{n+1}\) : \(\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}\)

So we can now right the hypothesis as: \(h_\theta(x) = \theta^Tx\).

We also call a linear regression with multiple variables a Multivariate linear regression.

The parameters for the hypothesis are \(\theta_0,\theta_1,\theta_2,\dotsc,\theta_n\), we can think of those parameters simply as a vector \(\theta \in \mathbb{R}^{n+1} \).

The cost function here is: \(J(\theta_0,\theta_1,\dotsc,\theta_n) = \frac{1}{2m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\)

So we just consider the cost function as: \(J(\theta) = \frac{1}{2m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\)

So the gradient descent algorithm consist in updating the parameters (simultaneously) until convergence with:

\[\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1,\dotsc,\theta_n)\]

When computing the partial derivative we get the following update rule: \[\theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) x^{(i)}_j \]

If we make sure that the features are on a similar scale, then gradient descent can converge more quickly.

For instance, we can divide each feature by its maximal value, so that we get features \(-1 \le x_i \le 1\) (provided the features values are positive). The range limits here are not very important, so we could have other similar range values like \(0 \le x_1 \le 3\) and \(-4 \le x_2 \le 2\), etc.

We can also do mean normalization : we replace \(x_i\) with \(x_i - \mu_i\) where \(\mu_i = \frac{1}{m} \sum\limits_{i=1}^m x_i\) to make features have approximately zero mean (cannot apply to \(x_0 = 1\) ).

The general mean normalization rule is then \(x_i := \frac{x_i - \mu_i}{s_i}\) where \(s_i\) is the range of the feature (eg max - min values). Note that we can also use the standard deviation instead for \(s_i\).

Depending on the application gradient descent could converge in 30 iterations or sometimes in 3000 iterations, or maybe 3000000 iterations, etc…

We could also create an automatic convergence test saying for instance that convergence occured if \(J(\theta)\) is not changing by more that, say, 10^-3 on a new iteration.

We can analysis the behavior of gradient descent by plotting the minimal value of \(J(\theta)\) found after a given number of iterations.

if this value of \(J(\theta)\) is increasing then it means we are not converging and we should be using a smaller value for the learning rate \(\alpha\).

For sufficiently small value of \(\alpha\), the value of \(J(\theta)\) should decrease on evrey iteration. But if \(\alpha\) is too small, then gradient descent can be slow to converge.

Concretely we can plot for different values of \alpha: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …

We can build new features from the base features, for instance by multiplying basic features, or taking the ratios, etc…

If we use a polinomial regression, for instance to predict the price of a house from its size, then we have \(h_\theta(x) = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3\). So we can choose to write : \(\begin{cases} x_1 = (size) \\ x_2 = (size)^2 \\ x_3 = (size)^3 \end{cases}\) and with this we get back to a regular linear regression model. Note that in that case, feature scaling can become very important for efficient gradient descent.

If we take the simple 1D example, so wit \(\theta \in \mathbb{R}\), we have \(J(\theta) = a\theta^2 + b\theta + c\). To minimise this function we would compute its derivative: \(\frac{d}{d\theta} J(\theta) = 2a\theta + b \) and we set this derivative to 0.

For the general case we have \(\theta \in \mathbb{R}\), so we have to set all the partial derivative to 0: \(\frac{\partial}{\partial\theta_j} J(\theta) = 0 \text{(for every j)} \). Then we have to solve these equations for \(\theta_0,\theta_1,\dotsc,\theta_n\).

So, taking all the training features, we can build the matrix \(X \in \mathbb{R}^{m \times (n+1)} \) (n+1 because we add the feature \(x_0 = 1\)). And the vector \(y \in \mathbb{R}^m \) of all target values.

Then the values of \(\theta\) that will minimize the cost function is given by \(\theta = (X^TX)^{-1}X^Ty\).

General formulation:
- we have m examples: \( (x^{(1)},y^{(1)}), \dotsc, (x^{(m)},y^{(m)}) \)
- we have n features: \( x^{(i)} = \begin{bmatrix} x_0^{(i)} \\ x_1^{(i)} \\ x_1^{(i)} \\ \vdots \\ x_n^{(i)} \end{bmatrix} \in \mathbb{R}^{n+1} \)
- The design matrix X, mentioned before is then: \( X = \begin{bmatrix} (x^{(1)})^T \\ (x^{(2)})^T \\ \vdots \\ (x^{(m)})^T \end{bmatrix}\ \in \mathbb{R}^{m \times (n+1)} \)

in Octave, we would compute this with the command:
```
pinv(X'*X)*X'*y
```
Feature scaling is not needed when using this normal equation computation.
For normal equation, we need to compute the inverse matrix \((X^TX)^{-1} \in \mathbb{R}^{(n+1) \times (n+1)}\). Computing an matrix inverse has a complexity of about \(O(n^3)\) which can be very slow when n is large.

For more that n=10000 features, we should use gradient descent.

Non invertible matrices are called singular or degenerate matrices.
Two major reason why \(X^TX\) could be non-invertible:
1. Redundant features like x1 : size in feet² and x2 : size in m².
2. Too many features (eg. \(m \le n\)), in that case we can delete some features or use regularization.

public/courses/machine_learning/machine_learning/linear_regression_multi.txt
Last modified: 2020/07/10 12:11
by 127.0.0.1