IV. Linear Regression with Multiple Variables
4.1 - Multiple Features
- When we have multiple features, we use the following notation convention:
- n : number of features,
- \(x^{(i)}\) : input (eg. features) of the ith training example.
- \(x^{(i)}_j\) : value of feature j in the ith training example.
- The hypothesis will now be in the form:
\[h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n\]
- For convinience we define \(x_0=1\), so for all training example i we define \(x^{(i)}_0 = 1\).
- So, with this convention, the feature vector becomes an \(\mathbb{R}^{n+1}\) element of the form: \(x = \left[ \begin{matrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{matrix}\right]\).
- We may also represent the parameters of the hypothesis as a vector of \(\mathbb{R}^{n+1}\) : \(\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}\)
- So we can now right the hypothesis as: \(h_\theta(x) = \theta^Tx\).
- We also call a linear regression with multiple variables a Multivariate linear regression.
4.2 - Gradient descent for Multiple Variables
- The parameters for the hypothesis are \(\theta_0,\theta_1,\theta_2,\dotsc,\theta_n\), we can think of those parameters simply as a vector \(\theta \in \mathbb{R}^{n+1} \).
- The cost function here is: \(J(\theta_0,\theta_1,\dotsc,\theta_n) = \frac{1}{2m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\)
- So we just consider the cost function as: \(J(\theta) = \frac{1}{2m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\)
- So the gradient descent algorithm consist in updating the parameters (simultaneously) until convergence with:
\[\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1,\dotsc,\theta_n)\]
- When computing the partial derivative we get the following update rule: \[\theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) x^{(i)}_j \]
4.3 - Gradient descent in Practive 1 - Feature Scaling
- If we make sure that the features are on a similar scale, then gradient descent can converge more quickly.
- For instance, we can divide each feature by its maximal value, so that we get features \(-1 \le x_i \le 1\) (provided the features values are positive). The range limits here are not very important, so we could have other similar range values like \(0 \le x_1 \le 3\) and \(-4 \le x_2 \le 2\), etc.
- We can also do mean normalization : we replace \(x_i\) with \(x_i - \mu_i\) where \(\mu_i = \frac{1}{m} \sum\limits_{i=1}^m x_i\) to make features have approximately zero mean (cannot apply to \(x_0 = 1\) ).
- The general mean normalization rule is then \(x_i := \frac{x_i - \mu_i}{s_i}\) where \(s_i\) is the range of the feature (eg max - min values). Note that we can also use the standard deviation instead for \(s_i\).
4.4 - Gradient descent in Practive 2 - Learning rate
- Depending on the application gradient descent could converge in 30 iterations or sometimes in 3000 iterations, or maybe 3000000 iterations, etc…
- We could also create an automatic convergence test saying for instance that convergence occured if \(J(\theta)\) is not changing by more that, say, 10-3 on a new iteration.
- We can analysis the behavior of gradient descent by plotting the minimal value of \(J(\theta)\) found after a given number of iterations.
- if this value of \(J(\theta)\) is increasing then it means we are not converging and we should be using a smaller value for the learning rate \(\alpha\).
- For sufficiently small value of \(\alpha\), the value of \(J(\theta)\) should decrease on evrey iteration. But if \(\alpha\) is too small, then gradient descent can be slow to converge.
- Concretely we can plot for different values of \alpha: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …
4.5 - Features and Polynomial Regression
- We can build new features from the base features, for instance by multiplying basic features, or taking the ratios, etc…
- If we use a polinomial regression, for instance to predict the price of a house from its size, then we have \(h_\theta(x) = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3\). So we can choose to write : \(\begin{cases} x_1 = (size) \\ x_2 = (size)^2 \\ x_3 = (size)^3 \end{cases}\) and with this we get back to a regular linear regression model. Note that in that case, feature scaling can become very important for efficient gradient descent.
4.6 - Normal Equation
- If we take the simple 1D example, so wit \(\theta \in \mathbb{R}\), we have \(J(\theta) = a\theta^2 + b\theta + c\). To minimise this function we would compute its derivative: \(\frac{d}{d\theta} J(\theta) = 2a\theta + b \) and we set this derivative to 0.
- For the general case we have \(\theta \in \mathbb{R}\), so we have to set all the partial derivative to 0: \(\frac{\partial}{\partial\theta_j} J(\theta) = 0 \text{(for every j)} \). Then we have to solve these equations for \(\theta_0,\theta_1,\dotsc,\theta_n\).
- So, taking all the training features, we can build the matrix \(X \in \mathbb{R}^{m \times (n+1)} \) (n+1 because we add the feature \(x_0 = 1\)). And the vector \(y \in \mathbb{R}^m \) of all target values.
- Then the values of \(\theta\) that will minimize the cost function is given by \(\theta = (X^TX)^{-1}X^Ty\).
- General formulation:
- we have m examples: \( (x^{(1)},y^{(1)}), \dotsc, (x^{(m)},y^{(m)}) \)
- we have n features: \( x^{(i)} = \begin{bmatrix} x_0^{(i)} \\ x_1^{(i)} \\ x_1^{(i)} \\ \vdots \\ x_n^{(i)} \end{bmatrix} \in \mathbb{R}^{n+1} \)
- The design matrix X, mentioned before is then: \( X = \begin{bmatrix} (x^{(1)})^T \\ (x^{(2)})^T \\ \vdots \\ (x^{(m)})^T \end{bmatrix}\ \in \mathbb{R}^{m \times (n+1)} \)
- in Octave, we would compute this with the command:
pinv(X'*X)*X'*y
- Feature scaling is not needed when using this normal equation computation.
- For normal equation, we need to compute the inverse matrix \((X^TX)^{-1} \in \mathbb{R}^{(n+1) \times (n+1)}\). Computing an matrix inverse has a complexity of about \(O(n^3)\) which can be very slow when n is large.
- For more that n=10000 features, we should use gradient descent.
4.7 - Normal Equation Noninvertibility
- Non invertible matrices are called singular or degenerate matrices.
- Two major reason why \(X^TX\) could be non-invertible:
- Redundant features like x1 : size in feet2 and x2 : size in m2.
- Too many features (eg. \(m \le n\)), in that case we can delete some features or use regularization.