IV. Linear Regression with Multiple Variables

4.1 - Multiple Features

  • When we have multiple features, we use the following notation convention:
    • n : number of features,
    • \(x^{(i)}\) : input (eg. features) of the ith training example.
    • \(x^{(i)}_j\) : value of feature j in the ith training example.
  • The hypothesis will now be in the form:

\[h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n\]

  • For convinience we define \(x_0=1\), so for all training example i we define \(x^{(i)}_0 = 1\).
  • So, with this convention, the feature vector becomes an \(\mathbb{R}^{n+1}\) element of the form: \(x = \left[ \begin{matrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{matrix}\right]\).
  • We may also represent the parameters of the hypothesis as a vector of \(\mathbb{R}^{n+1}\) : \(\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}\)
  • So we can now right the hypothesis as: \(h_\theta(x) = \theta^Tx\).
  • We also call a linear regression with multiple variables a Multivariate linear regression.

4.2 - Gradient descent for Multiple Variables

  • The parameters for the hypothesis are \(\theta_0,\theta_1,\theta_2,\dotsc,\theta_n\), we can think of those parameters simply as a vector \(\theta \in \mathbb{R}^{n+1} \).
  • The cost function here is: \(J(\theta_0,\theta_1,\dotsc,\theta_n) = \frac{1}{2m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\)
  • So we just consider the cost function as: \(J(\theta) = \frac{1}{2m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\)
  • So the gradient descent algorithm consist in updating the parameters (simultaneously) until convergence with:

\[\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1,\dotsc,\theta_n)\]

  • When computing the partial derivative we get the following update rule: \[\theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) x^{(i)}_j \]

4.3 - Gradient descent in Practive 1 - Feature Scaling

  • If we make sure that the features are on a similar scale, then gradient descent can converge more quickly.
  • For instance, we can divide each feature by its maximal value, so that we get features \(-1 \le x_i \le 1\) (provided the features values are positive). The range limits here are not very important, so we could have other similar range values like \(0 \le x_1 \le 3\) and \(-4 \le x_2 \le 2\), etc.
  • We can also do mean normalization : we replace \(x_i\) with \(x_i - \mu_i\) where \(\mu_i = \frac{1}{m} \sum\limits_{i=1}^m x_i\) to make features have approximately zero mean (cannot apply to \(x_0 = 1\) ).
  • The general mean normalization rule is then \(x_i := \frac{x_i - \mu_i}{s_i}\) where \(s_i\) is the range of the feature (eg max - min values). Note that we can also use the standard deviation instead for \(s_i\).

4.4 - Gradient descent in Practive 2 - Learning rate

  • Depending on the application gradient descent could converge in 30 iterations or sometimes in 3000 iterations, or maybe 3000000 iterations, etc…
  • We could also create an automatic convergence test saying for instance that convergence occured if \(J(\theta)\) is not changing by more that, say, 10-3 on a new iteration.
  • We can analysis the behavior of gradient descent by plotting the minimal value of \(J(\theta)\) found after a given number of iterations.
  • if this value of \(J(\theta)\) is increasing then it means we are not converging and we should be using a smaller value for the learning rate \(\alpha\).
  • For sufficiently small value of \(\alpha\), the value of \(J(\theta)\) should decrease on evrey iteration. But if \(\alpha\) is too small, then gradient descent can be slow to converge.
  • Concretely we can plot for different values of \alpha: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …

4.5 - Features and Polynomial Regression

  • We can build new features from the base features, for instance by multiplying basic features, or taking the ratios, etc…
  • If we use a polinomial regression, for instance to predict the price of a house from its size, then we have \(h_\theta(x) = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3\). So we can choose to write : \(\begin{cases} x_1 = (size) \\ x_2 = (size)^2 \\ x_3 = (size)^3 \end{cases}\) and with this we get back to a regular linear regression model. Note that in that case, feature scaling can become very important for efficient gradient descent.

4.6 - Normal Equation

  • If we take the simple 1D example, so wit \(\theta \in \mathbb{R}\), we have \(J(\theta) = a\theta^2 + b\theta + c\). To minimise this function we would compute its derivative: \(\frac{d}{d\theta} J(\theta) = 2a\theta + b \) and we set this derivative to 0.
  • For the general case we have \(\theta \in \mathbb{R}\), so we have to set all the partial derivative to 0: \(\frac{\partial}{\partial\theta_j} J(\theta) = 0 \text{(for every j)} \). Then we have to solve these equations for \(\theta_0,\theta_1,\dotsc,\theta_n\).
  • So, taking all the training features, we can build the matrix \(X \in \mathbb{R}^{m \times (n+1)} \) (n+1 because we add the feature \(x_0 = 1\)). And the vector \(y \in \mathbb{R}^m \) of all target values.
  • Then the values of \(\theta\) that will minimize the cost function is given by \(\theta = (X^TX)^{-1}X^Ty\).
  • General formulation:
    • we have m examples: \( (x^{(1)},y^{(1)}), \dotsc, (x^{(m)},y^{(m)}) \)
    • we have n features: \( x^{(i)} = \begin{bmatrix} x_0^{(i)} \\ x_1^{(i)} \\ x_1^{(i)} \\ \vdots \\ x_n^{(i)} \end{bmatrix} \in \mathbb{R}^{n+1} \)
    • The design matrix X, mentioned before is then: \( X = \begin{bmatrix} (x^{(1)})^T \\ (x^{(2)})^T \\ \vdots \\ (x^{(m)})^T \end{bmatrix}\ \in \mathbb{R}^{m \times (n+1)} \)
  • in Octave, we would compute this with the command:
    pinv(X'*X)*X'*y
  • Feature scaling is not needed when using this normal equation computation.
  • For normal equation, we need to compute the inverse matrix \((X^TX)^{-1} \in \mathbb{R}^{(n+1) \times (n+1)}\). Computing an matrix inverse has a complexity of about \(O(n^3)\) which can be very slow when n is large.
  • For more that n=10000 features, we should use gradient descent.

4.7 - Normal Equation Noninvertibility

  • Non invertible matrices are called singular or degenerate matrices.
  • Two major reason why \(X^TX\) could be non-invertible:
    1. Redundant features like x1 : size in feet2 and x2 : size in m2.
    2. Too many features (eg. \(m \le n\)), in that case we can delete some features or use regularization.