## IV. Linear Regression with Multiple Variables

### 4.1 - Multiple Features

• When we have multiple features, we use the following notation convention:
• n : number of features,
• $x^{(i)}$ : input (eg. features) of the ith training example.
• $x^{(i)}_j$ : value of feature j in the ith training example.
• The hypothesis will now be in the form:

$h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + \ldots + \theta_nx_n$

• For convinience we define $x_0=1$, so for all training example i we define $x^{(i)}_0 = 1$.
• So, with this convention, the feature vector becomes an $\mathbb{R}^{n+1}$ element of the form: $x = \left[ \begin{matrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{matrix}\right]$.
• We may also represent the parameters of the hypothesis as a vector of $\mathbb{R}^{n+1}$ : $\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}$
• So we can now right the hypothesis as: $h_\theta(x) = \theta^Tx$.
• We also call a linear regression with multiple variables a Multivariate linear regression.

### 4.2 - Gradient descent for Multiple Variables

• The parameters for the hypothesis are $\theta_0,\theta_1,\theta_2,\dotsc,\theta_n$, we can think of those parameters simply as a vector $\theta \in \mathbb{R}^{n+1}$.
• The cost function here is: $J(\theta_0,\theta_1,\dotsc,\theta_n) = \frac{1}{2m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2$
• So we just consider the cost function as: $J(\theta) = \frac{1}{2m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2$
• So the gradient descent algorithm consist in updating the parameters (simultaneously) until convergence with:

$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1,\dotsc,\theta_n)$

• When computing the partial derivative we get the following update rule: $\theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) x^{(i)}_j$

### 4.3 - Gradient descent in Practive 1 - Feature Scaling

• If we make sure that the features are on a similar scale, then gradient descent can converge more quickly.
• For instance, we can divide each feature by its maximal value, so that we get features $-1 \le x_i \le 1$ (provided the features values are positive). The range limits here are not very important, so we could have other similar range values like $0 \le x_1 \le 3$ and $-4 \le x_2 \le 2$, etc.
• We can also do mean normalization : we replace $x_i$ with $x_i - \mu_i$ where $\mu_i = \frac{1}{m} \sum\limits_{i=1}^m x_i$ to make features have approximately zero mean (cannot apply to $x_0 = 1$ ).
• The general mean normalization rule is then $x_i := \frac{x_i - \mu_i}{s_i}$ where $s_i$ is the range of the feature (eg max - min values). Note that we can also use the standard deviation instead for $s_i$.

### 4.4 - Gradient descent in Practive 2 - Learning rate

• Depending on the application gradient descent could converge in 30 iterations or sometimes in 3000 iterations, or maybe 3000000 iterations, etc…
• We could also create an automatic convergence test saying for instance that convergence occured if $J(\theta)$ is not changing by more that, say, 10-3 on a new iteration.
• We can analysis the behavior of gradient descent by plotting the minimal value of $J(\theta)$ found after a given number of iterations.
• if this value of $J(\theta)$ is increasing then it means we are not converging and we should be using a smaller value for the learning rate $\alpha$.
• For sufficiently small value of $\alpha$, the value of $J(\theta)$ should decrease on evrey iteration. But if $\alpha$ is too small, then gradient descent can be slow to converge.
• Concretely we can plot for different values of \alpha: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, …

### 4.5 - Features and Polynomial Regression

• We can build new features from the base features, for instance by multiplying basic features, or taking the ratios, etc…
• If we use a polinomial regression, for instance to predict the price of a house from its size, then we have $h_\theta(x) = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3$. So we can choose to write : $\begin{cases} x_1 = (size) \\ x_2 = (size)^2 \\ x_3 = (size)^3 \end{cases}$ and with this we get back to a regular linear regression model. Note that in that case, feature scaling can become very important for efficient gradient descent.

### 4.6 - Normal Equation

• If we take the simple 1D example, so wit $\theta \in \mathbb{R}$, we have $J(\theta) = a\theta^2 + b\theta + c$. To minimise this function we would compute its derivative: $\frac{d}{d\theta} J(\theta) = 2a\theta + b$ and we set this derivative to 0.
• For the general case we have $\theta \in \mathbb{R}$, so we have to set all the partial derivative to 0: $\frac{\partial}{\partial\theta_j} J(\theta) = 0 \text{(for every j)}$. Then we have to solve these equations for $\theta_0,\theta_1,\dotsc,\theta_n$.
• So, taking all the training features, we can build the matrix $X \in \mathbb{R}^{m \times (n+1)}$ (n+1 because we add the feature $x_0 = 1$). And the vector $y \in \mathbb{R}^m$ of all target values.
• Then the values of $\theta$ that will minimize the cost function is given by $\theta = (X^TX)^{-1}X^Ty$.
• General formulation:
• we have m examples: $(x^{(1)},y^{(1)}), \dotsc, (x^{(m)},y^{(m)})$
• we have n features: $x^{(i)} = \begin{bmatrix} x_0^{(i)} \\ x_1^{(i)} \\ x_1^{(i)} \\ \vdots \\ x_n^{(i)} \end{bmatrix} \in \mathbb{R}^{n+1}$
• The design matrix X, mentioned before is then: $X = \begin{bmatrix} (x^{(1)})^T \\ (x^{(2)})^T \\ \vdots \\ (x^{(m)})^T \end{bmatrix}\ \in \mathbb{R}^{m \times (n+1)}$
• in Octave, we would compute this with the command:
pinv(X'*X)*X'*y
• Feature scaling is not needed when using this normal equation computation.
• For normal equation, we need to compute the inverse matrix $(X^TX)^{-1} \in \mathbb{R}^{(n+1) \times (n+1)}$. Computing an matrix inverse has a complexity of about $O(n^3)$ which can be very slow when n is large.
• For more that n=10000 features, we should use gradient descent.

### 4.7 - Normal Equation Noninvertibility

• Non invertible matrices are called singular or degenerate matrices.
• Two major reason why $X^TX$ could be non-invertible:
1. Redundant features like x1 : size in feet2 and x2 : size in m2.
2. Too many features (eg. $m \le n$), in that case we can delete some features or use regularization.