===== IV. Linear Regression with Multiple Variables =====

==== 4.1 - Multiple Features ====

  * When we have multiple features, we use the following notation convention:
    * **n** : number of features,
    * \(x^{(i)}\) : input (eg. features) of the i<sup>th</sup> training example.
    * \(x^{(i)}_j\) : value of feature j in the i<sup>th</sup> training example.

  * The hypothesis will now be in the form:
\[h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n\]

  * For convinience we define \(x_0=1\), so for all training example i we define \(x^{(i)}_0 = 1\).
  * So, with this convention, the feature vector becomes an \(\mathbb{R}^{n+1}\) element of the form: \(x = \left[ \begin{matrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{matrix}\right]\).

  * We may also represent the **parameters** of the hypothesis as a vector of \(\mathbb{R}^{n+1}\) : \(\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}\)

  * So we can now right the hypothesis as: \(h_\theta(x) = \theta^Tx\).

  * We also call a linear regression with multiple variables a **Multivariate linear regression**.

==== 4.2 - Gradient descent for Multiple Variables ====

  * The **parameters** for the hypothesis are \(\theta_0,\theta_1,\theta_2,\dotsc,\theta_n\), we can think of those parameters simply as a vector \(\theta \in \mathbb{R}^{n+1} \).

  * The cost function here is: \(J(\theta_0,\theta_1,\dotsc,\theta_n) = \frac{1}{2m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\)

  * So we just consider the cost function as: \(J(\theta) = \frac{1}{2m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\)

  * So the gradient descent algorithm  consist in updating the parameters (**simultaneously**) until convergence with:
  \[\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1,\dotsc,\theta_n)\]

  * When computing the partial derivative we get the following update rule: \[\theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) x^{(i)}_j \]

==== 4.3 - Gradient descent in Practive 1 - Feature Scaling ====

  * If we make sure that the features are on a similar scale, then gradient descent can converge more quickly.

  * For instance, we can divide each feature by its maximal value, so that we get features \(-1 \le x_i \le 1\) (provided the features values are positive). The range limits here are not very important, so we could have other similar range values like \(0 \le x_1 \le 3\) and \(-4 \le x_2 \le 2\), etc.

  * We can also do **mean normalization** : we replace \(x_i\) with \(x_i - \mu_i\) where \(\mu_i = \frac{1}{m} \sum\limits_{i=1}^m x_i\) to make features have approximately zero mean (cannot apply to \(x_0 = 1\) ).

  * The general mean normalization rule is then \(x_i := \frac{x_i - \mu_i}{s_i}\) where \(s_i\) is the range of the feature (eg max - min values). Note that we can also use the standard deviation instead for \(s_i\).


==== 4.4 - Gradient descent in Practive 2 - Learning rate ====

  * Depending on the application gradient descent could converge in 30 iterations or sometimes in 3000 iterations, or maybe 3000000 iterations, etc...

  * We could also create an automatic convergence test saying for instance that convergence occured if \(J(\theta)\) is not changing by more that, say, 10<sup>-3</sup> on a new iteration.

  * We can analysis the behavior of gradient descent by plotting the minimal value of \(J(\theta)\) found after a given number of iterations.

  * if this value of \(J(\theta)\) is increasing then it means we are not converging and we should be using a smaller value for the learning rate \(\alpha\).

  * For sufficiently small value of \(\alpha\), the value of \(J(\theta)\) should decrease on evrey iteration. But if \(\alpha\) is too small, then gradient descent can be slow to converge.

  * Concretely we can plot for different values of \alpha: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ...

==== 4.5 - Features and Polynomial Regression ====

  * We can build new features from the base features, for instance by multiplying basic features, or taking the ratios, etc...

  * If we use a polinomial regression, for instance to predict the price of a house from its size, then we have \(h_\theta(x) = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3\). So we can choose to write : \(\begin{cases} x_1 = (size) \\ x_2 = (size)^2 \\ x_3 = (size)^3 \end{cases}\) and with this we get back to a regular linear regression model. Note that in that case, feature scaling can become very important for efficient gradient descent.

==== 4.6 - Normal Equation ====

  * If we take the simple 1D example, so wit \(\theta \in \mathbb{R}\), we have \(J(\theta) = a\theta^2 + b\theta + c\). To minimise this function we would compute its derivative: \(\frac{d}{d\theta} J(\theta) = 2a\theta + b \) and we set this derivative to 0.

  * For the general case we have \(\theta \in \mathbb{R}\), so we have to set all the partial derivative to 0: \(\frac{\partial}{\partial\theta_j} J(\theta) = 0 \text{(for every j)} \). Then we have to solve these equations for \(\theta_0,\theta_1,\dotsc,\theta_n\).

  * So, taking all the training features, we can build the matrix \(X \in \mathbb{R}^{m \times (n+1)} \) (n+1 because we add the feature \(x_0 = 1\)). And the vector \(y \in \mathbb{R}^m \) of all target values.

  * Then the values of \(\theta\) that will minimize the cost function is given by \(\theta = (X^TX)^{-1}X^Ty\).

  * General formulation: 
    * we have **m** examples: \( (x^{(1)},y^{(1)}), \dotsc, (x^{(m)},y^{(m)}) \)
    * we have **n** features: \( x^{(i)} = \begin{bmatrix} x_0^{(i)} \\ x_1^{(i)} \\ x_1^{(i)} \\ \vdots \\ x_n^{(i)} \end{bmatrix} \in \mathbb{R}^{n+1} \)
    * The **design matrix** X, mentioned before is then: \( X = \begin{bmatrix} (x^{(1)})^T \\ (x^{(2)})^T \\ \vdots \\ (x^{(m)})^T \end{bmatrix}\ \in \mathbb{R}^{m \times (n+1)} \)

  * in Octave, we would compute this with the command: <code>pinv(X'*X)*X'*y</code>
  * Feature scaling is not needed when using this normal equation computation.
  * For normal equation, we need to compute the inverse matrix \((X^TX)^{-1} \in \mathbb{R}^{(n+1) \times (n+1)}\). Computing an matrix inverse has a complexity of about \(O(n^3)\) which can be very slow when n is large.

  * For more that n=10000 features, we should use gradient descent.

==== 4.7 - Normal Equation Noninvertibility ====

  * Non invertible matrices are called **singular** or **degenerate** matrices.
  * Two major reason why \(X^TX\) could be non-invertible:  
    - Redundant features like x1 : size in feet<sup>2</sup> and x2 : size in m<sup>2</sup>.
    - Too many features (eg. \(m \le n\)), in that case we can delete some features or use regularization.