===== IV. Linear Regression with Multiple Variables ===== ==== 4.1 - Multiple Features ==== * When we have multiple features, we use the following notation convention: * **n** : number of features, * \(x^{(i)}\) : input (eg. features) of the ith training example. * \(x^{(i)}_j\) : value of feature j in the ith training example. * The hypothesis will now be in the form: \[h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2 + ... + \theta_nx_n\] * For convinience we define \(x_0=1\), so for all training example i we define \(x^{(i)}_0 = 1\). * So, with this convention, the feature vector becomes an \(\mathbb{R}^{n+1}\) element of the form: \(x = \left[ \begin{matrix} x_0 \\ x_1 \\ x_2 \\ \vdots \\ x_n \end{matrix}\right]\). * We may also represent the **parameters** of the hypothesis as a vector of \(\mathbb{R}^{n+1}\) : \(\theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix}\) * So we can now right the hypothesis as: \(h_\theta(x) = \theta^Tx\). * We also call a linear regression with multiple variables a **Multivariate linear regression**. ==== 4.2 - Gradient descent for Multiple Variables ==== * The **parameters** for the hypothesis are \(\theta_0,\theta_1,\theta_2,\dotsc,\theta_n\), we can think of those parameters simply as a vector \(\theta \in \mathbb{R}^{n+1} \). * The cost function here is: \(J(\theta_0,\theta_1,\dotsc,\theta_n) = \frac{1}{2m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\) * So we just consider the cost function as: \(J(\theta) = \frac{1}{2m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2\) * So the gradient descent algorithm consist in updating the parameters (**simultaneously**) until convergence with: \[\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0,\theta_1,\dotsc,\theta_n)\] * When computing the partial derivative we get the following update rule: \[\theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) x^{(i)}_j \] ==== 4.3 - Gradient descent in Practive 1 - Feature Scaling ==== * If we make sure that the features are on a similar scale, then gradient descent can converge more quickly. * For instance, we can divide each feature by its maximal value, so that we get features \(-1 \le x_i \le 1\) (provided the features values are positive). The range limits here are not very important, so we could have other similar range values like \(0 \le x_1 \le 3\) and \(-4 \le x_2 \le 2\), etc. * We can also do **mean normalization** : we replace \(x_i\) with \(x_i - \mu_i\) where \(\mu_i = \frac{1}{m} \sum\limits_{i=1}^m x_i\) to make features have approximately zero mean (cannot apply to \(x_0 = 1\) ). * The general mean normalization rule is then \(x_i := \frac{x_i - \mu_i}{s_i}\) where \(s_i\) is the range of the feature (eg max - min values). Note that we can also use the standard deviation instead for \(s_i\). ==== 4.4 - Gradient descent in Practive 2 - Learning rate ==== * Depending on the application gradient descent could converge in 30 iterations or sometimes in 3000 iterations, or maybe 3000000 iterations, etc... * We could also create an automatic convergence test saying for instance that convergence occured if \(J(\theta)\) is not changing by more that, say, 10-3 on a new iteration. * We can analysis the behavior of gradient descent by plotting the minimal value of \(J(\theta)\) found after a given number of iterations. * if this value of \(J(\theta)\) is increasing then it means we are not converging and we should be using a smaller value for the learning rate \(\alpha\). * For sufficiently small value of \(\alpha\), the value of \(J(\theta)\) should decrease on evrey iteration. But if \(\alpha\) is too small, then gradient descent can be slow to converge. * Concretely we can plot for different values of \alpha: 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, ... ==== 4.5 - Features and Polynomial Regression ==== * We can build new features from the base features, for instance by multiplying basic features, or taking the ratios, etc... * If we use a polinomial regression, for instance to predict the price of a house from its size, then we have \(h_\theta(x) = \theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3\). So we can choose to write : \(\begin{cases} x_1 = (size) \\ x_2 = (size)^2 \\ x_3 = (size)^3 \end{cases}\) and with this we get back to a regular linear regression model. Note that in that case, feature scaling can become very important for efficient gradient descent. ==== 4.6 - Normal Equation ==== * If we take the simple 1D example, so wit \(\theta \in \mathbb{R}\), we have \(J(\theta) = a\theta^2 + b\theta + c\). To minimise this function we would compute its derivative: \(\frac{d}{d\theta} J(\theta) = 2a\theta + b \) and we set this derivative to 0. * For the general case we have \(\theta \in \mathbb{R}\), so we have to set all the partial derivative to 0: \(\frac{\partial}{\partial\theta_j} J(\theta) = 0 \text{(for every j)} \). Then we have to solve these equations for \(\theta_0,\theta_1,\dotsc,\theta_n\). * So, taking all the training features, we can build the matrix \(X \in \mathbb{R}^{m \times (n+1)} \) (n+1 because we add the feature \(x_0 = 1\)). And the vector \(y \in \mathbb{R}^m \) of all target values. * Then the values of \(\theta\) that will minimize the cost function is given by \(\theta = (X^TX)^{-1}X^Ty\). * General formulation: * we have **m** examples: \( (x^{(1)},y^{(1)}), \dotsc, (x^{(m)},y^{(m)}) \) * we have **n** features: \( x^{(i)} = \begin{bmatrix} x_0^{(i)} \\ x_1^{(i)} \\ x_1^{(i)} \\ \vdots \\ x_n^{(i)} \end{bmatrix} \in \mathbb{R}^{n+1} \) * The **design matrix** X, mentioned before is then: \( X = \begin{bmatrix} (x^{(1)})^T \\ (x^{(2)})^T \\ \vdots \\ (x^{(m)})^T \end{bmatrix}\ \in \mathbb{R}^{m \times (n+1)} \) * in Octave, we would compute this with the command: pinv(X'*X)*X'*y * Feature scaling is not needed when using this normal equation computation. * For normal equation, we need to compute the inverse matrix \((X^TX)^{-1} \in \mathbb{R}^{(n+1) \times (n+1)}\). Computing an matrix inverse has a complexity of about \(O(n^3)\) which can be very slow when n is large. * For more that n=10000 features, we should use gradient descent. ==== 4.7 - Normal Equation Noninvertibility ==== * Non invertible matrices are called **singular** or **degenerate** matrices. * Two major reason why \(X^TX\) could be non-invertible: - Redundant features like x1 : size in feet2 and x2 : size in m2. - Too many features (eg. \(m \le n\)), in that case we can delete some features or use regularization.