===== XIV - Dimensionality Reduction ===== ==== 14.1 - Motivation 1: Data Compression ==== * Can be used for instance to reduce from 2D features to 1D features (if we have 2 features measuring a length). * Data compression can also allow faster execution of the algorithms. ==== 14.2 - Motivation 2: Visualization ==== * Can also be useful to display large dataset with multiple features (if reducing to 2 dimensions for instance). ==== 14.3 - Principal Component Analysis Problem Formulation ==== * PCA tries to find a surface on which to project the data so as to reduce the length of the segments of each point to its projection (eg. reduce the **projection error**). * Generally PCA is used to reduce from n-dimensions to k-dimensions. Thus finding k vectors \(u^{(1)},u^{(2)},\dots,u^{(k)}\) onto which to project the data, so as to minimize the projection error. * PCA is **not** linear regression (for linear regression we try to minimize the square of the error of the **prediction** (eg. error measured on a vertical line) whereas for PCA we try to minimize the error of the **projection** measured on line with a slope). Plus, in PCA, there is no "special" value "y", instead, we only have the features \(x_1,x_2,\dots,x_n\) and all of them are treated equaly. ==== 14.4 - Principal Component Analysis Algorithm ==== * Data preprocessing: Perform mean normalization (eg. feature scaling): * Compute the mean \(\mu_j = \frac{1}{m} \sum\limits_{i=1}^m x_j^{(i)}\) * Replace each \(x_j^{(i)}\) with \(x_j^{(i)} - \mu_j\) * If some features have different scales, then rescale them to get comparable ranges. * Algorithm to reduce from n-dimensions to k-dimensions: * Compute the **covariance matrix**: \(\Sigma = \frac{1}{m} \sum\limits_{i=1}^m (x^{(i)})(x^{(i)})^T\) * Then compute the eigenvectors from the matrix \(\Sigma\): [U, S, V] = svd(Sigma); * **SVD** => Singular Value Decomposition (**svd** function is a bit more stable that **eig** function in octave) * The U matrix returned from octave wiil contain the eigenvectors we need in colunms: \(U = \begin{bmatrix} \vdots & \vdots & \vdots & & \vdots \\ u^{(1)} & u^{(2)} & u^{(3)} & \dots & u^{(n)} \\ \vdots & \vdots & \vdots & & \vdots\end{bmatrix} \in \mathbb{R}^{n \times n} \). * And if we want to reduce to k dimensions, we just need to take the first k vectors from this matrix. * Next we build the matrix \(U_{reduce} = \begin{bmatrix} \vdots & \vdots & \vdots & & \vdots \\ u^{(1)} & u^{(2)} & u^{(3)} & \dots & u^{(k)} \\ \vdots & \vdots & \vdots & & \vdots\end{bmatrix} \in \mathbb{R}^{n \times k} \) * Finally, we can compute the projections as:\(z^{(i)} = (U_{reduce})^T \cdot x^{(i)} \). * In Octave we would use something as: [U, S, V] = svd(Sigma); Ureduce = U(:,1:k); z = Ureduce'*x; * If we have a matrix of all the example \( x^{(i)} \) given by \(X = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots \\ \cdots & (x^{(2)})^T & \cdots \\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots \end{bmatrix} \). Then the matrix \(\Sigma\) can be computed in Octave as: Sigma = (1/m)* X' * X; ==== 14.5 - Choosing the Number of Principal Components ==== * k: number of principal components. * Average squared projection error: \(\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} - x_{approx}^{(i)} \|^2 \) * Total variation in the data: \( \frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} \|^2 \) * Typically we choose k to be the smallest value so that: \( Err = \frac{\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} - x_{approx}^{(i)} \|^2}{\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} \|^2} \le 0.01 \) => 99% of variance is retained. * So we can start with k=1, then 2, 3, etc... until we mean the previous requirement. * But actually, the SVD decomposition gives us the matrix S which is a diagonal matrix. And the previous error ratio can be computed as: \( Err= 1 - \frac{\sum\limits_{i=1}^k S_{ii}}{\sum\limits_{i=1}^n S_{ii}} \). Alternatively we can check if \( \frac{\sum\limits_{i=1}^k S_{ii}}{\sum\limits_{i=1}^n S_{ii}} \ge 0.99 \). (this latest ratio is actually a measure of the **variance retained** the the PCA dimension reduction process). ==== 14.6 - Reconstruction from Compressed Representation ==== * We can reconstruct an approximation of the vector x with: \(x_{approx} = U_{reduce} * z\) * The approximation will actually ly on the PCA approximation surface. So this is a good approximation only if the approximation error is small (but at least the reconstructed vector is an n-dimensional vector instead of k-dimensional. ==== 14.7 - Advice for Applying PCA ==== * The mapping should be defined b running PCA only on the **training set**. But then we can use the transformation for examples from the cross validation set or the test set.