===== XIV - Dimensionality Reduction =====
==== 14.1 - Motivation 1: Data Compression ====
* Can be used for instance to reduce from 2D features to 1D features (if we have 2 features measuring a length).
* Data compression can also allow faster execution of the algorithms.
==== 14.2 - Motivation 2: Visualization ====
* Can also be useful to display large dataset with multiple features (if reducing to 2 dimensions for instance).
==== 14.3 - Principal Component Analysis Problem Formulation ====
* PCA tries to find a surface on which to project the data so as to reduce the length of the segments of each point to its projection (eg. reduce the **projection error**).
* Generally PCA is used to reduce from n-dimensions to k-dimensions. Thus finding k vectors \(u^{(1)},u^{(2)},\dots,u^{(k)}\) onto which to project the data, so as to minimize the projection error.
* PCA is **not** linear regression (for linear regression we try to minimize the square of the error of the **prediction** (eg. error measured on a vertical line) whereas for PCA we try to minimize the error of the **projection** measured on line with a slope). Plus, in PCA, there is no "special" value "y", instead, we only have the features \(x_1,x_2,\dots,x_n\) and all of them are treated equaly.
==== 14.4 - Principal Component Analysis Algorithm ====
* Data preprocessing: Perform mean normalization (eg. feature scaling):
* Compute the mean \(\mu_j = \frac{1}{m} \sum\limits_{i=1}^m x_j^{(i)}\)
* Replace each \(x_j^{(i)}\) with \(x_j^{(i)} - \mu_j\)
* If some features have different scales, then rescale them to get comparable ranges.
* Algorithm to reduce from n-dimensions to k-dimensions:
* Compute the **covariance matrix**: \(\Sigma = \frac{1}{m} \sum\limits_{i=1}^m (x^{(i)})(x^{(i)})^T\)
* Then compute the eigenvectors from the matrix \(\Sigma\): [U, S, V] = svd(Sigma);
* **SVD** => Singular Value Decomposition (**svd** function is a bit more stable that **eig** function in octave)
* The U matrix returned from octave wiil contain the eigenvectors we need in colunms: \(U = \begin{bmatrix} \vdots & \vdots & \vdots & & \vdots \\ u^{(1)} & u^{(2)} & u^{(3)} & \dots & u^{(n)} \\ \vdots & \vdots & \vdots & & \vdots\end{bmatrix} \in \mathbb{R}^{n \times n} \).
* And if we want to reduce to k dimensions, we just need to take the first k vectors from this matrix.
* Next we build the matrix \(U_{reduce} = \begin{bmatrix} \vdots & \vdots & \vdots & & \vdots \\ u^{(1)} & u^{(2)} & u^{(3)} & \dots & u^{(k)} \\ \vdots & \vdots & \vdots & & \vdots\end{bmatrix} \in \mathbb{R}^{n \times k} \)
* Finally, we can compute the projections as:\(z^{(i)} = (U_{reduce})^T \cdot x^{(i)} \).
* In Octave we would use something as: [U, S, V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce'*x;
* If we have a matrix of all the example \( x^{(i)} \) given by \(X = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots \\ \cdots & (x^{(2)})^T & \cdots \\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots \end{bmatrix} \). Then the matrix \(\Sigma\) can be computed in Octave as: Sigma = (1/m)* X' * X;
==== 14.5 - Choosing the Number of Principal Components ====
* k: number of principal components.
* Average squared projection error: \(\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} - x_{approx}^{(i)} \|^2 \)
* Total variation in the data: \( \frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} \|^2 \)
* Typically we choose k to be the smallest value so that: \( Err = \frac{\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} - x_{approx}^{(i)} \|^2}{\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} \|^2} \le 0.01 \) => 99% of variance is retained.
* So we can start with k=1, then 2, 3, etc... until we mean the previous requirement.
* But actually, the SVD decomposition gives us the matrix S which is a diagonal matrix. And the previous error ratio can be computed as: \( Err= 1 - \frac{\sum\limits_{i=1}^k S_{ii}}{\sum\limits_{i=1}^n S_{ii}} \). Alternatively we can check if \( \frac{\sum\limits_{i=1}^k S_{ii}}{\sum\limits_{i=1}^n S_{ii}} \ge 0.99 \). (this latest ratio is actually a measure of the **variance retained** the the PCA dimension reduction process).
==== 14.6 - Reconstruction from Compressed Representation ====
* We can reconstruct an approximation of the vector x with: \(x_{approx} = U_{reduce} * z\)
* The approximation will actually ly on the PCA approximation surface. So this is a good approximation only if the approximation error is small (but at least the reconstructed vector is an n-dimensional vector instead of k-dimensional.
==== 14.7 - Advice for Applying PCA ====
* The mapping should be defined b running PCA only on the **training set**. But then we can use the transformation for examples from the cross validation set or the test set.