Can be used for instance to reduce from 2D features to 1D features (if we have 2 features measuring a length).
Data compression can also allow faster execution of the algorithms.
14.2 - Motivation 2: Visualization
Can also be useful to display large dataset with multiple features (if reducing to 2 dimensions for instance).
14.3 - Principal Component Analysis Problem Formulation
PCA tries to find a surface on which to project the data so as to reduce the length of the segments of each point to its projection (eg. reduce the projection error).
Generally PCA is used to reduce from n-dimensions to k-dimensions. Thus finding k vectors \(u^{(1)},u^{(2)},\dots,u^{(k)}\) onto which to project the data, so as to minimize the projection error.
PCA is not linear regression (for linear regression we try to minimize the square of the error of the prediction (eg. error measured on a vertical line) whereas for PCA we try to minimize the error of the projection measured on line with a slope). Plus, in PCA, there is no “special” value “y”, instead, we only have the features \(x_1,x_2,\dots,x_n\) and all of them are treated equaly.
14.4 - Principal Component Analysis Algorithm
Data preprocessing: Perform mean normalization (eg. feature scaling):
Compute the mean \(\mu_j = \frac{1}{m} \sum\limits_{i=1}^m x_j^{(i)}\)
Replace each \(x_j^{(i)}\) with \(x_j^{(i)} - \mu_j\)
If some features have different scales, then rescale them to get comparable ranges.
Algorithm to reduce from n-dimensions to k-dimensions:
Compute the covariance matrix: \(\Sigma = \frac{1}{m} \sum\limits_{i=1}^m (x^{(i)})(x^{(i)})^T\)
Then compute the eigenvectors from the matrix \(\Sigma\):
[U, S, V] = svd(Sigma);
SVD ⇒ Singular Value Decomposition (svd function is a bit more stable that eig function in octave)
The U matrix returned from octave wiil contain the eigenvectors we need in colunms: \(U = \begin{bmatrix} \vdots & \vdots & \vdots & & \vdots \\ u^{(1)} & u^{(2)} & u^{(3)} & \dots & u^{(n)} \\ \vdots & \vdots & \vdots & & \vdots\end{bmatrix} \in \mathbb{R}^{n \times n} \).
And if we want to reduce to k dimensions, we just need to take the first k vectors from this matrix.
If we have a matrix of all the example \( x^{(i)} \) given by \(X = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots \\ \cdots & (x^{(2)})^T & \cdots \\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots \end{bmatrix} \). Then the matrix \(\Sigma\) can be computed in Octave as:
Sigma = (1/m)* X' * X;
14.5 - Choosing the Number of Principal Components
Total variation in the data: \( \frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} \|^2 \)
Typically we choose k to be the smallest value so that: \( Err = \frac{\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} - x_{approx}^{(i)} \|^2}{\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} \|^2} \le 0.01 \) ⇒ 99% of variance is retained.
So we can start with k=1, then 2, 3, etc… until we mean the previous requirement.
But actually, the SVD decomposition gives us the matrix S which is a diagonal matrix. And the previous error ratio can be computed as: \( Err= 1 - \frac{\sum\limits_{i=1}^k S_{ii}}{\sum\limits_{i=1}^n S_{ii}} \). Alternatively we can check if \( \frac{\sum\limits_{i=1}^k S_{ii}}{\sum\limits_{i=1}^n S_{ii}} \ge 0.99 \). (this latest ratio is actually a measure of the variance retained the the PCA dimension reduction process).
14.6 - Reconstruction from Compressed Representation
We can reconstruct an approximation of the vector x with: \(x_{approx} = U_{reduce} * z\)
The approximation will actually ly on the PCA approximation surface. So this is a good approximation only if the approximation error is small (but at least the reconstructed vector is an n-dimensional vector instead of k-dimensional.
14.7 - Advice for Applying PCA
The mapping should be defined b running PCA only on the training set. But then we can use the transformation for examples from the cross validation set or the test set.