public:courses:machine_learning:machine_learning:dimensionality

XIV - Dimensionality Reduction

Can be used for instance to reduce from 2D features to 1D features (if we have 2 features measuring a length).
Data compression can also allow faster execution of the algorithms.

Can also be useful to display large dataset with multiple features (if reducing to 2 dimensions for instance).

PCA tries to find a surface on which to project the data so as to reduce the length of the segments of each point to its projection (eg. reduce the projection error).
Generally PCA is used to reduce from n-dimensions to k-dimensions. Thus finding k vectors \(u^{(1)},u^{(2)},\dots,u^{(k)}\) onto which to project the data, so as to minimize the projection error.
PCA is not linear regression (for linear regression we try to minimize the square of the error of the prediction (eg. error measured on a vertical line) whereas for PCA we try to minimize the error of the projection measured on line with a slope). Plus, in PCA, there is no “special” value “y”, instead, we only have the features \(x_1,x_2,\dots,x_n\) and all of them are treated equaly.

Data preprocessing: Perform mean normalization (eg. feature scaling):
- Compute the mean \(\mu_j = \frac{1}{m} \sum\limits_{i=1}^m x_j^{(i)}\)
- Replace each \(x_j^{(i)}\) with \(x_j^{(i)} - \mu_j\)
- If some features have different scales, then rescale them to get comparable ranges.

Next we build the matrix \(U_{reduce} = \begin{bmatrix} \vdots & \vdots & \vdots & & \vdots \\ u^{(1)} & u^{(2)} & u^{(3)} & \dots & u^{(k)} \\ \vdots & \vdots & \vdots & & \vdots\end{bmatrix} \in \mathbb{R}^{n \times k} \)
Finally, we can compute the projections as:\(z^{(i)} = (U_{reduce})^T \cdot x^{(i)} \).

In Octave we would use something as:

[U, S, V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce'*x;

If we have a matrix of all the example \( x^{(i)} \) given by \(X = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots \\ \cdots & (x^{(2)})^T & \cdots \\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots \end{bmatrix} \). Then the matrix \(\Sigma\) can be computed in Octave as:
```
 Sigma = (1/m)* X' * X;
```

k: number of principal components.
Average squared projection error: \(\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} - x_{approx}^{(i)} \|^2 \)
Total variation in the data: \( \frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} \|^2 \)

Typically we choose k to be the smallest value so that: \( Err = \frac{\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} - x_{approx}^{(i)} \|^2}{\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} \|^2} \le 0.01 \) ⇒ 99% of variance is retained.

So we can start with k=1, then 2, 3, etc… until we mean the previous requirement.
But actually, the SVD decomposition gives us the matrix S which is a diagonal matrix. And the previous error ratio can be computed as: \( Err= 1 - \frac{\sum\limits_{i=1}^k S_{ii}}{\sum\limits_{i=1}^n S_{ii}} \). Alternatively we can check if \( \frac{\sum\limits_{i=1}^k S_{ii}}{\sum\limits_{i=1}^n S_{ii}} \ge 0.99 \). (this latest ratio is actually a measure of the variance retained the the PCA dimension reduction process).

We can reconstruct an approximation of the vector x with: \(x_{approx} = U_{reduce} * z\)
The approximation will actually ly on the PCA approximation surface. So this is a good approximation only if the approximation error is small (but at least the reconstructed vector is an n-dimensional vector instead of k-dimensional.

The mapping should be defined b running PCA only on the training set. But then we can use the transformation for examples from the cross validation set or the test set.