===== XIV - Dimensionality Reduction =====

==== 14.1 - Motivation 1: Data Compression ====

  * Can be used for instance to reduce from 2D features to 1D features (if we have 2 features measuring a length).
  * Data compression can also allow faster execution of the algorithms.

==== 14.2 - Motivation 2: Visualization ====

  * Can also be useful to display large dataset with multiple features (if reducing to 2 dimensions for instance).

==== 14.3 - Principal Component Analysis Problem Formulation ====

  * PCA tries to find a surface on which to project the data so as to reduce the length of the segments of each point to its projection (eg. reduce the **projection error**).
  * Generally PCA is used to reduce from n-dimensions to k-dimensions. Thus finding k vectors \(u^{(1)},u^{(2)},\dots,u^{(k)}\) onto which to project the data, so as to minimize the projection error.
  * PCA is **not** linear regression (for linear regression we try to minimize the square of the error of the **prediction** (eg. error measured on a vertical line) whereas for PCA we try to minimize the error of the **projection** measured on line with a slope). Plus, in PCA, there is no "special" value "y", instead, we only have the features \(x_1,x_2,\dots,x_n\) and all of them are treated equaly.

==== 14.4 - Principal Component Analysis Algorithm ====

  * Data preprocessing: Perform mean normalization (eg. feature scaling):
    * Compute the mean \(\mu_j = \frac{1}{m} \sum\limits_{i=1}^m x_j^{(i)}\)
    * Replace each \(x_j^{(i)}\) with  \(x_j^{(i)} - \mu_j\)
    * If some features have different scales, then rescale them to get comparable ranges.
 
  * Algorithm to reduce from n-dimensions to k-dimensions:
    * Compute the **covariance matrix**: \(\Sigma = \frac{1}{m} \sum\limits_{i=1}^m (x^{(i)})(x^{(i)})^T\)
    * Then compute the eigenvectors from the matrix \(\Sigma\): <code>[U, S, V] = svd(Sigma);</code>
    * **SVD** => Singular Value Decomposition (**svd** function is a bit more stable that **eig** function in octave)
    * The U matrix returned from octave wiil contain the eigenvectors we need in colunms: \(U = \begin{bmatrix} \vdots &  \vdots & \vdots &  & \vdots \\ u^{(1)} &  u^{(2)} &  u^{(3)} &  \dots &  u^{(n)} \\ \vdots &  \vdots & \vdots &  & \vdots\end{bmatrix} \in \mathbb{R}^{n \times n} \).
    * And if we want to reduce to k dimensions, we just need to take the first k vectors from this matrix.

  * Next we build the matrix \(U_{reduce} = \begin{bmatrix} \vdots &  \vdots & \vdots &  & \vdots \\ u^{(1)} &  u^{(2)} &  u^{(3)} &  \dots &  u^{(k)} \\ \vdots &  \vdots & \vdots &  & \vdots\end{bmatrix}  \in \mathbb{R}^{n \times k} \)
  * Finally, we can compute the projections as:\(z^{(i)} = (U_{reduce})^T \cdot x^{(i)} \).

  * In Octave we would use something as: <code>[U, S, V] = svd(Sigma);
Ureduce = U(:,1:k);
z = Ureduce'*x;</code>

  * If we have a matrix of all the example \( x^{(i)} \) given by \(X = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots \\ \cdots & (x^{(2)})^T & \cdots \\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots \end{bmatrix} \). Then the matrix \(\Sigma\) can be computed in Octave as: <code> Sigma = (1/m)* X' * X;</code>

==== 14.5 - Choosing the Number of Principal Components ====

  * k: number of principal components.
  * Average squared projection error: \(\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} - x_{approx}^{(i)} \|^2 \)
  * Total variation in the data: \( \frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} \|^2 \)

  * Typically we choose k to be the smallest value so that: \( Err = \frac{\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} - x_{approx}^{(i)} \|^2}{\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} \|^2} \le 0.01 \) => 99% of variance is retained.

  * So we can start with k=1, then 2, 3, etc... until we mean the previous requirement.
  * But actually, the SVD decomposition gives us the matrix S which is a diagonal matrix. And the previous error ratio can be computed as: \( Err= 1 - \frac{\sum\limits_{i=1}^k S_{ii}}{\sum\limits_{i=1}^n S_{ii}} \). Alternatively we can check if \( \frac{\sum\limits_{i=1}^k S_{ii}}{\sum\limits_{i=1}^n S_{ii}} \ge 0.99 \). (this latest ratio is actually a measure of the **variance retained** the the PCA dimension reduction process).

==== 14.6 - Reconstruction from Compressed Representation ====

  * We can reconstruct an approximation of the vector x with: \(x_{approx} = U_{reduce} * z\)
  * The approximation will actually ly on the PCA approximation surface. So this is a good approximation only if the approximation error is small (but at least the reconstructed vector is an n-dimensional vector instead of k-dimensional.

==== 14.7 - Advice for Applying PCA ====

  * The mapping should be defined b running PCA only on the **training set**. But then we can use the transformation for examples from the cross validation set or the test set.