public:courses:machine_learning:machine_learning:dimensionality_reduction

  • Can be used for instance to reduce from 2D features to 1D features (if we have 2 features measuring a length).
  • Data compression can also allow faster execution of the algorithms.
  • Can also be useful to display large dataset with multiple features (if reducing to 2 dimensions for instance).
  • PCA tries to find a surface on which to project the data so as to reduce the length of the segments of each point to its projection (eg. reduce the projection error).
  • Generally PCA is used to reduce from n-dimensions to k-dimensions. Thus finding k vectors \(u^{(1)},u^{(2)},\dots,u^{(k)}\) onto which to project the data, so as to minimize the projection error.
  • PCA is not linear regression (for linear regression we try to minimize the square of the error of the prediction (eg. error measured on a vertical line) whereas for PCA we try to minimize the error of the projection measured on line with a slope). Plus, in PCA, there is no “special” value “y”, instead, we only have the features \(x_1,x_2,\dots,x_n\) and all of them are treated equaly.
  • Data preprocessing: Perform mean normalization (eg. feature scaling):
    • Compute the mean \(\mu_j = \frac{1}{m} \sum\limits_{i=1}^m x_j^{(i)}\)
    • Replace each \(x_j^{(i)}\) with \(x_j^{(i)} - \mu_j\)
    • If some features have different scales, then rescale them to get comparable ranges.
  • Algorithm to reduce from n-dimensions to k-dimensions:
    • Compute the covariance matrix: \(\Sigma = \frac{1}{m} \sum\limits_{i=1}^m (x^{(i)})(x^{(i)})^T\)
    • Then compute the eigenvectors from the matrix \(\Sigma\):
      [U, S, V] = svd(Sigma);
    • SVD ⇒ Singular Value Decomposition (svd function is a bit more stable that eig function in octave)
    • The U matrix returned from octave wiil contain the eigenvectors we need in colunms: \(U = \begin{bmatrix} \vdots & \vdots & \vdots & & \vdots \\ u^{(1)} & u^{(2)} & u^{(3)} & \dots & u^{(n)} \\ \vdots & \vdots & \vdots & & \vdots\end{bmatrix} \in \mathbb{R}^{n \times n} \).
    • And if we want to reduce to k dimensions, we just need to take the first k vectors from this matrix.
  • Next we build the matrix \(U_{reduce} = \begin{bmatrix} \vdots & \vdots & \vdots & & \vdots \\ u^{(1)} & u^{(2)} & u^{(3)} & \dots & u^{(k)} \\ \vdots & \vdots & \vdots & & \vdots\end{bmatrix} \in \mathbb{R}^{n \times k} \)
  • Finally, we can compute the projections as:\(z^{(i)} = (U_{reduce})^T \cdot x^{(i)} \).
  • In Octave we would use something as:
    [U, S, V] = svd(Sigma);
    Ureduce = U(:,1:k);
    z = Ureduce'*x;
  • If we have a matrix of all the example \( x^{(i)} \) given by \(X = \begin{bmatrix} \cdots & (x^{(1)})^T & \cdots \\ \cdots & (x^{(2)})^T & \cdots \\ & \vdots & \\ \cdots & (x^{(m)})^T & \cdots \end{bmatrix} \). Then the matrix \(\Sigma\) can be computed in Octave as:
     Sigma = (1/m)* X' * X;
  • k: number of principal components.
  • Average squared projection error: \(\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} - x_{approx}^{(i)} \|^2 \)
  • Total variation in the data: \( \frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} \|^2 \)
  • Typically we choose k to be the smallest value so that: \( Err = \frac{\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} - x_{approx}^{(i)} \|^2}{\frac{1}{m} \sum\limits_{i=1}^m \| x^{(i)} \|^2} \le 0.01 \) ⇒ 99% of variance is retained.
  • So we can start with k=1, then 2, 3, etc… until we mean the previous requirement.
  • But actually, the SVD decomposition gives us the matrix S which is a diagonal matrix. And the previous error ratio can be computed as: \( Err= 1 - \frac{\sum\limits_{i=1}^k S_{ii}}{\sum\limits_{i=1}^n S_{ii}} \). Alternatively we can check if \( \frac{\sum\limits_{i=1}^k S_{ii}}{\sum\limits_{i=1}^n S_{ii}} \ge 0.99 \). (this latest ratio is actually a measure of the variance retained the the PCA dimension reduction process).
  • We can reconstruct an approximation of the vector x with: \(x_{approx} = U_{reduce} * z\)
  • The approximation will actually ly on the PCA approximation surface. So this is a good approximation only if the approximation error is small (but at least the reconstructed vector is an n-dimensional vector instead of k-dimensional.
  • The mapping should be defined b running PCA only on the training set. But then we can use the transformation for examples from the cross validation set or the test set.
  • public/courses/machine_learning/machine_learning/dimensionality_reduction.txt
  • Last modified: 2020/07/10 12:11
  • by 127.0.0.1