===== XII. Support Vector Machines ===== ==== 12.1 - Optimization Objective ==== * SVM are sometimes cleaner than NN to learn non-linear hypothesis. * We start with the "elemental" cost function for logistic regression, which is \(-log \frac{1}{1+e^{-z}} \) for an example where y=1 and \(-log (1 - \frac{1}{1+e^{-z}})\) when y=0. Then we replace these with simplified cost functions where the cost is 0 respectively when z > 1 and when z < -1. * So for support vector machine, we have the cost function: \[J(\theta) = \frac{1}{m} \sum\limits_{i=1}^m y^{(i)} cost_1(\theta^Tx^{(i)}) + (1-y^{(i)})cost_0(\theta^Tx^{(i)}) + \frac{\lambda}{2m} \sum\limits_{j=1}^n \theta_j^2\] * By convention, for SVM, we remove the \(\frac{1}{m}\) coefficient. * Also by convention, for SVM we do not use the \(\lambda\) parameter: instead we control the weight of the first part of the equation. So we effectively replace \(A + \lambda B\) by \(C A + B\). From that perspective we can consider that \(C = \frac{1}{\lambda}\). * So the overall cost function for SVM is rather: \[J(\theta) = C \sum\limits_{i=1}^m \left[y^{(i)} cost_1(\theta^Tx^{(i)}) + (1-y^{(i)})cost_0(\theta^Tx^{(i)}) \right] + \frac{1}{2} \sum\limits_{j=1}^n \theta_j^2\] * Finally, SVM hypothesis do not output a probability, instead, it will output a prediction value as 1 or 0 directly: \(h_\theta(x) = \begin{cases} 1 & \text{if }\theta^Tx \ge 0 \\ 0 & \text{otherwise} \end{cases}\) ==== 12.2 - Large Margin Intuition ==== * SVM <=> **Large margin classifier** (eg. large margins around the decision boundary) * if y=1, we want \(\theta^Tx \ge 1\) (not just >0) * if y=0, we want \(\theta^Tx \le -1\) (not just <0) ==== 12.3 - Mathematics Behind Large Margin Classification ==== * Given 2 vectors **u** and **v**, \(u^Tv\) is sometimes called the inner product of u and v. * We can also consider that \(u^Tv = p \times \|{u}\| = cos(\theta)*\|v\|*\|u\|\). * So \(\theta^Tx^{(i)} = p^i . \|\theta\|\) where \(p^i\) is the projection of the vector \(x^{(i)}\) on the vector \(\theta\). ==== 12.4 - Kernels I ==== * We now use the notation \(f_i\) to denote the features, so that we compute the hypothesis as \(h_\theta(x) = \theta^Tf\). * We can select select **landmarks** points \(l_i\) for now, manually. * Given an example x, we then define the new feature \(f_i = similarity(x,l^{(i)}) = e^{- \frac{\|x - l^{(i)}\|^2}{2\sigma^2}}\). * The similarity function used above is also called a **kernel**. And in that case we use a **Gaussian kernel** : \(k(x,l^{(i)}) = similarity(x,l^{(i)}) \). ==== 12.5 - Kernels II ==== * Choosing the landmarks => We can start with defining a landmark at the location of each of the example \(x^{(i)}\). So we end up with m landmarks. * Then for a given example x, we compute the features \(f_1, f_2, \dots, f_m\) and by convention we can still define \(f_0 = 1\); * So we move from \(x^{(i)} \in \mathbb{R}^{n+1}\) to \( f^{(i)} \in \mathbb{R}^{m+1}\). Now we have \(\theta \in \mathbb{R}^{m+1} \) too. * To do the training we minimize: \[C \sum\limits_{i=1}^m \left[y^{(i)} cost_1(\theta^Tf^{(i)}) + (1-y^{(i)})cost_0(\theta^Tf^{(i)}) \right] + \frac{1}{2} \sum\limits_{j=1}^n \theta_j^2 \] => here we have **n = m**. * For most SVM implementation, the "regularization element" : \( \sum\limits_{j=1}^n \theta_j^2 = \theta^T\theta\) (if we ignore \(\theta_0\)) is actually written: \(\theta^TM\theta\) instead, where M is a matrix that will depend on the type of kernel we use. * Large C => lower bias, high variance. * Small C => higher bias, low variance. * We also need to choose \(\sigma^2\): * large value: features vary more smoothly => higher bias, lower variance. * small value: features vary less smoothly => lower bias, higher variance. ==== 12.6 - Using an SVM ==== * Use liblinear or libsvm for instance. * We need to specify the parameter C and the kernel. * We can use no kernel at all <=> **linear kernel** (to use for instance when n is large and m is small). * Use Gaussian kernels, so need to choose \(\sigma^2\). In octave, this could be written as: function f = kernel(x1,x2) f = exp(- (x1 - x2)' * (x1 -x2) / (2*sigma)); return => We need to perform feature scalling before using the Gaussian kernel. * valid kernels need to satisfy a technical condition called the **Mercer's Theorem** to avoid divergence in SVm packages. * We also have polynomial kernel: \((x^Tl)^2\) or \((x^Tl)^3\) or \((x^Tl + 5)^4\), etc... (usually perform worst than Guassian kernels). * Or String kernel, chi-square kernel, histogram intersection kernel, etc... === Multo-class classification === * Use on-vs-all => train K SVMs, then pick the class with the largest \((\theta^{(i)})^Tx\). * If n is large relative to m, then use logistic regression or SVM without a kernel (eg. linear kernel) * If n is small and m is intermediate (n=1-1000, m=10-10000), then use SVM with Gaussian kernel. * If n is small and m is large (n=1-1000, m=50000+), then try to create more features manually, then use logistic regression or SVM without a kernel. (SVM packages will have an hard time if m is so big). * For all situations above a neural network will work fine too but it might be slow to train.