public:courses:machine_learning:machine_learning:support_vector_machines

  • SVM are sometimes cleaner than NN to learn non-linear hypothesis.
  • We start with the “elemental” cost function for logistic regression, which is \(-log \frac{1}{1+e^{-z}} \) for an example where y=1 and \(-log (1 - \frac{1}{1+e^{-z}})\) when y=0. Then we replace these with simplified cost functions where the cost is 0 respectively when z > 1 and when z < -1.
  • So for support vector machine, we have the cost function:

\[J(\theta) = \frac{1}{m} \sum\limits_{i=1}^m y^{(i)} cost_1(\theta^Tx^{(i)}) + (1-y^{(i)})cost_0(\theta^Tx^{(i)}) + \frac{\lambda}{2m} \sum\limits_{j=1}^n \theta_j^2\]

  • By convention, for SVM, we remove the \(\frac{1}{m}\) coefficient.
  • Also by convention, for SVM we do not use the \(\lambda\) parameter: instead we control the weight of the first part of the equation. So we effectively replace \(A + \lambda B\) by \(C A + B\). From that perspective we can consider that \(C = \frac{1}{\lambda}\).
  • So the overall cost function for SVM is rather:

\[J(\theta) = C \sum\limits_{i=1}^m \left[y^{(i)} cost_1(\theta^Tx^{(i)}) + (1-y^{(i)})cost_0(\theta^Tx^{(i)}) \right] + \frac{1}{2} \sum\limits_{j=1}^n \theta_j^2\]

  • Finally, SVM hypothesis do not output a probability, instead, it will output a prediction value as 1 or 0 directly: \(h_\theta(x) = \begin{cases} 1 & \text{if }\theta^Tx \ge 0 \\ 0 & \text{otherwise} \end{cases}\)
  • SVM ⇔ Large margin classifier (eg. large margins around the decision boundary)
  • if y=1, we want \(\theta^Tx \ge 1\) (not just >0)
  • if y=0, we want \(\theta^Tx \le -1\) (not just <0)
  • Given 2 vectors u and v, \(u^Tv\) is sometimes called the inner product of u and v.
  • We can also consider that \(u^Tv = p \times \|{u}\| = cos(\theta)*\|v\|*\|u\|\).
  • So \(\theta^Tx^{(i)} = p^i . \|\theta\|\) where \(p^i\) is the projection of the vector \(x^{(i)}\) on the vector \(\theta\).
  • We now use the notation \(f_i\) to denote the features, so that we compute the hypothesis as \(h_\theta(x) = \theta^Tf\).
  • We can select select landmarks points \(l_i\) for now, manually.
  • Given an example x, we then define the new feature \(f_i = similarity(x,l^{(i)}) = e^{- \frac{\|x - l^{(i)}\|^2}{2\sigma^2}}\).
  • The similarity function used above is also called a kernel. And in that case we use a Gaussian kernel : \(k(x,l^{(i)}) = similarity(x,l^{(i)}) \).
  • Choosing the landmarks ⇒ We can start with defining a landmark at the location of each of the example \(x^{(i)}\). So we end up with m landmarks.
  • Then for a given example x, we compute the features \(f_1, f_2, \dots, f_m\) and by convention we can still define \(f_0 = 1\);
  • So we move from \(x^{(i)} \in \mathbb{R}^{n+1}\) to \( f^{(i)} \in \mathbb{R}^{m+1}\). Now we have \(\theta \in \mathbb{R}^{m+1} \) too.
  • To do the training we minimize:

\[C \sum\limits_{i=1}^m \left[y^{(i)} cost_1(\theta^Tf^{(i)}) + (1-y^{(i)})cost_0(\theta^Tf^{(i)}) \right] + \frac{1}{2} \sum\limits_{j=1}^n \theta_j^2 \]

⇒ here we have n = m.

  • For most SVM implementation, the “regularization element” : \( \sum\limits_{j=1}^n \theta_j^2 = \theta^T\theta\) (if we ignore \(\theta_0\)) is actually written: \(\theta^TM\theta\) instead, where M is a matrix that will depend on the type of kernel we use.
  • Large C ⇒ lower bias, high variance.
  • Small C ⇒ higher bias, low variance.
  • We also need to choose \(\sigma^2\):
    • large value: features vary more smoothly ⇒ higher bias, lower variance.
    • small value: features vary less smoothly ⇒ lower bias, higher variance.
  • Use liblinear or libsvm for instance.
  • We need to specify the parameter C and the kernel.
    • We can use no kernel at all ⇔ linear kernel (to use for instance when n is large and m is small).
    • Use Gaussian kernels, so need to choose \(\sigma^2\). In octave, this could be written as:
      function f = kernel(x1,x2)
        f = exp(- (x1 - x2)' * (x1 -x2) / (2*sigma));
      return

⇒ We need to perform feature scalling before using the Gaussian kernel.

  • valid kernels need to satisfy a technical condition called the Mercer's Theorem to avoid divergence in SVm packages.
  • We also have polynomial kernel: \((x^Tl)^2\) or \((x^Tl)^3\) or \((x^Tl + 5)^4\), etc… (usually perform worst than Guassian kernels).
  • Or String kernel, chi-square kernel, histogram intersection kernel, etc…

Multo-class classification

  • Use on-vs-all ⇒ train K SVMs, then pick the class with the largest \((\theta^{(i)})^Tx\).
  • If n is large relative to m, then use logistic regression or SVM without a kernel (eg. linear kernel)
  • If n is small and m is intermediate (n=1-1000, m=10-10000), then use SVM with Gaussian kernel.
  • If n is small and m is large (n=1-1000, m=50000+), then try to create more features manually, then use logistic regression or SVM without a kernel. (SVM packages will have an hard time if m is so big).
  • For all situations above a neural network will work fine too but it might be slow to train.
  • public/courses/machine_learning/machine_learning/support_vector_machines.txt
  • Last modified: 2020/07/10 12:11
  • by 127.0.0.1