public:courses:machine_learning:machine_learning:clustering

  • A Clustering algorithm finds clusters in datasets with no labels.
  • We randomly initialize two points, called the cluster centroids.
  • Then, we have two steps in the K-means iterative algorithm:
    1. Cluster assignment step: Assign each example in the dataset to the cluster of the closest centroid.
    2. Move the centroids to the new mean location of its cluster.
  • Until the centroids are not moving anymore.

Algorithm

  1. Randomly initialier K cluster centroids \(\mu_1,\mu2, \dots, \mu_K \in \mathbb{R}^n\).
  2. Repeat:
    1. for i=1 to m: \(c^{(i)}\) := index (from 1 to K) of cluster centroid closest to \(x^{(i)}\) (we use the distance \(\|x-\mu_k\|\)).
    2. for k=1 to K: \(\mu_k\) := average (mean) of points assiged to cluster k.
  • Finally if there is a cluster with no point assigned to it, we can:
    1. eliminate that cluster,
    2. randomly re-initialize the position of the cluster centroid.

K-means for non-separated clusters

  • K-means also works when the data is not separated into clusters.
  • \(c^{(i)}\) : index of cluster to which example i is currently assigned.
  • \(\mu_k\) : cluster centroid k
  • \(\mu_{c^{(i)}}\): cluster centroid of cluster to which example i has been assigned.
  • The cost function for the K-means can then be written as:

\[J(c^{(1)},\dots,c^{(m)},\mu_1,\dots,\mu_K) = \frac{1}{m} \sum\limits_{i=1}^m \|x^{(i)} - \mu_{c^{(i)}}\|^2 \]

  • This cost function is sometimes called the distortion cost function.
  • We should have K < m.
  • Randomly pick K training examples and assign them as cluster centroids.
  • To avoid local optimun, we can run K-means about 50 -1000 times. Then we pick the result that provide the lowest value for the cost function. This can be useful for k = 2-10.
  • Elbow method : distortion goes rapidly first, then after the elbow it goes slowly (when drawing Cost function with respect to the number of clusters).
  • public/courses/machine_learning/machine_learning/clustering.txt
  • Last modified: 2020/07/10 12:11
  • by 127.0.0.1