===== XIII. Clustering ===== ==== 13.1 - Unsupervised Learning: Introduction ==== * A **Clustering algorithm** finds clusters in datasets with no labels. ==== 13.2 - K-Means Algorithm ==== * We randomly initialize two points, called the **cluster centroids**. * Then, we have two steps in the K-means iterative algorithm: - Cluster assignment step: Assign each example in the dataset to the cluster of the closest centroid. - Move the centroids to the new mean location of its cluster. * Until the centroids are not moving anymore. === Algorithm === - Randomly initialier K cluster centroids \(\mu_1,\mu2, \dots, \mu_K \in \mathbb{R}^n\). - Repeat: - for i=1 to m: \(c^{(i)}\) := index (from 1 to K) of cluster centroid closest to \(x^{(i)}\) (we use the distance \(\|x-\mu_k\|\)). - for k=1 to K: \(\mu_k\) := average (mean) of points assiged to cluster k. * Finally if there is a cluster with no point assigned to it, we can: - eliminate that cluster, - randomly re-initialize the position of the cluster centroid. === K-means for non-separated clusters === * K-means also works when the data is not separated into clusters. ==== 13.3 - Optimization Objective ==== * \(c^{(i)}\) : index of cluster to which example i is currently assigned. * \(\mu_k\) : cluster centroid k * \(\mu_{c^{(i)}}\): cluster centroid of cluster to which example i has been assigned. * The cost function for the K-means can then be written as: \[J(c^{(1)},\dots,c^{(m)},\mu_1,\dots,\mu_K) = \frac{1}{m} \sum\limits_{i=1}^m \|x^{(i)} - \mu_{c^{(i)}}\|^2 \] * This cost function is sometimes called the **distortion cost function**. ==== 13.4 - Random Initialization ==== * We should have K < m. * Randomly pick K training examples and assign them as cluster centroids. * To avoid local optimun, we can run K-means about 50 -1000 times. Then we pick the result that provide the lowest value for the cost function. This can be useful for k = 2-10. ==== 13.5 - Choosing the number of clusters ==== * **Elbow method** : distortion goes rapidly first, then after the elbow it goes slowly (when drawing Cost function with respect to the number of clusters).