===== XIII. Clustering =====

==== 13.1 - Unsupervised Learning: Introduction ====

  * A **Clustering algorithm** finds clusters in datasets with no labels.

==== 13.2 - K-Means Algorithm ====

  * We randomly initialize two points, called the **cluster centroids**.
  * Then, we have two steps in the K-means iterative algorithm:
    - Cluster assignment step: Assign each example in the dataset to the cluster of the closest centroid.
    - Move the centroids to the new mean location of its cluster.
  * Until the centroids are not moving anymore.

=== Algorithm ===

  - Randomly initialier K cluster centroids \(\mu_1,\mu2, \dots, \mu_K \in \mathbb{R}^n\).
  - Repeat:
    - for i=1 to m: \(c^{(i)}\) := index (from 1 to K) of cluster centroid closest to \(x^{(i)}\) (we use the distance \(\|x-\mu_k\|\)).
    - for k=1 to K: \(\mu_k\) := average (mean) of points assiged to cluster k.

  * Finally if there is a cluster with no point assigned to it, we can:
    - eliminate that cluster,
    - randomly re-initialize the position of the cluster centroid.

=== K-means for non-separated clusters ===

  * K-means also works when the data is not separated into clusters.

==== 13.3 - Optimization Objective ====

  * \(c^{(i)}\) : index of cluster to which example i is currently assigned.
  * \(\mu_k\) : cluster centroid k
  * \(\mu_{c^{(i)}}\): cluster centroid of cluster to which example i has been assigned.

  * The cost function for the K-means can then be written as:
\[J(c^{(1)},\dots,c^{(m)},\mu_1,\dots,\mu_K) = \frac{1}{m} \sum\limits_{i=1}^m \|x^{(i)} - \mu_{c^{(i)}}\|^2 \]

  * This cost function is sometimes called the **distortion cost function**.

==== 13.4 - Random Initialization ====

  * We should have K < m.
  * Randomly pick K training examples and assign them as cluster centroids.
  * To avoid local optimun, we can run K-means about 50 -1000 times. Then we pick the result that provide the lowest value for the cost function. This can be useful for k = 2-10.


==== 13.5 - Choosing the number of clusters ====

  * **Elbow method** : distortion goes rapidly first, then after the elbow it goes slowly (when drawing Cost function with respect to the number of clusters).