In this topic, you will learn about K-means, arguably the most well-known clustering algorithm. It's easy to understand and implement, the algorithm has a wide range of applications in the real world and is very efficient. K-means might be used for documents' clustering, market segmentation, image compression, and in a myriad of other cases. Excited? Let's find out more about it!
Main idea
The task of clustering is a task of dividing observations into some groups. K-means is an unsupervised algorithm which means that we don't need labels (also known as ground truth) to train our model. We'll start with picking the number of future clusters which is the main hyperparameter that we must specify before training.
We can describe the simplest version of the algorithm in 5 steps:
-
Choose the number of clusters (K).
-
Place K cluster centroids (the middle of a cluster) in random places.
-
Assign all the points to clusters by finding the closest centroid.
-
Recompute the centers of the clusters.
-
Repeat steps 3 and 4 until the data points assignment doesn't change after the centroid of the cluster is recomputed.
This version of K-means is also known as Lloyd's K-means.
Example
Let's imagine we have the following data and need to assign 20 data points to one of the two clusters.
Initialize: Firstly we'll place two cluster centroids in random places. However, that is a naive approach, there are more efficient initialization techniques. The coordinates of the blue centroid are , and of the red one are .
Assign: As the next step, we need to calculate distances for each data point to both cluster centroids. For this purpose, we'll use the Euclidian distance: Let's calculate the distance to both clusters for the point at .
As , it means that belongs to the red cluster. We'll repeat this process for all data points. That gives us the following result:
Recompute coordinates for centroids: Now we recompute the coordinates of centroids. To do this, we compute the mean of all data points belonging to a specific cluster. The mean will be the next position of the centroids. That gives us:
Then we'll repeat the computation of distances and assign data points to a label of the nearest cluster centroid and recompute the coordinates of centroids. The K-means algorithm terminates if new iterations bring no update to the coordinates of cluster centroids.
Limitations of K-means
K-means has, however, a couple of limitations:
-
Optimal k. It may be relatively difficult to find an optimal k for the number of clusters
-
Bad initialization = suboptimal results. The initialization step is crucial to find optimal centroid positions. K-means heavily depends on initialization, with non-optimal initialization K-means may get stuck in local optima.
-
Different cluster sizes in data. When we cluster data with clusters of various densities and sizes, then it may not be clustered correctly.
-
Won't work well if the clusters don't have a spherical shape.
Initialization
Let's talk about how to better initialize centroid positions. In the traditional version, we pick random points from our data to be centroids. However, it's not an optimal way, the centroids may be positioned too close to each other. Let's say we'd like to initialize our centroids far enough from each other. One of the possible ways to accomplish this goal is the k-means++ algorithm:
-
Pick randomly a position of the first centroid.
-
Then compute distances to the previously selected centroid.
-
Assign a data point to be the new centroid, so that the probability that is chosen, is directly proportional to its distance to the previously selected centroid.
-
Repeat steps 2 and 3, until the positions of all centroids are found.
So the intuition behind k-means++ – the larger the distance, the larger the probability of being chosen.
Conclusion
In this topic, we've learned about the K-means technique, here are some main takeaways:
-
Lloyd's K-means: the 5-step algorithm that implies first choosing the number of clusters, assigning positions to these clusters, calculating distances from points to clusters, and recalculating cluster locations and distances thus finding optimal clusters.
-
Limitations of K-means: finding the optimal number of clusters, initialization problem, shape, and the relative sizes of the clusters.
-
K-means++ initialization technique implies choosing centroids according to the probability of it being chosen and its distance to the previously selected centroid.