Clustering is a type of unsupervised machine learning algorithm. In an unsupervised ML algorithm, we don't need to label our dataset, which is why it is called unsupervised learning.
In this topic, you'll learn about clustering and how it divides our data into clusters (sets), where each cluster conventionally consists of similar elements. For example, if you want to build a spam filtering method, a clustering algorithm will cluster all spam emails into one set and non-spam emails into another set.
What is clustering
In the introduction, we mentioned dividing data into sets/clusters. Let's look at a few examples.
Suppose the image above is an example of the spam filtering method, where we divide our data into two sets, spam and non-spam emails. Yes, it looks simple but only at first glance.
As you can see from the above pictures, separating our data into clusters becomes more challenging as our data grows more complex. It can be very diverse, and it may be quite hard to describe it. Moreover, most of the data is multidimensional and difficult to visualize, so it complicates the issue because you can't initially estimate which approach will be the most effective.
Types of clustering algorithms
Like in any other machine learning problem, the data can be fuzzy (as in the above images). It can significantly change the model's performance and predictions. Also, the boundary between clusters can be implicit. The initial number of classes is unknown, so you don't have an exact formulation of the problem. How to navigate through this uncertainty? The answer is hard and soft clustering.
Hard clustering algorithms assign each data point to a single cluster. The most famous hard clustering algorithm is arguably the K-means clustering algorithm. K-means clusters datasets randomly in k clusters. After identifying a centroid for each cluster, an algorithm determines how far each object or data point is from the centroid. Then it groups based on the minimum distance between two points. An example of this is spam filtering.
Another commonly-used technique is hierarchical clustering. A hierarchical clustering algorithm is useful if your data contains nested structures, such as Netflix movie directories. Hierarchical clustering can be compared to a tree structure, called a dendrogram. A typical example here is agglomerative hierarchical clustering (AGNES).
Soft clustering algorithms assign each point to a number of clusters, each with some probability. A typical example of a soft clustering approach is Gaussian mixture models.
Common challenges in clustering
Using clustering on your dataset is a challenging task, and we outline some of the main limitations below:
- The dataset might lack labels, so you don't know how many clusters there can be. Additionally, clustering algorithms depend on the initial parameters, which require understanding of the underlying data.
- Many clustering algorithms have assumptions about the data. For example, k-means could only work well if your clusters have a spherical shape.
- Clustering operates on distances, which imposes 2 issues: first, the clustering results will be affected by the chosen metric (such as Euclidean or Manhattan). Second, since the distances are considered, it's often necessary to bring all features on the same scale (so standardization or normalization should be considered during the preprocessing stage).
- There are multiple approaches to distinguishing the clusters (such as distance-based or density-based), and it might not be obvious which one should be chosen for a particular problem.
- It is hard to visualize what our dataset would look like and which algorithm to use on it. This issue becomes even more apparent when we are working with higher dimensions.
- Sometimes the separation rule can be hard to comprehend.
The previously mentioned issues can be mitigated through the application of different heuristic strategies and careful assessment of the results.
Applications of clustering
Clustering is a powerful tool for many industries, from bioinformatics (looking for similar genomes) to marketing (finding clients with similar tastes). At the moment, clustering is used in image segmentation (computer vision), in the logistics for clustering locations, in the delivery of goods, in advertising for recommendation systems, and in many other fields.
Conclusion
Here is what you've learned in this introductory topic:
- Clustering is an unsupervised machine learning task that splits data into clusters.
- Using the K-means algorithm, we randomly cluster our dataset into k clusters and try to minimize the distance between each data point within each cluster.
- A hierarchical clustering process uses group similarities to decompose data.
- Despite its complexity, clustering is a powerful tool to deal with unlabeled data.