Computer scienceData scienceMachine learningClustering

Introduction to clustering

4 minutes read

Clustering is a type of unsupervised machine learning algorithm. In an unsupervised ML algorithm, we don't need to label our dataset, which is why it is called unsupervised learning.

In this topic, you'll learn about clustering and how it divides our data into clusters (sets), where each cluster conventionally consists of similar elements. For example, if you want to build a spam filtering method, a clustering algorithm will cluster all spam emails into one set and non-spam emails into another set.

What is clustering

In the introduction, we mentioned dividing data into sets/clusters. Let's look at a few examples.

Separating the data into two non-overlapping circular clusters

Suppose the image above is an example of the spam filtering method, where we divide our data into two sets, spam and non-spam emails. Yes, it looks simple but only at first glance.

Separating complex data into intersecting clusters

Complex data with flat and spiral clusters

As you can see from the above pictures, separating our data into clusters becomes more challenging as our data grows more complex. It can be very diverse, and it may be quite hard to describe it. Moreover, most of the data is multidimensional and difficult to visualize, so it complicates the issue because you can't initially estimate which approach will be the most effective.

Types of clustering algorithms

Like in any other machine learning problem, the data can be fuzzy (as in the above images). It can significantly change the model's performance and predictions. Also, the boundary between clusters can be implicit. The initial number of classes is unknown, so you don't have an exact formulation of the problem. How to navigate through this uncertainty? The answer is hard and soft clustering.

Hard clustering algorithms assign each data point to a single cluster. The most famous hard clustering algorithm is arguably the K-means clustering algorithm. K-means clusters datasets randomly in k clusters. After identifying a centroid for each cluster, an algorithm determines how far each object or data point is from the centroid. Then it groups based on the minimum distance between two points. An example of this is spam filtering.

Hard clustering with K-Means to separate the data into two clusters

Another commonly-used technique is hierarchical clustering. A hierarchical clustering algorithm is useful if your data contains nested structures, such as Netflix movie directories. Hierarchical clustering can be compared to a tree structure, called a dendrogram. A typical example here is agglomerative hierarchical clustering (AGNES).

Dendrogram for hierarchical clustering

Soft clustering algorithms assign each point to a number of clusters, each with some probability. A typical example of a soft clustering approach is Gaussian mixture models.

Common challenges in clustering

Using clustering on your dataset is a challenging task, and we outline some of the main limitations below:

The dataset might lack labels, so you don't know how many clusters there can be. Additionally, clustering algorithms depend on the initial parameters, which require understanding of the underlying data.
Many clustering algorithms have assumptions about the data. For example, k-means could only work well if your clusters have a spherical shape.
Clustering operates on distances, which imposes 2 issues: first, the clustering results will be affected by the chosen metric (such as Euclidean or Manhattan). Second, since the distances are considered, it's often necessary to bring all features on the same scale (so standardization or normalization should be considered during the preprocessing stage).
There are multiple approaches to distinguishing the clusters (such as distance-based or density-based), and it might not be obvious which one should be chosen for a particular problem.
It is hard to visualize what our dataset would look like and which algorithm to use on it. This issue becomes even more apparent when we are working with higher dimensions.
Sometimes the separation rule can be hard to comprehend.

The previously mentioned issues can be mitigated through the application of different heuristic strategies and careful assessment of the results.

Applications of clustering

Clustering is a powerful tool for many industries, from bioinformatics (looking for similar genomes) to marketing (finding clients with similar tastes). At the moment, clustering is used in image segmentation (computer vision), in the logistics for clustering locations, in the delivery of goods, in advertising for recommendation systems, and in many other fields.

Conclusion

Here is what you've learned in this introductory topic:

Clustering is an unsupervised machine learning task that splits data into clusters.
Using the K-means algorithm, we randomly cluster our dataset into k clusters and try to minimize the distance between each data point within each cluster.
A hierarchical clustering process uses group similarities to decompose data.
Despite its complexity, clustering is a powerful tool to deal with unlabeled data.

7 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo