Computer scienceData scienceMachine learningClustering

DBSCAN

11 minutes read

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. DBSCAN is non-parametric, meaning it does not make any assumptions about the data distribution. In this topic, we will consider the main properties of DBSCAN.

The key concepts

As a high-level overview, DBSCAN, unlike the centroid-based classification, clusters based on densely populated areas, creating clusters of varying shapes and sizes. It works on the concept that a cluster is defined as an area with a high concentration of samples, separated from other dense areas by regions of low concentration. In this section, we will get familiar with the main definitions that are required to understand the DBSCAN procedure.

minPts: The minimum required points is the smallest number of points that need to exist within the $\epsilon$ -neighborhood of a specific data point for it to be considered a core point. The point itself will be included in the minPts.
$\epsilon$ : the distance used to determine the neighborhood of a data point. Any data point within the $\epsilon$ -neighborhood is considered a neighbor of that point.
Core Points: A data point is considered a core point if it has at least the minimum number of data points within its $\epsilon$ -neighborhood.
Border Points: data points located within the $\epsilon$ of a core point, but don't have a sufficient number of neighboring points to be classified as core themselves. They exist on the outskirts of the clusters and can be associated with more than one cluster.
Noise Points: the points that are neither core nor border, not assigned to any cluster. This typically includes isolated points or points in low-density regions.

Illustration for the concepts of noise, border, and core points

The combination of minPts and epsilon constitutes the density in DBSCAN. The point types could be understood in the context of density:

All neighboring points within the $\epsilon$ of a central core point are deemed to be in the same cluster as that core point, a concept referred to as direct density reachability.
If any of these neighboring points are also core points, their respective neighborhoods are also included, a concept known as density reachability.
Points in this set that are not core points are referred to as border points, while all points within the same set are regarded as density connected.
Any points that cannot be reached by density from any core point are treated as noise and are not included in any cluster.

In the illustration below, arrows indicate direct density reachability. Points B and C are density connected, since both are density reachable from A. N is not density reachable, and is considered as a noise point.

Illustration for the ideas of direct density reachability, density reachability, and density connectedness

The general procedure

In order to initialize DBSCAN, we need to determine the following set of parameters:

Distance metric
The minimum number of points, minPts
$\epsilon$

A bit further down the line, we will consider some heuristics that could be applied to set the minPts and $\epsilon$ . When it comes to the metric, the Euclidean distance is often used, however, other metrics also apply under certain conditions (e.g., when the data is binary, the Jaccard distance might be better suited, or if the task involves textual data, cosine distance is used).

Considering the definitions above, the DBSCAN procedure could be summarized as follows:

Initialize the distance metric, $\epsilon$ , and minPts, and mark all points in the dataset as unvisited;
The algorithm starts with an arbitrary point in the dataset that has not been visited and its neighboring points are identified based on the $\epsilon$ .
If there are a sufficient number of neighboring points (greater than minPts), a new cluster is created and the current point is labeled as a core point. The points within the $\epsilon$ are added to the same cluster.
These new points are then evaluated. If they also have a sufficient number of neighboring points, they are also designated as core points and their neighbors are also added to the same cluster. This process continues until there are no new points to add to the cluster.
The algorithm then proceeds with the next unvisited point in the dataset and repeats steps 2-4. If a point does not have a sufficient number of neighboring points and it is not part of a cluster, it is labeled as noise.
The process ends when all points in the dataset have been marked as visited.

Below is a great illustration of the DBSCAN process:

The DBSCAN walk-through procedure — Source

Hyperparameter selection

How could one choose the appropriate values for $\epsilon$ and minPts? Various heuristics have been proposed over the years, and we take a look at some of them below.

In order to select the minPts, the following aspects could be considered:

Usually, the minPts is set to twice the dataset dimensionality, e.g., $\text{minPts} = 2 \cdot \text{dim}$ . For 2-dimensional data, minPts is set to 4.
For a dataset that is either very large, noisy, high dimensional, or contains many duplicates, increasing minPts may result in better performance.

Setting the $\epsilon$ is a bit more tricky.

The domain knowledge might help to determine the $\epsilon$ value. For instance, if dealing with the GPS locations clustering, a domain expert might propose that objects within a distance of 1 kilometer are considered neighbors. From there, the suitable distance function could be selected and $\epsilon$ is adjusted as required (along with minPts).
The k-distance plot could be also utilized to determine the $\epsilon$ . The goal is to compute the average distance from each point to its k closest neighbors, where the user will define the value of k as MinPts. Following that, these k-distances are arranged and plotted in ascending order. The primary objective is to identify the "knee" or an "elbow", which is the optimal $\epsilon$ parameter.
In the plot above, the "elbow" is approximately at 0.25 on the epsilon axis, so the initial $\epsilon$ might be set to 0.25.
The procedure for $\epsilon$ selection with the k-distance plot is:
1. Compute the distance from each point to its $k$ 'th closest neighbor ( $k$ here is equal to minPts).
2. Sort the distance values in ascending order and plot the k-distance graph.
3. Select the elbow or the knee on the k-distance plot. The point with the greatest angle can be chosen as the optimal $\epsilon$ . In case the k-distance plot does not have any visible 'breaking' point, another approach such as OPTICS (a variation of DBSCAN that does not require the $\epsilon$ initialization) could be used instead.

There is another approach to indicate whether the $\epsilon$ should be corrected after the first run. When applying DBSCAN, if the majority of data is included in the biggest component (the larger cluster), it suggests that the $\epsilon$ is too large. If the biggest component incorporates 20% to 50% of the clustered points, it's recommended to rerun DBSCAN using a smaller $\epsilon$ . Also, it might be helpful to look at the ratio of noise. Typically, the preferable noise content will range from 1% to 30%. When $\epsilon$ is set too low, more points will be labeled as noise, thus, increasing $\epsilon$ will make more points included in clusters.

Also, a hyperparameter search technique such as the grid search might be used to find the most suitable minPts and $\epsilon$ values.

Some considerations for DBSCAN

In this section, we will look at the advantages and disadvantages of DBSCAN.

The pros of DBSCAN:

As opposed to most centroid-based clusterings, DBSCAN can detect clusters of arbitrary shapes and does not need the setting of the number of clusters beforehand;
Can handle the noise directly by assigning the corresponding label.

The cons of DBSCAN:

Quality depends on the initialized parameters (the distance metric, $\epsilon$ , and the minPts), which should be set manually;
Since a cluster is a dense region of samples, separated by the regions of lower density, the universal choice of $\epsilon$ and minPts can lead to issues when working with a dataset with varying cluster densities ( $\epsilon$ and minPts might not be optimal for all regions);
DBSCAN is affected by the curse of dimensionality, where the notion of distance becomes less clear as the number of dimensions increases (something known as the distance concentration – when the number of dimensions gets higher, all pair-wise distances may converge to the same value). Thus, DBSCAN may not be a suitable choice for a high-dimensional dataset.
Since DBSCAN works with distance, the scale of the features will have an impact on the performance (so feature standardization might be required during preprocessing).

Conclusion

As a result, you are now familiar with:

The core concepts of DBSCAN (such as density, which consists of $\epsilon$ and minPts, and the distinction between core, border, and noise points);
The pros and cons of the algorithms;
Some of the proposed heuristics to determine the minPts and $\epsilon$ value.

4 learners liked this piece of theory. 0 didn't like it. What about you?

Report a typo