Intelligent Systems and Applications in Engineering

: In recent years, data analysis has become important with increasing data volume. Clustering, which groups objects according to their similarity, has an important role in data analysis. DBSCAN is one of the most effective and popular density-based clustering algorithm and has been successfully implemented in many areas. However, it is a challenging task to determine the input parameter values of DBSCAN algorithm which are neighborhood radius Eps and minimum number of points MinPts. The values of these parameters significantly affect clustering performance of the algorithm. In this study, we propose AE-DBSCAN algorithm which includes a new method to determine the value of neighborhood radius Eps automatically. The experimental evaluations showed that the proposed method outperformed the classical method.


Introduction
Clustering is one of the most important data analyses methods which groups unlabeled data based on their similarities. It is used in many areas for various applications, such as image segmentation, document retrieval, meteorology, pattern recognition, etc. [7,9,10]. Clustering algorithm of DBSCAN (Density-Based Spatial Clustering of Applications with Noise) discovers clusters based on user-given parameters of neighborhood radius, Eps, and minimum number of points, MinPts, [1]. DBSCAN has the potential of discovering clusters with different densities, shapes, and sizes. It is also independent of data ordering and can handle noisy datasets [5,7,10]. However, clustering with DBSCAN is challenging due to the difficulty of determining the values of input parameters of neighborhood radius, Eps, and minimum number of points, MinPts, [5]. Determination of parameter values can be very difficult for a user who has no experience on the dataset to be clustered. Because of these reasons, automatic techniques should be developed to determine the values of these parameters. In this study, we proposed AE-DBSCAN algorithm which includes a new method to automatically determine the value of neighborhood radius, Eps. The main idea of the proposed method is to assign first the sharp change in the k-dist plot as epsilon value. In our method, to find the first sharp change, we, first, generate k-dist plot of the dataset, and then, we take the first slope, which is above the mean+standard deviation of all nonzero slopes. The rest of the paper is organized as follows. Section 2 presents the discussion of related works. Section 3 gives the basic concepts of DBSCAN algorithm. Section 4 presents the proposed method. Section 5 presents the experimental evaluations and Section 6 presents the conclusions and future works.

Related Works
In the literature, clustering techniques can be broadly categorized as partitional, hierarchical, density-based, grid-based, and modelbased clustering algorithms [7]. This study deals with densitybased clustering algorithm of DBSCAN. Partitional clustering algorithms, such as K-means, PAM (Partitioning Around Medoids), CLARA (Clustering Large Applications), CLARANS (Clustering Large Applications based on Randomized Search), etc., partition objects according to their similarity [7]. Hierarchical clustering algorithms, such as BIRCH (balanced iterative reducing and clustering using hierarchies), CURE (Clustering Using Representatives), ROCKS (A Robust Clustering Algorithm for Categorical Attributes), etc., produce tree structure to represent clusters [7,12,13]. Grid-based clustering algorithms, such as STING, WaveCluster, etc., discover clusters by dividing the dataset into grids [7,11,14]. Model-based clustering algorithms use probability models, such as Gaussian model, Latent Dirichlet Allocation etc., to cluster objects [7]. Density-based clustering algorithms assign dense regions as clusters and are able to cluster arbitrary shaped noisy dataset. Examples of density based clustering algorithms are OPTICS, DENCLUE, CLIQUE, and DBSCAN [1,3,7]. DBSCAN algorithm has been inspiration of several studies since it was first proposed due to its potential to discover clusters with different shapes and sizes in noisy data [1][2][3][4][5][6][7][8][9][10]. In contrast to its popularity, determination of input parameters of DBSCAN is challenging. Because of this limitation, in the literature, methods for automatic determination of the value of these parameters were proposed. In APSCAN, which is a parameter-free density based clustering algorithm, affinity propagation clustering is used to detect local densities and to determine their values of input parameters [17]. Zhou  shaped clusters without requiring any input parameters [19]. Hou et al. proposed a parameter free clustering algorithm based on the combination of DSets (dominant sets) and DBSCAN algorithms [20]. In AGED algorithm, the value of the epsilon parameter of DBSCAN is determined based on the local densities [21]. Wang et al. proposed a modified DBSCAN algorithm to automatically determine the epsilon values for different data distributions [22]. Lakshmi et al. proposed an efficient density based subspace clustering method by computing the value of epsilon dynamically for each subspace based on the maximum spread of the data [23]. Daszykowski et al. developed an analytical approach to determine the value of epsilon using a gamma formula [24]. In this study, we propose AE-DBSCAN algorithm. It has a new method to determine the value of neighborhood radius, Eps, parameter of DBSCAN algorithm automatically by utilizing kdist list.

Basic Concepts
DBSCAN algorithm requires two input parameters, such as, Eps, which is used to determine the neighbouring area of an object (or point) and MinPts, which is the minimum number of points within Eps radius. Basic concepts of the method given as follows [1,7] then we present modeling of AE-DBSCAN.

Basic Concepts
Definition 1: (Eps-neighbourhood) For a given dataset D, Epsneighbourhood of a point p is the set of neighbouring points of q in a given radius Eps which is expressed by {qϵD| dist(p,q) ≤ Eps}. Definition 2: (Directly density reachable) A point p is defined as density reachable from a point q if p is within Eps-neighbourhood from q and Eps-neighbourhood of p contains at least MinPts number of points. Definition 3: (Density reachable) A point p is defined as density reachable from q with respect to Eps and MinPts if there is a chain of points p1 … pn, p1=q, pn=p such that pi+1 is directly reachable from pi.  k-dist= (k1, k2, k3, ..., kn), the slope of point ki with respect to next point ki+1 is defined as the slope of line segment kiki+1. To find the slope of a point with respect to another point, different definitions or line segment representations can be used. In this study, to calculate the slope of a point ki , we used its sequential point of ki+1 in the line segment representation.

The Proposed Method AE-DBSCAN
This section presents the proposed algorithm AE-DBSCAN. In contrary to the classical DBSCAN algorithm, it finds the Eps value automatically. The proposed AE-DBSCAN algorithm requires a dataset and a k value (or MinPts) as inputs. The proposed algorithm has two stages, such as determining the value of Eps and clustering the dataset. The first stage of the algorithm discovers the value of neighbourhood radius Eps and then this Eps value is used in the second stage with k (or MinPts) value to discover the clusters out of the dataset. The clustering stage of the algorithm works similar to the classical DBSCAN algorithm. The pseudo code of the algorithm is given in Algorithm 1. Retrieve all point density-reachable from p with respect to Eps 10.
If neighbour of p contains at least MinPts number of points then p is a core point and a cluster is formed. 11.
If p is not a core object but one of its neighbours a core object, p is assigned as border point.

12.
If p neither core point nor border point, mark it as noise point.

Output the discovered clusters
The aim of the determination of Eps value is to find the value of Eps by utilizing k-dist plot. The sharp changes in the k-dist plot are candidate Eps values. In the proposed approach, first, the kdist values are calculated by taking the distance of each point to its k th nearest neighbour. Then the k-dist values are sorted. Using these sorted values the k-dist plot is drawn. The sharp changes in this plot represent candidate Eps values. To determine the sharp changes we calculated slopes of each point with respect to the next point. The slope of a point ki is calculated as the absolute differences of k th neighbour distances of ki and ki+1. Then, we calculated mean and standard deviation of the non-zero slopes. In this stage, the slopes whose values are zero are excluded to focus on changes on the k-dist plot. In our method, the first slope which is above the mean(slopes)+standard deviation(slopes) is determined and the corresponding k-dist value of this slope is selected as Eps value. We also tested two other strategies to find Eps value. In the first strategy, the first slope which is above the mean(slopes)+2×standard deviation(slopes) is determined and corresponding k-dist value of this slope is selected as Eps value. In the second strategy, the first slope which is between mean(slopes)-standard deviation(slopes) and mean(slopes)+ standard deviation(slopes) is determined and corresponding kdist value of this slope is selected as Eps value However, our empirical results showed that the method which finds the first slope which is above the mean(slopes)+standard deviation(slopes) gave the best results. In the clustering stage of the AE-DBSCAN algorithm, it clusters the dataset using k as MinPts and the discovered Eps value. This stage is run for each point to discover clusters by marking each point as core point or border point or noise point.

Experimental Results
We compared the performance of the proposed method AE-DBSCAN with that of the analytical method [24]. Experiments were performed to answer the following four questions: • What are the effect of different densities?
• What are the effect of different sizes?
• What are the effect of adherent clusters? • What is the effect of MinPts parameter? Experiments were conducted on an Intel Core i7 2.4 GHz computer with 8 GB RAM.

Performances of Different Strategies to Determine Eps
In this section, we compared the performance of epsilon finding strategies using Compound Dataset (Dataset1), Complex9 Dataset (Dataset2), and R15 Dataset (Dataset3). The first strategy finds the epsilon value by finding k-dist value whose slope is above the mean(slopes)+2×standard deviation(slopes) and the second strategy finds the epsilon value by finding the kdist value whose slope is in between mean(slopes)-standard deviation(slopes) and mean(slopes)+standard deviation(slopes).
In this experiment, k value were selected as 3 for Compound Dataset, 7 for Complex9 dataset, and 4 for R15 dataset. Fig. 2 presents clustering results of the first strategy, Fig. 3 presents the clustering results of the second strategy, and Fig. 4 presents the clustering results of the proposed AE-DBSCAN algorithm for three datasets.

The Effect of Different Density
In this experiment, we used Compound Dataset to evaluate the performances of the proposed method AE-DBSCAN and analytical method [24] since the dataset has clusters with different densities. For the AE-DBSCAN algorithm, the value of k (or MinPts) was set to 3 and for the analytical method the value of k was set to 3 and 16. The results of the experiment can be seen in Fig. 5. As can be seen, AE-DBSCAN method can find clusters correctly with the clustering accuracy of 96.56% (Fig. 5  (a)). In contrast, the analytical method cannot discover all the clusters. When k is 3, the analytical method assigns some points of the clusters as noise (Fig. 5(b)). When k is 16, the analytical method combines some of the clusters and so it cannot discover all clusters completely (Fig. 5(c)). The clustering accuracy of analytical method is 89.15% for k=3 and 62.61 % for k=16.

The Effect of Different Size
In this experiment, we used Complex9 Dataset to evaluate the performances of the proposed method AE-DBSCAN and analytical method [24] since it has different-sized clusters. For the AE-DBSCAN algorithm, the value of k (or MinPts) was set to 7 and for the analytical method the value of k was set to 7 and 4. The clustering results of both methods can be seen in Fig. 6. The clustering accuracy of AE-DBSCAN is 99.81% and the clustering accuracy of analytical method 90.42% for k=7 and 99.78% for k=4 (Fig. 6).

The Effect of Adherent Clusters
In this experiment, we used R15 Dataset to evaluate the performances of the proposed method AE-DBSCAN and analytical method [24] since the dataset has adherent clusters. For the AE-DBSCAN algorithm, the value of k (or MinPts) was set to 4 and for the analytical method the value of k was set to 2 and 4. The clustering results of both method can be seen in Fig. 7. The clustering accuracy of AE-DBSCAN is 96.13% and the clustering accuracy of analytical method 76.9% for k=2 and 58.18% for k=4 (Fig. 7).

The Effect of k (or MinPts) Parameter
In this section, we evaluated the effect of k (or MinPts) parameter on the proposed AE-DBSCAN algorithm. First, we evaluated how the accuracy of clustering changes as the value of k increases on three datasets. As can be seen in Fig. 8, for low and high k values the accuracy of clustering gets worse.
In the experiments, we figure out that when the value of MinPts is around 3 or 4, the classification accuracy becomes higher. In the second experiment, we evaluated the effect of MinPts value on the number of clusters (Fig. 9). As seen in Fig. 9, as the value of MinPts increases, the number of clusters decrease.

Conclusion
DBSCAN has the potential of discovering clusters with different densities, shapes, and sizes. It is independent of data ordering and can handle noisy datasets. However, determining the value of neighborhood radius Eps is a difficult task. In this study, AE-DBSCAN, which includes a new method for determination of the value of neighborhood radius Eps automatically, is proposed. Experimental results showed that the proposed AE-DBSCAN outperformed the classical algorithm [24]. As the future work, we plan to study on more datasets, to improve the proposed method and to compare it with the performances of other algorithms.