SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets - arXiv

Page created by Fernando Norman
 
CONTINUE READING
SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets - arXiv
1

 SDCOR: Scalable Density-based Clustering for
 Local Outlier Detection in Massive-Scale
 Datasets
 Sayyed-Ahmad Naghavi-Nozad and Maryam Amir Haeri and Gianluigi Folino

 Abstract—This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets.
 Differently from the well-known traditional algorithms, which assume that all the data is memory-resident, our proposed method is
 scalable and processes the input data chunk-by-chunk within the confines of a limited memory buffer. At the first phase, a temporary
arXiv:2006.07616v6 [cs.LG] 26 Oct 2020

 clustering model is built, then it is incrementally updated by analyzing consecutive memory-loads of points. Subsequently, at the end of
 scalable clustering, the approximate structure of original clusters is obtained. Finally, by another scan of the entire dataset and using a
 suitable criterion, an outlying score is assigned to each object, which is called SDCOR (Scalable Density-based Clustering Outlierness
 Ratio). Evaluations on real-life and synthetic datasets demonstrate that the proposed method has a low linear time complexity and is
 more effective and efficient compared to best-known conventional density-based methods, which need to load all data into the
 memory; and also, to some fast distance-based methods, which can perform on data resident in the disk.

 Index Terms—local outlier detection, massive-scale datasets, scalable, density-based clustering, anomaly detection

 F

 1 I NTRODUCTION outliers only deviate significantly w.r.t. a specific neighbor-
 hood of the object [1], [5]. In [6], [7], it is noted that the
 Outlier detection, which is a noticeable and open line of concept of local outlier is more comprehensive than that of
 research [1], [2], [3], is a fundamental issue in data min- global; i.e., a global outlier could also be considered as a
 ing. Outliers refer to rare objects that deviate from well- local one, but not vice versa. This is the reason that makes
 defined notions of expected behavior and discovering them finding local outliers much more cumbersome.
 is sometimes compared with searching for a needle in In recent years, advances in data acquisition have made
 a haystack, because the rate of their occurrence is much massive collections of data, which contain valuable informa-
 smaller than normal objects. Outliers often interrupt the tion in diverse fields like business, medicine, society, gov-
 learning procedure from data for most of analytical models ernment etc. As a result, the common traditional software
 and thus, capturing them is very important, because it can methods for processing and management of this massive
 enhance model accuracy and reduce computational loads of amount of data will no longer be efficient, because most
 the algorithm. However, outliers are not always annoying of these methods assume that the data is memory-resident
 and sometimes, they become of special interest for the and their computational complexity for large-scale datasets
 data analyst in many problems such as controlling cellular is really expensive.
 phones activity to detect fraudulent usage, like stolen phone To overcome the above-mentioned issues, we propose
 airtime. Outlier detection methods could also be considered a new scalable and density-based clustering method for
 as a preprocessing step, useful before applying any other local outlier detection in massive-scale datasets that cannot
 advanced data mining analytics, and it has a wide range of be loaded into memory at once, employing a chunk-by-
 applicability in many research areas including intrusion de- chunk load procedure. In practice, the proposed approach
 tection, activity monitoring, satellite image analysis, medical is a clustering method for very large datasets, in which
 condition monitoring etc [2], [4]. outlier identification comes after that as a side effect; and
 Outliers can be generally partitioned into two different it is inspired by a scaling clustering algorithm for very
 categories, namely global and local. Global outliers are large databases, named after its authors, BFR [8]. BFR has
 objects, which show significant abnormal behavior in com- a strong assumption on the structure of existing clusters,
 parison to the rest of the data, and thus in some cases they which ought to be Gaussian distributed with uncorrelated
 are considered as point anomalies. On the contrary, local attributes, and more importantly, it is not introducing noise.
 In short, this clustering algorithm works as follows. First, it
 • Sayyed-Ahmad Naghavi-Nozad is with the Computer Engineering De- reads the data as successive (preferably random) samples,
 partment, Amirkabir University of Technology, Tehran, Iran; Email: so that each sample can be stored in the memory buffer,
 sa_na33@aut.ac.ir and then it updates the current clustering model over the
 Maryam Amir Haeri is with the Department of Computer Science, Univer-
 sity of Kaiserslautern, Kaiserslautern, Germany; Email: haeri@cs.uni-kl.de contents of buffer. Based on the updated model, singleton
 Gianluigi Folino is with ICAR-CNR, Rende, Italy; Email: gian- data are classified in three groups: some of them can be dis-
 luigi.folino@icar.cnr.it carded over the updates to the sufficient statistics (discard
 Sayyed-Ahmad Naghavi-Nozad is the corresponding author.
 set DS); some can be moderated through compression and
SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets - arXiv
2

abstracted as sufficient statistics (compression set CS); and 2 R ELATED WORKS
some demand to be retained in the buffer (retained set RS).
 Like BFR, our proposed method operates within the Outlier detection methods can be divided into following six
confines of a limited memory buffer. Thus, by assuming that categories [13]: extreme value analysis, probabilistic meth-
an interface to the database allows the algorithm to load an ods, distance-based methods, information-theoretic meth-
arbitrary number of requested data points, whether sequen- ods, clustering-based methods, and density-based methods.
tially or randomized, we are forced to load the data chunk- In extreme value analysis, the overall population is sup-
by-chunk, so that there is enough space for both loading posed as having a unique probability density distribution,
and processing each chunk at the same time. The proposed and only those objects at the very ends of it are considered
approach is based on clustering, and therefore, it must avoid as outliers. In particular, these types of methods are useful
that outliers play any role in forming and updating clusters. to find global outliers [13], [14].
After processing each chunk, the algorithm should combine In probabilistic methods, we assume that the data were
its approximate results with those of previous chunks, in generated from a mixture of different distributions as a
a way that the final approximate result will compete with generative model, and we use the same data to estimate
the same result obtained by processing the entire dataset the parameters of the model. After determining the spec-
at once. An algorithm, which is capable of handling data ified parameters, outliers will be those objects with a low
in such an incremental way and finally provides an ap- likelihood of being generated by this model [13]. Schölkopf
proximate result, from an operational perspective is called a et al. [15] propose a supervised approach, in which a prob-
scalable algorithm [9]. Moreover, from an algorithmic point abilistic model w.r.t. the input data is provided, so that it
of view, scalability means that algorithm complexity should can fit normal data in the best possible way. In this manner,
be nearly linear or sublinear w.r.t. the problem size [10]. the goal is to discover the smallest region that contains most
 of normal objects; data outside this region are supposed to
 In more detail, the proposed method comprises of three
 be outliers. This method is, in fact, an extended version of
steps. In the first step, a primary random sampling is carried
 Support Vector Machines (SVM), which has been improved
out to create an abstraction of the whole data on which
 to cope with imbalanced data. In practice, in this way, a
the algorithm works. Then, an initial clustering model is
 small number of outliers is considered as belonging to the
built and some information required for the next phase
 rare class and the rest of data as belonging to normal objects.
of incremental clustering will be acquired. In the second
step, on the basis of currently loaded chunk of data into In distance-based methods, distances among all objects
memory, a scalable density-based clustering algorithm is are computed to detect outliers. An object is assumed to
executed in order to identify dense regions, which leads to be a distance-based outlier, iff it is 0 distance away from
building incrementally some clusters, named mini-clusters at least fraction 0 of other objects in the dataset [16].
or sub-clusters. When all chunks are processed, the final Bay and Schwabacher [17] propose an optimized nested-
clustering model will be built by merging the information loop algorithm based on the k nearest neighbor distances
obtained through these mini-clusters. Finally, by applying among objects, that has a near linear time complexity, and
a Mahalanobis distance criterion [11], [12] to the entire is shortly named ORCA. ORCA shuffles the input dataset in
dataset, an outlying score is assigned to each object. random order using a disk-based algorithm, and processes
 it in blocks, as there is no need to load the entire data into
 In summary, the main contributions of our proposed
 memory. It keeps looking for a set of user-defined number
method are listed as follows:
 of data points as potential anomalies, as well as for their
 anomaly scores. The cut-off value is set as the minimum
 • There is no need to know the real number of original outlierness score of the set, and it will get updates if there
 clusters. is a data point having a higher score in other blocks of data.
 • It works better with Gaussian clusters having corre- For data points which obtain a lower score than the cut-off,
 lated or uncorrelated features, but it can also work they should be pruned; this pruning scheme will only expe-
 well with convex-shaped clusters following an arbi- dite the process of distance computation in the case of data
 trary distribution. being ordered in an uncorrelated manner. ORCA’s worst
 • It has a linear time complexity with a low constant. case time-complexity is 2 , and the I/O cost for the
 
 • In spite of working in a scalable manner and op- data accesses is quadratic. For the anomaly definition, it can
 erating on chunks of data, in terms of detection use either the kth nearest neighbor or average distance of
 accuracy, it is still able to compete with conventional k nearest neighbors. Furthermore, in [18], a Local Distance-
 density-based methods, which maintain all the data based Outlier Factor (LDOF) is proposed to find outliers
 in memory; and also, with some fast distance-based in scattered datasets, which employs the relative distance
 methods, which do not require to load the entire data from each data object to its nearest neighbors. Sp [19] is a
 into memory at their training stage. simple and rapid distance-based method that utilizes the
 Nearest Neighbor distance on a small sample from the
 The paper is organized as follows: Section 2 discusses dataset. It takes a small random sample of the entire dataset,
some related works in the field of outlier detection. In and then assigns an outlierness score to each point, as the
Section 3, we present detailed descriptions of the proposed distance from the point to its nearest neighbor in the sample
approach. In Section 4, experimental results and analysis set. Therefore, this method has a linear time complexity
on various real and synthetic datasets are provided. Finally, in the number of objects, dimensions, and samples, and
conclusions are given in Section 5. also a constant space complexity, which makes it ideal for
3

analyzing massive datasets. each point, the density around it is estimated employing a
 Information-theoretic methods could be particularly Weighted Kernel Density Estimation (WKDE) strategy with
deemed at almost the same level as distance-based and a flexible kernel width. Moreover, the Gaussian kernel func-
other deviation-based models. The only exception is that tion is adopted to corroborate the smoothness of estimated
in such methods, a fixed concept of deviation is determined local densities, and also, the adaptive kernel bandwidth en-
at first, and then, anomaly scores are established through hances the discriminating ability of the algorithm for better
evaluating the model size for this specific type of deviation; separation of outliers from inliers. ODRA is specialized for
unlike the other usual approaches, in which a fixed model is finding local outliers in high dimensions and is established
determined initially and employing that, the anomaly scores on a relevant attribute assortment approach. There are two
are obtained by means of measuring the size of deviation major phases in the algorithm, which in the preliminary one,
from such model [13]. Wu and Wang [20] propose a single- unimportant features and data objects are pruned to achieve
parameter method for outlier detection in categorical data better performance. In other words, ODRA firstly tries to
using a new concept of Holoentropy and by utilizing that, distinguish those attributes which contribute substantially
a formal definition of outliers and an optimization model to anomaly detection from the other irrelevant ones; and
for outlier detection is presented. According to this model, also, a significant number of normal points residing in the
a function for the outlier factor is defined which is solely dense regions of data space will be discarded for the sake
based on the object itself, not the entire data, and it could be of alleviating the detection process. In the second phase,
updated efficiently. like RDOS, a KDE strategy based on kNN, RNN and SNN
 Clustering-based methods use a global analysis to detect of each data object is utilized to evaluate the density at
crowded regions, and outliers will be those objects not the corresponding locations, and finally assign an outlying
belonging to any cluster [13]. A Cluster-Based Local Outlier degree to each point. In NaNOD, a concept of Natural
Factor (CBLOF) in [21], and a Cluster-Based Outlier Factor Neighbor (NaN) is used to adaptively attain a proper value
(CBOF) in [22] are presented. In both of them, after the for k (number of neighbors), which is called here the Natural
clustering procedure is carried out, in regard to a specific Value (NV), while the algorithm does not ask for any input
criterion, clusters are divided into two large and small parameter to acquire that. Thus, the parameter choosing
groups; it is assumed that outliers lie in small clusters. At challenge could be mitigated through this procedure. More-
last, the distance of each object to its nearest large cluster is over, a WKDE scheme is applied to the problem to assess the
used in different ways to define the outlier score. local density of each point, as it takes benefit of both kNN
 In density-based methods, the local density of each ob- and RNN categories of nearest neighbors, which makes
ject is calculated in a specific way and then is utilized to the anomaly detection model become adjustable to various
define the outlier scores. Given an object, the lower its local kinds of data patterns.
density compared to that of its neighbors, the more likely Beyond the mentioned six categories of outlier detection
it is that the object is an outlier. Density around the points methods, there is another state-of-the-art ensemble method
could be calculated by using many techniques, which most based on the novel concept of isolation, named iForest [29],
of them are distance-based [6], [13]. For example, Breunig [30]. The main idea behind iForest is elicited from a well-
et al. [6] propose a Local Outlier Factor (LOF) that uses known ensemble method called Random Forests [31], which
the distance values of each object to its nearest-neighbors is mostly employed in classification problems. In this ap-
to compute local densities. However, LOF has a drawback proach, firstly it is essential to build an ensemble of isolation
which is that the scores obtained through this approach are trees (iTrees) for the input dataset, then outliers are those
not globally comparable between all objects in the same objects with a short average path length on the correspond-
dataset or even in different datasets. The authors of [23] ing trees. For making every iTree, after acquiring a random
introduce the Local Outlier Probability (LoOP), which is an sub-sample of the entire data, it is recursively partitioned
enhanced version of LOF. LoOP gives each object a score by randomly choosing an attribute, and then randomly
in the interval [0,1], which is the probability of the object specifying a split value among the interval of minimum and
being an outlier, and is widely interpretable among various maximum values of the chosen attribute. Since this type of
situations. An INFLuenced Outlierness (INFLO) score is partitioning can be expressed by utilizing a tree structure,
presented in [24], which adopts both nearest neighbors thus the number of times required to partition the data
and reverse nearest neighbors of an object to estimate its for isolating an object is equal to the path length from the
relative density distribution. Moreover, Tang and He [25] starting node to the ending node in the respective iTree. In
propose a local density-based outlier detection approach, such a situation, the tree branches which bear outliers have
concisely named RDOS, in which the local density of each a remarkably lower depth, because these outlying points
object is approximated with the local Kernel Density Estima- fairly deviate from the normal conduct in data. Therefore,
tion (KDE) through nearest-neighbors of it. In this approach, data instances which possess substantially shorter paths
not only the k-Nearest-Neighbors (kNN) of an object are from the root in various iTrees are more potential to be
taken into account but, in addition, the Reverse-Nearest- outliers.
Neighbors (RNN) and the Shared-Nearest-Neighbors (SNN) One chief issue about using such isolation-based method
are considered for density distribution estimation as well. is that with the increase in the number of dimensions,
 Furthermore, Wahid and Rao have recently proposed if an incorrect attribute choice for the splitting matter is
three different density-based methods, shortly named RK- made at the higher levels of the iTree, then the probability
DOS [26], ODRA [27] and NaNOD [28]. RKDOS is mostly of detection outcomes getting misled, grows potentially.
similar to RDOS, in which by using both kNN and RNN of Nevertheless, the advantage of iForest is that by considering
4

the idea of isolation, it can utilize the sub-sampling in a mization algorithm to find the optimal values for these pa-
more sophisticated way than the existing methods, and rameters. Here, we prefer to use the evolutionary algorithm
provide an algorithm with a low linear time complexity PSO [45].
and a low memory requirement [32]. Moreover, in [33], Remark 3. Another important difference between the
[34], an enhanced version of iForest, named iNNE is pro- proposed method and BFR concerns the volume of struc-
posed which tries to overcome some drawbacks of the main tural information, which they need to store for clustering
method, including the incapability of efficiently detecting procedures. As the proposed method, differently from BFR,
local anomalies, anomalies with a low amount of pertinent can handle Gaussian clusters with correlated attributes too,
attributes, global anomalies that are disguised through axis- thus the covariance matrix will not always be diagonal
parallel projections, and finally anomalies in a dataset con- and could have many non-zero elements. Therefore, the
taining various modals. proposed method will consume more space than BFR for
 Remark 1. According to the fact that during scalable building the clustering structures.
clustering, we use the Mahalanobis distance measure to Since the Mahalanobis distance criterion is crucially
assign each object to a mini-cluster; and besides, the size based on the covariance matrix, hence, this matrix will be
of temporary clusters is much smaller than that of original literally the most prominent property of each sub-cluster.
clusters, it would be worth mentioning an important matter But according to the high computational expense of comput-
here. With respect to [12], [35], [36], [37], in the case of ing Mahalanobis distance in high-dimension spaces, thus,
high-dimensional data, classical approaches based on the as in [35], [46], we will use the properties of Principal
Mahalanobis distance are usually not applicable. Because, Components (PCs) in the transformed space. Therefore, the
when the cardinality of a cluster is less than or equal to its covariance matrix of each mini-cluster will become diagonal
dimensionality, the sample covariance matrix will become and by transforming each object to the new space of the
singular and not invertible [38]; hence, the corresponding mini-cluster, like BFR, we can calculate the Mahalanobis
Mahalanobis distance will no longer be reliable. distance without the need to use matrix inversion.
 Therefore, to overcome such problem, in a preprocess- According to [47], when the covariance matrix is di-
ing step, we need to resort to dimensionality reduction agonal, the corresponding Mahalanobis distance becomes
approaches. However, due to the serious dependence of the same normalized Euclidean distance. Moreover, we can
some dimensionality reduction methods like PCA [39] to establish a threshold value for defining the Mahalanobis
original attributes, and the consequent high computational radius. If the value of this threshold is, e.g. 4, it means
load because of the huge volume of the input data, we need that all points on this radius are as far as four standard
to look for alternative methods to determine a basis for data deviations from the mean; if we just denote the number of
projection. dimensions by p, the size of this Mahalanobis radius is equal
 √
 A simple and computationally inexpensive alternative is to 4 .
the exercise of random basis projections [40], [41], [42]. The
main characteristic of these types of methods is that they
will approximately preserve pairwise euclidean distances 3 P ROPOSED APPROACH
between data points, and, in addition, the dimension of the The proposed method consists of three major phases. In
transformed space is independent of the original dimension, the first phase, a preliminary random sampling is con-
and only relies logarithmically on the number of data points. ducted in order to obtain the main premises on which
Finally, after such a preprocessing step, we can be optimistic the algorithm works, i.e. some information on the original
that the singularity problem will not be present during clusters and some parameters useful for the incremental
clustering procedures; or in the case of it would be present, clustering. In the second phase, a scalable density-based
we would have a suitable mechanism to handle it. clustering algorithm is carried out in order to recognize
 Remark 2. As stated earlier, our proposed approach dense areas, on the basis of currently loaded chunk of data
is inspired by BFR. However, BFR by default, uses the K- points in memory. Clusters built incrementally in this phase,
means algorithm [43] in almost all of its clustering proce- are called mini-clusters or sub-clusters, and they form the
dures. In addition to this drawback of the K-means algo- temporary clustering model. After loading each chunk of
rithm, which is being seriously depending on foreknowing data, according to the points already loaded in memory and
the true number of original clusters, in the case of the pres- those undecided from previous chunks, and by employing
ence of outliers, K-means performs poorly, and therefore, we the Mahalanobis distance measure and in respect to the
need to resort to a clustering approach which is resistant to density-based clustering criteria, we update the temporary
anomalies in data. Here in this paper, we prefer to employ clustering model, which consists of making some changes
the density-based clustering method, DBSCAN [44]1 . Other to existing mini-clusters or adding new sub-clusters.
noise-tolerant clustering algorithms could be utilized as a Note that, in the whole scalable clustering procedure,
substitute option too, but definitely with their own specific our endeavor is aimed to not let outliers participate actively
configurations. in forming and updating any mini-cluster, and thus, after
 However, DBSCAN is strongly dependent on the choice processing the entire chunks, there will be some objects in
of its parameters. Thus, we are forced to utilize some opti- buffer remained undecided. Some of these data are true
 outliers, while some others are inliers, which, due to con-
 1. Although we will demonstrate that sometimes throughout scalable straints, have failed to play any effective role in forming a
clustering, even DBSCAN may produce some mini-clusters that cannot
abide outliers, because of some data assignment restraints; hence, we sub-cluster. Finally, all these undecided points are cleared
will be forced to use the same K-means algorithm to fix the issue. from the buffer, while only the structural information of
5

temporary clusters is maintained in memory. Then, at the TABLE 1
last part of scalable clustering, depending on the mini- Major notations
clusters associated to each initial mini-cluster out of the
 Notation Description
"Sampling" stage, we combine them to obtain the final [X ] × Input dataset X with objects and dimensions
clusters, which their structure will be approximately the ∈X An instance in X
same as of the original clusters. S ⊂X A random sample of X
 X⊂X A set of some points in buffer
 At last, in the third phase of the proposed approach, Y ⊂X A chunk of data
w.r.t. the final clustering model gained out of the second Ω A partition of some data points in memory into mini-
 clusters as {X1 , · · · , Xk }
phase, once again, we process the entire dataset in chunks, ℭ Number of objects in Y
to give each object an outlying score, according to the {Γ}8×L Information array of temporary clusters with 8 proper-
 ties for each mini-cluster
same Mahalanobis distance criterion. Fig. 1 illustrates the {Γ }8×1 ∈ Γ Information standing for the th mini-cluster in Γ, 1 ≤
software architecture of the approach. Moreover, in Table 1, ≤L
 Mini-cluster points associated with Γ , which are re-
the main notations used in the paper are summarized. moved from buffer
 < Retained set of objects in RAM buffer
6

 Algorithm 1: Framework of SDCOR these mini-clusters, we will obtain the approximate struc-
 Input : [X ] × - The by input dataset X; - Random ture of the original clusters.
 sampling rate; Λ - PC total variance ratio; - As we adopt PSO to attain optimal values of parame-
 Membership threshold; - Pruning threshold; C - ters S and S for DBSCAN to cluster the sampled data,
 Sampling Epsilon coefficient; C - Sampling MinPts
 coefficient it would be essential to stipulate this truth that as the
 Output: The outlying score for each object in X density of the sampled distribution is much less than that
 1 Phase 1 — Sampling: of the original data, we cannot use the same parameters
 2 Step 1. Take a random sample S of X according to the for the original distribution, while applying DBSCAN on
 sampling rate objects loaded in memory, during scalable clustering. Thus,
 3 Step 2. Employ the PSO algorithm to find the optimal values we have to use a coefficient in the interval [0,1] for each
 for DBSCAN parameters S and S , required for clustering
 S parameter. It is also necessary to mention that is much
 4 Step 3. Run DBSCAN on S using the obtained optimal more sensitive than , as with a slight change in the value
 parameters, and reserve the count of discovered mini-clusters of , we may observe a serious deviation in the clustering
 as T, as the true number of original clusters in data result, but this does not apply to . We show the mentioned
 5 Step 4. Build the very first array of mini-clusters information
 (temporary clustering model) out of the result of step 3, w.r.t. coefficients with C and C , for S and S respectively, and
 Algorithm 2 their values could be obtained through the user; but the best
 6 Step 5. Reserve the covariance determinant values of the initial values gained out of our experiments are 0.5 and 0.9, for C 
 sub-clusters as a vector ® = [ 1 , · · · , T ], for the maximum and C , respectively.
 covariance determinant condition
 7 Step 6. Clear S from the RAM and maintain the initial Now, before building the first clustering model, we need
 temporary clustering model in buffer to extract some information out of the sampled clusters
 8 Phase 2 — Scalable Clustering: obtained through DBSCAN, and store them in a special
 9 Prepare the input data to be processed chunk by chunk, so array. As stated earlier about the benefit of using properties
 that each chunk could be fit and processed in RAM buffer at of principal components for high-dimensional data, we need
 the same time to find those PCs that give higher contributions to the cluster
10 Step 1. Load the next available chunk of data into the RAM
11 Step 2. Update the temporary model of clustering over the
 representation. To this aim, we sort PCs on the basis of their
 contents of buffer, w.r.t. Algorithm 3 corresponding variances in descending order, and then we
12 Step 3. If there is any available unprocessed chunk, go to step 1 choose the topmost PCs having a share of total variance at
13 Step 4. Build the final clustering model, w.r.t. Algorithm 7, least equal to Λ percent. We call these PCs superior com-
 using the temporary clustering model obtained out of the
 previous steps ponents and denote their number as 0 . Let be an object
14 Phase 3 — Scoring:
 among total objects in dataset [X] × , belonging to the
 temporary cluster [X ] × ; then the information about this
15 According to the final clustering model, for each data point
 ∈ X, employ the Mahalanobis distance criterion to find the sub-cluster as {Γ }8×1 , in the array of temporary clustering
 closest final cluster, and lastly, assign to that cluster and model {Γ}8×L , is as follows:
 use the criterion value as the object outlierness score ∑︁
 1) Mean vector in the original space, X = 1 
 ∑︁ ∈X 
 2) Scatter matrix in the original space, SX = ( −
 ∈X 
 X ) · ( − X )
of samples from each original cluster, and thus the clustering 3) 0 superior components, 1 , · · · , 0 , derived from
  
structure obtained after applying the sampling may not be Í
 the covariance matrix X = −1 1
 SX , which form the
representative of the original one. Hence, outliers could be columns of the transformation matrix AX 
misclassified over scalable clustering. In such a situation, 4) Mean vector in the transformed space, X0 = X AX 
obtaining a satisfactory random sample requires processing 5) Square root 0
the entire dataset. In the following, the percentage of sam- √ √︁  of the top PC variances,
 1 , · · · , 0
pled data, is indicated as . 6) Size of the mini-cluster, 
 After obtaining the sampled data, we conduct the DB- 7) Value of 0
SCAN algorithm to cluster them, and assume that the 8) Index to the nearest initial mini-cluster, ∈
number of mini-clusters obtained through this, is the same {1, · · · , T }
as the number of original clusters T in the main dataset.
 Algorithm 2 demonstrates the process of obtaining and
We reserve T for later use. Besides, we presume that the
 adding this information per each mini-cluster attained
location (centroid) and the shape (covariance structure)
 through either of the "Sampling" and "Scalable Clustering"
of such sub-clusters are so close to the original ones. In
 stages, to the temporary clustering model. As for the eighth
Section 4, we will show that even by using a small rate
 property of the information array, for every sampled cluster,
of random sampling, the mentioned properties of sampled
 it is set as the corresponding original cluster number; and
clusters are similar enough to those of the original ones.
 for any of the sub-clusters achieved out of incremental
The idea behind making these primary sub-clusters, which
 clustering, it is set as the corresponding index to the nearest
are so similar to the original clusters in the input dataset in
 initial sub-cluster2 .
terms of basic characteristics, is that, we intend to determine
a Mahalanobis radius, which collapses a specific percentage 2. It is clear that all those sampled points, which the initial clustering
of objects belonging to each original cluster; and let other model is built upon them, will be met again throughout scalable
 clustering. But as they are totally randomly sampled, hence, it could
sub-clusters be created around this folding area during suc- be asserted that they will not impair the structure of final clustering
cessive memory-loads of points; and ultimately, by merging model.
7

 Algorithm 2: [Γ] = MiniClustMake(Γ, Ω, Λ, ) on the Mahalanobis radius, within reason, is hindering
 Input : Γ - Current array of mini-clusters information; outliers to be accepted as a member. In other words, when
 Ω = {X1 , · · · , Xk } - A partition of some data points in the Mahalanobis distance of an outlier is more than the
 memory into mini-clusters; Λ - PC share of total predefined radius threshold, it cannot be assigned to the
 variance; - Index to the nearest initial mini-cluster
 Output: Γ - Updated temporary clustering model sub-cluster, if and only if that threshold is set to a fair value.
 1 ←L Hence, we do not check on the covariance determinant of
 2 foreach mini-cluster X , 1 ≤ ≤ k do sub-clusters, while they are growing.
 3 Apply PCA on X and obtain its PC coefficients and Here, the first phase of the proposed approach is fin-
 variances. Then choose 0 as the number of those top PC
 variances, for which their share of total variance is at
 ished, and we need to clear RAM buffer of any sampled
 least Λ percent data, and only maintain the very initial information ob-
 4 Γ {1, + } ← Mean vector of X tained about the existing original clusters.
 5 Γ {2, + } ← Scatter matrix of X 
 6 Γ {3, + } ← Top 0 PC coefficients corresponding to the
 top 0 PC variances 3.2 Scalable Clustering
 7 Γ {4, + } ← Transformed mean vector, as
 Γ {1, + } · Γ {3, + } During this phase, we have to process the entire dataset
 8 Γ {5, + } ← Square root of the top 0 PC variances chunk-by-chunk, as for each chunk, there is enough space
 9 Γ {6, + } ← Number of objects in X 
 10 Γ {7, + } ← Value of 0 in memory for both loading and processing it all at the same
 11 if ≡ 0 then /* "Sampling" stage */ time. After loading each chunk, we update the temporary
 12 Γ {8, + } ← clustering model according to the data points, which are
 13 else /* "Scalable Clustering" stage */
 14 Γ {8, + } ← currently loaded in memory from the current chunk, or
 15 end retained from other previously loaded chunks. Finally, after
 16 end processing the entire chunks, the final clustering model is
 built out of the temporary clustering model. A detailed
 description of this phase is provided as follows.
 According to [48], [49], when a multivariate Gaussian
distribution is contaminated with some outliers, then the 3.2.1 Updating the temporary clustering model w.r.t. con-
corresponding covariance determinant is no longer robust tents of buffer
and is significantly more than that of the main cluster. After loading each chunk into memory, the temporary clus-
Following this contamination, the corresponding (classical) tering model is updated upon the objects belonging to the
Mahalanobis contour line3 will also become wider than currently loaded chunk and other ones sustained from the
that of the pure cluster (robust one), as it contains also formerly loaded data. First of all, the algorithm checks for
abnormal data. So, it makes sense that there is a direct the possible membership of each point of the present chunk
relationship between the value of covariance determinant of to any existing mini-cluster in the temporary clustering
an arbitrary cluster and the wideness of its tolerance ellipse, model4 .
which could also be referred to as the spatial volume of the Then, after the probable assignments of current chunk
cluster. Moreover, by being contaminated, this volume could points, there are some primary and secondary information
increase and become harmful. of modified sub-clusters that shall be updated. After this
 Since during scalable clustering, new objects are coming update, the structure of altered sub-clusters will change,
over time and mini-clusters are growing gradually, so, it is and thus they might still be capable of absorbing more
possible for a mini-cluster with an irregular shape to accept inliers. Therefore, the algorithm checks again for the likely
some outliers. Then, the following covariance matrix will memberships of sustained points in memory to updated
no longer be robust, and the corresponding Mahalanobis sub-clusters. This updating and assignment checking cycle
contour line will keep getting wider too, which could cause will keep going until there is not any retained point that
the absorption of many other outliers. Therefore, to impede could be assigned to an updated mini-cluster.
the creation process of these voluminous non-convex sub- When the membership evaluation of the present chunk
clusters, which could be contaminated with myriad outliers, and retained objects is carried out, the algorithm tries to
we have to put a limit on the covariance determinant cluster the remaining sustained objects in memory, regard-
of every sub-cluster, which is discovered through scalable ing the density-based clustering criteria which have been
clustering. Here, we follow a heuristic approach and employ constituted at the "Sampling" phase. After the new mini-
the covariance determinant of the nearest initial sub-cluster clusters were created out of the last retained points, there is
obtained out of the "Sampling" stage, as the limit. this probability that some sustained inliers in buffer might
 This problem that outliers can be included in clusters not be capable of participating actively in forming new sub-
and could no longer be detected, is called "masking effect". clusters, because of the density-based clustering standards,
Note that we are using the mentioned constraints only when though could be assigned to them, considering the firstly
sub-clusters are created for the first time, not while they are settled membership measures. Hence, the algorithm goes
growing over time. The justification is that, while an object another time in the cycle of assignment and updating pro-
is about to be assigned to a mini-cluster, the other constraint cedure, like what was done in the earlier steps.

 3. Since the terms "Mahalanobis contour line" and "tolerance ellipse" 4. After this step, the unassigned objects of the lastly and previously
are the same in essence, thus, we will use them indifferently in this loaded chunks, will be all together considered as retained or sustained
paper. objects in buffer.
8

 Algorithm 3, demonstrates the steps necessary for up- does, some information connected to the corresponding sub-
dating the temporary clustering model, w.r.t. an already cluster shall be updated.
loaded chunk of data and other undecided objects retained
in memory from before. The following subsections will Algorithm 4: [Γ, 0 ,
9

 Algorithm 5: [Γ,
10

 4 respectively. It is clear that even incoherent subdivided sub-
 clusters, with a covariance determinant less than or equal
 2
 to the predefined threshold though, could be hazardous as
 non-convex sub-clusters with a high value of covariance
x2

 0
 determinant, as their tolerance ellipses could get out of the
 -2
 scope of the following main cluster, and suck outliers in10 .
 -4 An alternative for this is to use hierarchical clustering
 -10 -5 0 5 10
 algorithms, which typically present a higher computational
 x1
 (a) The Whole Cluster with a Distinct Non-convex Subcluster and Some Outliers Around It
 load than K-means. For this matter, we decide to adopt a
 4 4 K-means variant, which, for every subdivided sub-cluster
 2 2
 obtained through K-means, we apply DBSCAN again to
 verify its cohesion. Finally, we increase the value of K for
 K-means, from 2 till a value for which11 , three conditions
x2

 x2
 0 0

 -2 -2 for every subdivided sub-cluster are met: to be coherent,
 not to be singular, and not having a covariance determinant
 -4 -4 less than or equal to .
 -10 -5 0 -10 -5 0
 x1 x1 Fig. 3c illustrates a scenario in which, three smaller mini-
 (b) Dividing the Non-convex Subcluster
 Using Kmeans with K=2
 (c) Dividing the Non-convex Subcluster
 Using Kmeans with K=3
 clusters, are represented as the K-means output in different
 colors; centroids and tolerance ellipses are shown as in
 Fig. 3b. The covariance determinant values are equal to 0.12,
Fig. 3. Breaking a non-convex mini-cluster with a very wide tolerance 1.02 and 0.10 for the green, red and blue subdivided sub-
ellipse into smaller pieces by the K-means algorithm6 clusters respectively. As it is obvious, all subdivided sub-
 clusters are coherent and not singular, with much smaller
 determinants than the threshold, and much tighter spatial
illustrated with magenta pentagons7 . volumes as such. Ultimately, after attaining acceptable sub-
 As it is evident, the irregular mini-cluster can absorb clusters, it is time to update the temporary clustering model
some local outliers, as its tolerance ellipse is covering a w.r.t. them, with regard to Algorithm 2. Algorithm 6 shows
remarkable space out of the containment area by the original all the steps required to cluster data points retained in
cluster. Moreover, the covariance determinant value for the memory buffer.
irregular mini-cluster is equal to 15.15, which is almost 3.2.1.5 Trying to assign retained objects in buffer
twice that of the initial mini-cluster equal to 7.88. Thus, for to newly created mini-clusters: After checking on retained
fixing this concern, by considering the proportion between objects in buffer in the case of being capable of forming
the covariance determinant of an arbitrary cluster and its a new mini-cluster, it would be necessary to examine the
spatial volume, we decide to divide the irregular mini- remaining retained objects once more, w.r.t. Algorithm 5;
cluster to smaller coherent pieces, with smaller covariance i.e., it ought to be inspected whether or not they could be
determinants, and also, more limited Mahalanobis radii as assigned to the newly created mini-clusters, in the same
well. Therefore, we heuristically set the threshold value for cycle of membership checking and mini-cluster updating,
the covariance determinant of any newly discovered sub- like what was done in subsection 3.2.1.3. The reason for
cluster or subdivided sub-cluster, as that of the nearest this concern is that, due to limitations connected to the
initial mini-cluster8 . utilized density-based clustering algorithm, such objects
 For the division process of an irregular sub-cluster, we may not have been capable of being an active member of
prefer to adopt the K-means algorithm. However, K-means any of the newly created sub-clusters, even though they
can cause some incoherent subdivided sub-clusters in such lie in the associated accepted Mahalanobis radius. Hence,
cases9 , as shown in Fig. 3b. In Fig. 3b, two smaller mini- it becomes essential to check again the assignment of these
clusters, produced as the K-means result, are represented latter retained objects.
in different colors, with covariance determinants of 1.63 Fig. 4 demonstrates an intuitive example of such sit-
and 7.71 for the coherent blue and incoherent red mini- uation, in which, some objects, according to the density
clusters respectively. The associated centroids and tolerance restrictions cannot be assigned to a cluster, even though
ellipses are denoted as black crosses and black solid lines, they lie in the accepted Mahalanobis radius of that cluster.
 Objects that have had the competence to form a cluster are
 7. Here, for challenging the performance of our method, we are
taking outliers so close to the original cluster. But in reality, it is not shown with black solid circles; and those which are not a
usually like that and outliers often have a significant distance from part of the cluster, but reside in its accepted Mahalanobis
every normal cluster in data. radius, which is represented by a blue solid line, are denoted
 8. This threshold is denoted as , 1 ≤ ≤ T, for any sub-cluster as red triangles.
discovered near the th initial mini-cluster, through scalable clustering.
The proximity measure for this nearness is as the Mahalanobis distance
of the new discovered sub-cluster centroid, from the initial mini- 10. However, the divided mini-cluster could be significantly smaller
clusters. We assume any sub-cluster with a covariance determinant in size and determinant, though because of the lack of coherency, some
greater than such threshold, as a candidate for a non-convex sub-cluster, PC variances could be very larger than others; therefore, the regarding
whose spatial volume could cover some significant space out of the tolerance ellipse will be more stretched in those PCs, and thus harmful.
scope of the related original cluster. 11. Here, the upper bound for K is b |
11

 ® S , S , C , C ,
 Algorithm 6: [Γ,
12

 , and a covariance matrix Σ . Hence, w.r.t. Algorithm 7, distance [13] is assigned to the object as its outlying score.
if a final cluster contains only one temporary cluster, then The higher the distance, the more likely it is that the object
the final centroid and the final covariance matrix will be the is an outlier. Here, we name such score obtained out of
same as for the temporary cluster. Otherwise, in the case of our proposed approach, SDCOR, which stands for "Scalable
containing more than one temporary cluster, we utilize the Density-based Clustering Outlierness Ratio"12 .
sizes and centroids of the associated mini-clusters, to obtain
the final mean.
 For acquiring the final covariance matrix of a final cluster 3.4 Algorithm complexity
comprising of multiple temporary clusters, for each of the Here, at first, we analyze the time complexity of the pro-
associated mini-clusters and w.r.t. its centroid and covari- posed approach. For the first two phases, "Sampling" and
ance structure, we afford to regenerate a specific amount of "Scalable Clustering", the most expensive operations are the
fresh data points with Gaussian distribution. We define the application of DBSCAN to the objects residing in memory
regeneration size of each mini-cluster equal to the product after loading each chunk, and the application of PCA to each
of the sampling rate—which was used at the "Sampling" mini-cluster to obtain and update its secondary information.
stage—and the cardinality of that mini-cluster, as · ; this Let ℭ be the number of data points contained in a
is necessary for saving some free space in memory, while chunk. Considering the three-sigma rule of thumb, in every
regenerating the approximate structure of an original clus- memory-load of points, the majority of these points lies
ter. We consider all regenerated objects of all sub-clusters in the accepted Mahalanobis neighborhood of the current
belonging to a final cluster, as a unique and coherent cluster temporary clusters and is being assigned to them—and
and afford to obtain the final covariance structure out of it. this will escalate over time by the increasing number of
 But before using the covariance matrix of such regener- sub-clusters, which are being created during each memory
ated cluster, we need to mitigate the effect of some generated process—and also, by utilizing an indexing structure for
outliers, which could be created unavoidably during the the k-NN queries, the time complexity of the DBSCAN
 
regeneration process, and can potentially prejudice the final algorithm will be ℭ log ( ℭ ) ; where, is a low constant
accuracy outcomes. For this purpose, we need to prune close to zero. Note that exercising an indexing structure to
this transient final cluster, according to the Mahalanobis decrease the computational complexity of answering the k-
threshold , obtained through the user. Thus, regenerated NN queries is only applicable to lower dimensions; for high-
 √
objects having a Mahalanobis distance more than · , dimensional data, we need some sort of sequential scans to
from the regenerated cluster, will be obviated. Now, we can handle the k-NN queries,
  which leads to an average time
compute the ultimate covariance matrix out of such pruned complexity of 2ℭ for DBSCAN.
regenerated cluster, and then remove this transient cluster.   
This procedure is conducted in sequence for every final Applying PCA on mini-clusters is 3 , 3ℭ [49],
cluster which consists of more than one temporary cluster. as p stands for the dimensionality of the input dataset.
 Fig. 5a demonstrates a dataset consisting of two dense But according to our strong assumption that < ℭ , thus,
Gaussian clusters with some local outliers around them. applying PCA will be 3 . Hence, the
  first two
  phases of
This figure is in fact, a sketch of what the proposed method 2
 the algorithm will totally take ℭ , . 3
looks like at the final steps of scalable clustering and before The last phase of the algorithm, w.r.t. this concern that
building the final clustering model. The green dots are only consists of calculating the Mahalanobis distance of
normal objects belonged to a mini-cluster. The red dots are the total n objects in the input dataset to T final clusters,
temporary outliers, and the magenta square points represent is ( T ); and regarding that T  , hence, the time
the temporary centroids. Fig. 5b demonstrates both tempo- complexity of this phase will be  ( ). The
  overall time
rary means and final means, represented by solid circles and
 
 complexity is thus at most 2ℭ , 3 + . However,
triangles respectively, with a different color for each final
cluster. Fig. 5c colorfully demonstrates pruned regenerated in addition to being a truly tiny constant, it is evident that
data points for every final cluster besides the final means, both p and ℭ values are negligible w.r.t. n; therefore, we can
denoted as dots and triangles respectively. state that the time complexity of our algorithm is linear.
 Now, after obtaining the final clustering model, the Analysis of the algorithm space complexity is twofold.
second phase of the proposed approach is finished. In the First, we consider the stored information in memory for the
following subsection, the third phase named "Scoring" is clustering models; and second, the amount of space required
presented, and we will describe how to give each object, a for processing the resident data in RAM. With respect to the
score of outlierness, w.r.t. the final clustering model obtained fact that the most voluminous parts of the temporary and of
out of scalable clustering. the final clustering models are the scatter and covariance
 matrices respectively, and that L  T ,  thus, the space
 complexity of the first part will be L 2 . For the second
3.3 Scoring part, according to this matter that in each memory-load of
At this phase, w.r.t. the final clustering model which was points, the most expensive operations belong to the cluster-
obtained through scalable clustering, we give each data ing algorithms, DBSCAN and K-means, and regarding the
point an outlying rank. Therefore, like phase two, once
more, we need to process the whole dataset in chunks, 12. Due to the high computational loads associated with calculating
 Mahalanobis distance in high-dimensions, one can still gain the benefit
and use the same Mahalanobis distance criterion to find the of using properties of principal components for computing outlying
closest final cluster to each object. This local Mahalanobis scores, like what was done during scalable clustering.
13

 20 20 20

 10 10 10
x2

 x2

 x2
 0 0 0

 -10 -10 -10

 -20 -20 -20
 -20 -15 -10 -5 -20 -15 -10 -5 -20 -15 -10 -5
 x1 x1 x1
 (a) After Processing the Last Chunk (b) Temporary Means and Final Means (c) Regenerated Objects and Final Means

Fig. 5. Proposed method appearance at the last steps of scalable clustering

linear space complexity of these methods, hence, the overall for outliers in each dataset, their share of total objects in the
space complexity will be ℭ + L 2 .
 
 corresponding dataset is reported in percentage terms. In all
 of our experiments, the Area Under the Curve (AUC) (curve
 of detection rate and false alarm rate) [1], [2] is used to
4 E XPERIMENTS evaluate the detection performance of compared algorithms.
In this section, we conduct effectiveness and efficiency tests The AUC and runtime results14 of different methods are
to analyze the performance of the proposed method. All the summarized in Table 3 and Table 4, respectively. The bold-
experiments were executed on a laptop having a 2.5 GHz faced AUC and runtime indicate the best method for a
Intel Core i5 processor, 6 GB of memory and a Windows 7 particular dataset.
operating system. The code was implemented using MAT-
LAB 9, and for the sake of reproducibility, it is published on TABLE 2
GitHub13 . Properties of the datasets used in the efficacy experiments
 For the effectiveness test, we analyze the accuracy and
stability of the proposed method on some real-life and syn- Dataset #n #p #o (%o)
 Shuttle 49,097 9 3,511 (7.15%)
thetic datasets, comparing to two state-of-the-art density- Smtp (KDDCUP99) 95,156 3 30 (0.03%)
 Real
based methods, namely LOF and LoOP; a fast K-means Datasets ForestCover 286,048 10 2,747 (0.96%)
variant optimized for clustering large-scale data, named X- Http (KDDCUP99) 567,498 3 2,211 (0.38%)
means [50]; and two state-of-the-art distance-based meth- Data1 500,000 30 5,000 (1.00%)
 Synth. Data2 1,000,000 40 10,000 (1.00%)
ods, namely Sp and ORCA. Consequently, for the efficiency Datasets Data3 1,500,000 50 15,000 (1.00%)
test, experiments are conducted on some synthetic datasets Data4 2,000,000 60 20,000 (1.00%)
in order to assess how the final accuracy varies when
the number of outliers is increased. In addition, more ex- For the competing density-based methods, the parame-
periments are executed to appraise the scalability of the ters are set as suggested, i.e. = 10, = 50
proposed algorithm, and the effect of random sampling rate in LOF and = 30, = 3 in LoOP. For X-means, the
on the final detection validity. minimum and maximum number of clusters are set to 1 and
 15 respectively; maximum number of iterations and number
 of times to split a cluster are set to 50 and 6 respectively as
4.1 Accuracy and stability analysis
 well.
Here, the evaluation results of the experiments conducted In ORCA, the parameter denotes the number of near-
on various real and synthetic datasets, with a diversity of est neighbors, which by using higher values for that, the
size and dimensionality, are presented in order to demon- execution time will also increase. Here, the suggested value
strate the accuracy and stability of the proposed approach. of , equal to 5, is utilized in all of our experiments. The
 parameter specifies the maximum number of anomalies to
4.1.1 Real datasets
Some public and large-scale real benchmark datasets, taken 14. For the proposed method, as we know the true structural char-
 acteristics of all the real and synthetic data, we compute the accurate
from UCI [51], and preprocessed and labeled by ODDS [52], anomaly score, based on the Mahalanobis distance criterion, for each
are used in our experiments. They are representative of object and report the following optimal AUC next to the attained result
different domains in science and humanities. Table 2 shows by SDCOR. Moreover, we do not take into account the required time for
the characteristics of all test datasets, namely the numbers finding the optimal parameters of DBSCAN, as a part of total runtime.
 Furthermore, for X-means, as it assumes the input data free of noise,
of objects (#n), attributes (#p) and outliers (#o). In addition, thus after obtaining the final clustering outcome, the Euclidean distance
 of each object to the closest centroid is assigned to it as an outlier score,
 13. https://github.com/sana33/SDCOR and hence, the following AUC could be calculated.
14

be reported. If is set to a small value, then ORCA increases from the normal clusters, and it is clear from its average
the running cut-off quickly and therefore, more searches will AUC for real datasets, which shows that outliers are totally
be pruned off, which will result in a much faster runtime. misclassified.
Hence, as the correct number of anomalies is not assumed to Furthermore, Table 4 reveals that SDCOR preforms
be preknown in the algorithm, we set = 8 , as a reasonable much better than other competing methods in terms of
value, where stands for the cardinality of the input data. execution time, except for Sp , which is slightly faster than
 For Sp , the sample size is set to the default value as the proposed method. Although Sp is the fastest among the
suggested by the author, equal to 20, in our experiments. compared algorithms, as it was noted, its AUC suffers from
Furthermore, for SDCOR, each dataset is divided into 10 large variations and is lower than SDCOR. Moreover, w.r.t.
chunks, to simulate the capability of the algorithm for the two state-of-the-art density-based methods, it is evident
scalable processing. Moreover, for the algorithms that have that there is a huge difference on consuming time between
random elements, including Sp and the proposed method, SDCOR and these methods; it is due to the fact that in
their detection results are reported in the format of ± , SDCOR, it is not required to compute the pairwise distances
over 40 independent runs, as and stand for the mean of the total objects in a dataset, differently from LOF and
value and the standard deviation of AUC values. LoOP.
 Additionally, as long as we are trying to not let outliers As stated earlier at the beginning of this paper, the
participate actively in the process of forming and updating strong assumption of SDCOR is on the structure of existing
mini-clusters during scalable clustering, two of the input clusters in the input dataset, which should have Gaussian
parameters of the proposed algorithm are more critical, distribution. In practice though, w.r.t. [53], with appreciation
listed in the following: the random sampling rate , which to the Central Limit Theorem, a fair amount of real world
influences the parameter , ® the boundary on the volume of data are following the Gaussian distribution. Furthermore,
mini-clusters at the time of creation; and the membership w.r.t. [54], [55], [56], even when original variables are not
threshold , which is useful to restrain the volume of mini- Normal, employing properties of PCs for detecting outliers
clusters, while they are incrementally growing over time. is possible and the corresponding results will be reliable.
 Note that the sampling rate should not be set too Since, given that PCs are linear functions of p random
low, as by which, the singularity problem might happen variables, a call for the Central Limit Theorem may vindicate
during the "Sampling" phase, or even some original clusters the approximate Normality for the PCs, even in the cases
may not take the initial density-based form, for the lack of that the original variables are not Normal. Regarding this
enough data points15 . The same problem concerns , as a concern, it is plausible to set up more official statistical tests
too low value could bring to the problem that the number of for outliers based on PCs, supposing that the PCs have
sub-clusters which are being created over scalable clustering Gaussian distributions. Moreover, the use of Mahalanobis
will become too large, which leads to a much higher com- distance criterion for outlier detection is only viable for
putational load. In addition, a too high value for brings convex-shaped clusters. Otherwise, outliers could be as-
the risk of outliers getting joined to normal sub-clusters, signed to an irregular (density-based) cluster under masking
and this will escalate over time, which increases the "False effect and thus, will be misclassified.
Negative" rate. Here, in all experiments, is set to 2.
 4.1.2 Synthetic datasets
 As for the real datasets, Table 3 shows that in terms of
AUC, SDCOR is more accurate than all the other competing Experiments on synthetic datasets are conducted in an ideal
methods, with the exception of Smtp dataset, in which LoOP setting, since these datasets are following the strong as-
has achieved the best result, though not with a remarkable sumptions of our algorithm on the structure of existing clus-
difference. Moreover, it is obvious that the attained results ters, which should be Gaussians. Moreover, the generated
by the proposed approach are almost the same as the opti- outliers are typically more outstanding than those in the
mal ones, and also, the average line indicates that SDCOR real data, and the outliers "truth" can be utilized to evaluate
performs overall much better than all the other methods. whether an outlier algorithm is eligible to locate them.
More importantly, SDCOR is effective on the largest dataset The experiments carried out on four artificial datasets are
Http. In addition, by considering the standard deviations of reported at the bottom of Table 3, along with their execution
AUC values for Sp and SDCOR, it is evident that SDCOR is times in Table 4. Each dataset consists of 6 Gaussian clusters,
much more stable than Sp , as the average value of standard and outliers take up 1 percent of its volume.
deviations for SDCOR is 0.007, which is really dispensable In more detail, for each dataset having p dimensions, we
comparing to that of Sp , equal to 0.116. And this is due to build Gaussian clusters with arbitrary mean vectors, so that
using one very small sample only by Sp , which causes the they are quite far away from each other to hinder possible
algorithm go through large variations on the final accuracy. overlappings among the multidimensional clusters. As for
For X-means, as it is not compatible with anomalies in the covariance matrix, first, we create a matrix [A] × ,
the input data, thus it severely fails on separating outliers whose elements are uniformly distributed in the interval
 [0,1]. Then, we randomly select half the elements in A and
 15. Hence, one can state that there is a straight relationship between make them negative. Finally, the corresponding covariance
 Í
the random sampling rate, and the true number of original clusters in matrix is obtained in the form of [ ] × = A A. Now,
data. In other words, for datasets with a high frequency and variety w.r.t. to the mean vector (location) of each cluster and its
of clusters, we are forced to take higher ratios of random sampling, to covariance matrix (shape of the cluster), we can generate an
avoid both problems of singularity and misclustering, in the "Sampling"
stage. Here, regarding our pre-knowledge about real and synthetic arbitrary number of data points from a p-variate Gaussian
data, we have set to 0.5%. distribution. Moreover, to eliminate marginal noisy objects
You can also read