An Overview of Multiple Outliers in Multidimensional Data

The process of detection of outliers is an interesting and important aspect in the analysis of data, as it could impact the inference. Literature is abundant with procedures for detection and testing of single outliers in sample data. However, the presence of two or more outliers in multivariate data would render the detection and testing process more complicated as majority of outliers are invisible to many of the methods. This is due to the masking effect, and regular classical and related methods being found unsuitable for use of outlier identification techniques. The difficulty of detection increases with the number of outliers and the dimension of the data because the outliers can be extreme in any growing number of directions. An overview of multivariate outlier detection methods are provided in this study because of its growing importance in a wide variety of practical situations.


Introduction
Statisticians have always been interested in finding "outlying", "unusual", or "unrepresentative" observations for many years as a precursor to data analysis.Data incorrectly entered or that do not belong to the population from which the rest of the data came can bias the estimates and give misleading results.Methods have been devised to identify and/or accommodate outlying observations in a variety of situations.With recent advances in technology, scientists are collecting large data sets, and the analyst is getting deeper to unravel the mysteries of data.So, it is ISBN-1391-4987 IASSL important to have a good methodology for dealing with rogue observations that might not be noticed in a typical data analysis.
The basic definition of an outlying observation is a data point or points that do not fit the model of the rest of the data.Specific definitions are given such as: An outlier is a point such that "in observing a set of observations in some practical situation one (or more) of the observations 'jars' stands out in contrast to other observations, as extreme value."[1].
An outlying observation, or 'outlier', is one that appears to deviate markedly from other members of the sample in which it occurs.[2].
An outlier is "an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism."[3].
However, the words "stands out", "appears to deviate", and "arouse suspicions" imply some kind of subjectivity or preconceived ideas about what the data should look like.Though formal methods also often rely on distributional assumptions, formal methods will cut down on the amount of subjectivity used in data analyses that employ outlier detection methods.
There are two basic reasons to search for outliers: i) the interest in the outliers for their own sake, and ii) the outliers could influence the results from the rest of the data.In 1949 in England, the case of Hadlum vs. Hadlum provides a good example of interest in outliers for their own sake.Mr. Hadlum appealed the rejection of an earlier petition for divorce on grounds of adultery.Mrs. Hadlum had given birth to a child (who she claimed was fathered by her husband) on August 12, 1945; 349 days after Mr. Hadlum had left the country.The average gestation period for a human female is 280 days, and so the question arose regarding 349 days being simply as a large observation or does that data point belong to another population, namely one of women who conceived much later than August 28, 1944 [1].Another example where outliers themselves are of primary importance involves air safety, as discussed in [4].Further applications of outlier identification to homeland security are described in an article by Banks [5].
Conversely, imagine a scientist studying a certain type of mosquito.If there were other types of mosquitoes in his data collection, he would not be interested in their characteristics, he would simply want to remove the observations or ensure that the observations do not influence the statistical estimates of the original population.In such a situation, the techniques should accommodate the outliers but need not detect and reject them in the estimation and are hence called robust.Thus, robustness signifies insensitivity to small deviations from the assumptions [6].are those points located "far away" from the majority of the data and "probably do not" follow the assumed model.A simple plot of the data, such as scatter plot, stemand-leaf plot, QQ-plot, etc., can often reveal which points are outliers.This is sometimes called the "interoccular test" because it hits between the eyes.
Tukey [7] introduced the most popular graphical procedure called a boxplot for detecting outliers from univariate data.The boxplot rule declares observations as outliers if they lie outside the interva ( 1 − ( 3 −  1 ),  3 + ( 3 −  1 )) (2.0.1)where Q i is the i th quartile.The common choices for k is 1.5 for flagging "out" values and 3.0 for flagging "far out" observations.Since this rule is not sample-size dependent, the probability of declaring outliers when none exist changes with the number of observations.In that sense, it differs from standard outlier identification rules, which are set at α probability of identifying outliers when none exist.
Hoaglin et al. [8], showed that the popular boxplot outlier labeling rule is highly liberal with a 50% chance of labeling at least one outlier for data coming from a random normal sample of size 75.Hoaglin and Iglewicz [9], modified the rule to make it sample-size dependent, so that this probability stays at 5% for normal samples up to 300 observations.Banerjee and Iglewicz [10], expanded this modified rule to handle large sample cases and also a great variety of continuous univariate distributions.Kimber [11], slightly modified the standard boxplot outlier-labeling rule for skewed distributions by replacing; ISBN-1391-4987 IASSL where M is the sample median.Kimber also used k = 1.5 and studied the exponential distribution, including cases where the data is right censored, using the Kaplan-Meier estimator to estimate the median and quartiles for censored data.Van der Loo [12] introduced two univariate outlier detection methods.In both methods, the distribution of the bulk of observed data is approximated by regression of the observed values on their estimated QQ-plot positions using a model cumulative distribution function.
The study of outliers in structured situations like regression models and designed experiments has been carried out by numerous authors including Gentleman and Wilk ( [13], [14]), John and Draper [15], Prescott [16], and John [17] and are based on residuals.Balasooriya et al. [18] carried out an empirical study to identify the best of seven commonly used methods for identifying outliers in linear regression models based on several data sets.The methods they compared are due to Tietjen et al. [19], Prescott [16], Andrews and Pregibon [20], Cook and Weisberg [21], Cook [22], and Draper and John [23].On the basis of their study, they observed that the methods do not always agree and suggested a judicious combination of procedures.
Their empirical studies also revealed that the results tend to strongly disagree in the case of multiple outliers.Balasooriya and Tse [24] studied the relative performance of five widely used test statistics for detecting outliers using Monte Carlo method.Through this study, they identified that the test statistic based on studentized residuals proposed by Tietjen et al. [19] is the best procedure for detecting a single outlier.

Multivariate Outliers
Multivariate outliers pose bigger challenges than univariate data as simple visual detection of multivariate outliers is virtually impossible because the outliers do not "stick out" on the end [25].Even plotting the data in bivariate form with a systematic rotation of coordinate pairs will not help.Barnett and Lewis [1] and Beckman and Cook [26], presented several key concepts that point to the relevance of multivariate outlier detection methods for anomaly detection.
Breakdown point is an important measure that is used to describe the resistance of robust estimators in the presence of outliers.Following Hodges [27] and Hampel ([28], [29]), breakdown point of an estimator is the fraction of arbitrary contaminating observations that can be presented in a sample before the value of the estimator can become arbitrarily large.Lopuhaä and Rousseeuw [30], have presented more formal definitions of the breakdown point for location and covariance estimators.For a location estimator,  �, at a collection of observations X, the breakdown point   * ( �, ) is defined as: where  � is a collection of observations corrupted by replacing observations with arbitrary values.From (3.1), it can be seen that the breakdown point for a location estimator is the smallest fraction of a sample that can be corrupted by outliers before the distance between the true sample mean and the corrupted sample mean can become arbitrarily large.
The formal definition of the breakdown point for the covariance estimator,  � , is given by : where (, ) = �| 1 () −  1 ()|, �  () −1 −   () −1 �� , and λ i (A) is the i th ordered eigen value of A. In other words, (3.2) states that the breakdown point for a covariance estimator is the smallest fraction of a sample that can be corrupted by outliers before the difference between the largest eigen values of the true covariance estimate and that of the corrupted covariance estimate becomes arbitrarily large, or the difference between the smallest eigen values of the two estimates is arbitrarily close to zero.In the context of estimating the mean vector and covariance matrix for a sample of data, it is advantageous to use estimators with a high breakdown point touching the theoretical limit of 50%, as explained by Rousseeuw and Leroy [31].
Unfortunately, the breakdown points for the classical mean and covariance estimators are only 1/N, where N is the sample size [32].Hence, the classical mean and covariance estimators can potentially produce unbounded estimates, in the sense of (3.1) and (3.2), with as little as one contaminating observation present in the sample.
The influence function is also an important robust measure, which measures the effect on an estimator of adding a small mass at a specific point [33].Robust estimators ideally have a bounded influence function, which means that a small contamination at any point can only have a small effect on the estimator [34].As discussed in Hampel et al. [33], the importance of the influence function lies in the fact that it can describe the effect of an infinitesimal contamination at the point x on the estimate T, standardized by the mass of the contaminant.It gives us a picture of the asymptotic bias caused by contamination in the data.
If an estimator is affine equivariant, stretching or rotating the data will not affect the estimator.Dropping this requirement greatly increases the number of available estimators, and in many cases, non-affine equivariant estimators have superior performance to affine equivariant estimators.
In addition to estimator breakdown, the phenomenon of outlier masking also argues for the use of outlier resistant detection methods for detecting multidimensional outliers.Masking refers to the condition of very strong outliers distorting non-robust mean and covariance estimates to such a degree that weaker outliers appear ordinary in terms of their Mahalanobis distances.If there is one or more distant outlier and one or more not so distant outlier in the same direction, the more distant outlier(s) could significantly shift the mean in that direction, and also increase the standard deviation, to such an extent that the lesser outlier(s) falls less than 2 or 3 standard deviations from the sample mean, and goes undetected.The degree of masking is measured in terms of an increase in Type II error, or false negatives, since observations that are truly outlying are classified as part of the uncontaminated population of data.
Becker and Gather [35], developed the masking breakdown point of outlier detection method that specifies the smallest fraction of outliers in a sample that can induce the masking affect.Becker and Gather prove that the masking breakdown point for an outlier detection method that uses a mean and covariance estimator is bounded by the breakdown points of these two estimators.Further, if the two estimators have the same breakdown point, then the masking breakdown point of the detector is equal to the estimator breakdown point.An immediate conclusion that can be drawn from these findings is that non-robust Mahalanobis distance-based outlier detection methods can be affected by masking in the presence of a single outlying observation.
Further reason for employing multivariate outlier detection methods for anomaly detection is to combat the swamping effect.Masking refers to the increase of Type II error due to the presence of outliers and swamping refers to the increase in Type I error caused by outliers.Hadi [36], observed that not all observations with large [Mahalanobis distance] values are necessarily outliers.For example, a small cluster of outliers will attract [the mean vector] and will inflate [the covariance estimate] in its direction and away from some other observations which belong to the pattern suggested by the majority of observations.To ensure against this source of false alarms, multivariate outliers detection methods should be employed that use robust estimation methods for the mean vector and covariance matrix.Following this strategy helps ensure that the false alarm rate for an anomaly detection method is inline with the accepted α-level for the method.
Various methods have been proposed over the years to detect outliers and are broadly classified into: robust distance-based methods, and non-traditional methods.
The robust distance methods use some form of robust estimation to obtain mean vector and covariance estimates for the data.The Mahalanobis distance is then computed for each observation using these robust estimates, and observations whose distances exceed a critical value -generally from the Chi-square distribution if the data is multivariate normal -are labeled as outliers.For the non-traditional methods, some alternative statistic is exploited that is presumably better at revealing outliers or computationally easier than distances based on robust mean and covariance estimates.Both the methods are discussed in detail in the following sections.

Robust Distance-based Methods
There are numerous robust distance-based outlier detection methods evolved over the last two decades and the following are the findings presented in order.

M-Estimation Method
One of the earliest robust distance-based methods was proposed by Campbell [37], who suggested using M-estimators to obtain robust mean vector and covariance matrix estimates.However, M-estimators were originally proposed by Maronna [38], as an affine equivariant method for obtaining robust mean vector and covariance matrices for possible use in linear discrimination, principal component analysis, and outlier detection.The M-estimates of a location vector t, and a scatter matrix V, are defined as the solution to the following system of equations: where u 1 and u 2 are functions of the Mahalanobis distance based on certain assumptions.In general, these functions serve as weighting functions that minimize the impact of outlying observations have on the mean and covariance estimates.Different forms of the weighting functions have been proposed in the literature.To find a solution for (3.5), iterative methods are typically employed but, there is no guarantee to attain the global optimum.As determined by Maronna [38], a weakness of these estimators is a breakdown point of 1/(p+1), where p is the dimension of data, which can be problematic if operating in high-dimensional space.

MVE and MCD Methods
As an alternative to the M-estimation method with high breakdown point, Rousseeuw [39], proposed the Minimum Volume Ellipsoid (MVE) and Minimum Covariance Determinant (MCD) as methods for estimating the location and scatter of the data.The MVE method searches for the minimal volume ellipsoid that encompasses at least h of the observations, with h taken as [n/2] + 1, where n is the number of samples.The mean vector estimate is the center of the ellipsoid, and the covariance is the ellipsoid itself multiplied by a correction factor to achieve consistency with a multivariate normal distribution.In a similar manner, the MCD looks for the sub-sample of h observations whose covariance matrix has the smallest determinant.The mean vector is then taken as the mean of the h observations, and the covariance estimate is the covariance of the h observations multiplied by a consistency factor.The MVE or MCD estimates are then used to compute the Mahalanobis distance of all the observations to detect outliers.The advantage of the MVE and MCD is the high breakdown point of 50%, and hence very useful for highly contaminated data.A disadvantage of these estimators is the combinatorial optimization problem that must be solved to find their exact solutions.In practice, search heuristics are employed to find approximate solutions.
A practical means for searching an approximate MVE solution was proposed by Rousseeuw and Leroy [31], and again by Rousseeuw and van Zomeren [40].This method -referred to as the resampling method -entails drawing m sub-samples of size p + 1 from the original data, where m is chosen to ensure a high probability that at least one sub-sample will be free of outliers.For each sub-sample, the covariance matrix is computed and either inflated or deflated to include h of the observations from the original sample.The volumes of each of the m resulting ellipsoids are then approximated, and the one with the minimum volume is used to form the MVE estimate.
To improve the efficiency of the MVE estimate, Rousseeuw and Leroy [31] go on to recommend a reweighting step in which the mean vector and covariance matrix are recomputed using only the observations whose Mahalanobis squared distance relative to the MVE mean vector and covariance matrix fall below a suitable quantile of a Chi-square distribution with p degrees of freedom.This reweighting step is also recommended by Rousseeuw and van Zomeren [40], while Lopuhaä and Rousseeuw [30], show that it preserves the breakdown point of the MVE.
When the MVE and MCD estimation methods were originally proposed by Rousseeuw [39], the MVE received initial attention for outlier detection because it was computationally less expensive to find an approximate MVE solution.However, Butler et al. [41], showed that the MCD has better statistical efficiency than the MVE since the MCD is asymptotically normal.Additionally, Davies [42], showed that the MVE has a lower convergence rate than the MCD.According to Rousseeuw and Van Driessen [43], theoretical findings combined with the need for accurate estimators for use in outlier detection schemes, the MCD began to gain favor over the MVE as the preferred robust estimator for outlier detection.The main drawback to using the MCD, however, was the high computational complexity involved with searching the space of half-samples of a dataset to find the covariance matrix with minimum determinant.
To address this problem, Rousseeuw and Van Driessen [43], proposed the FAST-MCD outlier detection method that uses a key theoretical finding in conjunction with a partitioning method to rapidly search for an approximate MCD solution.The primary theorem proved by Rousseeuw and Van Driessen states that if one starts with a half-sample of data, orders the entire data set based on Mahalanobis distances derived from the half-sample's mean vector and covariance matrix, and selects a new half-sample from the observations with smallest distances, the covariance determinant of the new half-sample will be less than or equal to the old half-sample covariance determinant.By repeatedly applying this theorem to a dataset -a process referred to as a C-step -it is possible to converge to at least a local optimal MCD solution.A further finding based on experimental results indicates that if the starting half-sample is capable of converging to a good solution, the covariance determinant will begin to rapidly converge after only two C-steps.

Stahel -Donoho Estimator
In addition to suggesting the MVE and MCD estimators for use in robust distance outlier detectors, Rousseeuw and Leroy [31] also allude to using Stahel-Donoho estimators in the robust distance computation.These estimators, proposed independently by Stahel [44] and Donoho [45], compute the mean vector and covariance matrix by assigning decreasing weight to observations that are outlying relative to some projection of the data to univariate space.Specifically, outlyingness of an observation x i is defined to be: ISBN-1391-4987 IASSL where v is a p-dimensional projection vector.Upon determining the u i for all observations, the mean vector and covariance matrix are estimated as: where w(u i ) is a positive, decreasing weighting function.
The Stahel-Donoho estimator is an attractive robust estimator because it has a high breakdown point which asymptotically approaches 50%, as shown by Donoho [45].However, as explained by Rousseeuw and Leroy [31], the primary difficulty with these estimators is the computation of the outlyingness values.Apparently, no satisfactory method has been proposed to find these values, thereby preventing these estimators from experiencing any practical use for outlier detection.However, Gasko and Donoho [46] propose a method that uses these estimators to identify leverage points in multiple regression data.

Hadi's Forward Search Method
Returning to the MVE-based outlier detection method proposed by Rousseeuw and Leroy [31] and Rousseeuw and van Zomeren [40], Hadi [36] identified several limitations with the approach.First, the user must decide upon the number of subsamples to use in the resampling scheme.This choice is not obvious since it depends on the presumably unknown fraction of outliers that exist in the data.A second limitation is that the covariance matrices for the sub-samples are estimated using only p + 1 observations which could lead to singularities or highly inaccurate estimates.The final problem highlighted by Hadi is that several of the sub-samples may have covariance determinants close to zero, leaving the user with the task of choosing which sub-sample to use to from the MVE estimate.Since these subsamples may have considerably different covariance structures, the resulting MVE estimates are also likely to be different.Thus, choosing the correct sub-sample is not obvious.
To correct for the limitations of the original MVE resampling method, Hadi proposed an MVE-based, non-affine equivariant outlier detection method that begins by computing the vector of coordinate-wise medians for the original data.The median vector is then used to estimate the covariance matrix for the data.These location and covariance estimates are then used to compute robust Mahalanobis distances for the observations.The [(n+p+1)/2] observations with the smallest distances are identified and used to form classical mean vector and covariance estimates and a new set of distances for all the observations.From this latest set of distances, the p +1 observations with the smallest distances are selected to form what is referred to as the basic subset.This basic subset is analogous to a sub-sample in the MVE resampling method with two notable differences.First, the basic subset is composed of observations closest to the centroid of the sample as determined by the robust, coordinate-wise median Mahalanobis distances.Second, there is only one basic subset in Hadi's method as opposed to potentially hundreds of subsamples in the resampling MVE method.This considerable reduction in the number of subsets makes Hadi's method less computationally complex and faster to execute.

Atkinson's Forward Search Method
Sharing the same concerns with the MVE resampling method as Hadi, Atkinson [47] proposed an affine equivariant forward search algorithm similar in nature to Hadi's method.Atkinson's forward search method begins by randomly selecting a subset of m = p+1 observations and using this subset to estimate a mean vector and covariance matrix.The covariance matrix is inflated or deflated to include h of the original observations, and the volume of the resulting matrix is recorded.The adjusted covariance matrix is then used to compute the Mahalanobis squared distances for all observations and the m+1 observations with the smallest distances are used to repeat the process, while any observations whose squared distances exceed a critical Chi-Square threshold are identified as potential outliers.When m = n, the entire process is repeated with a new random subset of m = p+1 observations.After executing the algorithm through the desired number of random starting subsets, the adjusted covariance matrix that gave the smallest volume over all trials can be used for the final robust mean and covariance estimates and subsequent outlier detection.However, Atkinson does not recommend identifying outliers in this manner.Rather, he uses a graphical method known as stalactite plots to analyze which of the observations consistently emerged as outliers in each stage of the algorithm.Atkinson's method is well illustrated in Atkinson [48].

Hawkins' Feasible Solution Algorithm
Motivated by the need to use efficient starting solutions for M-estimation and other iterative robust estimators, Hawkins [49] proposed the Feasible Solution Algorithm (FSA) for obtaining approximations to Rousseeuw's MCD estimator.Hawkins also suggests that the MCD estimate resulting from the FSA can be used to detect outliers using the usual robust distance scheme.The FSA begins by first assuming that there are at most k outliers in the data.A random sample of (n -k) observations ISBN-1391-4987 IASSL is then selected from the original sample of n observations, with the remaining k observations trimmed from the data.The randomly selected observations are used to form an initial mean vector and covariance estimate along with the respective covariance determinant.
Next, for each possible pair of observations with one observation coming from the randomly selected subset and the other from the trimmed subset, an updating formula provided by Hawkins is used to determine the reduction in covariance determinant if the pair of observations is interchanged between subsets.The pair of observations that produces the greatest reduction in the covariance determinant are then swapped and the process repeated until no swaps can be identified that reduce the determinant value.The subset of n − k observations that results with no scope for further improvement is referred to as a feasible solution.The entire process is then repeated to find additional feasible solutions.The final MCD estimate is obtained from the feasible solution that produced the smallest covariance determinant.

Hybrid Algorithm
The robust distance outlier detection methods discussed thus far follow one of three strategies: 1) use of what Rocke and Woodruff [50] refer to as smooth estimators, such as M-estimators or Stahel-Donoho estimators; 2) use of combinatorial estimators such as the MVE or MCD; and 3) use of forward search methods as proposed by Hadi and Atkinson.In an effort to unify these strategies under one outlier detection method, Rocke and Woodruff [50], proposed a hybrid algorithm for the detection of outliers.This method culminates the research of Rocke and Woodruff [51], Woodruff and Rocke [52], Woodruff and Rocke [53]and Rocke [54].
The high breakdown point, affine equivariant detection method is composed of two phases.The objective of Phase I is to obtain a robust estimate of the data set's location and shape.This estimate is achieved by first using Hawkins' FSA to obtain an approximate MCD estimate of the location and shape.The MCD estimate is then used for the starting point of Atkinson's forward search method as opposed to the mean vector and covariance matrix of a random subset of p+1 points originally suggested by Atkinson.The non-outlying points identified by Atkinson's method are used to compute the starting mean vector and covariance matrix estimates for a modified, high breakdown point M-estimation method proposed by Rocke [54].The rationale for obtaining the final estimates in this manner is that the forward search method achieves better results given a good starting point, while M-estimation is also more likely to find the globally optimal solution if the initial estimate is close to this solution.An additional feature of the Phase I process is a partitioning scheme designed to counter the fact that MCD computations grow exponentially with the sample size.Rather than attempt to apply the compound MCD, forward selection, and M-estimation method to the entire data set, the original data is randomly partitioned into a user-specified number of subsets.Robust estimates are then obtained for each subset and the covariance estimate with minimum determinant is used for next Phase.
Phase II of the compound estimation method involves computing the Mahalanobis squared distances for all the observations using the robust estimates from Phase I, scaling these squared distances so that they are consistent with distances obtained from multivariate normal data, and comparing the scaled distances to a suitable threshold from a Chi-square distribution with p degrees of freedom.

Smallest Half-Volume and Resampling by Half-Means Methods
Rocke and Woodruff's hybrid algorithm represents a combination of two somewhat theoretical approaches to detecting outliers.The main drawback of MCD and Mestimation strategy for robust distance detection is their large computational burden that limits their utility relative to large-scale problems.As a less-formal, intuitive alternative for outlier detection on large datasets, Egan and Morgan [55] propose the Smallest Half-Volume (SHV) method.The basic premise behind the SHV method is that good observations in a dataset will tend to cluster closely together in Euclidean space.To identify a cluster of good data, the method begins by mean-centering and standardizing each column of the data matrix using the respective column mean and standard deviation.This process is referred to as auto-scaling.Using the auto-scaled data, an n × n distance matrix is formed in which element d ij is the Euclidean distance from observation i to observation j.Thus, each column of the distance matrix records how close observation j is to all other observations.With this idea in mind, each column of the distance matrix is sorted in ascending order.For each sorted column, the sum of the first n/2 distances is computed.The column with the smallest sum is identified, and the n/2 observations used in computing this column's sum are labeled as good data.The good data are then used to form a robust mean vector and covariance matrix, and to re-perform the auto-scaling procedure.To detect outliers, the mean vector and covariance estimates are used as robust inputs to the classic Mahalanobis distance detector.
In the same article in which the SHV method is proposed, Egan and Morgan [55], also developed the Resampling by Half-Means (RHM) method for detecting outliers.This method makes use of the auto-scaling concept to create samples of robust distances for the observations.The RHM method begins by randomly selecting n/2 observations from the dataset without replacement.Each of the selected observations is used to form a row of the matrix X (i) , where i denotes the iteration of the method.The mean and standard deviation are computed for each column of X (i) .These estimates are then used to auto-scale the original data matrix.ISBN-1391-4987 IASSL The magnitude of each row of the auto-scaled matrix is computed, which is equivalent to computing the distance of each auto-scaled observation to the centroid of the data.The distances for the n observations are saved in the vector l (i) which, in turn, constitutes the i th column of a matrix L. This process is repeated for iteration i+1 until the desired number of iterations is achieved.After the last iteration is complete, each column of L is sorted in ascending order.For each of the sorted columns, the observations corresponding to the largest 5% of the distances are identified.Outliers are identified as those observations whose distances appear in the upper 5% of distances an unusually large number of times.Unfortunately, no guidance is provided as to how many appearances are indicative of an outlier and thus, the method ultimately relies on subjective judgment by the analyst.

Bivariate Boxplot Method
An informal method for detecting outliers in univariate data is to construct a boxplot that visually depicts the location, spread, and skewness of the data.Zani et al. [56], develop a method for building a bivariate boxplot and suggest how it may be used to mind multivariate outliers.To build the bivariate boxplot for pair of variables, the inner region for the plot -analogous to the univariate boxplot's inter-quartile region -is determined through the use of convex hull peeling originally proposed by Bebbington [57].Convex hull peeling entails identifying the observations on the convex hull of the bivariate data cloud, trimming these observations from the dataset, and repeating the process until only a desired percentage of the original observations remain.For the purpose of the bivariate boxplot, Zani et al. suggest trimming the data until 50% of the observations remain.These observations define the inner region for the boxplot.To ensure a smooth ellipse that visually depicts this inner region, Zani et al. use the method of B-splines [58] to fit a curve to the convex hull of the inner region.The centroid for the boxplot is computed as the arithmetic mean of the observations contained in the inner region.
To detect multivariate outliers, Zani et al. recommend constructing a bivariate boxplot for every pair of variables.Any observation that is outside the 90% convex hull in any of the plots is removed from the data set.The remaining observations are then used as the starting point for the forward search method of Hadi ([36], [59]) or Atkinson [47].The authors claim that using bivariate boxplot in this manner make the forward search more computationally efficient, presumably because the initial basic subset for the search should contain considerably more than p+1 points.

BACON Method
The desire to find an outlier detection method that is applicable to very large datasets is echoed by Billor et al. [60].However, where the FASTMCD method attempts to use nesting and C-steps to search for an optimal solution, Billor et al.
make two observations concerning robust distance computation as a guide to developing the Blocked Adaptive Computationally Efficient Outlier Nominator (BACON).The first observation is that the added computational complexity of trying to find optimal robust estimators may not be justified by significantly better outlier detection.The second observation is that insisting upon a completely affine equivariant method may add substantial computational complexity to an algorithm without a proportional improvement in the detection of outliers.Using these two observations, Billor et al. develop BACON as a method that "abandons" optimality conditions in favor of a very fast outlier detection strategy that can be run in a nonrobust, affine equivariant mode with breakdown point of 20%, or in a robust, nearaffine equivariant mode with a breakdown point of 40%.
The BACON method is derived from the forward search method of Hadi ([36], [59]), and begins its search for outliers in much the same manner by selecting an initial basic subset of good observations.The manner in which the initial basic subset is chosen depends on whether the user wishes to have a lower breakdown point method that is affine equivariant, or a high breakdown point method that is not completely affine equivariant.In the former case, the initial basic subset contains the p+1 observations with the smallest Mahalanobis distances relative to the mean vector and covariance matrix for the entire dataset.In the latter case, the basic subset is formed from the p+1 observations with smallest distances relative to the component-wise median of the observations and the covariance matrix derived from this median vector.Using the component-wise median makes the BACON method more robust to outliers at the expense of affine equivariance since the median estimator is not affine equivariant.Once the initial basic subset is selected, its mean vector and covariance matrix are estimated and used to compute Mahalanobis distances for all observations.Once these distances are obtained, they can be compared to the square root of an appropriate quantile from the Chi-Squared distribution with p degrees of freedom.

Kurtosis Method
In spite of its computational and other difficulties, the Stahel-Donoho estimator was nevertheless important in leading to the development of other estimators.One way to reduce the extreme computational burden is to decrease the number of examined projections.Peña and Prieto [61], presented the Kurtosis method that projects the data onto a set of 2p directions, where p is the dimension of the data.These directions are chosen so as to maximize or minimize the kurtosis coefficient of the projected data.The kurtosis coefficient is a measure of how peaked or flat the distribution is.Datasets with high kurtosis tend to have a sharp density peak near the mean, decline rather rapidly, and have heavy tails.Symmetric outliers lead to heavy tails and thus, higher kurtosis.A small amount of asymmetric contamination would also increase the kurtosis.The kurtosis coefficient is also affected by modality, as large number of asymmetric outliers would start introducing bimodality, leading to a very low value of kurtosis.
Peña and Prieto [61], thus argue that searching for outliers along the projections that maximize and minimize the kurtosis coefficient would be very promising.The exact solution of the kurtosis maximization and minimization problems requires a global solution, which is not efficient, so they settle instead for p local maximizers and p local minimizers.They show that computing a local maximizer or minimizer corresponds to the finding either (1) the direction from the center of the data straight to the outliers or (2) a direction orthogonal to it.As it not known which of these two directions have been found, the data need to be projected onto a subspace orthogonal to the computed directions, and another local solution obtained.This process has to be repeated a maximum of p times to find the desired direction, yielding a total of 2p examined directions where p directions for the maximum and p for the minimum.
For each of these 2p directions, Peña and Prieto [61], determine outlyingness based on the univariate median and Median Absolute Deviation (MAD).If a point is an outlier in any of these directions, that is, considering its maximum deviation from the median, it is labeled a potential outlier.The mean and covariance are then computed based on all points not considered to be potential outliers, followed by a robust Mahalanobis distance for each point.If the Mahalanobis distance for any point exceeds the critical value of a χ 2 distribution with p degrees of freedom, it is declared an outlier.

OGK Method
Maronna and Zamar [62], proposed an Orthogonalized Gnanadesikan-Kettenring (OGK) estimator by a general method to obtain positive-definite and approximately affine-equivariant robust scatter matrices starting from any pair-wise robust scatter matrix.This method was applied to the robust covariance estimate of Gnanadesikan and Kettenring [25].The resulting multivariate location and scatter estimates are called orthogonalized Gnanadesikan-Kettenring (OGK) estimates and are calculated as follows: 1. Let m(.) and s(.) be robust univariate estimators of location and scale 2. Construct y i = D -1 x i for i = 1, …, n with D = diag(s(X 1 ), …, s(X p )).
3. Compute the matrix U = (u jk ) with Compute the matrix E of eigenvectors of U and a) project the data on these eigenvectors, i.e.V=Y E; b) compute 'robust variances' of V= (V 1 ,...,V p ), i.e.Λ= diag(s 2 (V 1 ),...,s 2 (V p )); c) set the p × 1 vector  �(Y) =Em where m = (m(V 1 ),...,m(V p )) T , and compute the positive definite matrix  � (Y) = EΛE T . 5. Transform back to X, i.e  �  =  �() and  �  =  � ()  .Once these raw estimates are computed, they can be used to compute the robust Mahalanobis distances d i = D(x i ,  �  ,  �  ) for all observations.If the Mahalanobis distance for any observation exceeds the critical value c = χ p 2 (0.9)med (d 1 , … ,   )/χ p 2 (0.5), it is declared an outlier.By using this cut off value and the robust Mahalanobis distance, a weight function can be defined and as in the FASTMCD algorithm the estimate is improved by a weighting step.The weighted estimates are denoted  �  and  �  .

Comedian Approach
Sajesh and Srinivasan [63] proposed a method for the detection of outliers in multivariate data based on comedian, an alternative measure of dependence between two random variables introduced by Falk [64].Let X and Y be two random variables then the comedian of X and Y is defined as where med denotes median.It generalizes the Median Absolute Deviation (MAD) as it equals MAD 2 when X = Y and also has the highest possible breakdown point [64].
Comedian parallels COV(X, Y), but COV(X, Y) requires the existence of the first two moments of X and Y, whereas COM(X, Y) always exists.The comedian is symmetric, location invariant and scale equivariant i.e., COM(X, aY + b) = a COM(X, Y) = a COM(Y, X).Hall and Welsh [65] discussed about the strong consistency and asymptotic normality of MAD.Falk [64] established similar results for comedian.In a similar way, a natural median based alternative to the coefficient of correlation is the correlation median (, ) =  = COM(, ) MAD()MAD() (3.11) with δ Є [−1, 1] for bivariate data.Therefore correlation median of normal vectors (X, Y) as a measure of dependence between X and Y could very well be utilized [64].[63] make use of the multivariate version of comedian estimate for the detection of outliers.Let X be an n × p data matrix with rows x i T (i = 1, 2, …, n) and columns X j (j = 1,2, …, p).Then the comedian matrix COM(X) is defined as () = (COM (, )), ,  = 1, 2, . ., .
The problem of non positive semi-definiteness of estimators frequently occurs in robust estimation of covariance matrix.Rousseeuw and Molenberghs [66] proposed several methods to deal with this problem.Maronna and Zamar [62] proposed a general method to obtain positive-definite and approximately affine equivariant robust scatter matrices.Sajesh and Srinivasan [63] adopted the following steps to overcome the non positive semi-definiteness of comedian matrix and to obtain robust estimates for location and scatter.
The estimates can be improved through an iterative process, by replacing δ with S and repeat the steps (i), (ii) and (iii).The primary interest is the detection of outliers, by using a robust Mahalanobis distance defined as, where S and m are defined in (3.14).
The efficacy of the detection of outliers could be considered by a suitable cut off value cv defined as Accordingly, if any RD(x i , m) > cv, the corresponding observation x i can be considered as outlier.By using this cut off value and the robust Mahalanobis distance, a weight function can be defined and robust estimates for location and scatter can be obtained.These estimates are positive definite and approximately affine equivariant.In addition, the estimates obtained by comedian method would have high-breakdown value and helps detection of large cluster of outliers.The efficiency of the method increases with the increase in the dimension of datasets as examined through various numerical studies.

Other Distance-based Methods
Oyeyemi and Ipinyomi proposed a robust method of estimating a covariance matrix in a multivariate data set.The proposed robust method performs favourably well in the detection of single or fewer outliers especially for small sample size and when the magnitude of outliers is relatively small.Outlier detection on time series data plays an important role in life.Ren et al [68] proposed a method of outlier detection on time series data mainly aiming at the multivariate type.The improved ant colony algorithm is used for data clustering in classification of time series data.
Both the distance of inner-clusters and inter-clusters are considered to ensure the accuracy of the clustering.The objects which have significant changes from the neighbors are identified as outliers.The presence of missing values is more a rule than an exception in business surveys and poses additional severe challenges to the outlier detection.Todorov et al [69] compared some multivariate outlier detection methods which can cope with incomplete data through a simulation study and identified methods finding the outliers with low false discovery rate.Rousseeuw and Hubert [70] presented an overview of several robust methods and outlier detection tools suitable for univariate, low-dimensional, and high-dimensional data such as estimation of location and scatter, linear regression, principal component analysis, and classification.

Non-Traditional Methods
A common limitation with all robust distance-based outlier detection methods is the requirement to find a subset of outlier-free data from which robust estimates of the mean vector and covariance matrix can be obtained.Unfortunately, there is no existing method that can find an outlier-free subset with 100% certainty.In other words, there is always a chance that the "outlier-free" sample contains some outliers.
Researchers have proposed alternative non-traditional outlier detection methods that attempt to avoid robust Mahalanobis distances altogether.In the following paragraphs, the significant non-traditional outlier detection methods found in the technical literature are outlined.As in the previous section, these methods are discussed in chronological order to illustrate how these methods have evolved over time.ISBN-1391-4987 IASSL

Principal Component Methods
One of the earliest distance-free methods for detecting multiple outliers in multivariate data is described by Gnanadesikan and Kettenring [25] and is originally attributed to Rao [71].This method makes the assumption that the dataset falls in the linear subspace defined by the first p -q principal components of the sample covariance matrix.Under this assumption, it is argued that outliers will have a large deviation from this sub-space as measured by the sum of the magnitudes of their projections onto the last q eigenvectors.More specifically, outliers in a n × p dataset, Y, are observations, y j , with large values of: y y I (3.17)where I i = the eigenvector corresponding to the i th smallest eigen value of the covariance matrix and y = mean vector of Y. [25] suggest analyzing the 2 j d values through the use of a gamma probability plot where the shape parameter is estimated using a method proposed by Wilk and Gnanadesikan [72].In addition to Rao's method, Gnanadesikan and Kettenring suggest other informal uses of the principal component scores for detecting outliers.Unfortunately, a limitation of these methods is they are devoid of any formal tests of significance, relying upon the analyst to subjectively determine how an outlier should manifest itself.

Mahalanobis Distance Decomposition Method
As an alternative to computing robust Mahalanobis distances to detect outliers, Kim [73], derives two decompositions of the Mahalanobis distance and uses scatter plots of the component terms to uncover outlying observations.Thus, rather than using the Mahalanobis distances themselves to find outliers, Kim suggests analyzing the constituent parts of the Mahalanobis distances for an observation to determine how the distance was achieved.Kim provides no guidance on the distribution of the components of the Mahalanobis distance, thus requiring subjective analysis of the suggested scatter plots to identify outliers.

Projection Pursuit Detection
In order to avoid the masking and swamping effects associated with the classical Mahalanobis distance detector as well as the computational complexities of robust distance detection methods, Pan et al. [74], proposed a method that uses univariate projections of the original data and univariate outlier detection to identify multivariate outliers.The method begins by projecting the original data onto a vector located on the p-dimensional unit hypersphere.Based on tests with relatively small datasets, Pan et al. demonstrated that their method is effective at detecting outliers while achieving relatively low false alarm rates.No evidence is provided to suggest this method is scalable to larger problems.In fact, using this method for high dimensional datasets can be problematic since the number of projection vectors generated to achieve uniform coverage of the p-dimensional unit hypersphere grows non-linearly with p.Further discussion of this problem is provided by Fang and Wang [80][75].

Juan-Prieto Method
Empirical tests conducted by Juan and Prieto [76], indicate that the robust distance methods of Rocke and Woodruff [50], Hawkins [49], Rousseeuw and Van Driessen [43], and Maronna and Yohai [77], have difficulty in detecting clusters of concentrated outliers, particularly when the clusters are relatively close to the good data.To overcome this perceived weakness of robust distance methods, Juan and Prieto suggested a distance free method based on angles.Specifically, the authors state that the projections of observations on the p-dimensional unit hypersphere are uniformly distributed when the observations have an ellipsoidal distribution, as shown by Eaton [78].Based on this characteristic, Juan and Prieto claimed that the angles between the projected observation vectors and an arbitrary reference direction, u 0 , have a Beta distribution.The form of the Beta distribution is provided by the authors.
To detect outliers, the original observations are projected onto the unit hypersphere.A reference direction, u 0 , is then selected using a method suggested by Juan and Prieto [76].The angles between the projected observations and u 0 are then computed.The authors then suggest using a QQ -plot of the angles to determine if they follow the beta distribution.Alternatively, the distributional fit of the angles can be assessed by analyzing the spacings between the ordered values of F(w i ), the theoretical distribution function of the angles evaluated at each angle, w i .If the angles actually follow the prescribed distribution, the spacing should be uniformly distributed.To test this hypothesis, Juan and Prieto suggested using the distribution of the largest spacing in a uniform sample introduced by David [79].From this distribution function, a critical value for the largest spacing can be computed and all the largest spacing tested for significance.If the test fails, any corresponding observations preceding the largest spacing are considered outliers.To detect multiple outlier clusters, this entire process is repeated until the spacing indicates uniformity of the angles.ISBN-1391-4987 IASSL

Chiang-Pell-Seasholtz PCA Method
Where Gnanadesikan and Kettenring [25], proposed somewhat informal methods for using PCA to find multivariate outliers, Chiang et al. [80], presented a PCA method that included significance tests for outliers.The method began by performing a PCA on the original data to arrive at the p × a matrix, P, containing the eigenvectors corresponding to the a largest eigen values.In addition to testing if an observation is an outlier using the components for the a largest eigen values, Chiang et al. also suggest testing the observation using the p − a components for the remaining eigen values.To perform this test, the authors recommend using the Q-statistic of Jackson and Mudholkar [81].The threshold value for the Q-statistic is provided by Chiang et al.If the Q-statistics for an observation exceed their respective critical value, the observation is labeled an outlier and removed from the data set.Once all observations are tested, the entire process is repeated using only the non-outlying observations.The algorithm terminates when no additional observations are labeled outliers between iterations, or when the total number of outliers detected reaches n/2.

Max-Eigen Difference (MED) Method
Adding to the arsenal of principal component-based outlier detection methods, Gao et al. [82], proposed the Max-Eigen Difference (MED) method.The method proceeds by computing the eigen values and eigenvectors of the sample covariance matrix of the entire dataset.For each observation, x i , the eigen values and eigenvectors are then computed for the covariance matrix obtained when x i is removed from the dataset.
Gao et al. [82], demonstrated that large MED values indicate outlier observations.Specifically, the decomposition illustrates that an observation with a large MED may indicate: i) the observation has a first principal component score that is much larger than the other observations; ii) the observation may have relatively large scores on the other component axes; and iii) the observation is not close to the centroid of the data.An observation with large MED may possess any combination of these characteristics.Based on the properties of the MED, Gao et al. [82], recommended detecting outliers by plotting the MED values against the observation indices.Any observations that appear to have a large MED relative to the other observations are labeled as outliers.This labeling is a subjective decision made by the analyst.

Other Non-Traditional Methods
Singh et al. [83], have proposed an unsupervised clustering scheme for isolating atypical behaviors, a parameter less outlier detection method based on wavelets and a new feature for characterizing intrusions based on the repetition of an intrusion attempt from one system to another.Al-Zoubi [84], discussed a method based on clustering approaches for outlier detection using PAM clustering algorithm.Ueda [85], presented a simple and efficient method to detect multiple outliers using a modification of the Akaike's Information Criterion.It is well known that if a multivariate outlier has one or more missing component values, then multiple imputation (MI) methods tend to impute non-extreme values and make the outlier become less extreme and less likely to be detected.Dang and Serfling [86] proposed nonparametric depth-based multivariate outlier identifiers for such type of data.Two criteria, an 'outlier recovery probability' and a 'relative accuracy measure', are developed, based on depth functions.Yan [87] proposed a novel method integrating self-organizing map (SOM) with adaptive non-linear map (ANLM) to facilitate visualizing and detecting outliers in high dimensional complex data.
It is interesting to note that volume of data collected is getting exponentially increased by the day with the availability of powerful storage devices.It is equally important to analyze the data and draw meaningful inference because of its impact on various applications, like the collection and analysis of genomic data is important in the study of human diseases and drug discovery.In this process, detection of multiple outliers in multidimensional data is crucial for its influence in the inferential process and its interpretation in wide variety of applications.

Comparative Study
Since the comparison of all the above mentioned methods is tedious, four recent and important methods namely, FAST-MCD, Kurtosis, OGK and Comedian are selected.Performance of these methods is evaluated through simulation using the parameters of Success Rate (SR) and False Detection Rate (FDR).While success rate measures the detection of true outliers, FDR appraise the false detection of normal observations as outliers.In other words, success rate is a measure of masking as it reveals the number of true outliers which are not detected and false detection rate is a measure of swamping as it shows the number of inliers detected as outliers.
For a given contamination level α, a set of 100(1−α) observations from an N(0, I) distribution with dimension p has been generated and 100α additional observations are added from a N(ξu, λI) distribution, where u denotes the vector (1, 1, …, 1) T .This experiment has been conducted for different values of the sample space dimension p(p = 5, 10, 20) and the contamination level α (α = 0.1, 0.2, 0.3).ISBN-1391-4987 IASSL To check the efficiency of detecting very small deviations the experiment has been conducted for small values of ξ (ξ = 5, 10) and λ (λ = 0.01, 0.25, 1).For each set of values, 100 samples have been generated and parameters are estimated.To attain 50% of breakdown value for FAST-MCD method the subset size h = [(n+p+1)/2] is used.The scale estimate Q n proposed by Croux and Rousseeuw [88] is used as initial scale estimates for OGK.MATLAB codes have been used for Kurtosis, FAST-MCD and Comedian methods and OGK is available in R package.Regarding the value of λ, Comedian performs consistently well for all values except for two cases where the method scored less than 95% success rate.It is important to note that, for λ = 1 the success rate of Kurtosis method decreases with increasing dimension.For λ = 1, the success rate of Kurtosis method is 97% for p = 5, 75% for p = 10 and 49% for p = 20.Peña and Prieto [61] also shows similar results and states that, this case tends to be one of the most difficult ones for the kurtosis algorithm because the objective function is nearly constant for all directions, and for finite samples, it tends to present many local minimizers, particularly along directions that are nearly orthogonal to the outliers.It is just opposite in the case of FAST-MCD method that, for ξ = 5, only cases where it attains more than 95% success rate is when λ = 1.Its behavior is worse for both the remaining values of λ.Unlike FAST-MCD and Kurtosis, OGK performs almost steadily for all values of λ.Amount of contamination α is an important parameter to be analyzed because most of the outlier detection methods fall short when there is large amount of contamination.Comedian achieved 95% or more success rates for all values of α, except in two cases.Regarding the amount of contamination, Kurtosis method seems to perform better for p = 5, while its behavior is worse for large values of sample-space dimension.In two cases of p = 5; λ = 1and p = 10; λ = 1 alone FAST-MCD detects 30% of contamination and failed to do so in the rest of cases.The worst case is for ξ = 5, λ = 0.01 and α = 0.3 and for all values of p FAST-MCD has 0% success rate.Even though OGK performs better than FAST-MCD and Kurtosis, OGK fails to achieve at least 95% of success rate for any of the cases with ξ = 5, λ = 0.01 and α = 0.3.For ξ = 10, Comedian and OGK attain optimal success rates in all cases, while the success rates of FAST-MCD and Kurtosis follow the same pattern of ξ = 5.
False detection rate is also an important property to be examined for comparison of various methods.

Conclusion
Successful identification of outliers has a very close connection with robust estimation.Classical estimators such as the mean and covariance matrix are not suitable for data containing outliers and can cause the statistical analysis to produce results exactly opposite to the correct conclusions.Thus, robust statistical techniques that can be computed in a reasonable time should be used if outliers are thought to be present [93].Within the field of statistics, there are two broad approaches to outlier identification.Distance-based methods, such as MCD, Comedian and BACON, are based on obtaining robust estimates of the mean and covariance matrix so that a robust Mahalanobis distance can be computed for each point.Promising avenues for future research include finding robust covariance estimates that can be quickly computed, while still maintaining robustness against outliers in a variety of configurations.Non-traditional methods aim to find the best projections that reveal the outliers in a highly visible placement.Such approaches can find outliers in a wide variety of configurations since the original placement of outliers is transformed to more informative projections.However, such methods tend to be very computationally intensive and are not currently suitable for large datasets.It remains to be seen whether computationally efficient methods of projection pursuit can be found to enable this strategy to be used in data-mining and similar applications.
Performances of four recent methods are evaluated through simulation using the parameters of Success Rate and False Detection Rate.The simulation study has explored and examined almost all possible situations by varying parameters.Results show that, when compared to other methods, the comedian method is able to detect all the outliers efficiently based on success rate and false detection rate.Also the efficiency of comedian method increases with the increase in the dimension of data.

Table 3 .
1 presents the success rates of Comedian, Kurtosis, FAST-MCD and OGK methods for each set of the parameter values.For ξ = 5, except in two situations (for p = 5, α = 0.3, λ = 0.01 and p = 10, α = 0.3, λ = 0.01) Comedian method attains 95% or more success rate.Sample-space dimension p has significant influence on outlier detection methods.It is interesting to note that, for each value of α and λ success rate of the Comedian increases with increasing dimension.Although, the success rate of OGK also increases with dimension, the rate of increase is low compared to that of Comedian.Regarding the success rate, Kurtosis seems not much influenced by the sample-space dimension.But, this is not the case of FAST-MCD.Success rate of FAST-MCD decreases with increasing dimension.For example, for α = 0.3, λ = 0.25 and ξ = 5 the success rate of FAST-MCD is decreases from 60% for p = 5 to 41% for p = 10 and 0% for p = 20.

Table 3 .
Table 3.2 gives the false detection rates of Comedian, Kurtosis, FAST-MCD and OGK methods from 100 independent samples.Based on the success rates, Comedian and OGK methods provide almost equal results.But a comparison based on FDR will support the supremacy other similar methods.It is clear that in every situation Comedian possesses much lower false detection rates than other methods and also it reduces with increasing dimension.Also, the maximum false detection rate is 6 for Comedian, 45 for Kurtosis, 63 for FAST-MCD and 20 for OGK, based on all possible situations.