A New Method for Tracking Conﬁguration for Dirichlet Process Sampling

The method of ﬁtting a hierarchical model with Dirichlet process mixing is a versatile tool for data analysts. It has been applied to density estimation, clas-siﬁcation, clustering, and high dimensional data analysis. Many computing algorithms have been proposed to evaluate this mixture. Different labels in the algorithm that assign data points into clusters may actually yield the same partition conﬁguration. This paper makes this notion rigorous by establishing an equivalence theorem. Thus, we would recommend adding the step of checking for equivalent conﬁgurations to the algorithms for evaluating hierarchical Dirichlet process mixing models for improved results, especially when cluster assignments are the major goals of the analysis. results show two clusters are well supported for various choices of a , b and α . The ﬁrst cluster consists of pollen counts (8 , 4 , 4 , 4 , 4) observed on day 1, 2,


Introduction
The Dirichlet process (DP) has received a lot of attention since its development by Ferguson (1973). It constructs a prior measure on a large space of distribution functions. It allows users to specify two parameters α and G 0 , where distribution function G 0 reflects the prior mean ('center') of the unknown distribution G chosen by the DP and the positive real number α characterizes the prior precision. Larger α, stronger prior belief, indicates smaller variation of the unknown distribution G. Ferguson (1973) also showed that DP is intuitive and interpretable for some nonparametric statistical problems. Consequently, the Dirichlet process has been applied to many areas including text mining, machine learning, bioinformatics, biostatistics, and phylogenetics, where Bayesian nonparametric inference on an unknown distribution is called for.
Mixture of Dirichlet processes (MDP) originated from Antoniak (1974) as a further development from the DP, where a smoothing kernel with an unknown parameter θ is convolved with an unknown distribution chosen from a DP. Antoniak (1974) studied properties of this mixture and showed how it can be applied to measurement error, empirical Bayes, and quantal bioassay problems. Later Lo (1984) developed it further to construct rigorously a prior measure on a space of density functions. Kuo (1983) has extended MDP to a Bayesian experimental design problem to study the optimal dosages for quantal bioassay with the potency curve chosen from a DP. Kuo (1986a) has also applied MDP to empirical Bayes problems. Suppose we observe a set of n observations y = (y 1 , . . . , y n ). We consider a hierarchical model where the ith observation is modeled by the density f i (y i |θ i ) with an unobserved latent variable θ i . Then we assume the latent variables θ 1 , . . . , θ n are independent and identically distributed with a common unknown distribution G chosen from the DP (α, G 0 ). Consequently, Neal (2000) pointed out the name Dirichlet process mixture (DPM) model would be more appropriate. The Bayes solution to this mixture problem can be written concisely as in Kuo (1980Kuo ( , 1986a. However, the number of mixing components, which is given by the Bell exponential number (Berge, 1971), increases rapidly as the sample size increases due to the discrete nature of the DP. Therefore, many computing algorithms have been proposed to evaluate this mixture, such as Kuo (1986b), West (1995, 1998), MacEachern and Müller (1998), Neal (2000), Ishwaran and James (2001), Walker (2007), Dunson (2010), and Kalli et al. (2010). All these algorithms need to specify the latent allocation process that assigns each observation to a cluster. The confusion comes from the fact that different allocations (usually designated by labels) in the algorithm may actually denote the same cluster. Overlooking this subtlety may lead to an incorrect conclusion. In this paper, we clarify this issue. We define equivalent classes of labels if they lead to the same partition of the latent parameter space. So we suggest adding the checking for equivalence step to the computing algorithm when correct clustering allocation is essential in the solution. Section 2 contains basic formulations of DP and MDP. Section 3 lists two sampling algorithms for evaluating the MDP. One, called the truncated blocked Gibbs sampler, is based on Ishwaran and James (2001), where the number of atoms of the DP is truncated at a fixed number determined before running the sampler. The other is based on Kalli et al. (2010), where the number of atoms in the DP is determined from slice sampling in each iteration. Section 4 discusses the equivalence formulation which needs to be included in the sampling procedure for MDP, especially when the cluster allocation is of primary interest in the study. Section 5 contains a simulation study and a real data analysis to illustrate our method. We conclude the paper with brief discussion in Section 6.

Introduction to the Dirichlet Process
The DP process denoted by DP (α, G 0 ) allows users to select two components: a base distribution G 0 that defines the center of the DP, and a concentration parameter α that reflects the prior strength of belief on the base distribution G 0 . If G is taken as a sample from DP (α, G 0 ), then by construction, we have E[G(A)] = G 0 (A) and Var(G(A)) = G 0 (A)[1 − G 0 (A)]/(α + 1) for any measurable set A. If we think of the DP constructing a tube of random cumulative distribution function G centered around G 0 , then the variance formula shows how α controls the variation of this tube in an inverse relation at a linear rate. Suppose the prior on a random distribution is DP (α, G 0 ), then Ferguson (1973) showed that the posterior distribution of this random distribution given a sample of size n (from this distribution) is also a DP with new parameters: a concentration parameter α + n, and a base distribution (αG 0 + nF n )/(α + n), whereF n is the empirical distribution function.

A Constructive Representation of the Dirichlet Process
In order to sample a random distribution from DP, it is helpful to know the constructive definition of the DP. There are several versions of it, which include the Pólya urn scheme by Blackwell and MacQueen (1973), the stick-breaking process by Sethuraman and Tiwari (1982) and Sethuraman (1994), and the Chinese restaurant process by Aldous (1985). We are summarizing the Sethuraman (1994) construction here. Suppose G is a random sample chosen from DP (α, G 0 ). Then G can be represented by where δ φ k is a measure of mass 1 concentrated at φ k and the locations φ k of the random jumps of the distribution G are independent and identically distributed (i.i.d.) from G 0 with stick-breaking weights w k given by the following formula: where v l are also i.i.d. from a beta density Be(1, α). So this representation shows that DP selects discrete distributions with probability 1. Each distribution has atoms chosen as a sample from G 0 and with weights defined as the stick-breaking weights given in (2.2).

Hierarchical Dirichlet Process Mixing Model
The hierarchical model of a compound decision problem starts with modeling each observation with an unobserved latent variable at the first stage. Then DP IASSL ISSN -1391-4987 is used as a prior over the common distribution shared by the latent parameters at the second stage: This model was first proposed and called MDP by Antoniak (1974). Instead, we will call it DPM (Dirichlet process mixture) as in Neal (2000) to emphasize the mixing is done by the DP construction.
The mixture model allows borrowing information across components. Its formulation is flexible and adaptable, yet quite efficient in combating the curse of dimensionality. It avoids specifying a fixed number of clusters for the latent population. It is an ideal candidate for clustering problems where the distinct number of clusters is unknown beforehand. Although the DP assumes that there are infinitely many clusters in the latent population, the posterior distribution for the number of clusters for the latent variable has a sparseness favoring structure due to the discrete nature of the DP. Suppose we have latent variables θ 1 , . . . , θ n in k clusters with k ≤ n. When an (n + 1)st latent variable is added, then the new θ n+1 is assigned to a new cluster with a probability of α/(α + n), and to a previous cluster j with probability of n j /(α + n), where n j is the number of θs in the jth cluster. Note we have k j=1 n j = n. The prior expected number of clusters for the latent variables of size n is proportional to α log n. Thus, the number of clusters a priori increases slowly with the sample size at a rate determined by α. The posterior distribution of the number of clusters not only depends on n and α, and is also sensitive to the choice of G 0 as pointed out by Dunson (2010). Therefore, he argued for the need to choose G 0 with careful thought; he also suggested to standardize the data first and specify G 0 with location zero and scale one in order to have good practical performance.

Bayes Solution
Typically, the objective is to estimate each of the latent parameters θ i for i = 1, . . . , n. The above DPM setting allows multivariate observations and multivariate latent parameters for each component in the mixture model. However, in the following expression for the posterior mean of each latent variable, we will just use the univariate version for a more streamlined presentation. To derive the posterior mean of θ i , we usually integrate out the random G. Following Antoniak (1974), the posterior distribution of G is a mixture of DP with concentration parameter α + n and base distribution (αG 0 + n j=1 δ θ j )/(α + n) and mixing distribution H(θ|y). That is Then by applying Lo (1984), we can show the posterior distribution of θ given y is .
So the posterior mean for θ i , for any i, can be written aŝ . (

2.4)
A direct derivation for the above expression given in Kuo (1980) is omitted here. Although the above expression (2.4) looks deceptively simple, it is actually a weighted mean from many distributions. Antoniak (1974) had given a detailed account on calculating the probabilities for various configurations for the latent variables. For n = 3 and i = 3 as an example, then (2.4) is a weighted mean of 5 terms which is proportional to where the five terms are obtained from cases (1) 4) θ 2 = θ 1 = θ 3 , and (5) θ 3 = θ 2 = θ 1 , respectively. So the first term (for case (1)) is a product of three integrals where each kernel is integrated against its prior latent variable density. Each of the next three terms (for cases (2), (3) or (4)) is a product of two integrals where the first integral having two observations sharing the same latent variable integrated against the prior latent density, and the second integral using the third kernel integrated against the latent density G 0 . For the fifth term, all observations share the same latent variable θ, and integrated against the G 0 (dθ) distribution. The number of components in the mixture distribution given by the Bell exponential number goes up to 15, 52, 203, and 787 for n=4, 5, 6, and 7. So many computational algorithms have been developed to evaluate the posterior distribution. In the following we will discuss two algorithms: the truncated blocked Gibbs sampler by Ishwaran and James (2001), and the slice sampler by Kalli et al. (2010) to handle computation for large samples.

Sampling From DPM Process
3.1. Sampling from DP From Sethuraman's representation of DP (Sethuraman, 1994), we can derive an equivalent representation for a DPM process: where s i is a discrete latent allocation variable that assigns the ith latent variable to one of the latent class; so s i = j denotes the ith latent variable belongs to the jth cluster. The GEM condition is the same as described in (2.2), the name was attributed to Griffiths, Engen, and McCloskey (Ewens and Tavaré, 1998). We next describe two samplers: truncated blocked Gibbs sampler and slice Gibbs sampler for the DPM.

Truncated Blocked Gibbs Sampler for DPM:
This algorithm developed by Ishwaran and James (2001) consists of truncating the infinitely many atoms of the random mixing distribution G to a fixed finite number of atoms before hand. This is done by truncating the stick-breaking weights to at most R terms, where R is prespecified and v l s are defined as in (2.2) for l < R and V R = 1. So the number of atoms of G is fixed at at most R. Then the latent parameters θ's are clustered and updated with a dimension to be at most R, where the clusters are represented by their distinct values θ * 1 , . . . , θ * R . Then we list the MCMC steps as follows: Step 1. Initialization Step 2. Update the discrete random variable s i for each i, i = 1, . . . , n, where , with j = 1, ..., R; Step 3. For each j, j = 1, 2, . . . , R, sample θ * j with These θ * s can be updated simultaneously.
Step 4. Update v j , j = 1, ..., R, such that where n j = n k=1 I(s k = j) and m j = n k=1 I(s k > j).
Step 5. Go back to Step 2 and repeat.
Step 5. Go back to Step 2 and repeat.  When s (m 1 ) s (m 2 ) , we can claim that s (m 1 ) and s (m 2 ) belong to the same configuration in MCMC sampling from the DP process. The following theorem characterizes the conditions to determine the equivalence between s (m 1 ) and s (m 2 ) or whether s (m 1 ) and s (m 2 ) belong to the same configuration. The proof is straightforward.
3 ); and L  It is clear that K (1) = K (2) = K (3) = 3, which satisfies Theorem 4.1 (i ). Since K (4) = 2 which is different from the other 3 iterations, it is ruled out by Theorem 4.1 (i ) that s (4) is equivalent to any of s (1) , s (2) , and s (3) . We continue to check condition (ii ) in Theorem 4.1. It shows that L

Simulation Study
In this subsection, we generated two data sets with sample sizes 5 and 10 each from two different mixtures of normal distributions, where the first mixture consists of 3 normal distributions well separated with equal weights, and the second mixture, similar, but not so well separated. We applied DPM in (2.3) with f i to be a normal density with mean θ i and variance 1. The base distribution G 0 was specified to be N (0, 1/τ ). We varied the precision τ in G 0 with τ = 1, 0.1, or 0.00001, and α to be 0.1, 1 or 10. Then we applied the truncated blocked Gibbs sampler with fixed R to be the same as the sample size. We report the results from each MCMC run with 20,000 iterations after 1000 "burn-in" iterations. In the following, we list the populations, the simulated data sets (denoted by 1a, 1b, and 2a, 2b) from each population, the resulting most probable cluster configurations and tables for the probabilities for the most probable configuration with various choices of α and τ . For the varying choices of α and τ , the resulting most probable configuration was always the same except in the 1b scenario, so we summarized them before the table of probabilities.   The results are expected and reasonable. Moreover, the columns with τ = 0.00001 show the sampler converges to the resulting configuration with a high probability. Viewing for each fixed column, we can also observe the probabilities tend to decrease as α increases, which is also expected because the number of clusters increases as α increases.

Real Data Analysis
The real data set consists of (y 1 , . . . , y 12 ) = (8, 4, 0, 0, 0, 0, 1, 4, 4, 0, 0, 0) (n = 12), which are pollen counts over 12 days collected in the late season of 1991 at Kalamazoo, Michigan taken from Chen and Ibrahim (2000). We fit a hierarchical mixture model (2.3) to it with a Poisson model in the first stage, and G 0 , to be a gamma distribution G(a, b) with mean a/b in the second stage. We used the blocked truncated Gibbs sampler to truncate the model to be at most three clusters. We report the three most frequent configurations for α = 0.1, 1, or 10, and four choices of hyperparameters a and b with a MCMC run of 10000 iterations with 1000 iterations as burn-in in Table 5.3. The hyperparameters a = 0.591 and b = 0.338 in the second block were chosen from the matching functional form consideration for the parametric empirical Bayes model, that is the marginal mean and variance of y are matched to the empirical mean and variance of y. The other choices were more arbitrary except the mean of G 0 was chosen to be the empirical mean or close to it and the variance of G 0 was chosen around ten fold up as we go down each block of the table. The results show two clusters are well supported for various choices of a, b, and α. The first cluster consists of pollen counts (8,4,4,4,4) observed on day 1, 2, 8, and 9. The second cluster consists of all pollen counts that are zero, observed on 3, 4, 5, 6, 10, 11, and 12. The count 1, observed on day 7th, is clustered into the second cluster of count zero for the first two choices of G 0 (small variance) and clustered into the first cluster for the last two choices of G 0 (big variance). When α was chosen to be 0.1, we see the results also support only one cluster with 6% to 3% probabilities depending on the hyperparameters we chose. This is expected due to higher tendency for doubling up of the atoms of the DP in the mixing distribution for small α. When α = 10, we see the results also support three (the maximum number allowed) clusters with 3% to 8% probabilities, where the count 1 observed on the day 7th often forms the third cluster.

Discussion
In this paper, we have discussed a hierarchical Dirichlet process mixing model as in (2.3). We wrote it very general to allow different components to have different models, and also to allow vector forms of observation for each component and vector latent parameters. The DP process constructed on the distribution of the latent parameter relaxes the strong parametric assumptions made in the usual hierarchical model. We discussed two MCMC algorithms for updating the parameters. The blocked truncated Gibbs sampler restricts the number of atoms in DP to be at most R; the slice Gibbs sampler updates the number of atoms in DP using slice with no a priori restriction on it. When applying these algorithms to formulate cluster configurations, allocation indices (labels) are usually not uniquely defined. In order to preserve the posterior distribution of the latent variables given the data, we establish an equivalence theorem for two allocation indices to be equivalent from two iterations of the MCMC run. This theorem establishes that the two sets of indices not only need to have the same number of distinct clusters, but also need to have the same components for each cluster. We recommend adding this check for equivalence algorithm to the DPM algorithm. This check will identify permutations of labels that lead to the same cluster configuration. It will effectively recognize the label switching problem and tabulate the equivalent cluster configuration correctly. We have conducted a simulation study and a real data analysis using the blocked truncated Gibbs sampler and the configuration checking algorithm in Section 4. Our simulation study shows our method is efficient; and our real data analysis with varying prior choice provides more insights on the sensitivity analysis to the prior choice.