On the Mixture Gaussian Copula to study the suitability of Diagnostic Tests

In this article, we develop a new copula “Mixture Gaussian Copula” to study the suitability of the diagnostic tests in the same manner as the ROC curves are used in similar studies.


Introduction
For some time now, the Markov models and the Hidden Markov models have been in use in speech recognition, meteorology, biometry and many other fields.In biometry, especially in the context of epidemiological models and DNA sequencing, and micro-array analysis, the Hidden Markov model is commonly in use (see Ibe (2009) for the details and the literature).On the other hand, the copulas are used as a general way of formulating a multivariate distribution in such a way that the dependence can be infused in a reasonable manner.This is based on a simple idea that the joint distribution can be represented as a transformation of the underlying marginal distributions (see Sklar 1959).There are several copulas and each differ according to the strength of the dependence and the direction of the association.The Copula models fall either under the family of Archimedean Copulas or the non-Archimedean Copulas.The Gaussian Copulas belong to the non-Archimedean family of Copulas.Here in this paper, we consider the Gaussian Copula (to be more specific, the mixture of Gaussian Copulas).For the literature review, the interested readers are referred to Nelson (2006).
In the recent past, Krazanowski and Hand (2009), Pepe (2003), Zhou et al (2002) and others have investigated the use of Receiver Operating Characteristic Curve ISBN-1391-4987 IASSL (ROC curve) in the context of screening and diagnostic testing in the medical field.Pundir (2011), Gonen (2007), Shultz (1995) used this ROC curve to study the effectiveness of a single and multiple variable based medical diagnostic test.This paper discusses the use of both the Markov model and the Copula model to structure a medical diagnostic test while using the ROC curve in order to compare the effectiveness of two medical diagnostic tests.Furthermore, this is a computationally less intensive method while maintaining a fair degree of precision in diagnosis.
We divide the paper into many sections.First, we present the methodology based on the Markov chain to model the (disease, no disease) states.Next, we use the Copulas especially the mixture Gaussian Copula to model the distribution of the responses that we gather from the sample.The ROC curve based analysis of the responses and the robustness of the estimates is presented in the last section.

Methodology
Here, the main objective is to study the suitability of a diagnostic test to classify a person as either healthy ( ) H or disease stricken ( ) % of a population is healthy and the others are stricken by a disease.
Due to fear that this disease may spread to the healthy population, every person in the population is advised to undergo a treatment for this disease.In order to ascertain the effectiveness of this treatment, we take two measurements one before the treatment and the other after the treatment from the same group of people.Let us say that the pre-treatment and the post-treatment measurements were taken at time 1 and 2 respectively and the transition from time 1 to time 2 occurs according to a Markov Chain with the transition probability matrix given by Note that the measurements S 1 and S 2 are dependent and we propose a new copula "Mixture Gaussian Copula" to model the correlation structure.Generally speaking, Copulas are used for modeling the joint distributions from the marginal distributions.

Copula Modeling
In this section, we present the definition pertaining to the copulas and the formula for the Gaussian Copula.
Definition: A copula is a multivariate joint distribution defined on the k ( ) ( ) Gaussian Copula:

Mixture Gaussian Copula:
We define the mixture Gaussian Copula density for the four component normal mixture as follows.
( )    The area under the ROC Curve (AUROC) is a measure of accuracy of the diagnostic test.In probabilistic terms, this is the probability for the true positive rate to exceed the false positive rate in any diagnostic test.This technique is found to be very effective in evaluating the performance of the diagnostic test.
In our case, the situation is different.However, the objective remains the same.We evaluate the performance of the test based on the probability that S 2 < S 1 , where S 1 and S 2 are the health related measurements at time 1 and 2 respectively.We assume that the joint distribution can be modeled as a mixture Gaussian Copula.
Furthermore, the Copulas are supposed to yield the marginal distributions when the data is collapsed.Therefore, by equating the marginal distributions, we have As we can see, there are three equations in four variables indicating that there could be multiple solutions for the mixture Gaussian Copula modeling proportions.
Note that based on this Copula density, ( )

Flu season for Children
As we know the flu seasons come and go and the children are very vulnerable beside the elderly to suffer from the flu.Some children will have to see their physicians for recovery and some others recover on their own without any treatment.Also, there are those healthy children who will not be affected at all by the flu.So, one can see a Markov-chain pattern to explain the state space of the status of the children during a flu season.
Suppose that a test is done during the flu season by taking health related measurements from the children at the outset of the flu season and again towards the end of the flu season.This is to study the effectiveness of the medical test in diagnosing flu on the children.
Let S 1 = Health related measurement at the beginning of the flu season S 2 = Health related measurement at the end of the flu season The estimates of P(S 2 < S 1 ) or P(S 2 > S 1 ) can provide a true picture of the health status if the measurement S (S 1 , S 2 ) for this study is carefully chosen.

Example
Here, we discuss a situation where the transition from healthy ( ) H to disease ( ) or vice-versa takes place according to the first order Markov Chain with transition probability matrix given by Note that at time 1, 80 % of the population was healthy and the remaining 20 % were disease stricken.At time 2, 90 % of the population was healthy and the remaining 10 % were disease stricken.

Test # 1
Suppose that for a diagnostic test (say Test # 1) the pre and the post treatment measurements are S 1 and S 2 respectively.Note that S 1 and S 2 were generated according to mixture normal distributions.(13) Note that this is very close to the empirical estimate of 0.54 for this probability.Hence, it supports our theory that the Mixture Gaussian Copula is a very good model for checking the suitability of a certain diagnostic test to decide whether a child is healthy or not when there is a correlation structure.The diagnostic Test # 1 indicates that only 54 % of the children are healthy after the treatment when nearly 80 % should be healthy during the post-treatment period.So, diagnostic Test # 1 is not good and a different diagnostic test (say Test # 2) is needed.

Robustness Study (based on Test # 1 Results)
Here, we study the robustness of the copula based estimate of the probability for other possible choices for the mixing proportions.

ROC Curve Comparison
The main feature of this article is to use the mixture Gaussian Copula to model the dependence between the observations gathered at time 1and time 2 about a disease and the ROC curves to compare the performance of the diagnostic tests.Please note that only in the context of the following ROC curves, the term "time1" represents the probability for a health related measurement collected at time1 to exceed a certain threshold.Similarly, the term "time2" represents the probability for a health related measurement collected at time 2 to exceed the same threshold.This ROC curve was drawn by having the variable "time2" on the X-axis and the variable "time1" on the Y-axis.The area under the ROC curve (AUROC) is used as a measure for comparing the diagnostic tests.Although there can be several estimates for the mixing proportions in the Mixture Gaussian Copula model, our study shows that these copula based probability estimates are fairly robust and hence there is very little difference in the area estimates under the ROC curve based on this mixture Gaussian Copula for any given diagnostic test.Moreover, the Copula based estimate seemed to agree with the empirical estimate.Here, we are comparing two different diagnostic tests, namely Test #1 and Test #2.We note that the area under the ROC curve for Test #1 is much higher than the area under the ROC curve for Test #2.So, we conclude that Test # 2 is better than Test # 1 in the context of diagnostic testing.This mixture Gaussian Copula based model is fairly precise with respect to diagnosis and is computationally less intensive.
inverse of the cumulative standard normal distribution function.
Concept and its applicationROC Curves are used for classification in the context of assessing the performance of a diagnostic test.It is of immense help in the area of clinical diagnosis.Also, the ROC Curves can be used for comparing two or more diagnostic tests.It is a graph of the true positive rate against the false positive rate in medical diagnosis.

Figure 1 :
Figure 1: ROC Curve based on first set of measurements

Figure 2 :
Figure 2: ROC Curve based on second set of measurements