Estimation of Population Median in Presence of Non-Response Under Two-Phase Sampling

The present investigation deals with the problem of estimation of population median in presence of non-response under two-phase (double) sampling. Using information on two auxiliary variables, four general classes of estimators have been suggested for four different realistic situations of non – responses. It is shown that several estimators can be generated from our proposed classes of estimators. Proposed classes of estimators are compared with some contemporary estimators of population median under the similar realistic situations. The merits of the proposed strategies have been interpreted through empirical studies carried over three natural populations and one artificially generated population data sets. This establishes effectiveness of the suggested classes of estimators. Suitable recommendations to the survey statistician have been made. Key wards: Median estimation, two-phase sampling, non response, study variable, auxiliary variable, bias, mean square error.


Introduction
In survey sampling, statisticians often come across the study of variables which have highly skewed distributions, such as income, expenditure etc.In such situations, the estimation of median deserves special attention.Kuk and Mak (1989) are the first to introduce the estimation of population median of the study variable using auxiliary information in survey sampling.Francisco and Fuller (1991) also dealt with the problem of estimation of the median as part of the estimation of a finite population distribution function.Later on Singh et al. (2001), Singh et al. (2006), Singh and Priyanka (2008) and Jhajj and Bhangu (2013) have contributed towards the improvement of estimation procedures of population median using information on one or two auxiliary variables.
It is worth to be mentioned that all the developments of estimation of population median are based on the complete response from the sampled units.However, no efforts have been made to estimate the population median in presence of non-response in the sampled units.Non response is one major problem, which is encountered by practitioners in the field of sample surveys.For example, in the case of income from milk yield surveys, the animal may be sold or may die during the survey period; in the case of vegetable or fruit surveys, the yield of some pickings may be damaged or lost or the enumerators may fail to record them.Thus, the observations may be missing for some of the time stages.Such non-response (missingness) can have different patterns and causes.In surveys covering human populations in most cases, information is not obtained from all the units in the survey at the first attempt even after some call-backs.For example selected families may not be home at the time of survey or do not cooperate with the interviewer even if contacted.This is particularly true in mail surveys in which questionnaires are mailed to the sampled respondents who are requested to send back their returns by some deadline.As many respondents do not reply, available sample of returns is incomplete.An estimate obtained from such incomplete data may be misleading especially when the respondents differ from the non-respondents.In order to reduce the effect of nonresponse in estimation of population mean, Hansen and Hurwitz (1946) gave a technique of sub-sampling of the non-responding group.Following Hansen and Hurwitz (1946) technique, several authors including Cochran (1977), Tripathi and Khare (1997), Tabasum and Khan (2004) and Singh and Kumar (2010 a, b) have contributed towards the improvement of the estimation procedures of population mean in presence of non-response using information on auxiliary variable.In many situations, information on the auxiliary variable may be readily available on all the units of the population; for example, tonnage (or seat capacity) of each vehicle or ship is known in survey sampling of transportation and number of beds in different hospitals may be known in hospital surveys.When such information is lacking, it is sometimes, relatively cheap to take a large preliminary sample in which auxiliary variable alone is measured.This technique is known as double sampling or two-phase sampling.Tabasum and Khan (2004) have mentioned that the procedure of double sampling can be applied in a household survey where the household size is used as an auxiliary variable for the estimation of family expenditure.Information can be obtained completely on the family size, while there may be non-response on the household expenditure.Motivated with the above arguments and following the technique of sub-sampling of the nonresponding group, we have proposed four general classes of estimators of population median for four different situations of non-response in two-phase sampling.The superiorities of the proposed classes of estimators over some contemporary median estimators under the similar realistic situations have been established through theoretical and empirical comparisons.
The next section discusses the formulation of the classes of estimators.Section 3 and 4 discusses their properties.Section 5 and 6 discusses the merits of the proposed strategies and section 7 concludes the paper.ISSN-2424-6271 IASSL

Formulation of the Classes of Estimators
Consider a finite population 1 2 3 N U = (U , U , U , . .., U ) of N units.Let y, x and z are the variables under study, first auxiliary variable and second auxiliary variable respectively.Let k y, k x and k z be the values of variables y, x and z respectively for the k-th (k = 1, 2, …, N) unit in the population.When the population median x M of the auxiliary variable x is unknown, our purpose is to estimate the population median y M of the study variable y from a sample obtained through a two-phase selection.Permitting simple random sampling without replacement (SRSWOR) design in each phase, the two-phase sampling scheme will be as follows: i.The first phase sample of size n is drawn to observe the variable x only in order to furnish an estimate of x M.
ii.The second phase sample   n n n S S S   of size n is drawn to observe the variable y only.
Assuming that the population median x M of the auxiliary variable x is known, Kuk and Mak (1989)   suggested a ratio estimator for the population median y M of the study variable y as where y M and x M are the sample estimators of y M and x M respectively based on a sample n S of size y ,y , . .., y be the y values of sample units in ascending order.Further, let t be an integer such that where Encouraged and fascinated with the above works, we have considered that at the first phase sample n S  of size n,  all the units supplied information on the auxiliary variables x and z and at the second phase sample n S of size n, let 1 n units supply information on y and 2 n units refuse to respond.
Considering the non-response situations on the second phase sample, one may form an estimator by utilizing the information only from the respondents or take a sub-sample of the non-respondents and recontact them.Following Hansen and Hurwitz (1946) technique of sub-sampling the non-responding group adopted for estimation of population mean, a sub-sample of size m units   selected at random (without replacement) from the 2 n non-respondent units and is enumerated by direct interview.It is assumed that response is obtained for all the m units and the whole population (i.e., U) is supposed to be consisting of two non-overlapping strata of 1 N and 2 N units.Stratum of 1 N responding units   N N = N -N non-responding units   2 denoted by U would not respond on the first call at the second phase but will respond on the second call.Further, we assume that the strata sizes of 1 N and 2 N units are not known well in advance, see Tripathi and Khare (1997).The stratum weights of responding and non-responding groups are given by If non -response occurs on the study variable y as well as on the auxiliary variable x in the second phase sample, the estimators 1 t and 2 t may be considered in the following form as where ** d is a suitably chosen real constant such that the variance of the estimator ** 2 t is minimum.
Motivated by the above suggestions and following the two-phase sampling structure defined above with the assumption that the population median x M of the auxiliary variable x be unknown, we have proposed following four general classes of estimators of population median y M of the study variable y applicable for four different situations of non-responses.

Situation I:
In this case, we assume that the non-response conditions occur on the study variable y as well as on the auxiliary variable x in the second phase sample of size n and also the population median z M of the second auxiliary variable z be known.Accordingly, we have suggested the general class of estimators of population median y M in two-phase sampling as where where     Situation II: In this situation, we assume that the non-response occurs on the study variable y as well as on the auxiliary variables x and z in the second phase sample of size n and the population median z M of the auxiliary variable z be unknown.Considering these aspects, we have formed the general class of estimators of y M in two-phase sampling as where  

 
  Situation III: In this case, we assume that the non-response situation occurs only on the study variable y while the complete information on the auxiliary variable x is available in second phase sample of size n and also the population median z M of the second auxiliary variable z be known.Considering this situation, we have proposed the general class of estimators of population median y M in two-phase sampling as We treat the composite function   where   Situation IV: In this case, we assume that at the second phase sample non-response situation is found on the study variable y and the auxiliary variable z with unknown population mean z M while the complete information about the auxiliary variable x is available there.Considering this situation, we have suggested the general class of estimators of population median y M in two-phase sampling as where   where   Proceeding as above, it can be found that the classes of estimators 3 T and 4 T are also very wide and the following estimators can be identified as their member.
Estimators belonging to the class 3 T : Estimators belonging to the class 4 T :

Bias and Mean Square Errors of the Proposed Classes of Estimators
The bias and mean square errors (M. S. E.s) of the proposed classes of estimators   to the first order of approximations are derived under large sample approximations using the following transformations: Further, we have the following expectations: , E e e = f , 44 , E e e = f , 44 where it is assumed that as N→∞ (obviously then 1 N and 2 N both ), the distribution of the variables (x, y, z) approaches a continuous distribution with marginal densities fzare positive.It may be noted that under these conditions, the sample median y M is consistent and asymptotically normal (Gross, 1980).Now, to express 1 T in terms of e's, we expand M , M , M , M in a third order Taylor's series and we have    ( 22 33 44 55 12 q , q , q , q , q , 13 14 15 23 24 q , q , q , q , q , 25 34 q , q , 35 45 q , q ) are the second order partial derivatives Taking expectations on both sides of the equations ( 20) -( 23) and using the results from equation ( 17), we obtain the expressions for bias B(.) and mean square errors M(.) of the classes of estimators i T (i = 1, 2, . .., 4) to the first order of approximations as         d , d , c , c , p , p , q and q .Therefore, we desire to minimize the mean square errors of the classes of estimators i T .We differentiate the equations ( 28) -(31) with respect to Substituting these optimum values of the derivatives in equations ( 28) -(31), we have minimum M. S. ρ .Thus, to use such estimators one has to use guessed or estimated values of these parameters.Guessed values of these population parameters can be obtained either from past data or experience gathered over time; see Murthy (1967) and Tracy et al. (1996).If such guessed values are not available then it is advisable to use sample data to estimate these parameters as suggested by Silverman (1986) and Singh et al. (2001).In case, non-response situations occur in the sample data, it is advised to utilize the subsampling of the non-responding group technique to estimate these parameters as suggested in this paper.It could be seen that the mean square errors of the proposed classes of estimators remains same up to the first order of approximations, even if population parameters are replaced by their respective sample estimates.

Efficiency Comparisons of the Proposed Classes of Estimators  
It is important to investigate the situations under which our proposed classes of estimators M and * ** ii t and t (i = 1, 2).Proceeding as sections 3 and 4, the variance V(.)/minimum V(.)/ M. S. E.s of the estimators i t to the first order of approximations are obtained as Min. Min.

Efficiency Comparisons of the Classes of Estimators T and T
When non-response situations is observed on the study variable y as well as on the auxiliary variable x in the second phase sample of size n, we compare the efficiencies of our proposed classes of estimators i T (i = 1, 2) under their respective optimality conditions with the estimators * y M , * i t (i = 1, 2) and present them below.

(a) Efficiency Comparisons of the class of estimators 1 T:
It could be concluded from equations (33) and ( 37) -( 39) that It can be observed from equation (42) that the class of estimators 1 T is always preferable over * y M, as Proceeding as above, it can be observed from equations that (34), and (37) that which is possible when The conditions stated in equation ( 45 -1<ρ , ρ <1 are met.

Numerical Illustrations
We have chosen three natural population data sets and one artificially generated population data set to illustrate the efficacious performances of our proposed classes of estimators.The source of the populations, the nature of the variables y, x, z and the values of the various parameters are given as follows.Table1: Parametric values of different populations.

Natural population data sets
For completing the data sets of above populations we have taken

   
We assume that randomly r % of the whole population (i.e. U) shows non-responses which constitute the data set of the non -responding group (i.e. 2 U ) for the variables where non-responses occur.
To have a tangible idea about the performance of the proposed classes of estimators

Conclusions
The following conclusions can be read-out from the present study.From efficiency comparisons in the section 5, it is observed that: Suggested classes of estimators    Hence, the proposals of the classes of estimators in the present study are more justifiable in compare with the previous work of similar nature as they unify several desirable results including effectively handling of the various realistic situations of non -responses.Therefore, they may be recommended to the survey statisticians and practitioners for their applications in real life problems.
let p = t n be the proportion of y values in the sample that are less than or equal to the median value y M , an unknown population parameter.If p is a predictor of p, the sample median y M can be written in terms of quantities as   ˆQ pwhere p = 0.5.For describing the estimator in equation (1), Kuk and Mak (1989) defined a matrix of proportions y) are usually unknown but can be estimated by   ij p x,y based on a similar cross- classification of the sample.It may be noted that the estimator defined in equations (1) is based on prior knowledge of the population median x M of the auxiliary variable x.In many situations of practical importance the population median x M may not be known.Motivated with this point, Singh et al. (2001) discussed the problem of estimating the population median y M in two-phase sampling and suggested ratio and regression type estimators of y M

1 denoted
by U would respond on the first call at the second phase and the stratum of , M , M assume values in a closed convex subspace, 4 R of the four dimensional real space containing the point   , M , M is continuous and bounded in4   R.
, M , M , M satisfies regularity conditions similar to those given for   , M , M in equation(10).It can be observed from equation(11) that for any parametric function, M , M , M , M = M for all y M , can generate an estimator of y M .For examples, we present below few ratio, product, regression and exponential type estimators as the members of the class of estimators 2 T .
, M , M , M satisfies the similar regularity conditions as presented for the class of estimators 1 1+e .Such that ije and e are < 1  ( i = 0, 1, . .., 3; j= 1, 2).Now, we define following two matrices   ij ij P (y,z) and P (x,z); i, j= 1, 2 of the units of the population ,y), P (y,z) and P (x,z); i, j = 1, 2 of units in the population 2 U (non- responding group of the population U) can be defined in the same way as defined for the matrix proportions   ij ij ij P (x,y), P (y,z) and P (x,z); i, j = 1, 2 .ISSN-2424-6271 IASSL It is to be noted that xy 11 ρ = 4P (x,y) -1 is the correlation coefficient between the variables x and y based on the population U, goes from -1 to 1 as 11 P (x,y) increases from 0 to 12. Similarly, xz 11 ρ = 4P (x,z) -1 and yz 11 ρ = 4P (y,z) -1are the correlation coefficients between the respective variables based on the population U and ( (y,z) -1) are the correlation coefficients between the respective variables based on the non-responding part ( 2 U ) of the population.
e e =E e e = E e e = f , 4 M f (M ) E e =E e e = f , E e = E e =E e =E e =E e =E e = 0, 4 f y , f z for the whole population U and the non-responding part of the population (i.e. 2 U ) respectively.This assumption holds in particular under a super population model framework, treating the values of (x, y, z) in the population as a realization of N independent observations from a continuous distribution.It is assumed that     xy f x , f y ,   , d , d , d , d , d , d , d , dare the second order partial derivatives of , M , M in equation(10), it is noted that , M , M in terms of e's and neglecting the terms of e's having power greater than two we have the expansion of 1 T as ISSN-, M , M , M = M 1+ e + d M e -e + d M e M d e + d e +2d e e + d M e + 2M M d e e + d e e 1 + 2 +2d M M e e + 2M M d e e + d e e , M , M , M , M = M 1+ e + c M e -e + c M e -e  +c e +2c e e +M c e +c e +2c e e + 2M M c e e +c e e 1 + 2 +2M M c e e +c e e +2M M c e e +c e e +c e e +c e e , M , M , M = M 1+e + p M e -e + p M e M p e +p e +2p e e +M p e +2M M p e e + p e e 1 + 2 +2p M M e e +2M M p e e + p e e , M , M , M , M = M 1+e + q M e -e + q M e -e M q e +q e +2q e e +M q e +q e +2q e e 1 + +2M M q e e +q e e +2M M q e e +q e , c , c , c , c ,c , c , c , c , c , c, 35 45 c , c ) are the second order partial derivatives of and M are the population medians of the variables y, x and z respectively based on the non-responding part 2 U ,

1 .
The bias and mean square errors of the various estimators (indicated in section 2) belonging to the classes of estimators   i T i = 1, 2, . .., 4 can be easily obtained by substituting the suitable values of the derivatives in equations (24) -(31) as suggested in population mean estimation technique by Singh al. (2007).

4 .
Minimum M. S. E. of the Classes of Estimators   i T i = 1, 2, . .., 4 It is obvious from the equations (28) -(31) and remark 3.1 that the mean square errors of the proposed classes of estimators   i T i = 1, 2, . . ., 4 depend on the different values of the derivatives , c , c , p , p , q and q and equate the results to zero.Thus, we have obtained the optimum values of , c , c , p , p , q and q for minimizing the M. S. E. of the classes of estimators

2 t
Classes of Estimators 34 T and T We wish to compare the efficiencies of the classes of estimators 34 T and T under their respective optimality conditions with the estimators * y M and ** i t (i = 1, 2) when there is non-response only on the study variable y but the complete information on the auxiliary variable x is available in the second phase sample of size n.(a) Efficiency Comparisons of 3 T: Proceeding as above for the comparisons of efficiencies of the class of estimators 3 = 1, 2) from equations (35), (37), (40) and (41), we observe that 3 Comparisons of efficiencies of the class of estimators 4 T with the estimator * y M from equations (36) and (37) revels the fact that the class of estimators 4 T is preferable over the estimators * comparison technique discussed in section 5.1.(b), it can be easily verified that the condition stated in equation (48) occurs when   xz -1 < ρ < 1 .(ii) Similarly, a comparisons of efficiencies of the class of estimators 4 T with the estimator ** from equations (36) and (41) indicates that the class of estimators 4 T is more efficient than the {see for instance Singh and Priyanka (2008)} where yx μ , μ , z μ : Population means of the variables y, x and z based on the whole population U. yx σ , σ , z σ :Population standard deviations of the respective variables based on the whole population.means of the variables y, x and z based on the non-responding part of the population istandard deviations of the respective variables based on the non-responding part of the population.Artificially Generated Population We have generated three sets of independent random numbers of size N (N = 100) namely k k k x , y and z    (k =1, 2, 3, . .., N) from a standard normal distribution with the help of R-software.Further, motivated by the artificial population generation techniques adopted by Singh and Deo (2003) and Singh et al. (2001), we have generated the following transformed variables of the population U with the values of 2 xy xz y ρ = 0.8, ρ = 0.6, σ = 100,

iT
(i = 1, 2, . .., 4), we have computed the percent relative efficiencies (PREs) of the estimators i T and the estimators * ** ii t and t (i = 1, 2) under their respective optimality conditions with respect to the sample median estimator * y M .The findings are displayed in tables 2 and 3 where we have designated the percent relative efficiencies (PREs) of an estimator of t with respect to * ) denotes the M. S. E./ Minimum M. S. E. /Minimum variance of an estimator t.

1 T and 3 T
i T i = 1, 2, . .., 4 are always more efficient than the sample median estimator * y M under their respective optimality conditions and the classes of estimators are always preferable over the estimators * ** ii t and t (i = 1, 2) respectively under similar nonresponse situations.From Table 2, it is vindicated that: For different values of the correlation coefficients (i.e. yz xy xz ρ , ρ , ρ , yz(2) xy(t (i = 1, 2).It is also noted that for high positive values these correlation coefficients, the classes of estimators i T yield substantial gain in efficiency over the estimator * y M (especially visible from the populations I and II).
Singh et al. (2006)on, but in presence of complete response from the sampled units has been discussed among others bySingh et al. (2006).
) ISSN-2424-6271 IASSL It may be noted from the above equation that 1T is always more efficient than the estimator *

Population I-Source: Statistical Abstract of the United States, 2012 (Table No. 233)
The present data belongs to the state wise educational attainment of United States in the year 2012.Advanced degree or more in the year 2009 is taken as study variable y while Bachelor's degree or more in the years 2007 and 2006 are taken as auxiliary variables x and z respectively.The first 10 states have been considered as non-responding part of the population for the variables where nonresponses occur.

Source: Statistical Abstract the United States, 2012 (Table No. 629)
The state wise total unemployment of civilian labour force of United States in the year 2012 has been taken under study.Percent of total unemployed of the year 2010, 2009 and 2008 are considered as the variables y, x and z respectively.The first 10 states have been considered as non-responding group of the population for the variables where non-responses occur.
Population III-Source: Cochran (1977), pp.-34This data set indicates the weakly expenditure (y), the number of person (x) and the weakly family income (z) of 33 low income families.The first 7 families are considered as non-responding group of the population for the variables where non-response situations found.

Table 2 :
PREs of the different estimators with respect to * y M for natural population data sets.

Table 3 :
PREs of the different estimators with respect to * y M for artificially generated population taking n = 30 and n = 15.