Maximum Empirical Likelihood Estimation In A Heteroscedastic Linear Regression Model With Possibly Missing Responses

A heteroscedastic linear regression model is considered where responses are allowed to be missing at random and with the conditional variance modeled as a function of the mean response. Maximum empirical likelihood estimation is studied for an empirical likelihood with an increasing number of estimated constraints. The resulting estimator is shown to be asymptotically normal and can outperform the ordinary least squares estimator.


Introduction
Consider the heteroscedastic linear regression model in which the response variable Y is linked to the q-dimensional covariate vector X by the formula where Z is m(X) for a known measurable function m from R q into R p , θ is an unknown vector in R p , the error variable ε is conditionally centered, i.e., E[ |X] = 0, and its conditional variance is bounded and bounded away from zero.In order to identify the regression parameter, it is assumed that the matrix is well defined and positive definite.For q = 1, a possible choice of m is m(x) = (1, x, x 2 , . . ., x p−1 ) , x ∈ R, which corresponds to polynomial regression.The case p = 2 yields the classical heteroscedastic simple linear regression model.For q = 2, a possible choice of m is given by m(x 1 , x 2 ) = (1, x 1 , x 2 , x 2  1 , x 1 x 2 , x 2 2 ) , x 1 , x 2 ∈ R.
In the ideal situation one observes the pair (X, Y ).In real life data sets, however, one frequently encounters missing values.Here we allow the response Y to be missing.Then one observes (δ, X, δY ) with δ an indicator random variable.The interpretation is that for δ = 1 one observes the full pair (X, Y ), while for δ = 0 one observes only the covariate X.We make the common assumption that the response is missing at random.This means that the conditional probability of δ = 1 given (X, Y ) depends on X alone, Monographs on missing data are Little and Rubin (2002) and Tsiatis (2006).We assume throughout that the conditional probability is bounded away from zero.This implies that E[δ] is positive.
The data in our model are (δ 1 , X 1 , δ 1 Y 1 ), . . .(δ n , X n , δ n Y n ) which are independent copies of the triple (δ, X, δY ).We set and let N = δ 1 + • • • + δ n denote the number of complete observations.A possible estimator of θ is the least squares estimator θL based on the complete observations, θL = arg min If σ 2 were known, we could use the weighted least squares estimator θW to estimate the regression parameter θ.This estimator minimizes the weighted sum of squares and satisfies the stochastic expansion with Note that H is invertible because M is invertible and π/σ 2 is bounded and bounded away from zero.Thus n 1/2 ( θW − θ) is asymptotically normal with mean vector 0 and dispersion matrix H −1 .Since we treat σ 2 as unknown, the weighted least squares estimator is no longer available.For this reason we call θW the oracle weighted least squares estimator.A natural approach is to minimize instead of Q(ϑ) the weighted sum of squares in which an estimator σ2 replaces the unknown σ 2 .Working with a kernel estimator of σ2 based on least squares residuals, Müller and Van Keilegom (2012) show that under additional assumptions the resulting estimator is asymptotically equivalent to the oracle weighted least squares estimator.This generalizes the work of Carroll (1982) who was the first to obtain such a result without missing responses, i.e., in the case when δ is identically 1.Similar results without missing responses were obtained by Müller and Stadtmüller (1987), Robinson (1987) and Schick (1987) for various nonparametric estimators of σ 2 under differing conditions.
Schick (2013) has shown that one can construct an estimator that is asymptotically equivalent to the oracle weighted least squares estimator without constructing an estimator of the variance function σ 2 .He treated the case q = 1 with missing responses.He avoided estimation of the variance function σ 2 by working with a guided maximum empirical likelihood estimator associated with an empirical likelihood with an increasing number of estimated constraints, r n a positive integers tending to infinity with n, and v r a function from [0, 1] into R r+1 for each positive integer r.His estimator maximizes the restriction of S n to the random ball centered at the least squares estimator of radius C(log n/n) 1/2 for some constant C, He obtained the asymptotic equivalence of this estimator and the oracle weighted least squares estimator (by establishing the expansion (1.1) with θS in place of θW ) under a growth condition on r n and mild assumptions on the functions v r .
With U denoting the uniform distribution on [0,1], he required these functions to satisfy the following conditions.
(C1) There are positive constants c 0 , c 1 , c 2 , c 3 such that the inequalities hold for all x and y in [0, 1] and all unit vectors u in R r+1 .
(C2) For every Guided maximum likelihood estimation was introduced and studied by Peng and Schick (2012) for a fixed number of constraints.As shown in Schick (2013) possible choices for the function v r are The first choice consists of the first 1 + r elements of the trigonometric basis of L 2 (U ).The components of the second choice form a basis of the linear splines with knots at the points 0/r, . . ., r/r.
In many situations the conditional variance function can be modeled as a function of the mean response as expressed in the following assumption.
(A0) There is a measurable function τ from R into some compact subset of (0, ∞) such that We now make this assumption and study the guided maximum likelihood estimator θ = arg max associated with the modified empirical likelihood which is obtained from S n (ϑ) by replacing G j by Rj ( θ * ).Here and θ * is a discretized version of θL which is obtained by matching θL with the closest point in the grid where c is a positive constant.We work with a discretized version to simplify our proofs.Indeed, discretized estimators can be treated as non-stochastic sequences in the proofs.Discretization was introduced by Le Cam (1960) for precisely this reason and has become a popular tool in the construction of efficient estimators in semiparametric models, see Bickel (1982) and Schick (1986Schick ( , 1987)).We need the following assumptions.
(A1) There is a neighborhood N of θ, a finite constant K and a positive α, α ≤ 1, such that holds for all ϑ ∈ N and all y, z ∈ R.
(A2) The matrix is positive definite with We are ready to state our main result.
The theorem shows that the new estimator is no longer equivalent to the oracle weighted least squares estimator.Simulations in Section 2 show that the new estimator can significantly outperform the complete case least squares estimator θL in certain situations, but may be worse in others.
A proof of the theorem is given in Section 4. A discussion of our assumption and some preparatory results are in Section 3. The results of a simulation study are in Section 2.

Simulations
In order to assess our method, we performed a small simulation study using R. We compared several versions of our guided maximum likelihood estimator with the least squares estimator (OLSE) and the oracle weighted least squares estimator (WLSE).We took p = q = 2, θ = (1, 1) , δ = 1, Z = m(X) = X, X a bivariate normal random vector with mean vector (1, −2) and diagonal dispersion matrix with diagonal entries 1 and 4, and ε = τ (θ X)ζ, with ζ standard normal and independent of X.For this choice, assumption (A1) holds with α = 1; see Remark 3 below.
We looked at four cases for τ , namely τ = τ A , τ = τ B , τ = τ C and τ = τ D , where We write GT(r) for our estimator when using the trigonometric basis and r n = r, and GS(r) for our estimator when using the spline basis and r n = r.Table 1 reports the results for OLSE, WLSE and GT(r) with r = 1, . . ., 5, while Table 2 reports the results for OLSE, WLSE and GS(r) with r = 1, . . ., 5. We used the same sample to construct each of the twelve estimators.Thus the columns OLSE and WLSE are the same for both tables.
We can see our proposed estimator performs better than the OLSE for the choices τ A and τ B in all cases when r is at least 3, while for the choices τ C and τ D is does perform worth.Also, the performance of our estimator is influenced by the choice of r although the choice r = 4 performs quite will in all cases.There seems little difference between the choice of bases.

Comments and Remarks
We begin with some comments on our assumptions, and then discuss implications of these assumptions.
Remark 1.For α = 1, the requirements on r n are equivalent to r 5 n log n = o(n), while for α < 1, they are equivalent to r 5 n = o(n α ).For example, if α equals 1/2, then we need r 10 n = o(n).
The next two remarks address assumption (A1).
Remark 2. Let F denote the distribution of θ Z.We shall see that (A1) holds if F is Hölder.For real numbers y and z and a vector ϑ in R p , we set We derive the bound valid for positive B. Indeed, for random variables S and T we have and Applying this with S = θ Z and T = (ϑ − θ) Z, we obtain the inequality.Now assume that F is Hölder with exponent κ, 0 < κ ≤ 1.Since F is bounded, it is also Hölder for any exponent in the interval (0, κ).Using the Hölder property of F , we derive the following results using the above inequality.
Remark 3. Assume that Z has an elliptically contoured distribution conditionally given δ = 1.This means that given δ = 1, the random vector Z has the same distribution as µ + Ση where µ is a vector in R p , Σ is a positive definite p × p matrix, and η is a spherically symmetric random vector, i.e., Qη has the same distribution as η for each orthogonal p × p matrix Q.Then the random variables {u η : u = 1} have the same distribution function F .Thus we derive for every non-zero vector ϑ in R p .It is now easy to see that (A1) holds with α = 1 if F has a density f satisfying and if θ is not the zero vector.This shows that in the setting of our simulations (A1) holds with α = 1.
Remark 4. Let us briefly mention an example when (A2) fails to hold.Take δ identical to 1 and Z = X = (X 1 , X 2 ) , where X is a bivariate standard normal random vector.Then, for θ = (1, 1) , the conditional distribution of X 1 given θ X = X 1 + X 2 is the same as that of X 2 given X 1 + X 2 , and Let us now discuss implications of our assumptions and in the process derive important quantities and results needed in the proof of the theorem.For ϑ ∈ R p , we let G ϑ denote the conditional distribution function of ϑ Z given δ = 1, i.e., and set and obtain from the third part of (C1) the inequality In view of the identity A n = E[δv rn (R)W ], it follows from (A2) and the properties of V n that A n is eventually of full rank p. Finally, using condition (C2), one verifies as in Schick (2013) that

and τ 2 Dθi − θ 2 with
(x) = sin(x) + 1.1.We ran simulations for these choices of τ with sample sizes n = 100 and n = 200, and 2000 repetitions.In the tables we report 100 times the simulated mean square error for each estimator.The simulated mean square error for an estimator θ with 2000 repetitions is defined by θi the result of the i-th repetition.It estimates E[ θ − θ 2 ].

Table 2 . 1 :
Simulated Mean Square Errors with Trigonometric BasisEach entry is 100 times the simulated mean square error of the corresponding estimator, for two sample sizes and two choices of τ .

Table 2 . 2 :
Simulated Mean Square Errors with Spline Basis