Wavelet Estimation: Minimax Theory and Application

For now wavelet estimation is an important part of the nonparametric curve estimation. The minimax theory is well developed and there are many interesting practical applications where wavelets perfectly fit purposes of the data analysis. This paper presents a new theoretical result and a new application. It is well known in the statistical wavelet theory that for a Besov space of order α the minimax rate of the mean integrated squared error (MISE) convergence is n−2α/(2α+1). The developed in the paper local minimax approach shows that this rate is too slow and that any Besov’s function of order α can be estimated with a faster rate. Furthermore, an adaptive estimator is proposed that is minimax under the classical and new minimax approaches. Then an interesting and challenging application of the adaptive wavelet estimator in the fMRI study of neuroplasticity is discussed.


Introduction
Consider a classical nonparametric regression problem where based on n observations the problem is to estimate a square-integrable on [0, 1] function f (t).Here l are independent standard normal random variables, σ is an unknown positive constant which may be referred as the scale (σ 2 is also called variance or volatility depending on a particular application), and the quality of an estimate f of f is measured by the mean integrated squared error (MISE) defined as We make a traditional in the wavelet literature assumption that the estimated function (signal) f belongs to a Besov space B α pqQ of order α which will be defined shortly.Note that the familiar Hölder space of order α corresponds to B α ∞∞Q .The book by Meyer (1992) is a good reference on Besov spaces.Suppose that an (inhomogeneous) wavelet basis on [0, 1] is given, {φ j 0 k (x) = 2 j 0 /2 φ(2 j 0 x−k), k = 0, 1, . . ., 2 j 0 −1} and {ψ jk (x) = 2 j/2 ψ(2 j x−k), k = 0, 1, . . ., 2 j − 1, j ≥ j 0 }.Here φ and ψ are the scaling (father) function and mother wavelet, respectively.Then, for any positive integer j 0 the estimated function has a formal orthogonal expansion (called the wavelet multiresolution expansion) where wavelet coefficients are defined by κ j 0 k = 1 0 f (t)φ jk (t)dt and θ jk = 1 0 f (t)ψ jk (t)dt.
Customary estimates of the wavelet coefficients, recommended in the literature, are empirical wavelet coefficients κjk = n −1 n l=1 Y l φ j 0 k (l/n) and θjk = n −1 n l=1 Y l ψ jk (l/n), which are in practice are calculated using a cascade algorithm supported by all wavelet software packages (see a discussion in Efromovich 1999, Vidakovic 1999and Nason 2008).Now we are in a position to remember the important characterization of a Besov function space B α pqQ of order α via wavelet coefficients.Namely, if the following classical restrictions, that the wavelet regularity is larger than α, p ≥ 1, q ≥ 1, and α + 1/2 − 1/p > 0, hold, then ( It is known that the global minimax (over a Besov class of order α) rate of the MISE convergence is n −2α/(2α+1) , that is where the infimum is taken over all possible estimators f based on n observations Y n := (Y 1 , . . ., Y n ) defined in (1.1) and parameters of the Besov space, and C * is a positive constant.Furthermore, there exists a data-driven estimator, based solely on observations Y n , whose MISE attains the lower bound (1.5) up to a constant factor.Lower bound (1.5) is the foundation of the modern wavelet literature and it is the main aim of all wavelet estimators to attain this rate of the MISE convergence; see a discussion in Efromovich (1999) and Nason (2008).So far no methodology challenging this minimax approach has been proposed.
In this paper we explore the following natural question.In a practical application there is always a single underlying regression function.It is well known that there is no feasible lower bound for the MISE when a single function is estimated because a "lucky" estimate, which is equal to the underlying function, has zero MISE.The latter is the reason why the global minimax over a Besov space is considered.Nonetheless, can we consider a subfamily of Besov functions of order α that are close in L ∞ -norm to a single pivotal function and show that the local minimax over this subfamily may increase the rate of the MISE convergence?In other words, can a local minimax methodology challenge the classical rate n −2α/(2α+1) ?The answer is "yes" and this issue constitutes the main theoretical part of the paper presented in Section 2. Section 3 is devoted to an interesting new application of the developed wavelet estimator in the study of neuroplasticity based on fMRI data.Proofs are deferred to Section 4.

Local Minimax Theory
We begin with definition of the proposed local Besov space of order α with the pivot f 0 , θ jk ψ jk (t); (2.1) f ∈ B α pqQ 0 ; (2.6) Let us comment on the proposed local Besov class.Lines (2.1)-(2.3)indicate that we consider functions that on the coarsest scales (up to the scale (2α + 1) −1 log 2 (λ n n)) are equal to the pivot f 0 .Line (2.4) implies a natural restriction on the pivot which should belong to the Besov space B α pqQ 0 .Please note that Q 0 is the exact parameter of this Besov space corresponding to the pivot f 0 .Line (2.5) is another restriction on the pivot, and it states that the pivot cannot be a parametric function which is defined by a finite number of wavelet coefficients.Line (2.6) is the main restriction on the considered Besov's functions.It states that all functions from the local class must belong to the same Besov space as the pivot does.This is a natural restriction.Finally, line (2.7) defines the boundary between the coarsest scales, where considered Besov's functions are identical to the pivot, and the finest scales where we have a class of possible functions whose wavelet coefficients must satisfy the restriction (2.6).Now let us explain why the class B α pq (f 0 ) can be called local.It follows from (2.5) and (2.7) that where the infimum is taken over all estimates f (t, Y n , f 0 , α, p, q) based on data Y n and parameters f 0 , α, p, q of the local Besov space, and λ n is defined in (2.7).
What is the practical impact of the above-formulated lower bound?Because for each pivot f 0 the sequence λ n vanishes as n → ∞, we have a lower bound which indicates the possibility for a faster MISE decrease than the classical minimax lower bound (1.5) indicates.As a result, if we are able to propose an estimator that attains the lower bound (2.9), then this new lower bound dominates the classical (1.5).
Let us consider a wavelet estimator which belongs to a class of blockwiseshrinkage estimators discussed in Efromovich (1999Efromovich ( , 2000a)).Set k∈T jsn θjk ψ jk (t). (2.10) Here κj 0 k and θjk are empirical wavelet coefficients defined below line (1.2) in the Introduction, s}, L jn is the length of blocks on jth resolution scale, J * = log 2 n and x is the rounded down x, and (2.11) The estimate of σ 2 is the robust estimate based on the sample median for wavelet coefficients on the finest scale; see its discussion in Efromovich (1999).
There is a number of appropriate lengths L jk of blocks T jsn and threshold levels ν jk that may be used to attain the lower bound (2.9).For instance, we can use classical increasing blocks of the length L jn = b n 2 (j−j 0 )/3 and ν jn = 1, here b n is the largest dyadic (i.e., 2 k , k = 1, 2, . . . where λ n is defined in (2.7);For a Besov space We conclude that the lower bound (2.9) is bona fide and it is attainable by a data-driven estimator.This result supports our above-made conjecture that the classical minimax rate n −2α/(2α+1) of the MISE convergence is too slow due to the global nature of the minimax, and we can replace this rate by a faster one if the local minimax is considered.Theorem 2.2 also indicates that the same wavelet estimator can be both locally and globally minimax.
The obtained result resembles the phenomenon of superefficiency discussed in Efromovich (1999).Superefficiency is the property of an estimator to exhibit faster rates of convergence than the global minimax indicates.As a result, superefficiency is a curvewise property which is not completely surprising because the global minimax cares only about the most difficult regression functions for estimation, and those functions change with n.The proposed local minimax methodology is innovative and different from the superefficiency which just states the fact that the rate may be faster.Indeed, the local minimax explains that a Besov's function of order α is estimated with the rate λ 1/(2α+1) n n −2α/(2α+1) where λ n = o n (1) and this sequence is explicitly defined in (2.7).Researchers at the Southwestern Medical School explored the possibility of using alterations in resting brain activity for finding potential markers for brain plasticity; details of the study can be found Tung et al. ( 2013) and here we just briefly describe them.Using 3T MR scanner, blood-oxygenation-leveldependent (BOLD) signal fMRI data were obtained for healthy right-handed adults.The motor cortex model studies were based on a five minute pre-training session, followed by a 23 minute training session, and finished with a 5 minute post-training run.During the training session, a subject was shown a white cross-hair and was instructed to press a button three times with the thumb when a change in color was detected.The color change occurred randomly every 27-32 seconds, end there were a total of 40 trials during the motor task period.The data from the experiment raw data consists of 300 time points for each of pre and post training sessions, and four sessions of 340 time points during the training part; the data was measured every second.All data was transformed into Talairach coordinates using AFNI program and standardized mask ROI applied to all subjects to select the motor cortex voxels' time series.This preprocessed data was used in our analysis.

Application: Neuroplasticity
The aim of the experiments is to determine the changes in resting state motor cortex network following the button clicking training.The increased correlation between the right and left motor cortices in post-training resting state compared to pre-training resting state is a popular indicator of training efficiency.A popular approach in fMRI data statistical analysis is to average the voxels' time series data over all the ROI for each hemisphere and work with a single averaged time course signal.Following this method, for each subject the data Clearly, the averaging over all voxels in ROI smooths the time series signal and avoids a problem of multiple comparisons adjustment.On the other hand, in each ROI slice there are 80-130 voxels in the left and right motor cortex areas and there are 11 slices where ROI is clearly visible on fMRI scan.Can this information be used to shed a new light on the neural plasticity?
The proposed solution, based on the developed wavelet estimator, is as follows.The correlation between two functions f 1 (t) and f 2 (t), observed in corresponding regressions (1.1), can be calculated as the correlation between the proposed wavelet estimates f1 (t) and f2 (t).This step takes care about extremely large noise in fMRI data.As a result, because the large noise is no longer an issue, the averaging is not needed and we may treat data on a voxel level.Furthermore, using an appropriate thresholding for calculated correlations allows us to adjust for multiple comparisons because, as it has been explained earlier, we are dealing with the case when a number of comparisons (pairs of voxels) is much larger than the number of observations.Let us present a particular outcome of the study.For 11 fMRI scan slices, where motor cortex area was visible, the correlation matrix for all voxel pairs in right and left motor cortices is calculated.Then for each slice the number of activated pairs (defined as pairs with correlation exceeding a threshold level) is calculated.Results for a particular subject are presented in Table 3.1, and note that a similar outcome has been observed for other subjects.The results clearly indicate that in each slice there are more activated pairs of voxels in the post-training session rather than in the pre-training session.We may conclude that the simple clicking exercise does change the brain network.Furthermore, in addition to the conclusion of Tung et al. (2013), that training of the brain increases the correlation between averaged activities in the right and left hemispheres measured by their BOLD signals, we can also conclude that the training dramatically increases the number of voxel pairs with significant correlation.In other words, the training increases not only correlation in activities of the brain network connecting the right and left hemispheres, but the training also increases the number of neurons participating in the brain network.The latter may be of a special interest for finding a cure for the abovementioned brain diseases because our analysis indicates that new neurons may be trained to substitute disease-affected areas of the brain.

Proofs
Proof of Theorem 2.1.Denote by K := K(n, f 0 , α, p, q) the smallest dyadic integer larger than [λ n n] 1/(2α+1) , and also set J := log 2 (K).The Bessel inequality allows us to write for any estimate f (t) considered in Theorem 2.1, where θJk and θ Jk are wavelet coefficients of f and f , respectively.
Inequality (4.1) explains the idea of establishing the lower bound (2.9).Namely, we are replacing a local Besov space B α pq (f 0 ) by its K-dimensional parametric subspace 3) is used.
Our next step is a classical one when the minimax MISE over B K is bounded from below by a Bayes MISE with a corresponding distribution of parameters {θ Jk , k = 0, 1, . . ., K − 1}.To do this, we introduce K independent standard normal random variables ζ 0 , . . ., ζ K−1 .We begin with proving a pivotal result which states that for some positive constant a the probability Note that (4.3) is equivalent to Using definition of K and (2.8) we get Set where γ ∈ (0, 1/4).Using (4.5) and Chebyshev's inequality we get for all sufficiently large n, Inequality (4.6) verifies (4.4) which in its turn verifies (4.3).Now we can define new random variables Note that, according to (4.2) and (4.3), a stochastic function (process) f * (t), defined as f (t) in (4.2) only with θ Jk being replaced by Z k , belongs to B K with the probability tending to 1 as n increases.
We made all preliminary steps that allow us to use results of Efromovich (2000b) and conclude that Here Cs are generic positive constants.Inequality (4.7), together with the relation B K ⊂ B α pq (f 0 ), prove Theorem 2.1.
Proof of Theorem 2.2.In what follows Cs denote generic positive constants.We begin with a projection pseudo-estimate fP (t) = where J is the smallest integer such that 2 J > [λ n n] 1/(2α+1) .Note that the projection estimate uses the same empirical wavelet coefficients as the blockwiseshrinkage estimate.
Further, according to Efromovich (1999Efromovich ( ,2000a)), we have the following inequality for the mean squared errors of the estimates, Using the Parseval identity we can get a classical decomposition of the MISE into variance and integrated squared bias terms, which together with (4.9), used to evaluate the variance part, yields Note that This inequality yields that the variance part of the MISE of the projection estimate has the order indicated by the lower bound (2.9).Let us explore the integrated squared bias term in (4.10).The following inequality holds, To check (4.12) we note that it is trivial for p = 2, and for p > 2 we can write using the Hölder inequality, which yields (4.12).
Also using (2.7), for j > J and f ∈ B α pq (f 0 ) we get the following inequality Then (4.12) and (4.13) yield j>J Using this inequality together with (4.11) in (4.10) establishes the important result about local minimax of the projection estimate, namely Now we are in a position to present a sequence of steps following Efromovich (1986Efromovich ( ,1999)).The first step is to introduce an oracle-estimate f * (t) := where J * is defined above line (2.11) and µ * jk are the individual shrinkage oracle-coefficients µ * jk = θ 2 jk /(θ 2 jk + σ 2 n −1 ).Note that these are oraclecoefficients because they depend on wavelet coefficients of an estimated regression function and the variance in the regression model (1.1).It is shown in Efromovich (1999Efromovich ( ,2000a) that this oracle-estimator dominates the projection estimator and then

.16)
The oracle-estimate (4.15) is the most accurate one known in the literature, but a data-driven estimate cannot match its performance.This is why in Efromovich (1986) for the case of a trigonometric basis, and then in Efromovich (2000a) for a wavelet basis, a blockwise-shrinkage oracle estimate has been proposed, It is easy to see that the proposed data-driven estimate (2.10) is motivated (it is actually the plug-in estimate) by the blockwise-shrinkage oracle (4.17).While this blockwise-shrinkage oracle is dominated by the oracle (4.15), it is well known and this is verified by a direct calculation that this oracle is also rate minimax, namely  (2.12).The proof of (2.13) is identical only in (4.16) and (4.18) the supremum over f ∈ B α pqQ yields Cn −2α/(2α+1) .Theorem 2.2 is proved.

Conclusion
This is a tradition in the wavelet literature to measure quality of a wavelet estimate via comparison of the MISE convergence with the rate n −2α/(2α+1) .This rate is the lower bound in a global minimax where the supremum is considered over all functions from a Besov space of order α.On the other hand, in all practical situation we are dealing with a particular underlying function.It is desirable to measure quality of estimation for a particular function at hand, but we do know that such a program cannot be fulfilled because there is no feasible lower bound for a risk calculated at a particular function.To bridge these two approaches and to get a sharper lower bound, it is proposed to consider a local minimax approach where all considered functions belong to the same Besov space as the pivot does and all these function converge to the pivot (in some norm) as the sample size increases.An appropriate local Besov space is proposed, and its feasibility is supported by: A lower bound that indicates the faster rate of the MISE convergence than the traditional n −2α/(2α+1) ; A datadriven estimator whose MISE attains the faster rate of the MISE convergence indicated by the lower bound.Furthermore, the proposed estimator adapts to all parameters of the underlying local Besov space.Oracle-inequalities are also established.The developed estimator is tested on fMRI data exploring the phenomenon of neuroplasticity.Future research will be devoted to considering more general regression models, including heteroscedasticity, dependent errors, and missing observations.
Neuroplasticity is the lifelong ability of the brain to reorganize neural pathways based on new experiences and learning.Functional MRI is widely used to investigate brain networks and develop possible treatment for patients with stroke, Alzheimer disease, traumatic brain injury, multiple sclerosis, and schizophrenia, see a discussion in Dimyan and Cohen (2011), Greicius et al. (2004) and Tung et al. (2013).
would consist of the right and left motor cortex hemisphere time series during pre-training and post-training resting state.The correlation is computed using Pearson cross correlation for both pre-training and post-training averaged signals.To test for statistical significance, two-tail paired Student's t-test is used for comparisons of pre-training and post-training correlation indices.Results, presented in Tung et al. (2013), indicate that there is a statistically significant increase in the correlation during the post-training resting state.
number which is at most log 2 (n).Another possibility is to use "constant" blocks and thresholds, for instance L jn = b n and ν jk = 4.The following result holds for these two particular examples of the blockwise-shrinkage estimator.