|
Continuity Correction
Rabi Bhattacharya, The University of Arizona, USA
According to the central limit theorem (CLT), the distribution function of a normalized sum of independent random variables , having a common distribution with mean zero and variance , converges to the distribution function of the normal distribution with mean zero and variance , as . We will write for for the case . The densities of and are denoted by and , respectively. In the case are discrete, has jumps and the normal approximation is not very good when is not sufficiently large. This is a problem which most commonly occurs in statistical tests and estimation involving the normal approximation to the binomial and, in its multi-dimensional version, in Pearson's frequency chisquare tests, or in tests for association in categorical data. Applying the CLT to a binomial random variable with distribution , with mean and variance , the normal approximation is given , for integers , by
 |
(1) |
Here indicates that the difference between its two sides goes to zero as . In particular, when , the binomial probability is approximated by zero. This error is substantial if is not very large. One way to improve the approximation is to think graphically of each integer value of being uniformly spread over the interval . This is the so called histogram approximation, and leads to the continuity correction given by replacing by
 |
(2) |
To give an idea of the improvement due to this correction, let . Then , whereas the approximation (1) gives a probability , and the continuity correction (2) yields . Analogous continuity corrections apply to the Poisson distribution with a large mean.
For a precise mathematical justification of the continuity correction consider, in general, i.i.d. integer-valued random variables , with lattice span 1, mean , variance , and finite moments of order at least four. The distribution function of may then be approximated by the Edgeworth expansion (See Bhattacharya and Ranga Rao (1976), p. 239, or Gnedenko and Kolmogorov (1954), p. 213)
 |
(3) |
where is the right continuous periodic function (mod 1) which vanishes when . Thus, when a is an integer and , replacing a by (or ) on the right side of (3) gets rid of the discontinuous term involving .
Consider next the continuity correction for the (Mann-Whitney-)Wilcoxon two sample test. Here one wants to test nonparametrically if one distribution G is stochastically larger than another distribution F, with distribution functions , . Then the null hypothesis is for all , and the alternative is for all , with strict inequality for some . The test is based on independent random samples and from the two unknown continuous distributions and , respectively. The test statistic is = the sum of the ranks of the in the combined sample of and . The test rejects if , where is chosen such that the probability of rejection under is a given level . It is known (see Lehmann (1975), pp. 5-18) that is asymptotically normal and , . Since is integer-valued, the continuity correction yields
 |
(4) |
where .
As an example, let , , . Then , and its normal approximation is . The continuity correction yields the better approximation .
The continuity correction is also often used in contingency tables for testing for association between two categories. It is simplest to think of this as a two-sample problem for comparing two proportions of individuals with a certain characteristic (e.g., smokers) in two populations (e.g., men and women), based on two independent random samples of sizes from the two populations, with . Let be the numbers in the samples possessing the characteristic. Suppose first that we wish to test , against . Consider the test which rejects , in favor of , if , where , and is chosen so that the conditional probability (under ) of , given , is . This is the uniformly most powerful unbiased (UMPU) test of its size (See Lehmann (1959), pp. 140-146, or Kendall and Stuart (1973), pp. 570-576). The conditional distribution of , given , is multinomial, and the test using it is called Fisher's exact test. On the other hand, if and , the normal approximation is generally used to reject . Note that the (conditional) expectation and variance of are and , respectively (See Lehmann (1975), p. 216). The normalized statistic is then
![$\displaystyle t = [r_2 - n_2r/n]/ \sqrt{ n_1n_2r(n-r)/[n^2(n-1)]},$ $\displaystyle t = [r_2 - n_2r/n]/ \sqrt{ n_1n_2r(n-r)/[n^2(n-1)]},$](http://statprob.com/cache/objects/249/l2h/img112.png) |
(5) |
and is rejected when exceeds , the quantile of . For the continuity correction, one subtracts from the numerator in (5), and rejects if this adjusted exceeds . Against the two-sided alternative , Fisher's UMPU test rejects if is either too large or too small. The corresponding continuity corrected rejects if either the adjusted , obtained by subtracting from the numerator in (5), exceeds , or if the adjusted by adding to the numerator in (5) is smaller than . This may be compactly expressed as
Reject if
|
(6) |
where is the th quantile of the chisquare distribution with 1 degree of freedom. This two-sided continuity correction was originally proposed by F.Yates in1934, and it is known as Yates' correction. For numerical improvements due to the continuity corrections above, we refer to Kendall and Stuart (1973), pp. 575-576, and Lehmann (1975), pp. 215-217. For a critique, see Connover (1974). If the sampling of units is done at random from a population with two categories (men and women), then the UMPU test is still the same as Fisher's test above, conditioned on fixed marginals ,(and, therefore, ) and .
Finally, extensive numerical computations in Bhattacharya and Chan (1996) show that the chisquare approximation to the distribution of Pearson's frequency chisquare statistic is reasonably good for degrees of freedom 2 and 3, even in cases of small sample sizes, extreme asymmetry, and values of expected cell frequencies much smaller than 5. One theoretical justification for this may be found in the classic work of Esseen (1945), which shows that the error of chisquare approximation is for degrees of freedom d.
REFERENCES
Bhattacharya, R.N. and Chan, N.H. (1996). Comparisons of chisquare, Edgeworth expansions and bootstrap approximations to the distribution of the frequency chisquare. Sankhya, Ser.A, 58,57-68.
Bhattacharya, R.N. and Ranga Rao, R.(1976). Normal Approximation and Asymptotic Expansions. Wiley, New York.
Connover, W.J.(1974). Some reasons for not using Yates' continuity correction on contingency tables. J. Amer. Statist. Assoc. 69, 374-376.
Esseen, C.G.(1945). Fourier analysis of distribution functions: A mathematical study of the Laplace-Gaussian law. Acta Math. 77, 1-125.
Genedenko, B.V. and Kolmogorov, A.N. (1954). Limit Distributions of Sums of Independent Random Variables. English translation by K.L. Chung. Reading, Massachussetts.
Kendall, M.G. and Stuart, A. (1973). The Advanced Theory of Statistics. Vol. 2, 3rd edition. Griffin, London.
Lehmann, E.L.(1959). Testing Statistical Hypotheses. Wiley, New York.
Lehmann, E.L. (1975). Nonparametrics: Statistical Methods Based on Ranks. (With the special assistance of DA'brera, H.J.M.), Holden-Day, Oakland, California.
Footnotes
- Reprinted with permission from Lovric, Miodrag (2011), International Encyclopedia of Statistical Science. Heidelberg: Springer Science+Business Media, LLC
|