Friday, February 16, 2018

Camel

Yesterday, I claimed that the shape of the sample variance distribution was "wrong". A reasonable question to ask then is, "how so?" Thanks to my handy-dandy simulator, getting an answer to that is relatively easy.

First, we'll state the theoretical "right" shape. If the blocksums were independent, identically distributed according to a normal distribution (none of those things are true for any sample, but let's move on), the normed sample standard deviation is chi-squared. (by normed, I mean that we're looking at the sample variance as a proportion of the population variance and adjusting for the sample size).

A shocking amount of public policy gets built around research where those assumptions go unquestioned. True iid samples are about as common as hen's teeth. But, fortunately for the world's sloppy researchers, the violations of those assumptions generally don't get you into to much trouble, especially when you are dealing with averages and sums, both of which converge to normal if you make you sample large enough.

But, that's the rub, generally doesn't mean always. And, in the case of correlated query sums, it just ain't so. I ran the simulator for r=5 (that is, I pulled 1000 samples of 5 blocks each) from both the uniform data and the correlated batch data. In the case of the uniform data, things line up really nice, even for such a small sample size. The correlation between blocks isn't enough to mess up the iid assumption and the sums themselves are, in fact, normally distributed.


The "theory" line shows the Chi-square density with 4 (r-1) degrees of freedom. When we look at the correlated data, well, it's something quite different.

Yikes! It's not even uni-modal, much less properly skewed. Granted, this is for a very small sample size, but the effect doesn't really go away until many blocks have been sampled. Note that the mean is 4 (well, actually 3.85, but that's just sampling error) just like the uniform case. The right side mode doesn't change things too much as in this case the variance estimate is very high, so the confidence interval will be wide and we'll keep on sampling. It's the left side mode that's the killer. This yields an artificially narrow confidence interval. Any stopping rule based on that will be prone to shutting down too soon.

Of course, that can happen in the uniform case as well, but the t-statistic accounts for that and keeps the overall chance of shutting down early at 5% (or, whatever Type-I risk you want to accept). In the correlated case, even with the adjustment for sample size, you're still going to think you're good when you're not far too often.

Another reasonable question to ask is: "Since it's clearly not chi-squared, what is it?" That's a tougher question to answer. I'm sure I could grind out the exact distribution, but there really isn't much point. We have to come up with another way to prevent early termination of the sampling, and that's what the more sophisticated methods will be used for.

No comments:

Post a Comment