Thursday, February 15, 2018

Out of shape

Not me. Well, maybe I am a bit, but I'm talking about the shape of the distribution of the sample variance of block sums when observations are correlated. I ran a simulation today to illustrate how simple block sampling doesn't work when the results are correlated. You'd think it would. Sure, the variance is higher, but your still sampling sums from a finite population of blocks. As long as you do that, the total sum will converge to normal and the sample variance will be an unbiased estimator of the true variance. Yes, it's a sample variance, not a true variance, so we have to use a t-distribution instead of Normal for our confidence interval (not that there's much difference after we've taken a decent-sized sample). Even so, why do get this?


The horizontal axis is number of blocks sampled out of a population of 1000. Note the uniform sample correctly creates a 95% confidence interval that includes the true sum about 95% of the time. Not so when they are correlated. We have to sample many more blocks before our confidence interval correctly reflects the variability of our estimate.

It's not that the variance estimate is biased - the mean of the simulated estimates was pretty close at all sample sizes. The problem is the distribution of the sample variance. When a sample gets a block with no (or hardly any) rows relevant to the query, the variance estimate is understated. In the more common case where you don't get a zero block, the variance is slightly overstated, so the mean comes out fine. But those understated cases give you a false sense of security that you have a reasonably tight estimate of the sum when, in fact, you don't. Any stopping rule based on that would be very prone to stopping early.

The distinction is really important. As I posted a few weeks ago, the estimator of the total sum based on the block sums converges to normal very quickly, even when the underlying distribution is crazy skewed. It's the distribution of the sample variance that's messing us up. Why is that important? Because the thing random block sampling doesn't do is look at the underlying distribution. This helps motivate our study of methods that dig deeper into the shape of the data than just recording the first two moments.

No comments:

Post a Comment