It's not that the variance estimate is biased - the mean of the simulated estimates was pretty close at all sample sizes. The problem is the distribution of the sample variance. When a sample gets a block with no (or hardly any) rows relevant to the query, the variance estimate is understated. In the more common case where you don't get a zero block, the variance is slightly overstated, so the mean comes out fine. But those understated cases give you a false sense of security that you have a reasonably tight estimate of the sum when, in fact, you don't. Any stopping rule based on that would be very prone to stopping early.
The distinction is really important. As I posted a few weeks ago, the estimator of the total sum based on the block sums converges to normal very quickly, even when the underlying distribution is crazy skewed. It's the distribution of the sample variance that's messing us up. Why is that important? Because the thing random block sampling doesn't do is look at the underlying distribution. This helps motivate our study of methods that dig deeper into the shape of the data than just recording the first two moments.
No comments:
Post a Comment