Sunday, February 11, 2018

Finite comprehension

Last week, my adviser pointed out an error in my variance calculation when using the Bag of Little Bootstraps (BLB) method. That didn't bother me much. As I noted yesterday, I get details wrong all the time. Fix 'em and get on with life.

What bothered me was that, if he was right, my straightforward estimate from random block sampling would have a higher variance than BLB. I was sure that the scaled block average was a minimum variance unbiased estimator (MVUE). Granted, bootstrap estimators are a bit whackadoodle, being non-deterministic and all, but they are still estimators based on the data and if this one was coming in lower, then the scaled block sum couldn't possibly be the MVUE.

So, back to a very careful pass through the derivation followed by actually simulating the results to make sure they were matching theory. The problem, it turns out, is the whole confusion you get with bootstrap estimators over what it is you're actually estimating.

Normally, the sample is assumed to be a finite sample from an infinite population. You fold the sample over on itself to generate a bunch of samples and that lets you get some idea of what the variance should be on any sample from that finite population. BLB obfuscates that even more because now we have all these little sub-samples which themselves get bootstrapped. Under that model, my variance calc was wrong.

But, that's not what we're doing. The "sample" is actually the population. The population sum is not a random variable, it's a population parameter. Even in the Bayesian world, one has to concede that there is no actual randomness in the sum. It is what it is. So, what we're really looking at is bootstrapping the block sums to create an estimator and then looking at the variance of that estimator around it's mean plus the variance of the mean of that estimator. The first part is the block variance scaled up to the size of the total population.

I simulated 1000 blocks and then ran the BLB on each block. The variance of the block sums was 457 and the average variance of the estimators was 464934, or 1017 times as much. So, far, things are checking out just fine. But, we're not done. That's just how much variance we expect in the empirical distribution of the bootstrap samples. Since the bootstrap samples are all taken from the same block, the expected variance is (n/b2 = Nσ2, where n is the total number of rows, b is the block size, and N is the total number of block. So, if I'm wanting to estimate the block variance, I just divide my empirical variance by the number of blocks. Great, that was the whole point: use BLB to get the block variance.

Now, let's look at the variance of our estimator, Uj. Uj is the sum of n observations sampled with replacement from block j. Pretty clearly, E(Uj) = NSj where Sj is the block sum. We also know that NSj is an unbiased estimator of the total sum S (the MVUE from Sj at that). Let's take the variance of the difference:

Var(S - NSj) = Var(S) + N2Var(Sj) = 0 + N2σ2

Wait a minute, that's literally a thousand times greater (since N=1000 in my simulation)! Yup, and the sim came back with just that number. The variance across the BLB estimators from all blocks was 456,998,006, almost exactly a million times the block variance. So, while I haven't made as much progress on the new stuff my adviser wanted me to look at, at least I can sleep a bit better knowing that my base case is using the MVUE and all this other stuff is just to help figure out what that variance is.

Incidentally, if you grind out the covariance terms (which I'm starting to get good at), it turns out that the estimator created by simply averaging the BLB estimates for however many blocks you've sampled is as good as using the simple block estimator. Really! The variances come out to be exactly the same: N2σ2(N-r)/[r(N-1)]. That's fine. There's no rule that MVUE's are unique. I just wanted to be sure I had one.

No comments:

Post a Comment