First, the method. Sample r blocks and get all the partitions. Then run a bootstrap where you keep re-sampling from the partitions with replacement until you are back to your sample size. (I also tried the simpler version where rather than partitions, you just use blocks - it didn't make much difference either way, though the partition samples were a little smoother). That gives you a distribution on your sample sum. Now, just scale that up to your total size (multiply by N/r) and that gives you the distribution of your total sum estimator. Find the 2.5 and 97.5 percentiles and that's your 95% confidence interval.
Except it isn't.
If we plot the number of samples where the confidence interval does not contain the true sum, we get this:
Yeah, so the "confidence" interval doesn't seem to have much to do with our confidence. Note that this is an empirical distribution, so it's not like we've got a parameter wrong somewhere. The empirical distribution itself is way to narrow when the sample is small and too wide when it gets larger. To see how much, I scaled each confidence interval by a factor so that 5% of the intervals would miss the mark. I then tried to find a relationship between that factor and the sample size. From the shape of the graph, a reciprocal relationship seemed a good bet and, indeed, it fits fairly well.
I could just move on, but I'd really like to understand what's going on here because accurately simulating distributions is a pretty important part of sampling. And, I can't just write in the paper, "Then divide by r for no apparent reason and it works."
So, why would the empirical distribution be squeezed by the reciprocal of the sample size? And, why does the relationship continue to hold, even as the the confidence interval goes from being too wide to too narrow (one would expect it to be asymptotic)? Finally, 1/r doesn't do anything different when the number of blocks, N, goes to infinity. I need a formula that converges to 1 when N >> r.
That last bit got me thinking this has something to do with the fact that the relationship between the block variance and the true sum is different in the finite and infinite cases. Recall that the relationship between the block variance and the estimator error in the finite case is (N/r)(N-r). In the infinite case, where the you're not really estimating the sum, but rather the population mean scaled up by a factor of N (we're treating block sums as atomic observations for the moment), the relationship between the block variance and the estimator error is N2. When you divide those two, you get (1/r)(N-r)/N. When N is fixed, that gives the same reciprocal arrangement as above. When r is fixed and N goes to infinity, this converges to 1. So, it looks good.
I'm not quite ready to give out high fives though. First and foremost, I shouldn't be expanding the confidence interval by the size of the variance, I should be expanding it by the standard deviation. The relationship to the square root of the reciprocal mitigates the annoying kink on the left, but the right hand side is looking like it might not fit. At any rate, the graph is hardly proof of anything. I need to derive the actual bootstrap distribution.
The casualty in all this is that I haven't done any of the work on the Dirichlet stuff, other than reading about it. That's not really the message I wanted to take into tomorrow's meeting with my adviser, but at least I have an algorithm that works (albeit, inexplicably).
No comments:
Post a Comment