As I was typesetting yesterday's derivation into my paper, I was struck by the sudden realization that I had an extra random variable in that equation, and it's a really important one. In case it's not obvious, extra random variables tend to increase randomness and that increases variance.
I had been a bit bugged by the fact that the covariance term between the various partitions should have been more significant. Yet, the results from the simulations were matching pretty closely without it, so I figured I'd come back to that later. Well, that's because I had baked the randomness out of my variance estimator as well.
The "missing" random variable is the number of rows in each partition of the batch (bk in the equations from yesterday). For a fixed set of partition sizes, the formula is spot on but it doesn't account for the fact that they are not fixed. Every block is chopped up a little differently. The reason I got away with it is because my estimator also fixed them. I just took the average value across all blocks. That's not a bad estimate when partitions tend to be at least as large as a block. When you have a mix of small and large partitions, the variance can be understated. If the small blocks have one distribution and the large blocks have another that is quite different, the variance can be significantly understated.
More importantly, from the standpoint of understanding the problem, by making bk a random variable, the variance term goes way up and the covariance term brings it back down (the covariance is always negative because, in a fixed size block, if one partition is bigger, at least one other partition has to be smaller). This dynamic helps understand what's driving the overall variance. Some samples have a high variance because the observations are very volatile. Others are high because the partitions are really different. The latter case is a much bigger problem from a sampling perspective, so it's helpful to know that's what you're dealing with.
How much the covariance term brings it down depends very much on the distribution of the hit rates for each partition. If there are a lot of partitions with no hits, the covariance will be zero between those, so the overall variance stays pretty high. This is absolutely correct. In the extreme case, where all the hits are concentrated in one partition, your estimator is basically useless (analogous to an infinite variance) until you find that partition.
The rub is that this does not simplify the variance equations. It's not that I can't do the derivations, that's just more work. (Tedious and error-prone to be sure, but just work). It's that, the more complicated the expression, the more difficult it is to build a decent estimator. More importantly, the more difficult it is to get a distribution on that estimator.
At least I've found some more math.
No comments:
Post a Comment