Saturday, March 24, 2018

Edge cases

I haven't been posting because, frankly, working out these distributions (and typesetting them) is consuming pretty much every free moment. I did derive a somewhat interesting result tonight that deserves some mention.

Because I don't want to make any assumptions about the underlying distributions, I'm doing all my variance calculations using method of moments (compute the first two moments from the sample and plug them back into the variance equation). I'm doing this on just about everything: the observation amounts, the hit rate, the number of blocks in a partition... For the most part, it is then just a matter of cranking out the formulas. Of course, it doesn't hurt to check your work by plugging in some numbers you know and seeing that you get what you expect.

So, I did that. I set the number of partitions per block exactly equal to 1 and the variance of the block hit rate came out to exactly the variance of the partition hit rate. Except that it didn't. It came out to the population variance. But, if this is a sample, it really should come out to the sample variance. This isn't too terribly surprising. After all, I'm just matching up the moments, so there's no distinction made between a sample or population moment. Still, it's unsettling.

Then, I set the number of partitions to be exactly the row count for the block; basically my uniform case. And, again, it came out just right except that it was matching as a population rather than a sample.

This doesn't much matter when the sample gets large, but at small values it can throw things off quite a bit. I can adjust for it easy enough, but coming up with a mathematically sound way of doing that (as opposed to just messing with priors until it works) might be a bit of a trick. This seems like a good question for my adviser. It's entirely possible there's a standard trick that I just don't know about.

No comments:

Post a Comment