The concept is simple enough. For every block sampled, you separate out the partitions in that block. Then, for each partition you generate a distribution for rho (the hit rate). Lot's of distributions work for this and they all converge to the same thing so I just picked the easiest one to defend: the likelihood function for the data given rho. Using such a distribution essentially yields a Bayesian posterior given a non-informative prior. In the specific case of estimating a hit rate, that posterior distribution is simply the Beta distribution and the parameters are the number of hits and misses observed.
The kernel distribution is just the average of all the individual partition distributions weighted by how many rows were observed in each partition. If the data was iid, that would be all there is to it. Since it's not, we have to also account for the fact that our sample is correlated, so it will vary more than we might otherwise expect. This can be done either by giving more weight to the prior or less weight to the data. Since the result is pretty much the same either way, I took the latter approach and scale my hits and misses by how much of the sample has been read so far.
This yields the following expected distribution for rho when 1/200 of the population has been sampled:
Not surprisingly, the spike at zero gets picked up very quickly. If a partition is of any size at all and returns no rows, that's a pretty good indication that rho is zero for that partition. The rest of the distribution, however, is pretty fuzzy. Doubling the sample size helps.
It's starting to look like there are four underlying populations, but it's still not clear how much of the population is in each part. At 5% of the population sampled, we've pretty much got it. We may have to keep sampling to further reduce the variance on our estimator, but our estimate of the block variance is pretty solid at this point.
In the simple block sum case, the best we can do is try to weight the data such that the convergence reflects that uncertainty. This can be tuned based on experience with the data but, since every query is different, it's always going to be a bit of a guessing game. At any rate, since the distribution of the sample sum is a function of the distribution of rho, we can use this estimate of the distribution to to construct a distribution and confidence interval of our estimator.
No comments:
Post a Comment