This post is mainly for my adviser as i told him I'd have the implementation details section done tonight and may not get to prettying up this graph. Under the heading of "How can we tell if this is just plain wrong?" my thought was that a simple heuristic might be to not use a minimum sample size but rather a minimum non-zero sample size. That is, where we get in trouble is when we don't have enough non-zero blocks. So, I modified the simple block variance sampler to keep going until it had the specified number of non-zero blocks instead of at a fixed block count. The results are below:
As you can see, at really small sample sizes, this technique helps a lot (the horizontal scale is logarithmic, the points are 5, 10, 20, 50, 100, 250). That is, waiting for 5 non-zero blocks yields much better confidence intervals than simply stopping at 5 no matter what. However, by 10, it's that advantage has gone away. Somewhere between 5 and 10 non-zero blocks is enough that the method works.
So, this is a really simple heuristic: run a few queries using whichever method you want and plot the performance when the stopping rule is some number of total blocks versus the same number of non-zero blocks. When the two converge, that's your minimum non-zero count. Don't even test the width of the confidence interval if you don't have that many; it can't be trusted.
No comments:
Post a Comment