It seems to work great for driving out the big variability, but then it wanders around a bit. That doesn't bother me, but the fact that the confidence interval doesn't indicate as much bothers me a lot. In just about every trial, the confidence bounds exclude the real answer for an extended period between 300 and 400 blocks sampled. They generally do a pretty good job of everywhere else. I could just arbitrarily widen the confidence interval, but I'd like to know why I have to do that.
My first thought was that modeling the block sum given a hit as Uniform(0, max possible) was bogus. Turns out, it's not. I checked the data against that assumption and it came back looking pretty darn uniform.
My guess is that my test data set is simply too small. Since the total sum is just that, a sum, I'm counting on the central limit theorem to give me a normally distributed error. Between the fact that the sums in different strata have significantly different variances and the relatively small number of non-zero entries, things are still a bit more spread out. Also, we're sampling without replacement. Once more than half the data has been sampled, that's going to start cramping the whole "the block sum is an independent random variable" thing. This algorithm was conceived as one that would sample a sufficiently small percentage of the total to not worry about that.
Hooking this thing to the real data set will require a new data layer. I've architected it so that the data layer can be swapped out for something high-performance fairly easily, but there's still the non-trivial matter of writing a high-performance data layer and hooking it to a 4Tb data set that isn't really set up to be read this way.
In the mean time, I think I do have enough to start writing. The confidence intervals aren't way off. I think this could be presented as a first cut.
No comments:
Post a Comment