First the Bag of Little Bootstraps method with no filtering of rows, that is, basically iid distributions:
As with previous examples, all three distributions have the same inter-quartile range. The difference is the weight on the tails. The horizontal axis is number of blocks read.
Not surprisingly, BLB does great with normal. Then again, just about any algorithm works with iid normal data; that's why so many people assume it even when they shouldn't. The heavier the tail, the worse things get, which is also what you'd expect. That said, Even in the case of Cauchy data, the algorithm is holding it's own. It appears the iid assumption is enough.
CISS does a bit better with the heavy-tailed distributions, but not much. To make comparisons a bit easier, I've used the same vertical scale from one method to the next. I did have to use a larger scale for Cauchy because it's so much more variable, but it's the same on both graphs.
That's a mess. In the first two, the variance collapses because there are too many blocks where all the records are excluded (I generated this data set with attribute correlation, which cranks up the probability of missing all records). Oddly, the Cauchy approximation is be the best of the three, though I think that gets chalked up to dumb luck.
On the other hand...
Now we're talking. Good convergence and tight confidence bounds to match.
No comments:
Post a Comment