The idea is to use a Dirichlet Process to pull the partition distributions out of the data. That's an interesting exercise, but there's no way that's faster than just reading all the data and getting the real answer. However, it does offer an interesting bridge to D-BESt, which is still a piece of research I very much like.
The train of through goes like this:
- When the data is correlated, "random" sampling doesn't work very well.
- Instead of trying to un-correlate the data, we use D-BESt to crank the correlation up even higher and use that correlation to limit the number of blocks we have to look at.
- That works, but with the results from the Dirichlet distribution, we can do even better on two fronts. First, the distribution tells us how well D-BESt is working. Second, and this is the great part in my mind, it allows us to identify groups of blocks that have similar distributions and consider a sample from one of them to be representative of the entire group and leverage that in our estimate. It basically takes care of the problem noted before of irrelevant data. Somebody can add a trillion new rows we don't care about and we still won't care. The "map" of distributions to blocks will allow us to steer away from that and only consider the mass of data that supports non-zero sums.
Of course, that's a whole bunch of work and won't make it into this paper. But, setting the ground work for it might have some merit (though I'd still rather just get the basic results out first).
No comments:
Post a Comment