I really hope at least some of my readers get the reference to Louis CK's "Bag of ..." routine.
For those that don't, oh well, you could always google it. Meanwhile, here's the math:
Reference: Kleiner, A, Talwalkar, A, Sarkar, P, Jordan, M: A scalable bootstrap for massive data. Journal or the Royal Statistical Society, 2014 Part 4, 795-816.
The bootstrap is a technique for estimator evaluation that's been around for a while. The name comes from the common phrase of "pulling oneself up by their bootstraps." It's a nod to the fact that the appraisal uses the same data that the estimator is derived from. That said, the technique is sound and widely used.
The problem is that with really big data sets, it becomes computationally infeasible in it's basic form. The authors of this paper show how, rather than using the entire data set at once, one can chop the data set into lots of chunks, apply the boostrap to each chunk and essentially average the results to get the same answer. They call this the "Bag of little Bootstraps".
What interests me is that they have stuck with uniform sampling of items to get their smaller samples. Why? Is there some reason you couldn't bias the tails? If that still produces good results, that would be a big deal for my problem space. So, I need to see if anybody has addressed that. If not, it's a great opportunity for a result that's much more theoretical than what I have been doing while still wholly relevant.
I'm talking with my adviser about it tomorrow.
No comments:
Post a Comment