The first technique that I'll be using as a comparison will be the "Bag of Little Bootstraps" which I wrote about last spring. It doesn't appear that I can derive a theoretical superiority; I'll have to actually code the algorithm and run it on my data. I might want to run it on some simulated data as well. The long and short of it (mostly long) is that I'm in for some more development work.
The good news is that I think I can adapt my basic sampling engine to handle the selection, so all I need to write from scratch is the actual computation, which isn't particularly tough.
One thing that did amuse me a bit (I'll admit I'm being just a bit smug, here) was their section discussing "large-scale experiments on a distributed computing platform." This was describing a data set of 6,000,000 rows. Um, guys, you're short a few zeros on that number. My data set has 60,000,000,000 rows.
No comments:
Post a Comment