Before getting on to today's topic, let me just say that, if you liked yesterday's post (or, if you thought I might have been on to something but didn't say it well enough), Nate Silver posted pretty much the same thing over on 538. Except that, since blogging about polling is his full-time job, he did it better than me.
Anyway, life goes on. This isn't the first time in history a bigoted narcissist has risen to power and we're still here.
I'm writing up the results section for my CISS paper. It's true that I really need to just finish this thing, but all this talk about polling errors got me thinking about whether I should run it against a much larger data set. Basically, I've been testing it against a reasonably consistent slice of our business. I took the projections from a few business units; roughly 5% of our total business. Since I was pulling the entirety of the data for each BU, I wasn't in danger of the eliminating the heavy tail or correlation concerns that motivated the algorithm in the first place. However, it is a small sample of the total. A poll, so to speak, not the election.
So, why not run the election? Sure, the results aren't likely to differ much, but they will almost certainly be different in ways that are noticeable. Last spring, I decided against this simply because I wanted to keep the data in memory, which would make everything run a lot faster and make results analysis easier. However, it's really not that much work to write a real data layer. And, now that I've got a Hadoop cluster at my disposal, it will run plenty fast even on the "full" set. I use quotes there because I have no intention of running it on the 65,000,000,000-row data set that was the original motivation for this work; I'll use the initiative that has just over a billion rows.
No comments:
Post a Comment