Basically, the data is distributed with a very heavy tail. What I mean by that is that most of the observations are pretty tight to the mean, but there are some that are waaaaaaaaay far away. Making it worse is the fact that the mean is pretty close to zero. So, if you are trying to estimate the sum, those outlying observation are pretty crucial. If your true sum is, say, 100 million, and you miss an observation for 10 billion, well, your sum just went negative and increased two orders of magnitude.
That's a pretty big miss.
So, my focus for the moment isn't correlation, but accounting for the big misses. (Those actuaries are pretty sharp; they've been telling me this all along). However, just because a problem is hard doesn't make it impossible (or, so I keep telling my daughter when she doesn't want to do homework). Here's the plan in a nutshell:
- Partition the data not by attributes, but by the order of magnitude (probably log2, but that's an implementation detail) of the measure.
- When running a query, start with the partition that has the biggest values. Read them all; you can't afford to miss anything in this group.
- Repeat through all the partitions, changing the proportion of rows sampled based on how much data you've already hit. For example, if the big partition only gave you a single row, you should probably sample everything in the next partition, too. But, if the big partition gave you a bunch, you could start being more selective because you know that even ALL the rows in a lower partition can't offset a bunch of rows in a higher one.
- When your sampling threshold gets to 0% (meaning you don't need to sample any rows from that partition), break. This is your optimal stopping rule.
Questions:
- How do you determine what percentage to sample? This is where it ties back to Bayesian stats (a rather important tie-in since this is a project linked to a Bayesian stats course). We need some sort of prior on the distribution and the sample size would be set based on the width of the highest density interval of the posterior covering the confidence desired.
- Where do we get the prior? In the data I'm working with, we have the advantage of actuals. This is a much smaller data set that has a very similar distribution to projected future cash flows.
- This is all well and good if the data is static, but are you really going to re-partition all your data when new stuff comes in? I don't think that's necessary. My thought is to fall back on the block sampling that I've been wanting to use all along. Within a block, data would all fall into the same partition. When new data comes in, blocks are added to partitions, but existing data doesn't get moved around. The problem then becomes one of picking how many blocks to sample rather than how many rows. Once you have a block, you read everything.
- Doesn't that get you right back where you were started? What if all the important data goes into a single block? That's the rub, of course. And, it gets worse as you increase the grain of the query since a query with many attributes specified will be going after fewer rows which may all reside in a single block. However, I think that setting the sample percentage based on the posterior from the previous steps saves us. If things aren't converging, we'll sample a lot more stuff, maybe all.
- This isn't very complicated; are you sure nobody's thought of it already? They may well have. Fortunately, this is a project for a class, not a dissertation. If it turns out that the work's already been done, I'll still have enough work adapting it to my data set to justify it as graduate work. If not, there should be at least a paper in it; maybe a real start on the dissertation itself.
No comments:
Post a Comment